Discussion:
[Xcb] update to libxcb 1.12 breaks 32bit applications
G. Schlisio
2016-06-04 10:29:45 UTC
Permalink
Hi,

i am using ArchLinux and we recently received an update for libxcb 1.12.
to run 32bit applications in 64bit environments we have a multilib
repository, providing 32bit versions of required libs, lib32-libxcb in
this case.
with the new version of xcb several 32bit applications stopped working
and crash immediately on startup. one of these apps is acroread [0], it
was also reported on zsnes.
as i maintain the former, i collected some stack trace [1] and gdb
backtrace [2], of which i cannot make sense. is this helpful to you to
identify the root of this problem?

i am aware that acroread is an old crappy piece of ancient software, but
thats not the point here :)

i'll be glad to provide you with more information if needed.

best regards
georg

[0] https://aur.archlinux.org/packages/acroread
[1] http://pastebin.com/FhQM3izE
[2] http://pastebin.com/f3Py5LDn
Uli Schlachter
2016-06-05 07:53:04 UTC
Permalink
Hi,

Am 04.06.2016 um 12:29 schrieb G. Schlisio:
[...]
Post by G. Schlisio
as i maintain the former, i collected some stack trace [1]
That's strace output and not a stack trace. The only possible helpful thing in
there is:

--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---

so this is a NULL-pointer dereference.
Post by G. Schlisio
and gdb
backtrace [2], of which i cannot make sense. is this helpful to you to
identify the root of this problem?
[...]

No, it's not. Please recompile libxcb with debug symbols enabled so that the
stack trace says more than just ??.

Cheers,
Uli

P.S.: Since I have a feeling that you might ask, Google found:
https://wiki.archlinux.org/index.php/Debug_-_Getting_Traces
--
- He made himself, me nothing, you nothing out of the dust
- Er machte sich mir nichts, dir nichts aus dem Staub
Uli Schlachter
2016-06-05 13:00:25 UTC
Permalink
Hi,
Post by Uli Schlachter
Please recompile libxcb with debug symbols enabled so that the
stack trace says more than just ??.
done so (hopefully) and uploaded a trace here [0].
Program received signal SIGSEGV, Segmentation fault.
0xf50ed6d1 in remove_finished_readers (completed=<optimized out>,
prev_reader=0x980d0c8) at xcb_in.c:107
107 while(*prev_reader &&
XCB_SEQUENCE_COMPARE((*prev_reader)->request, <=, completed))
[...]
[0] http://pastebin.com/LSQktdui
Weird. I have no idea. The gdb output says "prev_reader=0x980d0c8", so this
can't be where the NULL pointer comes from. *prev_reader can't be the problem
either, because this value is checked.

Does strace still report that this crashes with a NULL pointer dereference? This
seems to be impossible, given that GDB says it is not a NULL pointer.

Also, looking through the changes since 1.11.1, I don't even see a commit that
touches this part of the code. Weird.

Sorry & cheers,
Uli
--
Bitte nicht mit dem verbleibenden Auge in den Laser gucken.
- Vincent Ebert
G. Schlisio
2016-06-05 13:09:49 UTC
Permalink
Post by Uli Schlachter
Weird. I have no idea. The gdb output says "prev_reader=0x980d0c8", so this
can't be where the NULL pointer comes from. *prev_reader can't be the problem
either, because this value is checked.
Does strace still report that this crashes with a NULL pointer dereference? This
seems to be impossible, given that GDB says it is not a NULL pointer.
strace still says:
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---

i have to make a little change in the starter script to use gdb. the
script sets the environment to use all the outdated libs shipped with
acroread.
the change is to add /usr/bin/gdb to the exec statement. that change has
(obviously) to be undone before running strace.
otoh this should not make any difference for this issue

Post by Uli Schlachter
Also, looking through the changes since 1.11.1, I don't even see a commit that
touches this part of the code. Weird.
i asked other affected people to provide traces as well, but no response
yet.
James Cloos
2016-06-05 19:25:15 UTC
Permalink
G>> Program received signal SIGSEGV, Segmentation fault.
G>> 0xf50ed6d1 in remove_finished_readers (completed=<optimized out>,
G>> prev_reader=0x980d0c8) at xcb_in.c:107
G>> 107 while(*prev_reader &&
G>> XCB_SEQUENCE_COMPARE((*prev_reader)->request, <=, completed))

US> Weird. I have no idea. The gdb output says "prev_reader=0x980d0c8",
US> so this can't be where the NULL pointer comes from. *prev_reader
US> can't be the problem either, because this value is checked.

Since it is (*prev_reader)->request that is a double-dereference, yes?
So the uint32_t at 0x980d0c8, which should be a pointer to a struct, is
presumably the problem.

-JimC
--
James Cloos <***@jhcloos.com> OpenPGP: 0x997A9F17ED7DAEA6
G. Schlisio
2016-06-06 11:37:11 UTC
Permalink
Post by James Cloos
Since it is (*prev_reader)->request that is a double-dereference, yes?
So the uint32_t at 0x980d0c8, which should be a pointer to a struct, is
presumably the problem.
i have way to few clue about c programming to fully understand what you
are telling me. i am very sorry.
meanwhile some folks provided us with some traces from other programs,
you find them attached in the arch bug tracker [0].
hope this helps.

georg

[0] https://bugs.archlinux.org/task/49560
James Cloos
2016-06-06 15:43:58 UTC
Permalink
Post by James Cloos
Since it is (*prev_reader)->request that is a double-dereference, yes?
So the uint32_t at 0x980d0c8, which should be a pointer to a struct, is
presumably the problem.
GS> i have way to few clue about c programming to fully understand what you
GS> are telling me. i am very sorry.

That bit was primarily intended for Uli.

For reference, structs' memebers are accessed via the . operator and
members of pointers to structs with the -> operator. So in the above
snippet, *prev_reader is a pointer to a struct, making prev_reader a
pointer to a pointer of a struct. So prev_reader has to be dereferenced
twice. The first worked; the second, presumably, is what caused the
segv.

But the backtrace might be a bit better if you try it w/ cairo compiled
with -O0 instead of what arch uses by default (presumable -O2).

And, when getting the backtrace, try this at the gdb prompt:

p prev_reader
p *prev_reader
p **prev_reader

and include that info, too.

-JimC
--
James Cloos <***@jhcloos.com> OpenPGP: 0x997A9F17ED7DAEA6
James Cloos
2016-06-06 19:02:06 UTC
Permalink
GS> ok, so:

GS> (gdb) run
GS> Starting program: /opt/Adobe/Reader9/Reader/intellinux/bin/acroread
GS> [Thread debugging using libthread_db enabled]
GS> Using host libthread_db library "/usr/lib/libthread_db.so.1".

GS> Program received signal SIGSEGV, Segmentation fault.
GS> 0xf50ed6d1 in remove_finished_readers (completed=<optimized out>,
GS> prev_reader=0x9806650) at xcb_in.c:107
GS> 107 xcb_in.c: No such file or directory.
GS> (gdb) p prev_reader
GS> $1 = (reader_list **) 0x9806650
GS> (gdb) p *prev_reader
GS> $2 = (reader_list *) 0xffffc5b0
GS> (gdb) p **prev_reader
GS> $3 = {request = 1, data = 0xffffc5c0, next = 0x0}

I'm not sure what the problem is, then. That looks like it should work
fine, given the code in that function.

Next being 0x0 is OK. The while code is written to expect that and stop
looping once *prev_reader is set to 0x0.

-JimC
--
James Cloos <***@jhcloos.com> OpenPGP: 0x997A9F17ED7DAEA6
G. Schlisio
2016-06-07 07:44:51 UTC
Permalink
Post by James Cloos
I'm not sure what the problem is, then. That looks like it should work
fine, given the code in that function.
building libxcb with O0 or O1 fixes this breakage, as someone discovered
on our bugtracker.
there also was a hint about arch recently switching to gcc6, might we
see a bug in that one?
G. Schlisio
2016-06-08 12:53:23 UTC
Permalink
Post by G. Schlisio
Post by James Cloos
I'm not sure what the problem is, then. That looks like it should work
fine, given the code in that function.
building libxcb with O0 or O1 fixes this breakage, as someone discovered
on our bugtracker.
there also was a hint about arch recently switching to gcc6, might we
see a bug in that one?
compiling libxcb 1.12 with gcc 5.3 and -O2 results in no crash in my
setup. looks like the bug is in gcc6 then.
thank you guys for helping!
Rémi Cardona
2016-06-08 19:34:01 UTC
Permalink
Post by G. Schlisio
compiling libxcb 1.12 with gcc 5.3 and -O2 results in no crash in my
setup. looks like the bug is in gcc6 then.
thank you guys for helping!
Does gcc's UBSan report anything? It could still very well be a bug in xcb.

Cheers,

Rémi
G. Schlisio
2016-06-08 19:46:44 UTC
Permalink
Post by Rémi Cardona
Does gcc's UBSan report anything? It could still very well be a bug in xcb.
never heard of it, sounds interesting.
i compiled libxcb with gcc6 with -O2 and -fsanitize=undefined (thats
what i understood from [0]) but there is no runtime error message.
did i do it wrong?

[0]
https://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/
Ran Benita
2016-06-09 07:02:10 UTC
Permalink
Post by G. Schlisio
Post by Rémi Cardona
Does gcc's UBSan report anything? It could still very well be a bug in xcb.
never heard of it, sounds interesting.
i compiled libxcb with gcc6 with -O2 and -fsanitize=undefined (thats
what i understood from [0]) but there is no runtime error message.
did i do it wrong?
I haven't read the code, but I also think gcc bug is not very likely
(though not impossible).

Can you try running the program under valgrind (without sanitizers).

To install, do `pacman -S valgrind`. Then, instead of `./my_program`,
run `valgrind ./my_program`.

There might be some noise, but I'd be interested in

- Whether it still crashes.
- Whether there's a report before the crash (or where the crash would
have happened).

You can attach the valgrind output if there is any.

Ran
Ran Benita
2016-06-09 10:21:12 UTC
Permalink
Post by Ran Benita
Can you try running the program under valgrind (without sanitizers).
There might be some noise, but I'd be interested in
- Whether it still crashes.
yes, it does
Post by Ran Benita
- Whether there's a report before the crash (or where the crash would
have happened).
i am not sure how a crash report would look like but i dont see anything
looking like it in the output (see below)
Post by Ran Benita
You can attach the valgrind output if there is any.
find it here: http://pastebin.com/1RWh34HW
It looks like it spawns subprocesses (some startup script maybe). Please
try `valgrind --trace-children ./my_program` then.

Since the program is multilib, you should install `valgrind-multilib`
package instead of `valgrind` one. Sorry, I forgot that.

If you can reproduce this with *any* program other than acrobat reader
(e.g. zsnes) that would be better. I did try zsnes (Chrono Trigger) and
it ran fine.
Ran Benita
2016-06-09 10:22:21 UTC
Permalink
Post by Ran Benita
It looks like it spawns subprocesses (some startup script maybe). Please
try `valgrind --trace-children ./my_program` then.
Meant to write `--trace-children=yes`.
G. Schlisio
2016-06-09 11:46:56 UTC
Permalink
Post by Ran Benita
Post by Ran Benita
It looks like it spawns subprocesses (some startup script maybe). Please
try `valgrind --trace-children ./my_program` then.
Meant to write `--trace-children=yes`.
ok, with valgrind-multilib and tracing children i yield this:
http://pastebin.com/nhyJ4hY3

crashes but at first glance i dont see something like a chrash report.
Ran Benita
2016-06-09 13:24:25 UTC
Permalink
Post by G. Schlisio
Post by Ran Benita
Post by Ran Benita
It looks like it spawns subprocesses (some startup script maybe). Please
try `valgrind --trace-children ./my_program` then.
Meant to write `--trace-children=yes`.
http://pastebin.com/nhyJ4hY3
crashes but at first glance i dont see something like a chrash report.
There are some errors, but seem unrelated. So, this didn't help.

If you are willing to put in a bit of effort, given that the program
works with 1.11 but not with 1.12, the best way would be to find the git
commit which caused the regression. The easiest way to do that is to use
"git bisect".

You would do this like so (from your libxcb git clone):

git checkout master
git bisect start
git bisect bad
git bisect good 1.11

Now, it will switch to some commit; you then need to recompile and test
if the problem is there or not. If it still crashes, write

git bisect bad

if it doesn't, write

git bisect good

You should repeat this (about 6 times) until git hones down on the first
bad commit (you can also automate the testing step with `git bisect
run`, see `man git bisect`).
G. Schlisio
2016-06-09 19:08:04 UTC
Permalink
Post by Ran Benita
There are some errors, but seem unrelated. So, this didn't help.
If you are willing to put in a bit of effort, given that the program
works with 1.11 but not with 1.12, the best way would be to find the git
commit which caused the regression. The easiest way to do that is to use
"git bisect".
.

ok, i will do the bisecting tomorrow, but first i should test whether
compiling 1.11.1 with gcc6 breaks things as well. atm it doesnt compile,
so i'll have to try harder.
i'll report back with results (hopefully!).

Loading...