Title | Telemetry system calls mps_io_flush() without mps_io_create() |
Status | closed |
Priority | essential |
Assigned user | Gareth Rees |
Organization | Ravenbrook |
Description | Bruce Mitchener reported on IRC that the amcssth test case was hanging: brucem: any chance there's an infinite loop in amcssth? 11:04 (on FreeBSD x86) 11:05 (or a deadlock) 11:05 err, sorry, Linux x86 11:05 https://jenkins.opendylan.org/job/mps-linux-precise-x86/15/console |
Analysis | I ran this test case in GDB on lii6gc. I was able to get it to hang with the following backtrace in thread 2: #0 0x00007ffff76039cb in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007ffff757f3c0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007ffff757e7f8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x000000000043baa0 in mps_io_flush (mps_io=<optimized out>) at mpsioan.c:79 #4 0x00000000004269f8 in EventSync () at event.c:149 #5 0x0000000000426ac6 in EventFlush (kind=4) at event.c:90 #6 0x0000000000437280 in amcFixInPlace (pool=<optimized out>, seg=0x7ffff7f7f978, ss=0x7ffff7500b60, refIO=<optimized out>) at poolamc.c:1715 (Line 79 of mpsioan.c is the call to fflush(f). Not sure why the name "fflush" is missing from the backtrace.) This is the backtrace in thread 1: #0 0x00007ffff75387a4 in sigsuspend () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000000000435583 in suspendSignalHandler (sig=<optimized out>, info=<optimized out>, context=0x7fffffffdf00) at pthrdext.c:99 #2 <signal handler called> #3 0x00007ffff75e892d in write () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x00007ffff757b883 in _IO_file_write () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007ffff757b74a in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x00007ffff757ceb5 in _IO_do_write () from /lib/x86_64-linux-gnu/libc.so.6 #7 0x00007ffff757bdc0 in _IO_file_sync () from /lib/x86_64-linux-gnu/libc.so.6 #8 0x00007ffff7570cfb in fflush () from /lib/x86_64-linux-gnu/libc.so.6 #9 0x0000000000401e5a in test (arg=<optimized out>, s=<optimized out>) at amcssth.c:271 #10 0x000000000043680d in ProtTramp (resultReturn=0x7fffffffe570, f=0x401b22 <test>, p=0x7ffff7ff6000, s=0) at protix.c:132 #11 0x000000000040613c in mps_tramp (r_o=0x7fffffffe570, f=0x401b22 <test>, p=0x7ffff7ff6000, s=0) at mpsi.c:1378 #12 0x00000000004020f2 in main (argc=<optimized out>, argv=0x7fffffffe678) at amcssth.c:334 So it looks as if there's deadlock here between the two simultaneous fflush() calls. How can that be possible? POSIX I/O is supposed to be thread-safe. But looking at the argument to mps_io_flush(), it's clear that something has gone wrong. eventIO has not been initialized (because event logging was not requested) but mps_io_flush() is being called anyway: (gdb) frame 4 #4 0x00000000004269f8 in EventSync () at event.c:149 149 (void)mps_io_flush(eventIO); (gdb) p eventIO $2 = (mps_io_t) 0x0 (gdb) p eventIOInited $3 = 0 (gdb) p EventKindControl $4 = 0 If the environment variable MPS_TELEMETRY_CONTROL is not set then no events need to be written to the telemetry stream, and so mps_io_create() is never called, and so eventIO is never initialized. But EventSync() always calls mps_io_flush(). See event.c [1]. So we need to guard the call to mps_io_flush(). So why does fflush(NULL) deadlock against fflush(stdout) on Linux but not on OS X? Well, fflush(NULL) flushes all open files. |
How found | automated_test |
Evidence | [1] //info.ravenbrook.com/project/mps/master/code/event.c |
Observed in | 1.111.0 |
Created by | Gareth Rees |
Created on | 2013-05-30 14:53:19 |
Last modified by | Gareth Rees |
Last modified on | 2013-06-04 15:19:42 |
History | 2013-05-30 GDR Created. |
Change | Effect | Date | User | Description |
---|---|---|---|---|
182322 | closed | 2013-05-30 15:24:41 | Gareth Rees | Guard the calls to mps_io_flush so that it can only be called after mps_io_create. |