Title | MPS assertion in trace.c: RefSetSub(ss.unfixedSummary, SegSummary(seg)) |
Status | closed |
Priority | essential |
Assigned user | Richard Kistruck |
Organization | Ravenbrook |
Description | MPS assertion in trace.c: RefSetSub(ss.unfixedSummary, SegSummary(seg)) Repeatable: not always. (originally wasn't, now appears to be on MacTel, see comments on 2007-03-09) Recurrence: 5% or more of mpsicv runs with random seeds Platforms: w3i3mv, xcppgc Varieties: CI, HI, II Age bounds: 1.106.0:Yes Related Jobs: job001543: mpsicv on Mac OS X does not use reg&stack scanner. RHSK 2006-12-13 __w3i3mv__ Assert fires: MPS ASSERTION FAILURE: RefSetSub(ss.unfixedSummary, SegSummary(seg)) .\trace.c 1111 in mpsicv on w3i3mv platform: Using master/...@161206, CI (cool) build. Also in version/1.106/...@155175 == release/1.106.0, CI (cool). Also in master/...@161213, both CI and HI builds (hot: which now means with AVERs but without DEBUG = DIAGNOSTICS = STATISTICs = METERs -- see job001545 & job001546). Repeatable: No. Repeating the randomize seed doesn't make it fail again. Example seeds that have failed once: 2715, 7909, 23634, 24772, 18186, Recurrence: Yes (<1 hour). Now seen about 10 times, from w3i3mv CI or HI mpsicv. Looping tests produce a failure fairly often. Typical successful iterations-before-failure: 0, 42, 5, 14. RHSK 2006-12-14 __xcppgc__ mpsicv on xcppgc (<= 1.107.0) is a bit different: the reg&stack scanner is not used. See job001543: mpsicv on Mac OS X does not use reg&stack scanner. version/1.107/...@161223 == release/1.107.0: xcppgc\ci\mpsicv gave 0 failures in 7 runs xcppgc\hi\mpsicv gave 2 failures in 7 runs; seeds: 27271, 27283. xcppgc\ti\mpsicv gave 0 failures in 7 runs xcppgc\ii\mpsicv gave 3 failures in 7 runs; seeds: 28869, 28896, 28923 and with fixed seed 23954, 0 failures in 12 runs (3 each ci, hi, ti, ii) HI and II have AVERs and checking at CheckLevelMINIMAL, but no DIAGNOSTICS = STATISTICs = METERs. |
Analysis | RHSK 2006-12-13 This assert reports that the old SegSummary(seg) was incomplete. Imagine that the segment is a box containing some refs; the box has a lid (the MPS Shield) so we should know when any new ref is put in the box; we should keep the label on the box (the summary) correct, except at certain defined times (eg. while seg-scan is in progress?). We have just (totally or partially) scanned the seg, accumulating the summary of all refs-before-fix ("unfixed") into ss.unfixedSummary. SegSummary should have had these already. (We are about to update SegSummary with the summary of refs-after-fix, at least if it was a total scan). (Hmmm... the assert only makes sense if unfixedSummary was inited to Empty at the start of scanning *this* segment. If not, it might have picked up some zone bits from other (previous) segments, in which case it's not surprising that it's not a subset of SegSummary() for *this* seg.) RHSK 2006-12-18 See detailed analysis: http://www.ravenbrook.com/project/mps/doc/2006-12-18/job001548-summary/ See development branch: http://info.ravenbrook.com/project/mps/branch/2006-12-15/unfixed-summary/ DRJ 2007-03-01 Can't reproduce on master/...@161872 using OS X on Intel. Two loops of: while : ; do ./xci3gc/hi/mpsicv || break; done gave up on first loop after 29 successful runs; second one gave up at 15 successful runs. Note: There is no stack scanner on this configuration (yet). DRJ 2007-03-09 After implemting reg scanner for Intel Darwin (change 161877) and then a proper protection module (change 161902) I can now reproduce this on my Intel MacBook (Intel Darwin). The first time I tried it, the loop: while : ; do ./xci3gc/hi/mpsicv || break; done stops with: MPS ASSERTION FAILURE: RefSetSub(ss.unfixedSummary, SegSummary(seg)) trace.c 1111 Abort trap After 76 runs (seed was 3670). Hmm, maybe I should've just left it to run longer earlier. Moreover, right now on the master sources change level 161907 on my MacBook the 3670 seed makes the failure repeatable. ./xci3gc/hi/mpsicv 3670 always fails. So does: ./xci3gc/hi/mpsicv 10259 DRJ 2007-03-09 Also fails on lii4gc. But not always repeatably. Sometimes with seeds: 14742, 14884, 15025 It appears to be very easy to fail though. Usually only a few different trials before one fails. And often some seeds, like 15025, appear to fail > 50% of the time. Yay! gdb works on this platform so I can catch an example failure in the debugger. Yumm. RHSK 2007-03-19 Failure appears to be during emergency tracing (xcppgc/hi/mpsicv). See http://info.ravenbrook.com/project/mps...12-15/unfixed-summary/code/a1oEmerg.txt See http://info.ravenbrook.com/project/mps...12-15/unfixed-summary/code/a1pEmerg.txt RHSK 2007-04-18 If a pool causes MPS_FIX1() to be applied to the same ref *twice* in the same scan, then ss.unfixedSummary becomes 'polluted' with new (fixed) refs and therefore not an accurate statement of the seg's summary before this scan started. The only problem this causes is to trip the .verify.segsummary AVER in trace.c. This may happen when a pool class cannot remember whether it has already fixed the ref. In the case of the AMC poolclass, this happens by design when scanning a boarded segment under emergency tracing, and a new mark is made the segment: http://info.ravenbrook.com/mail/2007/03/24/11-05-17/0.txt http://info.ravenbrook.com/mail/2007/03/26/17-05-58/0.txt (For interest, note that in a fwd-buffered mobile seg, MPS_FIX1 may get applied to the same ref more than once, but not in the same scan.) The fix is to detect the rare circumstances where this re-fix in the same scan may have occurred. As far as we know at the moment, this is only the AMC boarded seg under ET case. In these circumstances, deal with the polluted unfixedSummary by moving it into fixedSummary, and clearing unfixedSummary. For an "alternative correction", and other checks we should do, see: http://info.ravenbrook.com/mail/2007/03/26/17-28-37/0.txt |
How found | unknown |
Evidence | master/...@161206, w3i3mv, CI, mpsicvhttp://info.ravenbrook.com/mail/2006/12/12/12-07-19/0.txt http://info.ravenbrook.com/mail/2006/12/13/10-55-06/0.txt |
Observed in | 1.106.2 |
Created by | Richard Kistruck |
Created on | 2006-12-13 16:48:00 |
Last modified by | Gareth Rees |
Last modified on | 2014-04-12 22:05:52 |
History | 2006-12-13 RHSK Created; made critical. 2006-12-14 RHSK Also on Mac OS X PowerPC. 2006-12-14 RHSK Summarise occurrence. 2006-12-18 RHSK Analysis: link to doc for detailed analysis. 2006-12-18 RHSK Analysis: link to development branch unfixed-summary. 2007-03-01 DRJ Can't reproduce on MacTel. 2007-03-09 DRJ Reproduced on MacTel. And Repeatable. 2007-03-09 DRJ Reproduced on Linux 2007-03-19 RHSK Failure appears to be during emergency tracing. (fix link) 2007-04-18 RHSK Solved on 2007-03-24. Describe defect and fix. |
Change | Effect | Date | User | Description |
---|---|---|---|---|
162001 | closed | 2007-03-25 17:05:50 | Richard Kistruck | MPS br/unfixed-summary: amcScanNailed: Show how summaries change when amcScanNailed loops. Highlight cases that would (previously) have failed .verify.segsummary. Count the loops. Show whether it wasTotal. AMCSegSketch: correct it to show stalo and neo the right way round. |
162000 | closed | 2007-03-25 15:59:05 | Richard Kistruck | MPS br/unfixed-summary: if amcScanNailed looped, ss.unfixedSUmmary is not accurate, so move all of the ScanStateSummary into ss.fixedSumamry, so that <impl/trace/#verify.segsummary> does not erroneously fail. See also log file a2nNailedLoopReset.txt. |