This document contains incomplete and informal notes concerning the investigation of MPS job001548: assertion in trace.c: RefSetSub(ss.unfixedSummary, SegSummary(seg)).
Not confidential. Readership: MPS developers.
Imagine that the segment is a box containing some refs; the box has a lid (the MPS Shield) so we should know when any new ref is put in the box. We should keep the label on the box (the summary) correct, except at certain defined times (eg. while seg-scan is in progress?).
Something must have gone wrong at one of these three steps:
We look at what's in the box (or default to UNIV), put a label on the box, and put the lid on (write-protect).
Over some time, we take the lid on and off, some new refs get put into the box, and we try to keep the label correct. The writes can be from:
(during scan) we look at what was in the box, and check it against the label.
In a picture: DSC00687.JPG.
The final check is failing.
(For the rest of this document, stick to MPS terminology: 'label' = summary; 'box' = seg or zone usually; 'lid' = shield.)
Tricky situations:
The relevant code hasn't changed much in a while, and the failures aren't very common. Both of these suggest that the code fails only in a fairly unusual combination of circumstances. So it's worth looking at data at the time of failure, to see if some circumstances (eg. nailed seg) are always present.
It's easy to make this programmatic: hack in if (!assert-cond) { Describe(); printf data; etc }
before the assert, and run to crash several times.
See out-segdesc01.txt for a sample.
Also: could output telemetry. But that's not human friendly, and it would take me ages to wade through it :-(
On the other hand, we don't have a general purpose "CheckAllSummariesNow()" function. Writing one would help here, and also catch other present or future defects.
Thinking about the issues, and learning the source, is really useful for me. Not necessarily fastest, but loads of genuine extra benefit. See "How the code is supposed to work", below.
mpsicv is an internal test, that can go inside mps.h. Perhaps it's just doing something illegal? Better have a look inside. And add lots of printfs as mpsicv goes along.
mpsicv successfully completes its "for(200000 objects)
" loop, with the 30-or-so collections that print out "Collection %u, %lu objects".
Failure happens when mpsicv then calls arena_commit_test(), which allocates memory until it hits commit limit, forcing full collections, which sometimes trigger the assert. See a1f and a1g1stFull.txt (the sixth ASSERT in a1g... shows nPolls is not always 1.000).
Even though I don't know all the invariants, or all the times when the seg summary is valid, I can still write a CheckThisSummary() function, and run it at various known-good times, such as ArenaEnter/Leave.
How hard can it be? Should I use pool->scan or pool->walk? Scan should only see grey things. Walk should only see black things. Hmmm, in AMCWalk:
"/* NB, segments containing a mix of colours (i.e., nailed segs) are not handled properly: No objects are walked @@@@ */"
Using scan it would be:
ScanStateInit() replace ss->fix ShieldExpose() PoolScan() ShieldCover()
Also see ArenaFormattedObjectsWalk() [walk.c]
Here are some notes on the parts of code I have studied while investigating the defect.
One tricky issue is partial scans of a segment: seg may be part grey (must scan), part white (should not scan).
I have worked out in my head how this ought to work. See http://info.ravenbrook.com/mail/2006/12/15/11-42-40/0.txt "keeping summaries during partial scans".
I wrote an abstract walk-through of a trace: example-abstract-trace.txt. Some further notes follow:
When a collection trace ends (and we reclaim all white objects) we can replace the old summary with the summary of black-for-this-trace objects. Arbitrarily calling this trace "1" (one), I call this summary "t1b".
What do we encounter during scan? We find *all* refs in all *grey* objects (and, optionally, in black objects too, though that's a waste).
We encounter five types of ref:
unfixedSummary is the accumulated summary of 1, 2, 3, and 4.
t1b is the accumulated summary of 1, 2, 3, and 5.
What does the current scan and fix code actually do?
See new notes at design.mps.shield.
2006-12-18 RHSK Created. 2006-12-18 RHSK Approaches. How current code works. 2006-12-18 RHSK Link out-segdesc01.txt. What's mpsicv doing? 2006-12-21 RHSK Link design/shield 2007-01-04 RHSK Three steps to wrong summary: link to picture. 2007-01-04 RHSK Fails in arena_commit_test.
This document is copyright © 2006-2007 Ravenbrook Limited. All rights reserved. This is an open source license. Contact Ravenbrook for commercial licensing options.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
This software is provided by the copyright holders and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement, are disclaimed. In no event shall the copyright holders and contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.