Notes on MPS job001548: assertion in trace.c: RefSetSub(ss.unfixedSummary, SegSummary(seg))

This document contains incomplete and informal notes concerning the investigation of MPS job001548: assertion in trace.c: RefSetSub(ss.unfixedSummary, SegSummary(seg)).

Not confidential. Readership: MPS developers.

Introduction

Imagine that the segment is a box containing some refs; the box has a lid (the MPS Shield) so we should know when any new ref is put in the box. We should keep the label on the box (the summary) correct, except at certain defined times (eg. while seg-scan is in progress?).

Something must have gone wrong at one of these three steps:

We look at what's in the box (or default to UNIV), put a label on the box, and put the lid on (write-protect).
Over some time, we take the lid on and off, some new refs get put into the box, and we try to keep the label correct. The writes can be from:
1. mutator
2. fix that updates a ref
3. preserve-by-copy
(during scan) we look at what was in the box, and check it against the label.

In a picture: DSC00687.JPG.

The final check is failing.

(For the rest of this document, stick to MPS terminology: 'label' = summary; 'box' = seg or zone usually; 'lid' = shield.)

Questions

.q.valid: There ARE certain times when the summary is allowed to be wrong. What times are those? (Don't know).
.q.gen: Is the code that generated the summary correct?
.q.check: Is the code that finally checks the summary correct? At the time we check it, is the summary supposed to be valid?
.q.shield: Is the shield code keeping the seg write-protected? For any time the shield is down we need guarantees about what refs might get written into the seg.
.q.maintain: Is the code that unions a newly-added ref into the current summary correct?

Tricky situations:

.sit.preserve-into: seg we are preserve-by-copying into. If we preserve into a seg we are currently scanning, the newly-preserved object must be scanned in *this* scan (there is no mechanism for putting the seg put back on grey-list).
.sit.multipage: seg that spans several OS pages
.sit.zone-boundary: seg that straddles a zone boundary
.sit.nailed: nailed seg

Approach

Special circumstances?

The relevant code hasn't changed much in a while, and the failures aren't very common. Both of these suggest that the code fails only in a fairly unusual combination of circumstances. So it's worth looking at data at the time of failure, to see if some circumstances (eg. nailed seg) are always present.

It's easy to make this programmatic: hack in if (!assert-cond) { Describe(); printf data; etc } before the assert, and run to crash several times.

See out-segdesc01.txt for a sample.

Also: could output telemetry. But that's not human friendly, and it would take me ages to wade through it :-(

Write a General-purpose Check Function

On the other hand, we don't have a general purpose "CheckAllSummariesNow()" function. Writing one would help here, and also catch other present or future defects.

Use the Source...

Thinking about the issues, and learning the source, is really useful for me. Not necessarily fastest, but loads of genuine extra benefit. See "How the code is supposed to work", below.

What is mpsicv doing anyway?

mpsicv is an internal test, that can go inside mps.h. Perhaps it's just doing something illegal? Better have a look inside. And add lots of printfs as mpsicv goes along.

mpsicv successfully completes its "for(200000 objects)" loop, with the 30-or-so collections that print out "Collection %u, %lu objects".

Failure happens when mpsicv then calls arena_commit_test(), which allocates memory until it hits commit limit, forcing full collections, which sometimes trigger the assert. See a1f and a1g1stFull.txt (the sixth ASSERT in a1g... shows nPolls is not always 1.000).

A general purpose CheckAllSummariesNow() function

Even though I don't know all the invariants, or all the times when the seg summary is valid, I can still write a CheckThisSummary() function, and run it at various known-good times, such as ArenaEnter/Leave.

How hard can it be? Should I use pool->scan or pool->walk? Scan should only see grey things. Walk should only see black things. Hmmm, in AMCWalk:

"/* NB, segments containing a mix of colours (i.e., nailed segs) are not handled properly: No objects are walked @@@@ */"

Using scan it would be:

  ScanStateInit()
  replace ss->fix
  ShieldExpose()
  PoolScan()
  ShieldCover()

Also see ArenaFormattedObjectsWalk() [walk.c]

How the code is supposed to work

Here are some notes on the parts of code I have studied while investigating the defect.

Partial scans

One tricky issue is partial scans of a segment: seg may be part grey (must scan), part white (should not scan).

I have worked out in my head how this ought to work. See http://info.ravenbrook.com/mail/2006/12/15/11-42-40/0.txt "keeping summaries during partial scans".

I wrote an abstract walk-through of a trace: example-abstract-trace.txt. Some further notes follow:

When a collection trace ends (and we reclaim all white objects) we can replace the old summary with the summary of black-for-this-trace objects. Arbitrarily calling this trace "1" (one), I call this summary "t1b".

What do we encounter during scan? We find *all* refs in all *grey* objects (and, optionally, in black objects too, though that's a waste).

We encounter five types of ref:

obviously non-white: refs that aren't in the white zoneset;
non-white (but in a zone that has some white objs);
white becomes grey, but ref unchanged because object is preserved in place;
old white that needs replacement (broken heart, weak);
new grey replacement for old white (snapped-out, or splatted).

unfixedSummary is the accumulated summary of 1, 2, 3, and 4.

t1b is the accumulated summary of 1, 2, 3, and 5.

What does the current scan and fix code actually do?

Shield

See new notes at design.mps.shield.

B. Document History

  2006-12-18  RHSK  Created.
  2006-12-18  RHSK  Approaches.  How current code works.
  2006-12-18  RHSK  Link out-segdesc01.txt.  What's mpsicv doing?
  2006-12-21  RHSK  Link design/shield
  2007-01-04  RHSK  Three steps to wrong summary: link to picture.
  2007-01-04  RHSK  Fails in arena_commit_test.

C. Copyright and License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Redistributions in any form must be accompanied by information on how to obtain complete source code for the this software and any accompanying software that uses this software. The source code must either be included in the distribution or be available for no more than the cost of distribution plus a nominal fee, and must be freely redistributable under reasonable conditions. For an executable file, complete source code means the source code for all modules it contains. It does not include source code for modules or files that typically accompany the major components of the operating system on which the executable file runs.

This software is provided by the copyright holders and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement, are disclaimed. In no event shall the copyright holders and contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.