MPS issue job001989

TitleMPS _gc_start messages may cause assert or infinite loop
Statusclosed
Priorityessential
Assigned userRichard Kistruck
OrganizationRavenbrook
DescriptionMPS _gc_start messages may cause assert or infinite loop

Related jobs:
job001570 "MPS _gc_start messages may change while client reads them, or be silently skipped"

If the client enables _gc_start messages with:
  mps_message_type_enable(arena, mps_message_type_gc_start());
and a collection starts before the client has collected (with
mps_message_get) the _gc_start message from the preceding
collection, then the message MPS queue becomes corrupted. This
causes an assert, or (in varieties without asserts) prevents the
message from being removed from the queue. This unremovable
message is likely to cause the client to loop forever if it
attempts to Get the message, and will cause such a loop if the
client calls mps_arena_destroy.

A typical assert is:
MPS ASSERTION FAILURE: !RingIsSingle(_old)
message.c
164

Mitigation: clients that do not enable _gc_start messages are
unaffected; a client that enables them and then frequently checks
for and collects them can apparently prevent the defect occurring.
AnalysisRHSK 2008-11-28
Discovered when looking at the code. There is one GCSTART message,
statically allocated, per pre-allocated TraceStruct. If it's been
Posted but not yet Got, then it's on the arena message Ring. When
the trace ends, the TraceStruct is Del'd from arena->busyTraces,
even though the MessageStruct et al is still in use. When a new
trace starts, it re-uses that TraceStruct, treating it as
uninitialised memory, calls TraceCreate, which calls MessageInit,
which calls RingInit, thus corrupting the arena message Ring.

Now it's right and proper that GCSTART messages are pre-allocated,
in accordance with the principle that we avoid allocating when
starting a trace. But it's just wrong to statically allocate them:
they should be dynamically allocated with ControlAlloc, like other
messages. Yes: if the client enables GCSTART and never collects
the messages, then they will consume more & more memory. This is
just not a problem: the client should not fail to collect messages
it has enabled!

Surprisingly, the message queue limps on for a long time after
corruption with a single RingInited GCSTART message. Appending new
messages (eg. finalization messages) works, and heals the
next-direction of the ring, which allows these messages to then be
Got as normal. Re-adding the same GCSTART message, when already
present, also 'works'. But RingRemove simply leaves the ring
unchanged; this causes an infinite loop.

RHSK 2008-12-19
Fixed with a complete redesign of GC message lifecycle.
How foundinspection
Evidencenone
Observed in1.108.2
Introduced in1.107.0
Created byRichard Kistruck
Created on2008-11-28 17:24:45
Last modified byGareth Rees
Last modified on2014-04-12 22:05:22
History2008-11-28 RHSK Created.
2008-12-19 RHSK Fixed with a complete redesign of GC message lifecycle.

Fixes

Change Effect Date User Description
166995 closed 2008-12-19 16:25:20 Richard Kistruck MPS br/timing design/message-gc: Complete re-write, for new GC message lifecycle. See job001989.
166993 closed 2008-12-19 14:27:48 Richard Kistruck MPS br/timing: rename z001989a.c as zmess.c
166919 closed 2008-12-11 16:08:47 Richard Kistruck MPS br/timing z001989a.c: oops, re-enable test of delayeed getting
of messages. Demonstrates fix for job001989.
166918 closed 2008-12-11 16:06:14 Richard Kistruck MPS br/timing TraceIdMessages: Create them at mps_arena_create time,
re-create them immediately after a trace sends its last message.
Store them, one per TraceId, in ArenaStruct. Cope correctly if they
cannot be created (ControlAlloc fails). Destroy them correctly on
mps_arena_destroy. See job001989 and design/message-gc#lifecycle
(not yet written).
166881 open 2008-12-05 22:14:56 Richard Kistruck MPS br/timing z001989a: get collection begin/end messages.
FAILS: thereby demonstrating defect job001989.