MPS issue job003539

Title	MPS pause times are not well regulated
Status	closed
Priority	optional
Assigned user	Richard Brooksby
Organization	Ravenbrook
Description	The MPS is overperforming on incremental short pause times on modern processors, causing hundreds or thousands of short pauses per second, when this isn't required for good user interaction and is generally inefficient due to the incremental bookkeeping overhead and context switching into and out of the MPS. Some preliminary hacks indicate that there could be a significant performance improvement by regulating pause times better <`https://info.ravenbrook.com/mail/2013/06/18/18-22-24/0/`>. The "phasers on stun" hack involved using the RDTSC timer to avoid returning to the mutator before 100ms of CPU time had passed (or collection completed). However this simplistic hack didn't regulate the gaps between pauses, so probably isn't sufficient for production. Non-interactive programs like Clasp don't want to pay the cost of incremental collection and barrier hits at all: they need maximum throughput instead. See [3].
Analysis	High-level design in <`https://info.ravenbrook.com/mail/2013/07/05/00-46-27/0/`>. Göran is keen for "phasers on stun" (pause time regulation) in CVM and wants that to go ahead. DL cannot currently reproduce this speedup. At the moment the frequency of collections is determined by ArenaPollALLOCTIME, and there is a calculate as to how much work to do in a TraceQuantum. This currently works out at around 1Mb. Setting ArenaPollALLOCTIME to be larger doesn't appear to improve performance. Completing a collection in TraceQuantum also doesn't appear to affect performance. I just can't reproduce at present. DL did a further experiment of reducing the ALLOCTIME to 4096 (from 65536), and I did get a 2-3% slow down. It also seems to slow down marginally slower with a much larger value (655360). I think this parameter is fairly well-tuned. It's possibly saves 0.5% on startup time with a large value, but this is hard to measure accurately. I guess the 25% speed up seen was related to the performance problems we were having before we made some improvements (see job003536). I recommend we do not proceed with this at present. GDR 2014-05-15: After discussion with DL and RB this is back on the agenda. In the light of the conflicting findings reported above it's important to figure out what's going on. Maybe the experiments were carried out on different operating systems with different overheads for context switching and barrier hits?
How found	unknown
Evidence	[1] "Not the CET and MPS status report" <`https://info.ravenbrook.com/mail/2013/06/19/16-24-16/0/`> [2] MPS strategy discussion <`https://info.ravenbrook.com/mail/2014/05/15/19-19-13/0/`> [3] `https://info.ravenbrook.com/mail/2014/08/20/13-48-51/0/`
Created by	Richard Brooksby
Created on	2013-07-08 13:01:55
Last modified by	Richard Brooksby
Last modified on	2016-03-15 06:34:37
History	2013-07-08 RB Created to resolve job003534. 2013-11-22 DL Added to analysis. Can't repeat. 2013-12-10 DL Further test, can't repeat, but can get slow down. 2013-12-19 GDR Moved to MPS project and set to optional.

Fixes

Change	Effect	Date	User	Description
190053	closed	2016-03-15 06:31:08	Richard Brooksby	Merging branch/2016-03-12/pause into the master sources.