THE DESIGN OF A NEW MANUAL-VARIABLE MEMORY POOL CLASS design.mps.poolmvt draft design P T Withington 1998-02-13 INTRODUCTION: This is a second-generation design for a pool that manually manages variable-sized objects. It is intended as a replacement for poolmv (except in its control pool role) and poolepdl, and it is intended to satisfy the requirements of the Dylan "misc" pool and the product malloc/new drop-in replacement. [This form should include these fields, rather than me having to create them "by hand"] .readership: MM developers .source: req.dylan(6), req.epcore(16), req.product(2) .background: design.mps.poolmv(0), design.mps.poolepdl(0), design.product.soft.drop(0), paper.wil95(1), paper.vo96(0), paper.grun92(1), paper.beck82(0), mail.ptw.1998-02-25.22-18(0) .hist.-1: Initial email discussion mail.ptw.1998-02-04.21-27(0), ff. .hist.0: Draft created 1998-02-13 by P. T. Withington from email RFC mail.ptw.1998-02-12.03-36, ff. .hist.1: Revised 1998-04-01 in response to email RFC mail.ptw.1998-03-23.20-43(0), ff. .hist.2: Revised 1998-04-15 in response to email RFC mail.ptw.1998-04-13.21-40(0), ff. .hist.3: Erroneously incremented version number .hist.4: Revised 1998-05-06 in response to review review.design.mps.poolmv2.2 (0) DEFINITIONS .def.alignment: Alignment is a constraint on an object's address, typically to be a power of 2 (see also, glossary.alignment ) .def.bit-map: A bitmap is a boolean-valued vector (see also, glossary.bitmap ). .def.block: A block is a contiguous extent of memory. In this document, block is used to mean a contiguous extent of memory managed by the pool for the pool client, typically a subset of a segment (compare with .def.segment). .def.cartesian-tree: A cartesian tree is a binary tree ordered by two keys (paper.stephenson83(0)). .def.crossing-map: A mechanism that supports finding the start of an object from any address within the object, typically only required on untagged architectures (see also, glossary.crossing.map ). .def.footer: A block of descriptive information describing and immediately following another block of memory (see also .def.header). .def.fragmentation: Fragmented memory is memory reserved to the program but not usable by the program because of the arrangement of memory already in use (see also, glossary.fragmentation ). .def.header: A block of descriptive information describing and immediately preceding another block of memory (see also, glossary.in-band.header ). .def.in-band: From "in band signalling", when descriptive information about a data structure is stored in the data structure itself (see also, glossary.in-band.header ). .def.out-of-band: When descriptive information about a data structure is stored separately from the structure itself (see also, glossary.out-of-band.header ). .def.refcount: A refcount is a count of the number of users of an object (see also, glossary.reference.count ). .def.segment: A segment is a contiguous extent of memory. In this document, segment is used to mean a contiguous extent of memory managed by the MPS arena (design.mps.arena(1)) and subdivided by the pool to provide blocks (see .def.block) to its clients. .def.splay-tree: A splay tree is a self-adjusting binary tree (paper.st85(0), paper.sleator96(0)). .def.splinter: A splinter is a fragment of memory that is too small to be useful (see also, glossary.splinter ) .def.subblock: A subblock is a contiguous extent of memory. In this document, subblock is used to mean a contiguous extent of memory manage by the client for its own use, typically a subset of a block (compare with .def.block). ABBREVIATIONS .abbr.abq: ABQ = Available Block Queue .abbr.ap: AP = Allocation Point .abbr.cbs: CBS = Coalescing Block Structure .abbr.mps: MPS = Memory Pool System .abbr.mv: MV = Manual-Variable .abbr.ps: PS = PostScript OVERVIEW: mvt is intended to satisfy the requirements of the clients that need manual-variable pools, improving on the performance of the existing manual-variable pool implementations, and reducing the duplication of code that currently exists. The expected clients of mvt are: Dylan (currently for its misc pool), EP (particularly the dl pool, but all pools other than the PS object pool), and Product (initially the malloc/new pool, but also other manual pool classes). REQUIREMENTS: .req.cat: Requirements are categorized per guide.req(2). .req.risk: req.epcore(16) is known to be obsolete, but the revised document has not yet been accepted. Critical Requirements .req.fun.man-var: The pool class must support manual allocation and freeing of variable-sized blocks (source: req.dylan.fun.misc.alloc, req.epcore.fun.{dl,gen,tmp,stat,cache,trap}.{alloc,free}, req.product.fun.{malloc,new,man.man}). .non-req.fun.gc: There is not a requirement that the pool class support formatted objects, scanning, or collection objects; but it should not be arbitrarily precluded. .req.fun.align: The pool class must support aligned allocations to client-specified alignments. An individual instance need only support a single alignment; multiple instances may be used to support more than one alignment (source: req.epcore.attr.align). .req.fun.reallocate: The pool class must support resizing of allocated blocks (source req.epcore.fun.dl.promise.free, req.product.dc.env.{ansi-c,cpp}). .non-req.fun.reallocate.in-place: There is not a requirement blocks must be resized in place (where possible); but it seems like a good idea. .req.fun.thread: Each instance of the pool class must support multiple threads of allocation (source req.epcore.fun.dl.multi, req.product.dc.env.{ansi-c,cpp}). .req.attr.performance: The pool class must meet or exceed performance of "competitive" allocators (source: rec.epcore.attr.{run-time,tp}, req.product.attr.{mkt.eval, perform}). [Dylan does not seem to have any requirement that storage be allocated with a particular response time or throughput, just so long as we don't block for too long. Clearly there is a missing requirement.] .req.attr.performance.time: By inference, the time overhead must be competetive. .req.attr.performance.space: By inference, the space overhead must be competetive. .req.attr.reliability: The pool class must have "rock-solid reliability" (source: req.dylan.attr.rel.mtbf, req.epcore.attr.rel, req.product.attr.rel). .req.fun.range: The pool class must be able to manage blocks ranging in size from 1 byte to all of addressable memory (req.epcore.attr.{dl,gen,tmp,stat,cache,trap}.obj.{min,max}. The range requirement may be satisfied by multiple instances each managing a particular client-specified subrange of sizes. [Dylan has requirements req.dylan.attr.{capacity,obj.max}, but no requirement that such objects reside in a manual pool.] .req.fun.debug: The pool class must support debugging erroneous usage by client programs (source: req.epcore.fun.{dc.variety, debug.support}, req.product.attr.{mkt.eval,perform}). Debugging is permitted to incur additional overhead. .req.fun.debug.boundaries: The pool class must support checking for accesses outside the boundaries of live objects. .req.fun.debug.log: The pool class must support logging of all allocations and deallocations. .req.fun.debug.enumerate: The pool class must support examining all allocated objects. .req.fun.debug.free: The pool class must support detecting incorrect, overlapping, and double frees. .req.fun.tolerant: The pool class must support tolerance of erroneous usage (source req.product.attr.use.level.1). Essential Requirements .req.fun.profile: The pool class should support memory usage profiling (source: req.product.attr.{mkt.eval, perform}). .req.attr.flex: The pool class should be flexible so that it can be tuned to specific allocation and freeing patterns (source: req.product.attr.flex,req.epcore.attr.{dl,cache,trap}.typ). The flexibility requirement may be satisfied by multiple instances each optimizing a specific pattern. .req.attr.adapt: The pool class should be adaptive so that it can accommodate changing allocation and freeing patterns (source: req.epcore.fun.{tmp,stat}.policy, req.product.attr.{mkt.eval,perform}). Nice Requirements .req.fun.suballocate: The pool class may support freeing of any aligned, contiguous subset of an allocated block (source req.epcore.fun.dl.free.any, req.product.attr.{mkt.eval,perform}). ARCHITECTURE: .arch.overview: The pool has several layers: client allocation is by Allocation Points (APs). .arch.overview.ap: APs acquire storage from the pool available-block queue (ABQ). .arch.overview.abq: The ABQ holds blocks of a minimum configurable size: "reuse size". .arch.overview.storage: The ABQ acquires storage from the arena or from the coalescing-block structure (CBS). .arch.overview.storage.contiguous: The arena storage is requested to be contiguous to maximize opportunities for coalescing (Loci will be used when available). .arch.overview.cbs: The CBS holds blocks freed by the client until, through coalescing, they have reached the reuse size, at which point they are made available on the ABQ. .arch.ap: The pool will use allocation points as the allocation interface to the client. .arch.ap.two-phase: Allocation points will request blocks from the pool and suballocate those blocks (using the existing AP, compare and increment, 2-phase mechanism) to satisfy client requests. .arch.ap.fill: The pool will have a configurable "fill size" that will be the preferred size block used to fill the allocation point. .arch.ap.fill.size: The fill size should be chosen to amortize the cost of refill over a number of typical reserve/commit operations, but not so large as to exceed the typical object population of the pool. .arch.ap.no-fit: When an allocation does not fit in the remaining space of the allocation point, there may be a remaining fragment. .arch.ap.no-fit.sawdust: If the fragment is below a configurable threshold (minimum size), it will be left unused (but returned to the CBS so it will be reclaimed when adjacent objects are freed); .arch.ap.no-fit.splinter: otherwise, the remaining fragment will be (effectively) returned to the head of the available-block queue, so that it will be used as soon as possible (i.e., by objects of similar birthdate). .arch.ap.no-fit.oversize: If the requested allocation exceeds the fill size it is treated exceptionally (this may indicate the client has either misconfigured or misused the pool and should either change the pool configuration or create a separate pool for these exceptional objects for best performance). .arch.ap.no-fit.oversize.policy: Oversize blocks are assumed to have exceptional lifetimes, hence are allocated to one side and do not participate in the normal storage recycling of the pool. .arch.ap.refill.overhead: If reuse size is small, or becomes small due to .arch.adapt, all allocations will effectively be treated exceptionally (the AP will trip and a oldest-fit block will be chosen on each allocation). This mode will be within a constant factor in overhead of an unbuffered pool. .arch.abq: The available block queue holds blocks that have coalesced sufficiently to reach reuse size. arch.abq.reuse.size: A multiple of the quantum of virtual memory is used as the reuse size (.anal.policy.size). .arch.abq.fifo: It is a FIFO queue (recently coalesced blocks go to the tail of the queue, blocks are taken from the head of the queue for reuse). .arch.abq.delay-reuse: By thus delaying reuse, coalescing opportunities are greater. .arch.abq.high-water: It has a configurable high water mark, which when reached will cause blocks at the head of the queue to be returned to the arena, rather than reused. .arch.abq.return: When the MPS supports it, the pool will be able to return free blocks from the ABQ to the arena on demand. .arch.return.segment: arch.abq.return can be guaranteed to be able to return a segment by setting reuse size to twice the size of the segments the pool requests from the arena. .arch.cbs: The coalescing block structure holds blocks that have been freed by the client. .arch.cbs.optimize: The data structure is optimized for coalescing. .arch.cbs.abq: When a block reaches reuse size, it is added to the ABQ. .arch.cbs.data-structure: The data structures are organized so that a block can be on both the CBS and ABQ simultaneously to permit additional coalescing, up until the time the block is removed from the ABQ and assigned to an AP. .arch.fragmentation.internal: Internal fragmentation results from The pool will request large segments from the arena to minimize the internal fragmentation due to objects not crossing segment boundaries. .arch.modular: The architecture will be modular, to allow building variations on the pool by assembling different parts. .arch.modular.example: For example, it should be possible to build pools with any of the freelist mechanisms, with in-band or out-of-band storage (where applicable), that do or do not support derived object descriptions, etc. .arch.modular.initial: The initial architecture will use .mech.freelist.splay-tree for the CBS, .sol.mech.storage.out-of-band, .sol.mech.desc.derived, and .sol.mech.allocate.buffer. .arch.segregate: The architecture will support segregated allocation through the use of multiple allocation points. The client will choose the appropriate allocation point either at run time, or when possible, at compile time. .arch.segregate.initial: The initial architecture will segregate allocations into two classes: large and small. This will be implemented by creating two pools with different parameters. .arch.segregate.initial.choice: The initial architecture will provide glue code to choose which pool to allocate from at run time. If possible this glue code will be written in a way that a good compiler can optimize the selection of pool at compile time. Eventually this glue code should be subsumed by the client or generated automatically by a tool. .arch.debug: Debugging features such as tags, fenceposts, types, creators will be implemented in a layer above the pool and APs. A generic pool debugging interface will be developed to support debugging in this outer layer. .arch.debug.initial: The initial architecture will have counters for objects/bytes allocated/freed and support for detecting overlapping frees. .arch.dependency.loci: The architecture depends on the arena being able to efficiently provide segments of varying sizes without excessive fragmentation. The locus mechanism should satisfy this dependency. (See .anal.strategy.risk) .arch.dependency.mfs: The architecture internal data structures depend on efficient manual management of small, fixed-sized objects (2 different sizes). The MFS pool should satisfy this dependency. .arch.contingency: Since the strategy we propose is new, it may not work. .arch.contingency.pathalogical: In particular, pathological allocation patterns could result in fragmentation such that no blocks recycle from the CBS to ABQ. .arch.contingency.fallback: As a fallback, there will be a pool creation parameter for a high water mark for the CBS. .arch.contingency.fragmentation-limit: When the free space in the CBS as a percentage of all the memory managed by the pool (a measure of fragmentation) reaches that high water mark, the CBS will be searched oldest-fit before requesting additional segments from the arena. .arch.contingency.alternative: We also plan to implement .mech.freelist.cartesian-tree as an alternative CBS, which would permit more efficient searching of the CBS. .arch.parameters: The architecture supports several parameters so that multiple pools may be instantiated and tuned to support different object cohorts. The important parameters are: reuse size, minimum size, fill size, ABQ high water mark, CBS fragmentation limit (see .arch.contingency.fragmentation-limit). .arch.parameters.client-visible: The client-visible parameters of the pool are the minimum object size, the mean object size, the maximum object size, the reserve depth and fragmentation limit. The minimum object size determines when a splinter is kept on the head of the ABQ (.arch.ap.no-fit.splinter). The maximum object size determines the fill size (.arch.ap.fill-size) and hence when a block is allocated exceptionally (.arch.ap.no-fit.oversize). The mean object size is the most likely object size. The reserve depth is a measure of the hysteresis of the object population. The mean object size, reserve depth and, maximum object size are used to determine the size of the ABQ (.arch.abq.high-water). The fragmentation limit is used to determine when contingency mode is used to satisfy an allocation request (.arch.contingency). .arch.adapt: We believe that an important adaptation to explore is tying the reuse size inversely to the fragmentation (as measured in .arch.contingency.fragmentation-limit). .arch.adapt.reuse: By setting reuse size low when fragmentation is high, smaller blocks will be available for reuse, so fragmentation should diminish. .arch.adapt.overhead: This will result in higher overhead as the AP will need to be refilled more often, so reuse size should be raised again as fragmentation diminishes. .arch.adapt.oldest-fit: In the limit, if reuse size goes to zero, the pool will implement a "oldest-fit" policy: the oldest free block of sufficient size will be used for each allocation. .arch.adapt.risk: This adaptation is an experimental policy and should not be delivered to clients until thoroughly tested. ANALYSIS: .anal.discard: We have discarded many traditional solutions based on experience and analysis in paper.wil95(1). In particular, managing the free list as a linear list arranged by address or size and basing policy on searching such a linear list in a particular direction, from a particular starting point, using fit and/or immediacy as criteria. We believe that none of these solutions is derived from considering the root of the problem to be solved (as described in .strategy), although their behavior as analyzed by Wilson gives several insights. .anal.strategy: For any program to run in the minimum required memory (with minimal overhead -- we discard solutions such as compression for now), fragmentation must be eliminated. To eliminate fragmentation, simply place blocks in memory so that they die "in order" and can be immediately coalesced. This ideal is not achievable, but we believe we can find object attributes that correlate with deathtime and exploit them to approximate the ideal. Initially we believe birth time and type (as approximated by size) will be useful attributes to explore. .anal.strategy.perform: To meet .req.attr.performance, the implementation of .sol.strategy must be competitive in both time and space. .anal.strategy.risk: The current MPS segment substrate can cause internal fragmentation which an individual pool can do nothing about. We expect that request.epcore.170193.sugg.loci will be implemented to remove this risk. .anal.policy: Deferred coalescing, when taken to the extreme will not minimize the memory consumption of a program, as no memory would ever be reused. Eager reuse appears to lead to more fragmentation, whereas delayed reuse appears to reduce fragmentation (paper.wil95(1)). The systems studied by Wilson did not directly address deferring reuse. Our proposed policy is to reuse blocks when they reach a (configurable) size. We believe that this policy along with the policy of segregating allocations by death time, will greatly reduce fragmentation. .anal.policy.risk: This policy could lead to pathologic behavior if allocations cannot be successfully segregated. .anal.policy.allocate.segregate: This policy has some similarities to CustomAlloc (paper.grun92(1)). CustomAlloc segregates objects by size classes, and then within those classes chooses a different allocator depending on whether that size class has a stable or unstable population. Classes with stable population recycle storage within the class, whereas classes with unstable populations return their storage to the general allocation pool for possible reuse by another class. CustomAlloc, however, requires profiling the application and tuning the allocator according to those profiles. Although we intend to support such tuning, we do not want to require it. .anal.policy.reallocate: For reallocation, .fun.suballocate can be used to free the remainder if a block is made smaller. Doing so will cause the freed block to obey .sol.policy.allocate [i.e., the freed block will not be treated specially, it will be subject to the normal policy on reuse]. Copying can be used if a block is made larger. paper.vo96(0) reports success in over-allocating a block the first time it is resized larger, presumably because blocks that are resized once tend to be resized again and over-allocating may avoid a subsequent copy. If each object that will be reallocated can be given its own allocation point until its final reallocation, the allocation point can be used to hold released or spare storage. .anal.policy.size: We believe that this will take advantage of the underlying virtual memory system's ability to compact the physical memory footprint of the program by discarding free fragments that align with the virtual memory quantum. (In a VM system one can approximate compaction by sparse mapping. If every other page of a segment is unused, the unused pages can be unmapped, freeing up physical memory that can be mapped to a new contiguous vm range.) .anal.mech.freelist: The literature (paper.grun92(1), paper.vo96(0)) indicate that .sol.freelist.cartesian-tree provides a space-efficient implementation at some cost in speed. .sol.freelist.splay-tree is faster but less space-efficient. .sol.freelist.bitmap is unstudied. Many of the faster allocators maintain caches of free blocks by size to speed allocation of "popular" sizes. We intend to initially explore not doing so, as we believe that policy ultimately leads to fragmentation by mixing objects of varying death times. Instead we intend to use a free list mechanism to support fast coalescing, deferring reuse of blocks until a minimum size has been reached. anal.mech.allocate.optimize-small: Wilson (paper.wil95(1)) notes that small blocks typically have short lifetimes and that overall performance is improved if you optimize the management of small blocks, e.g., sol.mech.allocate.lookup-table for all small blocks. We believe that .sol.mech.allocate.buffer does exactly that. .anal.mech.allocate.optimize-new: Wilson (paper.wil95(1)) reports some benefit from "preserving wilderness", that is, when a block of memory must be requested from the system to satisfy an allocation, only the minimum amount of that block is used, the remainder is preserved (effectively by putting it at the tail of the free list). This mechanism may or may not implement .sol.policy.allocate. We believe a better mechanism is to choose to preserve or not, based on .sol.policy.allocate. IDEAS: .sol: Many solution ideas for manual management of variable-sized memory blocks are enumerated by paper.wil95(1). Here we list the most promising, and some of our own. Strategy .sol.strategy: To run a program in the minimal required memory, with minimal overhead, utilize memory efficiently. Memory becomes unusable when fragmented. Strategy is to minimize fragmentation. So place blocks where they won't cause fragmentation later. .sol.strategy.death: objects that will die together (in time) should be allocated together (in space); thus they will coalesce, reducing fragmentation. .sol.strategy.death.birth: assume objects allocated near each other in time will have similar deathtimes (paper.beck82(0)) .sol.strategy.death.type: assume objects of different type may have different deathtimes, even if born together .sol.strategy.death.predict: find and use program features to predict deathtimes .sol.strategy.reallocate: reallocation implies rebirth, or at least a change in lifetime .sol.strategy.debug: as much of the debugging functionality as possible should be implemented as a generally available MPS utility; the pool will provide support for debugging that would be expensive or impossible to allocate outside the pool Policy [Policy is an implementable decision procedure, hopefully approximating the strategy.] .sol.policy.reuse: defer reusing blocks, to encourage coalescing .sol.policy.split: when a block is split to satisfy an allocation, use the remainder as soon as possible .sol.policy.size: prevent .policy.reuse from consuming all of memory by choosing a (coalesced) block for reuse when it reaches a minimum size .sol.policy.size.fixed: use the quantum of virtual memory (e.g., one page) as minimum size .sol.policy.size.tune: allow tuning minimum size .sol.policy.size.adapt: adaptively change minimum size .sol.policy.allocate: allocate objects with similar birthdate and lifetime together .sol.policy.allocate.segregate: segregate allocations by type .sol.policy.allocate.segregate.size: use size as a substitute for type .sol.policy.allocate.segregate.tune: permit tuning of segregation .sol.policy.allocate.segregate.adapt: adaptively segregate allocations .sol.policy.reallocate: implement reallocation in a central mechanism outside of the pool, create a generic pool interface in support of same. .sol.policy.debug: implement a pool debugging interface .sol.policy.debug.counters: implement debugging counters in the pool that are queried with a generic interface .sol.policy.debug.verify: implement debugging error returns on overlapping frees Mechanism [Mechanisms are algorithms or data structures used to implement policy.] .sol.mech.free-list: mechanisms that can be used to describe the free list .sol.mech.free-list.cartesian-tree: Using address and size as keys supports fast coalescing of adjacent blocks and fast searching for optimal-sized blocks. Unfortunately, because the shape of the tree is constrained by the second key, it can become unbalanced. This data structure is used in the SunOS 4.1 malloc (paper.grun92(1)). .sol.mech.free-list.splay-tree: The amortized cost of a splay tree is competitive with balanced binary trees in the worst case, but can be significantly better for regular patterns of access because recently-accessed keys are moved to the root of the tree and hence can be re-accessed quickly. This data structure is used in the System Vr4 malloc (paper.vo96(0)). (For a complete analysis of the splay tree algorithm time bounds see paper.st85(0).) .sol.mech.free-list.bit-map: Using address as an index and fix-sized blocks, the booleans can represent whether a block is free or not. Adjacent blocks can be used to construct larger blocks. Efficient algorithms for searching for runs in a vector are known. This data structure is used in many file system disk block managers. .sol.mech.free-list.refcount: A count of the number of allocated but not freed subblocks of a block can be used to determine when a block is available for reuse. This is an extremely compact data structure, but does not support subblock reuse. .sol.mech.free-list.hybrid: Bitmaps appear suited particularly to managing small, contiguous blocks. The tree structures appear suited particularly to managing varying-sized, discontiguous blocks. A refcount can be very efficient if objects can be placed accurately according to death time. A hybrid mechanism may offer better performance for a wider range of situations. .sol.mech.storage: methods that can be used to store the free list description .sol.mech.storage.in-band: The tree data structures are amenable to being stored in the free blocks themselves, minimizing the space overhead of management. To do so imposes a minimum size on free blocks and reduces the locality of the data structure. .sol.mech.storage.out-of-band: The bit-map data structure must be stored separately. .sol.mech.desc: for an allocated block to be freed, its base and bound must be known .sol.mech.desc.derived: Most clients can supply the base of the block. Some clients can supply the bound. .sol.mech.desc.in-band: When the bound cannot be supplied, it can be stored as an in-band "header". If neither the base nor bound can be supplied (e.g., the client may only have an interior pointer to the block), a header and footer may be required. .sol.mech.desc.out-of-band: In un-tagged architectures, it may be necessary to store the header and footer out-of-band to distinguish them from client data. Out-of-band storage can improve locality and reliability. Any of the free-list structures can also be used to describe allocated blocks out-of-band. .sol.mech.desc.crossing-map: An alternative for untagged architectures is to store a "crossing map" which records an encoding of the start of objects and then store the descriptive information in-band. .sol.mech.allocate: mechanisms that can be used to allocate blocks (these typically sit on top of a more general free-list manager) .sol.mech.allocate.lookup-table: Use a table of popular sizes to cache free blocks of those sizes. .sol.mech.allocate.buffer: Allocate from contiguous blocks using compare and increment. .sol.mech.allocate.optimize-small: Use a combination of techniques to ensure the time spent managing a block is small relative to the block's lifetime; assume small blocks typically have short lifetimes. .sol.mech.allocate.optimize-new: When "virgin" memory is acquired from the operating system to satisfy a request, try to preserve it (i.e., use only what is necessary) .sol.mech.allocate.segregate.size: use size as a substitute for type .sol.mech.reallocate: use .req.fun.suballocate to return unused memory when a block shrinks, but differentiate this from an erroneous overlapping free by using separate interfaces. IMPLEMENTATION: The implementation consists of the following separable modules: Coalescing Block Structure .impl.c.cbs: The initial implementation will use .sol.mech.free-list.splay-tree and sol.mech.storage.out-of-band. For locality, this storage should be managed as a linked free list of splay nodes suballocated from blocks acquired from a pool shared by all CBS's. Must support creation and destruction of an empty tree. Must support search, insert and delete by key of type Addr. Must support finding left and right neighbors of a failed search for a key. Must support iterating over the elements of the tree with reasonable efficiency. Must support storing and retrieving a value of type Size associated with the key. Standard checking and description should be provided. See design.mps.splay(0) and design.mps.cbs(0). Available Block Queue .impl.c.abq: The initial implementation will be a queue of fixed size (determined at pool creation time from the high water mark). Must support creation and destruction of an empty queue. Must support insertion at the head or tail of the queue (failing if full), peeking at the head of the queue, and removal of the head (failing if empty) or any element of the queue (found by a search). Standard checking and description should be provided. Pool Implementation .impl.c: The initial implementation will use the above modules to implement a buffered pool. Must support creation and destruction of the pool. Creation takes parameters: minimum size, mean size, maximum size, reserve depth and fragmentation limit. Minimum, mean, and maximum size are used to calculate the internal fill and reuse sizes. Reserve depth and mean size are used to calculate the ABQ high water mark. Fragmentation limit is used to set the CBS contingency mode. Must support buffer initialization, filling and emptying. Must support freeing. Standard checking and description should be provided. [Eventually, it should support scanning, so it can be used with collected pools, but no manual pool currently does.] .impl.c.future: The implementation should not preclude "buffered free" (mail.ptw.1997-12-05.19-07(0), ff.) being added in the future. .impl.c.parameters: The pool parameters are calculated as follows from the input parameters: minimum, mean, and maximum size are taked directly from the parameters. .impl.c.parameter.fill-size: The fill size is set to the maximum size times the reciprocal of the fragmentation limit, aligned to the arena alignment. .imple.c.parameter.reuse-size: The reuse size is set to twice the fill size (see .arch.abq.return.segment, .impl.c.free.merge.segment). .impl.c.parameter.abq-limit: The ABQ high-water limit is set to the reserve depth times the mean size (that is, the queue should hold as many reuse blocks as would take to cover the population hysteresis if the population consisted solely of mean-sized blocks, see .arch.abq.high-water). .impl.c.parameter.avail-limit: The CBS high-water limit is implemented by comparing the available free space to an "available limit". The available limit is updated each time a segment is allocated from or returned to the arena by setting it to the total size of the pool times the fragmentation limit divide vy 100 (see .arch.contingency.fallback). .impl.c.ap.fill: An AP fill request will be handled as follows: o If the request is larger than fill size, attempt to request a segment from the arena sufficient to satisfy the request o Use any previously returned splinter (from .impl.c.ap.empty), if large enough o Attempt to retrieve a free block from the head of the ABQ (removing it from ABQ and CBS if found). o If above fragmentation limit, attempt to find a block on the CBS, using oldest-fit search o Attempt to request a segment of fill size from the arena o Attempt to find a block on the CBS, using oldest-fit search o Otherwise, fail .impl.c.ap.empty: An AP empty request will be handled as follows: o If remaining free is less than min size, return it to the CBS o If the remaining free is larger than any previous splinter, return that splinter to the CBS and save this one for use by a subsequent fill o Otherwise return the remaining block to the CBS .impl.c.free: When blocks are returned to the CBS a search is made for adjacent blocks that can be merged. If not, the block is simply inserted in the CBS. If a merge occurs between two blocks on the ABQ, the ABQ must be adjusted to reflect the merge. .impl.c.free.exception: Exceptional blocks are returned directly to the arena. .impl.c.free.merge: If a merge occurs and the merged block is larger than reuse size: o If the ABQ is full, remove the block at the head of the ABQ from the ABQ and CBS and return it to the arena(*) o Insert the newly merged block at the tail of the ABQ, leaving it on the CBS for further merging .impl.c.free.merge.segment: (*) Merged blocks may not align with arena segments. If necessary, return the interior segments of a block to the arena and return the splinters to the CBS. .impl.c.free.merge.segment.reuse: If the reuse size (the size at which blocks recycle from the CBS to the ABQ) is at least twice the fill size (the size of segments the pool allocates from the arena), we can guarantee that there will always be a returnable segment in every ABQ block. .impl.c.free.merge.segment.overflow: If the reuse size is set smaller (see .arch.adapt), there may not be a returnable segment in an ABQ block, in which case the ABQ has "overflowed". Whenever this occurs, the ABQ will be refilled by searching the CBS for dropped reusable blocks when needed. .impl.c.free.merge.segment.risk: The current segment structure does not really support what we would like to do. Loci should do better: support reserving contiguous address space and mapping/unmapping any portion of that address space. .impl.c.free.merge.alternative: Alternatively, if the MPS segment substrate permitted mapping/unmapping of pages, the pool could use very large segments and map/unmap pages as needed. AP Dispatch .impl.c.multiap: The initial implementation will be a glue layer that selects among several AP's for allocation according to the predicted deathtime (as approximated by size) of the requested allocation. Each AP will be filled from a pool instance tuned to the range of object sizes expected to be allocated from that AP. [For bonus points provide an interface that creates a batch of pools and AP's according to some set of expected object sizes. Eventually expand to understand object lifetimes and general lifetime prediction keys.] impl.c.multiap.sample-code: This glue code is not properly part of the pool or MPS interface. It is a layer on top of the MPS interface, intended as sample code for unsophisticated clients. Sophisticated clients will likely want to choose among multiple AP's more directly. TESTING: .test.component: Components .impl.c.splay, .impl.c.cbs, and .impl.c.abq will be subjected to individual component tests to verify their functionality. .test.regression: All tests applied to poolmv (design.mps.poolmv(0)) and poolepdl (design.mps.poolepdl(0)) will be applied to poolmvt to ensure that mvt is at least as functional as the pools it is replacing. .test.qa: Once poolmvt is integrated into the MPS, the standard MPS QA tests will be applied to poolmvt prior to each release. .test.customer: Customer acceptance tests will be performed on a per-customer basis before release to that customer (cf., proc.release.epcore(2).test) TEXT: Possible tweaks (from mail.pekka.1998-04-15.13-10(0)): 1. Try to coalesce splinters returned from AP's with the front (or any) block on the ABQ. 2. Sort ABQ in some other way to minimize splitting/splinters. E.g., proximity to recently allocated blocks.
2002-06-07 | RB | Converted from MMInfo database design document. |
This document is copyright © 1995-2002 Ravenbrook Limited. All rights reserved. This is an open source license. Contact Ravenbrook for commercial licensing options.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
This software is provided by the copyright holders and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement, are disclaimed. In no event shall the copyright holders and contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.
$Id: //info.ravenbrook.com/project/mps/version/1.105/design/poolmvt/index.html#1 $
Ravenbrook / Projects / Memory Pool System / Master Product Sources / Design Documents