src/backend/utils/mmgr/README

   1 src/backend/utils/mmgr/README
   2
   3 Memory Context System Design Overview
   4 =====================================
   5
   6 Background
   7 ----------
   8
   9 We do most of our memory allocation in "memory contexts", which are usually
  10 AllocSets as implemented by src/backend/utils/mmgr/aset.c.  The key to
  11 successful memory management without lots of overhead is to define a useful
  12 set of contexts with appropriate lifespans.
  13
  14 The basic operations on a memory context are:
  15
  16 * create a context
  17
  18 * allocate a chunk of memory within a context (equivalent of standard
  19   C library's malloc())
  20
  21 * delete a context (including freeing all the memory allocated therein)
  22
  23 * reset a context (free all memory allocated in the context, but not the
  24   context object itself)
  25
  26 * inquire about the total amount of memory allocated to the context
  27   (the raw memory from which the context allocates chunks; not the
  28   chunks themselves)
  29
  30 Given a chunk of memory previously allocated from a context, one can
  31 free it or reallocate it larger or smaller (corresponding to standard C
  32 library's free() and realloc() routines).  These operations return memory
  33 to or get more memory from the same context the chunk was originally
  34 allocated in.
  35
  36 At all times there is a "current" context denoted by the
  37 CurrentMemoryContext global variable.  palloc() implicitly allocates space
  38 in that context.  The MemoryContextSwitchTo() operation selects a new current
  39 context (and returns the previous context, so that the caller can restore the
  40 previous context before exiting).
  41
  42 The main advantage of memory contexts over plain use of malloc/free is
  43 that the entire contents of a memory context can be freed easily, without
  44 having to request freeing of each individual chunk within it.  This is
  45 both faster and more reliable than per-chunk bookkeeping.  We use this
  46 fact to clean up at transaction end: by resetting all the active contexts
  47 of transaction or shorter lifespan, we can reclaim all transient memory.
  48 Similarly, we can clean up at the end of each query, or after each tuple
  49 is processed during a query.
  50
  51
  52 Some Notes About the palloc API Versus Standard C Library
  53 ---------------------------------------------------------
  54
  55 The behavior of palloc and friends is similar to the standard C library's
  56 malloc and friends, but there are some deliberate differences too.  Here
  57 are some notes to clarify the behavior.
  58
  59 * If out of memory, palloc and repalloc exit via elog(ERROR).  They
  60 never return NULL, and it is not necessary or useful to test for such
  61 a result.  With palloc_extended() that behavior can be overridden
  62 using the MCXT_ALLOC_NO_OOM flag.
  63
  64 * palloc(0) is explicitly a valid operation.  It does not return a NULL
  65 pointer, but a valid chunk of which no bytes may be used.  However, the
  66 chunk might later be repalloc'd larger; it can also be pfree'd without
  67 error.  Similarly, repalloc allows realloc'ing to zero size.
  68
  69 * pfree and repalloc do not accept a NULL pointer.  This is intentional.
  70 (For repalloc, this is necessary: As mentioned above, repalloc does
  71 not depend on the current memory context.  But then it needs to know
  72 which memory context to do the allocation in.  So the first allocation
  73 has to be done outside of repalloc.  For pfree, this behavior is
  74 mostly historical and partially because the extra check would impact
  75 performance.)
  76
  77
  78 The Current Memory Context
  79 --------------------------
  80
  81 Because it would be too much notational overhead to always pass an
  82 appropriate memory context to called routines, there always exists the
  83 notion of the current memory context CurrentMemoryContext.  Without it,
  84 for example, the copyObject routines would need to be passed a context, as
  85 would function execution routines that return a pass-by-reference
  86 datatype.  Similarly for routines that temporarily allocate space
  87 internally, but don't return it to their caller?  We certainly don't
  88 want to clutter every call in the system with "here is a context to
  89 use for any temporary memory allocation you might want to do".
  90
  91 The upshot of that reasoning, though, is that CurrentMemoryContext should
  92 generally point at a short-lifespan context if at all possible.  During
  93 query execution it usually points to a context that gets reset after each
  94 tuple.  Only in *very* circumscribed code should it ever point at a
  95 context having greater than transaction lifespan, since doing so risks
  96 permanent memory leaks.
  97
  98
  99 pfree/repalloc Do Not Depend On CurrentMemoryContext
 100 ----------------------------------------------------
 101
 102 pfree() and repalloc() can be applied to any chunk whether it belongs
 103 to CurrentMemoryContext or not --- the chunk's owning context will be
 104 invoked to handle the operation, regardless.
 105
 106
 107 "Parent" and "Child" Contexts
 108 -----------------------------
 109
 110 If all contexts were independent, it'd be hard to keep track of them,
 111 especially in error cases.  That is solved by creating a tree of
 112 "parent" and "child" contexts.  When creating a memory context, the
 113 new context can be specified to be a child of some existing context.
 114 A context can have many children, but only one parent.  In this way
 115 the contexts form a forest (not necessarily a single tree, since there
 116 could be more than one top-level context; although in current practice
 117 there is only one top context, TopMemoryContext).
 118
 119 Deleting a context deletes all its direct and indirect children as
 120 well.  When resetting a context it's almost always more useful to
 121 delete child contexts, thus MemoryContextReset() means that, and if
 122 you really do want a tree of empty contexts you need to call
 123 MemoryContextResetOnly() plus MemoryContextResetChildren().
 124
 125 These features allow us to manage a lot of contexts without fear that
 126 some will be leaked; we only need to keep track of one top-level
 127 context that we are going to delete at transaction end, and make sure
 128 that any shorter-lived contexts we create are descendants of that
 129 context.  Since the tree can have multiple levels, we can deal easily
 130 with nested lifetimes of storage, such as per-transaction,
 131 per-statement, per-scan, per-tuple.  Storage lifetimes that only
 132 partially overlap can be handled by allocating from different trees of
 133 the context forest (there are some examples in the next section).
 134
 135 For convenience we also provide operations like "reset/delete all children
 136 of a given context, but don't reset or delete that context itself".
 137
 138
 139 Memory Context Reset/Delete Callbacks
 140 -------------------------------------
 141
 142 A feature introduced in Postgres 9.5 allows memory contexts to be used
 143 for managing more resources than just plain palloc'd memory.  This is
 144 done by registering a "reset callback function" for a memory context.
 145 Such a function will be called, once, just before the context is next
 146 reset or deleted.  It can be used to give up resources that are in some
 147 sense associated with an object allocated within the context.  Possible
 148 use-cases include
 149 * closing open files associated with a tuplesort object;
 150 * releasing reference counts on long-lived cache objects that are held
 151   by some object within the context being reset;
 152 * freeing malloc-managed memory associated with some palloc'd object.
 153 That last case would just represent bad programming practice for pure
 154 Postgres code; better to have made all the allocations using palloc,
 155 in the target context or some child context.  However, it could well
 156 come in handy for code that interfaces to non-Postgres libraries.
 157
 158 Any number of reset callbacks can be established for a memory context;
 159 they are called in reverse order of registration.  Also, callbacks
 160 attached to child contexts are called before callbacks attached to
 161 parent contexts, if a tree of contexts is being reset or deleted.
 162
 163 The API for this requires the caller to provide a MemoryContextCallback
 164 memory chunk to hold the state for a callback.  Typically this should be
 165 allocated in the same context it is logically attached to, so that it
 166 will be released automatically after use.  The reason for asking the
 167 caller to provide this memory is that in most usage scenarios, the caller
 168 will be creating some larger struct within the target context, and the
 169 MemoryContextCallback struct can be made "for free" without a separate
 170 palloc() call by including it in this larger struct.
 171
 172
 173 Memory Contexts in Practice
 174 ===========================
 175
 176 Globally Known Contexts
 177 -----------------------
 178
 179 There are a few widely-known contexts that are typically referenced
 180 through global variables.  At any instant the system may contain many
 181 additional contexts, but all other contexts should be direct or indirect
 182 children of one of these contexts to ensure they are not leaked in event
 183 of an error.
 184
 185 TopMemoryContext --- this is the actual top level of the context tree;
 186 every other context is a direct or indirect child of this one.  Allocating
 187 here is essentially the same as "malloc", because this context will never
 188 be reset or deleted.  This is for stuff that should live forever, or for
 189 stuff that the controlling module will take care of deleting at the
 190 appropriate time.  An example is fd.c's tables of open files.  Avoid
 191 allocating stuff here unless really necessary, and especially avoid
 192 running with CurrentMemoryContext pointing here.
 193
 194 PostmasterContext --- this is the postmaster's normal working context.
 195 After a backend is spawned, it can delete PostmasterContext to free its
 196 copy of memory the postmaster was using that it doesn't need.
 197 Note that in non-EXEC_BACKEND builds, the postmaster's copy of pg_hba.conf
 198 and pg_ident.conf data is used directly during authentication in backend
 199 processes; so backends can't delete PostmasterContext until that's done.
 200 (The postmaster has only TopMemoryContext, PostmasterContext, and
 201 ErrorContext --- the remaining top-level contexts are set up in each
 202 backend during startup.)
 203
 204 CacheMemoryContext --- permanent storage for relcache, catcache, and
 205 related modules.  This will never be reset or deleted, either, so it's
 206 not truly necessary to distinguish it from TopMemoryContext.  But it
 207 seems worthwhile to maintain the distinction for debugging purposes.
 208 (Note: CacheMemoryContext has child contexts with shorter lifespans.
 209 For example, a child context is the best place to keep the subsidiary
 210 storage associated with a relcache entry; that way we can free rule
 211 parsetrees and so forth easily, without having to depend on constructing
 212 a reliable version of freeObject().)
 213
 214 MessageContext --- this context holds the current command message from the
 215 frontend, as well as any derived storage that need only live as long as
 216 the current message (for example, in simple-Query mode the parse and plan
 217 trees can live here).  This context will be reset, and any children
 218 deleted, at the top of each cycle of the outer loop of PostgresMain.  This
 219 is kept separate from per-transaction and per-portal contexts because a
 220 query string might need to live either a longer or shorter time than any
 221 single transaction or portal.
 222
 223 TopTransactionContext --- this holds everything that lives until end of the
 224 top-level transaction.  This context will be reset, and all its children
 225 deleted, at conclusion of each top-level transaction cycle.  In most cases
 226 you don't want to allocate stuff directly here, but in CurTransactionContext;
 227 what does belong here is control information that exists explicitly to manage
 228 status across multiple subtransactions.  Note: this context is NOT cleared
 229 immediately upon error; its contents will survive until the transaction block
 230 is exited by COMMIT/ROLLBACK.
 231
 232 CurTransactionContext --- this holds data that has to survive until the end
 233 of the current transaction, and in particular will be needed at top-level
 234 transaction commit.  When we are in a top-level transaction this is the same
 235 as TopTransactionContext, but in subtransactions it points to a child context.
 236 It is important to understand that if a subtransaction aborts, its
 237 CurTransactionContext is thrown away after finishing the abort processing;
 238 but a committed subtransaction's CurTransactionContext is kept until top-level
 239 commit (unless of course one of the intermediate levels of subtransaction
 240 aborts).  This ensures that we do not keep data from a failed subtransaction
 241 longer than necessary.  Because of this behavior, you must be careful to clean
 242 up properly during subtransaction abort --- the subtransaction's state must be
 243 delinked from any pointers or lists kept in upper transactions, or you will
 244 have dangling pointers leading to a crash at top-level commit.  An example of
 245 data kept here is pending NOTIFY messages, which are sent at top-level commit,
 246 but only if the generating subtransaction did not abort.
 247
 248 PortalContext --- this is not actually a separate context, but a
 249 global variable pointing to the per-portal context of the currently active
 250 execution portal.  This can be used if it's necessary to allocate storage
 251 that will live just as long as the execution of the current portal requires.
 252
 253 ErrorContext --- this permanent context is switched into for error
 254 recovery processing, and then reset on completion of recovery.  We arrange
 255 to have a few KB of memory available in it at all times.  In this way, we
 256 can ensure that some memory is available for error recovery even if the
 257 backend has run out of memory otherwise.  This allows out-of-memory to be
 258 treated as a normal ERROR condition, not a FATAL error.
 259
 260
 261 Contexts For Prepared Statements And Portals
 262 --------------------------------------------
 263
 264 A prepared-statement object has an associated private context, in which
 265 the parse and plan trees for its query are stored.  Because these trees
 266 are read-only to the executor, the prepared statement can be re-used many
 267 times without further copying of these trees.
 268
 269 An execution-portal object has a private context that is referenced by
 270 PortalContext when the portal is active.  In the case of a portal created
 271 by DECLARE CURSOR, this private context contains the query parse and plan
 272 trees (there being no other object that can hold them).  Portals created
 273 from prepared statements simply reference the prepared statements' trees,
 274 and don't actually need any storage allocated in their private contexts.
 275
 276
 277 Logical Replication Worker Contexts
 278 -----------------------------------
 279
 280 ApplyContext --- permanent during whole lifetime of apply worker.  It
 281 is possible to use TopMemoryContext here as well, but for simplicity
 282 of memory usage analysis we spin up different context.
 283
 284 ApplyMessageContext --- short-lived context that is reset after each
 285 logical replication protocol message is processed.
 286
 287
 288 Transient Contexts During Execution
 289 -----------------------------------
 290
 291 When creating a prepared statement, the parse and plan trees will be built
 292 in a temporary context that's a child of MessageContext (so that it will
 293 go away automatically upon error).  On success, the finished plan is
 294 copied to the prepared statement's private context, and the temp context
 295 is released; this allows planner temporary space to be recovered before
 296 execution begins.  (In simple-Query mode we don't bother with the extra
 297 copy step, so the planner temp space stays around till end of query.)
 298
 299 The top-level executor routines, as well as most of the "plan node"
 300 execution code, will normally run in a context that is created by
 301 ExecutorStart and destroyed by ExecutorEnd; this context also holds the
 302 "plan state" tree built during ExecutorStart.  Most of the memory
 303 allocated in these routines is intended to live until end of query,
 304 so this is appropriate for those purposes.  The executor's top context
 305 is a child of PortalContext, that is, the per-portal context of the
 306 portal that represents the query's execution.
 307
 308 The main memory-management consideration in the executor is that
 309 expression evaluation --- both for qual testing and for computation of
 310 targetlist entries --- needs to not leak memory.  To do this, each
 311 ExprContext (expression-eval context) created in the executor has a
 312 private memory context associated with it, and we switch into that context
 313 when evaluating expressions in that ExprContext.  The plan node that owns
 314 the ExprContext is responsible for resetting the private context to empty
 315 when it no longer needs the results of expression evaluations.  Typically
 316 the reset is done at the start of each tuple-fetch cycle in the plan node.
 317
 318 Note that this design gives each plan node its own expression-eval memory
 319 context.  This appears necessary to handle nested joins properly, since
 320 an outer plan node might need to retain expression results it has computed
 321 while obtaining the next tuple from an inner node --- but the inner node
 322 might execute many tuple cycles and many expressions before returning a
 323 tuple.  The inner node must be able to reset its own expression context
 324 more often than once per outer tuple cycle.  Fortunately, memory contexts
 325 are cheap enough that giving one to each plan node doesn't seem like a
 326 problem.
 327
 328 A problem with running index accesses and sorts in a query-lifespan context
 329 is that these operations invoke datatype-specific comparison functions,
 330 and if the comparators leak any memory then that memory won't be recovered
 331 till end of query.  The comparator functions all return bool or int32,
 332 so there's no problem with their result data, but there can be a problem
 333 with leakage of internal temporary data.  In particular, comparator
 334 functions that operate on TOAST-able data types need to be careful
 335 not to leak detoasted versions of their inputs.  This is annoying, but
 336 it appeared a lot easier to make the comparators conform than to fix the
 337 index and sort routines, so that's what was done for 7.1.  This remains
 338 the state of affairs in btree and hash indexes, so btree and hash support
 339 functions still need to not leak memory.  Most of the other index AMs
 340 have been modified to run opclass support functions in short-lived
 341 contexts, so that leakage is not a problem; this is necessary in view
 342 of the fact that their support functions tend to be far more complex.
 343
 344 There are some special cases, such as aggregate functions.  nodeAgg.c
 345 needs to remember the results of evaluation of aggregate transition
 346 functions from one tuple cycle to the next, so it can't just discard
 347 all per-tuple state in each cycle.  The easiest way to handle this seems
 348 to be to have two per-tuple contexts in an aggregate node, and to
 349 ping-pong between them, so that at each tuple one is the active allocation
 350 context and the other holds any results allocated by the prior cycle's
 351 transition function.
 352
 353 Executor routines that switch the active CurrentMemoryContext may need
 354 to copy data into their caller's current memory context before returning.
 355 However, we have minimized the need for that, because of the convention
 356 of resetting the per-tuple context at the *start* of an execution cycle
 357 rather than at its end.  With that rule, an execution node can return a
 358 tuple that is palloc'd in its per-tuple context, and the tuple will remain
 359 good until the node is called for another tuple or told to end execution.
 360 This parallels the situation with pass-by-reference values at the table
 361 scan level, since a scan node can return a direct pointer to a tuple in a
 362 disk buffer that is only guaranteed to remain good that long.
 363
 364 A more common reason for copying data is to transfer a result from
 365 per-tuple context to per-query context; for example, a Unique node will
 366 save the last distinct tuple value in its per-query context, requiring a
 367 copy step.
 368
 369
 370 Mechanisms to Allow Multiple Types of Contexts
 371 ----------------------------------------------
 372
 373 To efficiently allow for different allocation patterns, and for
 374 experimentation, we allow for different types of memory contexts with
 375 different allocation policies but similar external behavior.  To
 376 handle this, memory allocation functions are accessed via function
 377 pointers, and we require all context types to obey the conventions
 378 given here.
 379
 380 A memory context is represented by struct MemoryContextData (see
 381 memnodes.h). This struct identifies the exact type of the context, and
 382 contains information common between the different types of
 383 MemoryContext like the parent and child contexts, and the name of the
 384 context.
 385
 386 This is essentially an abstract superclass, and the behavior is
 387 determined by the "methods" pointer which references which set of
 388 MemoryContextMethods are to be used.  Specific memory context types will
 389 use derived structs having these fields as their first fields.  All the
 390 contexts of a specific type will have methods pointers that point to
 391 the corresponding element in the mcxt_methods[] array as defined in mcxt.c.
 392
 393 While operations like allocating from and resetting a context take the
 394 relevant MemoryContext as a parameter, operations like free and
 395 realloc are trickier.  To make those work, we require all memory
 396 context types to produce allocated chunks that are immediately,
 397 without any padding, preceded by a uint64 value of which the least
 398 significant 3 bits are set to the owning context's MemoryContextMethodID.
 399 This allows the code to determine the correct MemoryContextMethods to
 400 use by looking up the mcxt_methods[] array using the 3 bits as an index
 401 into that array.
 402
 403 If a type of allocator needs additional information about its chunks,
 404 like e.g. the size of the allocation, that information can in turn
 405 either be encoded into the remaining 61 bits of the preceding uint64 value
 406 or if more space is required, additional values may be stored directly prior
 407 to the uint64 value.  It is up to the context implementation to manage this.
 408
 409 Given this, routines like pfree can determine which set of
 410 MemoryContextMethods to call the free_p function for by calling
 411 GetMemoryChunkMethodID() and finding the corresponding MemoryContextMethods
 412 in the mcxt_methods[] array.  For convenience, the MCXT_METHOD() macro is
 413 provided, making the code as simple as:
 414
 415 void
 416 pfree(void *pointer)
 417 {
 418         MCXT_METHOD(pointer, free_p)(pointer);
 419 }
 420
 421 All of the current memory contexts make use of the MemoryChunk header type
 422 which is defined in memutils_memorychunk.h.  This suits all of the existing
 423 context types well as it makes use of the remaining 61-bits of the uint64
 424 header to efficiently encode the size of the chunk of memory (or freelist
 425 index, in the case of aset.c) and the number of bytes which must be subtracted
 426 from the chunk in order to obtain a reference to the block that the chunk
 427 belongs to.  30 bits are used for each of these.  If more than 30 bits are
 428 required then the memory context must manage that itself.  This can be done by
 429 calling the MemoryChunkSetHdrMaskExternal() function on the given chunk.
 430 Currently, each memory context type stores large allocations on dedicated
 431 blocks (which always contain only a single chunk).  For these, finding the
 432 block is simple as we know that the chunk must be the first on the given
 433 block, so the block is always at a fixed offset to the chunk.  For these,
 434 finding the size of the chunk is also simple as the block always stores an
 435 endptr which we can use to calculate the size of the chunk.
 436
 437 More Control Over aset.c Behavior
 438 ---------------------------------
 439
 440 By default aset.c always allocates an 8K block upon the first
 441 allocation in a context, and doubles that size for each successive
 442 block request.  That's good behavior for a context that might hold
 443 *lots* of data.  But if there are dozens if not hundreds of smaller
 444 contexts in the system, we need to be able to fine-tune things a
 445 little better.
 446
 447 The creator of a context is able to specify an initial block size and
 448 a maximum block size.  Selecting smaller values can prevent wastage of
 449 space in contexts that aren't expected to hold very much (an example
 450 is the relcache's per-relation contexts).
 451
 452 Also, it is possible to specify a minimum context size, in case for some
 453 reason that should be different from the initial size for additional
 454 blocks.  An aset.c context will always contain at least one block,
 455 of size minContextSize if that is specified, otherwise initBlockSize.
 456
 457 We expect that per-tuple contexts will be reset frequently and typically
 458 will not allocate very much space per tuple cycle.  To make this usage
 459 pattern cheap, the first block allocated in a context is not given
 460 back to malloc() during reset, but just cleared.  This avoids malloc
 461 thrashing.
 462
 463
 464 Alternative Memory Context Implementations
 465 ------------------------------------------
 466
 467 aset.c is our default general-purpose implementation, working fine
 468 in most situations. We also have two implementations optimized for
 469 special use cases, providing either better performance or lower memory
 470 usage compared to aset.c (or both).
 471
 472 * slab.c (SlabContext) is designed for allocations of fixed-length
 473   chunks, and does not allow allocations of chunks with different size.
 474
 475 * generation.c (GenerationContext) is designed for cases when chunks
 476   are allocated in groups with similar lifespan (generations), or
 477   roughly in FIFO order.
 478
 479 Both memory contexts aim to free memory back to the operating system
 480 (unlike aset.c, which keeps the freed chunks in a freelist, and only
 481 returns the memory when reset/deleted).
 482
 483 These memory contexts were initially developed for ReorderBuffer, but
 484 may be useful elsewhere as long as the allocation patterns match.
 485
 486
 487 Memory Accounting
 488 -----------------
 489
 490 One of the basic memory context operations is determining the amount of
 491 memory used in the context (and its children). We have multiple places
 492 that implement their own ad hoc memory accounting, and this is meant to
 493 provide a unified approach. Ad hoc accounting solutions work for places
 494 with tight control over the allocations or when it's easy to determine
 495 sizes of allocated chunks (e.g. places that only work with tuples).
 496
 497 The accounting built into the memory contexts is transparent and works
 498 transparently for all allocations as long as they end up in the right
 499 memory context subtree.
 500
 501 Consider for example aggregate functions - the aggregate state is often
 502 represented by an arbitrary structure, allocated from the transition
 503 function, so the ad hoc accounting is unlikely to work. The built-in
 504 accounting will however handle such cases just fine.
 505
 506 To minimize overhead, the accounting is done at the block level, not for
 507 individual allocation chunks.
 508
 509 The accounting is lazy - after a block is allocated (or freed), only the
 510 context owning that block is updated. This means that when inquiring
 511 about the memory usage in a given context, we have to walk all children
 512 contexts recursively. This means the memory accounting is not intended
 513 for cases with too many memory contexts (in the relevant subtree).