doc/src/sgml/storage.sgml

   1 <!-- doc/src/sgml/storage.sgml -->
   2
   3 <chapter id="storage">
   4
   5 <title>Database Physical Storage</title>
   6
   7 <para>
   8 This chapter provides an overview of the physical storage format used by
   9 <productname>PostgreSQL</productname> databases.
  10 </para>
  11
  12 <sect1 id="storage-file-layout">
  13
  14 <title>Database File Layout</title>
  15
  16 <para>
  17 This section describes the storage format at the level of files and
  18 directories.
  19 </para>
  20
  21 <para>
  22 Traditionally, the configuration and data files used by a database
  23 cluster are stored together within the cluster's data
  24 directory, commonly referred to as <varname>PGDATA</varname> (after the name of the
  25 environment variable that can be used to define it).  A common location for
  26 <varname>PGDATA</varname> is <filename>/var/lib/pgsql/data</filename>.  Multiple clusters,
  27 managed by different server instances, can exist on the same machine.
  28 </para>
  29
  30 <para>
  31 The <varname>PGDATA</varname> directory contains several subdirectories and control
  32 files, as shown in <xref linkend="pgdata-contents-table"/>.  In addition to
  33 these required items, the cluster configuration files
  34 <filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
  35 <filename>pg_ident.conf</filename> are traditionally stored in
  36 <varname>PGDATA</varname>, although it is possible to place them elsewhere.
  37 </para>
  38
  39 <table tocentry="1" id="pgdata-contents-table">
  40 <title>Contents of <varname>PGDATA</varname></title>
  41 <tgroup cols="2">
  42 <thead>
  43 <row>
  44 <entry>
  45 Item
  46 </entry>
  47 <entry>Description</entry>
  48 </row>
  49 </thead>
  50
  51 <tbody>
  52
  53 <row>
  54  <entry><filename>PG_VERSION</filename></entry>
  55  <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
  56 </row>
  57
  58 <row>
  59  <entry><filename>base</filename></entry>
  60  <entry>Subdirectory containing per-database subdirectories</entry>
  61 </row>
  62
  63 <row>
  64  <entry><filename>current_logfiles</filename></entry>
  65  <entry>File recording the log file(s) currently written to by the logging
  66   collector</entry>
  67 </row>
  68
  69 <row>
  70  <entry><filename>global</filename></entry>
  71  <entry>Subdirectory containing cluster-wide tables, such as
  72  <structname>pg_database</structname></entry>
  73 </row>
  74
  75 <row>
  76  <entry><filename>pg_commit_ts</filename></entry>
  77  <entry>Subdirectory containing transaction commit timestamp data</entry>
  78 </row>
  79
  80 <row>
  81  <entry><filename>pg_dynshmem</filename></entry>
  82  <entry>Subdirectory containing files used by the dynamic shared memory
  83   subsystem</entry>
  84 </row>
  85
  86 <row>
  87  <entry><filename>pg_logical</filename></entry>
  88  <entry>Subdirectory containing status data for logical decoding</entry>
  89 </row>
  90
  91 <row>
  92  <entry><filename>pg_multixact</filename></entry>
  93  <entry>Subdirectory containing multitransaction status data
  94   (used for shared row locks)</entry>
  95 </row>
  96
  97 <row>
  98  <entry><filename>pg_notify</filename></entry>
  99  <entry>Subdirectory containing LISTEN/NOTIFY status data</entry>
 100 </row>
 101
 102 <row>
 103  <entry><filename>pg_replslot</filename></entry>
 104  <entry>Subdirectory containing replication slot data</entry>
 105 </row>
 106
 107 <row>
 108  <entry><filename>pg_serial</filename></entry>
 109  <entry>Subdirectory containing information about committed serializable transactions</entry>
 110 </row>
 111
 112 <row>
 113  <entry><filename>pg_snapshots</filename></entry>
 114  <entry>Subdirectory containing exported snapshots</entry>
 115 </row>
 116
 117 <row>
 118  <entry><filename>pg_stat</filename></entry>
 119  <entry>Subdirectory containing permanent files for the statistics
 120   subsystem</entry>
 121 </row>
 122
 123 <row>
 124  <entry><filename>pg_stat_tmp</filename></entry>
 125  <entry>Subdirectory containing temporary files for the statistics
 126   subsystem</entry>
 127 </row>
 128
 129 <row>
 130  <entry><filename>pg_subtrans</filename></entry>
 131  <entry>Subdirectory containing subtransaction status data</entry>
 132 </row>
 133
 134 <row>
 135  <entry><filename>pg_tblspc</filename></entry>
 136  <entry>Subdirectory containing symbolic links to tablespaces</entry>
 137 </row>
 138
 139 <row>
 140  <entry><filename>pg_twophase</filename></entry>
 141  <entry>Subdirectory containing state files for prepared transactions</entry>
 142 </row>
 143
 144 <row>
 145  <entry><filename>pg_wal</filename></entry>
 146  <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
 147 </row>
 148
 149 <row>
 150  <entry><filename>pg_xact</filename></entry>
 151  <entry>Subdirectory containing transaction commit status data</entry>
 152 </row>
 153
 154 <row>
 155  <entry><filename>postgresql.auto.conf</filename></entry>
 156  <entry>A file used for storing configuration parameters that are set by
 157 <command>ALTER SYSTEM</command></entry>
 158 </row>
 159
 160 <row>
 161  <entry><filename>postmaster.opts</filename></entry>
 162  <entry>A file recording the command-line options the server was
 163 last started with</entry>
 164 </row>
 165
 166 <row>
 167  <entry><filename>postmaster.pid</filename></entry>
 168  <entry>A lock file recording the current postmaster process ID (PID),
 169   cluster data directory path,
 170   postmaster start timestamp,
 171   port number,
 172   Unix-domain socket directory path (could be empty),
 173   first valid listen_address (IP address or <literal>*</literal>, or empty if
 174   not listening on TCP),
 175   and shared memory segment ID
 176   (this file is not present after server shutdown)</entry>
 177 </row>
 178
 179 </tbody>
 180 </tgroup>
 181 </table>
 182
 183 <para>
 184 For each database in the cluster there is a subdirectory within
 185 <varname>PGDATA</varname><filename>/base</filename>, named after the database's OID in
 186 <structname>pg_database</structname>.  This subdirectory is the default location
 187 for the database's files; in particular, its system catalogs are stored
 188 there.
 189 </para>
 190
 191 <para>
 192  Note that the following sections describe the behavior of the builtin
 193  <literal>heap</literal> <link linkend="tableam">table access method</link>,
 194  and the builtin <link linkend="indexam">index access methods</link>. Due
 195  to the extensible nature of <productname>PostgreSQL</productname>, other
 196  access methods might work differently.
 197 </para>
 198
 199 <para>
 200 Each table and index is stored in a separate file.  For ordinary relations,
 201 these files are named after the table or index's <firstterm>filenode</firstterm> number,
 202 which can be found in <structname>pg_class</structname>.<structfield>relfilenode</structfield>. But
 203 for temporary relations, the file name is of the form
 204 <literal>t<replaceable>BBB</replaceable>_<replaceable>FFF</replaceable></literal>, where <replaceable>BBB</replaceable>
 205 is the backend ID of the backend which created the file, and <replaceable>FFF</replaceable>
 206 is the filenode number.  In either case, in addition to the main file (a/k/a
 207 main fork), each table and index has a <firstterm>free space map</firstterm> (see <xref
 208 linkend="storage-fsm"/>), which stores information about free space available in
 209 the relation.  The free space map is stored in a file named with the filenode
 210 number plus the suffix <literal>_fsm</literal>.  Tables also have a
 211 <firstterm>visibility map</firstterm>, stored in a fork with the suffix <literal>_vm</literal>,
 212 to track which pages are known to have no dead tuples.  The visibility map is
 213 described further in <xref linkend="storage-vm"/>.  Unlogged tables and indexes
 214 have a third fork, known as the initialization fork, which is stored in a fork
 215 with the suffix <literal>_init</literal> (see <xref linkend="storage-init"/>).
 216 </para>
 217
 218 <caution>
 219 <para>
 220 Note that while a table's filenode often matches its OID, this is
 221 <emphasis>not</emphasis> necessarily the case; some operations, like
 222 <command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms
 223 of <command>ALTER TABLE</command>, can change the filenode while preserving the OID.
 224 Avoid assuming that filenode and table OID are the same.
 225 Also, for certain system catalogs including <structname>pg_class</structname> itself,
 226 <structname>pg_class</structname>.<structfield>relfilenode</structfield> contains zero.  The
 227 actual filenode number of these catalogs is stored in a lower-level data
 228 structure, and can be obtained using the <function>pg_relation_filenode()</function>
 229 function.
 230 </para>
 231 </caution>
 232
 233 <para>
 234 When a table or index exceeds 1 GB, it is divided into gigabyte-sized
 235 <firstterm>segments</firstterm>.  The first segment's file name is the same as the
 236 filenode; subsequent segments are named filenode.1, filenode.2, etc.
 237 This arrangement avoids problems on platforms that have file size limitations.
 238 (Actually, 1 GB is just the default segment size.  The segment size can be
 239 adjusted using the configuration option <option>--with-segsize</option>
 240 when building <productname>PostgreSQL</productname>.)
 241 In principle, free space map and visibility map forks could require multiple
 242 segments as well, though this is unlikely to happen in practice.
 243 </para>
 244
 245 <para>
 246 A table that has columns with potentially large entries will have an
 247 associated <firstterm>TOAST</firstterm> table, which is used for out-of-line storage of
 248 field values that are too large to keep in the table rows proper.
 249 <structname>pg_class</structname>.<structfield>reltoastrelid</structfield> links from a table to
 250 its <acronym>TOAST</acronym> table, if any.
 251 See <xref linkend="storage-toast"/> for more information.
 252 </para>
 253
 254 <para>
 255 The contents of tables and indexes are discussed further in
 256 <xref linkend="storage-page-layout"/>.
 257 </para>
 258
 259 <para>
 260 Tablespaces make the scenario more complicated.  Each user-defined tablespace
 261 has a symbolic link inside the <varname>PGDATA</varname><filename>/pg_tblspc</filename>
 262 directory, which points to the physical tablespace directory (i.e., the
 263 location specified in the tablespace's <command>CREATE TABLESPACE</command> command).
 264 This symbolic link is named after
 265 the tablespace's OID.  Inside the physical tablespace directory there is
 266 a subdirectory with a name that depends on the <productname>PostgreSQL</productname>
 267 server version, such as <literal>PG_9.0_201008051</literal>.  (The reason for using
 268 this subdirectory is so that successive versions of the database can use
 269 the same <command>CREATE TABLESPACE</command> location value without conflicts.)
 270 Within the version-specific subdirectory, there is
 271 a subdirectory for each database that has elements in the tablespace, named
 272 after the database's OID.  Tables and indexes are stored within that
 273 directory, using the filenode naming scheme.
 274 The <literal>pg_default</literal> tablespace is not accessed through
 275 <filename>pg_tblspc</filename>, but corresponds to
 276 <varname>PGDATA</varname><filename>/base</filename>.  Similarly, the <literal>pg_global</literal>
 277 tablespace is not accessed through <filename>pg_tblspc</filename>, but corresponds to
 278 <varname>PGDATA</varname><filename>/global</filename>.
 279 </para>
 280
 281 <para>
 282 The <function>pg_relation_filepath()</function> function shows the entire path
 283 (relative to <varname>PGDATA</varname>) of any relation.  It is often useful
 284 as a substitute for remembering many of the above rules.  But keep in
 285 mind that this function just gives the name of the first segment of the
 286 main fork of the relation &mdash; you may need to append a segment number
 287 and/or <literal>_fsm</literal>, <literal>_vm</literal>, or <literal>_init</literal> to find all
 288 the files associated with the relation.
 289 </para>
 290
 291 <para>
 292 Temporary files (for operations such as sorting more data than can fit in
 293 memory) are created within <varname>PGDATA</varname><filename>/base/pgsql_tmp</filename>,
 294 or within a <filename>pgsql_tmp</filename> subdirectory of a tablespace directory
 295 if a tablespace other than <literal>pg_default</literal> is specified for them.
 296 The name of a temporary file has the form
 297 <filename>pgsql_tmp<replaceable>PPP</replaceable>.<replaceable>NNN</replaceable></filename>,
 298 where <replaceable>PPP</replaceable> is the PID of the owning backend and
 299 <replaceable>NNN</replaceable> distinguishes different temporary files of that backend.
 300 </para>
 301
 302 </sect1>
 303
 304 <sect1 id="storage-toast">
 305
 306 <title>TOAST</title>
 307
 308     <indexterm>
 309      <primary>TOAST</primary>
 310     </indexterm>
 311     <indexterm><primary>sliced bread</primary><see>TOAST</see></indexterm>
 312
 313 <para>
 314 This section provides an overview of <acronym>TOAST</acronym> (The
 315 Oversized-Attribute Storage Technique).
 316 </para>
 317
 318 <para>
 319 <productname>PostgreSQL</productname> uses a fixed page size (commonly
 320 8 kB), and does not allow tuples to span multiple pages.  Therefore, it is
 321 not possible to store very large field values directly.  To overcome
 322 this limitation, large field values are compressed and/or broken up into
 323 multiple physical rows.  This happens transparently to the user, with only
 324 small impact on most of the backend code.  The technique is affectionately
 325 known as <acronym>TOAST</acronym> (or <quote>the best thing since sliced bread</quote>).
 326 The <acronym>TOAST</acronym> infrastructure is also used to improve handling of
 327 large data values in-memory.
 328 </para>
 329
 330 <para>
 331 Only certain data types support <acronym>TOAST</acronym> &mdash; there is no need to
 332 impose the overhead on data types that cannot produce large field values.
 333 To support <acronym>TOAST</acronym>, a data type must have a variable-length
 334 (<firstterm>varlena</firstterm>) representation, in which, ordinarily, the first
 335 four-byte word of any stored value contains the total length of the value in
 336 bytes (including itself).  <acronym>TOAST</acronym> does not constrain the rest
 337 of the data type's representation.  The special representations collectively
 338 called <firstterm><acronym>TOAST</acronym>ed values</firstterm> work by modifying or
 339 reinterpreting this initial length word.  Therefore, the C-level functions
 340 supporting a <acronym>TOAST</acronym>-able data type must be careful about how they
 341 handle potentially <acronym>TOAST</acronym>ed input values: an input might not
 342 actually consist of a four-byte length word and contents until after it's
 343 been <firstterm>detoasted</firstterm>.  (This is normally done by invoking
 344 <function>PG_DETOAST_DATUM</function> before doing anything with an input value,
 345 but in some cases more efficient approaches are possible.
 346 See <xref linkend="xtypes-toast"/> for more detail.)
 347 </para>
 348
 349 <para>
 350 <acronym>TOAST</acronym> usurps two bits of the varlena length word (the high-order
 351 bits on big-endian machines, the low-order bits on little-endian machines),
 352 thereby limiting the logical size of any value of a <acronym>TOAST</acronym>-able
 353 data type to 1 GB (2<superscript>30</superscript> - 1 bytes).  When both bits are zero,
 354 the value is an ordinary un-<acronym>TOAST</acronym>ed value of the data type, and
 355 the remaining bits of the length word give the total datum size (including
 356 length word) in bytes.  When the highest-order or lowest-order bit is set,
 357 the value has only a single-byte header instead of the normal four-byte
 358 header, and the remaining bits of that byte give the total datum size
 359 (including length byte) in bytes.  This alternative supports space-efficient
 360 storage of values shorter than 127 bytes, while still allowing the data type
 361 to grow to 1 GB at need.  Values with single-byte headers aren't aligned on
 362 any particular boundary, whereas values with four-byte headers are aligned on
 363 at least a four-byte boundary; this omission of alignment padding provides
 364 additional space savings that is significant compared to short values.
 365 As a special case, if the remaining bits of a single-byte header are all
 366 zero (which would be impossible for a self-inclusive length), the value is
 367 a pointer to out-of-line data, with several possible alternatives as
 368 described below.  The type and size of such a <firstterm>TOAST pointer</firstterm>
 369 are determined by a code stored in the second byte of the datum.
 370 Lastly, when the highest-order or lowest-order bit is clear but the adjacent
 371 bit is set, the content of the datum has been compressed and must be
 372 decompressed before use.  In this case the remaining bits of the four-byte
 373 length word give the total size of the compressed datum, not the
 374 original data.  Note that compression is also possible for out-of-line data
 375 but the varlena header does not tell whether it has occurred &mdash;
 376 the content of the <acronym>TOAST</acronym> pointer tells that, instead.
 377 </para>
 378
 379 <para>
 380 The compression technique used for either in-line or out-of-line compressed
 381 data can be selected for each column by setting
 382 the <literal>COMPRESSION</literal> column option in <command>CREATE
 383 TABLE</command> or <command>ALTER TABLE</command>.  The default for columns
 384 with no explicit setting is to consult the
 385 <xref linkend="guc-default-toast-compression"/> parameter at the time data is
 386 inserted.
 387 </para>
 388
 389 <para>
 390 As mentioned, there are multiple types of <acronym>TOAST</acronym> pointer datums.
 391 The oldest and most common type is a pointer to out-of-line data stored in
 392 a <firstterm><acronym>TOAST</acronym> table</firstterm> that is separate from, but
 393 associated with, the table containing the <acronym>TOAST</acronym> pointer datum
 394 itself.  These <firstterm>on-disk</firstterm> pointer datums are created by the
 395 <acronym>TOAST</acronym> management code (in <filename>access/common/toast_internals.c</filename>)
 396 when a tuple to be stored on disk is too large to be stored as-is.
 397 Further details appear in <xref linkend="storage-toast-ondisk"/>.
 398 Alternatively, a <acronym>TOAST</acronym> pointer datum can contain a pointer to
 399 out-of-line data that appears elsewhere in memory.  Such datums are
 400 necessarily short-lived, and will never appear on-disk, but they are very
 401 useful for avoiding copying and redundant processing of large data values.
 402 Further details appear in <xref linkend="storage-toast-inmemory"/>.
 403 </para>
 404
 405 <sect2 id="storage-toast-ondisk">
 406  <title>Out-of-Line, On-Disk TOAST Storage</title>
 407
 408 <para>
 409 If any of the columns of a table are <acronym>TOAST</acronym>-able, the table will
 410 have an associated <acronym>TOAST</acronym> table, whose OID is stored in the table's
 411 <structname>pg_class</structname>.<structfield>reltoastrelid</structfield> entry.  On-disk
 412 <acronym>TOAST</acronym>ed values are kept in the <acronym>TOAST</acronym> table, as
 413 described in more detail below.
 414 </para>
 415
 416 <para>
 417 Out-of-line values are divided (after compression if used) into chunks of at
 418 most <symbol>TOAST_MAX_CHUNK_SIZE</symbol> bytes (by default this value is chosen
 419 so that four chunk rows will fit on a page, making it about 2000 bytes).
 420 Each chunk is stored as a separate row in the <acronym>TOAST</acronym> table
 421 belonging to the owning table.  Every
 422 <acronym>TOAST</acronym> table has the columns <structfield>chunk_id</structfield> (an OID
 423 identifying the particular <acronym>TOAST</acronym>ed value),
 424 <structfield>chunk_seq</structfield> (a sequence number for the chunk within its value),
 425 and <structfield>chunk_data</structfield> (the actual data of the chunk).  A unique index
 426 on <structfield>chunk_id</structfield> and <structfield>chunk_seq</structfield> provides fast
 427 retrieval of the values.  A pointer datum representing an out-of-line on-disk
 428 <acronym>TOAST</acronym>ed value therefore needs to store the OID of the
 429 <acronym>TOAST</acronym> table in which to look and the OID of the specific value
 430 (its <structfield>chunk_id</structfield>).  For convenience, pointer datums also store the
 431 logical datum size (original uncompressed data length), physical stored size
 432 (different if compression was applied), and the compression method used, if
 433 any.  Allowing for the varlena header bytes,
 434 the total size of an on-disk <acronym>TOAST</acronym> pointer datum is therefore 18
 435 bytes regardless of the actual size of the represented value.
 436 </para>
 437
 438 <para>
 439 The <acronym>TOAST</acronym> management code is triggered only
 440 when a row value to be stored in a table is wider than
 441 <symbol>TOAST_TUPLE_THRESHOLD</symbol> bytes (normally 2 kB).
 442 The <acronym>TOAST</acronym> code will compress and/or move
 443 field values out-of-line until the row value is shorter than
 444 <symbol>TOAST_TUPLE_TARGET</symbol> bytes (also normally 2 kB, adjustable)
 445 or no more gains can be had.  During an UPDATE
 446 operation, values of unchanged fields are normally preserved as-is; so an
 447 UPDATE of a row with out-of-line values incurs no <acronym>TOAST</acronym> costs if
 448 none of the out-of-line values change.
 449 </para>
 450
 451 <para>
 452 The <acronym>TOAST</acronym> management code recognizes four different strategies
 453 for storing <acronym>TOAST</acronym>-able columns on disk:
 454
 455    <itemizedlist>
 456     <listitem>
 457      <para>
 458       <literal>PLAIN</literal> prevents either compression or
 459       out-of-line storage.  This is the only possible strategy for
 460       columns of non-<acronym>TOAST</acronym>-able data types.
 461      </para>
 462     </listitem>
 463     <listitem>
 464      <para>
 465       <literal>EXTENDED</literal> allows both compression and out-of-line
 466       storage.  This is the default for most <acronym>TOAST</acronym>-able data types.
 467       Compression will be attempted first, then out-of-line storage if
 468       the row is still too big.
 469      </para>
 470     </listitem>
 471     <listitem>
 472      <para>
 473       <literal>EXTERNAL</literal> allows out-of-line storage but not
 474       compression.  Use of <literal>EXTERNAL</literal> will
 475       make substring operations on wide <type>text</type> and
 476       <type>bytea</type> columns faster (at the penalty of increased storage
 477       space) because these operations are optimized to fetch only the
 478       required parts of the out-of-line value when it is not compressed.
 479      </para>
 480     </listitem>
 481     <listitem>
 482      <para>
 483       <literal>MAIN</literal> allows compression but not out-of-line
 484       storage.  (Actually, out-of-line storage will still be performed
 485       for such columns, but only as a last resort when there is no other
 486       way to make the row small enough to fit on a page.)
 487      </para>
 488     </listitem>
 489    </itemizedlist>
 490
 491 Each <acronym>TOAST</acronym>-able data type specifies a default strategy for columns
 492 of that data type, but the strategy for a given table column can be altered
 493 with <link linkend="sql-altertable"><command>ALTER TABLE ... SET STORAGE</command></link>.
 494 </para>
 495
 496 <para>
 497 <symbol>TOAST_TUPLE_TARGET</symbol> can be adjusted for each table using
 498 <link linkend="sql-altertable"><command>ALTER TABLE ... SET (toast_tuple_target = N)</command></link>
 499 </para>
 500
 501 <para>
 502 This scheme has a number of advantages compared to a more straightforward
 503 approach such as allowing row values to span pages.  Assuming that queries are
 504 usually qualified by comparisons against relatively small key values, most of
 505 the work of the executor will be done using the main row entry. The big values
 506 of <acronym>TOAST</acronym>ed attributes will only be pulled out (if selected at all)
 507 at the time the result set is sent to the client. Thus, the main table is much
 508 smaller and more of its rows fit in the shared buffer cache than would be the
 509 case without any out-of-line storage. Sort sets shrink also, and sorts will
 510 more often be done entirely in memory. A little test showed that a table
 511 containing typical HTML pages and their URLs was stored in about half of the
 512 raw data size including the <acronym>TOAST</acronym> table, and that the main table
 513 contained only about 10% of the entire data (the URLs and some small HTML
 514 pages). There was no run time difference compared to an un-<acronym>TOAST</acronym>ed
 515 comparison table, in which all the HTML pages were cut down to 7 kB to fit.
 516 </para>
 517
 518 </sect2>
 519
 520 <sect2 id="storage-toast-inmemory">
 521  <title>Out-of-Line, In-Memory TOAST Storage</title>
 522
 523 <para>
 524 <acronym>TOAST</acronym> pointers can point to data that is not on disk, but is
 525 elsewhere in the memory of the current server process.  Such pointers
 526 obviously cannot be long-lived, but they are nonetheless useful.  There
 527 are currently two sub-cases:
 528 pointers to <firstterm>indirect</firstterm> data and
 529 pointers to <firstterm>expanded</firstterm> data.
 530 </para>
 531
 532 <para>
 533 Indirect <acronym>TOAST</acronym> pointers simply point at a non-indirect varlena
 534 value stored somewhere in memory.  This case was originally created merely
 535 as a proof of concept, but it is currently used during logical decoding to
 536 avoid possibly having to create physical tuples exceeding 1 GB (as pulling
 537 all out-of-line field values into the tuple might do).  The case is of
 538 limited use since the creator of the pointer datum is entirely responsible
 539 that the referenced data survives for as long as the pointer could exist,
 540 and there is no infrastructure to help with this.
 541 </para>
 542
 543 <para>
 544 Expanded <acronym>TOAST</acronym> pointers are useful for complex data types
 545 whose on-disk representation is not especially suited for computational
 546 purposes.  As an example, the standard varlena representation of a
 547 <productname>PostgreSQL</productname> array includes dimensionality information, a
 548 nulls bitmap if there are any null elements, then the values of all the
 549 elements in order.  When the element type itself is variable-length, the
 550 only way to find the <replaceable>N</replaceable>'th element is to scan through all the
 551 preceding elements.  This representation is appropriate for on-disk storage
 552 because of its compactness, but for computations with the array it's much
 553 nicer to have an <quote>expanded</quote> or <quote>deconstructed</quote>
 554 representation in which all the element starting locations have been
 555 identified.  The <acronym>TOAST</acronym> pointer mechanism supports this need by
 556 allowing a pass-by-reference Datum to point to either a standard varlena
 557 value (the on-disk representation) or a <acronym>TOAST</acronym> pointer that
 558 points to an expanded representation somewhere in memory.  The details of
 559 this expanded representation are up to the data type, though it must have
 560 a standard header and meet the other API requirements given
 561 in <filename>src/include/utils/expandeddatum.h</filename>.  C-level functions
 562 working with the data type can choose to handle either representation.
 563 Functions that do not know about the expanded representation, but simply
 564 apply <function>PG_DETOAST_DATUM</function> to their inputs, will automatically
 565 receive the traditional varlena representation; so support for an expanded
 566 representation can be introduced incrementally, one function at a time.
 567 </para>
 568
 569 <para>
 570 <acronym>TOAST</acronym> pointers to expanded values are further broken down
 571 into <firstterm>read-write</firstterm> and <firstterm>read-only</firstterm> pointers.
 572 The pointed-to representation is the same either way, but a function that
 573 receives a read-write pointer is allowed to modify the referenced value
 574 in-place, whereas one that receives a read-only pointer must not; it must
 575 first create a copy if it wants to make a modified version of the value.
 576 This distinction and some associated conventions make it possible to avoid
 577 unnecessary copying of expanded values during query execution.
 578 </para>
 579
 580 <para>
 581 For all types of in-memory <acronym>TOAST</acronym> pointer, the <acronym>TOAST</acronym>
 582 management code ensures that no such pointer datum can accidentally get
 583 stored on disk.  In-memory <acronym>TOAST</acronym> pointers are automatically
 584 expanded to normal in-line varlena values before storage &mdash; and then
 585 possibly converted to on-disk <acronym>TOAST</acronym> pointers, if the containing
 586 tuple would otherwise be too big.
 587 </para>
 588
 589 </sect2>
 590
 591 </sect1>
 592
 593 <sect1 id="storage-fsm">
 594
 595 <title>Free Space Map</title>
 596
 597 <indexterm>
 598  <primary>Free Space Map</primary>
 599 </indexterm>
 600 <indexterm><primary>FSM</primary><see>Free Space Map</see></indexterm>
 601
 602 <para>
 603 Each heap and index relation, except for hash indexes, has a Free Space Map
 604 (<acronym>FSM</acronym>) to keep track of available space in the relation.
 605 It's stored alongside the main relation data in a separate relation fork,
 606 named after the filenode number of the relation, plus a <literal>_fsm</literal>
 607 suffix. For example, if the filenode of a relation is 12345, the
 608 <acronym>FSM</acronym> is stored in a file called
 609 <filename>12345_fsm</filename>, in the same directory as the main relation file.
 610 </para>
 611
 612 <para>
 613 The Free Space Map is organized as a tree of <acronym>FSM</acronym> pages. The
 614 bottom level <acronym>FSM</acronym> pages store the free space available on each
 615 heap (or index) page, using one byte to represent each such page. The upper
 616 levels aggregate information from the lower levels.
 617 </para>
 618
 619 <para>
 620 Within each <acronym>FSM</acronym> page is a binary tree, stored in an array with
 621 one byte per node. Each leaf node represents a heap page, or a lower level
 622 <acronym>FSM</acronym> page. In each non-leaf node, the higher of its children's
 623 values is stored. The maximum value in the leaf nodes is therefore stored
 624 at the root.
 625 </para>
 626
 627 <para>
 628 See <filename>src/backend/storage/freespace/README</filename> for more details on
 629 how the <acronym>FSM</acronym> is structured, and how it's updated and searched.
 630 The <xref linkend="pgfreespacemap"/> module
 631 can be used to examine the information stored in free space maps.
 632 </para>
 633
 634 </sect1>
 635
 636 <sect1 id="storage-vm">
 637
 638 <title>Visibility Map</title>
 639
 640 <indexterm>
 641  <primary>Visibility Map</primary>
 642 </indexterm>
 643 <indexterm><primary>VM</primary><see>Visibility Map</see></indexterm>
 644
 645 <para>
 646 Each heap relation has a Visibility Map
 647 (VM) to keep track of which pages contain only tuples that are known to be
 648 visible to all active transactions; it also keeps track of which pages contain
 649 only frozen tuples.  It's stored
 650 alongside the main relation data in a separate relation fork, named after the
 651 filenode number of the relation, plus a <literal>_vm</literal> suffix. For example,
 652 if the filenode of a relation is 12345, the VM is stored in a file called
 653 <filename>12345_vm</filename>, in the same directory as the main relation file.
 654 Note that indexes do not have VMs.
 655 </para>
 656
 657 <para>
 658 The visibility map stores two bits per heap page.  The first bit, if set,
 659 indicates that the page is all-visible, or in other words that the page does
 660 not contain any tuples that need to be vacuumed.
 661 This information can also be used
 662 by <link linkend="indexes-index-only-scans"><firstterm>index-only
 663 scans</firstterm></link> to answer queries using only the index tuple.
 664 The second bit, if set, means that all tuples on the page have been frozen.
 665 That means that even an anti-wraparound vacuum need not revisit the page.
 666 </para>
 667
 668 <para>
 669 The map is conservative in the sense that we make sure that whenever a bit is
 670 set, we know the condition is true, but if a bit is not set, it might or
 671 might not be true. Visibility map bits are only set by vacuum, but are
 672 cleared by any data-modifying operations on a page.
 673 </para>
 674
 675 <para>
 676 The <xref linkend="pgvisibility"/> module can be used to examine the
 677 information stored in the visibility map.
 678 </para>
 679
 680 </sect1>
 681
 682 <sect1 id="storage-init">
 683
 684 <title>The Initialization Fork</title>
 685
 686 <indexterm>
 687  <primary>Initialization Fork</primary>
 688 </indexterm>
 689
 690 <para>
 691 Each unlogged table, and each index on an unlogged table, has an initialization
 692 fork.  The initialization fork is an empty table or index of the appropriate
 693 type.  When an unlogged table must be reset to empty due to a crash, the
 694 initialization fork is copied over the main fork, and any other forks are
 695 erased (they will be recreated automatically as needed).
 696 </para>
 697
 698 </sect1>
 699
 700 <sect1 id="storage-page-layout">
 701
 702 <title>Database Page Layout</title>
 703
 704 <para>
 705 This section provides an overview of the page format used within
 706 <productname>PostgreSQL</productname> tables and indexes.<footnote>
 707   <para>
 708     Actually, use of this page format is not required for either table or
 709     index access methods. The <literal>heap</literal> table access method
 710     always uses this format.  All the existing index methods also use the
 711     basic format, but the data kept on index metapages usually doesn't follow
 712     the item layout rules.
 713   </para>
 714 </footnote>
 715 Sequences and <acronym>TOAST</acronym> tables are formatted just like a regular table.
 716 </para>
 717
 718 <para>
 719 In the following explanation, a
 720 <firstterm>byte</firstterm>
 721 is assumed to contain 8 bits.  In addition, the term
 722 <firstterm>item</firstterm>
 723 refers to an individual data value that is stored on a page.  In a table,
 724 an item is a row; in an index, an item is an index entry.
 725 </para>
 726
 727 <para>
 728 Every table and index is stored as an array of <firstterm>pages</firstterm> of a
 729 fixed size (usually 8 kB, although a different page size can be selected
 730 when compiling the server).  In a table, all the pages are logically
 731 equivalent, so a particular item (row) can be stored in any page.  In
 732 indexes, the first page is generally reserved as a <firstterm>metapage</firstterm>
 733 holding control information, and there can be different types of pages
 734 within the index, depending on the index access method.
 735 </para>
 736
 737 <para>
 738 <xref linkend="page-table"/> shows the overall layout of a page.
 739 There are five parts to each page.
 740 </para>
 741
 742 <table tocentry="1" id="page-table">
 743 <title>Overall Page Layout</title>
 744 <titleabbrev>Page Layout</titleabbrev>
 745 <tgroup cols="2">
 746 <thead>
 747 <row>
 748 <entry>
 749 Item
 750 </entry>
 751 <entry>Description</entry>
 752 </row>
 753 </thead>
 754
 755 <tbody>
 756
 757 <row>
 758  <entry>PageHeaderData</entry>
 759  <entry>24 bytes long. Contains general information about the page, including
 760 free space pointers.</entry>
 761 </row>
 762
 763 <row>
 764 <entry>ItemIdData</entry>
 765 <entry>Array of item identifiers pointing to the actual items. Each
 766 entry is an (offset,length) pair. 4 bytes per item.</entry>
 767 </row>
 768
 769 <row>
 770 <entry>Free space</entry>
 771 <entry>The unallocated space. New item identifiers are allocated from
 772 the start of this area, new items from the end.</entry>
 773 </row>
 774
 775 <row>
 776 <entry>Items</entry>
 777 <entry>The actual items themselves.</entry>
 778 </row>
 779
 780 <row>
 781 <entry>Special space</entry>
 782 <entry>Index access method specific data. Different methods store different
 783 data. Empty in ordinary tables.</entry>
 784 </row>
 785
 786 </tbody>
 787 </tgroup>
 788 </table>
 789
 790  <para>
 791
 792   The first 24 bytes of each page consists of a page header
 793   (<structname>PageHeaderData</structname>). Its format is detailed in <xref
 794   linkend="pageheaderdata-table"/>. The first field tracks the most
 795   recent WAL entry related to this page. The second field contains
 796   the page checksum if <xref linkend="app-initdb-data-checksums"/> are
 797   enabled.  Next is a 2-byte field containing flag bits. This is followed
 798   by three 2-byte integer fields (<structfield>pd_lower</structfield>,
 799   <structfield>pd_upper</structfield>, and
 800   <structfield>pd_special</structfield>).  These contain byte offsets
 801   from the page start to the start of unallocated space, to the end of
 802   unallocated space, and to the start of the special space.  The next 2
 803   bytes of the page header, <structfield>pd_pagesize_version</structfield>,
 804   store both the page size and a version indicator.  Beginning with
 805   <productname>PostgreSQL</productname> 8.3 the version number is 4;
 806   <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3;
 807   <productname>PostgreSQL</productname> 8.0 used version number 2;
 808   <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
 809   prior releases used version number 0.
 810   (The basic page layout and header format has not changed in most of these
 811   versions, but the layout of heap row headers has.)  The page size
 812   is basically only present as a cross-check; there is no support for having
 813   more than one page size in an installation.
 814   The last field is a hint that shows whether pruning the page is likely
 815   to be profitable: it tracks the oldest un-pruned XMAX on the page.
 816
 817  </para>
 818
 819  <table tocentry="1" id="pageheaderdata-table">
 820  <title>PageHeaderData Layout</title>
 821  <titleabbrev>PageHeaderData Layout</titleabbrev>
 822  <tgroup cols="4">
 823  <thead>
 824   <row>
 825    <entry>Field</entry>
 826    <entry>Type</entry>
 827    <entry>Length</entry>
 828    <entry>Description</entry>
 829   </row>
 830  </thead>
 831  <tbody>
 832   <row>
 833    <entry>pd_lsn</entry>
 834    <entry>PageXLogRecPtr</entry>
 835    <entry>8 bytes</entry>
 836    <entry>LSN: next byte after last byte of WAL record for last change
 837    to this page</entry>
 838   </row>
 839   <row>
 840    <entry>pd_checksum</entry>
 841    <entry>uint16</entry>
 842    <entry>2 bytes</entry>
 843    <entry>Page checksum</entry>
 844   </row>
 845   <row>
 846    <entry>pd_flags</entry>
 847    <entry>uint16</entry>
 848    <entry>2 bytes</entry>
 849    <entry>Flag bits</entry>
 850   </row>
 851   <row>
 852    <entry>pd_lower</entry>
 853    <entry>LocationIndex</entry>
 854    <entry>2 bytes</entry>
 855    <entry>Offset to start of free space</entry>
 856   </row>
 857   <row>
 858    <entry>pd_upper</entry>
 859    <entry>LocationIndex</entry>
 860    <entry>2 bytes</entry>
 861    <entry>Offset to end of free space</entry>
 862   </row>
 863   <row>
 864    <entry>pd_special</entry>
 865    <entry>LocationIndex</entry>
 866    <entry>2 bytes</entry>
 867    <entry>Offset to start of special space</entry>
 868   </row>
 869   <row>
 870    <entry>pd_pagesize_version</entry>
 871    <entry>uint16</entry>
 872    <entry>2 bytes</entry>
 873    <entry>Page size and layout version number information</entry>
 874   </row>
 875   <row>
 876    <entry>pd_prune_xid</entry>
 877    <entry>TransactionId</entry>
 878    <entry>4 bytes</entry>
 879    <entry>Oldest unpruned XMAX on page, or zero if none</entry>
 880   </row>
 881  </tbody>
 882  </tgroup>
 883  </table>
 884
 885  <para>
 886   All the details can be found in
 887   <filename>src/include/storage/bufpage.h</filename>.
 888  </para>
 889
 890  <para>
 891   Following the page header are item identifiers
 892   (<type>ItemIdData</type>), each requiring four bytes.
 893   An item identifier contains a byte-offset to
 894   the start of an item, its length in bytes, and a few attribute bits
 895   which affect its interpretation.
 896   New item identifiers are allocated
 897   as needed from the beginning of the unallocated space.
 898   The number of item identifiers present can be determined by looking at
 899   <structfield>pd_lower</structfield>, which is increased to allocate a new identifier.
 900   Because an item
 901   identifier is never moved until it is freed, its index can be used on a
 902   long-term basis to reference an item, even when the item itself is moved
 903   around on the page to compact free space.  In fact, every pointer to an
 904   item (<type>ItemPointer</type>, also known as
 905   <type>CTID</type>) created by
 906   <productname>PostgreSQL</productname> consists of a page number and the
 907   index of an item identifier.
 908
 909  </para>
 910
 911  <para>
 912
 913   The items themselves are stored in space allocated backwards from the end
 914   of unallocated space.  The exact structure varies depending on what the
 915   table is to contain. Tables and sequences both use a structure named
 916   <type>HeapTupleHeaderData</type>, described below.
 917
 918  </para>
 919
 920  <para>
 921
 922   The final section is the <quote>special section</quote> which can
 923   contain anything the access method wishes to store.  For example,
 924   b-tree indexes store links to the page's left and right siblings,
 925   as well as some other data relevant to the index structure.
 926   Ordinary tables do not use a special section at all (indicated by setting
 927   <structfield>pd_special</structfield> to equal the page size).
 928
 929  </para>
 930
 931  <para>
 932   <xref linkend="storage-page-layout-figure"/> illustrates how these parts are
 933   laid out in a page.
 934  </para>
 935
 936  <figure id="storage-page-layout-figure">
 937   <title>Page Layout</title>
 938   <mediaobject>
 939    <imageobject>
 940     <imagedata fileref="images/pagelayout.svg" format="SVG" width="100%"/>
 941    </imageobject>
 942   </mediaobject>
 943  </figure>
 944
 945  <sect2 id="storage-tuple-layout">
 946
 947  <title>Table Row Layout</title>
 948
 949  <para>
 950
 951   All table rows are structured in the same way. There is a fixed-size
 952   header (occupying 23 bytes on most machines), followed by an optional null
 953   bitmap, an optional object ID field, and the user data. The header is
 954   detailed
 955   in <xref linkend="heaptupleheaderdata-table"/>.  The actual user data
 956   (columns of the row) begins at the offset indicated by
 957   <structfield>t_hoff</structfield>, which must always be a multiple of the MAXALIGN
 958   distance for the platform.
 959   The null bitmap is
 960   only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
 961   <structfield>t_infomask</structfield>. If it is present it begins just after
 962   the fixed header and occupies enough bytes to have one bit per data column
 963   (that is, the number of bits that equals the attribute count in
 964   <structfield>t_infomask2</structfield>). In this list of bits, a
 965   1 bit indicates not-null, a 0 bit is a null.  When the bitmap is not
 966   present, all columns are assumed not-null.
 967   The object ID is only present if the <firstterm>HEAP_HASOID_OLD</firstterm> bit
 968   is set in <structfield>t_infomask</structfield>.  If present, it appears just
 969   before the <structfield>t_hoff</structfield> boundary.  Any padding needed to make
 970   <structfield>t_hoff</structfield> a MAXALIGN multiple will appear between the null
 971   bitmap and the object ID.  (This in turn ensures that the object ID is
 972   suitably aligned.)
 973
 974  </para>
 975
 976  <table tocentry="1" id="heaptupleheaderdata-table">
 977  <title>HeapTupleHeaderData Layout</title>
 978  <titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
 979  <tgroup cols="4">
 980  <thead>
 981   <row>
 982    <entry>Field</entry>
 983    <entry>Type</entry>
 984    <entry>Length</entry>
 985    <entry>Description</entry>
 986   </row>
 987  </thead>
 988  <tbody>
 989   <row>
 990    <entry>t_xmin</entry>
 991    <entry>TransactionId</entry>
 992    <entry>4 bytes</entry>
 993    <entry>insert XID stamp</entry>
 994   </row>
 995   <row>
 996    <entry>t_xmax</entry>
 997    <entry>TransactionId</entry>
 998    <entry>4 bytes</entry>
 999    <entry>delete XID stamp</entry>
1000   </row>
1001   <row>
1002    <entry>t_cid</entry>
1003    <entry>CommandId</entry>
1004    <entry>4 bytes</entry>
1005    <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry>
1006   </row>
1007   <row>
1008    <entry>t_xvac</entry>
1009    <entry>TransactionId</entry>
1010    <entry>4 bytes</entry>
1011    <entry>XID for VACUUM operation moving a row version</entry>
1012   </row>
1013   <row>
1014    <entry>t_ctid</entry>
1015    <entry>ItemPointerData</entry>
1016    <entry>6 bytes</entry>
1017    <entry>current TID of this or newer row version</entry>
1018   </row>
1019   <row>
1020    <entry>t_infomask2</entry>
1021    <entry>uint16</entry>
1022    <entry>2 bytes</entry>
1023    <entry>number of attributes, plus various flag bits</entry>
1024   </row>
1025   <row>
1026    <entry>t_infomask</entry>
1027    <entry>uint16</entry>
1028    <entry>2 bytes</entry>
1029    <entry>various flag bits</entry>
1030   </row>
1031   <row>
1032    <entry>t_hoff</entry>
1033    <entry>uint8</entry>
1034    <entry>1 byte</entry>
1035    <entry>offset to user data</entry>
1036   </row>
1037  </tbody>
1038  </tgroup>
1039  </table>
1040
1041  <para>
1042    All the details can be found in
1043    <filename>src/include/access/htup_details.h</filename>.
1044  </para>
1045
1046  <para>
1047
1048   Interpreting the actual data can only be done with information obtained
1049   from other tables, mostly <structname>pg_attribute</structname>. The
1050   key values needed to identify field locations are
1051   <structfield>attlen</structfield> and <structfield>attalign</structfield>.
1052   There is no way to directly get a
1053   particular attribute, except when there are only fixed width fields and no
1054   null values. All this trickery is wrapped up in the functions
1055   <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
1056   and <firstterm>heap_getsysattr</firstterm>.
1057
1058  </para>
1059  <para>
1060
1061   To read the data you need to examine each attribute in turn. First check
1062   whether the field is NULL according to the null bitmap. If it is, go to
1063   the next. Then make sure you have the right alignment.  If the field is a
1064   fixed width field, then all the bytes are simply placed. If it's a
1065   variable length field (attlen = -1) then it's a bit more complicated.
1066   All variable-length data types share the common header structure
1067   <type>struct varlena</type>, which includes the total length of the stored
1068   value and some flag bits.  Depending on the flags, the data can be either
1069   inline or in a <acronym>TOAST</acronym> table;
1070   it might be compressed, too (see <xref linkend="storage-toast"/>).
1071
1072  </para>
1073  </sect2>
1074 </sect1>
1075
1076 <sect1 id="storage-hot">
1077
1078  <title>Heap-Only Tuples (<acronym>HOT</acronym>)</title>
1079
1080  <para>
1081   To allow for high concurrency, <productname>PostgreSQL</productname>
1082   uses <link linkend="mvcc-intro">multiversion concurrency
1083   control</link> (<acronym>MVCC</acronym>) to store rows.  However,
1084   <acronym>MVCC</acronym> has some downsides for update queries.
1085   Specifically, updates require new versions of rows to be added to
1086   tables.  This can also require new index entries for each updated row,
1087   and removal of old versions of rows and their index entries can be
1088   expensive.
1089  </para>
1090
1091  <para>
1092   To help reduce the overhead of updates,
1093   <productname>PostgreSQL</productname> has an optimization called
1094   heap-only tuples (<acronym>HOT</acronym>).  This optimization is
1095   possible when:
1096
1097   <itemizedlist>
1098    <listitem>
1099     <para>
1100      The update does not modify any columns referenced by the table's
1101      indexes, including expression and partial indexes.
1102      </para>
1103    </listitem>
1104    <listitem>
1105     <para>
1106      There is sufficient free space on the page containing the old row
1107      for the updated row.
1108     </para>
1109    </listitem>
1110   </itemizedlist>
1111
1112   In such cases, heap-only tuples provide two optimizations:
1113
1114   <itemizedlist>
1115    <listitem>
1116     <para>
1117      New index entries are not needed to represent updated rows.
1118     </para>
1119    </listitem>
1120    <listitem>
1121     <para>
1122      Old versions of updated rows can be completely removed during normal
1123      operation, including <command>SELECT</command>s, instead of requiring
1124      periodic vacuum operations.  (This is possible because indexes
1125      do not reference their <link linkend="storage-page-layout">page
1126      item identifiers</link>.)
1127     </para>
1128    </listitem>
1129   </itemizedlist>
1130  </para>
1131
1132  <para>
1133   In summary, heap-only tuple updates can only be created
1134   if columns used by indexes are not updated.  You can
1135   increase the likelihood of sufficient page space for
1136   <acronym>HOT</acronym> updates by decreasing a table's <link
1137   linkend="reloption-fillfactor"><literal>fillfactor</literal></link>.
1138   If you don't, <acronym>HOT</acronym> updates will still happen because
1139   new rows will naturally migrate to new pages and existing pages with
1140   sufficient free space for new row versions.  The system view <link
1141   linkend="monitoring-pg-stat-all-tables-view">pg_stat_all_tables</link>
1142   allows monitoring of the occurrence of HOT and non-HOT updates.
1143  </para>
1144 </sect1>
1145
1146 </chapter>