doc/src/sgml/storage.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="storage">
   4
   5 <title>Database Physical Storage</title>
   6
   7 <para>
   8 This chapter provides an overview of the physical storage format used by
   9 <productname>PostgreSQL</productname> databases.
  10 </para>
  11
  12 <sect1 id="storage-file-layout">
  13
  14 <title>Database File Layout</title>
  15
  16 <para>
  17 This section describes the storage format at the level of files and
  18 directories.
  19 </para>
  20
  21 <para>
  22 All the data needed for a database cluster is stored within the cluster's data
  23 directory, commonly referred to as <varname>PGDATA</> (after the name of the
  24 environment variable that can be used to define it).  A common location for
  25 <varname>PGDATA</> is <filename>/var/lib/pgsql/data</>.  Multiple clusters,
  26 managed by different server instances, can exist on the same machine.
  27 </para>
  28
  29 <para>
  30 The <varname>PGDATA</> directory contains several subdirectories and control
  31 files, as shown in <xref linkend="pgdata-contents-table">.  In addition to
  32 these required items, the cluster configuration files
  33 <filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
  34 <filename>pg_ident.conf</filename> are traditionally stored in
  35 <varname>PGDATA</> (although in <productname>PostgreSQL</productname> 8.0 and
  36 later, it is possible to keep them elsewhere).
  37 </para>
  38
  39 <table tocentry="1" id="pgdata-contents-table">
  40 <title>Contents of <varname>PGDATA</></title>
  41 <tgroup cols="2">
  42 <thead>
  43 <row>
  44 <entry>
  45 Item
  46 </entry>
  47 <entry>Description</entry>
  48 </row>
  49 </thead>
  50
  51 <tbody>
  52
  53 <row>
  54  <entry><filename>PG_VERSION</></entry>
  55  <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
  56 </row>
  57
  58 <row>
  59  <entry><filename>base</></entry>
  60  <entry>Subdirectory containing per-database subdirectories</entry>
  61 </row>
  62
  63 <row>
  64  <entry><filename>global</></entry>
  65  <entry>Subdirectory containing cluster-wide tables, such as
  66  <structname>pg_database</></entry>
  67 </row>
  68
  69 <row>
  70  <entry><filename>pg_clog</></entry>
  71  <entry>Subdirectory containing transaction commit status data</entry>
  72 </row>
  73
  74 <row>
  75  <entry><filename>pg_multixact</></entry>
  76  <entry>Subdirectory containing multitransaction status data
  77   (used for shared row locks)</entry>
  78 </row>
  79
  80 <row>
  81  <entry><filename>pg_stat_tmp</></entry>
  82  <entry>Subdirectory containing temporary files for the statistics
  83   subsystem</entry>
  84 </row>
  85
  86 <row>
  87  <entry><filename>pg_subtrans</></entry>
  88  <entry>Subdirectory containing subtransaction status data</entry>
  89 </row>
  90
  91 <row>
  92  <entry><filename>pg_tblspc</></entry>
  93  <entry>Subdirectory containing symbolic links to tablespaces</entry>
  94 </row>
  95
  96 <row>
  97  <entry><filename>pg_twophase</></entry>
  98  <entry>Subdirectory containing state files for prepared transactions</entry>
  99 </row>
 100
 101 <row>
 102  <entry><filename>pg_xlog</></entry>
 103  <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
 104 </row>
 105
 106 <row>
 107  <entry><filename>postmaster.opts</></entry>
 108  <entry>A file recording the command-line options the server was
 109 last started with</entry>
 110 </row>
 111
 112 <row>
 113  <entry><filename>postmaster.pid</></entry>
 114  <entry>A lock file recording the current server PID and shared memory
 115 segment ID (not present after server shutdown)</entry>
 116 </row>
 117
 118 </tbody>
 119 </tgroup>
 120 </table>
 121
 122 <para>
 123 For each database in the cluster there is a subdirectory within
 124 <varname>PGDATA</><filename>/base</>, named after the database's OID in
 125 <structname>pg_database</>.  This subdirectory is the default location
 126 for the database's files; in particular, its system catalogs are stored
 127 there.
 128 </para>
 129
 130 <para>
 131 Each table and index is stored in a separate file, named after the table
 132 or index's <firstterm>filenode</> number, which can be found in
 133 <structname>pg_class</>.<structfield>relfilenode</>. In addition to the
 134 main file (aka. main fork), a <firstterm>free space map</> (see
 135 <xref linkend="storage-fsm">) that stores information about free space
 136 available in the relation, is stored in a file named after the filenode
 137 number, with the the <literal>_fsm</> suffix.
 138 </para>
 139
 140 <caution>
 141 <para>
 142 Note that while a table's filenode often matches its OID, this is
 143 <emphasis>not</> necessarily the case; some operations, like
 144 <command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
 145 of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
 146 Avoid assuming that filenode and table OID are the same.
 147 </para>
 148 </caution>
 149
 150 <para>
 151 When a table or index exceeds 1 GB, it is divided into gigabyte-sized
 152 <firstterm>segments</>.  The first segment's file name is the same as the
 153 filenode; subsequent segments are named filenode.1, filenode.2, etc.
 154 This arrangement avoids problems on platforms that have file size limitations.
 155 (Actually, 1 GB is just the default segment size.  The segment size can be
 156 adjusted using the configuration option <option>--with-segsize</option>
 157 when building <productname>PostgreSQL</>.)
 158 The contents of tables and indexes are discussed further in
 159 <xref linkend="storage-page-layout">.
 160 </para>
 161
 162 <para>
 163 A table that has columns with potentially large entries will have an
 164 associated <firstterm>TOAST</> table, which is used for out-of-line storage of
 165 field values that are too large to keep in the table rows proper.
 166 <structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
 167 its <acronym>TOAST</> table, if any.
 168 See <xref linkend="storage-toast"> for more information.
 169 </para>
 170
 171 <para>
 172 Tablespaces make the scenario more complicated.  Each user-defined tablespace
 173 has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
 174 directory, which points to the physical tablespace directory (as specified in
 175 its <command>CREATE TABLESPACE</> command).  The symbolic link is named after
 176 the tablespace's OID.  Inside the physical tablespace directory there is
 177 a subdirectory for each database that has elements in the tablespace, named
 178 after the database's OID.  Tables within that directory follow the filenode
 179 naming scheme.  The <literal>pg_default</> tablespace is not accessed through
 180 <filename>pg_tblspc</>, but corresponds to
 181 <varname>PGDATA</><filename>/base</>.  Similarly, the <literal>pg_global</>
 182 tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
 183 <varname>PGDATA</><filename>/global</>.
 184 </para>
 185
 186 <para>
 187 Temporary files (for operations such as sorting more data than can fit in
 188 memory) are created within <varname>PGDATA</><filename>/base/pgsql_tmp</>,
 189 or within a <filename>pgsql_tmp</> subdirectory of a tablespace directory
 190 if a tablespace other than <literal>pg_default</> is specified for them.
 191 The name of a temporary file has the form
 192 <filename>pgsql_tmp<replaceable>PPP</>.<replaceable>NNN</></filename>,
 193 where <replaceable>PPP</> is the PID of the owning backend and
 194 <replaceable>NNN</> distinguishes different files of that backend.
 195 </para>
 196
 197 </sect1>
 198
 199 <sect1 id="storage-toast">
 200
 201 <title>TOAST</title>
 202
 203     <indexterm>
 204      <primary>TOAST</primary>
 205     </indexterm>
 206     <indexterm><primary>sliced bread</><see>TOAST</></indexterm>
 207
 208 <para>
 209 This section provides an overview of <acronym>TOAST</> (The
 210 Oversized-Attribute Storage Technique).
 211 </para>
 212
 213 <para>
 214 <productname>PostgreSQL</productname> uses a fixed page size (commonly
 215 8 kB), and does not allow tuples to span multiple pages.  Therefore,  it is
 216 not possible to store very large field values directly.  To overcome
 217 this limitation, large  field values are compressed and/or broken up into
 218 multiple physical rows. This happens transparently to the user, with only
 219 small impact on most of the backend code.  The technique is affectionately
 220 known as <acronym>TOAST</>  (or <quote>the best thing since sliced bread</>).
 221 </para>
 222
 223 <para>
 224 Only certain data types support <acronym>TOAST</> &mdash; there is no need to
 225 impose the overhead on data types that cannot produce large field values.
 226 To support <acronym>TOAST</>, a data type must have a variable-length
 227 (<firstterm>varlena</>) representation, in which the first 32-bit word of any
 228 stored value contains the total length of the value in bytes (including
 229 itself).  <acronym>TOAST</> does not constrain the rest of the representation.
 230 All the C-level functions supporting a <acronym>TOAST</>-able data type must
 231 be careful to handle <acronym>TOAST</>ed input values.  (This is normally done
 232 by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input
 233 value, but in some cases more efficient approaches are possible.)
 234 </para>
 235
 236 <para>
 237 <acronym>TOAST</> usurps two bits of the varlena length word (the high-order
 238 bits on big-endian machines, the low-order bits on little-endian machines),
 239 thereby limiting the logical size of any value of a <acronym>TOAST</>-able
 240 data type to 1 GB (2<superscript>30</> - 1 bytes).  When both bits are zero,
 241 the value is an ordinary un-<acronym>TOAST</>ed value of the data type, and
 242 the remaining bits of the length word give the total datum size (including
 243 length word) in bytes.  When the highest-order or lowest-order bit is set,
 244 the value has only a single-byte header instead of the normal four-byte
 245 header, and the remaining bits give the total datum size (including length
 246 byte) in bytes.  As a special case, if the remaining bits are all zero
 247 (which would be impossible for a self-inclusive length), the value is a
 248 pointer to out-of-line data stored in a separate TOAST table.  (The size of
 249 a TOAST pointer is given in the second byte of the datum.)
 250 Values with single-byte headers aren't aligned on any particular
 251 boundary, either.  Lastly, when the highest-order or lowest-order bit is
 252 clear but the adjacent bit is set, the content of the datum has been
 253 compressed and must be decompressed before use.  In this case the remaining
 254 bits of the length word give the total size of the compressed datum, not the
 255 original data.  Note that compression is also possible for out-of-line data
 256 but the varlena header does not tell whether it has occurred &mdash;
 257 the content of the TOAST pointer tells that, instead.
 258 </para>
 259
 260 <para>
 261 If any of the columns of a table are <acronym>TOAST</>-able, the table will
 262 have an associated <acronym>TOAST</> table, whose OID is stored in the table's
 263 <structname>pg_class</>.<structfield>reltoastrelid</> entry.  Out-of-line
 264 <acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
 265 described in more detail below.
 266 </para>
 267
 268 <para>
 269 The compression technique used is a fairly simple and very fast member
 270 of the LZ family of compression techniques.  See
 271 <filename>src/backend/utils/adt/pg_lzcompress.c</> for the details.
 272 </para>
 273
 274 <para>
 275 Out-of-line values are divided (after compression if used) into chunks of at
 276 most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
 277 so that four chunk rows will fit on a page, making it about 2000 bytes).
 278 Each chunk is stored
 279 as a separate row in the <acronym>TOAST</> table for the owning table.  Every
 280 <acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
 281 identifying the particular <acronym>TOAST</>ed value),
 282 <structfield>chunk_seq</> (a sequence number for the chunk within its value),
 283 and <structfield>chunk_data</> (the actual data of the chunk).  A unique index
 284 on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
 285 retrieval of the values.  A pointer datum representing an out-of-line
 286 <acronym>TOAST</>ed value therefore needs to store the OID of the
 287 <acronym>TOAST</> table in which to look and the OID of the specific value
 288 (its <structfield>chunk_id</>).  For convenience, pointer datums also store the
 289 logical datum size (original uncompressed data length) and actual stored size
 290 (different if compression was applied).  Allowing for the varlena header bytes,
 291 the total size of a <acronym>TOAST</> pointer datum is therefore 18 bytes
 292 regardless of the actual size of the represented value.
 293 </para>
 294
 295 <para>
 296 The <acronym>TOAST</> code is triggered only
 297 when a row value to be stored in a table is wider than
 298 <symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
 299 The <acronym>TOAST</> code will compress and/or move
 300 field values out-of-line until the row value is shorter than
 301 <symbol>TOAST_TUPLE_TARGET</> bytes (also normally 2 kB)
 302 or no more gains can be had.  During an UPDATE
 303 operation, values of unchanged fields are normally preserved as-is; so an
 304 UPDATE of a row with out-of-line values incurs no <acronym>TOAST</> costs if
 305 none of the out-of-line values change.
 306 </para>
 307
 308 <para>
 309 The <acronym>TOAST</> code recognizes four different strategies for storing
 310 <acronym>TOAST</>-able columns:
 311
 312    <itemizedlist>
 313     <listitem>
 314      <para>
 315       <literal>PLAIN</literal> prevents either compression or
 316       out-of-line storage; furthermore it disables use of single-byte headers
 317       for varlena types.
 318       This is the only possible strategy for
 319       columns of non-<acronym>TOAST</>-able data types.
 320      </para>
 321     </listitem>
 322     <listitem>
 323      <para>
 324       <literal>EXTENDED</literal> allows both compression and out-of-line
 325       storage.  This is the default for most <acronym>TOAST</>-able data types.
 326       Compression will be attempted first, then out-of-line storage if
 327       the row is still too big.
 328      </para>
 329     </listitem>
 330     <listitem>
 331      <para>
 332       <literal>EXTERNAL</literal> allows out-of-line storage but not
 333       compression.  Use of <literal>EXTERNAL</literal> will
 334       make substring operations on wide <type>text</type> and
 335       <type>bytea</type> columns faster (at the penalty of increased storage
 336       space) because these operations are optimized to fetch only the
 337       required parts of the out-of-line value when it is not compressed.
 338      </para>
 339     </listitem>
 340     <listitem>
 341      <para>
 342       <literal>MAIN</literal> allows compression but not out-of-line
 343       storage.  (Actually, out-of-line storage will still be performed
 344       for such columns, but only as a last resort when there is no other
 345       way to make the row small enough.)
 346      </para>
 347     </listitem>
 348    </itemizedlist>
 349
 350 Each <acronym>TOAST</>-able data type specifies a default strategy for columns
 351 of that data type, but the strategy for a given table column can be altered
 352 with <command>ALTER TABLE SET STORAGE</>.
 353 </para>
 354
 355 <para>
 356 This scheme has a number of advantages compared to a more straightforward
 357 approach such as allowing row values to span pages.  Assuming that queries are
 358 usually qualified by comparisons against relatively small key values, most of
 359 the work of the executor will be done using the main row entry. The big values
 360 of <acronym>TOAST</>ed attributes will only be pulled out (if selected at all)
 361 at the time the result set is sent to the client. Thus, the main table is much
 362 smaller and more of its rows fit in the shared buffer cache than would be the
 363 case without any out-of-line storage. Sort sets shrink also, and sorts will
 364 more often be done entirely in memory. A little test showed that a table
 365 containing typical HTML pages and their URLs was stored in about half of the
 366 raw data size including the <acronym>TOAST</> table, and that the main table
 367 contained only about 10% of the entire data (the URLs and some small HTML
 368 pages). There was no run time difference compared to an un-<acronym>TOAST</>ed
 369 comparison table, in which all the HTML pages were cut down to 7 kB to fit.
 370 </para>
 371
 372 </sect1>
 373
 374 <sect1 id="storage-fsm">
 375
 376 <title>Free Space Map</title>
 377
 378     <indexterm>
 379      <primary>Free Space Map</primary>
 380     </indexterm>
 381     <indexterm><primary>FSM</><see>Free Space Map</></indexterm>
 382
 383 <para>
 384 A Free Space Map is stored with every heap and index relation, except for
 385 hash indexes, to keep track of available space in the relation. It's stored
 386 along the main relation data, in a separate FSM relation fork, named after
 387 relfilenode of the relation, but with a <literal>_fsm</> suffix. For example,
 388 if the relfilenode of a relation is 12345, the FSM is stored in a file called
 389 <filename>12345_fsm</>, in the same directory as the main relation file.
 390 </para>
 391
 392 <para>
 393 The Free Space Map is organized as a tree of <acronym>FSM</> pages. The
 394 bottom level <acronym>FSM</> pages stores the free space available on every
 395 heap (or index) page, using one byte to represent each heap page. The upper
 396 levels aggregate information from the lower levels.
 397 </para>
 398
 399 <para>
 400 Within each <acronym>FSM</> page is a binary tree, stored in an array with
 401 one byte per node. Each leaf node represents a heap page, or a lower level
 402 <acronym>FSM</> page. In each non-leaf node, the higher of its children's
 403 values is stored. The maximum value in the leaf nodes is therefore stored
 404 at the root.
 405 </para>
 406
 407 <para>
 408 See <filename>src/backend/storage/freespace/README</> for more details on
 409 how the <acronym>FSM</> is structured, and how it's updated and searched.
 410 <xref linkend="pgfreespacemap"> contrib module can be used to view the
 411 information stored in free space maps.
 412 </para>
 413
 414 </sect1>
 415
 416 <sect1 id="storage-page-layout">
 417
 418 <title>Database Page Layout</title>
 419
 420 <para>
 421 This section provides an overview of the page format used within
 422 <productname>PostgreSQL</productname> tables and indexes.<footnote>
 423   <para>
 424     Actually, index access methods need not use this page format.
 425     All the existing index methods do use this basic format,
 426     but the data kept on index metapages usually doesn't follow
 427     the item layout rules.
 428   </para>
 429 </footnote>
 430 Sequences and <acronym>TOAST</> tables are formatted just like a regular table.
 431 </para>
 432
 433 <para>
 434 In the following explanation, a
 435 <firstterm>byte</firstterm>
 436 is assumed to contain 8 bits.  In addition, the term
 437 <firstterm>item</firstterm>
 438 refers to an individual data value that is stored on a page.  In a table,
 439 an item is a row; in an index, an item is an index entry.
 440 </para>
 441
 442 <para>
 443 Every table and index is stored as an array of <firstterm>pages</> of a
 444 fixed size (usually 8 kB, although a different page size can be selected
 445 when compiling the server).  In a table, all the pages are logically
 446 equivalent, so a particular item (row) can be stored in any page.  In
 447 indexes, the first page is generally reserved as a <firstterm>metapage</>
 448 holding control information, and there can be different types of pages
 449 within the index, depending on the index access method.
 450 </para>
 451
 452 <para>
 453 <xref linkend="page-table"> shows the overall layout of a page.
 454 There are five parts to each page.
 455 </para>
 456
 457 <table tocentry="1" id="page-table">
 458 <title>Overall Page Layout</title>
 459 <titleabbrev>Page Layout</titleabbrev>
 460 <tgroup cols="2">
 461 <thead>
 462 <row>
 463 <entry>
 464 Item
 465 </entry>
 466 <entry>Description</entry>
 467 </row>
 468 </thead>
 469
 470 <tbody>
 471
 472 <row>
 473  <entry>PageHeaderData</entry>
 474  <entry>24 bytes long. Contains general information about the page, including
 475 free space pointers.</entry>
 476 </row>
 477
 478 <row>
 479 <entry>ItemIdData</entry>
 480 <entry>Array of (offset,length) pairs pointing to the actual items.
 481 4 bytes per item.</entry>
 482 </row>
 483
 484 <row>
 485 <entry>Free space</entry>
 486 <entry>The unallocated space. New item pointers are allocated from the start
 487 of this area, new items from the end.</entry>
 488 </row>
 489
 490 <row>
 491 <entry>Items</entry>
 492 <entry>The actual items themselves.</entry>
 493 </row>
 494
 495 <row>
 496 <entry>Special space</entry>
 497 <entry>Index access method specific data. Different methods store different
 498 data. Empty in ordinary tables.</entry>
 499 </row>
 500
 501 </tbody>
 502 </tgroup>
 503 </table>
 504
 505  <para>
 506
 507   The first 24 bytes of each page consists of a page header
 508   (PageHeaderData). Its format is detailed in <xref
 509   linkend="pageheaderdata-table">. The first two fields track the most
 510   recent WAL entry related to this page. Next is a 2-byte field
 511   containing flag bits. This is followed by three 2-byte integer fields
 512   (<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
 513   and <structfield>pd_special</structfield>). These contain byte offsets
 514   from the page start to the start
 515   of unallocated space, to the end of unallocated space, and to the start of
 516   the special space.
 517   The next 2 bytes of the page header,
 518   <structfield>pd_pagesize_version</structfield>, store both the page size
 519   and a version indicator.  Beginning with
 520   <productname>PostgreSQL</productname> 8.3 the version number is 4;
 521   <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3;
 522   <productname>PostgreSQL</productname> 8.0 used version number 2;
 523   <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
 524   prior releases used version number 0.
 525   (The basic page layout and header format has not changed in most of these
 526   versions, but the layout of heap row headers has.)  The page size
 527   is basically only present as a cross-check; there is no support for having
 528   more than one page size in an installation.
 529   The last field is a hint that shows whether pruning the page is likely
 530   to be profitable: it tracks the oldest un-pruned XMAX on the page.
 531
 532  </para>
 533
 534  <table tocentry="1" id="pageheaderdata-table">
 535  <title>PageHeaderData Layout</title>
 536  <titleabbrev>PageHeaderData Layout</titleabbrev>
 537  <tgroup cols="4">
 538  <thead>
 539   <row>
 540    <entry>Field</entry>
 541    <entry>Type</entry>
 542    <entry>Length</entry>
 543    <entry>Description</entry>
 544   </row>
 545  </thead>
 546  <tbody>
 547   <row>
 548    <entry>pd_lsn</entry>
 549    <entry>XLogRecPtr</entry>
 550    <entry>8 bytes</entry>
 551    <entry>LSN: next byte after last byte of xlog record for last change
 552    to this page</entry>
 553   </row>
 554   <row>
 555    <entry>pd_tli</entry>
 556    <entry>uint16</entry>
 557    <entry>2 bytes</entry>
 558    <entry>TimeLineID of last change (only its lowest 16 bits)</entry>
 559   </row>
 560   <row>
 561    <entry>pd_flags</entry>
 562    <entry>uint16</entry>
 563    <entry>2 bytes</entry>
 564    <entry>Flag bits</entry>
 565   </row>
 566   <row>
 567    <entry>pd_lower</entry>
 568    <entry>LocationIndex</entry>
 569    <entry>2 bytes</entry>
 570    <entry>Offset to start of free space</entry>
 571   </row>
 572   <row>
 573    <entry>pd_upper</entry>
 574    <entry>LocationIndex</entry>
 575    <entry>2 bytes</entry>
 576    <entry>Offset to end of free space</entry>
 577   </row>
 578   <row>
 579    <entry>pd_special</entry>
 580    <entry>LocationIndex</entry>
 581    <entry>2 bytes</entry>
 582    <entry>Offset to start of special space</entry>
 583   </row>
 584   <row>
 585    <entry>pd_pagesize_version</entry>
 586    <entry>uint16</entry>
 587    <entry>2 bytes</entry>
 588    <entry>Page size and layout version number information</entry>
 589   </row>
 590   <row>
 591    <entry>pd_prune_xid</entry>
 592    <entry>TransactionId</entry>
 593    <entry>4 bytes</entry>
 594    <entry>Oldest unpruned XMAX on page, or zero if none</entry>
 595   </row>
 596  </tbody>
 597  </tgroup>
 598  </table>
 599
 600  <para>
 601   All the details can be found in
 602   <filename>src/include/storage/bufpage.h</filename>.
 603  </para>
 604
 605  <para>
 606
 607   Following the page header are item identifiers
 608   (<type>ItemIdData</type>), each requiring four bytes.
 609   An item identifier contains a byte-offset to
 610   the start of an item, its length in bytes, and a few attribute bits
 611   which affect its interpretation.
 612   New item identifiers are allocated
 613   as needed from the beginning of the unallocated space.
 614   The number of item identifiers present can be determined by looking at
 615   <structfield>pd_lower</>, which is increased to allocate a new identifier.
 616   Because an item
 617   identifier is never moved until it is freed, its index can be used on a
 618   long-term basis to reference an item, even when the item itself is moved
 619   around on the page to compact free space.  In fact, every pointer to an
 620   item (<type>ItemPointer</type>, also known as
 621   <type>CTID</type>) created by
 622   <productname>PostgreSQL</productname> consists of a page number and the
 623   index of an item identifier.
 624
 625  </para>
 626
 627  <para>
 628
 629   The items themselves are stored in space allocated backwards from the end
 630   of unallocated space.  The exact structure varies depending on what the
 631   table is to contain. Tables and sequences both use a structure named
 632   <type>HeapTupleHeaderData</type>, described below.
 633
 634  </para>
 635
 636  <para>
 637
 638   The final section is the <quote>special section</quote> which can
 639  contain anything the access method wishes to store.  For example,
 640   b-tree indexes store links to the page's left and right siblings,
 641   as well as some other data relevant to the index structure.
 642   Ordinary tables do not use a special section at all (indicated by setting
 643   <structfield>pd_special</> to equal the page size).
 644
 645  </para>
 646
 647  <para>
 648
 649   All table rows are structured in the same way. There is a fixed-size
 650   header (occupying 23 bytes on most machines), followed by an optional null
 651   bitmap, an optional object ID field, and the user data. The header is
 652   detailed
 653   in <xref linkend="heaptupleheaderdata-table">.  The actual user data
 654   (columns of the row) begins at the offset indicated by
 655   <structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
 656   distance for the platform.
 657   The null bitmap is
 658   only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
 659   <structfield>t_infomask</structfield>. If it is present it begins just after
 660   the fixed header and occupies enough bytes to have one bit per data column
 661   (that is, <structfield>t_natts</> bits altogether). In this list of bits, a
 662   1 bit indicates not-null, a 0 bit is a null.  When the bitmap is not
 663   present, all columns are assumed not-null.
 664   The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
 665   is set in <structfield>t_infomask</structfield>.  If present, it appears just
 666   before the <structfield>t_hoff</> boundary.  Any padding needed to make
 667   <structfield>t_hoff</> a MAXALIGN multiple will appear between the null
 668   bitmap and the object ID.  (This in turn ensures that the object ID is
 669   suitably aligned.)
 670
 671  </para>
 672
 673  <table tocentry="1" id="heaptupleheaderdata-table">
 674  <title>HeapTupleHeaderData Layout</title>
 675  <titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
 676  <tgroup cols="4">
 677  <thead>
 678   <row>
 679    <entry>Field</entry>
 680    <entry>Type</entry>
 681    <entry>Length</entry>
 682    <entry>Description</entry>
 683   </row>
 684  </thead>
 685  <tbody>
 686   <row>
 687    <entry>t_xmin</entry>
 688    <entry>TransactionId</entry>
 689    <entry>4 bytes</entry>
 690    <entry>insert XID stamp</entry>
 691   </row>
 692   <row>
 693    <entry>t_xmax</entry>
 694    <entry>TransactionId</entry>
 695    <entry>4 bytes</entry>
 696    <entry>delete XID stamp</entry>
 697   </row>
 698   <row>
 699    <entry>t_cid</entry>
 700    <entry>CommandId</entry>
 701    <entry>4 bytes</entry>
 702    <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry>
 703   </row>
 704   <row>
 705    <entry>t_xvac</entry>
 706    <entry>TransactionId</entry>
 707    <entry>4 bytes</entry>
 708    <entry>XID for VACUUM operation moving a row version</entry>
 709   </row>
 710   <row>
 711    <entry>t_ctid</entry>
 712    <entry>ItemPointerData</entry>
 713    <entry>6 bytes</entry>
 714    <entry>current TID of this or newer row version</entry>
 715   </row>
 716   <row>
 717    <entry>t_infomask2</entry>
 718    <entry>int16</entry>
 719    <entry>2 bytes</entry>
 720    <entry>number of attributes, plus various flag bits</entry>
 721   </row>
 722   <row>
 723    <entry>t_infomask</entry>
 724    <entry>uint16</entry>
 725    <entry>2 bytes</entry>
 726    <entry>various flag bits</entry>
 727   </row>
 728   <row>
 729    <entry>t_hoff</entry>
 730    <entry>uint8</entry>
 731    <entry>1 byte</entry>
 732    <entry>offset to user data</entry>
 733   </row>
 734  </tbody>
 735  </tgroup>
 736  </table>
 737
 738  <para>
 739    All the details can be found in
 740    <filename>src/include/access/htup.h</filename>.
 741  </para>
 742
 743  <para>
 744
 745   Interpreting the actual data can only be done with information obtained
 746   from other tables, mostly <structname>pg_attribute</structname>. The
 747   key values needed to identify field locations are
 748   <structfield>attlen</structfield> and <structfield>attalign</structfield>.
 749   There is no way to directly get a
 750   particular attribute, except when there are only fixed width fields and no
 751   null values. All this trickery is wrapped up in the functions
 752   <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
 753   and <firstterm>heap_getsysattr</firstterm>.
 754
 755  </para>
 756  <para>
 757
 758   To read the data you need to examine each attribute in turn. First check
 759   whether the field is NULL according to the null bitmap. If it is, go to
 760   the next. Then make sure you have the right alignment.  If the field is a
 761   fixed width field, then all the bytes are simply placed. If it's a
 762   variable length field (attlen = -1) then it's a bit more complicated.
 763   All variable-length datatypes share the common header structure
 764   <type>struct varlena</type>, which includes the total length of the stored
 765   value and some flag bits.  Depending on the flags, the data can be either
 766   inline or in a <acronym>TOAST</> table;
 767   it might be compressed, too (see <xref linkend="storage-toast">).
 768
 769  </para>
 770 </sect1>
 771
 772 </chapter>