Use fork names instead of numbers in the file names for additional
[PostgreSQL.git] / doc / src / sgml / storage.sgml
blob777f81a066a5e90d61bdf84a16346f51db2da0a6
1 <!-- $PostgreSQL$ -->
3 <chapter id="storage">
5 <title>Database Physical Storage</title>
7 <para>
8 This chapter provides an overview of the physical storage format used by
9 <productname>PostgreSQL</productname> databases.
10 </para>
12 <sect1 id="storage-file-layout">
14 <title>Database File Layout</title>
16 <para>
17 This section describes the storage format at the level of files and
18 directories.
19 </para>
21 <para>
22 All the data needed for a database cluster is stored within the cluster's data
23 directory, commonly referred to as <varname>PGDATA</> (after the name of the
24 environment variable that can be used to define it). A common location for
25 <varname>PGDATA</> is <filename>/var/lib/pgsql/data</>. Multiple clusters,
26 managed by different server instances, can exist on the same machine.
27 </para>
29 <para>
30 The <varname>PGDATA</> directory contains several subdirectories and control
31 files, as shown in <xref linkend="pgdata-contents-table">. In addition to
32 these required items, the cluster configuration files
33 <filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
34 <filename>pg_ident.conf</filename> are traditionally stored in
35 <varname>PGDATA</> (although in <productname>PostgreSQL</productname> 8.0 and
36 later, it is possible to keep them elsewhere).
37 </para>
39 <table tocentry="1" id="pgdata-contents-table">
40 <title>Contents of <varname>PGDATA</></title>
41 <tgroup cols="2">
42 <thead>
43 <row>
44 <entry>
45 Item
46 </entry>
47 <entry>Description</entry>
48 </row>
49 </thead>
51 <tbody>
53 <row>
54 <entry><filename>PG_VERSION</></entry>
55 <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
56 </row>
58 <row>
59 <entry><filename>base</></entry>
60 <entry>Subdirectory containing per-database subdirectories</entry>
61 </row>
63 <row>
64 <entry><filename>global</></entry>
65 <entry>Subdirectory containing cluster-wide tables, such as
66 <structname>pg_database</></entry>
67 </row>
69 <row>
70 <entry><filename>pg_clog</></entry>
71 <entry>Subdirectory containing transaction commit status data</entry>
72 </row>
74 <row>
75 <entry><filename>pg_multixact</></entry>
76 <entry>Subdirectory containing multitransaction status data
77 (used for shared row locks)</entry>
78 </row>
80 <row>
81 <entry><filename>pg_stat_tmp</></entry>
82 <entry>Subdirectory containing temporary files for the statistics
83 subsystem</entry>
84 </row>
86 <row>
87 <entry><filename>pg_subtrans</></entry>
88 <entry>Subdirectory containing subtransaction status data</entry>
89 </row>
91 <row>
92 <entry><filename>pg_tblspc</></entry>
93 <entry>Subdirectory containing symbolic links to tablespaces</entry>
94 </row>
96 <row>
97 <entry><filename>pg_twophase</></entry>
98 <entry>Subdirectory containing state files for prepared transactions</entry>
99 </row>
101 <row>
102 <entry><filename>pg_xlog</></entry>
103 <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
104 </row>
106 <row>
107 <entry><filename>postmaster.opts</></entry>
108 <entry>A file recording the command-line options the server was
109 last started with</entry>
110 </row>
112 <row>
113 <entry><filename>postmaster.pid</></entry>
114 <entry>A lock file recording the current server PID and shared memory
115 segment ID (not present after server shutdown)</entry>
116 </row>
118 </tbody>
119 </tgroup>
120 </table>
122 <para>
123 For each database in the cluster there is a subdirectory within
124 <varname>PGDATA</><filename>/base</>, named after the database's OID in
125 <structname>pg_database</>. This subdirectory is the default location
126 for the database's files; in particular, its system catalogs are stored
127 there.
128 </para>
130 <para>
131 Each table and index is stored in a separate file, named after the table
132 or index's <firstterm>filenode</> number, which can be found in
133 <structname>pg_class</>.<structfield>relfilenode</>. In addition to the
134 main file (aka. main fork), a <firstterm>free space map</> (see
135 <xref linkend="storage-fsm">) that stores information about free space
136 available in the relation, is stored in a file named after the filenode
137 number, with the the <literal>_fsm</> suffix.
138 </para>
140 <caution>
141 <para>
142 Note that while a table's filenode often matches its OID, this is
143 <emphasis>not</> necessarily the case; some operations, like
144 <command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
145 of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
146 Avoid assuming that filenode and table OID are the same.
147 </para>
148 </caution>
150 <para>
151 When a table or index exceeds 1 GB, it is divided into gigabyte-sized
152 <firstterm>segments</>. The first segment's file name is the same as the
153 filenode; subsequent segments are named filenode.1, filenode.2, etc.
154 This arrangement avoids problems on platforms that have file size limitations.
155 (Actually, 1 GB is just the default segment size. The segment size can be
156 adjusted using the configuration option <option>--with-segsize</option>
157 when building <productname>PostgreSQL</>.)
158 The contents of tables and indexes are discussed further in
159 <xref linkend="storage-page-layout">.
160 </para>
162 <para>
163 A table that has columns with potentially large entries will have an
164 associated <firstterm>TOAST</> table, which is used for out-of-line storage of
165 field values that are too large to keep in the table rows proper.
166 <structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
167 its <acronym>TOAST</> table, if any.
168 See <xref linkend="storage-toast"> for more information.
169 </para>
171 <para>
172 Tablespaces make the scenario more complicated. Each user-defined tablespace
173 has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
174 directory, which points to the physical tablespace directory (as specified in
175 its <command>CREATE TABLESPACE</> command). The symbolic link is named after
176 the tablespace's OID. Inside the physical tablespace directory there is
177 a subdirectory for each database that has elements in the tablespace, named
178 after the database's OID. Tables within that directory follow the filenode
179 naming scheme. The <literal>pg_default</> tablespace is not accessed through
180 <filename>pg_tblspc</>, but corresponds to
181 <varname>PGDATA</><filename>/base</>. Similarly, the <literal>pg_global</>
182 tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
183 <varname>PGDATA</><filename>/global</>.
184 </para>
186 <para>
187 Temporary files (for operations such as sorting more data than can fit in
188 memory) are created within <varname>PGDATA</><filename>/base/pgsql_tmp</>,
189 or within a <filename>pgsql_tmp</> subdirectory of a tablespace directory
190 if a tablespace other than <literal>pg_default</> is specified for them.
191 The name of a temporary file has the form
192 <filename>pgsql_tmp<replaceable>PPP</>.<replaceable>NNN</></filename>,
193 where <replaceable>PPP</> is the PID of the owning backend and
194 <replaceable>NNN</> distinguishes different files of that backend.
195 </para>
197 </sect1>
199 <sect1 id="storage-toast">
201 <title>TOAST</title>
203 <indexterm>
204 <primary>TOAST</primary>
205 </indexterm>
206 <indexterm><primary>sliced bread</><see>TOAST</></indexterm>
208 <para>
209 This section provides an overview of <acronym>TOAST</> (The
210 Oversized-Attribute Storage Technique).
211 </para>
213 <para>
214 <productname>PostgreSQL</productname> uses a fixed page size (commonly
215 8 kB), and does not allow tuples to span multiple pages. Therefore, it is
216 not possible to store very large field values directly. To overcome
217 this limitation, large field values are compressed and/or broken up into
218 multiple physical rows. This happens transparently to the user, with only
219 small impact on most of the backend code. The technique is affectionately
220 known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
221 </para>
223 <para>
224 Only certain data types support <acronym>TOAST</> &mdash; there is no need to
225 impose the overhead on data types that cannot produce large field values.
226 To support <acronym>TOAST</>, a data type must have a variable-length
227 (<firstterm>varlena</>) representation, in which the first 32-bit word of any
228 stored value contains the total length of the value in bytes (including
229 itself). <acronym>TOAST</> does not constrain the rest of the representation.
230 All the C-level functions supporting a <acronym>TOAST</>-able data type must
231 be careful to handle <acronym>TOAST</>ed input values. (This is normally done
232 by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input
233 value, but in some cases more efficient approaches are possible.)
234 </para>
236 <para>
237 <acronym>TOAST</> usurps two bits of the varlena length word (the high-order
238 bits on big-endian machines, the low-order bits on little-endian machines),
239 thereby limiting the logical size of any value of a <acronym>TOAST</>-able
240 data type to 1 GB (2<superscript>30</> - 1 bytes). When both bits are zero,
241 the value is an ordinary un-<acronym>TOAST</>ed value of the data type, and
242 the remaining bits of the length word give the total datum size (including
243 length word) in bytes. When the highest-order or lowest-order bit is set,
244 the value has only a single-byte header instead of the normal four-byte
245 header, and the remaining bits give the total datum size (including length
246 byte) in bytes. As a special case, if the remaining bits are all zero
247 (which would be impossible for a self-inclusive length), the value is a
248 pointer to out-of-line data stored in a separate TOAST table. (The size of
249 a TOAST pointer is given in the second byte of the datum.)
250 Values with single-byte headers aren't aligned on any particular
251 boundary, either. Lastly, when the highest-order or lowest-order bit is
252 clear but the adjacent bit is set, the content of the datum has been
253 compressed and must be decompressed before use. In this case the remaining
254 bits of the length word give the total size of the compressed datum, not the
255 original data. Note that compression is also possible for out-of-line data
256 but the varlena header does not tell whether it has occurred &mdash;
257 the content of the TOAST pointer tells that, instead.
258 </para>
260 <para>
261 If any of the columns of a table are <acronym>TOAST</>-able, the table will
262 have an associated <acronym>TOAST</> table, whose OID is stored in the table's
263 <structname>pg_class</>.<structfield>reltoastrelid</> entry. Out-of-line
264 <acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
265 described in more detail below.
266 </para>
268 <para>
269 The compression technique used is a fairly simple and very fast member
270 of the LZ family of compression techniques. See
271 <filename>src/backend/utils/adt/pg_lzcompress.c</> for the details.
272 </para>
274 <para>
275 Out-of-line values are divided (after compression if used) into chunks of at
276 most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
277 so that four chunk rows will fit on a page, making it about 2000 bytes).
278 Each chunk is stored
279 as a separate row in the <acronym>TOAST</> table for the owning table. Every
280 <acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
281 identifying the particular <acronym>TOAST</>ed value),
282 <structfield>chunk_seq</> (a sequence number for the chunk within its value),
283 and <structfield>chunk_data</> (the actual data of the chunk). A unique index
284 on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
285 retrieval of the values. A pointer datum representing an out-of-line
286 <acronym>TOAST</>ed value therefore needs to store the OID of the
287 <acronym>TOAST</> table in which to look and the OID of the specific value
288 (its <structfield>chunk_id</>). For convenience, pointer datums also store the
289 logical datum size (original uncompressed data length) and actual stored size
290 (different if compression was applied). Allowing for the varlena header bytes,
291 the total size of a <acronym>TOAST</> pointer datum is therefore 18 bytes
292 regardless of the actual size of the represented value.
293 </para>
295 <para>
296 The <acronym>TOAST</> code is triggered only
297 when a row value to be stored in a table is wider than
298 <symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
299 The <acronym>TOAST</> code will compress and/or move
300 field values out-of-line until the row value is shorter than
301 <symbol>TOAST_TUPLE_TARGET</> bytes (also normally 2 kB)
302 or no more gains can be had. During an UPDATE
303 operation, values of unchanged fields are normally preserved as-is; so an
304 UPDATE of a row with out-of-line values incurs no <acronym>TOAST</> costs if
305 none of the out-of-line values change.
306 </para>
308 <para>
309 The <acronym>TOAST</> code recognizes four different strategies for storing
310 <acronym>TOAST</>-able columns:
312 <itemizedlist>
313 <listitem>
314 <para>
315 <literal>PLAIN</literal> prevents either compression or
316 out-of-line storage; furthermore it disables use of single-byte headers
317 for varlena types.
318 This is the only possible strategy for
319 columns of non-<acronym>TOAST</>-able data types.
320 </para>
321 </listitem>
322 <listitem>
323 <para>
324 <literal>EXTENDED</literal> allows both compression and out-of-line
325 storage. This is the default for most <acronym>TOAST</>-able data types.
326 Compression will be attempted first, then out-of-line storage if
327 the row is still too big.
328 </para>
329 </listitem>
330 <listitem>
331 <para>
332 <literal>EXTERNAL</literal> allows out-of-line storage but not
333 compression. Use of <literal>EXTERNAL</literal> will
334 make substring operations on wide <type>text</type> and
335 <type>bytea</type> columns faster (at the penalty of increased storage
336 space) because these operations are optimized to fetch only the
337 required parts of the out-of-line value when it is not compressed.
338 </para>
339 </listitem>
340 <listitem>
341 <para>
342 <literal>MAIN</literal> allows compression but not out-of-line
343 storage. (Actually, out-of-line storage will still be performed
344 for such columns, but only as a last resort when there is no other
345 way to make the row small enough.)
346 </para>
347 </listitem>
348 </itemizedlist>
350 Each <acronym>TOAST</>-able data type specifies a default strategy for columns
351 of that data type, but the strategy for a given table column can be altered
352 with <command>ALTER TABLE SET STORAGE</>.
353 </para>
355 <para>
356 This scheme has a number of advantages compared to a more straightforward
357 approach such as allowing row values to span pages. Assuming that queries are
358 usually qualified by comparisons against relatively small key values, most of
359 the work of the executor will be done using the main row entry. The big values
360 of <acronym>TOAST</>ed attributes will only be pulled out (if selected at all)
361 at the time the result set is sent to the client. Thus, the main table is much
362 smaller and more of its rows fit in the shared buffer cache than would be the
363 case without any out-of-line storage. Sort sets shrink also, and sorts will
364 more often be done entirely in memory. A little test showed that a table
365 containing typical HTML pages and their URLs was stored in about half of the
366 raw data size including the <acronym>TOAST</> table, and that the main table
367 contained only about 10% of the entire data (the URLs and some small HTML
368 pages). There was no run time difference compared to an un-<acronym>TOAST</>ed
369 comparison table, in which all the HTML pages were cut down to 7 kB to fit.
370 </para>
372 </sect1>
374 <sect1 id="storage-fsm">
376 <title>Free Space Map</title>
378 <indexterm>
379 <primary>Free Space Map</primary>
380 </indexterm>
381 <indexterm><primary>FSM</><see>Free Space Map</></indexterm>
383 <para>
384 A Free Space Map is stored with every heap and index relation, except for
385 hash indexes, to keep track of available space in the relation. It's stored
386 along the main relation data, in a separate FSM relation fork, named after
387 relfilenode of the relation, but with a <literal>_fsm</> suffix. For example,
388 if the relfilenode of a relation is 12345, the FSM is stored in a file called
389 <filename>12345_fsm</>, in the same directory as the main relation file.
390 </para>
392 <para>
393 The Free Space Map is organized as a tree of <acronym>FSM</> pages. The
394 bottom level <acronym>FSM</> pages stores the free space available on every
395 heap (or index) page, using one byte to represent each heap page. The upper
396 levels aggregate information from the lower levels.
397 </para>
399 <para>
400 Within each <acronym>FSM</> page is a binary tree, stored in an array with
401 one byte per node. Each leaf node represents a heap page, or a lower level
402 <acronym>FSM</> page. In each non-leaf node, the higher of its children's
403 values is stored. The maximum value in the leaf nodes is therefore stored
404 at the root.
405 </para>
407 <para>
408 See <filename>src/backend/storage/freespace/README</> for more details on
409 how the <acronym>FSM</> is structured, and how it's updated and searched.
410 <xref linkend="pgfreespacemap"> contrib module can be used to view the
411 information stored in free space maps.
412 </para>
414 </sect1>
416 <sect1 id="storage-page-layout">
418 <title>Database Page Layout</title>
420 <para>
421 This section provides an overview of the page format used within
422 <productname>PostgreSQL</productname> tables and indexes.<footnote>
423 <para>
424 Actually, index access methods need not use this page format.
425 All the existing index methods do use this basic format,
426 but the data kept on index metapages usually doesn't follow
427 the item layout rules.
428 </para>
429 </footnote>
430 Sequences and <acronym>TOAST</> tables are formatted just like a regular table.
431 </para>
433 <para>
434 In the following explanation, a
435 <firstterm>byte</firstterm>
436 is assumed to contain 8 bits. In addition, the term
437 <firstterm>item</firstterm>
438 refers to an individual data value that is stored on a page. In a table,
439 an item is a row; in an index, an item is an index entry.
440 </para>
442 <para>
443 Every table and index is stored as an array of <firstterm>pages</> of a
444 fixed size (usually 8 kB, although a different page size can be selected
445 when compiling the server). In a table, all the pages are logically
446 equivalent, so a particular item (row) can be stored in any page. In
447 indexes, the first page is generally reserved as a <firstterm>metapage</>
448 holding control information, and there can be different types of pages
449 within the index, depending on the index access method.
450 </para>
452 <para>
453 <xref linkend="page-table"> shows the overall layout of a page.
454 There are five parts to each page.
455 </para>
457 <table tocentry="1" id="page-table">
458 <title>Overall Page Layout</title>
459 <titleabbrev>Page Layout</titleabbrev>
460 <tgroup cols="2">
461 <thead>
462 <row>
463 <entry>
464 Item
465 </entry>
466 <entry>Description</entry>
467 </row>
468 </thead>
470 <tbody>
472 <row>
473 <entry>PageHeaderData</entry>
474 <entry>24 bytes long. Contains general information about the page, including
475 free space pointers.</entry>
476 </row>
478 <row>
479 <entry>ItemIdData</entry>
480 <entry>Array of (offset,length) pairs pointing to the actual items.
481 4 bytes per item.</entry>
482 </row>
484 <row>
485 <entry>Free space</entry>
486 <entry>The unallocated space. New item pointers are allocated from the start
487 of this area, new items from the end.</entry>
488 </row>
490 <row>
491 <entry>Items</entry>
492 <entry>The actual items themselves.</entry>
493 </row>
495 <row>
496 <entry>Special space</entry>
497 <entry>Index access method specific data. Different methods store different
498 data. Empty in ordinary tables.</entry>
499 </row>
501 </tbody>
502 </tgroup>
503 </table>
505 <para>
507 The first 24 bytes of each page consists of a page header
508 (PageHeaderData). Its format is detailed in <xref
509 linkend="pageheaderdata-table">. The first two fields track the most
510 recent WAL entry related to this page. Next is a 2-byte field
511 containing flag bits. This is followed by three 2-byte integer fields
512 (<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
513 and <structfield>pd_special</structfield>). These contain byte offsets
514 from the page start to the start
515 of unallocated space, to the end of unallocated space, and to the start of
516 the special space.
517 The next 2 bytes of the page header,
518 <structfield>pd_pagesize_version</structfield>, store both the page size
519 and a version indicator. Beginning with
520 <productname>PostgreSQL</productname> 8.3 the version number is 4;
521 <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3;
522 <productname>PostgreSQL</productname> 8.0 used version number 2;
523 <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
524 prior releases used version number 0.
525 (The basic page layout and header format has not changed in most of these
526 versions, but the layout of heap row headers has.) The page size
527 is basically only present as a cross-check; there is no support for having
528 more than one page size in an installation.
529 The last field is a hint that shows whether pruning the page is likely
530 to be profitable: it tracks the oldest un-pruned XMAX on the page.
532 </para>
534 <table tocentry="1" id="pageheaderdata-table">
535 <title>PageHeaderData Layout</title>
536 <titleabbrev>PageHeaderData Layout</titleabbrev>
537 <tgroup cols="4">
538 <thead>
539 <row>
540 <entry>Field</entry>
541 <entry>Type</entry>
542 <entry>Length</entry>
543 <entry>Description</entry>
544 </row>
545 </thead>
546 <tbody>
547 <row>
548 <entry>pd_lsn</entry>
549 <entry>XLogRecPtr</entry>
550 <entry>8 bytes</entry>
551 <entry>LSN: next byte after last byte of xlog record for last change
552 to this page</entry>
553 </row>
554 <row>
555 <entry>pd_tli</entry>
556 <entry>uint16</entry>
557 <entry>2 bytes</entry>
558 <entry>TimeLineID of last change (only its lowest 16 bits)</entry>
559 </row>
560 <row>
561 <entry>pd_flags</entry>
562 <entry>uint16</entry>
563 <entry>2 bytes</entry>
564 <entry>Flag bits</entry>
565 </row>
566 <row>
567 <entry>pd_lower</entry>
568 <entry>LocationIndex</entry>
569 <entry>2 bytes</entry>
570 <entry>Offset to start of free space</entry>
571 </row>
572 <row>
573 <entry>pd_upper</entry>
574 <entry>LocationIndex</entry>
575 <entry>2 bytes</entry>
576 <entry>Offset to end of free space</entry>
577 </row>
578 <row>
579 <entry>pd_special</entry>
580 <entry>LocationIndex</entry>
581 <entry>2 bytes</entry>
582 <entry>Offset to start of special space</entry>
583 </row>
584 <row>
585 <entry>pd_pagesize_version</entry>
586 <entry>uint16</entry>
587 <entry>2 bytes</entry>
588 <entry>Page size and layout version number information</entry>
589 </row>
590 <row>
591 <entry>pd_prune_xid</entry>
592 <entry>TransactionId</entry>
593 <entry>4 bytes</entry>
594 <entry>Oldest unpruned XMAX on page, or zero if none</entry>
595 </row>
596 </tbody>
597 </tgroup>
598 </table>
600 <para>
601 All the details can be found in
602 <filename>src/include/storage/bufpage.h</filename>.
603 </para>
605 <para>
607 Following the page header are item identifiers
608 (<type>ItemIdData</type>), each requiring four bytes.
609 An item identifier contains a byte-offset to
610 the start of an item, its length in bytes, and a few attribute bits
611 which affect its interpretation.
612 New item identifiers are allocated
613 as needed from the beginning of the unallocated space.
614 The number of item identifiers present can be determined by looking at
615 <structfield>pd_lower</>, which is increased to allocate a new identifier.
616 Because an item
617 identifier is never moved until it is freed, its index can be used on a
618 long-term basis to reference an item, even when the item itself is moved
619 around on the page to compact free space. In fact, every pointer to an
620 item (<type>ItemPointer</type>, also known as
621 <type>CTID</type>) created by
622 <productname>PostgreSQL</productname> consists of a page number and the
623 index of an item identifier.
625 </para>
627 <para>
629 The items themselves are stored in space allocated backwards from the end
630 of unallocated space. The exact structure varies depending on what the
631 table is to contain. Tables and sequences both use a structure named
632 <type>HeapTupleHeaderData</type>, described below.
634 </para>
636 <para>
638 The final section is the <quote>special section</quote> which can
639 contain anything the access method wishes to store. For example,
640 b-tree indexes store links to the page's left and right siblings,
641 as well as some other data relevant to the index structure.
642 Ordinary tables do not use a special section at all (indicated by setting
643 <structfield>pd_special</> to equal the page size).
645 </para>
647 <para>
649 All table rows are structured in the same way. There is a fixed-size
650 header (occupying 23 bytes on most machines), followed by an optional null
651 bitmap, an optional object ID field, and the user data. The header is
652 detailed
653 in <xref linkend="heaptupleheaderdata-table">. The actual user data
654 (columns of the row) begins at the offset indicated by
655 <structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
656 distance for the platform.
657 The null bitmap is
658 only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
659 <structfield>t_infomask</structfield>. If it is present it begins just after
660 the fixed header and occupies enough bytes to have one bit per data column
661 (that is, <structfield>t_natts</> bits altogether). In this list of bits, a
662 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
663 present, all columns are assumed not-null.
664 The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
665 is set in <structfield>t_infomask</structfield>. If present, it appears just
666 before the <structfield>t_hoff</> boundary. Any padding needed to make
667 <structfield>t_hoff</> a MAXALIGN multiple will appear between the null
668 bitmap and the object ID. (This in turn ensures that the object ID is
669 suitably aligned.)
671 </para>
673 <table tocentry="1" id="heaptupleheaderdata-table">
674 <title>HeapTupleHeaderData Layout</title>
675 <titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
676 <tgroup cols="4">
677 <thead>
678 <row>
679 <entry>Field</entry>
680 <entry>Type</entry>
681 <entry>Length</entry>
682 <entry>Description</entry>
683 </row>
684 </thead>
685 <tbody>
686 <row>
687 <entry>t_xmin</entry>
688 <entry>TransactionId</entry>
689 <entry>4 bytes</entry>
690 <entry>insert XID stamp</entry>
691 </row>
692 <row>
693 <entry>t_xmax</entry>
694 <entry>TransactionId</entry>
695 <entry>4 bytes</entry>
696 <entry>delete XID stamp</entry>
697 </row>
698 <row>
699 <entry>t_cid</entry>
700 <entry>CommandId</entry>
701 <entry>4 bytes</entry>
702 <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry>
703 </row>
704 <row>
705 <entry>t_xvac</entry>
706 <entry>TransactionId</entry>
707 <entry>4 bytes</entry>
708 <entry>XID for VACUUM operation moving a row version</entry>
709 </row>
710 <row>
711 <entry>t_ctid</entry>
712 <entry>ItemPointerData</entry>
713 <entry>6 bytes</entry>
714 <entry>current TID of this or newer row version</entry>
715 </row>
716 <row>
717 <entry>t_infomask2</entry>
718 <entry>int16</entry>
719 <entry>2 bytes</entry>
720 <entry>number of attributes, plus various flag bits</entry>
721 </row>
722 <row>
723 <entry>t_infomask</entry>
724 <entry>uint16</entry>
725 <entry>2 bytes</entry>
726 <entry>various flag bits</entry>
727 </row>
728 <row>
729 <entry>t_hoff</entry>
730 <entry>uint8</entry>
731 <entry>1 byte</entry>
732 <entry>offset to user data</entry>
733 </row>
734 </tbody>
735 </tgroup>
736 </table>
738 <para>
739 All the details can be found in
740 <filename>src/include/access/htup.h</filename>.
741 </para>
743 <para>
745 Interpreting the actual data can only be done with information obtained
746 from other tables, mostly <structname>pg_attribute</structname>. The
747 key values needed to identify field locations are
748 <structfield>attlen</structfield> and <structfield>attalign</structfield>.
749 There is no way to directly get a
750 particular attribute, except when there are only fixed width fields and no
751 null values. All this trickery is wrapped up in the functions
752 <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
753 and <firstterm>heap_getsysattr</firstterm>.
755 </para>
756 <para>
758 To read the data you need to examine each attribute in turn. First check
759 whether the field is NULL according to the null bitmap. If it is, go to
760 the next. Then make sure you have the right alignment. If the field is a
761 fixed width field, then all the bytes are simply placed. If it's a
762 variable length field (attlen = -1) then it's a bit more complicated.
763 All variable-length datatypes share the common header structure
764 <type>struct varlena</type>, which includes the total length of the stored
765 value and some flag bits. Depending on the flags, the data can be either
766 inline or in a <acronym>TOAST</> table;
767 it might be compressed, too (see <xref linkend="storage-toast">).
769 </para>
770 </sect1>
772 </chapter>