sqlite3/www/fileformat.tcl

   1 #
   2 # Run this script to generated a fileformat.html output file
   3 #
   4 set rcsid {$Id: fileformat.tcl,v 1.13 2004/10/10 17:24:55 drh Exp $}
   5 source common.tcl
   6 header {SQLite Database File Format (Version 2)}
   7 puts {
   8 <h2>SQLite 2.X Database File Format</h2>
   9
  10 <p>
  11 This document describes the disk file format for SQLite versions 2.1
  12 through 2.8.  SQLite version 3.0 and following uses a very different
  13 format which is described separately.
  14 </p>
  15
  16 <h3>1.0 &nbsp; Layers</h3>
  17
  18 <p>
  19 SQLite is implemented in layers.
  20 (See the <a href="arch.html">architecture description</a>.)
  21 The format of database files is determined by three different
  22 layers in the architecture.
  23 </p>
  24
  25 <ul>
  26 <li>The <b>schema</b> layer implemented by the VDBE.</li>
  27 <li>The <b>b-tree</b> layer implemented by btree.c</li>
  28 <li>The <b>pager</b> layer implemented by pager.c</li>
  29 </ul>
  30
  31 <p>
  32 We will describe each layer beginning with the bottom (pager)
  33 layer and working upwards.
  34 </p>
  35
  36 <h3>2.0 &nbsp; The Pager Layer</h3>
  37
  38 <p>
  39 An SQLite database consists of
  40 "pages" of data.  Each page is 1024 bytes in size.
  41 Pages are numbered beginning with 1.
  42 A page number of 0 is used to indicate "no such page" in the
  43 B-Tree and Schema layers.
  44 </p>
  45
  46 <p>
  47 The pager layer is responsible for implementing transactions
  48 with atomic commit and rollback.  It does this using a separate
  49 journal file.  Whenever a new transaction is started, a journal
  50 file is created that records the original state of the database.
  51 If the program terminates before completing the transaction, the next
  52 process to open the database can use the journal file to restore
  53 the database to its original state.
  54 </p>
  55
  56 <p>
  57 The journal file is located in the same directory as the database
  58 file and has the same name as the database file but with the
  59 characters "<tt>-journal</tt>" appended.
  60 </p>
  61
  62 <p>
  63 The pager layer does not impose any content restrictions on the
  64 main database file.  As far as the pager is concerned, each page
  65 contains 1024 bytes of arbitrary data.  But there is structure to
  66 the journal file.
  67 </p>
  68
  69 <p>
  70 A journal file begins with 8 bytes as follows:
  71 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, and 0xd6.
  72 Processes that are attempting to rollback a journal use these 8 bytes
  73 as a sanity check to make sure the file they think is a journal really
  74 is a valid journal.  Prior version of SQLite used different journal
  75 file formats.  The magic numbers for these prior formats are different
  76 so that if a new version of the library attempts to rollback a journal
  77 created by an earlier version, it can detect that the journal uses
  78 an obsolete format and make the necessary adjustments.  This article
  79 describes only the newest journal format - supported as of version
  80 2.8.0.
  81 </p>
  82
  83 <p>
  84 Following the 8 byte prefix is a three 4-byte integers that tell us
  85 the number of pages that have been committed to the journal,
  86 a magic number used for
  87 sanity checking each page, and the
  88 original size of the main database file before the transaction was
  89 started.  The number of committed pages is used to limit how far
  90 into the journal to read.  The use of the checksum magic number is
  91 described below.
  92 The original size of the database is used to restore the database
  93 file back to its original size.
  94 The size is expressed in pages (1024 bytes per page).
  95 </p>
  96
  97 <p>
  98 All three integers in the journal header and all other multi-byte
  99 numbers used in the journal file are big-endian.
 100 That means that the most significant byte
 101 occurs first.  That way, a journal file that is
 102 originally created on one machine can be rolled back by another
 103 machine that uses a different byte order.  So, for example, a
 104 transaction that failed to complete on your big-endian SparcStation
 105 can still be rolled back on your little-endian Linux box.
 106 </p>
 107
 108 <p>
 109 After the 8-byte prefix and the three 4-byte integers, the
 110 journal file consists of zero or more page records.  Each page
 111 record is a 4-byte (big-endian) page number followed by 1024 bytes
 112 of data and a 4-byte checksum.
 113 The data is the original content of the database page
 114 before the transaction was started.  So to roll back the transaction,
 115 the data is simply written into the corresponding page of the
 116 main database file.  Pages can appear in the journal in any order,
 117 but they are guaranteed to appear only once. All page numbers will be
 118 between 1 and the maximum specified by the page size integer that
 119 appeared at the beginning of the journal.
 120 </p>
 121
 122 <p>
 123 The so-called checksum at the end of each record is not really a
 124 checksum - it is the sum of the page number and the magic number which
 125 was the second integer in the journal header.  The purpose of this
 126 value is to try to detect journal corruption that might have occurred
 127 because of a power loss or OS crash that occurred which the journal
 128 file was being written to disk.  It could have been the case that the
 129 meta-data for the journal file, specifically the size of the file, had
 130 been written to the disk so that when the machine reboots it appears that
 131 file is large enough to hold the current record.  But even though the
 132 file size has changed, the data for the file might not have made it to
 133 the disk surface at the time of the OS crash or power loss.  This means
 134 that after reboot, the end of the journal file will contain quasi-random
 135 garbage data.  The checksum is an attempt to detect such corruption.  If
 136 the checksum does not match, that page of the journal is not rolled back.
 137 </p>
 138
 139 <p>
 140 Here is a summary of the journal file format:
 141 </p>
 142
 143 <ul>
 144 <li>8 byte prefix: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, 0xd6</li>
 145 <li>4 byte number of records in journal</li>
 146 <li>4 byte magic number used for page checksums</li>
 147 <li>4 byte initial database page count</li>
 148 <li>Zero or more instances of the following:
 149    <ul>
 150    <li>4 byte page number</li>
 151    <li>1024 bytes of original data for the page</li>
 152    <li>4 byte checksum</li>
 153    </ul>
 154 </li>
 155 </ul>
 156
 157 <h3>3.0 &nbsp; The B-Tree Layer</h3>
 158
 159 <p>
 160 The B-Tree layer builds on top of the pager layer to implement
 161 one or more separate b-trees all in the same disk file.  The
 162 algorithms used are taken from Knuth's <i>The Art Of Computer
 163 Programming.</i></p>
 164
 165 <p>
 166 Page 1 of a database contains a header string used for sanity
 167 checking, a few 32-bit words of configuration data, and a pointer
 168 to the beginning of a list of unused pages in the database.
 169 All other pages in the
 170 database are either pages of a b-tree, overflow pages, or unused
 171 pages on the freelist.
 172 </p>
 173
 174 <p>
 175 Each b-tree page contains zero or more database entries.
 176 Each entry has an unique key of one or more bytes and data of
 177 zero or more bytes.
 178 Both the key and data are arbitrary byte sequences.  The combination
 179 of key and data are collectively known as "payload".  The current
 180 implementation limits the amount of payload in a single entry to
 181 1048576 bytes.  This limit can be raised to 16777216 by adjusting
 182 a single #define in the source code and recompiling.  But most entries
 183 contain less than a hundred bytes of payload so a megabyte limit seems
 184 more than enough.
 185 </p>
 186
 187 <p>
 188 Up to 238 bytes of payload for an entry can be held directly on
 189 a b-tree page.  Any additional payload is contained on a linked list
 190 of overflow pages.  This limit on the amount of payload held directly
 191 on b-tree pages guarantees that each b-tree page can hold at least
 192 4 entries.  In practice, most entries are smaller than 238 bytes and
 193 thus most pages can hold more than 4 entries.
 194 </p>
 195
 196 <p>
 197 A single database file can hold any number of separate, independent b-trees.
 198 Each b-tree is identified by its root page, which never changes.
 199 Child pages of the b-tree may change as entries are added and removed
 200 and pages split and combine.  But the root page always stays the same.
 201 The b-tree itself does not record which pages are root pages and which
 202 are not.  That information is handled entirely at the schema layer.
 203 </p>
 204
 205 <h4>3.1 &nbsp; B-Tree Page 1 Details</h4>
 206
 207 <p>
 208 Page 1 begins with the following 48-byte string:
 209 </p>
 210
 211 <blockquote><pre>
 212 ** This file contains an SQLite 2.1 database **
 213 </pre></blockquote>
 214
 215 <p>
 216 If you count the number of characters in the string above, you will
 217 see that there are only 47.  A '\000' terminator byte is added to
 218 bring the total to 48.
 219 </p>
 220
 221 <p>
 222 A frequent question is why the string says version 2.1 when (as
 223 of this writing) we are up to version 2.7.0 of SQLite and any
 224 change to the second digit of the version is suppose to represent
 225 a database format change.  The answer to this is that the B-tree
 226 layer has not changed any since version 2.1.  There have been
 227 database format changes since version 2.1 but those changes have
 228 all been in the schema layer.  Because the format of the b-tree
 229 layer is unchanged since version 2.1.0, the header string still
 230 says version 2.1.
 231 </p>
 232
 233 <p>
 234 After the format string is a 4-byte integer used to determine the
 235 byte-order of the database.  The integer has a value of
 236 0xdae37528.  If this number is expressed as 0xda, 0xe3, 0x75, 0x28, then
 237 the database is in a big-endian format and all 16 and 32-bit integers
 238 elsewhere in the b-tree layer are also big-endian.  If the number is
 239 expressed as 0x28, 0x75, 0xe3, and 0xda, then the database is in a
 240 little-endian format and all other multi-byte numbers in the b-tree
 241 layer are also little-endian.
 242 Prior to version 2.6.3, the SQLite engine was only able to read databases
 243 that used the same byte order as the processor they were running on.
 244 But beginning with 2.6.3, SQLite can read or write databases in any
 245 byte order.
 246 </p>
 247
 248 <p>
 249 After the byte-order code are six 4-byte integers.  Each integer is in the
 250 byte order determined by the byte-order code.  The first integer is the
 251 page number for the first page of the freelist.  If there are no unused
 252 pages in the database, then this integer is 0.  The second integer is
 253 the number of unused pages in the database.  The last 4 integers are
 254 not used by the b-tree layer.  These are the so-called "meta" values that
 255 are passed up to the schema layer
 256 and used there for configuration and format version information.
 257 All bytes of page 1 past beyond the meta-value integers are unused
 258 and are initialized to zero.
 259 </p>
 260
 261 <p>
 262 Here is a summary of the information contained on page 1 in the b-tree layer:
 263 </p>
 264
 265 <ul>
 266 <li>48 byte header string</li>
 267 <li>4 byte integer used to determine the byte-order</li>
 268 <li>4 byte integer which is the first page of the freelist</li>
 269 <li>4 byte integer which is the number of pages on the freelist</li>
 270 <li>36 bytes of meta-data arranged as nine 4-byte integers</li>
 271 <li>928 bytes of unused space</li>
 272 </ul>
 273
 274 <h4>3.2 &nbsp; Structure Of A Single B-Tree Page</h4>
 275
 276 <p>
 277 Conceptually, a b-tree page contains N database entries and N+1 pointers
 278 to other b-tree pages.
 279 </p>
 280
 281 <blockquote>
 282 <table border=1 cellspacing=0 cellpadding=5>
 283 <tr>
 284 <td align="center">Ptr<br>0</td>
 285 <td align="center">Entry<br>0</td>
 286 <td align="center">Ptr<br>1</td>
 287 <td align="center">Entry<br>1</td>
 288 <td align="center"><b>...</b></td>
 289 <td align="center">Ptr<br>N-1</td>
 290 <td align="center">Entry<br>N-1</td>
 291 <td align="center">Ptr<br>N</td>
 292 </tr>
 293 </table>
 294 </blockquote>
 295
 296 <p>
 297 The entries are arranged in increasing order.  That is, the key to
 298 Entry 0 is less than the key to Entry 1, and the key to Entry 1 is
 299 less than the key of Entry 2, and so forth.  The pointers point to
 300 pages containing additional entries that have keys in between the
 301 entries on either side.  So Ptr 0 points to another b-tree page that
 302 contains entries that all have keys less than Key 0, and Ptr 1
 303 points to a b-tree pages where all entries have keys greater than Key 0
 304 but less than Key 1, and so forth.
 305 </p>
 306
 307 <p>
 308 Each b-tree page in SQLite consists of a header, zero or more "cells"
 309 each holding a single entry and pointer, and zero or more "free blocks"
 310 that represent unused space on the page.
 311 </p>
 312
 313 <p>
 314 The header on a b-tree page is the first 8 bytes of the page.
 315 The header contains the value
 316 of the right-most pointer (Ptr N) and the byte offset into the page
 317 of the first cell and the first free block.  The pointer is a 32-bit
 318 value and the offsets are each 16-bit values.  We have:
 319 </p>
 320
 321 <blockquote>
 322 <table border=1 cellspacing=0 cellpadding=5>
 323 <tr>
 324 <td align="center" width=30>0</td>
 325 <td align="center" width=30>1</td>
 326 <td align="center" width=30>2</td>
 327 <td align="center" width=30>3</td>
 328 <td align="center" width=30>4</td>
 329 <td align="center" width=30>5</td>
 330 <td align="center" width=30>6</td>
 331 <td align="center" width=30>7</td>
 332 </tr>
 333 <tr>
 334 <td align="center" colspan=4>Ptr N</td>
 335 <td align="center" colspan=2>Cell 0</td>
 336 <td align="center" colspan=2>Freeblock 0</td>
 337 </tr>
 338 </table>
 339 </blockquote>
 340
 341 <p>
 342 The 1016 bytes of a b-tree page that come after the header contain
 343 cells and freeblocks.  All 1016 bytes are covered by either a cell
 344 or a freeblock.
 345 </p>
 346
 347 <p>
 348 The cells are connected in a linked list.  Cell 0 contains Ptr 0 and
 349 Entry 0.  Bytes 4 and 5 of the header point to Cell 0.  Cell 0 then
 350 points to Cell 1 which contains Ptr 1 and Entry 1.  And so forth.
 351 Cells vary in size.  Every cell has a 12-byte header and at least 4
 352 bytes of payload space.  Space is allocated to payload in increments
 353 of 4 bytes.  Thus the minimum size of a cell is 16 bytes and up to
 354 63 cells can fit on a single page.  The size of a cell is always a multiple
 355 of 4 bytes.
 356 A cell can have up to 238 bytes of payload space.  If
 357 the payload is more than 238 bytes, then an additional 4 byte page
 358 number is appended to the cell which is the page number of the first
 359 overflow page containing the additional payload.  The maximum size
 360 of a cell is thus 254 bytes, meaning that a least 4 cells can fit into
 361 the 1016 bytes of space available on a b-tree page.
 362 An average cell is usually around 52 to 100 bytes in size with about
 363 10 or 20 cells to a page.
 364 </p>
 365
 366 <p>
 367 The data layout of a cell looks like this:
 368 </p>
 369
 370 <blockquote>
 371 <table border=1 cellspacing=0 cellpadding=5>
 372 <tr>
 373 <td align="center" width=20>0</td>
 374 <td align="center" width=20>1</td>
 375 <td align="center" width=20>2</td>
 376 <td align="center" width=20>3</td>
 377 <td align="center" width=20>4</td>
 378 <td align="center" width=20>5</td>
 379 <td align="center" width=20>6</td>
 380 <td align="center" width=20>7</td>
 381 <td align="center" width=20>8</td>
 382 <td align="center" width=20>9</td>
 383 <td align="center" width=20>10</td>
 384 <td align="center" width=20>11</td>
 385 <td align="center" width=100>12 ... 249</td>
 386 <td align="center" width=20>250</td>
 387 <td align="center" width=20>251</td>
 388 <td align="center" width=20>252</td>
 389 <td align="center" width=20>253</td>
 390 </tr>
 391 <tr>
 392 <td align="center" colspan=4>Ptr</td>
 393 <td align="center" colspan=2>Keysize<br>(low)</td>
 394 <td align="center" colspan=2>Next</td>
 395 <td align="center" colspan=1>Ksz<br>(hi)</td>
 396 <td align="center" colspan=1>Dsz<br>(hi)</td>
 397 <td align="center" colspan=2>Datasize<br>(low)</td>
 398 <td align="center" colspan=1>Payload</td>
 399 <td align="center" colspan=4>Overflow<br>Pointer</td>
 400 </tr>
 401 </table>
 402 </blockquote>
 403
 404 <p>
 405 The first four bytes are the pointer.  The size of the key is a 24-bit
 406 where the upper 8 bits are taken from byte 8 and the lower 16 bits are
 407 taken from bytes 4 and 5 (or bytes 5 and 4 on little-endian machines.)
 408 The size of the data is another 24-bit value where the upper 8 bits
 409 are taken from byte 9 and the lower 16 bits are taken from bytes 10 and
 410 11 or 11 and 10, depending on the byte order.  Bytes 6 and 7 are the
 411 offset to the next cell in the linked list of all cells on the current
 412 page.  This offset is 0 for the last cell on the page.
 413 </p>
 414
 415 <p>
 416 The payload itself can be any number of bytes between 1 and 1048576.
 417 But space to hold the payload is allocated in 4-byte chunks up to
 418 238 bytes.  If the entry contains more than 238 bytes of payload, then
 419 additional payload data is stored on a linked list of overflow pages.
 420 A 4 byte page number is appended to the cell that contains the first
 421 page of this linked list.
 422 </p>
 423
 424 <p>
 425 Each overflow page begins with a 4-byte value which is the
 426 page number of the next overflow page in the list.   This value is
 427 0 for the last page in the list.  The remaining
 428 1020 bytes of the overflow page are available for storing payload.
 429 Note that a full page is allocated regardless of the number of overflow
 430 bytes stored.  Thus, if the total payload for an entry is 239 bytes,
 431 the first 238 are stored in the cell and the overflow page stores just
 432 one byte.
 433 </p>
 434
 435 <p>
 436 The structure of an overflow page looks like this:
 437 </p>
 438
 439 <blockquote>
 440 <table border=1 cellspacing=0 cellpadding=5>
 441 <tr>
 442 <td align="center" width=20>0</td>
 443 <td align="center" width=20>1</td>
 444 <td align="center" width=20>2</td>
 445 <td align="center" width=20>3</td>
 446 <td align="center" width=200>4 ... 1023</td>
 447 </tr>
 448 <tr>
 449 <td align="center" colspan=4>Next Page</td>
 450 <td align="center" colspan=1>Overflow Data</td>
 451 </tr>
 452 </table>
 453 </blockquote>
 454
 455 <p>
 456 All space on a b-tree page which is not used by the header or by cells
 457 is filled by freeblocks.  Freeblocks, like cells, are variable in size.
 458 The size of a freeblock is at least 4 bytes and is always a multiple of
 459 4 bytes.
 460 The first 4 bytes contain a header and the remaining bytes
 461 are unused.  The structure of the freeblock is as follows:
 462 </p>
 463
 464 <blockquote>
 465 <table border=1 cellspacing=0 cellpadding=5>
 466 <tr>
 467 <td align="center" width=20>0</td>
 468 <td align="center" width=20>1</td>
 469 <td align="center" width=20>2</td>
 470 <td align="center" width=20>3</td>
 471 <td align="center" width=200>4 ... 1015</td>
 472 </tr>
 473 <tr>
 474 <td align="center" colspan=2>Size</td>
 475 <td align="center" colspan=2>Next</td>
 476 <td align="center" colspan=1>Unused</td>
 477 </tr>
 478 </table>
 479 </blockquote>
 480
 481 <p>
 482 Freeblocks are stored in a linked list in increasing order.  That is
 483 to say, the first freeblock occurs at a lower index into the page than
 484 the second free block, and so forth.  The first 2 bytes of the header
 485 are an integer which is the total number of bytes in the freeblock.
 486 The second 2 bytes are the index into the page of the next freeblock
 487 in the list.  The last freeblock has a Next value of 0.
 488 </p>
 489
 490 <p>
 491 When a new b-tree is created in a database, the root page of the b-tree
 492 consist of a header and a single 1016 byte freeblock.  As entries are
 493 added, space is carved off of that freeblock and used to make cells.
 494 When b-tree entries are deleted, the space used by their cells is converted
 495 into freeblocks.  Adjacent freeblocks are merged, but the page can still
 496 become fragmented.  The b-tree code will occasionally try to defragment
 497 the page by moving all cells to the beginning and constructing a single
 498 freeblock at the end to take up all remaining space.
 499 </p>
 500
 501 <h4>3.3 &nbsp; The B-Tree Free Page List</h4>
 502
 503 <p>
 504 When information is removed from an SQLite database such that one or
 505 more pages are no longer needed, those pages are added to a list of
 506 free pages so that they can be reused later when new information is
 507 added.  This subsection describes the structure of this freelist.
 508 </p>
 509
 510 <p>
 511 The 32-bit integer beginning at byte-offset 52 in page 1 of the database
 512 contains the address of the first page in a linked list of free pages.
 513 If there are no free pages available, this integer has a value of 0.
 514 The 32-bit integer at byte-offset 56 in page 1 contains the number of
 515 free pages on the freelist.
 516 </p>
 517
 518 <p>
 519 The freelist contains a trunk and many branches.  The trunk of
 520 the freelist is composed of overflow pages.  That is to say, each page
 521 contains a single 32-bit integer at byte offset 0 which
 522 is the page number of the next page on the freelist trunk.
 523 The payload area
 524 of each trunk page is used to record pointers to branch pages.
 525 The first 32-bit integer in the payload area of a trunk page
 526 is the number of branch pages to follow (between 0 and 254)
 527 and each subsequent 32-bit integer is a page number for a branch page.
 528 The following diagram shows the structure of a trunk freelist page:
 529 </p>
 530
 531 <blockquote>
 532 <table border=1 cellspacing=0 cellpadding=5>
 533 <tr>
 534 <td align="center" width=20>0</td>
 535 <td align="center" width=20>1</td>
 536 <td align="center" width=20>2</td>
 537 <td align="center" width=20>3</td>
 538 <td align="center" width=20>4</td>
 539 <td align="center" width=20>5</td>
 540 <td align="center" width=20>6</td>
 541 <td align="center" width=20>7</td>
 542 <td align="center" width=200>8 ... 1023</td>
 543 </tr>
 544 <tr>
 545 <td align="center" colspan=4>Next trunk page</td>
 546 <td align="center" colspan=4># of branch pages</td>
 547 <td align="center" colspan=1>Page numbers for branch pages</td>
 548 </tr>
 549 </table>
 550 </blockquote>
 551
 552 <p>
 553 It is important to note that only the pages on the trunk of the freelist
 554 contain pointers to other pages.  The branch pages contain no
 555 data whatsoever.  The fact that the branch pages are completely
 556 blank allows for an important optimization in the paging layer.  When
 557 a branch page is removed from the freelist to be reused, it is not
 558 necessary to write the original content of that page into the rollback
 559 journal.  The branch page contained no data to begin with, so there is
 560 no need to restore the page in the event of a rollback.  Similarly,
 561 when a page is not longer needed and is added to the freelist as a branch
 562 page, it is not necessary to write the content of that page
 563 into the database file.
 564 Again, the page contains no real data so it is not necessary to record the
 565 content of that page.  By reducing the amount of disk I/O required,
 566 these two optimizations allow some database operations
 567 to go four to six times faster than they would otherwise.
 568 </p>
 569
 570 <h3>4.0 &nbsp; The Schema Layer</h3>
 571
 572 <p>
 573 The schema layer implements an SQL database on top of one or more
 574 b-trees and keeps track of the root page numbers for all b-trees.
 575 Where the b-tree layer provides only unformatted data storage with
 576 a unique key, the schema layer allows each entry to contain multiple
 577 columns.  The schema layer also allows indices and non-unique key values.
 578 </p>
 579
 580 <p>
 581 The schema layer implements two separate data storage abstractions:
 582 tables and indices.  Each table and each index uses its own b-tree
 583 but they use the b-tree capabilities in different ways.  For a table,
 584 the b-tree key is a unique 4-byte integer and the b-tree data is the
 585 content of the table row, encoded so that columns can be separately
 586 extracted.  For indices, the b-tree key varies in size depending on the
 587 size of the fields being indexed and the b-tree data is empty.
 588 </p>
 589
 590 <h4>4.1 &nbsp; SQL Table Implementation Details</h4>
 591
 592 <p>Each row of an SQL table is stored in a single b-tree entry.
 593 The b-tree key is a 4-byte big-endian integer that is the ROWID
 594 or INTEGER PRIMARY KEY for that table row.
 595 The key is stored in a big-endian format so
 596 that keys will sort in numerical order using memcmp() function.</p>
 597
 598 <p>The content of a table row is stored in the data portion of
 599 the corresponding b-tree table.  The content is encoded to allow
 600 individual columns of the row to be extracted as necessary.  Assuming
 601 that the table has N columns, the content is encoded as N+1 offsets
 602 followed by N column values, as follows:
 603 </p>
 604
 605 <blockquote>
 606 <table border=1 cellspacing=0 cellpadding=5>
 607 <tr>
 608 <td>offset 0</td>
 609 <td>offset 1</td>
 610 <td><b>...</b></td>
 611 <td>offset N-1</td>
 612 <td>offset N</td>
 613 <td>value 0</td>
 614 <td>value 1</td>
 615 <td><b>...</b></td>
 616 <td>value N-1</td>
 617 </tr>
 618 </table>
 619 </blockquote>
 620
 621 <p>
 622 The offsets can be either 8-bit, 16-bit, or 24-bit integers depending
 623 on how much data is to be stored.  If the total size of the content
 624 is less than 256 bytes then 8-bit offsets are used.  If the total size
 625 of the b-tree data is less than 65536 then 16-bit offsets are used.
 626 24-bit offsets are used otherwise.  Offsets are always little-endian,
 627 which means that the least significant byte occurs first.
 628 </p>
 629
 630 <p>
 631 Data is stored as a nul-terminated string.  Any empty string consists
 632 of just the nul terminator.  A NULL value is an empty string with no
 633 nul-terminator.  Thus a NULL value occupies zero bytes and an empty string
 634 occupies 1 byte.
 635 </p>
 636
 637 <p>
 638 Column values are stored in the order that they appear in the CREATE TABLE
 639 statement.  The offsets at the beginning of the record contain the
 640 byte index of the corresponding column value.  Thus, Offset 0 contains
 641 the byte index for Value 0, Offset 1 contains the byte offset
 642 of Value 1, and so forth.  The number of bytes in a column value can
 643 always be found by subtracting offsets.  This allows NULLs to be
 644 recovered from the record unambiguously.
 645 </p>
 646
 647 <p>
 648 Most columns are stored in the b-tree data as described above.
 649 The one exception is column that has type INTEGER PRIMARY KEY.
 650 INTEGER PRIMARY KEY columns correspond to the 4-byte b-tree key.
 651 When an SQL statement attempts to read the INTEGER PRIMARY KEY,
 652 the 4-byte b-tree key is read rather than information out of the
 653 b-tree data.  But there is still an Offset associated with the
 654 INTEGER PRIMARY KEY, just like any other column.  But the Value
 655 associated with that offset is always NULL.
 656 </p>
 657
 658 <h4>4.2 &nbsp; SQL Index Implementation Details</h4>
 659
 660 <p>
 661 SQL indices are implement using a b-tree in which the key is used
 662 but the data is always empty.  The purpose of an index is to map
 663 one or more column values into the ROWID for the table entry that
 664 contains those column values.
 665 </p>
 666
 667 <p>
 668 Each b-tree in an index consists of one or more column values followed
 669 by a 4-byte ROWID.  Each column value is nul-terminated (even NULL values)
 670 and begins with a single character that indicates the datatype for that
 671 column value.  Only three datatypes are supported: NULL, Number, and
 672 Text.  NULL values are encoded as the character 'a' followed by the
 673 nul terminator.  Numbers are encoded as the character 'b' followed by
 674 a string that has been crafted so that sorting the string using memcmp()
 675 will sort the corresponding numbers in numerical order.  (See the
 676 sqliteRealToSortable() function in util.c of the SQLite sources for
 677 additional information on this encoding.)  Numbers are also nul-terminated.
 678 Text values consists of the character 'c' followed by a copy of the
 679 text string and a nul-terminator.  These encoding rules result in
 680 NULLs being sorted first, followed by numerical values in numerical
 681 order, followed by text values in lexicographical order.
 682 </p>
 683
 684 <h4>4.4 &nbsp; SQL Schema Storage And Root B-Tree Page Numbers</h4>
 685
 686 <p>
 687 The database schema is stored in the database in a special tabled named
 688 "sqlite_master" and which always has a root b-tree page number of 2.
 689 This table contains the original CREATE TABLE,
 690 CREATE INDEX, CREATE VIEW, and CREATE TRIGGER statements used to define
 691 the database to begin with.  Whenever an SQLite database is opened,
 692 the sqlite_master table is scanned from beginning to end and
 693 all the original CREATE statements are played back through the parser
 694 in order to reconstruct an in-memory representation of the database
 695 schema for use in subsequent command parsing.  For each CREATE TABLE
 696 and CREATE INDEX statement, the root page number for the corresponding
 697 b-tree is also recorded in the sqlite_master table so that SQLite will
 698 know where to look for the appropriate b-tree.
 699 </p>
 700
 701 <p>
 702 SQLite users can query the sqlite_master table just like any other table
 703 in the database.  But the sqlite_master table cannot be directly written.
 704 The sqlite_master table is automatically updated in response to CREATE
 705 and DROP statements but it cannot be changed using INSERT, UPDATE, or
 706 DELETE statements as that would risk corrupting the database.
 707 </p>
 708
 709 <p>
 710 SQLite stores temporary tables and indices in a separate
 711 file from the main database file.  The temporary table database file
 712 is the same structure as the main database file.  The schema table
 713 for the temporary tables is stored on page 2 just as in the main
 714 database.  But the schema table for the temporary database named
 715 "sqlite_temp_master" instead of "sqlite_master".  Other than the
 716 name change, it works exactly the same.
 717 </p>
 718
 719 <h4>4.4 &nbsp; Schema Version Numbering And Other Meta-Information</h4>
 720
 721 <p>
 722 The nine 32-bit integers that are stored beginning at byte offset
 723 60 of Page 1 in the b-tree layer are passed up into the schema layer
 724 and used for versioning and configuration information.  The meaning
 725 of the first four integers is shown below.  The other five are currently
 726 unused.
 727 </p>
 728
 729 <ol>
 730 <li>The schema version number</li>
 731 <li>The format version number</li>
 732 <li>The recommended pager cache size</li>
 733 <li>The safety level</li>
 734 </ol>
 735
 736 <p>
 737 The first meta-value, the schema version number, is used to detect when
 738 the schema of the database is changed by a CREATE or DROP statement.
 739 Recall that when a database is first opened the sqlite_master table is
 740 scanned and an internal representation of the tables, indices, views,
 741 and triggers for the database is built in memory.  This internal
 742 representation is used for all subsequent SQL command parsing and
 743 execution.  But what if another process were to change the schema
 744 by adding or removing a table, index, view, or trigger?  If the original
 745 process were to continue using the old schema, it could potentially
 746 corrupt the database by writing to a table that no longer exists.
 747 To avoid this problem, the schema version number is changed whenever
 748 a CREATE or DROP statement is executed.  Before each command is
 749 executed, the current schema version number for the database file
 750 is compared against the schema version number from when the sqlite_master
 751 table was last read.  If those numbers are different, the internal
 752 schema representation is erased and the sqlite_master table is reread
 753 to reconstruct the internal schema representation.
 754 (Calls to sqlite_exec() generally return SQLITE_SCHEMA when this happens.)
 755 </p>
 756
 757 <p>
 758 The second meta-value is the schema format version number.  This
 759 number tells what version of the schema layer should be used to
 760 interpret the file.  There have been changes to the schema layer
 761 over time and this number is used to detect when an older database
 762 file is being processed by a newer version of the library.
 763 As of this writing (SQLite version 2.7.0) the current format version
 764 is "4".
 765 </p>
 766
 767 <p>
 768 The third meta-value is the recommended pager cache size as set
 769 by the DEFAULT_CACHE_SIZE pragma.  If the value is positive it
 770 means that synchronous behavior is enable (via the DEFAULT_SYNCHRONOUS
 771 pragma) and if negative it means that synchronous behavior is
 772 disabled.
 773 </p>
 774
 775 <p>
 776 The fourth meta-value is safety level added in version 2.8.0.
 777 A value of 1 corresponds to a SYNCHRONOUS setting of OFF.  In other
 778 words, SQLite does not pause to wait for journal data to reach the disk
 779 surface before overwriting pages of the database.  A value of 2 corresponds
 780 to a SYNCHRONOUS setting of NORMAL.  A value of 3 corresponds to a
 781 SYNCHRONOUS setting of FULL. If the value is 0, that means it has not
 782 been initialized so the default synchronous setting of NORMAL is used.
 783 </p>
 784 }
 785 footer $rcsid