Fix spelling error in docs.
[PostgreSQL.git] / doc / src / sgml / wal.sgml
blobcff6fde7316470b8b8514476d952486b2333e548
1 <!-- $PostgreSQL$ -->
3 <chapter id="wal">
4 <title>Reliability and the Write-Ahead Log</title>
6 <para>
7 This chapter explains how the Write-Ahead Log is used to obtain
8 efficient, reliable operation.
9 </para>
11 <sect1 id="wal-reliability">
12 <title>Reliability</title>
14 <para>
15 Reliability is an important property of any serious database
16 system, and <productname>PostgreSQL</> does everything possible to
17 guarantee reliable operation. One aspect of reliable operation is
18 that all data recorded by a committed transaction should be stored
19 in a nonvolatile area that is safe from power loss, operating
20 system failure, and hardware failure (except failure of the
21 nonvolatile area itself, of course). Successfully writing the data
22 to the computer's permanent storage (disk drive or equivalent)
23 ordinarily meets this requirement. In fact, even if a computer is
24 fatally damaged, if the disk drives survive they can be moved to
25 another computer with similar hardware and all committed
26 transactions will remain intact.
27 </para>
29 <para>
30 While forcing data periodically to the disk platters might seem like
31 a simple operation, it is not. Because disk drives are dramatically
32 slower than main memory and CPUs, several layers of caching exist
33 between the computer's main memory and the disk platters.
34 First, there is the operating system's buffer cache, which caches
35 frequently requested disk blocks and combines disk writes. Fortunately,
36 all operating systems give applications a way to force writes from
37 the buffer cache to disk, and <productname>PostgreSQL</> uses those
38 features. (See the <xref linkend="guc-wal-sync-method"> parameter
39 to adjust how this is done.)
40 </para>
42 <para>
43 Next, there might be a cache in the disk drive controller; this is
44 particularly common on <acronym>RAID</> controller cards. Some of
45 these caches are <firstterm>write-through</>, meaning writes are passed
46 along to the drive as soon as they arrive. Others are
47 <firstterm>write-back</>, meaning data is passed on to the drive at
48 some later time. Such caches can be a reliability hazard because the
49 memory in the disk controller cache is volatile, and will lose its
50 contents in a power failure. Better controller cards have
51 <firstterm>battery-backed</> caches, meaning the card has a battery that
52 maintains power to the cache in case of system power loss. After power
53 is restored the data will be written to the disk drives.
54 </para>
56 <para>
57 And finally, most disk drives have caches. Some are write-through
58 while some are write-back, and the
59 same concerns about data loss exist for write-back drive caches as
60 exist for disk controller caches. Consumer-grade IDE and SATA drives are
61 particularly likely to have write-back caches that will not survive a
62 power failure. To check write caching on <productname>Linux</> use
63 <command>hdparm -I</>; it is enabled if there is a <literal>*</> next
64 to <literal>Write cache</>. <command>hdparm -W</> to turn off
65 write caching. On <productname>FreeBSD</> use
66 <application>atacontrol</>. (For SCSI disks use <ulink
67 url="http://sg.torque.net/sg/sdparm.html"><application>sdparm</></ulink>
68 to turn off <literal>WCE</>.) On <productname>Solaris</> the disk
69 write cache is controlled by <ulink
70 url="http://www.sun.com/bigadmin/content/submitted/format_utility.jsp"><literal>format
71 -e</></ulink>. (The Solaris <acronym>ZFS</> file system is safe with
72 disk write-cache enabled because it issues its own disk cache flush
73 commands.) On <productname>Windows</> if <varname>wal_sync_method</>
74 is <literal>open_datasync</> (the default), write caching is disabled
75 by unchecking <literal>My Computer\Open\{select disk
76 drive}\Properties\Hardware\Properties\Policies\Enable write caching on
77 the disk</>. Also on Windows, <literal>fsync</> and
78 <literal>fsync_writethrough</> never do write caching.
79 </para>
81 <para>
82 When the operating system sends a write request to the disk hardware,
83 there is little it can do to make sure the data has arrived at a truly
84 non-volatile storage area. Rather, it is the
85 administrator's responsibility to be sure that all storage components
86 ensure data integrity. Avoid disk controllers that have non-battery-backed
87 write caches. At the drive level, disable write-back caching if the
88 drive cannot guarantee the data will be written before shutdown.
89 </para>
91 <para>
92 Another risk of data loss is posed by the disk platter write
93 operations themselves. Disk platters are divided into sectors,
94 commonly 512 bytes each. Every physical read or write operation
95 processes a whole sector.
96 When a write request arrives at the drive, it might be for 512 bytes,
97 1024 bytes, or 8192 bytes, and the process of writing could fail due
98 to power loss at any time, meaning some of the 512-byte sectors were
99 written, and others were not. To guard against such failures,
100 <productname>PostgreSQL</> periodically writes full page images to
101 permanent storage <emphasis>before</> modifying the actual page on
102 disk. By doing this, during crash recovery <productname>PostgreSQL</> can
103 restore partially-written pages. If you have a battery-backed disk
104 controller or file-system software that prevents partial page writes
105 (e.g., ReiserFS 4), you can turn off this page imaging by using the
106 <xref linkend="guc-full-page-writes"> parameter.
107 </para>
108 </sect1>
110 <sect1 id="wal-intro">
111 <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
113 <indexterm zone="wal">
114 <primary>WAL</primary>
115 </indexterm>
117 <indexterm>
118 <primary>transaction log</primary>
119 <see>WAL</see>
120 </indexterm>
122 <para>
123 <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
124 is a standard method for ensuring data integrity. A detailed
125 description can be found in most (if not all) books about
126 transaction processing. Briefly, <acronym>WAL</acronym>'s central
127 concept is that changes to data files (where tables and indexes
128 reside) must be written only after those changes have been logged,
129 that is, after log records describing the changes have been flushed
130 to permanent storage. If we follow this procedure, we do not need
131 to flush data pages to disk on every transaction commit, because we
132 know that in the event of a crash we will be able to recover the
133 database using the log: any changes that have not been applied to
134 the data pages can be redone from the log records. (This is
135 roll-forward recovery, also known as REDO.)
136 </para>
138 <tip>
139 <para>
140 Because <acronym>WAL</acronym> restores database file
141 contents after a crash, journaled filesystems are not necessary for
142 reliable storage of the data files or WAL files. In fact, journaling
143 overhead can reduce performance, especially if journaling
144 causes file system <emphasis>data</emphasis> to be flushed
145 to disk. Fortunately, data flushing during journaling can
146 often be disabled with a filesystem mount option, e.g.
147 <literal>data=writeback</> on a Linux ext3 file system.
148 Journaled file systems do improve boot speed after a crash.
149 </para>
150 </tip>
153 <para>
154 Using <acronym>WAL</acronym> results in a
155 significantly reduced number of disk writes, because only the log
156 file needs to be flushed to disk to guarantee that a transaction is
157 committed, rather than every data file changed by the transaction.
158 The log file is written sequentially,
159 and so the cost of syncing the log is much less than the cost of
160 flushing the data pages. This is especially true for servers
161 handling many small transactions touching different parts of the data
162 store. Furthermore, when the server is processing many small concurrent
163 transactions, one <function>fsync</function> of the log file may
164 suffice to commit many transactions.
165 </para>
167 <para>
168 <acronym>WAL</acronym> also makes it possible to support on-line
169 backup and point-in-time recovery, as described in <xref
170 linkend="continuous-archiving">. By archiving the WAL data we can support
171 reverting to any time instant covered by the available WAL data:
172 we simply install a prior physical backup of the database, and
173 replay the WAL log just as far as the desired time. What's more,
174 the physical backup doesn't have to be an instantaneous snapshot
175 of the database state &mdash; if it is made over some period of time,
176 then replaying the WAL log for that period will fix any internal
177 inconsistencies.
178 </para>
179 </sect1>
181 <sect1 id="wal-async-commit">
182 <title>Asynchronous Commit</title>
184 <indexterm>
185 <primary>synchronous commit</primary>
186 </indexterm>
188 <indexterm>
189 <primary>asynchronous commit</primary>
190 </indexterm>
192 <para>
193 <firstterm>Asynchronous commit</> is an option that allows transactions
194 to complete more quickly, at the cost that the most recent transactions may
195 be lost if the database should crash. In many applications this is an
196 acceptable trade-off.
197 </para>
199 <para>
200 As described in the previous section, transaction commit is normally
201 <firstterm>synchronous</>: the server waits for the transaction's
202 <acronym>WAL</acronym> records to be flushed to permanent storage
203 before returning a success indication to the client. The client is
204 therefore guaranteed that a transaction reported to be committed will
205 be preserved, even in the event of a server crash immediately after.
206 However, for short transactions this delay is a major component of the
207 total transaction time. Selecting asynchronous commit mode means that
208 the server returns success as soon as the transaction is logically
209 completed, before the <acronym>WAL</acronym> records it generated have
210 actually made their way to disk. This can provide a significant boost
211 in throughput for small transactions.
212 </para>
214 <para>
215 Asynchronous commit introduces the risk of data loss. There is a short
216 time window between the report of transaction completion to the client
217 and the time that the transaction is truly committed (that is, it is
218 guaranteed not to be lost if the server crashes). Thus asynchronous
219 commit should not be used if the client will take external actions
220 relying on the assumption that the transaction will be remembered.
221 As an example, a bank would certainly not use asynchronous commit for
222 a transaction recording an ATM's dispensing of cash. But in many
223 scenarios, such as event logging, there is no need for a strong
224 guarantee of this kind.
225 </para>
227 <para>
228 The risk that is taken by using asynchronous commit is of data loss,
229 not data corruption. If the database should crash, it will recover
230 by replaying <acronym>WAL</acronym> up to the last record that was
231 flushed. The database will therefore be restored to a self-consistent
232 state, but any transactions that were not yet flushed to disk will
233 not be reflected in that state. The net effect is therefore loss of
234 the last few transactions. Because the transactions are replayed in
235 commit order, no inconsistency can be introduced &mdash; for example,
236 if transaction B made changes relying on the effects of a previous
237 transaction A, it is not possible for A's effects to be lost while B's
238 effects are preserved.
239 </para>
241 <para>
242 The user can select the commit mode of each transaction, so that
243 it is possible to have both synchronous and asynchronous commit
244 transactions running concurrently. This allows flexible trade-offs
245 between performance and certainty of transaction durability.
246 The commit mode is controlled by the user-settable parameter
247 <xref linkend="guc-synchronous-commit">, which can be changed in any of
248 the ways that a configuration parameter can be set. The mode used for
249 any one transaction depends on the value of
250 <varname>synchronous_commit</varname> when transaction commit begins.
251 </para>
253 <para>
254 Certain utility commands, for instance <command>DROP TABLE</>, are
255 forced to commit synchronously regardless of the setting of
256 <varname>synchronous_commit</varname>. This is to ensure consistency
257 between the server's file system and the logical state of the database.
258 The commands supporting two-phase commit, such as <command>PREPARE
259 TRANSACTION</>, are also always synchronous.
260 </para>
262 <para>
263 If the database crashes during the risk window between an
264 asynchronous commit and the writing of the transaction's
265 <acronym>WAL</acronym> records,
266 then changes made during that transaction <emphasis>will</> be lost.
267 The duration of the
268 risk window is limited because a background process (the <quote>WAL
269 writer</>) flushes unwritten <acronym>WAL</acronym> records to disk
270 every <xref linkend="guc-wal-writer-delay"> milliseconds.
271 The actual maximum duration of the risk window is three times
272 <varname>wal_writer_delay</varname> because the WAL writer is
273 designed to favor writing whole pages at a time during busy periods.
274 </para>
276 <caution>
277 <para>
278 An immediate-mode shutdown is equivalent to a server crash, and will
279 therefore cause loss of any unflushed asynchronous commits.
280 </para>
281 </caution>
283 <para>
284 Asynchronous commit provides behavior different from setting
285 <xref linkend="guc-fsync"> = off.
286 <varname>fsync</varname> is a server-wide
287 setting that will alter the behavior of all transactions. It disables
288 all logic within <productname>PostgreSQL</> that attempts to synchronize
289 writes to different portions of the database, and therefore a system
290 crash (that is, a hardware or operating system crash, not a failure of
291 <productname>PostgreSQL</> itself) could result in arbitrarily bad
292 corruption of the database state. In many scenarios, asynchronous
293 commit provides most of the performance improvement that could be
294 obtained by turning off <varname>fsync</varname>, but without the risk
295 of data corruption.
296 </para>
298 <para>
299 <xref linkend="guc-commit-delay"> also sounds very similar to
300 asynchronous commit, but it is actually a synchronous commit method
301 (in fact, <varname>commit_delay</varname> is ignored during an
302 asynchronous commit). <varname>commit_delay</varname> causes a delay
303 just before a synchronous commit attempts to flush
304 <acronym>WAL</acronym> to disk, in the hope that a single flush
305 executed by one such transaction can also serve other transactions
306 committing at about the same time. Setting <varname>commit_delay</varname>
307 can only help when there are many concurrently committing transactions,
308 and it is difficult to tune it to a value that actually helps rather
309 than hurting throughput.
310 </para>
312 </sect1>
314 <sect1 id="wal-configuration">
315 <title><acronym>WAL</acronym> Configuration</title>
317 <para>
318 There are several <acronym>WAL</>-related configuration parameters that
319 affect database performance. This section explains their use.
320 Consult <xref linkend="runtime-config"> for general information about
321 setting server configuration parameters.
322 </para>
324 <para>
325 <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
326 are points in the sequence of transactions at which it is guaranteed
327 that the data files have been updated with all information written before
328 the checkpoint. At checkpoint time, all dirty data pages are flushed to
329 disk and a special checkpoint record is written to the log file.
330 In the event of a crash, the crash recovery procedure looks at the latest
331 checkpoint record to determine the point in the log (known as the redo
332 record) from which it should start the REDO operation. Any changes made to
333 data files before that point are known to be already on disk. Hence, after
334 a checkpoint has been made, any log segments preceding the one containing
335 the redo record are no longer needed and can be recycled or removed. (When
336 <acronym>WAL</acronym> archiving is being done, the log segments must be
337 archived before being recycled or removed.)
338 </para>
340 <para>
341 The server's background writer process will automatically perform
342 a checkpoint every so often. A checkpoint is created every <xref
343 linkend="guc-checkpoint-segments"> log segments, or every <xref
344 linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
345 The default settings are 3 segments and 300 seconds respectively.
346 It is also possible to force a checkpoint by using the SQL command
347 <command>CHECKPOINT</command>.
348 </para>
350 <para>
351 Reducing <varname>checkpoint_segments</varname> and/or
352 <varname>checkpoint_timeout</varname> causes checkpoints to be done
353 more often. This allows faster after-crash recovery (since less work
354 will need to be redone). However, one must balance this against the
355 increased cost of flushing dirty data pages more often. If
356 <xref linkend="guc-full-page-writes"> is set (as is the default), there is
357 another factor to consider. To ensure data page consistency,
358 the first modification of a data page after each checkpoint results in
359 logging the entire page content. In that case,
360 a smaller checkpoint interval increases the volume of output to the WAL log,
361 partially negating the goal of using a smaller interval,
362 and in any case causing more disk I/O.
363 </para>
365 <para>
366 Checkpoints are fairly expensive, first because they require writing
367 out all currently dirty buffers, and second because they result in
368 extra subsequent WAL traffic as discussed above. It is therefore
369 wise to set the checkpointing parameters high enough that checkpoints
370 don't happen too often. As a simple sanity check on your checkpointing
371 parameters, you can set the <xref linkend="guc-checkpoint-warning">
372 parameter. If checkpoints happen closer together than
373 <varname>checkpoint_warning</> seconds,
374 a message will be output to the server log recommending increasing
375 <varname>checkpoint_segments</varname>. Occasional appearance of such
376 a message is not cause for alarm, but if it appears often then the
377 checkpoint control parameters should be increased. Bulk operations such
378 as large <command>COPY</> transfers might cause a number of such warnings
379 to appear if you have not set <varname>checkpoint_segments</> high
380 enough.
381 </para>
383 <para>
384 To avoid flooding the I/O system with a burst of page writes,
385 writing dirty buffers during a checkpoint is spread over a period of time.
386 That period is controlled by
387 <xref linkend="guc-checkpoint-completion-target">, which is
388 given as a fraction of the checkpoint interval.
389 The I/O rate is adjusted so that the checkpoint finishes when the
390 given fraction of <varname>checkpoint_segments</varname> WAL segments
391 have been consumed since checkpoint start, or the given fraction of
392 <varname>checkpoint_timeout</varname> seconds have elapsed,
393 whichever is sooner. With the default value of 0.5,
394 <productname>PostgreSQL</> can be expected to complete each checkpoint
395 in about half the time before the next checkpoint starts. On a system
396 that's very close to maximum I/O throughput during normal operation,
397 you might want to increase <varname>checkpoint_completion_target</varname>
398 to reduce the I/O load from checkpoints. The disadvantage of this is that
399 prolonging checkpoints affects recovery time, because more WAL segments
400 will need to be kept around for possible use in recovery. Although
401 <varname>checkpoint_completion_target</varname> can be set as high as 1.0,
402 it is best to keep it less than that (perhaps 0.9 at most) since
403 checkpoints include some other activities besides writing dirty buffers.
404 A setting of 1.0 is quite likely to result in checkpoints not being
405 completed on time, which would result in performance loss due to
406 unexpected variation in the number of WAL segments needed.
407 </para>
409 <para>
410 There will always be at least one WAL segment file, and will normally
411 not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
412 files. Each segment file is normally 16 MB (though this size can be
413 altered when building the server). You can use this to estimate space
414 requirements for <acronym>WAL</acronym>.
415 Ordinarily, when old log segment files are no longer needed, they
416 are recycled (renamed to become the next segments in the numbered
417 sequence). If, due to a short-term peak of log output rate, there
418 are more than 3 * <varname>checkpoint_segments</varname> + 1
419 segment files, the unneeded segment files will be deleted instead
420 of recycled until the system gets back under this limit.
421 </para>
423 <para>
424 There are two commonly used internal <acronym>WAL</acronym> functions:
425 <function>LogInsert</function> and <function>LogFlush</function>.
426 <function>LogInsert</function> is used to place a new record into
427 the <acronym>WAL</acronym> buffers in shared memory. If there is no
428 space for the new record, <function>LogInsert</function> will have
429 to write (move to kernel cache) a few filled <acronym>WAL</acronym>
430 buffers. This is undesirable because <function>LogInsert</function>
431 is used on every database low level modification (for example, row
432 insertion) at a time when an exclusive lock is held on affected
433 data pages, so the operation needs to be as fast as possible. What
434 is worse, writing <acronym>WAL</acronym> buffers might also force the
435 creation of a new log segment, which takes even more
436 time. Normally, <acronym>WAL</acronym> buffers should be written
437 and flushed by a <function>LogFlush</function> request, which is
438 made, for the most part, at transaction commit time to ensure that
439 transaction records are flushed to permanent storage. On systems
440 with high log output, <function>LogFlush</function> requests might
441 not occur often enough to prevent <function>LogInsert</function>
442 from having to do writes. On such systems
443 one should increase the number of <acronym>WAL</acronym> buffers by
444 modifying the configuration parameter <xref
445 linkend="guc-wal-buffers">. The default number of <acronym>WAL</acronym>
446 buffers is 8. Increasing this value will
447 correspondingly increase shared memory usage. When
448 <xref linkend="guc-full-page-writes"> is set and the system is very busy,
449 setting this value higher will help smooth response times during the
450 period immediately following each checkpoint.
451 </para>
453 <para>
454 The <xref linkend="guc-commit-delay"> parameter defines for how many
455 microseconds the server process will sleep after writing a commit
456 record to the log with <function>LogInsert</function> but before
457 performing a <function>LogFlush</function>. This delay allows other
458 server processes to add their commit records to the log so as to have all
459 of them flushed with a single log sync. No sleep will occur if
460 <xref linkend="guc-fsync">
461 is not enabled, nor if fewer than <xref linkend="guc-commit-siblings">
462 other sessions are currently in active transactions; this avoids
463 sleeping when it's unlikely that any other session will commit soon.
464 Note that on most platforms, the resolution of a sleep request is
465 ten milliseconds, so that any nonzero <varname>commit_delay</varname>
466 setting between 1 and 10000 microseconds would have the same effect.
467 Good values for these parameters are not yet clear; experimentation
468 is encouraged.
469 </para>
471 <para>
472 The <xref linkend="guc-wal-sync-method"> parameter determines how
473 <productname>PostgreSQL</productname> will ask the kernel to force
474 <acronym>WAL</acronym> updates out to disk.
475 All the options should be the same as far as reliability goes,
476 but it's quite platform-specific which one will be the fastest.
477 Note that this parameter is irrelevant if <varname>fsync</varname>
478 has been turned off.
479 </para>
481 <para>
482 Enabling the <xref linkend="guc-wal-debug"> configuration parameter
483 (provided that <productname>PostgreSQL</productname> has been
484 compiled with support for it) will result in each
485 <function>LogInsert</function> and <function>LogFlush</function>
486 <acronym>WAL</acronym> call being logged to the server log. This
487 option might be replaced by a more general mechanism in the future.
488 </para>
489 </sect1>
491 <sect1 id="wal-internals">
492 <title>WAL Internals</title>
494 <para>
495 <acronym>WAL</acronym> is automatically enabled; no action is
496 required from the administrator except ensuring that the
497 disk-space requirements for the <acronym>WAL</acronym> logs are met,
498 and that any necessary tuning is done (see <xref
499 linkend="wal-configuration">).
500 </para>
502 <para>
503 <acronym>WAL</acronym> logs are stored in the directory
504 <filename>pg_xlog</filename> under the data directory, as a set of
505 segment files, normally each 16 MB in size (but the size can be changed
506 by altering the <option>--with-wal-segsize</> configure option when
507 building the server). Each segment is divided into pages, normally
508 8 kB each (this size can be changed via the <option>--with-wal-blocksize</>
509 configure option). The log record headers are described in
510 <filename>access/xlog.h</filename>; the record content is dependent
511 on the type of event that is being logged. Segment files are given
512 ever-increasing numbers as names, starting at
513 <filename>000000010000000000000000</filename>. The numbers do not wrap, at
514 present, but it should take a very very long time to exhaust the
515 available stock of numbers.
516 </para>
518 <para>
519 It is of advantage if the log is located on another disk than the
520 main database files. This can be achieved by moving the directory
521 <filename>pg_xlog</filename> to another location (while the server
522 is shut down, of course) and creating a symbolic link from the
523 original location in the main data directory to the new location.
524 </para>
526 <para>
527 The aim of <acronym>WAL</acronym>, to ensure that the log is
528 written before database records are altered, can be subverted by
529 disk drives<indexterm><primary>disk drive</></> that falsely report a
530 successful write to the kernel,
531 when in fact they have only cached the data and not yet stored it
532 on the disk. A power failure in such a situation might still lead to
533 irrecoverable data corruption. Administrators should try to ensure
534 that disks holding <productname>PostgreSQL</productname>'s
535 <acronym>WAL</acronym> log files do not make such false reports.
536 </para>
538 <para>
539 After a checkpoint has been made and the log flushed, the
540 checkpoint's position is saved in the file
541 <filename>pg_control</filename>. Therefore, when recovery is to be
542 done, the server first reads <filename>pg_control</filename> and
543 then the checkpoint record; then it performs the REDO operation by
544 scanning forward from the log position indicated in the checkpoint
545 record. Because the entire content of data pages is saved in the
546 log on the first page modification after a checkpoint (assuming
547 <xref linkend="guc-full-page-writes"> is not disabled), all pages
548 changed since the checkpoint will be restored to a consistent
549 state.
550 </para>
552 <para>
553 To deal with the case where <filename>pg_control</filename> is
554 corrupted, we should support the possibility of scanning existing log
555 segments in reverse order &mdash; newest to oldest &mdash; in order to find the
556 latest checkpoint. This has not been implemented yet.
557 <filename>pg_control</filename> is small enough (less than one disk page)
558 that it is not subject to partial-write problems, and as of this writing
559 there have been no reports of database failures due solely to inability
560 to read <filename>pg_control</filename> itself. So while it is
561 theoretically a weak spot, <filename>pg_control</filename> does not
562 seem to be a problem in practice.
563 </para>
564 </sect1>
565 </chapter>