Remove tab from SGML.
[pgsql-fdw.git] / doc / src / sgml / high-availability.sgml
blobc3e700a7899ccdde4d3b28085ffc397587cf95b8
1 <!-- $PostgreSQL$ -->
3 <chapter id="high-availability">
4 <title>High Availability, Load Balancing, and Replication</title>
6 <indexterm><primary>high availability</></>
7 <indexterm><primary>failover</></>
8 <indexterm><primary>replication</></>
9 <indexterm><primary>load balancing</></>
10 <indexterm><primary>clustering</></>
11 <indexterm><primary>data partitioning</></>
13 <para>
14 Database servers can work together to allow a second server to
15 take over quickly if the primary server fails (high
16 availability), or to allow several computers to serve the same
17 data (load balancing). Ideally, database servers could work
18 together seamlessly. Web servers serving static web pages can
19 be combined quite easily by merely load-balancing web requests
20 to multiple machines. In fact, read-only database servers can
21 be combined relatively easily too. Unfortunately, most database
22 servers have a read/write mix of requests, and read/write servers
23 are much harder to combine. This is because though read-only
24 data needs to be placed on each server only once, a write to any
25 server has to be propagated to all servers so that future read
26 requests to those servers return consistent results.
27 </para>
29 <para>
30 This synchronization problem is the fundamental difficulty for
31 servers working together. Because there is no single solution
32 that eliminates the impact of the sync problem for all use cases,
33 there are multiple solutions. Each solution addresses this
34 problem in a different way, and minimizes its impact for a specific
35 workload.
36 </para>
38 <para>
39 Some solutions deal with synchronization by allowing only one
40 server to modify the data. Servers that can modify data are
41 called read/write, <firstterm>master</> or <firstterm>primary</> servers.
42 Servers that track changes in the master are called <firstterm>standby</>
43 or <firstterm>slave</> servers. A standby server that cannot be connected
44 to until it is promoted to a master server is called a <firstterm>warm
45 standby</> server, and one that can accept connections and serves read-only
46 queries is called a <firstterm>hot standby</> server.
47 </para>
49 <para>
50 Some solutions are synchronous,
51 meaning that a data-modifying transaction is not considered
52 committed until all servers have committed the transaction. This
53 guarantees that a failover will not lose any data and that all
54 load-balanced servers will return consistent results no matter
55 which server is queried. In contrast, asynchronous solutions allow some
56 delay between the time of a commit and its propagation to the other servers,
57 opening the possibility that some transactions might be lost in
58 the switch to a backup server, and that load balanced servers
59 might return slightly stale results. Asynchronous communication
60 is used when synchronous would be too slow.
61 </para>
63 <para>
64 Solutions can also be categorized by their granularity. Some solutions
65 can deal only with an entire database server, while others allow control
66 at the per-table or per-database level.
67 </para>
69 <para>
70 Performance must be considered in any choice. There is usually a
71 trade-off between functionality and
72 performance. For example, a fully synchronous solution over a slow
73 network might cut performance by more than half, while an asynchronous
74 one might have a minimal performance impact.
75 </para>
77 <para>
78 The remainder of this section outlines various failover, replication,
79 and load balancing solutions. A <ulink
80 url="http://www.postgres-r.org/documentation/terms">glossary</ulink> is
81 also available.
82 </para>
84 <sect1 id="different-replication-solutions">
85 <title>Comparison of different solutions</title>
87 <variablelist>
89 <varlistentry>
90 <term>Shared Disk Failover</term>
91 <listitem>
93 <para>
94 Shared disk failover avoids synchronization overhead by having only one
95 copy of the database. It uses a single disk array that is shared by
96 multiple servers. If the main database server fails, the standby server
97 is able to mount and start the database as though it were recovering from
98 a database crash. This allows rapid failover with no data loss.
99 </para>
101 <para>
102 Shared hardware functionality is common in network storage devices.
103 Using a network file system is also possible, though care must be
104 taken that the file system has full <acronym>POSIX</> behavior (see <xref
105 linkend="creating-cluster-nfs">). One significant limitation of this
106 method is that if the shared disk array fails or becomes corrupt, the
107 primary and standby servers are both nonfunctional. Another issue is
108 that the standby server should never access the shared storage while
109 the primary server is running.
110 </para>
112 </listitem>
113 </varlistentry>
115 <varlistentry>
116 <term>File System (Block-Device) Replication</term>
117 <listitem>
119 <para>
120 A modified version of shared hardware functionality is file system
121 replication, where all changes to a file system are mirrored to a file
122 system residing on another computer. The only restriction is that
123 the mirroring must be done in a way that ensures the standby server
124 has a consistent copy of the file system &mdash; specifically, writes
125 to the standby must be done in the same order as those on the master.
126 <productname>DRBD</> is a popular file system replication solution
127 for Linux.
128 </para>
130 <!--
131 https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
133 Oracle RAC is a shared disk approach and just send cache invalidations
134 to other nodes but not actual data. As the disk is shared, data is
135 only committed once to disk and there is a distributed locking
136 protocol to make nodes agree on a serializable transactional order.
139 </listitem>
140 </varlistentry>
142 <varlistentry>
143 <term>Warm and Hot Standby Using Point-In-Time Recovery (<acronym>PITR</>)</term>
144 <listitem>
146 <para>
147 Warm and hot standby servers can be kept current by reading a
148 stream of write-ahead log (<acronym>WAL</>)
149 records. If the main server fails, the standby contains
150 almost all of the data of the main server, and can be quickly
151 made the new master database server. This is asynchronous and
152 can only be done for the entire database server.
153 </para>
154 <para>
155 A PITR standby server can be implemented using file-based log shipping
156 (<xref linkend="warm-standby">) or streaming replication (see
157 <xref linkend="streaming-replication">), or a combination of both. For
158 information on hot standby, see <xref linkend="hot-standby">.
159 </para>
160 </listitem>
161 </varlistentry>
163 <varlistentry>
164 <term>Trigger-Based Master-Standby Replication</term>
165 <listitem>
167 <para>
168 A master-standby replication setup sends all data modification
169 queries to the master server. The master server asynchronously
170 sends data changes to the standby server. The standby can answer
171 read-only queries while the master server is running. The
172 standby server is ideal for data warehouse queries.
173 </para>
175 <para>
176 <productname>Slony-I</> is an example of this type of replication, with per-table
177 granularity, and support for multiple standby servers. Because it
178 updates the standby server asynchronously (in batches), there is
179 possible data loss during fail over.
180 </para>
181 </listitem>
182 </varlistentry>
184 <varlistentry>
185 <term>Statement-Based Replication Middleware</term>
186 <listitem>
188 <para>
189 With statement-based replication middleware, a program intercepts
190 every SQL query and sends it to one or all servers. Each server
191 operates independently. Read-write queries are sent to all servers,
192 while read-only queries can be sent to just one server, allowing
193 the read workload to be distributed.
194 </para>
196 <para>
197 If queries are simply broadcast unmodified, functions like
198 <function>random()</>, <function>CURRENT_TIMESTAMP</>, and
199 sequences can have different values on different servers.
200 This is because each server operates independently, and because
201 SQL queries are broadcast (and not actual modified rows). If
202 this is unacceptable, either the middleware or the application
203 must query such values from a single server and then use those
204 values in write queries. Another option is to use this replication
205 option with a traditional master-standby setup, i.e. data modification
206 queries are sent only to the master and are propagated to the
207 standby servers via master-standby replication, not by the replication
208 middleware. Care must also be taken that all
209 transactions either commit or abort on all servers, perhaps
210 using two-phase commit (<xref linkend="sql-prepare-transaction">
211 and <xref linkend="sql-commit-prepared">.
212 <productname>Pgpool-II</> and <productname>Sequoia</> are examples of
213 this type of replication.
214 </para>
215 </listitem>
216 </varlistentry>
218 <varlistentry>
219 <term>Asynchronous Multimaster Replication</term>
220 <listitem>
222 <para>
223 For servers that are not regularly connected, like laptops or
224 remote servers, keeping data consistent among servers is a
225 challenge. Using asynchronous multimaster replication, each
226 server works independently, and periodically communicates with
227 the other servers to identify conflicting transactions. The
228 conflicts can be resolved by users or conflict resolution rules.
229 Bucardo is an example of this type of replication.
230 </para>
231 </listitem>
232 </varlistentry>
234 <varlistentry>
235 <term>Synchronous Multimaster Replication</term>
236 <listitem>
238 <para>
239 In synchronous multimaster replication, each server can accept
240 write requests, and modified data is transmitted from the
241 original server to every other server before each transaction
242 commits. Heavy write activity can cause excessive locking,
243 leading to poor performance. In fact, write performance is
244 often worse than that of a single server. Read requests can
245 be sent to any server. Some implementations use shared disk
246 to reduce the communication overhead. Synchronous multimaster
247 replication is best for mostly read workloads, though its big
248 advantage is that any server can accept write requests &mdash;
249 there is no need to partition workloads between master and
250 standby servers, and because the data changes are sent from one
251 server to another, there is no problem with non-deterministic
252 functions like <function>random()</>.
253 </para>
255 <para>
256 <productname>PostgreSQL</> does not offer this type of replication,
257 though <productname>PostgreSQL</> two-phase commit (<xref
258 linkend="sql-prepare-transaction"> and <xref
259 linkend="sql-commit-prepared">)
260 can be used to implement this in application code or middleware.
261 </para>
262 </listitem>
263 </varlistentry>
265 <varlistentry>
266 <term>Commercial Solutions</term>
267 <listitem>
269 <para>
270 Because <productname>PostgreSQL</> is open source and easily
271 extended, a number of companies have taken <productname>PostgreSQL</>
272 and created commercial closed-source solutions with unique
273 failover, replication, and load balancing capabilities.
274 </para>
275 </listitem>
276 </varlistentry>
278 </variablelist>
280 <para>
281 <xref linkend="high-availability-matrix"> summarizes
282 the capabilities of the various solutions listed above.
283 </para>
285 <table id="high-availability-matrix">
286 <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
287 <tgroup cols="8">
288 <thead>
289 <row>
290 <entry>Feature</entry>
291 <entry>Shared Disk Failover</entry>
292 <entry>File System Replication</entry>
293 <entry>Hot/Warm Standby Using PITR</entry>
294 <entry>Trigger-Based Master-Standby Replication</entry>
295 <entry>Statement-Based Replication Middleware</entry>
296 <entry>Asynchronous Multimaster Replication</entry>
297 <entry>Synchronous Multimaster Replication</entry>
298 </row>
299 </thead>
301 <tbody>
303 <row>
304 <entry>Most Common Implementation</entry>
305 <entry align="center">NAS</entry>
306 <entry align="center">DRBD</entry>
307 <entry align="center">PITR</entry>
308 <entry align="center">Slony</entry>
309 <entry align="center">pgpool-II</entry>
310 <entry align="center">Bucardo</entry>
311 <entry align="center"></entry>
312 </row>
314 <row>
315 <entry>Communication Method</entry>
316 <entry align="center">shared disk</entry>
317 <entry align="center">disk blocks</entry>
318 <entry align="center">WAL</entry>
319 <entry align="center">table rows</entry>
320 <entry align="center">SQL</entry>
321 <entry align="center">table rows</entry>
322 <entry align="center">table rows and row locks</entry>
323 </row>
325 <row>
326 <entry>No special hardware required</entry>
327 <entry align="center"></entry>
328 <entry align="center">&bull;</entry>
329 <entry align="center">&bull;</entry>
330 <entry align="center">&bull;</entry>
331 <entry align="center">&bull;</entry>
332 <entry align="center">&bull;</entry>
333 <entry align="center">&bull;</entry>
334 </row>
336 <row>
337 <entry>Allows multiple master servers</entry>
338 <entry align="center"></entry>
339 <entry align="center"></entry>
340 <entry align="center"></entry>
341 <entry align="center"></entry>
342 <entry align="center">&bull;</entry>
343 <entry align="center">&bull;</entry>
344 <entry align="center">&bull;</entry>
345 </row>
347 <row>
348 <entry>No master server overhead</entry>
349 <entry align="center">&bull;</entry>
350 <entry align="center"></entry>
351 <entry align="center">&bull;</entry>
352 <entry align="center"></entry>
353 <entry align="center">&bull;</entry>
354 <entry align="center"></entry>
355 <entry align="center"></entry>
356 </row>
358 <row>
359 <entry>No waiting for multiple servers</entry>
360 <entry align="center">&bull;</entry>
361 <entry align="center"></entry>
362 <entry align="center">&bull;</entry>
363 <entry align="center">&bull;</entry>
364 <entry align="center"></entry>
365 <entry align="center">&bull;</entry>
366 <entry align="center"></entry>
367 </row>
369 <row>
370 <entry>Master failure will never lose data</entry>
371 <entry align="center">&bull;</entry>
372 <entry align="center">&bull;</entry>
373 <entry align="center"></entry>
374 <entry align="center"></entry>
375 <entry align="center">&bull;</entry>
376 <entry align="center"></entry>
377 <entry align="center">&bull;</entry>
378 </row>
380 <row>
381 <entry>Standby accept read-only queries</entry>
382 <entry align="center"></entry>
383 <entry align="center"></entry>
384 <entry align="center">Hot only</entry>
385 <entry align="center">&bull;</entry>
386 <entry align="center">&bull;</entry>
387 <entry align="center">&bull;</entry>
388 <entry align="center">&bull;</entry>
389 </row>
391 <row>
392 <entry>Per-table granularity</entry>
393 <entry align="center"></entry>
394 <entry align="center"></entry>
395 <entry align="center"></entry>
396 <entry align="center">&bull;</entry>
397 <entry align="center"></entry>
398 <entry align="center">&bull;</entry>
399 <entry align="center">&bull;</entry>
400 </row>
402 <row>
403 <entry>No conflict resolution necessary</entry>
404 <entry align="center">&bull;</entry>
405 <entry align="center">&bull;</entry>
406 <entry align="center">&bull;</entry>
407 <entry align="center">&bull;</entry>
408 <entry align="center"></entry>
409 <entry align="center"></entry>
410 <entry align="center">&bull;</entry>
411 </row>
413 </tbody>
414 </tgroup>
415 </table>
417 <para>
418 There are a few solutions that do not fit into the above categories:
419 </para>
421 <variablelist>
423 <varlistentry>
424 <term>Data Partitioning</term>
425 <listitem>
427 <para>
428 Data partitioning splits tables into data sets. Each set can
429 be modified by only one server. For example, data can be
430 partitioned by offices, e.g., London and Paris, with a server
431 in each office. If queries combining London and Paris data
432 are necessary, an application can query both servers, or
433 master/standby replication can be used to keep a read-only copy
434 of the other office's data on each server.
435 </para>
436 </listitem>
437 </varlistentry>
439 <varlistentry>
440 <term>Multiple-Server Parallel Query Execution</term>
441 <listitem>
443 <para>
444 Many of the above solutions allow multiple servers to handle multiple
445 queries, but none allow a single query to use multiple servers to
446 complete faster. This solution allows multiple servers to work
447 concurrently on a single query. It is usually accomplished by
448 splitting the data among servers and having each server execute its
449 part of the query and return results to a central server where they
450 are combined and returned to the user. <productname>Pgpool-II</>
451 has this capability. Also, this can be implemented using the
452 <productname>PL/Proxy</> toolset.
453 </para>
455 </listitem>
456 </varlistentry>
458 </variablelist>
460 </sect1>
463 <sect1 id="warm-standby">
464 <title>Log-Shipping Standby Servers</title>
467 <para>
468 Continuous archiving can be used to create a <firstterm>high
469 availability</> (HA) cluster configuration with one or more
470 <firstterm>standby servers</> ready to take over operations if the
471 primary server fails. This capability is widely referred to as
472 <firstterm>warm standby</> or <firstterm>log shipping</>.
473 </para>
475 <para>
476 The primary and standby server work together to provide this capability,
477 though the servers are only loosely coupled. The primary server operates
478 in continuous archiving mode, while each standby server operates in
479 continuous recovery mode, reading the WAL files from the primary. No
480 changes to the database tables are required to enable this capability,
481 so it offers low administration overhead compared to some other
482 replication solutions. This configuration also has relatively low
483 performance impact on the primary server.
484 </para>
486 <para>
487 Directly moving WAL records from one database server to another
488 is typically described as log shipping. <productname>PostgreSQL</>
489 implements file-based log shipping, which means that WAL records are
490 transferred one file (WAL segment) at a time. WAL files (16MB) can be
491 shipped easily and cheaply over any distance, whether it be to an
492 adjacent system, another system at the same site, or another system on
493 the far side of the globe. The bandwidth required for this technique
494 varies according to the transaction rate of the primary server.
495 Record-based log shipping is also possible with streaming replication
496 (see <xref linkend="streaming-replication">).
497 </para>
499 <para>
500 It should be noted that the log shipping is asynchronous, i.e., the WAL
501 records are shipped after transaction commit. As a result, there is a
502 window for data loss should the primary server suffer a catastrophic
503 failure; transactions not yet shipped will be lost. The size of the
504 data loss window in file-based log shipping can be limited by use of the
505 <varname>archive_timeout</varname> parameter, which can be set as low
506 as a few seconds. However such a low setting will
507 substantially increase the bandwidth required for file shipping.
508 If you need a window of less than a minute or so, consider using
509 streaming replication (see <xref linkend="streaming-replication">).
510 </para>
512 <para>
513 Recovery performance is sufficiently good that the standby will
514 typically be only moments away from full
515 availability once it has been activated. As a result, this is called
516 a warm standby configuration which offers high
517 availability. Restoring a server from an archived base backup and
518 rollforward will take considerably longer, so that technique only
519 offers a solution for disaster recovery, not high availability.
520 A standby server can also be used for read-only queries, in which case
521 it is called a Hot Standby server. See <xref linkend="hot-standby"> for
522 more information.
523 </para>
525 <indexterm zone="high-availability">
526 <primary>warm standby</primary>
527 </indexterm>
529 <indexterm zone="high-availability">
530 <primary>PITR standby</primary>
531 </indexterm>
533 <indexterm zone="high-availability">
534 <primary>standby server</primary>
535 </indexterm>
537 <indexterm zone="high-availability">
538 <primary>log shipping</primary>
539 </indexterm>
541 <indexterm zone="high-availability">
542 <primary>witness server</primary>
543 </indexterm>
545 <indexterm zone="high-availability">
546 <primary>STONITH</primary>
547 </indexterm>
549 <sect2 id="standby-planning">
550 <title>Planning</title>
552 <para>
553 It is usually wise to create the primary and standby servers
554 so that they are as similar as possible, at least from the
555 perspective of the database server. In particular, the path names
556 associated with tablespaces will be passed across unmodified, so both
557 primary and standby servers must have the same mount paths for
558 tablespaces if that feature is used. Keep in mind that if
559 <xref linkend="sql-createtablespace">
560 is executed on the primary, any new mount point needed for it must
561 be created on the primary and all standby servers before the command
562 is executed. Hardware need not be exactly the same, but experience shows
563 that maintaining two identical systems is easier than maintaining two
564 dissimilar ones over the lifetime of the application and system.
565 In any case the hardware architecture must be the same &mdash; shipping
566 from, say, a 32-bit to a 64-bit system will not work.
567 </para>
569 <para>
570 In general, log shipping between servers running different major
571 <productname>PostgreSQL</> release
572 levels is not possible. It is the policy of the PostgreSQL Global
573 Development Group not to make changes to disk formats during minor release
574 upgrades, so it is likely that running different minor release levels
575 on primary and standby servers will work successfully. However, no
576 formal support for that is offered and you are advised to keep primary
577 and standby servers at the same release level as much as possible.
578 When updating to a new minor release, the safest policy is to update
579 the standby servers first &mdash; a new minor release is more likely
580 to be able to read WAL files from a previous minor release than vice
581 versa.
582 </para>
584 </sect2>
586 <sect2 id="standby-server-operation">
587 <title>Standby Server Operation</title>
589 <para>
590 In standby mode, the server continuously applies WAL received from the
591 master server. The standby server can read WAL from a WAL archive
592 (see <varname>restore_command</>) or directly from the master
593 over a TCP connection (streaming replication). The standby server will
594 also attempt to restore any WAL found in the standby cluster's
595 <filename>pg_xlog</> directory. That typically happens after a server
596 restart, when the standby replays again WAL that was streamed from the
597 master before the restart, but you can also manually copy files to
598 <filename>pg_xlog</> at any time to have them replayed.
599 </para>
601 <para>
602 At startup, the standby begins by restoring all WAL available in the
603 archive location, calling <varname>restore_command</>. Once it
604 reaches the end of WAL available there and <varname>restore_command</>
605 fails, it tries to restore any WAL available in the pg_xlog directory.
606 If that fails, and streaming replication has been configured, the
607 standby tries to connect to the primary server and start streaming WAL
608 from the last valid record found in archive or pg_xlog. If that fails
609 or streaming replication is not configured, or if the connection is
610 later disconnected, the standby goes back to step 1 and tries to
611 restore the file from the archive again. This loop of retries from the
612 archive, pg_xlog, and via streaming replication goes on until the server
613 is stopped or failover is triggered by a trigger file.
614 </para>
616 <para>
617 Standby mode is exited and the server switches to normal operation,
618 when a trigger file is found (<varname>trigger_file</>). Before failover,
619 any WAL immediately available in the archive or in pg_xlog will be
620 restored, but no attempt is made to connect to the master.
621 </para>
622 </sect2>
624 <sect2 id="preparing-master-for-standby">
625 <title>Preparing the Master for Standby Servers</title>
627 <para>
628 Set up continuous archiving on the primary to an archive directory
629 accessible from the standby, as described
630 in <xref linkend="continuous-archiving">. The archive location should be
631 accessible from the standby even when the master is down, i.e. it should
632 reside on the standby server itself or another trusted server, not on
633 the master server.
634 </para>
636 <para>
637 If you want to use streaming replication, set up authentication on the
638 primary server to allow replication connections from the standby
639 server(s); that is, provide a suitable entry or entries in
640 <filename>pg_hba.conf</> with the database field set to
641 <literal>replication</>. Also ensure <varname>max_wal_senders</> is set
642 to a sufficiently large value in the configuration file of the primary
643 server.
644 </para>
646 <para>
647 Take a base backup as described in <xref linkend="backup-base-backup">
648 to bootstrap the standby server.
649 </para>
650 </sect2>
652 <sect2 id="standby-server-setup">
653 <title>Setting Up a Standby Server</title>
655 <para>
656 To set up the standby server, restore the base backup taken from primary
657 server (see <xref linkend="backup-pitr-recovery">). Create a recovery
658 command file <filename>recovery.conf</> in the standby's cluster data
659 directory, and turn on <varname>standby_mode</>. Set
660 <varname>restore_command</> to a simple command to copy files from
661 the WAL archive.
662 </para>
664 <note>
665 <para>
666 Do not use pg_standby or similar tools with the built-in standby mode
667 described here. <varname>restore_command</> should return immediately
668 if the file does not exist; the server will retry the command again if
669 necessary. See <xref linkend="log-shipping-alternative">
670 for using tools like pg_standby.
671 </para>
672 </note>
674 <para>
675 If you want to use streaming replication, fill in
676 <varname>primary_conninfo</> with a libpq connection string, including
677 the host name (or IP address) and any additional details needed to
678 connect to the primary server. If the primary needs a password for
679 authentication, the password needs to be specified in
680 <varname>primary_conninfo</> as well.
681 </para>
683 <para>
684 You can use <varname>archive_cleanup_command</> to prune the archive of
685 files no longer needed by the standby.
686 </para>
688 <para>
689 If you're setting up the standby server for high availability purposes,
690 set up WAL archiving, connections and authentication like the primary
691 server, because the standby server will work as a primary server after
692 failover. You will also need to set <varname>trigger_file</> to make
693 it possible to fail over.
694 If you're setting up the standby server for reporting
695 purposes, with no plans to fail over to it, <varname>trigger_file</>
696 is not required.
697 </para>
699 <para>
700 A simple example of a <filename>recovery.conf</> is:
701 <programlisting>
702 standby_mode = 'on'
703 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
704 restore_command = 'cp /path/to/archive/%f %p'
705 trigger_file = '/path/to/trigger_file'
706 </programlisting>
707 </para>
709 <para>
710 You can have any number of standby servers, but if you use streaming
711 replication, make sure you set <varname>max_wal_senders</> high enough in
712 the primary to allow them to be connected simultaneously.
713 </para>
715 <para>
716 If you're using a WAL archive, its size can be minimized using
717 the <varname>archive_cleanup_command</> option to remove files that are
718 no longer required by the standby server. Note however, that if you're
719 using the archive for backup purposes, you need to retain files needed
720 to recover from at least the latest base backup, even if they're no
721 longer needed by the standby.
722 </para>
723 </sect2>
725 <sect2 id="streaming-replication">
726 <title>Streaming Replication</title>
728 <indexterm zone="high-availability">
729 <primary>Streaming Replication</primary>
730 </indexterm>
732 <para>
733 Streaming replication allows a standby server to stay more up-to-date
734 than is possible with file-based log shipping. The standby connects
735 to the primary, which streams WAL records to the standby as they're
736 generated, without waiting for the WAL file to be filled.
737 </para>
739 <para>
740 Streaming replication is asynchronous, so there is still a small delay
741 between committing a transaction in the primary and for the changes to
742 become visible in the standby. The delay is however much smaller than with
743 file-based log shipping, typically under one second assuming the standby
744 is powerful enough to keep up with the load. With streaming replication,
745 <varname>archive_timeout</> is not required to reduce the data loss
746 window.
747 </para>
749 <para>
750 If you use streaming replication without file-based continuous
751 archiving, you have to set <varname>wal_keep_segments</> in the master
752 to a value high enough to ensure that old WAL segments are not recycled
753 too early, while the standby might still need them to catch up. If the
754 standby falls behind too much, it needs to be reinitialized from a new
755 base backup. If you set up a WAL archive that's accessible from the
756 standby, wal_keep_segments is not required as the standby can always
757 use the archive to catch up.
758 </para>
760 <para>
761 To use streaming replication, set up a file-based log-shipping standby
762 server as described in <xref linkend="warm-standby">. The step that
763 turns a file-based log-shipping standby into streaming replication
764 standby is setting <varname>primary_conninfo</> setting in the
765 <filename>recovery.conf</> file to point to the primary server. Set
766 <xref linkend="guc-listen-addresses"> and authentication options
767 (see <filename>pg_hba.conf</>) on the primary so that the standby server
768 can connect to the <literal>replication</> pseudo-database on the primary
769 server (see <xref linkend="streaming-replication-authentication">).
770 </para>
772 <para>
773 On systems that support the keepalive socket option, setting
774 <xref linkend="guc-tcp-keepalives-idle">,
775 <xref linkend="guc-tcp-keepalives-interval"> and
776 <xref linkend="guc-tcp-keepalives-count"> helps the primary promptly
777 notice a broken connection.
778 </para>
780 <para>
781 Set the maximum number of concurrent connections from the standby servers
782 (see <xref linkend="guc-max-wal-senders"> for details).
783 </para>
785 <para>
786 When the standby is started and <varname>primary_conninfo</> is set
787 correctly, the standby will connect to the primary after replaying all
788 WAL files available in the archive. If the connection is established
789 successfully, you will see a walreceiver process in the standby, and
790 a corresponding walsender process in the primary.
791 </para>
793 <sect3 id="streaming-replication-authentication">
794 <title>Authentication</title>
795 <para>
796 It is very important that the access privileges for replication be set up
797 so that only trusted users can read the WAL stream, because it is
798 easy to extract privileged information from it. Standby servers must
799 authenticate to the primary as a superuser account.
800 So a role with the <literal>SUPERUSER</> and <literal>LOGIN</>
801 privileges needs to be created on the primary.
802 </para>
803 <para>
804 Client authentication for replication is controlled by a
805 <filename>pg_hba.conf</> record specifying <literal>replication</> in the
806 <replaceable>database</> field. For example, if the standby is running on
807 host IP <literal>192.168.1.100</> and the superuser's name for replication
808 is <literal>foo</>, the administrator can add the following line to the
809 <filename>pg_hba.conf</> file on the primary:
811 <programlisting>
812 # Allow the user "foo" from host 192.168.1.100 to connect to the primary
813 # as a replication standby if the user's password is correctly supplied.
815 # TYPE DATABASE USER CIDR-ADDRESS METHOD
816 host replication foo 192.168.1.100/32 md5
817 </programlisting>
818 </para>
819 <para>
820 The host name and port number of the primary, connection user name,
821 and password are specified in the <filename>recovery.conf</> file.
822 The password can also be set in the <filename>~/.pgpass</> file on the
823 standby (specify <literal>replication</> in the <replaceable>database</>
824 field).
825 For example, if the primary is running on host IP <literal>192.168.1.50</>,
826 port <literal>5432</literal>, the superuser's name for replication is
827 <literal>foo</>, and the password is <literal>foopass</>, the administrator
828 can add the following line to the <filename>recovery.conf</> file on the
829 standby:
831 <programlisting>
832 # The standby connects to the primary that is running on host 192.168.1.50
833 # and port 5432 as the user "foo" whose password is "foopass".
834 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
835 </programlisting>
836 </para>
837 </sect3>
839 <sect3 id="streaming-replication-monitoring">
840 <title>Monitoring</title>
841 <para>
842 An important health indicator of streaming replication is the amount
843 of WAL records generated in the primary, but not yet applied in the
844 standby. You can calculate this lag by comparing the current WAL write
845 location on the primary with the last WAL location received by the
846 standby. They can be retrieved using
847 <function>pg_current_xlog_location</> on the primary and the
848 <function>pg_last_xlog_receive_location</> on the standby,
849 respectively (see <xref linkend="functions-admin-backup-table"> and
850 <xref linkend="functions-recovery-info-table"> for details).
851 The last WAL receive location in the standby is also displayed in the
852 process status of the WAL receiver process, displayed using the
853 <command>ps</> command (see <xref linkend="monitoring-ps"> for details).
854 </para>
855 </sect3>
857 </sect2>
858 </sect1>
860 <sect1 id="warm-standby-failover">
861 <title>Failover</title>
863 <para>
864 If the primary server fails then the standby server should begin
865 failover procedures.
866 </para>
868 <para>
869 If the standby server fails then no failover need take place. If the
870 standby server can be restarted, even some time later, then the recovery
871 process can also be restarted immediately, taking advantage of
872 restartable recovery. If the standby server cannot be restarted, then a
873 full new standby server instance should be created.
874 </para>
876 <para>
877 If the primary server fails and the standby server becomes the
878 new primary, and then the old primary restarts, you must have
879 a mechanism for informing the old primary that it is no longer the primary. This is
880 sometimes known as <acronym>STONITH</> (Shoot The Other Node In The Head), which is
881 necessary to avoid situations where both systems think they are the
882 primary, which will lead to confusion and ultimately data loss.
883 </para>
885 <para>
886 Many failover systems use just two systems, the primary and the standby,
887 connected by some kind of heartbeat mechanism to continually verify the
888 connectivity between the two and the viability of the primary. It is
889 also possible to use a third system (called a witness server) to prevent
890 some cases of inappropriate failover, but the additional complexity
891 might not be worthwhile unless it is set up with sufficient care and
892 rigorous testing.
893 </para>
895 <para>
896 <productname>PostgreSQL</productname> does not provide the system
897 software required to identify a failure on the primary and notify
898 the standby database server. Many such tools exist and are well
899 integrated with the operating system facilities required for
900 successful failover, such as IP address migration.
901 </para>
903 <para>
904 Once failover to the standby occurs, there is only a
905 single server in operation. This is known as a degenerate state.
906 The former standby is now the primary, but the former primary is down
907 and might stay down. To return to normal operation, a standby server
908 must be recreated,
909 either on the former primary system when it comes up, or on a third,
910 possibly new, system. Once complete the primary and standby can be
911 considered to have switched roles. Some people choose to use a third
912 server to provide backup for the new primary until the new standby
913 server is recreated,
914 though clearly this complicates the system configuration and
915 operational processes.
916 </para>
918 <para>
919 So, switching from primary to standby server can be fast but requires
920 some time to re-prepare the failover cluster. Regular switching from
921 primary to standby is useful, since it allows regular downtime on
922 each system for maintenance. This also serves as a test of the
923 failover mechanism to ensure that it will really work when you need it.
924 Written administration procedures are advised.
925 </para>
927 <para>
928 To trigger failover of a log-shipping standby server, create a trigger
929 file with the filename and path specified by the <varname>trigger_file</>
930 setting in <filename>recovery.conf</>. If <varname>trigger_file</> is
931 not given, there is no way to exit recovery in the standby and promote
932 it to a master. That can be useful for e.g reporting servers that are
933 only used to offload read-only queries from the primary, not for high
934 availability purposes.
935 </para>
936 </sect1>
938 <sect1 id="log-shipping-alternative">
939 <title>Alternative method for log shipping</title>
941 <para>
942 An alternative to the built-in standby mode described in the previous
943 sections is to use a <varname>restore_command</> that polls the archive location.
944 This was the only option available in versions 8.4 and below. In this
945 setup, set <varname>standby_mode</> off, because you are implementing
946 the polling required for standby operation yourself. See
947 contrib/pg_standby (<xref linkend="pgstandby">) for a reference
948 implementation of this.
949 </para>
951 <para>
952 Note that in this mode, the server will apply WAL one file at a
953 time, so if you use the standby server for queries (see Hot Standby),
954 there is a delay between an action in the master and when the
955 action becomes visible in the standby, corresponding the time it takes
956 to fill up the WAL file. <varname>archive_timeout</> can be used to make that delay
957 shorter. Also note that you can't combine streaming replication with
958 this method.
959 </para>
961 <para>
962 The operations that occur on both primary and standby servers are
963 normal continuous archiving and recovery tasks. The only point of
964 contact between the two database servers is the archive of WAL files
965 that both share: primary writing to the archive, standby reading from
966 the archive. Care must be taken to ensure that WAL archives from separate
967 primary servers do not become mixed together or confused. The archive
968 need not be large if it is only required for standby operation.
969 </para>
971 <para>
972 The magic that makes the two loosely coupled servers work together is
973 simply a <varname>restore_command</> used on the standby that,
974 when asked for the next WAL file, waits for it to become available from
975 the primary. The <varname>restore_command</> is specified in the
976 <filename>recovery.conf</> file on the standby server. Normal recovery
977 processing would request a file from the WAL archive, reporting failure
978 if the file was unavailable. For standby processing it is normal for
979 the next WAL file to be unavailable, so the standby must wait for
980 it to appear. For files ending in <literal>.backup</> or
981 <literal>.history</> there is no need to wait, and a non-zero return
982 code must be returned. A waiting <varname>restore_command</> can be
983 written as a custom script that loops after polling for the existence of
984 the next WAL file. There must also be some way to trigger failover, which
985 should interrupt the <varname>restore_command</>, break the loop and
986 return a file-not-found error to the standby server. This ends recovery
987 and the standby will then come up as a normal server.
988 </para>
990 <para>
991 Pseudocode for a suitable <varname>restore_command</> is:
992 <programlisting>
993 triggered = false;
994 while (!NextWALFileReady() &amp;&amp; !triggered)
996 sleep(100000L); /* wait for ~0.1 sec */
997 if (CheckForExternalTrigger())
998 triggered = true;
1000 if (!triggered)
1001 CopyWALFileForRecovery();
1002 </programlisting>
1003 </para>
1005 <para>
1006 A working example of a waiting <varname>restore_command</> is provided
1007 as a <filename>contrib</> module named <application>pg_standby</>. It
1008 should be used as a reference on how to correctly implement the logic
1009 described above. It can also be extended as needed to support specific
1010 configurations and environments.
1011 </para>
1013 <para>
1014 The method for triggering failover is an important part of planning
1015 and design. One potential option is the <varname>restore_command</>
1016 command. It is executed once for each WAL file, but the process
1017 running the <varname>restore_command</> is created and dies for
1018 each file, so there is no daemon or server process, and
1019 signals or a signal handler cannot be used. Therefore, the
1020 <varname>restore_command</> is not suitable to trigger failover.
1021 It is possible to use a simple timeout facility, especially if
1022 used in conjunction with a known <varname>archive_timeout</>
1023 setting on the primary. However, this is somewhat error prone
1024 since a network problem or busy primary server might be sufficient
1025 to initiate failover. A notification mechanism such as the explicit
1026 creation of a trigger file is ideal, if this can be arranged.
1027 </para>
1029 <sect2 id="warm-standby-config">
1030 <title>Implementation</title>
1032 <para>
1033 The short procedure for configuring a standby server using this alternative
1034 method is as follows. For
1035 full details of each step, refer to previous sections as noted.
1036 <orderedlist>
1037 <listitem>
1038 <para>
1039 Set up primary and standby systems as nearly identical as
1040 possible, including two identical copies of
1041 <productname>PostgreSQL</> at the same release level.
1042 </para>
1043 </listitem>
1044 <listitem>
1045 <para>
1046 Set up continuous archiving from the primary to a WAL archive
1047 directory on the standby server. Ensure that
1048 <xref linkend="guc-archive-mode">,
1049 <xref linkend="guc-archive-command"> and
1050 <xref linkend="guc-archive-timeout">
1051 are set appropriately on the primary
1052 (see <xref linkend="backup-archiving-wal">).
1053 </para>
1054 </listitem>
1055 <listitem>
1056 <para>
1057 Make a base backup of the primary server (see <xref
1058 linkend="backup-base-backup">), and load this data onto the standby.
1059 </para>
1060 </listitem>
1061 <listitem>
1062 <para>
1063 Begin recovery on the standby server from the local WAL
1064 archive, using a <filename>recovery.conf</> that specifies a
1065 <varname>restore_command</> that waits as described
1066 previously (see <xref linkend="backup-pitr-recovery">).
1067 </para>
1068 </listitem>
1069 </orderedlist>
1070 </para>
1072 <para>
1073 Recovery treats the WAL archive as read-only, so once a WAL file has
1074 been copied to the standby system it can be copied to tape at the same
1075 time as it is being read by the standby database server.
1076 Thus, running a standby server for high availability can be performed at
1077 the same time as files are stored for longer term disaster recovery
1078 purposes.
1079 </para>
1081 <para>
1082 For testing purposes, it is possible to run both primary and standby
1083 servers on the same system. This does not provide any worthwhile
1084 improvement in server robustness, nor would it be described as HA.
1085 </para>
1086 </sect2>
1088 <sect2 id="warm-standby-record">
1089 <title>Record-based Log Shipping</title>
1091 <para>
1092 It is also possible to implement record-based log shipping using this
1093 alternative method, though this requires custom development, and changes
1094 will still only become visible to hot standby queries after a full WAL
1095 file has been shipped.
1096 </para>
1098 <para>
1099 An external program can call the <function>pg_xlogfile_name_offset()</>
1100 function (see <xref linkend="functions-admin">)
1101 to find out the file name and the exact byte offset within it of
1102 the current end of WAL. It can then access the WAL file directly
1103 and copy the data from the last known end of WAL through the current end
1104 over to the standby servers. With this approach, the window for data
1105 loss is the polling cycle time of the copying program, which can be very
1106 small, and there is no wasted bandwidth from forcing partially-used
1107 segment files to be archived. Note that the standby servers'
1108 <varname>restore_command</> scripts can only deal with whole WAL files,
1109 so the incrementally copied data is not ordinarily made available to
1110 the standby servers. It is of use only when the primary dies &mdash;
1111 then the last partial WAL file is fed to the standby before allowing
1112 it to come up. The correct implementation of this process requires
1113 cooperation of the <varname>restore_command</> script with the data
1114 copying program.
1115 </para>
1117 <para>
1118 Starting with <productname>PostgreSQL</> version 9.0, you can use
1119 streaming replication (see <xref linkend="streaming-replication">) to
1120 achieve the same benefits with less effort.
1121 </para>
1122 </sect2>
1123 </sect1>
1125 <sect1 id="hot-standby">
1126 <title>Hot Standby</title>
1128 <indexterm zone="high-availability">
1129 <primary>Hot Standby</primary>
1130 </indexterm>
1132 <para>
1133 Hot Standby is the term used to describe the ability to connect to
1134 the server and run read-only queries while the server is in archive
1135 recovery. This
1136 is useful for both log shipping replication and for restoring a backup
1137 to an exact state with great precision.
1138 The term Hot Standby also refers to the ability of the server to move
1139 from recovery through to normal operation while users continue running
1140 queries and/or keep their connections open.
1141 </para>
1143 <para>
1144 Running queries in recovery mode is similar to normal query operation,
1145 though there are several usage and administrative differences
1146 noted below.
1147 </para>
1149 <sect2 id="hot-standby-users">
1150 <title>User's Overview</title>
1152 <para>
1153 When the <xref linkend="guc-hot-standby"> parameter is set to true on a
1154 standby server, it will begin accepting connections once the recovery has
1155 brought the system to a consistent state. All such connections are
1156 strictly read-only; not even temporary tables may be written.
1157 </para>
1159 <para>
1160 The data on the standby takes some time to arrive from the primary server
1161 so there will be a measurable delay between primary and standby. Running the
1162 same query nearly simultaneously on both primary and standby might therefore
1163 return differing results. We say that data on the standby is
1164 <firstterm>eventually consistent</firstterm> with the primary. Once the
1165 commit record for a transaction is replayed on the standby, the changes
1166 made by that transaction will be visible to any new snapshots taken on
1167 the standby. Snapshots may be taken at the start of each query or at the
1168 start of each transaction, depending on the current transaction isolation
1169 level. For more details, see <xref linkend="transaction-iso">.
1170 </para>
1172 <para>
1173 Transactions started during recovery may issue the following commands:
1175 <itemizedlist>
1176 <listitem>
1177 <para>
1178 Query access - <command>SELECT</>, <command>COPY TO</>
1179 </para>
1180 </listitem>
1181 <listitem>
1182 <para>
1183 Cursor commands - <command>DECLARE</>, <command>FETCH</>, <command>CLOSE</>
1184 </para>
1185 </listitem>
1186 <listitem>
1187 <para>
1188 Parameters - <command>SHOW</>, <command>SET</>, <command>RESET</>
1189 </para>
1190 </listitem>
1191 <listitem>
1192 <para>
1193 Transaction management commands
1194 <itemizedlist>
1195 <listitem>
1196 <para>
1197 <command>BEGIN</>, <command>END</>, <command>ABORT</>, <command>START TRANSACTION</>
1198 </para>
1199 </listitem>
1200 <listitem>
1201 <para>
1202 <command>SAVEPOINT</>, <command>RELEASE</>, <command>ROLLBACK TO SAVEPOINT</>
1203 </para>
1204 </listitem>
1205 <listitem>
1206 <para>
1207 <command>EXCEPTION</> blocks and other internal subtransactions
1208 </para>
1209 </listitem>
1210 </itemizedlist>
1211 </para>
1212 </listitem>
1213 <listitem>
1214 <para>
1215 <command>LOCK TABLE</>, though only when explicitly in one of these modes:
1216 <literal>ACCESS SHARE</>, <literal>ROW SHARE</> or <literal>ROW EXCLUSIVE</>.
1217 </para>
1218 </listitem>
1219 <listitem>
1220 <para>
1221 Plans and resources - <command>PREPARE</>, <command>EXECUTE</>,
1222 <command>DEALLOCATE</>, <command>DISCARD</>
1223 </para>
1224 </listitem>
1225 <listitem>
1226 <para>
1227 Plugins and extensions - <command>LOAD</>
1228 </para>
1229 </listitem>
1230 </itemizedlist>
1231 </para>
1233 <para>
1234 Transactions started during recovery may never be assigned a transaction ID
1235 and may not write to the system write-ahead log. Therefore, the following
1236 actions will produce error messages:
1238 <itemizedlist>
1239 <listitem>
1240 <para>
1241 Data Manipulation Language (DML) - <command>INSERT</>,
1242 <command>UPDATE</>, <command>DELETE</>, <command>COPY FROM</>,
1243 <command>TRUNCATE</>.
1244 Note that there are no allowed actions that result in a trigger
1245 being executed during recovery. This restriction applies even to
1246 temporary tables, because table rows cannot be read or written without
1247 assigning a transaction ID, which is currently not possible in a
1248 Hot Standby environment.
1249 </para>
1250 </listitem>
1251 <listitem>
1252 <para>
1253 Data Definition Language (DDL) - <command>CREATE</>,
1254 <command>DROP</>, <command>ALTER</>, <command>COMMENT</>.
1255 This restriction applies even to temporary tables, because carrying
1256 out these operations would require updating the system catalog tables.
1257 </para>
1258 </listitem>
1259 <listitem>
1260 <para>
1261 <command>SELECT ... FOR SHARE | UPDATE</>, because row locks cannot be
1262 taken without updating the underlying data files.
1263 </para>
1264 </listitem>
1265 <listitem>
1266 <para>
1267 Rules on <command>SELECT</> statements that generate DML commands.
1268 </para>
1269 </listitem>
1270 <listitem>
1271 <para>
1272 <command>LOCK</> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</>.
1273 </para>
1274 </listitem>
1275 <listitem>
1276 <para>
1277 <command>LOCK</> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</>.
1278 </para>
1279 </listitem>
1280 <listitem>
1281 <para>
1282 Transaction management commands that explicitly set non-read-only state:
1283 <itemizedlist>
1284 <listitem>
1285 <para>
1286 <command>BEGIN READ WRITE</>,
1287 <command>START TRANSACTION READ WRITE</>
1288 </para>
1289 </listitem>
1290 <listitem>
1291 <para>
1292 <command>SET TRANSACTION READ WRITE</>,
1293 <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</>
1294 </para>
1295 </listitem>
1296 <listitem>
1297 <para>
1298 <command>SET transaction_read_only = off</>
1299 </para>
1300 </listitem>
1301 </itemizedlist>
1302 </para>
1303 </listitem>
1304 <listitem>
1305 <para>
1306 Two-phase commit commands - <command>PREPARE TRANSACTION</>,
1307 <command>COMMIT PREPARED</>, <command>ROLLBACK PREPARED</>
1308 because even read-only transactions need to write WAL in the
1309 prepare phase (the first phase of two phase commit).
1310 </para>
1311 </listitem>
1312 <listitem>
1313 <para>
1314 Sequence updates - <function>nextval()</>, <function>setval()</>
1315 </para>
1316 </listitem>
1317 <listitem>
1318 <para>
1319 <command>LISTEN</>, <command>UNLISTEN</>, <command>NOTIFY</>
1320 </para>
1321 </listitem>
1322 </itemizedlist>
1323 </para>
1325 <para>
1326 Outside of recovery, read-only transactions are allowed to update sequences
1327 and to use <command>LISTEN</>, <command>UNLISTEN</>, and
1328 <command>NOTIFY</>, so Hot Standby sessions operate under slightly tighter
1329 restrictions than ordinary read-only sessions. It is possible that some
1330 of these restrictions might be loosened in a future release.
1331 </para>
1333 <para>
1334 During recovery, the parameter <varname>transaction_read_only</> is always
1335 true and may not be changed. But as long as no attempt is made to modify
1336 the database, connections during recovery will act much like any other
1337 database connection. If failover or switchover occurs, the database will
1338 switch to normal processing mode. Sessions will remain connected while the
1339 server changes mode. Once recovery finishes, it will be possible to
1340 initiate read-write transactions (even from a session begun during
1341 recovery).
1342 </para>
1344 <para>
1345 Users will be able to tell whether their session is read-only by
1346 issuing <command>SHOW transaction_read_only</>. In addition, a set of
1347 functions (<xref linkend="functions-recovery-info-table">) allow users to
1348 access information about the standby server. These allow you to write
1349 programs that are aware of the current state of the database. These
1350 can be used to monitor the progress of recovery, or to allow you to
1351 write complex programs that restore the database to particular states.
1352 </para>
1354 <para>
1355 In general, queries will not experience lock conflicts from the database
1356 changes made by recovery. This is because recovery follows normal
1357 concurrency control mechanisms, known as <acronym>MVCC</>. There are
1358 some types of change that will cause conflicts, covered in the following
1359 section.
1360 </para>
1361 </sect2>
1363 <sect2 id="hot-standby-conflict">
1364 <title>Handling query conflicts</title>
1366 <para>
1367 The primary and standby nodes are in many ways loosely connected. Actions
1368 on the primary will have an effect on the standby. As a result, there is
1369 potential for negative interactions or conflicts between them. The easiest
1370 conflict to understand is performance: if a huge data load is taking place
1371 on the primary then this will generate a similar stream of WAL records on the
1372 standby, so standby queries may contend for system resources, such as I/O.
1373 </para>
1375 <para>
1376 There are also additional types of conflict that can occur with Hot Standby.
1377 These conflicts are <emphasis>hard conflicts</> in the sense that queries
1378 might need to be cancelled and, in some cases, sessions disconnected to resolve them.
1379 The user is provided with several ways to handle these
1380 conflicts. Conflicts can be caused by:
1382 <itemizedlist>
1383 <listitem>
1384 <para>
1385 Access Exclusive Locks from primary node, including both explicit
1386 <command>LOCK</> commands and various <acronym>DDL</> actions
1387 </para>
1388 </listitem>
1389 <listitem>
1390 <para>
1391 Dropping tablespaces on the primary while standby queries are using
1392 those tablespaces for temporary work files (<varname>work_mem</> overflow)
1393 </para>
1394 </listitem>
1395 <listitem>
1396 <para>
1397 Dropping databases on the primary while users are connected to that
1398 database on the standby.
1399 </para>
1400 </listitem>
1401 <listitem>
1402 <para>
1403 The standby waiting longer than <varname>max_standby_delay</>
1404 to acquire a buffer cleanup lock.
1405 </para>
1406 </listitem>
1407 <listitem>
1408 <para>
1409 Early cleanup of data still visible to the current query's snapshot.
1410 </para>
1411 </listitem>
1412 </itemizedlist>
1413 </para>
1415 <para>
1416 Some WAL redo actions will be for <acronym>DDL</> execution. These DDL
1417 actions are replaying changes that have already committed on the primary
1418 node, so they must not fail on the standby node. These DDL locks take
1419 priority and will automatically <emphasis>cancel</> any read-only
1420 transactions that get in their way, after a grace period. This is similar
1421 to the possibility of being canceled by the deadlock detector. But in this
1422 case, the standby recovery process always wins, since the replayed actions
1423 must not fail. This also ensures that replication does not fall behind
1424 while waiting for a query to complete. This prioritization presumes that
1425 the standby exists primarily for high availability, and that adjusting the
1426 grace period will allow a sufficient guard against unexpected cancellation.
1427 </para>
1429 <para>
1430 An example of the above would be an administrator on the primary server
1431 running <command>DROP TABLE</> on a table that is currently being queried
1432 on the standby server.
1433 Clearly the query cannot continue if <command>DROP TABLE</>
1434 proceeds. If this situation occurred on the primary, the <command>DROP TABLE</>
1435 would wait until the query had finished. When <command>DROP TABLE</> is
1436 run on the primary, the primary doesn't have
1437 information about which queries are running on the standby, so it
1438 cannot wait for any of the standby queries. The WAL change records come through to the
1439 standby while the standby query is still running, causing a conflict.
1440 </para>
1442 <para>
1443 The most common reason for conflict between standby queries and WAL redo is
1444 "early cleanup". Normally, <productname>PostgreSQL</> allows cleanup of old
1445 row versions when there are no users who need to see them to ensure correct
1446 visibility of data (the heart of MVCC). If there is a standby query that has
1447 been running for longer than any query on the primary then it is possible
1448 for old row versions to be removed by either a vacuum or HOT. This will
1449 then generate WAL records that, if applied, would remove data on the
1450 standby that might <emphasis>potentially</> be required by the standby query.
1451 In more technical language, the primary's xmin horizon is later than
1452 the standby's xmin horizon, allowing dead rows to be removed.
1453 </para>
1455 <para>
1456 Experienced users should note that both row version cleanup and row version
1457 freezing will potentially conflict with recovery queries. Running a
1458 manual <command>VACUUM FREEZE</> is likely to cause conflicts even on tables
1459 with no updated or deleted rows.
1460 </para>
1462 <para>
1463 There are a number of choices for resolving query conflicts. The default
1464 is to wait and hope the query finishes. The server will wait
1465 automatically until the lag between primary and standby is at most
1466 <xref linkend="guc-max-standby-delay"> (30 seconds by default).
1467 Once that grace period expires,
1468 one of the following actions is taken:
1470 <itemizedlist>
1471 <listitem>
1472 <para>
1473 If the conflict is caused by a lock, the conflicting standby
1474 transaction is cancelled immediately. If the transaction is
1475 idle-in-transaction, then the session is aborted instead.
1476 This behavior might change in the future.
1477 </para>
1478 </listitem>
1480 <listitem>
1481 <para>
1482 If the conflict is caused by cleanup records, the standby query is informed
1483 a conflict has occurred and that it must cancel itself to avoid the
1484 risk that it silently fails to read relevant data because
1485 that data has been removed. Some cleanup
1486 records only conflict with older queries, while others
1487 can affect all queries.
1488 </para>
1490 <para>
1491 Cancelled queries may be retried immediately (after beginning a new
1492 transaction, of course). Since query cancellation depends on
1493 the nature of the WAL records being replayed, a query that was
1494 cancelled may succeed if it is executed again.
1495 </para>
1496 </listitem>
1497 </itemizedlist>
1498 </para>
1500 <para>
1501 Keep in mind that <varname>max_standby_delay</> is compared to the
1502 difference between the standby server's clock and the transaction
1503 commit timestamps read from the WAL log. Thus, the grace period
1504 allowed to any one query on the standby is never more than
1505 <varname>max_standby_delay</>, and could be considerably less if the
1506 standby has already fallen behind as a result of waiting for previous
1507 queries to complete, or as a result of being unable to keep up with a
1508 heavy update load.
1509 </para>
1511 <caution>
1512 <para>
1513 Be sure that the primary and standby servers' clocks are kept in sync;
1514 otherwise the values compared to <varname>max_standby_delay</> will be
1515 erroneous, possibly leading to additional query cancellations.
1516 If the clocks are intentionally not in sync, or if there is a large
1517 propagation delay from primary to standby, it is advisable to set
1518 <varname>max_standby_delay</> to -1. In any case the value should be
1519 larger than the largest expected clock skew between primary and standby.
1520 </para>
1521 </caution>
1523 <para>
1524 Users should be clear that tables that are regularly and heavily updated on the
1525 primary server will quickly cause cancellation of longer running queries on
1526 the standby. In those cases <varname>max_standby_delay</> can be
1527 considered similar to setting
1528 <varname>statement_timeout</>.
1529 </para>
1531 <para>
1532 Other remedial actions exist if the number of cancellations is unacceptable.
1533 The first option is to connect to the primary server and keep a query active
1534 for as long as needed to run queries on the standby. This guarantees that
1535 a WAL cleanup record is never generated and query conflicts do not occur,
1536 as described above. This could be done using <filename>contrib/dblink</>
1537 and <function>pg_sleep()</>, or via other mechanisms. If you do this, you
1538 should note that this will delay cleanup of dead rows on the primary by
1539 vacuum or HOT, which may be undesirable. However, remember
1540 that the primary and standby nodes are linked via the WAL, so the cleanup
1541 situation is no different from the case where the query ran on the primary
1542 node itself, and you are still getting the benefit of off-loading the
1543 execution onto the standby. <varname>max_standby_delay</> should
1544 not be used in this case because delayed WAL files might already
1545 contain entries that invalidate the current snapshot.
1546 </para>
1548 <para>
1549 It is also possible to set <varname>vacuum_defer_cleanup_age</> on the primary
1550 to defer the cleanup of records by autovacuum, <command>VACUUM</>
1551 and HOT. This might allow
1552 more time for queries to execute before they are cancelled on the standby,
1553 without the need for setting a high <varname>max_standby_delay</>.
1554 </para>
1556 <para>
1557 Three-way deadlocks are possible between <literal>AccessExclusiveLocks</> arriving from
1558 the primary, cleanup WAL records that require buffer cleanup locks, and
1559 user requests that are waiting behind replayed <literal>AccessExclusiveLocks</>.
1560 Deadlocks are resolved automatically after <varname>deadlock_timeout</>
1561 seconds, though they are thought to be rare in practice.
1562 </para>
1564 <para>
1565 Dropping tablespaces or databases is discussed in the administrator's
1566 section since they are not typical user situations.
1567 </para>
1568 </sect2>
1570 <sect2 id="hot-standby-admin">
1571 <title>Administrator's Overview</title>
1573 <para>
1574 If <varname>hot_standby</> is turned <literal>on</> in
1575 <filename>postgresql.conf</> and there is a <filename>recovery.conf</>
1576 file present, the server will run in Hot Standby mode.
1577 However, it may take some time for Hot Standby connections to be allowed,
1578 because the server will not accept connections until it has completed
1579 sufficient recovery to provide a consistent state against which queries
1580 can run. During this period,
1581 clients that attempt to connect will be refused with an error message.
1582 To confirm the server has come up, either loop trying to connect from
1583 the application, or look for these messages in the server logs:
1585 <programlisting>
1586 LOG: entering standby mode
1588 ... then some time later ...
1590 LOG: consistent recovery state reached
1591 LOG: database system is ready to accept read only connections
1592 </programlisting>
1594 Consistency information is recorded once per checkpoint on the primary.
1595 It is not possible to enable hot standby when reading WAL
1596 written during a period when <varname>wal_level</> was not set to
1597 <literal>hot_standby</> on the primary. Reaching a consistent state can
1598 also be delayed in the presence of both of these conditions:
1600 <itemizedlist>
1601 <listitem>
1602 <para>
1603 A write transaction has more than 64 subtransactions
1604 </para>
1605 </listitem>
1606 <listitem>
1607 <para>
1608 Very long-lived write transactions
1609 </para>
1610 </listitem>
1611 </itemizedlist>
1613 If you are running file-based log shipping ("warm standby"), you might need
1614 to wait until the next WAL file arrives, which could be as long as the
1615 <varname>archive_timeout</> setting on the primary.
1616 </para>
1618 <para>
1619 The setting of some parameters on the standby will need reconfiguration
1620 if they have been changed on the primary. For these parameters,
1621 the value on the standby must
1622 be equal to or greater than the value on the primary. If these parameters
1623 are not set high enough then the standby will refuse to start.
1624 Higher values can then be supplied and the server
1625 restarted to begin recovery again. These parameters are:
1627 <itemizedlist>
1628 <listitem>
1629 <para>
1630 <varname>max_connections</>
1631 </para>
1632 </listitem>
1633 <listitem>
1634 <para>
1635 <varname>max_prepared_transactions</>
1636 </para>
1637 </listitem>
1638 <listitem>
1639 <para>
1640 <varname>max_locks_per_transaction</>
1641 </para>
1642 </listitem>
1643 </itemizedlist>
1644 </para>
1646 <para>
1647 It is important that the administrator consider the appropriate setting
1648 of <varname>max_standby_delay</>,
1649 which can be set in <filename>postgresql.conf</>.
1650 There is no optimal setting, so it should be set according to business
1651 priorities. For example if the server is primarily tasked as a High
1652 Availability server, then you may wish to lower
1653 <varname>max_standby_delay</> or even set it to zero, though that is a
1654 very aggressive setting. If the standby server is tasked as an additional
1655 server for decision support queries then it might be acceptable to set this
1656 to a value of many hours. It is also possible to set
1657 <varname>max_standby_delay</> to -1 which means wait forever for queries
1658 to complete; this will be useful when performing
1659 an archive recovery from a backup.
1660 </para>
1662 <para>
1663 Transaction status "hint bits" written on the primary are not WAL-logged,
1664 so data on the standby will likely re-write the hints again on the standby.
1665 Thus, the standby server will still perform disk writes even though
1666 all users are read-only; no changes occur to the data values
1667 themselves. Users will still write large sort temporary files and
1668 re-generate relcache info files, so no part of the database
1669 is truly read-only during hot standby mode.
1670 Note also that writes to remote databases using
1671 <application>dblink</application> module, and other operations outside the
1672 database using PL functions will still be possible, even though the
1673 transaction is read-only locally.
1674 </para>
1676 <para>
1677 The following types of administration commands are not accepted
1678 during recovery mode:
1680 <itemizedlist>
1681 <listitem>
1682 <para>
1683 Data Definition Language (DDL) - e.g. <command>CREATE INDEX</>
1684 </para>
1685 </listitem>
1686 <listitem>
1687 <para>
1688 Privilege and Ownership - <command>GRANT</>, <command>REVOKE</>,
1689 <command>REASSIGN</>
1690 </para>
1691 </listitem>
1692 <listitem>
1693 <para>
1694 Maintenance commands - <command>ANALYZE</>, <command>VACUUM</>,
1695 <command>CLUSTER</>, <command>REINDEX</>
1696 </para>
1697 </listitem>
1698 </itemizedlist>
1699 </para>
1701 <para>
1702 Again, note that some of these commands are actually allowed during
1703 "read only" mode transactions on the primary.
1704 </para>
1706 <para>
1707 As a result, you cannot create additional indexes that exist solely
1708 on the standby, nor statistics that exist solely on the standby.
1709 If these administration commands are needed, they should be executed
1710 on the primary, and eventually those changes will propagate to the
1711 standby.
1712 </para>
1714 <para>
1715 <function>pg_cancel_backend()</> will work on user backends, but not the
1716 Startup process, which performs recovery. <structname>pg_stat_activity</structname> does not
1717 show an entry for the Startup process, nor do recovering transactions
1718 show as active. As a result, <structname>pg_prepared_xacts</structname> is always empty during
1719 recovery. If you wish to resolve in-doubt prepared transactions,
1720 view <literal>pg_prepared_xacts</> on the primary and issue commands to
1721 resolve transactions there.
1722 </para>
1724 <para>
1725 <structname>pg_locks</structname> will show locks held by backends,
1726 as normal. <structname>pg_locks</structname> also shows
1727 a virtual transaction managed by the Startup process that owns all
1728 <literal>AccessExclusiveLocks</> held by transactions being replayed by recovery.
1729 Note that the Startup process does not acquire locks to
1730 make database changes, and thus locks other than <literal>AccessExclusiveLocks</>
1731 do not show in <structname>pg_locks</structname> for the Startup
1732 process; they are just presumed to exist.
1733 </para>
1735 <para>
1736 The <productname>Nagios</> plugin <productname>check_pgsql</> will
1737 work, because the simple information it checks for exists.
1738 The <productname>check_postgres</> monitoring script will also work,
1739 though some reported values could give different or confusing results.
1740 For example, last vacuum time will not be maintained, since no
1741 vacuum occurs on the standby. Vacuums running on the primary
1742 do still send their changes to the standby.
1743 </para>
1745 <para>
1746 WAL file control commands will not work during recovery,
1747 e.g. <function>pg_start_backup</>, <function>pg_switch_xlog</> etc.
1748 </para>
1750 <para>
1751 Dynamically loadable modules work, including <structname>pg_stat_statements</>.
1752 </para>
1754 <para>
1755 Advisory locks work normally in recovery, including deadlock detection.
1756 Note that advisory locks are never WAL logged, so it is impossible for
1757 an advisory lock on either the primary or the standby to conflict with WAL
1758 replay. Nor is it possible to acquire an advisory lock on the primary
1759 and have it initiate a similar advisory lock on the standby. Advisory
1760 locks relate only to the server on which they are acquired.
1761 </para>
1763 <para>
1764 Trigger-based replication systems such as <productname>Slony</>,
1765 <productname>Londiste</> and <productname>Bucardo</> won't run on the
1766 standby at all, though they will run happily on the primary server as
1767 long as the changes are not sent to standby servers to be applied.
1768 WAL replay is not trigger-based so you cannot relay from the
1769 standby to any system that requires additional database writes or
1770 relies on the use of triggers.
1771 </para>
1773 <para>
1774 New oids cannot be assigned, though some <acronym>UUID</> generators may still
1775 work as long as they do not rely on writing new status to the database.
1776 </para>
1778 <para>
1779 Currently, temporary table creation is not allowed during read only
1780 transactions, so in some cases existing scripts will not run correctly.
1781 This restriction might be relaxed in a later release. This is
1782 both a SQL Standard compliance issue and a technical issue.
1783 </para>
1785 <para>
1786 <command>DROP TABLESPACE</> can only succeed if the tablespace is empty.
1787 Some standby users may be actively using the tablespace via their
1788 <varname>temp_tablespaces</> parameter. If there are temporary files in the
1789 tablespace, all active queries are cancelled to ensure that temporary
1790 files are removed, so the tablespace can be removed and WAL replay
1791 can continue.
1792 </para>
1794 <para>
1795 Running <command>DROP DATABASE</>, <command>ALTER DATABASE ... SET TABLESPACE</>,
1796 or <command>ALTER DATABASE ... RENAME</> on primary will generate a log message
1797 that will cause all users connected to that database on the standby to be
1798 forcibly disconnected. This action occurs immediately, whatever the setting of
1799 <varname>max_standby_delay</>.
1800 </para>
1802 <para>
1803 In normal (non-recovery) mode, if you issue <command>DROP USER</> or <command>DROP ROLE</>
1804 for a role with login capability while that user is still connected then
1805 nothing happens to the connected user - they remain connected. The user cannot
1806 reconnect however. This behavior applies in recovery also, so a
1807 <command>DROP USER</> on the primary does not disconnect that user on the standby.
1808 </para>
1810 <para>
1811 The statistics collector is active during recovery. All scans, reads, blocks,
1812 index usage, etc., will be recorded normally on the standby. Replayed
1813 actions will not duplicate their effects on primary, so replaying an
1814 insert will not increment the Inserts column of pg_stat_user_tables.
1815 The stats file is deleted at the start of recovery, so stats from primary
1816 and standby will differ; this is considered a feature, not a bug.
1817 </para>
1819 <para>
1820 Autovacuum is not active during recovery, it will start normally at the
1821 end of recovery.
1822 </para>
1824 <para>
1825 The background writer is active during recovery and will perform
1826 restartpoints (similar to checkpoints on the primary) and normal block
1827 cleaning activities. This can include updates of the hint bit
1828 information stored on the standby server.
1829 The <command>CHECKPOINT</> command is accepted during recovery,
1830 though it performs a restartpoint rather than a new checkpoint.
1831 </para>
1832 </sect2>
1834 <sect2 id="hot-standby-parameters">
1835 <title>Hot Standby Parameter Reference</title>
1837 <para>
1838 Various parameters have been mentioned above in
1839 <xref linkend="hot-standby-admin">
1840 and <xref linkend="hot-standby-conflict">.
1841 </para>
1843 <para>
1844 On the primary, parameters <xref linkend="guc-wal-level"> and
1845 <xref linkend="guc-vacuum-defer-cleanup-age"> can be used.
1846 <xref linkend="guc-max-standby-delay"> has no effect if set on the primary.
1847 </para>
1849 <para>
1850 On the standby, parameters <xref linkend="guc-hot-standby"> and
1851 <xref linkend="guc-max-standby-delay"> can be used.
1852 <xref linkend="guc-vacuum-defer-cleanup-age"> has no effect during
1853 recovery.
1854 </para>
1855 </sect2>
1857 <sect2 id="hot-standby-caveats">
1858 <title>Caveats</title>
1860 <para>
1861 There are several limitations of Hot Standby.
1862 These can and probably will be fixed in future releases:
1864 <itemizedlist>
1865 <listitem>
1866 <para>
1867 Operations on hash indexes are not presently WAL-logged, so
1868 replay will not update these indexes. Hash indexes will not be
1869 used for query plans during recovery.
1870 </para>
1871 </listitem>
1872 <listitem>
1873 <para>
1874 Full knowledge of running transactions is required before snapshots
1875 can be taken. Transactions that use large numbers of subtransactions
1876 (currently greater than 64) will delay the start of read only
1877 connections until the completion of the longest running write transaction.
1878 If this situation occurs, explanatory messages will be sent to the server log.
1879 </para>
1880 </listitem>
1881 <listitem>
1882 <para>
1883 Valid starting points for standby queries are generated at each
1884 checkpoint on the master. If the standby is shut down while the master
1885 is in a shutdown state, it might not be possible to re-enter Hot Standby
1886 until the primary is started up, so that it generates further starting
1887 points in the WAL logs. This situation isn't a problem in the most
1888 common situations where it might happen. Generally, if the primary is
1889 shut down and not available anymore, that's likely due to a serious
1890 failure that requires the standby being converted to operate as
1891 the new primary anyway. And in situations where the primary is
1892 being intentionally taken down, coordinating to make sure the standby
1893 becomes the new primary smoothly is also standard procedure.
1894 </para>
1895 </listitem>
1896 <listitem>
1897 <para>
1898 At the end of recovery, <literal>AccessExclusiveLocks</> held by prepared transactions
1899 will require twice the normal number of lock table entries. If you plan
1900 on running either a large number of concurrent prepared transactions
1901 that normally take <literal>AccessExclusiveLocks</>, or you plan on having one
1902 large transaction that takes many <literal>AccessExclusiveLocks</>, you are
1903 advised to select a larger value of <varname>max_locks_per_transaction</>,
1904 perhaps as much as twice the value of the parameter on
1905 the primary server. You need not consider this at all if
1906 your setting of <varname>max_prepared_transactions</> is <literal>0</>.
1907 </para>
1908 </listitem>
1909 </itemizedlist>
1911 </para>
1912 </sect2>
1914 </sect1>
1916 <sect1 id="backup-incremental-updated">
1917 <title>Incrementally Updated Backups</title>
1919 <indexterm zone="high-availability">
1920 <primary>incrementally updated backups</primary>
1921 </indexterm>
1923 <indexterm zone="high-availability">
1924 <primary>change accumulation</primary>
1925 </indexterm>
1927 <para>
1928 In a standby configuration, it is possible to offload the expense of
1929 taking periodic base backups from the primary server; instead base backups
1930 can be made by backing
1931 up a standby server's files. This concept is generally known as
1932 incrementally updated backups, log change accumulation, or more simply,
1933 change accumulation.
1934 </para>
1936 <para>
1937 If we take a file system backup of the standby server's data
1938 directory while it is processing
1939 logs shipped from the primary, we will be able to reload that backup and
1940 restart the standby's recovery process from the last restart point.
1941 We no longer need to keep WAL files from before the standby's restart point.
1942 If recovery is needed, it will be faster to recover from the incrementally
1943 updated backup than from the original base backup.
1944 </para>
1946 <para>
1947 The procedure for taking a file system backup of the standby server's
1948 data directory while it's processing logs shipped from the primary is:
1949 <orderedlist>
1950 <listitem>
1951 <para>
1952 Perform the backup, without using <function>pg_start_backup</> and
1953 <function>pg_stop_backup</>. Note that the <filename>pg_control</>
1954 file must be backed up <emphasis>first</>, as in:
1955 <programlisting>
1956 cp /var/lib/pgsql/data/global/pg_control /tmp
1957 cp -r /var/lib/pgsql/data /path/to/backup
1958 mv /tmp/pg_control /path/to/backup/data/global
1959 </programlisting>
1960 <filename>pg_control</> contains the location where WAL replay will
1961 begin after restoring from the backup; backing it up first ensures
1962 that it points to the last restartpoint when the backup started, not
1963 some later restartpoint that happened while files were copied to the
1964 backup.
1965 </para>
1966 </listitem>
1967 <listitem>
1968 <para>
1969 Make note of the backup ending WAL location by calling the <function>
1970 pg_last_xlog_replay_location</> function at the end of the backup,
1971 and keep it with the backup.
1972 <programlisting>
1973 psql -c "select pg_last_xlog_replay_location();" > /path/to/backup/end_location
1974 </programlisting>
1975 When recovering from the incrementally updated backup, the server
1976 can begin accepting connections and complete the recovery successfully
1977 before the database has become consistent. To avoid that, you must
1978 ensure the database is consistent before users try to connect to the
1979 server and when the recovery ends. You can do that by comparing the
1980 progress of the recovery with the stored backup ending WAL location:
1981 the server is not consistent until recovery has reached the backup end
1982 location. The progress of the recovery can also be observed with the
1983 <function>pg_last_xlog_replay_location</> function, but that required
1984 connecting to the server while it might not be consistent yet, so
1985 care should be taken with that method.
1986 </para>
1987 <para>
1988 </para>
1989 </listitem>
1990 </orderedlist>
1991 </para>
1993 <para>
1994 Since the standby server is not <quote>live</>, it is not possible to
1995 use <function>pg_start_backup()</> and <function>pg_stop_backup()</>
1996 to manage the backup process; it will be up to you to determine how
1997 far back you need to keep WAL segment files to have a recoverable
1998 backup. That is determined by the last restartpoint when the backup
1999 was taken, any WAL older than that can be deleted from the archive
2000 once the backup is complete. You can determine the last restartpoint
2001 by running <application>pg_controldata</> on the standby server before
2002 taking the backup, or by using the <varname>log_checkpoints</> option
2003 to print values to the standby's server log.
2004 </para>
2005 </sect1>
2007 </chapter>