doc/src/sgml/high-availability.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="high-availability">
   4  <title>High Availability, Load Balancing, and Replication</title>
   5
   6  <indexterm><primary>high availability</></>
   7  <indexterm><primary>failover</></>
   8  <indexterm><primary>replication</></>
   9  <indexterm><primary>load balancing</></>
  10  <indexterm><primary>clustering</></>
  11  <indexterm><primary>data partitioning</></>
  12
  13  <para>
  14   Database servers can work together to allow a second server to
  15   take over quickly if the primary server fails (high
  16   availability), or to allow several computers to serve the same
  17   data (load balancing).  Ideally, database servers could work
  18   together seamlessly.  Web servers serving static web pages can
  19   be combined quite easily by merely load-balancing web requests
  20   to multiple machines.  In fact, read-only database servers can
  21   be combined relatively easily too.  Unfortunately, most database
  22   servers have a read/write mix of requests, and read/write servers
  23   are much harder to combine.  This is because though read-only
  24   data needs to be placed on each server only once, a write to any
  25   server has to be propagated to all servers so that future read
  26   requests to those servers return consistent results.
  27  </para>
  28
  29  <para>
  30   This synchronization problem is the fundamental difficulty for
  31   servers working together.  Because there is no single solution
  32   that eliminates the impact of the sync problem for all use cases,
  33   there are multiple solutions.  Each solution addresses this
  34   problem in a different way, and minimizes its impact for a specific
  35   workload.
  36  </para>
  37
  38  <para>
  39   Some solutions deal with synchronization by allowing only one
  40   server to modify the data.  Servers that can modify data are
  41   called read/write, <firstterm>master</> or <firstterm>primary</> servers.
  42   Servers that track changes in the master are called <firstterm>standby</>
  43   or <firstterm>slave</> servers. A standby server that cannot be connected
  44   to until it is promoted to a master server is called a <firstterm>warm
  45   standby</> server, and one that can accept connections and serves read-only
  46   queries is called a <firstterm>hot standby</> server.
  47  </para>
  48
  49  <para>
  50   Some solutions are synchronous,
  51   meaning that a data-modifying transaction is not considered
  52   committed until all servers have committed the transaction.  This
  53   guarantees that a failover will not lose any data and that all
  54   load-balanced servers will return consistent results no matter
  55   which server is queried. In contrast, asynchronous solutions allow some
  56   delay between the time of a commit and its propagation to the other servers,
  57   opening the possibility that some transactions might be lost in
  58   the switch to a backup server, and that load balanced servers
  59   might return slightly stale results.  Asynchronous communication
  60   is used when synchronous would be too slow.
  61  </para>
  62
  63  <para>
  64   Solutions can also be categorized by their granularity.  Some solutions
  65   can deal only with an entire database server, while others allow control
  66   at the per-table or per-database level.
  67  </para>
  68
  69  <para>
  70   Performance must be considered in any choice.  There is usually a
  71   trade-off between functionality and
  72   performance.  For example, a fully synchronous solution over a slow
  73   network might cut performance by more than half, while an asynchronous
  74   one might have a minimal performance impact.
  75  </para>
  76
  77  <para>
  78   The remainder of this section outlines various failover, replication,
  79   and load balancing solutions.  A <ulink
  80   url="http://www.postgres-r.org/documentation/terms">glossary</ulink> is
  81   also available.
  82  </para>
  83
  84  <sect1 id="different-replication-solutions">
  85  <title>Comparison of different solutions</title>
  86
  87  <variablelist>
  88
  89   <varlistentry>
  90    <term>Shared Disk Failover</term>
  91    <listitem>
  92
  93     <para>
  94      Shared disk failover avoids synchronization overhead by having only one
  95      copy of the database.  It uses a single disk array that is shared by
  96      multiple servers.  If the main database server fails, the standby server
  97      is able to mount and start the database as though it were recovering from
  98      a database crash.  This allows rapid failover with no data loss.
  99     </para>
 100
 101     <para>
 102      Shared hardware functionality is common in network storage devices.
 103      Using a network file system is also possible, though care must be
 104      taken that the file system has full <acronym>POSIX</> behavior (see <xref
 105      linkend="creating-cluster-nfs">).  One significant limitation of this
 106      method is that if the shared disk array fails or becomes corrupt, the
 107      primary and standby servers are both nonfunctional.  Another issue is
 108      that the standby server should never access the shared storage while
 109      the primary server is running.
 110     </para>
 111
 112    </listitem>
 113   </varlistentry>
 114
 115   <varlistentry>
 116    <term>File System (Block-Device) Replication</term>
 117    <listitem>
 118
 119     <para>
 120      A modified version of shared hardware functionality is file system
 121      replication, where all changes to a file system are mirrored to a file
 122      system residing on another computer.  The only restriction is that
 123      the mirroring must be done in a way that ensures the standby server
 124      has a consistent copy of the file system &mdash; specifically, writes
 125      to the standby must be done in the same order as those on the master.
 126      <productname>DRBD</> is a popular file system replication solution
 127      for Linux.
 128     </para>
 129
 130 <!--
 131 https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
 132
 133 Oracle RAC is a shared disk approach and just send cache invalidations
 134 to other nodes but not actual data. As the disk is shared, data is
 135 only committed once to disk and there is a distributed locking
 136 protocol to make nodes agree on a serializable transactional order.
 137 -->
 138
 139    </listitem>
 140   </varlistentry>
 141
 142   <varlistentry>
 143    <term>Warm and Hot Standby Using Point-In-Time Recovery (<acronym>PITR</>)</term>
 144    <listitem>
 145
 146     <para>
 147      Warm and hot standby servers can be kept current by reading a
 148      stream of write-ahead log (<acronym>WAL</>)
 149      records.  If the main server fails, the standby contains
 150      almost all of the data of the main server, and can be quickly
 151      made the new master database server.  This is asynchronous and
 152      can only be done for the entire database server.
 153     </para>
 154     <para>
 155      A PITR standby server can be implemented using file-based log shipping
 156      (<xref linkend="warm-standby">) or streaming replication (see
 157      <xref linkend="streaming-replication">), or a combination of both. For
 158      information on hot standby, see <xref linkend="hot-standby">.
 159     </para>
 160    </listitem>
 161   </varlistentry>
 162
 163   <varlistentry>
 164    <term>Trigger-Based Master-Standby Replication</term>
 165    <listitem>
 166
 167     <para>
 168      A master-standby replication setup sends all data modification
 169      queries to the master server.  The master server asynchronously
 170      sends data changes to the standby server.  The standby can answer
 171      read-only queries while the master server is running.  The
 172      standby server is ideal for data warehouse queries.
 173     </para>
 174
 175     <para>
 176      <productname>Slony-I</> is an example of this type of replication, with per-table
 177      granularity, and support for multiple standby servers.  Because it
 178      updates the standby server asynchronously (in batches), there is
 179      possible data loss during fail over.
 180     </para>
 181    </listitem>
 182   </varlistentry>
 183
 184   <varlistentry>
 185    <term>Statement-Based Replication Middleware</term>
 186    <listitem>
 187
 188     <para>
 189      With statement-based replication middleware, a program intercepts
 190      every SQL query and sends it to one or all servers.  Each server
 191      operates independently.  Read-write queries are sent to all servers,
 192      while read-only queries can be sent to just one server, allowing
 193      the read workload to be distributed.
 194     </para>
 195
 196     <para>
 197      If queries are simply broadcast unmodified, functions like
 198      <function>random()</>, <function>CURRENT_TIMESTAMP</>, and
 199      sequences can have different values on different servers.
 200      This is because each server operates independently, and because
 201      SQL queries are broadcast (and not actual modified rows).  If
 202      this is unacceptable, either the middleware or the application
 203      must query such values from a single server and then use those
 204      values in write queries.  Another option is to use this replication
 205      option with a traditional master-standby setup, i.e. data modification
 206      queries are sent only to the master and are propagated to the
 207      standby servers via master-standby replication, not by the replication
 208      middleware.  Care must also be taken that all
 209      transactions either commit or abort on all servers, perhaps
 210      using two-phase commit (<xref linkend="sql-prepare-transaction">
 211      and <xref linkend="sql-commit-prepared">.
 212      <productname>Pgpool-II</> and <productname>Sequoia</> are examples of
 213      this type of replication.
 214     </para>
 215    </listitem>
 216   </varlistentry>
 217
 218   <varlistentry>
 219    <term>Asynchronous Multimaster Replication</term>
 220    <listitem>
 221
 222     <para>
 223      For servers that are not regularly connected, like laptops or
 224      remote servers, keeping data consistent among servers is a
 225      challenge.  Using asynchronous multimaster replication, each
 226      server works independently, and periodically communicates with
 227      the other servers to identify conflicting transactions.  The
 228      conflicts can be resolved by users or conflict resolution rules.
 229      Bucardo is an example of this type of replication.
 230     </para>
 231    </listitem>
 232   </varlistentry>
 233
 234   <varlistentry>
 235    <term>Synchronous Multimaster Replication</term>
 236    <listitem>
 237
 238     <para>
 239      In synchronous multimaster replication, each server can accept
 240      write requests, and modified data is transmitted from the
 241      original server to every other server before each transaction
 242      commits.  Heavy write activity can cause excessive locking,
 243      leading to poor performance.  In fact, write performance is
 244      often worse than that of a single server.  Read requests can
 245      be sent to any server.  Some implementations use shared disk
 246      to reduce the communication overhead.  Synchronous multimaster
 247      replication is best for mostly read workloads, though its big
 248      advantage is that any server can accept write requests &mdash;
 249      there is no need to partition workloads between master and
 250      standby servers, and because the data changes are sent from one
 251      server to another, there is no problem with non-deterministic
 252      functions like <function>random()</>.
 253     </para>
 254
 255     <para>
 256      <productname>PostgreSQL</> does not offer this type of replication,
 257      though <productname>PostgreSQL</> two-phase commit (<xref
 258      linkend="sql-prepare-transaction"> and <xref
 259      linkend="sql-commit-prepared">)
 260      can be used to implement this in application code or middleware.
 261     </para>
 262    </listitem>
 263   </varlistentry>
 264
 265   <varlistentry>
 266    <term>Commercial Solutions</term>
 267    <listitem>
 268
 269     <para>
 270      Because <productname>PostgreSQL</> is open source and easily
 271      extended, a number of companies have taken <productname>PostgreSQL</>
 272      and created commercial closed-source solutions with unique
 273      failover, replication, and load balancing capabilities.
 274     </para>
 275    </listitem>
 276   </varlistentry>
 277
 278  </variablelist>
 279
 280  <para>
 281   <xref linkend="high-availability-matrix"> summarizes
 282   the capabilities of the various solutions listed above.
 283  </para>
 284
 285  <table id="high-availability-matrix">
 286   <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
 287   <tgroup cols="8">
 288    <thead>
 289     <row>
 290      <entry>Feature</entry>
 291      <entry>Shared Disk Failover</entry>
 292      <entry>File System Replication</entry>
 293      <entry>Hot/Warm Standby Using PITR</entry>
 294      <entry>Trigger-Based Master-Standby Replication</entry>
 295      <entry>Statement-Based Replication Middleware</entry>
 296      <entry>Asynchronous Multimaster Replication</entry>
 297      <entry>Synchronous Multimaster Replication</entry>
 298     </row>
 299    </thead>
 300
 301    <tbody>
 302
 303     <row>
 304      <entry>Most Common Implementation</entry>
 305      <entry align="center">NAS</entry>
 306      <entry align="center">DRBD</entry>
 307      <entry align="center">PITR</entry>
 308      <entry align="center">Slony</entry>
 309      <entry align="center">pgpool-II</entry>
 310      <entry align="center">Bucardo</entry>
 311      <entry align="center"></entry>
 312     </row>
 313
 314     <row>
 315      <entry>Communication Method</entry>
 316      <entry align="center">shared disk</entry>
 317      <entry align="center">disk blocks</entry>
 318      <entry align="center">WAL</entry>
 319      <entry align="center">table rows</entry>
 320      <entry align="center">SQL</entry>
 321      <entry align="center">table rows</entry>
 322      <entry align="center">table rows and row locks</entry>
 323     </row>
 324
 325     <row>
 326      <entry>No special hardware required</entry>
 327      <entry align="center"></entry>
 328      <entry align="center">&bull;</entry>
 329      <entry align="center">&bull;</entry>
 330      <entry align="center">&bull;</entry>
 331      <entry align="center">&bull;</entry>
 332      <entry align="center">&bull;</entry>
 333      <entry align="center">&bull;</entry>
 334     </row>
 335
 336     <row>
 337      <entry>Allows multiple master servers</entry>
 338      <entry align="center"></entry>
 339      <entry align="center"></entry>
 340      <entry align="center"></entry>
 341      <entry align="center"></entry>
 342      <entry align="center">&bull;</entry>
 343      <entry align="center">&bull;</entry>
 344      <entry align="center">&bull;</entry>
 345     </row>
 346
 347     <row>
 348      <entry>No master server overhead</entry>
 349      <entry align="center">&bull;</entry>
 350      <entry align="center"></entry>
 351      <entry align="center">&bull;</entry>
 352      <entry align="center"></entry>
 353      <entry align="center">&bull;</entry>
 354      <entry align="center"></entry>
 355      <entry align="center"></entry>
 356     </row>
 357
 358     <row>
 359      <entry>No waiting for multiple servers</entry>
 360      <entry align="center">&bull;</entry>
 361      <entry align="center"></entry>
 362      <entry align="center">&bull;</entry>
 363      <entry align="center">&bull;</entry>
 364      <entry align="center"></entry>
 365      <entry align="center">&bull;</entry>
 366      <entry align="center"></entry>
 367     </row>
 368
 369     <row>
 370      <entry>Master failure will never lose data</entry>
 371      <entry align="center">&bull;</entry>
 372      <entry align="center">&bull;</entry>
 373      <entry align="center"></entry>
 374      <entry align="center"></entry>
 375      <entry align="center">&bull;</entry>
 376      <entry align="center"></entry>
 377      <entry align="center">&bull;</entry>
 378     </row>
 379
 380     <row>
 381      <entry>Standby accept read-only queries</entry>
 382      <entry align="center"></entry>
 383      <entry align="center"></entry>
 384      <entry align="center">Hot only</entry>
 385      <entry align="center">&bull;</entry>
 386      <entry align="center">&bull;</entry>
 387      <entry align="center">&bull;</entry>
 388      <entry align="center">&bull;</entry>
 389     </row>
 390
 391     <row>
 392      <entry>Per-table granularity</entry>
 393      <entry align="center"></entry>
 394      <entry align="center"></entry>
 395      <entry align="center"></entry>
 396      <entry align="center">&bull;</entry>
 397      <entry align="center"></entry>
 398      <entry align="center">&bull;</entry>
 399      <entry align="center">&bull;</entry>
 400     </row>
 401
 402     <row>
 403      <entry>No conflict resolution necessary</entry>
 404      <entry align="center">&bull;</entry>
 405      <entry align="center">&bull;</entry>
 406      <entry align="center">&bull;</entry>
 407      <entry align="center">&bull;</entry>
 408      <entry align="center"></entry>
 409      <entry align="center"></entry>
 410      <entry align="center">&bull;</entry>
 411     </row>
 412
 413    </tbody>
 414   </tgroup>
 415  </table>
 416
 417  <para>
 418   There are a few solutions that do not fit into the above categories:
 419  </para>
 420
 421  <variablelist>
 422
 423   <varlistentry>
 424    <term>Data Partitioning</term>
 425    <listitem>
 426
 427     <para>
 428      Data partitioning splits tables into data sets.  Each set can
 429      be modified by only one server.  For example, data can be
 430      partitioned by offices, e.g., London and Paris, with a server
 431      in each office.  If queries combining London and Paris data
 432      are necessary, an application can query both servers, or
 433      master/standby replication can be used to keep a read-only copy
 434      of the other office's data on each server.
 435     </para>
 436    </listitem>
 437   </varlistentry>
 438
 439   <varlistentry>
 440    <term>Multiple-Server Parallel Query Execution</term>
 441    <listitem>
 442
 443     <para>
 444      Many of the above solutions allow multiple servers to handle multiple
 445      queries, but none allow a single query to use multiple servers to
 446      complete faster.  This solution allows multiple servers to work
 447      concurrently on a single query.  It is usually accomplished by
 448      splitting the data among servers and having each server execute its
 449      part of the query and return results to a central server where they
 450      are combined and returned to the user.  <productname>Pgpool-II</>
 451      has this capability.  Also, this can be implemented using the
 452      <productname>PL/Proxy</> toolset.
 453     </para>
 454
 455    </listitem>
 456   </varlistentry>
 457
 458  </variablelist>
 459
 460  </sect1>
 461
 462
 463  <sect1 id="warm-standby">
 464  <title>Log-Shipping Standby Servers</title>
 465
 466
 467   <para>
 468    Continuous archiving can be used to create a <firstterm>high
 469    availability</> (HA) cluster configuration with one or more
 470    <firstterm>standby servers</> ready to take over operations if the
 471    primary server fails. This capability is widely referred to as
 472    <firstterm>warm standby</> or <firstterm>log shipping</>.
 473   </para>
 474
 475   <para>
 476    The primary and standby server work together to provide this capability,
 477    though the servers are only loosely coupled. The primary server operates
 478    in continuous archiving mode, while each standby server operates in
 479    continuous recovery mode, reading the WAL files from the primary. No
 480    changes to the database tables are required to enable this capability,
 481    so it offers low administration overhead compared to some other
 482    replication solutions. This configuration also has relatively low
 483    performance impact on the primary server.
 484   </para>
 485
 486   <para>
 487    Directly moving WAL records from one database server to another
 488    is typically described as log shipping. <productname>PostgreSQL</>
 489    implements file-based log shipping, which means that WAL records are
 490    transferred one file (WAL segment) at a time. WAL files (16MB) can be
 491    shipped easily and cheaply over any distance, whether it be to an
 492    adjacent system, another system at the same site, or another system on
 493    the far side of the globe. The bandwidth required for this technique
 494    varies according to the transaction rate of the primary server.
 495    Record-based log shipping is also possible with streaming replication
 496    (see <xref linkend="streaming-replication">).
 497   </para>
 498
 499   <para>
 500    It should be noted that the log shipping is asynchronous, i.e., the WAL
 501    records are shipped after transaction commit. As a result, there is a
 502    window for data loss should the primary server suffer a catastrophic
 503    failure; transactions not yet shipped will be lost.  The size of the
 504    data loss window in file-based log shipping can be limited by use of the
 505    <varname>archive_timeout</varname> parameter, which can be set as low
 506    as a few seconds.  However such a low setting will
 507    substantially increase the bandwidth required for file shipping.
 508    If you need a window of less than a minute or so, consider using
 509    streaming replication (see <xref linkend="streaming-replication">).
 510   </para>
 511
 512   <para>
 513    Recovery performance is sufficiently good that the standby will
 514    typically be only moments away from full
 515    availability once it has been activated. As a result, this is called
 516    a warm standby configuration which offers high
 517    availability. Restoring a server from an archived base backup and
 518    rollforward will take considerably longer, so that technique only
 519    offers a solution for disaster recovery, not high availability.
 520    A standby server can also be used for read-only queries, in which case
 521    it is called a Hot Standby server. See <xref linkend="hot-standby"> for
 522    more information.
 523   </para>
 524
 525   <indexterm zone="high-availability">
 526    <primary>warm standby</primary>
 527   </indexterm>
 528
 529   <indexterm zone="high-availability">
 530    <primary>PITR standby</primary>
 531   </indexterm>
 532
 533   <indexterm zone="high-availability">
 534    <primary>standby server</primary>
 535   </indexterm>
 536
 537   <indexterm zone="high-availability">
 538    <primary>log shipping</primary>
 539   </indexterm>
 540
 541   <indexterm zone="high-availability">
 542    <primary>witness server</primary>
 543   </indexterm>
 544
 545   <indexterm zone="high-availability">
 546    <primary>STONITH</primary>
 547   </indexterm>
 548
 549   <sect2 id="standby-planning">
 550    <title>Planning</title>
 551
 552    <para>
 553     It is usually wise to create the primary and standby servers
 554     so that they are as similar as possible, at least from the
 555     perspective of the database server.  In particular, the path names
 556     associated with tablespaces will be passed across unmodified, so both
 557     primary and standby servers must have the same mount paths for
 558     tablespaces if that feature is used.  Keep in mind that if
 559     <xref linkend="sql-createtablespace">
 560     is executed on the primary, any new mount point needed for it must
 561     be created on the primary and all standby servers before the command
 562     is executed. Hardware need not be exactly the same, but experience shows
 563     that maintaining two identical systems is easier than maintaining two
 564     dissimilar ones over the lifetime of the application and system.
 565     In any case the hardware architecture must be the same &mdash; shipping
 566     from, say, a 32-bit to a 64-bit system will not work.
 567    </para>
 568
 569    <para>
 570     In general, log shipping between servers running different major
 571     <productname>PostgreSQL</> release
 572     levels is not possible. It is the policy of the PostgreSQL Global
 573     Development Group not to make changes to disk formats during minor release
 574     upgrades, so it is likely that running different minor release levels
 575     on primary and standby servers will work successfully. However, no
 576     formal support for that is offered and you are advised to keep primary
 577     and standby servers at the same release level as much as possible.
 578     When updating to a new minor release, the safest policy is to update
 579     the standby servers first &mdash; a new minor release is more likely
 580     to be able to read WAL files from a previous minor release than vice
 581     versa.
 582    </para>
 583
 584   </sect2>
 585
 586   <sect2 id="standby-server-operation">
 587    <title>Standby Server Operation</title>
 588
 589    <para>
 590     In standby mode, the server continuously applies WAL received from the
 591     master server. The standby server can read WAL from a WAL archive
 592     (see <varname>restore_command</>) or directly from the master
 593     over a TCP connection (streaming replication). The standby server will
 594     also attempt to restore any WAL found in the standby cluster's
 595     <filename>pg_xlog</> directory. That typically happens after a server
 596     restart, when the standby replays again WAL that was streamed from the
 597     master before the restart, but you can also manually copy files to
 598     <filename>pg_xlog</> at any time to have them replayed.
 599    </para>
 600
 601    <para>
 602     At startup, the standby begins by restoring all WAL available in the
 603     archive location, calling <varname>restore_command</>. Once it
 604     reaches the end of WAL available there and <varname>restore_command</>
 605     fails, it tries to restore any WAL available in the pg_xlog directory.
 606     If that fails, and streaming replication has been configured, the
 607     standby tries to connect to the primary server and start streaming WAL
 608     from the last valid record found in archive or pg_xlog. If that fails
 609     or streaming replication is not configured, or if the connection is
 610     later disconnected, the standby goes back to step 1 and tries to
 611     restore the file from the archive again. This loop of retries from the
 612     archive, pg_xlog, and via streaming replication goes on until the server
 613     is stopped or failover is triggered by a trigger file.
 614    </para>
 615
 616    <para>
 617     Standby mode is exited and the server switches to normal operation,
 618     when a trigger file is found (<varname>trigger_file</>). Before failover,
 619     any WAL immediately available in the archive or in pg_xlog will be
 620     restored, but no attempt is made to connect to the master.
 621    </para>
 622   </sect2>
 623
 624   <sect2 id="preparing-master-for-standby">
 625    <title>Preparing the Master for Standby Servers</title>
 626
 627    <para>
 628     Set up continuous archiving on the primary to an archive directory
 629     accessible from the standby, as described
 630     in <xref linkend="continuous-archiving">. The archive location should be
 631     accessible from the standby even when the master is down, i.e. it should
 632     reside on the standby server itself or another trusted server, not on
 633     the master server.
 634    </para>
 635
 636    <para>
 637     If you want to use streaming replication, set up authentication on the
 638     primary server to allow replication connections from the standby
 639     server(s); that is, provide a suitable entry or entries in
 640     <filename>pg_hba.conf</> with the database field set to
 641     <literal>replication</>.  Also ensure <varname>max_wal_senders</> is set
 642     to a sufficiently large value in the configuration file of the primary
 643     server.
 644    </para>
 645
 646    <para>
 647     Take a base backup as described in <xref linkend="backup-base-backup">
 648     to bootstrap the standby server.
 649    </para>
 650   </sect2>
 651
 652   <sect2 id="standby-server-setup">
 653    <title>Setting Up a Standby Server</title>
 654
 655    <para>
 656     To set up the standby server, restore the base backup taken from primary
 657     server (see <xref linkend="backup-pitr-recovery">). Create a recovery
 658     command file <filename>recovery.conf</> in the standby's cluster data
 659     directory, and turn on <varname>standby_mode</>. Set
 660     <varname>restore_command</> to a simple command to copy files from
 661     the WAL archive.
 662    </para>
 663
 664    <note>
 665      <para>
 666      Do not use pg_standby or similar tools with the built-in standby mode
 667      described here. <varname>restore_command</> should return immediately
 668      if the file does not exist; the server will retry the command again if
 669      necessary. See <xref linkend="log-shipping-alternative">
 670      for using tools like pg_standby.
 671     </para>
 672    </note>
 673
 674    <para>
 675      If you want to use streaming replication, fill in
 676      <varname>primary_conninfo</> with a libpq connection string, including
 677      the host name (or IP address) and any additional details needed to
 678      connect to the primary server. If the primary needs a password for
 679      authentication, the password needs to be specified in
 680      <varname>primary_conninfo</> as well.
 681    </para>
 682
 683    <para>
 684     You can use <varname>archive_cleanup_command</> to prune the archive of
 685     files no longer needed by the standby.
 686    </para>
 687
 688    <para>
 689     If you're setting up the standby server for high availability purposes,
 690     set up WAL archiving, connections and authentication like the primary
 691     server, because the standby server will work as a primary server after
 692     failover. You will also need to set <varname>trigger_file</> to make
 693     it possible to fail over.
 694     If you're setting up the standby server for reporting
 695     purposes, with no plans to fail over to it, <varname>trigger_file</>
 696     is not required.
 697    </para>
 698
 699    <para>
 700     A simple example of a <filename>recovery.conf</> is:
 701 <programlisting>
 702 standby_mode = 'on'
 703 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 704 restore_command = 'cp /path/to/archive/%f %p'
 705 trigger_file = '/path/to/trigger_file'
 706 </programlisting>
 707    </para>
 708
 709    <para>
 710     You can have any number of standby servers, but if you use streaming
 711     replication, make sure you set <varname>max_wal_senders</> high enough in
 712     the primary to allow them to be connected simultaneously.
 713    </para>
 714
 715    <para>
 716     If you're using a WAL archive, its size can be minimized using
 717     the <varname>archive_cleanup_command</> option to remove files that are
 718     no longer required by the standby server. Note however, that if you're
 719     using the archive for backup purposes, you need to retain files needed
 720     to recover from at least the latest base backup, even if they're no
 721     longer needed by the standby.
 722    </para>
 723   </sect2>
 724
 725   <sect2 id="streaming-replication">
 726    <title>Streaming Replication</title>
 727
 728    <indexterm zone="high-availability">
 729     <primary>Streaming Replication</primary>
 730    </indexterm>
 731
 732    <para>
 733     Streaming replication allows a standby server to stay more up-to-date
 734     than is possible with file-based log shipping. The standby connects
 735     to the primary, which streams WAL records to the standby as they're
 736     generated, without waiting for the WAL file to be filled.
 737    </para>
 738
 739    <para>
 740     Streaming replication is asynchronous, so there is still a small delay
 741     between committing a transaction in the primary and for the changes to
 742     become visible in the standby. The delay is however much smaller than with
 743     file-based log shipping, typically under one second assuming the standby
 744     is powerful enough to keep up with the load. With streaming replication,
 745     <varname>archive_timeout</> is not required to reduce the data loss
 746     window.
 747    </para>
 748
 749    <para>
 750     If you use streaming replication without file-based continuous
 751     archiving, you have to set <varname>wal_keep_segments</> in the master
 752     to a value high enough to ensure that old WAL segments are not recycled
 753     too early, while the standby might still need them to catch up. If the
 754     standby falls behind too much, it needs to be reinitialized from a new
 755     base backup. If you set up a WAL archive that's accessible from the
 756     standby, wal_keep_segments is not required as the standby can always
 757     use the archive to catch up.
 758    </para>
 759
 760    <para>
 761     To use streaming replication, set up a file-based log-shipping standby
 762     server as described in <xref linkend="warm-standby">. The step that
 763     turns a file-based log-shipping standby into streaming replication
 764     standby is setting <varname>primary_conninfo</> setting in the
 765     <filename>recovery.conf</> file to point to the primary server. Set
 766     <xref linkend="guc-listen-addresses"> and authentication options
 767     (see <filename>pg_hba.conf</>) on the primary so that the standby server
 768     can connect to the <literal>replication</> pseudo-database on the primary
 769     server (see <xref linkend="streaming-replication-authentication">).
 770    </para>
 771
 772    <para>
 773     On systems that support the keepalive socket option, setting
 774     <xref linkend="guc-tcp-keepalives-idle">,
 775     <xref linkend="guc-tcp-keepalives-interval"> and
 776     <xref linkend="guc-tcp-keepalives-count"> helps the primary promptly
 777     notice a broken connection.
 778    </para>
 779
 780    <para>
 781     Set the maximum number of concurrent connections from the standby servers
 782     (see <xref linkend="guc-max-wal-senders"> for details).
 783    </para>
 784
 785    <para>
 786     When the standby is started and <varname>primary_conninfo</> is set
 787     correctly, the standby will connect to the primary after replaying all
 788     WAL files available in the archive. If the connection is established
 789     successfully, you will see a walreceiver process in the standby, and
 790     a corresponding walsender process in the primary.
 791    </para>
 792
 793    <sect3 id="streaming-replication-authentication">
 794     <title>Authentication</title>
 795     <para>
 796      It is very important that the access privileges for replication be set up
 797      so that only trusted users can read the WAL stream, because it is
 798      easy to extract privileged information from it.  Standby servers must
 799      authenticate to the primary as a superuser account.
 800      So a role with the <literal>SUPERUSER</> and <literal>LOGIN</>
 801      privileges needs to be created on the primary.
 802     </para>
 803     <para>
 804      Client authentication for replication is controlled by a
 805      <filename>pg_hba.conf</> record specifying <literal>replication</> in the
 806      <replaceable>database</> field. For example, if the standby is running on
 807      host IP <literal>192.168.1.100</> and the superuser's name for replication
 808      is <literal>foo</>, the administrator can add the following line to the
 809      <filename>pg_hba.conf</> file on the primary:
 810
 811 <programlisting>
 812 # Allow the user "foo" from host 192.168.1.100 to connect to the primary
 813 # as a replication standby if the user's password is correctly supplied.
 814 #
 815 # TYPE  DATABASE        USER            CIDR-ADDRESS            METHOD
 816 host    replication     foo             192.168.1.100/32        md5
 817 </programlisting>
 818     </para>
 819     <para>
 820      The host name and port number of the primary, connection user name,
 821      and password are specified in the <filename>recovery.conf</> file.
 822      The password can also be set in the <filename>~/.pgpass</> file on the
 823      standby (specify <literal>replication</> in the <replaceable>database</>
 824      field).
 825      For example, if the primary is running on host IP <literal>192.168.1.50</>,
 826      port <literal>5432</literal>, the superuser's name for replication is
 827      <literal>foo</>, and the password is <literal>foopass</>, the administrator
 828      can add the following line to the <filename>recovery.conf</> file on the
 829      standby:
 830
 831 <programlisting>
 832 # The standby connects to the primary that is running on host 192.168.1.50
 833 # and port 5432 as the user "foo" whose password is "foopass".
 834 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 835 </programlisting>
 836     </para>
 837    </sect3>
 838
 839    <sect3 id="streaming-replication-monitoring">
 840     <title>Monitoring</title>
 841     <para>
 842      An important health indicator of streaming replication is the amount
 843      of WAL records generated in the primary, but not yet applied in the
 844      standby. You can calculate this lag by comparing the current WAL write
 845      location on the primary with the last WAL location received by the
 846      standby. They can be retrieved using
 847      <function>pg_current_xlog_location</> on the primary and the
 848      <function>pg_last_xlog_receive_location</> on the standby,
 849      respectively (see <xref linkend="functions-admin-backup-table"> and
 850      <xref linkend="functions-recovery-info-table"> for details).
 851      The last WAL receive location in the standby is also displayed in the
 852      process status of the WAL receiver process, displayed using the
 853      <command>ps</> command (see <xref linkend="monitoring-ps"> for details).
 854     </para>
 855    </sect3>
 856
 857   </sect2>
 858   </sect1>
 859
 860   <sect1 id="warm-standby-failover">
 861    <title>Failover</title>
 862
 863    <para>
 864     If the primary server fails then the standby server should begin
 865     failover procedures.
 866    </para>
 867
 868    <para>
 869     If the standby server fails then no failover need take place. If the
 870     standby server can be restarted, even some time later, then the recovery
 871     process can also be restarted immediately, taking advantage of
 872     restartable recovery. If the standby server cannot be restarted, then a
 873     full new standby server instance should be created.
 874    </para>
 875
 876    <para>
 877     If the primary server fails and the standby server becomes the
 878     new primary, and then the old primary restarts, you must have
 879     a mechanism for informing the old primary that it is no longer the primary. This is
 880     sometimes known as <acronym>STONITH</> (Shoot The Other Node In The Head), which is
 881     necessary to avoid situations where both systems think they are the
 882     primary, which will lead to confusion and ultimately data loss.
 883    </para>
 884
 885    <para>
 886     Many failover systems use just two systems, the primary and the standby,
 887     connected by some kind of heartbeat mechanism to continually verify the
 888     connectivity between the two and the viability of the primary. It is
 889     also possible to use a third system (called a witness server) to prevent
 890     some cases of inappropriate failover, but the additional complexity
 891     might not be worthwhile unless it is set up with sufficient care and
 892     rigorous testing.
 893    </para>
 894
 895    <para>
 896     <productname>PostgreSQL</productname> does not provide the system
 897     software required to identify a failure on the primary and notify
 898     the standby database server.  Many such tools exist and are well
 899     integrated with the operating system facilities required for
 900     successful failover, such as IP address migration.
 901    </para>
 902
 903    <para>
 904     Once failover to the standby occurs, there is only a
 905     single server in operation. This is known as a degenerate state.
 906     The former standby is now the primary, but the former primary is down
 907     and might stay down.  To return to normal operation, a standby server
 908     must be recreated,
 909     either on the former primary system when it comes up, or on a third,
 910     possibly new, system. Once complete the primary and standby can be
 911     considered to have switched roles. Some people choose to use a third
 912     server to provide backup for the new primary until the new standby
 913     server is recreated,
 914     though clearly this complicates the system configuration and
 915     operational processes.
 916    </para>
 917
 918    <para>
 919     So, switching from primary to standby server can be fast but requires
 920     some time to re-prepare the failover cluster. Regular switching from
 921     primary to standby is useful, since it allows regular downtime on
 922     each system for maintenance. This also serves as a test of the
 923     failover mechanism to ensure that it will really work when you need it.
 924     Written administration procedures are advised.
 925    </para>
 926
 927    <para>
 928     To trigger failover of a log-shipping standby server, create a trigger
 929     file with the filename and path specified by the <varname>trigger_file</>
 930     setting in <filename>recovery.conf</>. If <varname>trigger_file</> is
 931     not given, there is no way to exit recovery in the standby and promote
 932     it to a master. That can be useful for e.g reporting servers that are
 933     only used to offload read-only queries from the primary, not for high
 934     availability purposes.
 935    </para>
 936   </sect1>
 937
 938   <sect1 id="log-shipping-alternative">
 939    <title>Alternative method for log shipping</title>
 940
 941    <para>
 942     An alternative to the built-in standby mode described in the previous
 943     sections is to use a <varname>restore_command</> that polls the archive location.
 944     This was the only option available in versions 8.4 and below. In this
 945     setup, set <varname>standby_mode</> off, because you are implementing
 946     the polling required for standby operation yourself. See
 947     contrib/pg_standby (<xref linkend="pgstandby">) for a reference
 948     implementation of this.
 949    </para>
 950
 951    <para>
 952     Note that in this mode, the server will apply WAL one file at a
 953     time, so if you use the standby server for queries (see Hot Standby),
 954     there is a delay between an action in the master and when the
 955     action becomes visible in the standby, corresponding the time it takes
 956     to fill up the WAL file. <varname>archive_timeout</> can be used to make that delay
 957     shorter. Also note that you can't combine streaming replication with
 958     this method.
 959    </para>
 960
 961    <para>
 962     The operations that occur on both primary and standby servers are
 963     normal continuous archiving and recovery tasks. The only point of
 964     contact between the two database servers is the archive of WAL files
 965     that both share: primary writing to the archive, standby reading from
 966     the archive. Care must be taken to ensure that WAL archives from separate
 967     primary servers do not become mixed together or confused. The archive
 968     need not be large if it is only required for standby operation.
 969    </para>
 970
 971    <para>
 972     The magic that makes the two loosely coupled servers work together is
 973     simply a <varname>restore_command</> used on the standby that,
 974     when asked for the next WAL file, waits for it to become available from
 975     the primary. The <varname>restore_command</> is specified in the
 976     <filename>recovery.conf</> file on the standby server. Normal recovery
 977     processing would request a file from the WAL archive, reporting failure
 978     if the file was unavailable.  For standby processing it is normal for
 979     the next WAL file to be unavailable, so the standby must wait for
 980     it to appear. For files ending in <literal>.backup</> or
 981     <literal>.history</> there is no need to wait, and a non-zero return
 982     code must be returned. A waiting <varname>restore_command</> can be
 983     written as a custom script that loops after polling for the existence of
 984     the next WAL file. There must also be some way to trigger failover, which
 985     should interrupt the <varname>restore_command</>, break the loop and
 986     return a file-not-found error to the standby server. This ends recovery
 987     and the standby will then come up as a normal server.
 988    </para>
 989
 990    <para>
 991     Pseudocode for a suitable <varname>restore_command</> is:
 992 <programlisting>
 993 triggered = false;
 994 while (!NextWALFileReady() &amp;&amp; !triggered)
 995 {
 996     sleep(100000L);         /* wait for ~0.1 sec */
 997     if (CheckForExternalTrigger())
 998         triggered = true;
 999 }
1000 if (!triggered)
1001         CopyWALFileForRecovery();
1002 </programlisting>
1003    </para>
1004
1005    <para>
1006     A working example of a waiting <varname>restore_command</> is provided
1007     as a <filename>contrib</> module named <application>pg_standby</>. It
1008     should be used as a reference on how to correctly implement the logic
1009     described above. It can also be extended as needed to support specific
1010     configurations and environments.
1011    </para>
1012
1013    <para>
1014     The method for triggering failover is an important part of planning
1015     and design. One potential option is the <varname>restore_command</>
1016     command.  It is executed once for each WAL file, but the process
1017     running the <varname>restore_command</> is created and dies for
1018     each file, so there is no daemon or server process, and
1019     signals or a signal handler cannot be used. Therefore, the
1020     <varname>restore_command</> is not suitable to trigger failover.
1021     It is possible to use a simple timeout facility, especially if
1022     used in conjunction with a known <varname>archive_timeout</>
1023     setting on the primary. However, this is somewhat error prone
1024     since a network problem or busy primary server might be sufficient
1025     to initiate failover. A notification mechanism such as the explicit
1026     creation of a trigger file is ideal, if this can be arranged.
1027    </para>
1028
1029   <sect2 id="warm-standby-config">
1030    <title>Implementation</title>
1031
1032    <para>
1033     The short procedure for configuring a standby server using this alternative
1034     method is as follows. For
1035     full details of each step, refer to previous sections as noted.
1036     <orderedlist>
1037      <listitem>
1038       <para>
1039        Set up primary and standby systems as nearly identical as
1040        possible, including two identical copies of
1041        <productname>PostgreSQL</> at the same release level.
1042       </para>
1043      </listitem>
1044      <listitem>
1045       <para>
1046        Set up continuous archiving from the primary to a WAL archive
1047        directory on the standby server. Ensure that
1048        <xref linkend="guc-archive-mode">,
1049        <xref linkend="guc-archive-command"> and
1050        <xref linkend="guc-archive-timeout">
1051        are set appropriately on the primary
1052        (see <xref linkend="backup-archiving-wal">).
1053       </para>
1054      </listitem>
1055      <listitem>
1056       <para>
1057        Make a base backup of the primary server (see <xref
1058        linkend="backup-base-backup">), and load this data onto the standby.
1059       </para>
1060      </listitem>
1061      <listitem>
1062       <para>
1063        Begin recovery on the standby server from the local WAL
1064        archive, using a <filename>recovery.conf</> that specifies a
1065        <varname>restore_command</> that waits as described
1066        previously (see <xref linkend="backup-pitr-recovery">).
1067       </para>
1068      </listitem>
1069     </orderedlist>
1070    </para>
1071
1072    <para>
1073     Recovery treats the WAL archive as read-only, so once a WAL file has
1074     been copied to the standby system it can be copied to tape at the same
1075     time as it is being read by the standby database server.
1076     Thus, running a standby server for high availability can be performed at
1077     the same time as files are stored for longer term disaster recovery
1078     purposes.
1079    </para>
1080
1081    <para>
1082     For testing purposes, it is possible to run both primary and standby
1083     servers on the same system. This does not provide any worthwhile
1084     improvement in server robustness, nor would it be described as HA.
1085    </para>
1086   </sect2>
1087
1088   <sect2 id="warm-standby-record">
1089    <title>Record-based Log Shipping</title>
1090
1091    <para>
1092     It is also possible to implement record-based log shipping using this
1093     alternative method, though this requires custom development, and changes
1094     will still only become visible to hot standby queries after a full WAL
1095     file has been shipped.
1096    </para>
1097
1098    <para>
1099     An external program can call the <function>pg_xlogfile_name_offset()</>
1100     function (see <xref linkend="functions-admin">)
1101     to find out the file name and the exact byte offset within it of
1102     the current end of WAL.  It can then access the WAL file directly
1103     and copy the data from the last known end of WAL through the current end
1104     over to the standby servers.  With this approach, the window for data
1105     loss is the polling cycle time of the copying program, which can be very
1106     small, and there is no wasted bandwidth from forcing partially-used
1107     segment files to be archived.  Note that the standby servers'
1108     <varname>restore_command</> scripts can only deal with whole WAL files,
1109     so the incrementally copied data is not ordinarily made available to
1110     the standby servers.  It is of use only when the primary dies &mdash;
1111     then the last partial WAL file is fed to the standby before allowing
1112     it to come up.  The correct implementation of this process requires
1113     cooperation of the <varname>restore_command</> script with the data
1114     copying program.
1115    </para>
1116
1117    <para>
1118     Starting with <productname>PostgreSQL</> version 9.0, you can use
1119     streaming replication (see <xref linkend="streaming-replication">) to
1120     achieve the same benefits with less effort.
1121    </para>
1122   </sect2>
1123  </sect1>
1124
1125  <sect1 id="hot-standby">
1126   <title>Hot Standby</title>
1127
1128   <indexterm zone="high-availability">
1129    <primary>Hot Standby</primary>
1130   </indexterm>
1131
1132    <para>
1133     Hot Standby is the term used to describe the ability to connect to
1134     the server and run read-only queries while the server is in archive
1135     recovery. This
1136     is useful for both log shipping replication and for restoring a backup
1137     to an exact state with great precision.
1138     The term Hot Standby also refers to the ability of the server to move
1139     from recovery through to normal operation while users continue running
1140     queries and/or keep their connections open.
1141    </para>
1142
1143    <para>
1144     Running queries in recovery mode is similar to normal query operation,
1145     though there are several usage and administrative differences
1146     noted below.
1147    </para>
1148
1149   <sect2 id="hot-standby-users">
1150    <title>User's Overview</title>
1151
1152    <para>
1153     When the <xref linkend="guc-hot-standby"> parameter is set to true on a
1154     standby server, it will begin accepting connections once the recovery has
1155     brought the system to a consistent state.  All such connections are
1156     strictly read-only; not even temporary tables may be written.
1157    </para>
1158
1159    <para>
1160     The data on the standby takes some time to arrive from the primary server
1161     so there will be a measurable delay between primary and standby. Running the
1162     same query nearly simultaneously on both primary and standby might therefore
1163     return differing results. We say that data on the standby is
1164     <firstterm>eventually consistent</firstterm> with the primary.  Once the
1165     commit record for a transaction is replayed on the standby, the changes
1166     made by that transaction will be visible to any new snapshots taken on
1167     the standby.  Snapshots may be taken at the start of each query or at the
1168     start of each transaction, depending on the current transaction isolation
1169     level.  For more details, see <xref linkend="transaction-iso">.
1170    </para>
1171
1172    <para>
1173     Transactions started during recovery may issue the following commands:
1174
1175     <itemizedlist>
1176      <listitem>
1177       <para>
1178        Query access - <command>SELECT</>, <command>COPY TO</>
1179       </para>
1180      </listitem>
1181      <listitem>
1182       <para>
1183        Cursor commands - <command>DECLARE</>, <command>FETCH</>, <command>CLOSE</>
1184       </para>
1185      </listitem>
1186      <listitem>
1187       <para>
1188        Parameters - <command>SHOW</>, <command>SET</>, <command>RESET</>
1189       </para>
1190      </listitem>
1191      <listitem>
1192       <para>
1193        Transaction management commands
1194         <itemizedlist>
1195          <listitem>
1196           <para>
1197            <command>BEGIN</>, <command>END</>, <command>ABORT</>, <command>START TRANSACTION</>
1198           </para>
1199          </listitem>
1200          <listitem>
1201           <para>
1202            <command>SAVEPOINT</>, <command>RELEASE</>, <command>ROLLBACK TO SAVEPOINT</>
1203           </para>
1204          </listitem>
1205          <listitem>
1206           <para>
1207            <command>EXCEPTION</> blocks and other internal subtransactions
1208           </para>
1209          </listitem>
1210         </itemizedlist>
1211       </para>
1212      </listitem>
1213      <listitem>
1214       <para>
1215        <command>LOCK TABLE</>, though only when explicitly in one of these modes:
1216        <literal>ACCESS SHARE</>, <literal>ROW SHARE</> or <literal>ROW EXCLUSIVE</>.
1217       </para>
1218      </listitem>
1219      <listitem>
1220       <para>
1221        Plans and resources - <command>PREPARE</>, <command>EXECUTE</>,
1222        <command>DEALLOCATE</>, <command>DISCARD</>
1223       </para>
1224      </listitem>
1225      <listitem>
1226       <para>
1227        Plugins and extensions - <command>LOAD</>
1228       </para>
1229      </listitem>
1230     </itemizedlist>
1231    </para>
1232
1233    <para>
1234     Transactions started during recovery may never be assigned a transaction ID
1235     and may not write to the system write-ahead log.  Therefore, the following
1236     actions will produce error messages:
1237
1238     <itemizedlist>
1239      <listitem>
1240       <para>
1241        Data Manipulation Language (DML) - <command>INSERT</>,
1242        <command>UPDATE</>, <command>DELETE</>, <command>COPY FROM</>,
1243        <command>TRUNCATE</>.
1244        Note that there are no allowed actions that result in a trigger
1245        being executed during recovery.  This restriction applies even to
1246        temporary tables, because table rows cannot be read or written without
1247        assigning a transaction ID, which is currently not possible in a
1248        Hot Standby environment.
1249       </para>
1250      </listitem>
1251      <listitem>
1252       <para>
1253        Data Definition Language (DDL) - <command>CREATE</>,
1254        <command>DROP</>, <command>ALTER</>, <command>COMMENT</>.
1255        This restriction applies even to temporary tables, because carrying
1256        out these operations would require updating the system catalog tables.
1257       </para>
1258      </listitem>
1259      <listitem>
1260       <para>
1261        <command>SELECT ... FOR SHARE | UPDATE</>, because row locks cannot be
1262        taken without updating the underlying data files.
1263       </para>
1264      </listitem>
1265      <listitem>
1266       <para>
1267        Rules on <command>SELECT</> statements that generate DML commands.
1268       </para>
1269      </listitem>
1270      <listitem>
1271       <para>
1272        <command>LOCK</> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</>.
1273       </para>
1274      </listitem>
1275      <listitem>
1276       <para>
1277        <command>LOCK</> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</>.
1278       </para>
1279      </listitem>
1280      <listitem>
1281       <para>
1282        Transaction management commands that explicitly set non-read-only state:
1283         <itemizedlist>
1284          <listitem>
1285           <para>
1286             <command>BEGIN READ WRITE</>,
1287             <command>START TRANSACTION READ WRITE</>
1288           </para>
1289          </listitem>
1290          <listitem>
1291           <para>
1292             <command>SET TRANSACTION READ WRITE</>,
1293             <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</>
1294           </para>
1295          </listitem>
1296          <listitem>
1297           <para>
1298            <command>SET transaction_read_only = off</>
1299           </para>
1300          </listitem>
1301         </itemizedlist>
1302       </para>
1303      </listitem>
1304      <listitem>
1305       <para>
1306        Two-phase commit commands - <command>PREPARE TRANSACTION</>,
1307        <command>COMMIT PREPARED</>, <command>ROLLBACK PREPARED</>
1308        because even read-only transactions need to write WAL in the
1309        prepare phase (the first phase of two phase commit).
1310       </para>
1311      </listitem>
1312      <listitem>
1313       <para>
1314        Sequence updates - <function>nextval()</>, <function>setval()</>
1315       </para>
1316      </listitem>
1317      <listitem>
1318       <para>
1319        <command>LISTEN</>, <command>UNLISTEN</>, <command>NOTIFY</>
1320       </para>
1321      </listitem>
1322     </itemizedlist>
1323    </para>
1324
1325    <para>
1326     Outside of recovery, read-only transactions are allowed to update sequences
1327     and to use <command>LISTEN</>, <command>UNLISTEN</>, and
1328     <command>NOTIFY</>, so Hot Standby sessions operate under slightly tighter
1329     restrictions than ordinary read-only sessions.  It is possible that some
1330     of these restrictions might be loosened in a future release.
1331    </para>
1332
1333    <para>
1334     During recovery, the parameter <varname>transaction_read_only</> is always
1335     true and may not be changed.  But as long as no attempt is made to modify
1336     the database, connections during recovery will act much like any other
1337     database connection.  If failover or switchover occurs, the database will
1338     switch to normal processing mode.  Sessions will remain connected while the
1339     server changes mode.  Once recovery finishes, it will be possible to
1340     initiate read-write transactions (even from a session begun during
1341     recovery).
1342    </para>
1343
1344    <para>
1345     Users will be able to tell whether their session is read-only by
1346     issuing <command>SHOW transaction_read_only</>.  In addition, a set of
1347     functions (<xref linkend="functions-recovery-info-table">) allow users to
1348     access information about the standby server. These allow you to write
1349     programs that are aware of the current state of the database. These
1350     can be used to monitor the progress of recovery, or to allow you to
1351     write complex programs that restore the database to particular states.
1352    </para>
1353
1354    <para>
1355     In general, queries will not experience lock conflicts from the database
1356     changes made by recovery. This is because recovery follows normal
1357     concurrency control mechanisms, known as <acronym>MVCC</>. There are
1358     some types of change that will cause conflicts, covered in the following
1359     section.
1360    </para>
1361   </sect2>
1362
1363   <sect2 id="hot-standby-conflict">
1364    <title>Handling query conflicts</title>
1365
1366    <para>
1367     The primary and standby nodes are in many ways loosely connected. Actions
1368     on the primary will have an effect on the standby. As a result, there is
1369     potential for negative interactions or conflicts between them. The easiest
1370     conflict to understand is performance: if a huge data load is taking place
1371     on the primary then this will generate a similar stream of WAL records on the
1372     standby, so standby queries may contend for system resources, such as I/O.
1373    </para>
1374
1375    <para>
1376     There are also additional types of conflict that can occur with Hot Standby.
1377     These conflicts are <emphasis>hard conflicts</> in the sense that queries
1378     might need to be cancelled and, in some cases, sessions disconnected to resolve them.
1379     The user is provided with several ways to handle these
1380     conflicts. Conflicts can be caused by:
1381
1382       <itemizedlist>
1383        <listitem>
1384         <para>
1385          Access Exclusive Locks from primary node, including both explicit
1386          <command>LOCK</> commands and various <acronym>DDL</> actions
1387         </para>
1388        </listitem>
1389        <listitem>
1390         <para>
1391          Dropping tablespaces on the primary while standby queries are using
1392          those tablespaces for temporary work files (<varname>work_mem</> overflow)
1393         </para>
1394        </listitem>
1395        <listitem>
1396         <para>
1397          Dropping databases on the primary while users are connected to that
1398          database on the standby.
1399         </para>
1400        </listitem>
1401        <listitem>
1402         <para>
1403          The standby waiting longer than <varname>max_standby_delay</>
1404          to acquire a buffer cleanup lock.
1405         </para>
1406        </listitem>
1407        <listitem>
1408         <para>
1409          Early cleanup of data still visible to the current query's snapshot.
1410         </para>
1411        </listitem>
1412       </itemizedlist>
1413    </para>
1414
1415    <para>
1416     Some WAL redo actions will be for <acronym>DDL</> execution. These DDL
1417     actions are replaying changes that have already committed on the primary
1418     node, so they must not fail on the standby node. These DDL locks take
1419     priority and will automatically <emphasis>cancel</> any read-only
1420     transactions that get in their way, after a grace period. This is similar
1421     to the possibility of being canceled by the deadlock detector.  But in this
1422     case, the standby recovery process always wins, since the replayed actions
1423     must not fail.  This also ensures that replication does not fall behind
1424     while waiting for a query to complete. This prioritization presumes that
1425     the standby exists primarily for high availability, and that adjusting the
1426     grace period will allow a sufficient guard against unexpected cancellation.
1427    </para>
1428
1429    <para>
1430     An example of the above would be an administrator on the primary server
1431     running <command>DROP TABLE</> on a table that is currently being queried
1432     on the standby server.
1433     Clearly the query cannot continue if <command>DROP TABLE</>
1434     proceeds. If this situation occurred on the primary, the <command>DROP TABLE</>
1435     would wait until the query had finished. When <command>DROP TABLE</> is
1436     run on the primary, the primary doesn't have
1437     information about which queries are running on the standby, so it
1438     cannot wait for any of the standby queries. The WAL change records come through to the
1439     standby while the standby query is still running, causing a conflict.
1440    </para>
1441
1442    <para>
1443     The most common reason for conflict between standby queries and WAL redo is
1444     "early cleanup". Normally, <productname>PostgreSQL</> allows cleanup of old
1445     row versions when there are no users who need to see them to ensure correct
1446     visibility of data (the heart of MVCC). If there is a standby query that has
1447     been running for longer than any query on the primary then it is possible
1448     for old row versions to be removed by either a vacuum or HOT. This will
1449     then generate WAL records that, if applied, would remove data on the
1450     standby that might <emphasis>potentially</> be required by the standby query.
1451     In more technical language, the primary's xmin horizon is later than
1452     the standby's xmin horizon, allowing dead rows to be removed.
1453    </para>
1454
1455    <para>
1456     Experienced users should note that both row version cleanup and row version
1457     freezing will potentially conflict with recovery queries. Running a
1458     manual <command>VACUUM FREEZE</> is likely to cause conflicts even on tables
1459     with no updated or deleted rows.
1460    </para>
1461
1462    <para>
1463     There are a number of choices for resolving query conflicts.  The default
1464     is to wait and hope the query finishes. The server will wait
1465     automatically until the lag between primary and standby is at most
1466     <xref linkend="guc-max-standby-delay"> (30 seconds by default).
1467     Once that grace period expires,
1468     one of the following actions is taken:
1469
1470       <itemizedlist>
1471        <listitem>
1472         <para>
1473          If the conflict is caused by a lock, the conflicting standby
1474          transaction is cancelled immediately. If the transaction is
1475          idle-in-transaction, then the session is aborted instead.
1476          This behavior might change in the future.
1477         </para>
1478        </listitem>
1479
1480        <listitem>
1481         <para>
1482          If the conflict is caused by cleanup records, the standby query is informed
1483          a conflict has occurred and that it must cancel itself to avoid the
1484          risk that it silently fails to read relevant data because
1485          that data has been removed.  Some cleanup
1486          records only conflict with older queries, while others
1487          can affect all queries.
1488         </para>
1489
1490         <para>
1491          Cancelled queries may be retried immediately (after beginning a new
1492          transaction, of course).  Since query cancellation depends on
1493          the nature of the WAL records being replayed, a query that was
1494          cancelled may succeed if it is executed again.
1495         </para>
1496        </listitem>
1497       </itemizedlist>
1498    </para>
1499
1500    <para>
1501     Keep in mind that <varname>max_standby_delay</> is compared to the
1502     difference between the standby server's clock and the transaction
1503     commit timestamps read from the WAL log.  Thus, the grace period
1504     allowed to any one query on the standby is never more than
1505     <varname>max_standby_delay</>, and could be considerably less if the
1506     standby has already fallen behind as a result of waiting for previous
1507     queries to complete, or as a result of being unable to keep up with a
1508     heavy update load.
1509    </para>
1510
1511    <caution>
1512     <para>
1513      Be sure that the primary and standby servers' clocks are kept in sync;
1514      otherwise the values compared to <varname>max_standby_delay</> will be
1515      erroneous, possibly leading to additional query cancellations.
1516      If the clocks are intentionally not in sync, or if there is a large
1517      propagation delay from primary to standby, it is advisable to set
1518      <varname>max_standby_delay</> to -1.  In any case the value should be
1519      larger than the largest expected clock skew between primary and standby.
1520     </para>
1521    </caution>
1522
1523    <para>
1524     Users should be clear that tables that are regularly and heavily updated on the
1525     primary server will quickly cause cancellation of longer running queries on
1526     the standby. In those cases <varname>max_standby_delay</> can be
1527     considered similar to setting
1528     <varname>statement_timeout</>.
1529     </para>
1530
1531    <para>
1532     Other remedial actions exist if the number of cancellations is unacceptable.
1533     The first option is to connect to the primary server and keep a query active
1534     for as long as needed to run queries on the standby. This guarantees that
1535     a WAL cleanup record is never generated and query conflicts do not occur,
1536     as described above. This could be done using <filename>contrib/dblink</>
1537     and <function>pg_sleep()</>, or via other mechanisms. If you do this, you
1538     should note that this will delay cleanup of dead rows on the primary by
1539     vacuum or HOT, which may be undesirable. However, remember
1540     that the primary and standby nodes are linked via the WAL, so the cleanup
1541     situation is no different from the case where the query ran on the primary
1542     node itself, and you are still getting the benefit of off-loading the
1543     execution onto the standby. <varname>max_standby_delay</> should
1544     not be used in this case because delayed WAL files might already
1545     contain entries that invalidate the current snapshot.
1546    </para>
1547
1548    <para>
1549     It is also possible to set <varname>vacuum_defer_cleanup_age</> on the primary
1550     to defer the cleanup of records by autovacuum, <command>VACUUM</>
1551     and HOT. This might allow
1552     more time for queries to execute before they are cancelled on the standby,
1553     without the need for setting a high <varname>max_standby_delay</>.
1554    </para>
1555
1556    <para>
1557     Three-way deadlocks are possible between <literal>AccessExclusiveLocks</> arriving from
1558     the primary, cleanup WAL records that require buffer cleanup locks, and
1559     user requests that are waiting behind replayed <literal>AccessExclusiveLocks</>.
1560     Deadlocks are resolved automatically after <varname>deadlock_timeout</>
1561     seconds, though they are thought to be rare in practice.
1562    </para>
1563
1564    <para>
1565     Dropping tablespaces or databases is discussed in the administrator's
1566     section since they are not typical user situations.
1567    </para>
1568   </sect2>
1569
1570   <sect2 id="hot-standby-admin">
1571    <title>Administrator's Overview</title>
1572
1573    <para>
1574     If <varname>hot_standby</> is turned <literal>on</> in
1575     <filename>postgresql.conf</> and there is a <filename>recovery.conf</>
1576     file present, the server will run in Hot Standby mode.
1577     However, it may take some time for Hot Standby connections to be allowed,
1578     because the server will not accept connections until it has completed
1579     sufficient recovery to provide a consistent state against which queries
1580     can run.  During this period,
1581     clients that attempt to connect will be refused with an error message.
1582     To confirm the server has come up, either loop trying to connect from
1583     the application, or look for these messages in the server logs:
1584
1585 <programlisting>
1586 LOG:  entering standby mode
1587
1588 ... then some time later ...
1589
1590 LOG:  consistent recovery state reached
1591 LOG:  database system is ready to accept read only connections
1592 </programlisting>
1593
1594     Consistency information is recorded once per checkpoint on the primary.
1595     It is not possible to enable hot standby when reading WAL
1596     written during a period when <varname>wal_level</> was not set to
1597     <literal>hot_standby</> on the primary.  Reaching a consistent state can
1598     also be delayed in the presence of both of these conditions:
1599
1600       <itemizedlist>
1601        <listitem>
1602         <para>
1603          A write transaction has more than 64 subtransactions
1604         </para>
1605        </listitem>
1606        <listitem>
1607         <para>
1608          Very long-lived write transactions
1609         </para>
1610        </listitem>
1611       </itemizedlist>
1612
1613     If you are running file-based log shipping ("warm standby"), you might need
1614     to wait until the next WAL file arrives, which could be as long as the
1615     <varname>archive_timeout</> setting on the primary.
1616    </para>
1617
1618    <para>
1619     The setting of some parameters on the standby will need reconfiguration
1620     if they have been changed on the primary. For these parameters,
1621     the value on the standby must
1622     be equal to or greater than the value on the primary. If these parameters
1623     are not set high enough then the standby will refuse to start.
1624     Higher values can then be supplied and the server
1625     restarted to begin recovery again.  These parameters are:
1626
1627       <itemizedlist>
1628        <listitem>
1629         <para>
1630          <varname>max_connections</>
1631         </para>
1632        </listitem>
1633        <listitem>
1634         <para>
1635          <varname>max_prepared_transactions</>
1636         </para>
1637        </listitem>
1638        <listitem>
1639         <para>
1640          <varname>max_locks_per_transaction</>
1641         </para>
1642        </listitem>
1643       </itemizedlist>
1644    </para>
1645
1646    <para>
1647     It is important that the administrator consider the appropriate setting
1648     of <varname>max_standby_delay</>,
1649     which can be set in <filename>postgresql.conf</>.
1650     There is no optimal setting, so it should be set according to business
1651     priorities. For example if the server is primarily tasked as a High
1652     Availability server, then you may wish to lower
1653     <varname>max_standby_delay</> or even set it to zero, though that is a
1654     very aggressive setting. If the standby server is tasked as an additional
1655     server for decision support queries then it might be acceptable to set this
1656     to a value of many hours.  It is also possible to set
1657     <varname>max_standby_delay</> to -1 which means wait forever for queries
1658     to complete; this will be useful when performing
1659     an archive recovery from a backup.
1660    </para>
1661
1662    <para>
1663     Transaction status "hint bits" written on the primary are not WAL-logged,
1664     so data on the standby will likely re-write the hints again on the standby.
1665     Thus, the standby server will still perform disk writes even though
1666     all users are read-only; no changes occur to the data values
1667     themselves.  Users will still write large sort temporary files and
1668     re-generate relcache info files, so no part of the database
1669     is truly read-only during hot standby mode.
1670     Note also that writes to remote databases using
1671     <application>dblink</application> module, and other operations outside the
1672     database using PL functions will still be possible, even though the
1673     transaction is read-only locally.
1674    </para>
1675
1676    <para>
1677     The following types of administration commands are not accepted
1678     during recovery mode:
1679
1680       <itemizedlist>
1681        <listitem>
1682         <para>
1683          Data Definition Language (DDL) - e.g. <command>CREATE INDEX</>
1684         </para>
1685        </listitem>
1686        <listitem>
1687         <para>
1688          Privilege and Ownership - <command>GRANT</>, <command>REVOKE</>,
1689          <command>REASSIGN</>
1690         </para>
1691        </listitem>
1692        <listitem>
1693         <para>
1694          Maintenance commands - <command>ANALYZE</>, <command>VACUUM</>,
1695          <command>CLUSTER</>, <command>REINDEX</>
1696         </para>
1697        </listitem>
1698       </itemizedlist>
1699    </para>
1700
1701    <para>
1702     Again, note that some of these commands are actually allowed during
1703     "read only" mode transactions on the primary.
1704    </para>
1705
1706    <para>
1707     As a result, you cannot create additional indexes that exist solely
1708     on the standby, nor statistics that exist solely on the standby.
1709     If these administration commands are needed, they should be executed
1710     on the primary, and eventually those changes will propagate to the
1711     standby.
1712    </para>
1713
1714    <para>
1715     <function>pg_cancel_backend()</> will work on user backends, but not the
1716     Startup process, which performs recovery. <structname>pg_stat_activity</structname> does not
1717     show an entry for the Startup process, nor do recovering transactions
1718     show as active. As a result, <structname>pg_prepared_xacts</structname> is always empty during
1719     recovery. If you wish to resolve in-doubt prepared transactions,
1720     view <literal>pg_prepared_xacts</> on the primary and issue commands to
1721     resolve transactions there.
1722    </para>
1723
1724    <para>
1725     <structname>pg_locks</structname> will show locks held by backends,
1726     as normal. <structname>pg_locks</structname> also shows
1727     a virtual transaction managed by the Startup process that owns all
1728     <literal>AccessExclusiveLocks</> held by transactions being replayed by recovery.
1729     Note that the Startup process does not acquire locks to
1730     make database changes, and thus locks other than <literal>AccessExclusiveLocks</>
1731     do not show in <structname>pg_locks</structname> for the Startup
1732     process; they are just presumed to exist.
1733    </para>
1734
1735    <para>
1736     The <productname>Nagios</> plugin <productname>check_pgsql</> will
1737     work, because the simple information it checks for exists.
1738     The <productname>check_postgres</> monitoring script will also work,
1739     though some reported values could give different or confusing results.
1740     For example, last vacuum time will not be maintained, since no
1741     vacuum occurs on the standby.  Vacuums running on the primary
1742     do still send their changes to the standby.
1743    </para>
1744
1745    <para>
1746     WAL file control commands will not work during recovery,
1747     e.g. <function>pg_start_backup</>, <function>pg_switch_xlog</> etc.
1748    </para>
1749
1750    <para>
1751     Dynamically loadable modules work, including <structname>pg_stat_statements</>.
1752    </para>
1753
1754    <para>
1755     Advisory locks work normally in recovery, including deadlock detection.
1756     Note that advisory locks are never WAL logged, so it is impossible for
1757     an advisory lock on either the primary or the standby to conflict with WAL
1758     replay. Nor is it possible to acquire an advisory lock on the primary
1759     and have it initiate a similar advisory lock on the standby. Advisory
1760     locks relate only to the server on which they are acquired.
1761    </para>
1762
1763    <para>
1764     Trigger-based replication systems such as <productname>Slony</>,
1765     <productname>Londiste</> and <productname>Bucardo</> won't run on the
1766     standby at all, though they will run happily on the primary server as
1767     long as the changes are not sent to standby servers to be applied.
1768     WAL replay is not trigger-based so you cannot relay from the
1769     standby to any system that requires additional database writes or
1770     relies on the use of triggers.
1771    </para>
1772
1773    <para>
1774     New oids cannot be assigned, though some <acronym>UUID</> generators may still
1775     work as long as they do not rely on writing new status to the database.
1776    </para>
1777
1778    <para>
1779     Currently, temporary table creation is not allowed during read only
1780     transactions, so in some cases existing scripts will not run correctly.
1781     This restriction might be relaxed in a later release. This is
1782     both a SQL Standard compliance issue and a technical issue.
1783    </para>
1784
1785    <para>
1786     <command>DROP TABLESPACE</> can only succeed if the tablespace is empty.
1787     Some standby users may be actively using the tablespace via their
1788     <varname>temp_tablespaces</> parameter. If there are temporary files in the
1789     tablespace, all active queries are cancelled to ensure that temporary
1790     files are removed, so the tablespace can be removed and WAL replay
1791     can continue.
1792    </para>
1793
1794    <para>
1795     Running <command>DROP DATABASE</>, <command>ALTER DATABASE ... SET TABLESPACE</>,
1796     or <command>ALTER DATABASE ... RENAME</> on primary will generate a log message
1797     that will cause all users connected to that database on the standby to be
1798     forcibly disconnected. This action occurs immediately, whatever the setting of
1799     <varname>max_standby_delay</>.
1800    </para>
1801
1802    <para>
1803     In normal (non-recovery) mode, if you issue <command>DROP USER</> or <command>DROP ROLE</>
1804     for a role with login capability while that user is still connected then
1805     nothing happens to the connected user - they remain connected. The user cannot
1806     reconnect however. This behavior applies in recovery also, so a
1807     <command>DROP USER</> on the primary does not disconnect that user on the standby.
1808    </para>
1809
1810    <para>
1811     The statistics collector is active during recovery. All scans, reads, blocks,
1812     index usage, etc., will be recorded normally on the standby. Replayed
1813     actions will not duplicate their effects on primary, so replaying an
1814     insert will not increment the Inserts column of pg_stat_user_tables.
1815     The stats file is deleted at the start of recovery, so stats from primary
1816     and standby will differ; this is considered a feature, not a bug.
1817    </para>
1818
1819    <para>
1820     Autovacuum is not active during recovery, it will start normally at the
1821     end of recovery.
1822    </para>
1823
1824    <para>
1825     The background writer is active during recovery and will perform
1826     restartpoints (similar to checkpoints on the primary) and normal block
1827     cleaning activities. This can include updates of the hint bit
1828     information stored on the standby server.
1829     The <command>CHECKPOINT</> command is accepted during recovery,
1830     though it performs a restartpoint rather than a new checkpoint.
1831    </para>
1832   </sect2>
1833
1834   <sect2 id="hot-standby-parameters">
1835    <title>Hot Standby Parameter Reference</title>
1836
1837    <para>
1838     Various parameters have been mentioned above in
1839     <xref linkend="hot-standby-admin">
1840     and <xref linkend="hot-standby-conflict">.
1841    </para>
1842
1843    <para>
1844     On the primary, parameters <xref linkend="guc-wal-level"> and
1845     <xref linkend="guc-vacuum-defer-cleanup-age"> can be used.
1846     <xref linkend="guc-max-standby-delay"> has no effect if set on the primary.
1847    </para>
1848
1849    <para>
1850     On the standby, parameters <xref linkend="guc-hot-standby"> and
1851     <xref linkend="guc-max-standby-delay"> can be used.
1852     <xref linkend="guc-vacuum-defer-cleanup-age"> has no effect during
1853     recovery.
1854    </para>
1855   </sect2>
1856
1857   <sect2 id="hot-standby-caveats">
1858    <title>Caveats</title>
1859
1860    <para>
1861     There are several limitations of Hot Standby.
1862     These can and probably will be fixed in future releases:
1863
1864   <itemizedlist>
1865    <listitem>
1866     <para>
1867      Operations on hash indexes are not presently WAL-logged, so
1868      replay will not update these indexes.  Hash indexes will not be
1869      used for query plans during recovery.
1870     </para>
1871    </listitem>
1872    <listitem>
1873     <para>
1874      Full knowledge of running transactions is required before snapshots
1875      can be taken. Transactions that use large numbers of subtransactions
1876      (currently greater than 64) will delay the start of read only
1877      connections until the completion of the longest running write transaction.
1878      If this situation occurs, explanatory messages will be sent to the server log.
1879     </para>
1880    </listitem>
1881    <listitem>
1882     <para>
1883      Valid starting points for standby queries are generated at each
1884      checkpoint on the master. If the standby is shut down while the master
1885      is in a shutdown state, it might not be possible to re-enter Hot Standby
1886      until the primary is started up, so that it generates further starting
1887      points in the WAL logs.  This situation isn't a problem in the most
1888      common situations where it might happen. Generally, if the primary is
1889      shut down and not available anymore, that's likely due to a serious
1890      failure that requires the standby being converted to operate as
1891      the new primary anyway.  And in situations where the primary is
1892      being intentionally taken down, coordinating to make sure the standby
1893      becomes the new primary smoothly is also standard procedure.
1894     </para>
1895    </listitem>
1896    <listitem>
1897     <para>
1898      At the end of recovery, <literal>AccessExclusiveLocks</> held by prepared transactions
1899      will require twice the normal number of lock table entries. If you plan
1900      on running either a large number of concurrent prepared transactions
1901      that normally take <literal>AccessExclusiveLocks</>, or you plan on having one
1902      large transaction that takes many <literal>AccessExclusiveLocks</>, you are
1903      advised to select a larger value of <varname>max_locks_per_transaction</>,
1904      perhaps as much as twice the value of the parameter on
1905      the primary server. You need not consider this at all if
1906      your setting of <varname>max_prepared_transactions</> is <literal>0</>.
1907     </para>
1908    </listitem>
1909   </itemizedlist>
1910
1911    </para>
1912   </sect2>
1913
1914  </sect1>
1915
1916   <sect1 id="backup-incremental-updated">
1917    <title>Incrementally Updated Backups</title>
1918
1919   <indexterm zone="high-availability">
1920    <primary>incrementally updated backups</primary>
1921   </indexterm>
1922
1923   <indexterm zone="high-availability">
1924    <primary>change accumulation</primary>
1925   </indexterm>
1926
1927    <para>
1928     In a standby configuration, it is possible to offload the expense of
1929     taking periodic base backups from the primary server; instead base backups
1930     can be made by backing
1931     up a standby server's files.  This concept is generally known as
1932     incrementally updated backups, log change accumulation, or more simply,
1933     change accumulation.
1934    </para>
1935
1936    <para>
1937     If we take a file system backup of the standby server's data
1938     directory while it is processing
1939     logs shipped from the primary, we will be able to reload that backup and
1940     restart the standby's recovery process from the last restart point.
1941     We no longer need to keep WAL files from before the standby's restart point.
1942     If recovery is needed, it will be faster to recover from the incrementally
1943     updated backup than from the original base backup.
1944    </para>
1945
1946    <para>
1947     The procedure for taking a file system backup of the standby server's
1948     data directory while it's processing logs shipped from the primary is:
1949    <orderedlist>
1950     <listitem>
1951      <para>
1952       Perform the backup, without using <function>pg_start_backup</> and
1953       <function>pg_stop_backup</>. Note that the <filename>pg_control</>
1954       file must be backed up <emphasis>first</>, as in:
1955 <programlisting>
1956 cp /var/lib/pgsql/data/global/pg_control /tmp
1957 cp -r /var/lib/pgsql/data /path/to/backup
1958 mv /tmp/pg_control /path/to/backup/data/global
1959 </programlisting>
1960       <filename>pg_control</> contains the location where WAL replay will
1961       begin after restoring from the backup; backing it up first ensures
1962       that it points to the last restartpoint when the backup started, not
1963       some later restartpoint that happened while files were copied to the
1964       backup.
1965      </para>
1966     </listitem>
1967     <listitem>
1968      <para>
1969       Make note of the backup ending WAL location by calling the <function>
1970       pg_last_xlog_replay_location</> function at the end of the backup,
1971       and keep it with the backup.
1972 <programlisting>
1973 psql -c "select pg_last_xlog_replay_location();" > /path/to/backup/end_location
1974 </programlisting>
1975       When recovering from the incrementally updated backup, the server
1976       can begin accepting connections and complete the recovery successfully
1977       before the database has become consistent. To avoid that, you must
1978       ensure the database is consistent before users try to connect to the
1979       server and when the recovery ends. You can do that by comparing the
1980       progress of the recovery with the stored backup ending WAL location:
1981       the server is not consistent until recovery has reached the backup end
1982       location. The progress of the recovery can also be observed with the
1983       <function>pg_last_xlog_replay_location</> function, but that required
1984       connecting to the server while it might not be consistent yet, so
1985       care should be taken with that method.
1986      </para>
1987      <para>
1988      </para>
1989     </listitem>
1990    </orderedlist>
1991    </para>
1992
1993    <para>
1994     Since the standby server is not <quote>live</>, it is not possible to
1995     use <function>pg_start_backup()</> and <function>pg_stop_backup()</>
1996     to manage the backup process; it will be up to you to determine how
1997     far back you need to keep WAL segment files to have a recoverable
1998     backup. That is determined by the last restartpoint when the backup
1999     was taken, any WAL older than that can be deleted from the archive
2000     once the backup is complete. You can determine the last restartpoint
2001     by running <application>pg_controldata</> on the standby server before
2002     taking the backup, or by using the <varname>log_checkpoints</> option
2003     to print values to the standby's server log.
2004    </para>
2005   </sect1>
2006
2007 </chapter>