xapian-core/docs/replication.rst

   1 .. Copyright (C) 2008 Lemur Consulting Ltd
   2 .. Copyright (C) 2008,2010,2011,2012 Olly Betts
   3
   4 =======================================
   5 Xapian Database Replication Users Guide
   6 =======================================
   7
   8 .. contents:: Table of contents
   9
  10 Introduction
  11 ============
  12
  13 It is often desirable to maintain multiple copies of a Xapian database, having
  14 a "master" database which modifications are made on, and a set of secondary
  15 (read-only, "slave") databases which these modifications propagate to.  For
  16 example, to support a high query load there may be many search servers, each
  17 with a local copy of the database, and a single indexing server.  In order to
  18 allow scaling to a large number of search servers, with large databases and
  19 frequent updates, we need an database replication implementation to have the
  20 following characteristics:
  21
  22  - Data transfer is (at most) proportional to the size of the updates, rather
  23    than the size of the database, to allow frequent small updates to large
  24    databases to be replicated efficiently.
  25
  26  - Searching (on the slave databases) and indexing (on the master database) can
  27    continue during synchronisation.
  28
  29  - Data cached (in memory) on the slave databases is not discarded (unless it's
  30    actually out of date) as updates arrive, to ensure that searches continue to
  31    be performed quickly during and after updates.
  32
  33  - Synchronising each slave database involves low overhead (both IO and CPU) on
  34    the server holding the master database, so that many slaves can be updated
  35    from a single master without overloading it.
  36
  37  - Database synchronisation can be recovered after network outages or server
  38    failures without manual intervention and without excessive data transfer.
  39
  40 The database replication protocol is intended to support replicating a single
  41 writable database to multiple (read-only) search servers, while satisfying all
  42 of the above properties.  It is not intended to support replication of multiple
  43 writable databases - there must always be a single master database to which all
  44 modifications are made.
  45
  46 This document gives an overview of how and why to use the replication protocol.
  47 For technical details of the implementation of the replication protocol, see
  48 the separate document `net/replication_protocol.rst` in the xapian-core
  49 source tree.
  50
  51 Backend Support
  52 ===============
  53
  54 Replication is supported by the glass and chert database backends,
  55 and can cleanly handle the
  56 master switching database type (a full copy is sent in this situation).  It
  57 doesn't make a lot of sense to support replication for the remote backend.
  58 Replication of inmemory databases isn't currently available.  We have a longer
  59 term aim to replace the current inmemory backend with the current disk based
  60 backend (e.g. glass) but storing its data in memory.  Once this is done, it
  61 would probably be easy to support replication of inmemory databases.
  62
  63 Setting up replicated databases
  64 ===============================
  65
  66 .. FIXME - expand this section.
  67
  68 To replicate a database efficiently from one master machine to other machines,
  69 there is one configuration step to be performed on the master machine, and two
  70 servers to run.
  71
  72 Firstly, on the master machine, the indexer must be run with the environment
  73 variable `XAPIAN_MAX_CHANGESETS` set to a non-zero value, which will cause
  74 changeset files to be created whenever a transaction is committed.  A
  75 changeset file allows the transaction to be replayed efficiently on a replica
  76 of the database.
  77
  78 The value which `XAPIAN_MAX_CHANGESETS` is set to determines the maximum number
  79 of changeset files which will be kept.  The best number to keep depends on how
  80 frequently you run replication and how big your transactions are - if all
  81 the changeset files needed to update a replica aren't present, a full copy of
  82 the database will be sent, but at some point that becomes more efficient
  83 anyway.  `10` is probably a good value to start with.
  84
  85 Secondly, also on the master machine, run the `xapian-replicate-server` server
  86 to serve the databases which are to be replicated.  This takes various
  87 parameters to control the directory that databases are found in, and the
  88 network interface to serve on.  The `--help` option will cause usage
  89 information to be displayed.  For example, if `/var/search/dbs`` contains a
  90 set of Xapian databases to be replicated::
  91
  92   xapian-replicate-server /var/search/dbs -p 7010
  93
  94 would run a server allowing access to these databases, on port 7010.
  95
  96 Finally, on the client machine, run the `xapian-replicate` server to keep an
  97 individual database up-to-date.  This will contact the server on the specified
  98 host and port, and copy the database with the name (on the master) specified in
  99 the `-m` option to the client.  One non-option argument is required - this is
 100 the name that the database should be stored in on the slave machine.  For
 101 example, contacting the above server from the same machine::
 102
 103   xapian-replicate -h 127.0.0.1 -p 7010 -m foo foo2
 104
 105 would produce a database "foo2" containing a replica of the database
 106 "/var/search/dbs/foo".  Note that the first time you run this, this command
 107 will create the foo2 directory and populate it with appropriate files; you
 108 should not create this directory yourself.
 109
 110 As of 1.2.5, if you don't specify the master name, the same name is used
 111 remotely and locally, so this will replicate remote database "foo2" to
 112 local database "foo2"::
 113
 114   xapian-replicate -h 127.0.0.1 -p 7010 foo2
 115
 116 Both the server and client can be run in "one-shot" mode, by passing `-o`.
 117 This may be particularly useful for the client, to allow a shell script to be
 118 used to cycle through a set of databases, updating each in turn (and then
 119 probably sleeping for a period).
 120
 121 Limitations
 122 ===========
 123
 124 It is possible to confuse the replication system in some cases, such that an
 125 invalid database will be produced on the client.  However, this is easy to
 126 avoid in practice.
 127
 128 To confuse the replication system, the following needs to happen:
 129
 130  - Start with two databases, A and B.
 131  - Start a replication of database A.
 132  - While the replication is in progress, swap B in place of A (ie, by moving
 133    the files around, such that B is now at the path of A).
 134  - While the replication is still in progress, swap A back in place of B.
 135
 136 If this happens, the replication process will not detect the change in
 137 database, and you are likely to end up with a database on the client which
 138 contains parts of A and B mixed together.  You will need to delete the damaged
 139 database on the client, and re-run the replication.
 140
 141 To avoid this, simply avoid swapping a database back in place of another one.
 142 Or at least, if you must do this, wait until any replications in progress when
 143 you were using the original database have finished.
 144
 145 Calling reopen
 146 --------------
 147
 148 `Database::reopen()` is usually an efficient way to ensure that a database is
 149 up-to-date with the latest changes.  Unfortunately, it does not currently work
 150 as you might expect with databases which are being updated by the replication
 151 client.  The workaround is simple; don't use the reopen() method on such
 152 databases: instead, you should close the database and open it
 153 again from scratch.
 154
 155 Briefly, the issue is that the databases created by the replication client are
 156 created in a subdirectory of the target path supplied to the client, rather
 157 than at that path.  A "stub database" file is then created in that directory,
 158 pointing to the database.  This allows the database which readers open to be
 159 switched atomically after a database copy has occurred.  The reopen() method
 160 doesn't re-read the stub database file in this situation, so ends up
 161 attempting to read the old database which has been deleted.
 162
 163 We intend to fix this issue in a future backend.
 164
 165 Alternative approaches
 166 ======================
 167
 168 Without using the database replication protocol, there are various ways in
 169 which the "single master, multiple slaves" setup could be implemented.
 170
 171  - Copy database from master to all slaves after each update, then swap the new
 172    database for the old.
 173
 174  - Synchronise databases from the master to the slaves using rsync.
 175
 176  - Keep copy of database on master from before each update, and use a binary
 177    diff algorithm (e.g., xdelta) to calculate the changes, and then apply these
 178    same changes to the databases on each slave.
 179
 180  - Serve database from master to slaves over NFS (or other remote file system).
 181
 182  - Use the "remote database backend" facility of Xapian to allow slave servers
 183    to search the database directly on the master.
 184
 185 All of these could be made to work but have various drawbacks, and fail to
 186 satisfy all the desired characteristics.  Let's examine them in detail:
 187
 188 Copying database after each update
 189 ----------------------------------
 190
 191 Databases could be pushed to the slaves after each update simply by copying the
 192 entire database from the master (using scp, ftp, http or one of the many other
 193 transfer options).  After the copy is completed, the new database would be made
 194 live by indirecting access through a stub database and switching what it points to.
 195
 196 After a sufficient interval to allow searches in progress on the old database to
 197 complete, the old database would be removed.  (On UNIX filesystems, the old
 198 database could be unlinked immediately, and the resources used by it would be
 199 automatically freed as soon as the current searches using it complete.)
 200
 201 This approach has the advantage of simplicity, and also ensures that the
 202 databases can be correctly re-synchronised after network outages or hardware
 203 failure.
 204
 205 However, this approach would involve copying a large amount of data for each
 206 update, however small the update was.  Also, because the search server would
 207 have to switch to access new files each time an update was pushed, the search
 208 server will be likely to experience poor performance due to commonly accessed
 209 pages falling out of the disk cache during the update.  In particular, although
 210 some of the newly pushed data would be likely to be in the cache immediately
 211 after the update, if the combination of the old and new database sizes exceeds
 212 the size of the memory available on the search servers for caching, either some
 213 of the live database will be dropped from the cache resulting in poor
 214 performance during the update, or some of the new database will not initially
 215 be present in the cache after update.
 216
 217 Synchronise database using rsync
 218 --------------------------------
 219
 220 Rsync works by calculating hashes for the content on the client and the server,
 221 sending the hashes from the client to the server, and then calculating (on the
 222 server) which pieces of the file need to be sent to update the client.  This
 223 results in a fairly low amount of network traffic, but puts a fairly high CPU
 224 load on the server.  This would result in a large load being placed on the
 225 master server if a large number of slaves tried to synchronise with it.
 226
 227 Also, rsync will not reliably update the database in a manner which allows the
 228 database on a slave to be searched while being updated - therefore, a copy or
 229 snapshot of the database would need to be taken first to allow searches to
 230 continue (accessing the copy) while the database is being synchronised.
 231
 232 If a copy is used, the caching problems discussed in the previous section would
 233 apply again.  If a snapshotting filesystem is used, it may be possible to take
 234 a read-only snapshot copy cheaply (and without encountering poor caching
 235 behaviour), but filesystems with support for this are not always available, and
 236 may require considerable effort to set up even if they are available.
 237
 238 Use a binary diff algorithm
 239 ---------------------------
 240
 241 If a copy of the database on the master before the update was kept, a binary
 242 diff algorithm (such as "xdelta") could be used to compare the old and new
 243 versions of the database.  This would produce a patch file which could be
 244 transferred to the slaves, and then applied - avoiding the need for specific
 245 calculations to be performed for each slave.
 246
 247 However, this requires a copy or snapshot to be taken on the master - which has
 248 the same problems as previously discussed.  A copy or snapshot would also need
 249 to be taken on the slave, since a patch from xdelta couldn't safely be applied
 250 to a live database.
 251
 252 Serve database from master to slaves over NFS
 253 ---------------------------------------------
 254
 255 NFS allows a section of a filesystem to be exported to a remote host.  Xapian
 256 is quite capable of searching a database which is exported in such a manner,
 257 and thus NFS can be used to quickly and easily share a database from the master
 258 to multiple slaves.
 259
 260 A reasonable setup might be to use a powerful machine with a fast disk as the
 261 master, and use that same machine as an NFS server.  Then, multiple slaves can
 262 connect to that NFS server for searching the database. This setup is quite
 263 convenient, because it separates the indexing workload from the search workload
 264 to a reasonable extent, but may lead to performance problems.
 265
 266 There are two main problems which are likely to be encountered.  Firstly, in
 267 order to work efficiently, NFS clients (or the OS filesystem layer above NFS)
 268 cache information read from the remote file system in memory.  If there is
 269 insufficient memory available to cache the whole database in memory, searches
 270 will occasionally need to access parts of the database which are held only on
 271 the master server.  Such searches will take a long time to complete, because
 272 the round-trip time for an access to a disk block on the master is typically a
 273 lot slower than the round-trip time for access to a local disk.  Additionally,
 274 if the local network experiences problems, or the master server fails (or gets
 275 overloaded due to all the search requests), the searches will be unable to be
 276 completed.
 277
 278 Also, when a file is modified, the NFS protocol has no way of indicating that
 279 only a small set of blocks in the file have been modified.  The caching is all
 280 implemented by NFS clients, which can do little other than check the file
 281 modification time periodically, and invalidate all cached blocks for the file
 282 if the modification time has changed. For the Linux client, the time between
 283 checks can be configured by setting the acregmin and acregmax mount options,
 284 but whatever these are set to, the whole file will be dropped from the cache
 285 when any modification is found.
 286
 287 This means that, after every update to the database on the master, searches on
 288 the slaves will have to fetch all the blocks required for their search across
 289 the network, which will likely result in extremely slow search times until the
 290 cache on the slaves gets populated properly again.
 291
 292 Use the "remote database backend" facility
 293 ------------------------------------------
 294
 295 Xapian has supported a "remote" database backend since the very early days of
 296 the project.  This allows a search to be run against a database on a remote
 297 machine, which may seem to be exactly what we want.  However, the "remote"
 298 database backend works by performing most of the work for a search on the
 299 remote end - in the situation we're concerned with, this would mean that most
 300 of the work was performed on the master, while slaves remain largely idle.
 301
 302 The "remote" database backend is intended to allow a large database to be
 303 split, at the document level, between multiple hosts.  This allows systems to
 304 be built which search a very large database with some degree of parallelism
 305 (and thus provide faster individual searches than a system searching a single
 306 database locally).  In contrast, the database replication protocol is intended
 307 to allow a database to be copied to multiple machines to support a high
 308 concurrent search load (and thus to allow a higher throughput of searches).
 309
 310 In some cases (i.e., a very large database and a high concurrent search load)
 311 it may be perfectly reasonable to use both the database replication protocol in
 312 conjunction with the "remote" database backend to get both of these advantages
 313 - the two systems solve different problems.