1 Tdb is a hashtable database with multiple concurrent writer and external
2 record lock support. For speed reasons, wherever possible tdb uses a shared
3 memory mapped area for data access. In its currently released form, it uses
4 fcntl byte-range locks to coordinate access to the data itself.
6 The tdb data is organized as a hashtable. Hash collisions are dealt with by
7 forming a linked list of records that share a hash value. The individual
8 linked lists are protected across processes with 1-byte fcntl locks on the
9 starting pointer of the linked list representing a hash value.
11 The external locking API of tdb allows one to lock individual records. Instead of
12 really locking individual records, the tdb API locks a complete linked list
15 The external locking API of tdb also allows one to lock the complete database, and
16 ctdb uses this facility to freeze databases during a recovery. While the
17 so-called allrecord lock is held, all linked lists and all individual records
18 are frozen altogether. Tdb achieves this by locking the complete file range
19 with a single fcntl lock. Individual 1-byte locks for the linked lists
20 conflict with this. Access to records is prevented by the one large fnctl byte
23 Fcntl locks have been chosen for tdb for two reasons: First they are portable
24 across all current unixes. Secondly they provide auto-cleanup. If a process
25 dies while holding a fcntl lock, the lock is given up as if it was explicitly
26 unlocked. Thus fcntl locks provide a very robust locking scheme, if a process
27 dies for any reason the database will not stay blocked until reboot. This
28 robustness is very important for long-running services, a reboot is not an
29 option for most users of tdb.
31 Unfortunately, during stress testing, fcntl locks have turned out to be a major
32 problem for performance. The particular problem that was seen happens when
33 ctdb on a busy server does a recovery. A recovery means that ctdb has to
34 freeze all tdb databases for some time, usually a few seconds. This is done
35 with the allrecord lock. During the recovery phase on a busy server many smbd
36 processes try to access the tdb file with blocking fcntl calls. The specific
37 test in question easily reproduces 7,000 processes piling up waiting for
38 1-byte fcntl locks. When ctdb is done with the recovery, it gives up the
39 allrecord lock, covering the whole file range. All 7,000 processes waiting for
40 1-byte fcntl locks are woken up, trying to acquire their lock. The special
41 implementation of fcntl locks in Linux (up to 2013-02-12 at least) protects
42 all fcntl lock operations with a single system-wide spinlock. If 7,000 process
43 waiting for the allrecord lock to become released this leads to a thundering
44 herd condition, all CPUs are spinning on that single spinlock.
46 Functionally the kernel is fine, eventually the thundering herd slows down and
47 every process correctly gets his share and locking range, but the performance
48 of the system while the herd is active is worse than expected.
50 The thundering herd is only the worst case scenario for fcntl lock use. The
51 single spinlock for fcntl operations is also a performance penalty for normal
52 operations. In the cluster case, every read and write SMB request has to do
53 two fcntl calls to provide correct SMB mandatory locks. The single spinlock
54 is one source of serialization for the SMB read/write requests, limiting the
55 parallelism that can be achieved in a multi-core system.
57 While trying to tune his servers, Ira Cooper, Samba Team member, found fcntl
58 locks to be a problem on Solaris as well. Ira pointed out that there is a
59 potential alternative locking mechanism that might be more scalable: Process
60 shared robust mutexes, as defined by Posix 2008 for example via
62 http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setpshared.html
63 http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setrobust.html
65 Pthread mutexes provide one of the core mechanisms in posix threads to protect
66 in-process data structures from concurrent access by multiple threads. In the
67 Linux implementation, a pthread_mutex_t is represented by a data structure in
68 user space that requires no kernel calls in the uncontended case for locking
69 and unlocking. Locking and unlocking in the uncontended case is implemented
70 purely in user space with atomic CPU instructions and thus are very fast.
72 The setpshared functions indicate to the kernel that the mutex is about to be
73 shared between processes in a common shared memory area.
75 The process shared posix mutexes have the potential to replace fcntl locking
76 to coordinate mmap access for tdbs. However, they are missing the critical
77 auto-cleanup property that fcntl provides when a process dies. A process that
78 dies hard while holding a shared mutex has no chance to clean up the protected
79 data structures and unlock the shared mutex. Thus with a pure process shared
80 mutex the mutex will remain locked forever until the data structures are
81 re-initialized from scratch.
83 With the robust mutexes defined by Posix the process shared mutexes have been
84 extended with a limited auto-cleanup property. If a mutex has been declared
85 robust, when a process exits while holding that mutex, the next process trying
86 to lock the mutex will get the special error message EOWNERDEAD. This informs
87 the caller that the data structures the mutex protects are potentially corrupt
88 and need to be cleaned up.
90 The error message EOWNERDEAD when trying to lock a mutex is an extension over
91 the fcntl functionality. A process that does a blocking fcntl lock call is not
92 informed about whether the lock was explicitly freed by a process still alive
93 or due to an unplanned process exit. At the time of this writing (February
94 2013), at least Linux and OpenSolaris also implement the robustness feature of
95 process-shared mutexes.
97 Converting the tdb locking mechanism from fcntl to mutexes has to take care of
98 both types of locks that are used on tdb files.
100 The easy part is to use mutexes to replace the 1-byte linked list locks
101 covering the individual hashes. Those can be represented by a mutex each.
103 Covering the allrecord lock is more difficult. The allrecord lock uses a fcntl
104 lock spanning all hash list locks simultaneously. This basic functionality is
105 not easily possible with mutexes. A mutex carries 1 bit of information, a
106 fcntl lock can carry an arbitrary amount of information.
108 In order to support the allrecord lock, we have an allrecord_lock variable
109 protected by an allrecord_mutex. The coordination between the allrecord lock
110 and the chainlocks works like this:
112 - Getting a chain lock works like this:
115 2. return success if allrecord_lock is F_UNLCK (not locked)
116 3. return success if allrecord_lock is F_RDLCK (locked readonly)
117 and we only need a read lock.
118 4. release chain mutex
119 5. wait for allrecord_mutex
120 6. unlock allrecord_mutex
123 - Getting the allrecord lock:
125 1. get the allrecord mutex
126 2. return error if allrecord_lock is not F_UNLCK (it's locked)
127 3. set allrecord_lock to the desired value.
128 4. in a loop: lock(blocking) / unlock each chain mutex.
131 - allrecord lock upgrade:
133 1. check we already have the allrecord lock with F_RDLCK.
134 3. set allrecord_lock to F_WRLCK
135 4. in a loop: lock(blocking) / unlock each chain mutex.