9 date 2011.03.01.11.46.54; author rusty; state Exp;
14 date 2010.12.01.12.20.49; author rusty; state Exp;
19 date 2010.12.01.11.55.20; author rusty; state Exp;
24 date 2010.09.14.00.33.57; author rusty; state Exp;
29 date 2010.09.09.07.25.12; author rusty; state Exp;
34 date 2010.09.02.02.29.05; author rusty; state Exp;
39 date 2010.09.01.10.58.12; author rusty; state Exp;
44 date 2010.08.02.00.21.43; author rusty; state Exp;
49 date 2010.08.02.00.21.16; author rusty; state Exp;
54 date 2010.05.10.13.09.11; author rusty; state Exp;
59 date 2010.05.10.11.58.37; author rusty; state Exp;
64 date 2010.05.10.05.35.13; author rusty; state Exp;
69 date 2010.05.04.02.29.16; author rusty; state Exp;
84 @#LyX 1.6.7 created this file. For more info see http://www.lyx.org/
89 \use_default_options true
94 \font_typewriter default
95 \font_default_family default
102 \paperfontsize default
110 \paperorientation portrait
113 \paragraph_separation indent
115 \quotes_language english
118 \paperpagestyle default
119 \tracking_changes true
121 \author "Rusty Russell,,,"
128 TDB2: A Redesigning The Trivial DataBase
132 Rusty Russell, IBM Corporation
139 \begin_layout Abstract
140 The Trivial DataBase on-disk format is 32 bits; with usage cases heading
141 towards the 4G limit, that must change.
142 This required breakage provides an opportunity to revisit TDB's other design
143 decisions and reassess them.
146 \begin_layout Section
150 \begin_layout Standard
151 The Trivial DataBase was originally written by Andrew Tridgell as a simple
152 key/data pair storage system with the same API as dbm, but allowing multiple
153 readers and writers while being small enough (< 1000 lines of C) to include
155 The simple design created in 1999 has proven surprisingly robust and performant
156 , used in Samba versions 3 and 4 as well as numerous other projects.
157 Its useful life was greatly increased by the (backwards-compatible!) addition
158 of transaction support in 2005.
161 \begin_layout Standard
162 The wider variety and greater demands of TDB-using code has lead to some
163 organic growth of the API, as well as some compromises on the implementation.
164 None of these, by themselves, are seen as show-stoppers, but the cumulative
165 effect is to a loss of elegance over the initial, simple TDB implementation.
166 Here is a table of the approximate number of lines of implementation code
167 and number of API functions at the end of each year:
170 \begin_layout Standard
172 <lyxtabular version="3" rows="12" columns="3">
174 <column alignment="center" valignment="top" width="0">
175 <column alignment="center" valignment="top" width="0">
176 <column alignment="center" valignment="top" width="0">
178 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
181 \begin_layout Plain Layout
187 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
190 \begin_layout Plain Layout
196 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
199 \begin_layout Plain Layout
200 Lines of C Code Implementation
207 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
210 \begin_layout Plain Layout
216 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
219 \begin_layout Plain Layout
225 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
228 \begin_layout Plain Layout
236 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
239 \begin_layout Plain Layout
245 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
248 \begin_layout Plain Layout
254 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
257 \begin_layout Plain Layout
265 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
268 \begin_layout Plain Layout
274 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
277 \begin_layout Plain Layout
283 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
286 \begin_layout Plain Layout
294 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
297 \begin_layout Plain Layout
303 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
306 \begin_layout Plain Layout
312 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
315 \begin_layout Plain Layout
323 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
326 \begin_layout Plain Layout
332 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
335 \begin_layout Plain Layout
341 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
344 \begin_layout Plain Layout
352 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
355 \begin_layout Plain Layout
361 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
364 \begin_layout Plain Layout
370 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
373 \begin_layout Plain Layout
381 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
384 \begin_layout Plain Layout
390 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
393 \begin_layout Plain Layout
399 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
402 \begin_layout Plain Layout
410 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
413 \begin_layout Plain Layout
419 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
422 \begin_layout Plain Layout
428 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
431 \begin_layout Plain Layout
439 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
442 \begin_layout Plain Layout
448 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
451 \begin_layout Plain Layout
457 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
460 \begin_layout Plain Layout
468 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
471 \begin_layout Plain Layout
477 <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
480 \begin_layout Plain Layout
486 <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
489 \begin_layout Plain Layout
497 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
500 \begin_layout Plain Layout
506 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
509 \begin_layout Plain Layout
515 <cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
518 \begin_layout Plain Layout
532 \begin_layout Standard
533 This review is an attempt to catalog and address all the known issues with
534 TDB and create solutions which address the problems without significantly
535 increasing complexity; all involved are far too aware of the dangers of
536 second system syndrome in rewriting a successful project like this.
539 \begin_layout Section
543 \begin_layout Subsection
544 tdb_open_ex Is Not Expandable
547 \begin_layout Standard
548 The tdb_open() call was expanded to tdb_open_ex(), which added an optional
549 hashing function and an optional logging function argument.
550 Additional arguments to open would require the introduction of a tdb_open_ex2
554 \begin_layout Subsubsection
556 \begin_inset CommandInset label
565 \begin_layout Standard
566 tdb_open() will take a linked-list of attributes:
569 \begin_layout LyX-Code
573 \begin_layout LyX-Code
574 TDB_ATTRIBUTE_LOG = 0,
577 \begin_layout LyX-Code
578 TDB_ATTRIBUTE_HASH = 1
581 \begin_layout LyX-Code
585 \begin_layout LyX-Code
586 struct tdb_attribute_base {
589 \begin_layout LyX-Code
590 enum tdb_attribute attr;
593 \begin_layout LyX-Code
594 union tdb_attribute *next;
597 \begin_layout LyX-Code
601 \begin_layout LyX-Code
602 struct tdb_attribute_log {
605 \begin_layout LyX-Code
606 struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG */
609 \begin_layout LyX-Code
613 \begin_layout LyX-Code
617 \begin_layout LyX-Code
621 \begin_layout LyX-Code
622 struct tdb_attribute_hash {
625 \begin_layout LyX-Code
626 struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH */
629 \begin_layout LyX-Code
630 tdb_hash_func hash_fn;
633 \begin_layout LyX-Code
637 \begin_layout LyX-Code
641 \begin_layout LyX-Code
642 union tdb_attribute {
645 \begin_layout LyX-Code
646 struct tdb_attribute_base base;
649 \begin_layout LyX-Code
650 struct tdb_attribute_log log;
653 \begin_layout LyX-Code
654 struct tdb_attribute_hash hash;
657 \begin_layout LyX-Code
661 \begin_layout Standard
662 This allows future attributes to be added, even if this expands the size
666 \begin_layout Subsubsection
670 \begin_layout Standard
674 \begin_layout Subsection
675 tdb_traverse Makes Impossible Guarantees
678 \begin_layout Standard
679 tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, and it
680 was thought that it was important to guarantee that all records which exist
681 at the start and end of the traversal would be included, and no record
682 would be included twice.
685 \begin_layout Standard
686 This adds complexity (see
687 \begin_inset CommandInset ref
689 reference "Reliable-Traversal-Adds"
693 ) and does not work anyway for records which are altered (in particular,
694 those which are expanded may be effectively deleted and re-added behind
698 \begin_layout Subsubsection
699 \begin_inset CommandInset label
701 name "traverse-Proposed-Solution"
708 \begin_layout Standard
709 Abandon the guarantee.
710 You will see every record if no changes occur during your traversal, otherwise
711 you will see some subset.
712 You can prevent changes by using a transaction or the locking API.
715 \begin_layout Subsubsection
719 \begin_layout Standard
721 Delete-during-traverse will still delete every record, too (assuming no
725 \begin_layout Subsection
726 Nesting of Transactions Is Fraught
729 \begin_layout Standard
730 TDB has alternated between allowing nested transactions and not allowing
732 Various paths in the Samba codebase assume that transactions will nest,
733 and in a sense they can: the operation is only committed to disk when the
734 outer transaction is committed.
735 There are two problems, however:
738 \begin_layout Enumerate
739 Canceling the inner transaction will cause the outer transaction commit
740 to fail, and will not undo any operations since the inner transaction began.
741 This problem is soluble with some additional internal code.
744 \begin_layout Enumerate
745 An inner transaction commit can be cancelled by the outer transaction.
746 This is desirable in the way which Samba's database initialization code
747 uses transactions, but could be a surprise to any users expecting a successful
748 transaction commit to expose changes to others.
751 \begin_layout Standard
752 The current solution is to specify the behavior at tdb_open(), with the
753 default currently that nested transactions are allowed.
754 This flag can also be changed at runtime.
757 \begin_layout Subsubsection
761 \begin_layout Standard
762 Given the usage patterns, it seems that the
763 \begin_inset Quotes eld
767 \begin_inset Quotes erd
770 behavior of disallowing nested transactions should become the default.
771 Additionally, it seems the outer transaction is the only code which knows
772 whether inner transactions should be allowed, so a flag to indicate this
773 could be added to tdb_transaction_start.
774 However, this behavior can be simulated with a wrapper which uses tdb_add_flags
775 () and tdb_remove_flags(), so the API should not be expanded for this relatively
779 \begin_layout Subsubsection
783 \begin_layout Standard
785 \change_deleted 0 1298979572
786 Incomplete; nesting flag is still defined as per tdb1.
787 \change_inserted 0 1298979584
788 Complete; the nesting flag has been removed.
793 \begin_layout Subsection
794 Incorrect Hash Function is Not Detected
797 \begin_layout Standard
798 tdb_open_ex() allows the calling code to specify a different hash function
799 to use, but does not check that all other processes accessing this tdb
800 are using the same hash function.
801 The result is that records are missing from tdb_fetch().
804 \begin_layout Subsubsection
808 \begin_layout Standard
809 The header should contain an example hash result (eg.
810 the hash of 0xdeadbeef), and tdb_open_ex() should check that the given
811 hash function produces the same answer, or fail the tdb_open call.
814 \begin_layout Subsubsection
818 \begin_layout Standard
822 \begin_layout Subsection
823 tdb_set_max_dead/TDB_VOLATILE Expose Implementation
826 \begin_layout Standard
827 In response to scalability issues with the free list (
828 \begin_inset CommandInset ref
830 reference "TDB-Freelist-Is"
834 ) two API workarounds have been incorporated in TDB: tdb_set_max_dead()
835 and the TDB_VOLATILE flag to tdb_open.
836 The latter actually calls the former with an argument of
837 \begin_inset Quotes eld
841 \begin_inset Quotes erd
847 \begin_layout Standard
848 This code allows deleted records to accumulate without putting them in the
850 On delete we iterate through each chain and free them in a batch if there
851 are more than max_dead entries.
852 These are never otherwise recycled except as a side-effect of a tdb_repack.
855 \begin_layout Subsubsection
859 \begin_layout Standard
860 With the scalability problems of the freelist solved, this API can be removed.
861 The TDB_VOLATILE flag may still be useful as a hint that store and delete
862 of records will be at least as common as fetch in order to allow some internal
863 tuning, but initially will become a no-op.
866 \begin_layout Subsubsection
870 \begin_layout Standard
872 TDB_VOLATILE still defined, but implementation should fail on unknown flags
876 \begin_layout Subsection
877 \begin_inset CommandInset label
879 name "TDB-Files-Cannot"
883 TDB Files Cannot Be Opened Multiple Times In The Same Process
886 \begin_layout Standard
887 No process can open the same TDB twice; we check and disallow it.
888 This is an unfortunate side-effect of fcntl locks, which operate on a per-file
889 rather than per-file-descriptor basis, and do not nest.
890 Thus, closing any file descriptor on a file clears all the locks obtained
891 by this process, even if they were placed using a different file descriptor!
894 \begin_layout Standard
895 Note that even if this were solved, deadlock could occur if operations were
896 nested: this is a more manageable programming error in most cases.
899 \begin_layout Subsubsection
903 \begin_layout Standard
904 We could lobby POSIX to fix the perverse rules, or at least lobby Linux
905 to violate them so that the most common implementation does not have this
907 This would be a generally good idea for other fcntl lock users.
910 \begin_layout Standard
911 Samba uses a wrapper which hands out the same tdb_context to multiple callers
912 if this happens, and does simple reference counting.
913 We should do this inside the tdb library, which already emulates lock nesting
914 internally; it would need to recognize when deadlock occurs within a single
916 This would create a new failure mode for tdb operations (while we currently
917 handle locking failures, they are impossible in normal use and a process
918 encountering them can do little but give up).
921 \begin_layout Standard
922 I do not see benefit in an additional tdb_open flag to indicate whether
923 re-opening is allowed, as though there may be some benefit to adding a
924 call to detect when a tdb_context is shared, to allow other to create such
928 \begin_layout Subsubsection
932 \begin_layout Standard
936 \begin_layout Subsection
937 TDB API Is Not POSIX Thread-safe
940 \begin_layout Standard
941 The TDB API uses an error code which can be queried after an operation to
942 determine what went wrong.
943 This programming model does not work with threads, unless specific additional
944 guarantees are given by the implementation.
945 In addition, even otherwise-independent threads cannot open the same TDB
947 \begin_inset CommandInset ref
949 reference "TDB-Files-Cannot"
956 \begin_layout Subsubsection
960 \begin_layout Standard
961 Reachitecting the API to include a tdb_errcode pointer would be a great
963 \change_inserted 0 1298979557
964 , but fortunately most functions return 0 on success and -1 on error: we
965 can change these to return 0 on success and a negative error code on error,
966 and the API remains similar to previous.
967 The tdb_fetch, tdb_firstkey and tdb_nextkey functions need to take a TDB_DATA
968 pointer and return an error code.
969 It is also simpler to have tdb_nextkey replace its key argument in place,
970 freeing up any old .dptr.
973 \begin_layout Standard
975 \change_deleted 0 1298979438
976 ; we are better to guarantee that the tdb_errcode is per-thread so the current
977 programming model can be maintained.
980 \begin_layout Standard
982 \change_deleted 0 1298979438
983 This requires dynamic per-thread allocations, which is awkward with POSIX
984 threads (pthread_key_create space is limited and we cannot simply allocate
985 a key for every TDB).
990 \begin_layout Standard
991 Internal locking is required to make sure that fcntl locks do not overlap
992 between threads, and also that the global list of tdbs is maintained.
995 \begin_layout Standard
996 The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe
997 version of the library, and otherwise no overhead will exist.
998 Alternatively, a hooking mechanism similar to that proposed for
999 \begin_inset CommandInset ref
1001 reference "Proposed-Solution-locking-hook"
1005 could be used to enable pthread locking at runtime.
1008 \begin_layout Subsubsection
1012 \begin_layout Standard
1014 \change_inserted 0 1298979681
1015 ; API has been changed but thread safety has not been implemented.
1016 \change_deleted 0 1298979669
1022 \begin_layout Subsection
1023 *_nonblock Functions And *_mark Functions Expose Implementation
1026 \begin_layout Standard
1031 \begin_layout Plain Layout
1032 Clustered TDB, see http://ctdb.samba.org
1037 wishes to operate on TDB in a non-blocking manner.
1038 This is currently done as follows:
1041 \begin_layout Enumerate
1042 Call the _nonblock variant of an API function (eg.
1043 tdb_lockall_nonblock).
1047 \begin_layout Enumerate
1048 Fork a child process, and wait for it to call the normal variant (eg.
1052 \begin_layout Enumerate
1053 If the child succeeds, call the _mark variant to indicate we already have
1058 \begin_layout Enumerate
1059 Upon completion, tell the child to release the locks (eg.
1063 \begin_layout Enumerate
1064 Indicate to tdb that it should consider the locks removed (eg.
1065 tdb_unlockall_mark).
1068 \begin_layout Standard
1069 There are several issues with this approach.
1070 Firstly, adding two new variants of each function clutters the API for
1071 an obscure use, and so not all functions have three variants.
1072 Secondly, it assumes that all paths of the functions ask for the same locks,
1073 otherwise the parent process will have to get a lock which the child doesn't
1074 have under some circumstances.
1075 I don't believe this is currently the case, but it constrains the implementatio
1080 \begin_layout Subsubsection
1081 \begin_inset CommandInset label
1083 name "Proposed-Solution-locking-hook"
1090 \begin_layout Standard
1091 Implement a hook for locking methods, so that the caller can control the
1092 calls to create and remove fcntl locks.
1093 In this scenario, ctdbd would operate as follows:
1096 \begin_layout Enumerate
1097 Call the normal API function, eg tdb_lockall().
1100 \begin_layout Enumerate
1101 When the lock callback comes in, check if the child has the lock.
1102 Initially, this is always false.
1104 Otherwise, try to obtain it in non-blocking mode.
1105 If that fails, return EWOULDBLOCK.
1108 \begin_layout Enumerate
1109 Release locks in the unlock callback as normal.
1112 \begin_layout Enumerate
1113 If tdb_lockall() fails, see if we recorded a lock failure; if so, call the
1114 child to repeat the operation.
1117 \begin_layout Enumerate
1118 The child records what locks it obtains, and returns that information to
1122 \begin_layout Enumerate
1123 When the child has succeeded, goto 1.
1126 \begin_layout Standard
1127 This is flexible enough to handle any potential locking scenario, even when
1128 lock requirements change.
1129 It can be optimized so that the parent does not release locks, just tells
1130 the child which locks it doesn't need to obtain.
1133 \begin_layout Standard
1134 It also keeps the complexity out of the API, and in ctdbd where it is needed.
1137 \begin_layout Subsubsection
1141 \begin_layout Standard
1145 \begin_layout Subsection
1146 tdb_chainlock Functions Expose Implementation
1149 \begin_layout Standard
1150 tdb_chainlock locks some number of records, including the record indicated
1152 This gave atomicity guarantees; no-one can start a transaction, alter,
1153 read or delete that key while the lock is held.
1156 \begin_layout Standard
1157 It also makes the same guarantee for any other key in the chain, which is
1158 an internal implementation detail and potentially a cause for deadlock.
1161 \begin_layout Subsubsection
1165 \begin_layout Standard
1167 It would be nice to have an explicit single entry lock which effected no
1169 Unfortunately, this won't work for an entry which doesn't exist.
1170 Thus while chainlock may be implemented more efficiently for the existing
1171 case, it will still have overlap issues with the non-existing case.
1172 So it is best to keep the current (lack of) guarantee about which records
1173 will be effected to avoid constraining our implementation.
1176 \begin_layout Subsection
1177 Signal Handling is Not Race-Free
1180 \begin_layout Standard
1181 The tdb_setalarm_sigptr() call allows the caller's signal handler to indicate
1182 that the tdb locking code should return with a failure, rather than trying
1183 again when a signal is received (and errno == EAGAIN).
1184 This is usually used to implement timeouts.
1187 \begin_layout Standard
1188 Unfortunately, this does not work in the case where the signal is received
1189 before the tdb code enters the fcntl() call to place the lock: the code
1190 will sleep within the fcntl() code, unaware that the signal wants it to
1192 In the case of long timeouts, this does not happen in practice.
1195 \begin_layout Subsubsection
1199 \begin_layout Standard
1200 The locking hooks proposed in
1201 \begin_inset CommandInset ref
1203 reference "Proposed-Solution-locking-hook"
1207 would allow the user to decide on whether to fail the lock acquisition
1209 This allows the caller to choose their own compromise: they could narrow
1210 the race by checking immediately before the fcntl call.
1214 \begin_layout Plain Layout
1215 It may be possible to make this race-free in some implementations by having
1216 the signal handler alter the struct flock to make it invalid.
1217 This will cause the fcntl() lock call to fail with EINVAL if the signal
1218 occurs before the kernel is entered, otherwise EAGAIN.
1226 \begin_layout Subsubsection
1230 \begin_layout Standard
1234 \begin_layout Subsection
1235 The API Uses Gratuitous Typedefs, Capitals
1238 \begin_layout Standard
1239 typedefs are useful for providing source compatibility when types can differ
1240 across implementations, or arguably in the case of function pointer definitions
1241 which are hard for humans to parse.
1242 Otherwise it is simply obfuscation and pollutes the namespace.
1245 \begin_layout Standard
1246 Capitalization is usually reserved for compile-time constants and macros.
1249 \begin_layout Description
1250 TDB_CONTEXT There is no reason to use this over 'struct tdb_context'; the
1251 definition isn't visible to the API user anyway.
1254 \begin_layout Description
1255 TDB_DATA There is no reason to use this over struct TDB_DATA; the struct
1256 needs to be understood by the API user.
1259 \begin_layout Description
1261 \begin_inset space ~
1264 TDB_DATA This would normally be called 'struct tdb_data'.
1267 \begin_layout Description
1269 \begin_inset space ~
1272 TDB_ERROR Similarly, this would normally be enum tdb_error.
1275 \begin_layout Subsubsection
1279 \begin_layout Standard
1281 Introducing lower case variants would please pedants like myself, but if
1282 it were done the existing ones should be kept.
1283 There is little point forcing a purely cosmetic change upon tdb users.
1286 \begin_layout Subsection
1287 \begin_inset CommandInset label
1289 name "tdb_log_func-Doesnt-Take"
1293 tdb_log_func Doesn't Take The Private Pointer
1296 \begin_layout Standard
1297 For API compatibility reasons, the logging function needs to call tdb_get_loggin
1298 g_private() to retrieve the pointer registered by the tdb_open_ex for logging.
1301 \begin_layout Subsubsection
1305 \begin_layout Standard
1306 It should simply take an extra argument, since we are prepared to break
1310 \begin_layout Subsubsection
1314 \begin_layout Standard
1318 \begin_layout Subsection
1319 Various Callback Functions Are Not Typesafe
1322 \begin_layout Standard
1323 The callback functions in tdb_set_logging_function (after
1324 \begin_inset CommandInset ref
1326 reference "tdb_log_func-Doesnt-Take"
1330 is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read and tdb_check
1331 all take void * and must internally convert it to the argument type they
1335 \begin_layout Standard
1336 If this type changes, the compiler will not produce warnings on the callers,
1337 since it only sees void *.
1340 \begin_layout Subsubsection
1344 \begin_layout Standard
1345 With careful use of macros, we can create callback functions which give
1346 a warning when used on gcc and the types of the callback and its private
1348 Unsupported compilers will not give a warning, which is no worse than now.
1349 In addition, the callbacks become clearer, as they need not use void *
1350 for their parameter.
1353 \begin_layout Standard
1354 See CCAN's typesafe_cb module at http://ccan.ozlabs.org/info/typesafe_cb.html
1357 \begin_layout Subsubsection
1361 \begin_layout Standard
1365 \begin_layout Subsection
1366 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, tdb_reopen_all Problematic
1369 \begin_layout Standard
1370 The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB file should
1371 be cleared if the caller discovers it is the only process with the TDB
1373 However, if any caller does not specify TDB_CLEAR_IF_FIRST it will not
1374 be detected, so will have the TDB erased underneath them (usually resulting
1378 \begin_layout Standard
1379 There is a similar issue on fork(); if the parent exits (or otherwise closes
1380 the tdb) before the child calls tdb_reopen_all() to establish the lock
1381 used to indicate the TDB is opened by someone, a TDB_CLEAR_IF_FIRST opener
1382 at that moment will believe it alone has opened the TDB and will erase
1386 \begin_layout Subsubsection
1390 \begin_layout Standard
1391 Remove TDB_CLEAR_IF_FIRST.
1392 Other workarounds are possible, but see
1393 \begin_inset CommandInset ref
1395 reference "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1402 \begin_layout Subsubsection
1406 \begin_layout Standard
1408 \change_deleted 0 1298979699
1409 Incomplete, TDB_CLEAR_IF_FIRST still defined, but not implemented.
1410 \change_inserted 0 1298979700
1416 \begin_layout Subsection
1417 Extending The Header Is Difficult
1420 \begin_layout Standard
1421 We have reserved (zeroed) words in the TDB header, which can be used for
1423 If the future features are compulsory, the version number must be updated
1424 to prevent old code from accessing the database.
1425 But if the future feature is optional, we have no way of telling if older
1426 code is accessing the database or not.
1429 \begin_layout Subsubsection
1433 \begin_layout Standard
1434 The header should contain a
1435 \begin_inset Quotes eld
1439 \begin_inset Quotes erd
1443 This is divided into two 32-bit parts:
1446 \begin_layout Enumerate
1447 The lower part reflects the format variant understood by code accessing
1451 \begin_layout Enumerate
1452 The upper part reflects the format variant you must understand to write
1453 to the database (otherwise you can only open for reading).
1456 \begin_layout Standard
1457 The latter field can only be written at creation time, the former should
1458 be written under the OPEN_LOCK when opening the database for writing, if
1459 the variant of the code is lower than the current lowest variant.
1462 \begin_layout Standard
1463 This should allow backwards-compatible features to be added, and detection
1464 if older code (which doesn't understand the feature) writes to the database.
1467 \begin_layout Subsubsection
1471 \begin_layout Standard
1475 \begin_layout Subsection
1476 Record Headers Are Not Expandible
1479 \begin_layout Standard
1480 If we later want to add (say) checksums on keys and data, it would require
1481 another format change, which we'd like to avoid.
1484 \begin_layout Subsubsection
1488 \begin_layout Standard
1489 We often have extra padding at the tail of a record.
1490 If we ensure that the first byte (if any) of this padding is zero, we will
1491 have a way for future changes to detect code which doesn't understand a
1492 new format: the new code would write (say) a 1 at the tail, and thus if
1493 there is no tail or the first byte is 0, we would know the extension is
1494 not present on that record.
1497 \begin_layout Subsubsection
1501 \begin_layout Standard
1505 \begin_layout Subsection
1506 TDB Does Not Use Talloc
1509 \begin_layout Standard
1510 Many users of TDB (particularly Samba) use the talloc allocator, and thus
1511 have to wrap TDB in a talloc context to use it conveniently.
1514 \begin_layout Subsubsection
1518 \begin_layout Standard
1519 The allocation within TDB is not complicated enough to justify the use of
1520 talloc, and I am reluctant to force another (excellent) library on TDB
1522 Nonetheless a compromise is possible.
1524 \begin_inset CommandInset ref
1526 reference "attributes"
1530 ) can be added later to tdb_open() to provide an alternate allocation mechanism,
1531 specifically for talloc but usable by any other allocator (which would
1533 \begin_inset Quotes eld
1537 \begin_inset Quotes erd
1543 \begin_layout Standard
1544 This would form a talloc heirarchy as expected, but the caller would still
1545 have to attach a destructor to the tdb context returned from tdb_open to
1547 All TDB_DATA fields would be children of the tdb_context, and the caller
1548 would still have to manage them (using talloc_free() or talloc_steal()).
1551 \begin_layout Subsubsection
1555 \begin_layout Standard
1559 \begin_layout Section
1560 Performance And Scalability Issues
1563 \begin_layout Subsection
1564 \begin_inset CommandInset label
1566 name "TDB_CLEAR_IF_FIRST-Imposes-Performance"
1570 TDB_CLEAR_IF_FIRST Imposes Performance Penalty
1573 \begin_layout Standard
1574 When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is placed at offset
1577 While these locks never conflict in normal tdb usage, they do add substantial
1578 overhead for most fcntl lock implementations when the kernel scans to detect
1579 if a lock conflict exists.
1580 This is often a single linked list, making the time to acquire and release
1581 a fcntl lock O(N) where N is the number of processes with the TDB open,
1582 not the number actually doing work.
1585 \begin_layout Standard
1586 In a Samba server it is common to have huge numbers of clients sitting idle,
1587 and thus they have weaned themselves off the TDB_CLEAR_IF_FIRST flag.
1591 \begin_layout Plain Layout
1592 There is a flag to tdb_reopen_all() which is used for this optimization:
1593 if the parent process will outlive the child, the child does not need the
1595 This is a workaround for this very performance issue.
1603 \begin_layout Subsubsection
1607 \begin_layout Standard
1609 It was a neat idea, but even trivial servers tend to know when they are
1610 initializing for the first time and can simply unlink the old tdb at that
1614 \begin_layout Subsubsection
1618 \begin_layout Standard
1620 \change_deleted 0 1298979837
1621 Incomplete; TDB_CLEAR_IF_FIRST still defined, but does nothing.
1622 \change_inserted 0 1298979837
1628 \begin_layout Subsection
1629 TDB Files Have a 4G Limit
1632 \begin_layout Standard
1633 This seems to be becoming an issue (so much for
1634 \begin_inset Quotes eld
1638 \begin_inset Quotes erd
1641 !), particularly for ldb.
1644 \begin_layout Subsubsection
1648 \begin_layout Standard
1649 A new, incompatible TDB format which uses 64 bit offsets internally rather
1651 For simplicity of endian conversion (which TDB does on the fly if required),
1652 all values will be 64 bit on disk.
1653 In practice, some upper bits may be used for other purposes, but at least
1654 56 bits will be available for file offsets.
1657 \begin_layout Standard
1658 tdb_open() will automatically detect the old version, and even create them
1659 if TDB_VERSION6 is specified to tdb_open.
1662 \begin_layout Standard
1663 32 bit processes will still be able to access TDBs larger than 4G (assuming
1664 that their off_t allows them to seek to 64 bits), they will gracefully
1665 fall back as they fail to mmap.
1666 This can happen already with large TDBs.
1669 \begin_layout Standard
1670 Old versions of tdb will fail to open the new TDB files (since 28 August
1671 2009, commit 398d0c29290: prior to that any unrecognized file format would
1672 be erased and initialized as a fresh tdb!)
1675 \begin_layout Subsubsection
1679 \begin_layout Standard
1683 \begin_layout Subsection
1684 TDB Records Have a 4G Limit
1687 \begin_layout Standard
1688 This has not been a reported problem, and the API uses size_t which can
1689 be 64 bit on 64 bit platforms.
1690 However, other limits may have made such an issue moot.
1693 \begin_layout Subsubsection
1697 \begin_layout Standard
1698 Record sizes will be 64 bit, with an error returned on 32 bit platforms
1699 which try to access such records (the current implementation would return
1700 TDB_ERR_OOM in a similar case).
1701 It seems unlikely that 32 bit keys will be a limitation, so the implementation
1702 may not support this (see
1703 \begin_inset CommandInset ref
1705 reference "sub:Records-Incur-A"
1712 \begin_layout Subsubsection
1716 \begin_layout Standard
1720 \begin_layout Subsection
1721 Hash Size Is Determined At TDB Creation Time
1724 \begin_layout Standard
1725 TDB contains a number of hash chains in the header; the number is specified
1726 at creation time, and defaults to 131.
1727 This is such a bottleneck on large databases (as each hash chain gets quite
1728 long), that LDB uses 10,000 for this hash.
1729 In general it is impossible to know what the 'right' answer is at database
1733 \begin_layout Subsubsection
1734 \begin_inset CommandInset label
1736 name "sub:Hash-Size-Solution"
1743 \begin_layout Standard
1744 After comprehensive performance testing on various scalable hash variants
1748 \begin_layout Plain Layout
1749 http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 This was annoying
1750 because I was previously convinced that an expanding tree of hashes would
1751 be very close to optimal.
1756 , it became clear that it is hard to beat a straight linear hash table which
1757 doubles in size when it reaches saturation.
1758 Unfortunately, altering the hash table introduces serious locking complications
1759 : the entire hash table needs to be locked to enlarge the hash table, and
1760 others might be holding locks.
1761 Particularly insidious are insertions done under tdb_chainlock.
1764 \begin_layout Standard
1765 Thus an expanding layered hash will be used: an array of hash groups, with
1766 each hash group exploding into pointers to lower hash groups once it fills,
1767 turning into a hash tree.
1768 This has implications for locking: we must lock the entire group in case
1769 we need to expand it, yet we don't know how deep the tree is at that point.
1772 \begin_layout Standard
1773 Note that bits from the hash table entries should be stolen to hold more
1774 hash bits to reduce the penalty of collisions.
1775 We can use the otherwise-unused lower 3 bits.
1776 If we limit the size of the database to 64 exabytes, we can use the top
1777 8 bits of the hash entry as well.
1778 These 11 bits would reduce false positives down to 1 in 2000 which is more
1779 than we need: we can use one of the bits to indicate that the extra hash
1781 This means we can choose not to re-hash all entries when we expand a hash
1782 group; simply use the next bits we need and mark them invalid.
1785 \begin_layout Subsubsection
1789 \begin_layout Standard
1793 \begin_layout Subsection
1794 \begin_inset CommandInset label
1796 name "TDB-Freelist-Is"
1800 TDB Freelist Is Highly Contended
1803 \begin_layout Standard
1804 TDB uses a single linked list for the free list.
1805 Allocation occurs as follows, using heuristics which have evolved over
1809 \begin_layout Enumerate
1810 Get the free list lock for this whole operation.
1813 \begin_layout Enumerate
1814 Multiply length by 1.25, so we always over-allocate by 25%.
1817 \begin_layout Enumerate
1818 Set the slack multiplier to 1.
1821 \begin_layout Enumerate
1822 Examine the current freelist entry: if it is > length but < the current
1823 best case, remember it as the best case.
1826 \begin_layout Enumerate
1827 Multiply the slack multiplier by 1.05.
1830 \begin_layout Enumerate
1831 If our best fit so far is less than length * slack multiplier, return it.
1832 The slack will be turned into a new free record if it's large enough.
1835 \begin_layout Enumerate
1836 Otherwise, go onto the next freelist entry.
1839 \begin_layout Standard
1840 Deleting a record occurs as follows:
1843 \begin_layout Enumerate
1844 Lock the hash chain for this whole operation.
1847 \begin_layout Enumerate
1848 Walk the chain to find the record, keeping the prev pointer offset.
1851 \begin_layout Enumerate
1852 If max_dead is non-zero:
1856 \begin_layout Enumerate
1857 Walk the hash chain again and count the dead records.
1860 \begin_layout Enumerate
1861 If it's more than max_dead, bulk free all the dead ones (similar to steps
1862 4 and below, but the lock is only obtained once).
1865 \begin_layout Enumerate
1866 Simply mark this record as dead and return.
1871 \begin_layout Enumerate
1872 Get the free list lock for the remainder of this operation.
1875 \begin_layout Enumerate
1876 \begin_inset CommandInset label
1878 name "right-merging"
1882 Examine the following block to see if it is free; if so, enlarge the current
1883 block and remove that block from the free list.
1884 This was disabled, as removal from the free list was O(entries-in-free-list).
1887 \begin_layout Enumerate
1888 Examine the preceeding block to see if it is free: for this reason, each
1889 block has a 32-bit tailer which indicates its length.
1890 If it is free, expand it to cover our new block and return.
1893 \begin_layout Enumerate
1894 Otherwise, prepend ourselves to the free list.
1897 \begin_layout Standard
1898 Disabling right-merging (step
1899 \begin_inset CommandInset ref
1901 reference "right-merging"
1905 ) causes fragmentation; the other heuristics proved insufficient to address
1906 this, so the final answer to this was that when we expand the TDB file
1907 inside a transaction commit, we repack the entire tdb.
1910 \begin_layout Standard
1911 The single list lock limits our allocation rate; due to the other issues
1912 this is not currently seen as a bottleneck.
1915 \begin_layout Subsubsection
1919 \begin_layout Standard
1920 The first step is to remove all the current heuristics, as they obviously
1921 interact, then examine them once the lock contention is addressed.
1924 \begin_layout Standard
1925 The free list must be split to reduce contention.
1926 Assuming perfect free merging, we can at most have 1 free list entry for
1928 This implies that the number of free lists is related to the size of the
1929 hash table, but as it is rare to walk a large number of free list entries
1930 we can use far fewer, say 1/32 of the number of hash buckets.
1933 \begin_layout Standard
1934 It seems tempting to try to reuse the hash implementation which we use for
1935 records here, but we have two ways of searching for free entries: for allocatio
1936 n we search by size (and possibly zone) which produces too many clashes
1937 for our hash table to handle well, and for coalescing we search by address.
1938 Thus an array of doubly-linked free lists seems preferable.
1941 \begin_layout Standard
1942 There are various benefits in using per-size free lists (see
1943 \begin_inset CommandInset ref
1945 reference "sub:TDB-Becomes-Fragmented"
1949 ) but it's not clear this would reduce contention in the common case where
1950 all processes are allocating/freeing the same size.
1951 Thus we almost certainly need to divide in other ways: the most obvious
1952 is to divide the file into zones, and using a free list (or table of free
1954 This approximates address ordering.
1957 \begin_layout Standard
1958 Unfortunately it is difficult to know what heuristics should be used to
1959 determine zone sizes, and our transaction code relies on being able to
1961 \begin_inset Quotes eld
1965 \begin_inset Quotes erd
1968 by simply appending to the file (difficult if it would need to create a
1970 Thus we use a linked-list of free tables; currently we only ever create
1971 one, but if there is more than one we choose one at random to use.
1972 In future we may use heuristics to add new free tables on contention.
1973 We only expand the file when all free tables are exhausted.
1976 \begin_layout Standard
1977 The basic algorithm is as follows.
1981 \begin_layout Enumerate
1982 Identify the correct free list.
1985 \begin_layout Enumerate
1986 Lock the corresponding list.
1989 \begin_layout Enumerate
1990 Re-check the list (we didn't have a lock, sizes could have changed): relock
1994 \begin_layout Enumerate
1995 Place the freed entry in the list.
1998 \begin_layout Standard
1999 Allocation is a little more complicated, as we perform delayed coalescing
2003 \begin_layout Enumerate
2004 Pick a free table; usually the previous one.
2007 \begin_layout Enumerate
2008 Lock the corresponding list.
2011 \begin_layout Enumerate
2012 If the top entry is -large enough, remove it from the list and return it.
2015 \begin_layout Enumerate
2016 Otherwise, coalesce entries in the list.If there was no entry large enough,
2017 unlock the list and try the next largest list
2020 \begin_layout Enumerate
2021 If no list has an entry which meets our needs, try the next free table.
2024 \begin_layout Enumerate
2025 If no zone satisfies, expand the file.
2028 \begin_layout Standard
2029 This optimizes rapid insert/delete of free list entries by not coalescing
2031 First-fit address ordering ordering seems to be fairly good for keeping
2032 fragmentation low (see
2033 \begin_inset CommandInset ref
2035 reference "sub:TDB-Becomes-Fragmented"
2040 Note that address ordering does not need a tailer to coalesce, though if
2041 we needed one we could have one cheaply: see
2042 \begin_inset CommandInset ref
2044 reference "sub:Records-Incur-A"
2052 \begin_layout Standard
2053 Each free entry has the free table number in the header: less than 255.
2054 It also contains a doubly-linked list for easy deletion.
2057 \begin_layout Subsection
2058 \begin_inset CommandInset label
2060 name "sub:TDB-Becomes-Fragmented"
2064 TDB Becomes Fragmented
2067 \begin_layout Standard
2068 Much of this is a result of allocation strategy
2072 \begin_layout Plain Layout
2073 The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 ftp://ftp.cs.ute
2074 xas.edu/pub/garbage/malloc/ismm98.ps
2079 and deliberate hobbling of coalescing; internal fragmentation (aka overallocati
2080 on) is deliberately set at 25%, and external fragmentation is only cured
2081 by the decision to repack the entire db when a transaction commit needs
2082 to enlarge the file.
2085 \begin_layout Subsubsection
2089 \begin_layout Standard
2090 The 25% overhead on allocation works in practice for ldb because indexes
2091 tend to expand by one record at a time.
2092 This internal fragmentation can be resolved by having an
2093 \begin_inset Quotes eld
2097 \begin_inset Quotes erd
2100 bit in the header to note entries that have previously expanded, and allocating
2101 more space for them.
2104 \begin_layout Standard
2105 There are is a spectrum of possible solutions for external fragmentation:
2106 one is to use a fragmentation-avoiding allocation strategy such as best-fit
2107 address-order allocator.
2108 The other end of the spectrum would be to use a bump allocator (very fast
2109 and simple) and simply repack the file when we reach the end.
2112 \begin_layout Standard
2113 There are three problems with efficient fragmentation-avoiding allocators:
2114 they are non-trivial, they tend to use a single free list for each size,
2115 and there's no evidence that tdb allocation patterns will match those recorded
2116 for general allocators (though it seems likely).
2119 \begin_layout Standard
2120 Thus we don't spend too much effort on external fragmentation; we will be
2121 no worse than the current code if we need to repack on occasion.
2122 More effort is spent on reducing freelist contention, and reducing overhead.
2125 \begin_layout Subsection
2126 \begin_inset CommandInset label
2128 name "sub:Records-Incur-A"
2132 Records Incur A 28-Byte Overhead
2135 \begin_layout Standard
2136 Each TDB record has a header as follows:
2139 \begin_layout LyX-Code
2143 \begin_layout LyX-Code
2144 tdb_off_t next; /* offset of the next record in the list */
2147 \begin_layout LyX-Code
2148 tdb_len_t rec_len; /* total byte length of record */
2151 \begin_layout LyX-Code
2152 tdb_len_t key_len; /* byte length of key */
2155 \begin_layout LyX-Code
2156 tdb_len_t data_len; /* byte length of data */
2159 \begin_layout LyX-Code
2160 uint32_t full_hash; /* the full 32 bit hash of the key */
2163 \begin_layout LyX-Code
2164 uint32_t magic; /* try to catch errors */
2167 \begin_layout LyX-Code
2168 /* the following union is implied:
2171 \begin_layout LyX-Code
2175 \begin_layout LyX-Code
2176 char record[rec_len];
2179 \begin_layout LyX-Code
2183 \begin_layout LyX-Code
2187 \begin_layout LyX-Code
2188 char data[data_len];
2191 \begin_layout LyX-Code
2195 \begin_layout LyX-Code
2196 uint32_t totalsize; (tailer)
2199 \begin_layout LyX-Code
2203 \begin_layout LyX-Code
2207 \begin_layout LyX-Code
2211 \begin_layout Standard
2212 Naively, this would double to a 56-byte overhead on a 64 bit implementation.
2215 \begin_layout Subsubsection
2219 \begin_layout Standard
2220 We can use various techniques to reduce this for an allocated block:
2223 \begin_layout Enumerate
2224 The 'next' pointer is not required, as we are using a flat hash table.
2227 \begin_layout Enumerate
2228 'rec_len' can instead be expressed as an addition to key_len and data_len
2229 (it accounts for wasted or overallocated length in the record).
2230 Since the record length is always a multiple of 8, we can conveniently
2231 fit it in 32 bits (representing up to 35 bits).
2234 \begin_layout Enumerate
2235 'key_len' and 'data_len' can be reduced.
2236 I'm unwilling to restrict 'data_len' to 32 bits, but instead we can combine
2237 the two into one 64-bit field and using a 5 bit value which indicates at
2238 what bit to divide the two.
2239 Keys are unlikely to scale as fast as data, so I'm assuming a maximum key
2243 \begin_layout Enumerate
2244 'full_hash' is used to avoid a memcmp on the
2245 \begin_inset Quotes eld
2249 \begin_inset Quotes erd
2252 case, but this is diminishing returns after a handful of bits (at 10 bits,
2253 it reduces 99.9% of false memcmp).
2254 As an aside, as the lower bits are already incorporated in the hash table
2255 resolution, the upper bits should be used here.
2256 Note that it's not clear that these bits will be a win, given the extra
2257 bits in the hash table itself (see
2258 \begin_inset CommandInset ref
2260 reference "sub:Hash-Size-Solution"
2267 \begin_layout Enumerate
2268 'magic' does not need to be enlarged: it currently reflects one of 5 values
2269 (used, free, dead, recovery, and unused_recovery).
2270 It is useful for quick sanity checking however, and should not be eliminated.
2273 \begin_layout Enumerate
2274 'tailer' is only used to coalesce free blocks (so a block to the right can
2275 find the header to check if this block is free).
2276 This can be replaced by a single 'free' bit in the header of the following
2277 block (and the tailer only exists in free blocks).
2281 \begin_layout Plain Layout
2282 This technique from Thomas Standish.
2283 Data Structure Techniques.
2284 Addison-Wesley, Reading, Massachusetts, 1980.
2289 The current proposed coalescing algorithm doesn't need this, however.
2292 \begin_layout Standard
2293 This produces a 16 byte used header like this:
2296 \begin_layout LyX-Code
2297 struct tdb_used_record {
2300 \begin_layout LyX-Code
2301 uint32_t used_magic : 16,
2304 \begin_layout LyX-Code
2308 \begin_layout LyX-Code
2312 \begin_layout LyX-Code
2316 \begin_layout LyX-Code
2317 uint32_t extra_octets;
2320 \begin_layout LyX-Code
2321 uint64_t key_and_data_len;
2324 \begin_layout LyX-Code
2328 \begin_layout Standard
2329 And a free record like this:
2332 \begin_layout LyX-Code
2333 struct tdb_free_record {
2336 \begin_layout LyX-Code
2337 uint64_t free_magic: 8,
2340 \begin_layout LyX-Code
2344 \begin_layout LyX-Code
2348 \begin_layout LyX-Code
2349 uint64_t free_table: 8,
2352 \begin_layout LyX-Code
2356 \begin_layout LyX-Code
2360 \begin_layout LyX-Code
2364 \begin_layout Standard
2366 \change_deleted 0 1291206079
2369 Note that by limiting valid offsets to 56 bits, we can pack everything we
2370 need into 3 64-byte words, meaning our minimum record size is 8 bytes.
2373 \begin_layout Subsubsection
2377 \begin_layout Standard
2381 \begin_layout Subsection
2382 Transaction Commit Requires 4 fdatasync
2385 \begin_layout Standard
2386 The current transaction algorithm is:
2389 \begin_layout Enumerate
2390 write_recovery_data();
2393 \begin_layout Enumerate
2397 \begin_layout Enumerate
2398 write_recovery_header();
2401 \begin_layout Enumerate
2405 \begin_layout Enumerate
2406 overwrite_with_new_data();
2409 \begin_layout Enumerate
2413 \begin_layout Enumerate
2414 remove_recovery_header();
2417 \begin_layout Enumerate
2421 \begin_layout Standard
2422 On current ext3, each sync flushes all data to disk, so the next 3 syncs
2423 are relatively expensive.
2424 But this could become a performance bottleneck on other filesystems such
2428 \begin_layout Subsubsection
2432 \begin_layout Standard
2433 Neil Brown points out that this is overzealous, and only one sync is needed:
2436 \begin_layout Enumerate
2437 Bundle the recovery data, a transaction counter and a strong checksum of
2441 \begin_layout Enumerate
2442 Strong checksum that whole bundle.
2445 \begin_layout Enumerate
2446 Store the bundle in the database.
2449 \begin_layout Enumerate
2450 Overwrite the oldest of the two recovery pointers in the header (identified
2451 using the transaction counter) with the offset of this bundle.
2454 \begin_layout Enumerate
2458 \begin_layout Enumerate
2459 Write the new data to the file.
2462 \begin_layout Standard
2463 Checking for recovery means identifying the latest bundle with a valid checksum
2464 and using the new data checksum to ensure that it has been applied.
2465 This is more expensive than the current check, but need only be done at
2467 For running databases, a separate header field can be used to indicate
2468 a transaction in progress; we need only check for recovery if this is set.
2471 \begin_layout Subsubsection
2475 \begin_layout Standard
2479 \begin_layout Subsection
2480 \begin_inset CommandInset label
2482 name "sub:TDB-Does-Not"
2486 TDB Does Not Have Snapshot Support
2489 \begin_layout Subsubsection
2490 Proposed SolutionNone.
2491 At some point you say
2492 \begin_inset Quotes eld
2496 \begin_inset Quotes erd
2500 \begin_inset CommandInset ref
2502 reference "replay-attribute"
2509 \begin_layout Standard
2510 But as a thought experiment, if we implemented transactions to only overwrite
2511 free entries (this is tricky: there must not be a header in each entry
2512 which indicates whether it is free, but use of presence in metadata elsewhere),
2513 and a pointer to the hash table, we could create an entirely new commit
2514 without destroying existing data.
2515 Then it would be easy to implement snapshots in a similar way.
2518 \begin_layout Standard
2519 This would not allow arbitrary changes to the database, such as tdb_repack
2520 does, and would require more space (since we have to preserve the current
2521 and future entries at once).
2522 If we used hash trees rather than one big hash table, we might only have
2523 to rewrite some sections of the hash, too.
2526 \begin_layout Standard
2527 We could then implement snapshots using a similar method, using multiple
2528 different hash tables/free tables.
2531 \begin_layout Subsubsection
2535 \begin_layout Standard
2539 \begin_layout Subsection
2540 Transactions Cannot Operate in Parallel
2543 \begin_layout Standard
2544 This would be useless for ldb, as it hits the index records with just about
2546 It would add significant complexity in resolving clashes, and cause the
2547 all transaction callers to write their code to loop in the case where the
2548 transactions spuriously failed.
2551 \begin_layout Subsubsection
2555 \begin_layout Standard
2557 \begin_inset CommandInset ref
2559 reference "replay-attribute"
2564 We could solve a small part of the problem by providing read-only transactions.
2565 These would allow one write transaction to begin, but it could not commit
2566 until all r/o transactions are done.
2567 This would require a new RO_TRANSACTION_LOCK, which would be upgraded on
2571 \begin_layout Subsubsection
2575 \begin_layout Standard
2579 \begin_layout Subsection
2580 Default Hash Function Is Suboptimal
2583 \begin_layout Standard
2584 The Knuth-inspired multiplicative hash used by tdb is fairly slow (especially
2585 if we expand it to 64 bits), and works best when the hash bucket size is
2586 a prime number (which also means a slow modulus).
2587 In addition, it is highly predictable which could potentially lead to a
2588 Denial of Service attack in some TDB uses.
2591 \begin_layout Subsubsection
2595 \begin_layout Standard
2596 The Jenkins lookup3 hash
2600 \begin_layout Plain Layout
2601 http://burtleburtle.net/bob/c/lookup3.c
2606 is a fast and superbly-mixing hash.
2607 It's used by the Linux kernel and almost everything else.
2608 This has the particular properties that it takes an initial seed, and produces
2609 two 32 bit hash numbers, which we can combine into a 64-bit hash.
2612 \begin_layout Standard
2613 The seed should be created at tdb-creation time from some random source,
2614 and placed in the header.
2615 This is far from foolproof, but adds a little bit of protection against
2619 \begin_layout Subsubsection
2623 \begin_layout Standard
2627 \begin_layout Subsection
2628 \begin_inset CommandInset label
2630 name "Reliable-Traversal-Adds"
2634 Reliable Traversal Adds Complexity
2637 \begin_layout Standard
2638 We lock a record during traversal iteration, and try to grab that lock in
2640 If that grab on delete fails, we simply mark it deleted and continue onwards;
2641 traversal checks for this condition and does the delete when it moves off
2645 \begin_layout Standard
2646 If traversal terminates, the dead record may be left indefinitely.
2649 \begin_layout Subsubsection
2653 \begin_layout Standard
2654 Remove reliability guarantees; see
2655 \begin_inset CommandInset ref
2657 reference "traverse-Proposed-Solution"
2664 \begin_layout Subsubsection
2668 \begin_layout Standard
2672 \begin_layout Subsection
2673 Fcntl Locking Adds Overhead
2676 \begin_layout Standard
2677 Placing a fcntl lock means a system call, as does removing one.
2678 This is actually one reason why transactions can be faster (everything
2679 is locked once at transaction start).
2680 In the uncontended case, this overhead can theoretically be eliminated.
2683 \begin_layout Subsubsection
2687 \begin_layout Standard
2691 \begin_layout Standard
2692 We tried this before with spinlock support, in the early days of TDB, and
2693 it didn't make much difference except in manufactured benchmarks.
2696 \begin_layout Standard
2697 We could use spinlocks (with futex kernel support under Linux), but it means
2698 that we lose automatic cleanup when a process dies with a lock.
2699 There is a method of auto-cleanup under Linux, but it's not supported by
2700 other operating systems.
2701 We could reintroduce a clear-if-first-style lock and sweep for dead futexes
2702 on open, but that wouldn't help the normal case of one concurrent opener
2704 Increasingly elaborate repair schemes could be considered, but they require
2705 an ABI change (everyone must use them) anyway, so there's no need to do
2706 this at the same time as everything else.
2709 \begin_layout Subsection
2710 Some Transactions Don't Require Durability
2713 \begin_layout Standard
2714 Volker points out that gencache uses a CLEAR_IF_FIRST tdb for normal (fast)
2715 usage, and occasionally empties the results into a transactional TDB.
2716 This kind of usage prioritizes performance over durability: as long as
2717 we are consistent, data can be lost.
2720 \begin_layout Standard
2721 This would be more neatly implemented inside tdb: a
2722 \begin_inset Quotes eld
2726 \begin_inset Quotes erd
2729 transaction commit (ie.
2730 syncless) which meant that data may be reverted on a crash.
2733 \begin_layout Subsubsection
2737 \begin_layout Standard
2741 \begin_layout Standard
2742 Unfortunately any transaction scheme which overwrites old data requires
2743 a sync before that overwrite to avoid the possibility of corruption.
2746 \begin_layout Standard
2747 It seems possible to use a scheme similar to that described in
2748 \begin_inset CommandInset ref
2750 reference "sub:TDB-Does-Not"
2754 ,where transactions are committed without overwriting existing data, and
2755 an array of top-level pointers were available in the header.
2756 If the transaction is
2757 \begin_inset Quotes eld
2761 \begin_inset Quotes erd
2764 then we would not need a sync at all: existing processes would pick up
2765 the new hash table and free list and work with that.
2768 \begin_layout Standard
2769 At some later point, a sync would allow recovery of the old data into the
2770 free lists (perhaps when the array of top-level pointers filled).
2771 On crash, tdb_open() would examine the array of top levels, and apply the
2772 transactions until it encountered an invalid checksum.
2775 \begin_layout Subsection
2776 Tracing Is Fragile, Replay Is External
2779 \begin_layout Standard
2780 The current TDB has compile-time-enabled tracing code, but it often breaks
2781 as it is not enabled by default.
2782 In a similar way, the ctdb code has an external wrapper which does replay
2783 tracing so it can coordinate cluster-wide transactions.
2786 \begin_layout Subsubsection
2788 \begin_inset CommandInset label
2790 name "replay-attribute"
2797 \begin_layout Standard
2798 Tridge points out that an attribute can be later added to tdb_open (see
2800 \begin_inset CommandInset ref
2802 reference "attributes"
2806 ) to provide replay/trace hooks, which could become the basis for this and
2807 future parallel transactions and snapshot support.
2810 \begin_layout Subsubsection
2814 \begin_layout Standard
2825 @Add status, some fixes, linked freelists.
2831 \change_deleted 0 1291204535
2833 \change_inserted 0 1291204533
2838 \change_inserted 0 1291204563
2842 \change_inserted 0 1291204572
2845 \change_inserted 0 1291204573
2850 \change_inserted 0 1291204588
2854 \change_inserted 0 1291204588
2857 \change_inserted 0 1291204631
2862 \change_inserted 0 1291204639
2866 \change_inserted 0 1291204640
2869 \change_inserted 0 1291204665
2872 \change_inserted 0 1291204671
2876 \change_inserted 0 1291204671
2879 \change_inserted 0 1291204673
2884 \change_inserted 0 1291204731
2888 \change_inserted 0 1291204732
2891 \change_inserted 0 1291204779
2896 \change_inserted 0 1291204830
2900 \change_inserted 0 1291204831
2903 \change_inserted 0 1291204834
2909 deal of churn; we are better to guarantee that the tdb_errcode is per-thread
2910 so the current programming model can be maintained.
2914 \change_inserted 0 1291204847
2918 \change_inserted 0 1291204847
2922 \change_inserted 0 1291204852
2925 \change_inserted 0 1291204881
2929 \change_inserted 0 1291204881
2932 \change_inserted 0 1291204885
2937 \change_inserted 0 1291204898
2941 \change_inserted 0 1291204898
2944 \change_inserted 0 1291204901
2949 \change_inserted 0 1291204908
2953 \change_inserted 0 1291204908
2956 \change_inserted 0 1291204908
2961 \change_inserted 0 1291204917
2965 \change_inserted 0 1291204917
2968 \change_inserted 0 1291204920
2973 \change_inserted 0 1291204927
2977 \change_inserted 0 1291204928
2980 \change_inserted 0 1291204942
2983 \change_inserted 0 1291205003
2987 \change_inserted 0 1291205004
2990 \change_inserted 0 1291205007
2992 \change_inserted 0 1291205019
2996 \change_inserted 0 1291205019
2999 \change_inserted 0 1291205023
3004 \change_inserted 0 1291205029
3008 \change_inserted 0 1291205029
3011 \change_inserted 0 1291206020
3016 \change_inserted 0 1291205043
3020 \change_inserted 0 1291205043
3023 \change_inserted 0 1291205057
3026 \change_inserted 0 1291205062
3030 \change_inserted 0 1291205062
3033 \change_inserted 0 1291205062
3038 \change_inserted 0 1291205072
3042 \change_inserted 0 1291205073
3045 \change_inserted 0 1291205073
3051 \change_deleted 0 1291204504
3055 \change_inserted 0 1291205079
3059 \change_inserted 0 1291205080
3062 \change_inserted 0 1291205080
3067 \change_inserted 0 1291205090
3071 is to divide the file into zones, and using a free list (or
3072 \change_inserted 0 1291205498
3074 \change_deleted 0 1291205497
3077 of free lists) for each.
3079 \change_inserted 0 1291205203
3083 \change_inserted 0 1291205358
3089 \begin_layout Standard
3091 \change_deleted 0 1291205198
3092 Note that this means we need to split the free lists when we expand the
3093 file; this is probably acceptable when we double the hash table size, since
3094 that is such an expensive operation already.
3095 In the case of increasing the file size, there is an optimization we can
3096 use: if we use M in the formula above as the file size rounded up to the
3097 next power of 2, we only need reshuffle free lists when the file size crosses
3098 a power of 2 boundary,
3102 reshuffling the free lists is trivial: we simply merge every consecutive
3108 Identify the correct
3109 \change_inserted 0 1291205366
3111 \change_deleted 0 1291205364
3118 \change_inserted 0 1291205372
3120 \change_deleted 0 1291205371
3123 (we didn't have a lock, sizes could have changed): relock if necessary.
3126 Place the freed entry in the list
3127 \change_deleted 0 1291205382
3134 \change_deleted 0 1291205403
3135 zone either the zone we last freed into, or based on a
3136 \begin_inset Quotes eld
3140 \begin_inset Quotes erd
3144 \change_inserted 0 1291205411
3145 free table; usually the previous one.
3149 \change_deleted 0 1291205432
3153 \begin_layout Enumerate
3155 \change_deleted 0 1291205428
3156 Re-check the zone: relock if necessary.
3161 unlock the list and try the next
3162 \change_inserted 0 1291205455
3164 \change_deleted 0 1291205452
3166 \change_inserted 0 1291205457
3170 \change_inserted 0 1291205476
3175 \change_inserted 0 1291205542
3179 \change_inserted 0 1291205591
3185 \begin_layout Standard
3187 \change_deleted 0 1291205539
3188 I anticipate that the number of entries in each free zone would be small,
3189 but it might be worth using one free entry to hold pointers to the others
3190 for cache efficiency.
3195 \begin_layout Standard
3197 \change_deleted 0 1291205534
3198 \begin_inset CommandInset label
3200 name "freelist-in-zone"
3204 If we want to avoid locking complexity (enlarging the free lists when we
3205 enlarge the file) we could place the array of free lists at the beginning
3207 This means existing array lists never move, but means that a record cannot
3208 be larger than a zone.
3209 That in turn implies that zones should be variable sized (say, power of
3210 2), which makes the question
3211 \begin_inset Quotes eld
3214 what zone is this record in?
3215 \begin_inset Quotes erd
3219 \begin_inset Quotes eld
3223 \begin_inset Quotes erd
3226 , but that's less common).
3227 It could be done with as few as 4 bits from the record header.
3231 \begin_layout Plain Layout
3233 \begin_inset Formula $2^{16+N*3}$
3236 means 0 gives a minimal 65536-byte zone, 15 gives the maximal
3237 \begin_inset Formula $2^{61}$
3241 Zones range in factor of 8 steps.
3242 Given the zone size for the zone the current record is in, we can determine
3243 the start of the zone.
3249 \change_inserted 0 1291205139
3254 \change_inserted 0 1291205758
3259 \change_deleted 0 1291205693
3266 \change_inserted 0 1291205704
3268 \change_deleted 0 1291205704
3275 \change_inserted 0 1291205725
3277 \change_deleted 0 1291205723
3281 \change_inserted 0 1291205753
3285 \change_inserted 0 1291205746
3287 \change_deleted 0 1291205749
3293 \begin_layout LyX-Code
3295 \change_inserted 0 1291205786
3299 \begin_layout LyX-Code
3301 \change_inserted 0 1291205788
3305 \change_inserted 0 1291205792
3307 \change_deleted 0 1291205790
3314 \change_deleted 0 1291205801
3318 \change_deleted 0 1291205811
3323 \change_deleted 0 1291205811
3328 \change_deleted 0 1291205808
3334 \change_deleted 0 1291205827
3335 We might want to take some bits from the used record's top_hash (and the
3336 free record which has 32 bits of padding to spare anyway) if we use variable
3339 \begin_inset CommandInset ref
3341 reference "freelist-in-zone"
3347 \change_inserted 0 1291205885
3348 Note that by limiting valid offsets to 56 bits, we can pack everything
3349 we need into 3 64-byte words, meaning our minimum record size is 8 bytes.
3352 \change_inserted 0 1291205886
3355 \change_inserted 0 1291205886
3360 \change_inserted 0 1291205894
3364 \change_inserted 0 1291205894
3367 \change_inserted 0 1291205902
3373 \change_deleted 0 1291204504
3377 \change_inserted 0 1291205910
3381 \change_inserted 0 1291205910
3384 \change_inserted 0 1291205914
3389 \change_inserted 0 1291205919
3393 \change_inserted 0 1291205919
3396 \change_inserted 0 1291205922
3401 \change_inserted 0 1291205929
3405 \change_inserted 0 1291205929
3408 \change_inserted 0 1291205929
3413 \change_inserted 0 1291205932
3417 \change_inserted 0 1291205933
3420 \change_inserted 0 1291205933
3425 \change_inserted 0 1291205944
3429 \change_inserted 0 1291205945
3432 \change_inserted 0 1291205948
3461 if older code (which doesn't understand the feature) writes to the database.Reco
3462 rd Headers Are Not Expandible
3471 Proposed SolutionThe first step is to remove all the current heuristics,
3472 as they obviously interact, then examine them once the lock contention
3476 is to divide the file into zones, and using a free list (or set of free
3483 Identify the correct zone.
3486 Re-check the zone (we didn't have a lock, sizes could have changed): relock
3490 Place the freed entry in the list for that zone.
3493 Pick a zone either the zone we last freed into, or based on a
3500 unlock the list and try the next zone.
3511 uint32_t magic : 16,
3519 uint32_t free_magic;
3522 uint64_t total_length;
3525 uint64_t prev, next;
3543 @Tracing attribute, talloc support.
3548 #LyX 1.6.5 created this file. For more info see http://www.lyx.org/
3552 \change_deleted 0 1283307542
3554 \change_inserted 0 1284423485
3559 \change_inserted 0 1284422789
3566 \change_inserted 0 1284016998
3571 \change_inserted 0 1284015637
3575 \change_inserted 0 1284015716
3578 \change_inserted 0 1284015906
3581 \change_inserted 0 1284015637
3584 \change_inserted 0 1284016114
3587 \change_inserted 0 1284016149
3590 \change_inserted 0 1284016639
3593 \change_inserted 0 1284016821
3596 \change_inserted 0 1284016803
3599 if older code (which doesn't understand the feature) writes to the database.
3600 \change_deleted 0 1284016101
3604 \begin_layout Subsection
3606 \change_inserted 0 1284015634
3607 Record Headers Are Not Expandible
3610 \change_inserted 0 1284015634
3613 \change_inserted 0 1284015634
3616 \change_inserted 0 1284422552
3619 \change_inserted 0 1284422568
3622 \change_inserted 0 1284422646
3625 \change_inserted 0 1284422656
3628 \change_inserted 0 1284423065
3631 \change_inserted 0 1284423042
3637 \change_inserted 0 1283336713
3644 \change_deleted 0 1283307675
3645 There are three details which become important:
3648 \begin_layout Enumerate
3650 \change_deleted 0 1283307675
3651 On encountering a full bucket, we use the next bucket.
3654 \begin_layout Enumerate
3656 \change_deleted 0 1283307675
3657 Extra hash bits are stored with the offset, to reduce comparisons.
3660 \begin_layout Enumerate
3662 \change_deleted 0 1283307675
3663 A marker entry is used on deleting an entry.
3666 \begin_layout Standard
3668 \change_deleted 0 1283307675
3669 The doubling of the table must be done under a transaction; we will not
3670 reduce it on deletion, so it will be an unusual case.
3671 It will either be placed at the head (other entries will be moved out the
3672 way so we can expand).
3673 We could have a pointer in the header to the current hashtable location,
3674 but that pointer would have to be read frequently to check for hashtable
3678 \begin_layout Standard
3680 \change_deleted 0 1283307675
3681 The locking for this is slightly more complex than the chained case; we
3682 currently have one lock per bucket, and that means we would need to expand
3683 the lock if we overflow to the next bucket.
3684 The frequency of such collisions will effect our locking heuristics: we
3685 can always lock more buckets than we need.
3688 \begin_layout Standard
3690 \change_deleted 0 1283307675
3691 One possible optimization is to only re-check the hash size on an insert
3694 \change_inserted 0 1283307770
3697 \change_inserted 0 1283336187
3700 \change_inserted 0 1283336586
3707 \change_deleted 0 1283336858
3711 \begin_layout Standard
3712 The first step is to remove all the current heuristics, as they obviously
3713 interact, then examine them once the lock contention is addressed.
3715 \change_inserted 0 1283336910
3719 \change_inserted 0 1283337052
3724 \change_inserted 0 1283309850
3728 \change_inserted 0 1283337216
3731 \change_inserted 0 1284424151
3740 \change_inserted 0 1283336739
3745 \change_inserted 0 1283337133
3749 \change_inserted 0 1283337139
3755 \change_inserted 0 1283337235
3762 \change_deleted 0 1284423472
3766 \begin_layout Standard
3770 \change_inserted 0 1284423891
3773 \change_deleted 0 1284423891
3776 \change_inserted 0 1284423901
3781 \change_inserted 0 1284423495
3785 \change_inserted 0 1284424201
3790 We could solve a small part of the problem by providing read-only transactions.
3792 \change_inserted 0 1284423555
3796 \change_inserted 0 1284423617
3799 \change_inserted 0 1284423719
3802 \change_inserted 0 1284423864
3805 \change_inserted 0 1284423850
3814 @Extension mechanism.
3819 \change_inserted 0 1284016854
3824 \change_inserted 0 1284016847
3828 \change_inserted 0 1283310945
3841 @Remove bogus footnote
3846 \change_inserted 0 1283307544
3855 @Moving hash table does not work.
3862 \begin_layout Plain Layout
3864 \change_inserted 0 1283336450
3865 If we make the hash offsets zone-relative, then this only restricts the
3866 zone size, not the overall database size.
3888 There are three details which become important:
3903 \begin_layout LyX-Code
3909 @Soft transaction commit
3914 \author "Rusty Russell,,,"
3917 \change_deleted 0 1280141199
3919 \change_inserted 0 1280141202
3925 \change_inserted 0 1280140902
3930 \change_inserted 0 1280140661
3934 \change_inserted 0 1280140703
3937 \change_inserted 0 1280708312
3940 \change_inserted 0 1280708400
3943 \change_inserted 0 1280140836
3946 \change_inserted 0 1280708255
3949 \change_inserted 0 1280708374
3952 \change_inserted 0 1280141181
3955 \change_inserted 0 1280141345
3976 @Transaction and freelist rethink.
3981 \author "Rusty Russell,,,"
3987 behavior of disallowing
3988 \change_inserted 0 1272940179
3991 transactions should become the default.
3993 \change_inserted 0 1272944650
3997 \change_inserted 0 1272944763
4006 \change_inserted 0 1273478114
4013 \change_deleted 0 1273469807
4015 \change_inserted 0 1273469810
4019 \change_deleted 0 1273469815
4022 to reduce contention.
4024 \change_inserted 0 1273470006
4028 \change_inserted 0 1273492055
4031 \change_inserted 0 1273483888
4037 \change_deleted 0 1272942055
4038 There are various ways to organize these lisys, but because we want to be
4039 able to quickly identify which free list an entry is in, and reduce the
4040 number of locks required for merging, we will use zoning (eg.
4041 each free list covers some fixed fraction of the file).
4043 \change_inserted 0 1273484187
4047 \change_deleted 0 1273484194
4049 \change_inserted 0 1273484194
4055 Identify the correct
4056 \change_deleted 0 1273482856
4058 \change_inserted 0 1273482857
4065 \change_inserted 0 1273482895
4069 \change_inserted 0 1273482863
4073 \change_inserted 0 1273482909
4077 \change_deleted 0 1273482885
4079 \change_inserted 0 1273482888
4082 lace the freed entry
4083 \change_deleted 0 1273492415
4085 \change_inserted 0 1273492415
4086 in the list for that zone
4091 Allocation is a little more complicated, as we
4092 \change_deleted 0 1273483240
4093 merge entries as we walk the list:
4094 \change_inserted 0 1273484250
4095 perform delayed coalescing at this point:
4101 \change_deleted 0 1273482955
4103 \change_inserted 0 1273482957
4107 \change_deleted 0 1273482962
4109 \change_inserted 0 1273482962
4113 \change_deleted 0 1273482966
4115 \change_inserted 0 1273482966
4122 \change_inserted 0 1273482980
4124 \change_deleted 0 1273482973
4128 \change_inserted 0 1273482982
4132 \change_inserted 0 1273483084
4138 \begin_layout Enumerate
4140 \change_deleted 0 1273492155
4142 \change_inserted 0 1273492159
4145 remove it from the list and return it.
4148 \begin_layout Enumerate
4150 \change_inserted 0 1273492206
4151 coalesce entries in the list.
4152 \change_deleted 0 1273492200
4153 examine the entry to the right of it in the file.
4158 \begin_layout Enumerate
4160 \change_deleted 0 1273492200
4161 If that entry is in a different list, lock that list too.
4164 \begin_layout Enumerate
4166 \change_deleted 0 1273492200
4167 If we had to place a new lock, re-check that the entry is free.
4170 \begin_layout Enumerate
4172 \change_deleted 0 1273492200
4173 Remove that entry from its free list and expand this entry to cover it.
4176 \begin_layout Enumerate
4178 \change_deleted 0 1273485554
4183 \begin_layout Enumerate
4185 \change_inserted 0 1273485311
4186 If there was no entry large enough, unlock the list and try the next zone.
4190 \change_deleted 0 1273483646
4191 Repeat step 3 with each entry in the list.
4197 \change_deleted 0 1273483668
4198 Unlock the list and repeat step 2 with the next list.
4204 \change_deleted 0 1273483671
4206 \change_inserted 0 1273483671
4209 satisfies, expand the file.
4212 This optimizes rapid insert/delete of free list entries
4213 \change_inserted 0 1273485794
4214 by not coalescing them all the time.
4215 \change_deleted 0 1273483685
4216 , and allows us to get rid of the tailer altogether
4220 \change_inserted 0 1273492299
4223 \change_deleted 0 1273476840
4225 \begin_inset Quotes eld
4229 \begin_inset Quotes erd
4232 free entries is more difficult: the 25% overhead works in practice for
4233 ldb because indexes tend to expand by one record at a time.
4234 This can be resolved by having an
4235 \begin_inset Quotes eld
4239 \begin_inset Quotes erd
4242 bit in the header to note entries that have previously expanded, and allocating
4243 more space for them.
4245 \begin_inset Quotes eld
4249 \begin_inset Quotes erd
4252 algorithm should be implemented or first-fit used is still unknown: we
4253 will determine this once these other ideas are implemented.
4254 \change_inserted 0 1273483750
4258 \begin_layout Standard
4260 \change_inserted 0 1273492450
4263 \change_inserted 0 1273470441
4266 \change_inserted 0 1273476556
4269 \change_inserted 0 1273470423
4275 \change_inserted 0 1273476847
4278 \change_inserted 0 1273476886
4281 \change_inserted 0 1273477233
4284 \change_inserted 0 1273477534
4287 \change_inserted 0 1273482700
4290 \change_inserted 0 1273478079
4293 \change_inserted 0 1273477839
4296 \change_inserted 0 1273477925
4299 \change_inserted 0 1273477925
4302 \change_inserted 0 1273477925
4305 \change_inserted 0 1273477925
4308 \change_inserted 0 1273477925
4311 \change_inserted 0 1273477925
4314 \change_inserted 0 1273477925
4317 \change_inserted 0 1273477925
4320 \change_inserted 0 1273477925
4323 \change_inserted 0 1273477925
4326 \change_inserted 0 1273477925
4329 \change_inserted 0 1273477925
4332 \change_inserted 0 1273477925
4335 \change_inserted 0 1273477925
4338 \change_inserted 0 1273477925
4341 \change_inserted 0 1273477925
4344 \change_inserted 0 1273477925
4347 \change_inserted 0 1273477925
4350 \change_inserted 0 1273492522
4353 \change_inserted 0 1273492530
4356 \change_inserted 0 1273492546
4359 \change_inserted 0 1273478239
4362 \change_inserted 0 1273479960
4365 \change_inserted 0 1273480265
4368 \change_inserted 0 1273480354
4371 \change_inserted 0 1273478968
4374 \change_inserted 0 1273492604
4377 \change_inserted 0 1273479572
4383 \change_inserted 0 1273480282
4386 \change_inserted 0 1273478931
4389 \change_inserted 0 1273481549
4392 \change_inserted 0 1273481557
4395 \change_inserted 0 1273480307
4398 \change_inserted 0 1273480335
4401 \change_inserted 0 1273479897
4404 \change_inserted 0 1273479653
4407 \change_inserted 0 1273480371
4410 \change_inserted 0 1273480464
4413 \change_inserted 0 1273480399
4416 \change_inserted 0 1273480425
4419 \change_inserted 0 1273480453
4422 \change_inserted 0 1273480455
4425 \change_inserted 0 1273480450
4428 \change_inserted 0 1273480452
4430 \change_inserted 0 1273478830
4434 \change_deleted 0 1273481604
4435 In theory, we could get away with 2: one after we write the new data, and
4436 one to somehow atomically change over to it.
4437 \change_inserted 0 1273481632
4440 \change_inserted 0 1273481724
4443 \change_inserted 0 1273481713
4446 \change_inserted 0 1273481717
4449 \change_inserted 0 1273481730
4452 \change_inserted 0 1273481736
4455 \change_inserted 0 1273481744
4458 \change_inserted 0 1273481748
4461 \change_inserted 0 1273482185
4464 \change_inserted 0 1273482259
4467 \change_deleted 0 1273481848
4469 Trying to rewrite the transaction code is a separate experiment, which
4470 I encourage someone else to do.
4471 At some point you say
4472 \begin_inset Quotes eld
4476 \begin_inset Quotes erd
4482 \begin_layout Standard
4484 \change_deleted 0 1273481848
4485 But as a thought experiment:
4490 \begin_layout Standard
4492 \change_deleted 0 1273481788
4493 Say there was a pointer in the header which said where the hash table and
4494 free list tables were, and that no blocks were labeled with whether they
4495 were free or not (it had to be derived from what list they were in).
4496 We could create new hash table and free list in some free space, and populate
4497 it as we want the post-committed state to look.
4498 Then we sync, then we switch the offset in the header, then we sync again.
4501 \begin_layout Standard
4503 \change_deleted 0 1273481788
4504 This would not allow arbitrary changes to the database, such as tdb_repack
4505 does, and would require more space (since we have to preserve the current
4506 and future entries at once).
4507 If we used hash trees rather than one big hash table, we might only have
4508 to rewrite some sections of the hash, too.
4509 \change_inserted 0 1273481854
4513 \begin_layout Standard
4515 \change_inserted 0 1273482102
4518 \change_inserted 0 1273482061
4521 \change_inserted 0 1273482063
4524 \change_inserted 0 1273482072
4527 \change_inserted 0 1273482139
4530 \change_inserted 0 1273482364
4533 \change_inserted 0 1273482163
4536 \change_inserted 0 1273482493
4539 \change_inserted 0 1273482536
4545 \change_inserted 0 1273482641
4548 \change_inserted 0 1273481827
4552 \change_inserted 0 1273481829
4555 implement snapshots using a similar method
4556 \change_deleted 0 1273481838
4558 \change_inserted 0 1273481840
4561 using multiple different hash tables/free tables.
4567 @After first feedback (Ronnie & Volker)
4573 The free list should be split into multiple lists to reduce contention.
4578 The algorithm for freeing is simple:
4581 Identify the correct free list.
4584 Lock the list, and place the freed entry at the head.
4587 Allocation is a little more complicated, as we merge entries as we walk
4591 Pick a free list; either the list we last freed onto, or based on a
4597 If the top entry is well-sized, remove it from the list and return it.
4600 Otherwise, examine the entry to the right of it in the file.
4611 If no list satisfies, expand the file.
4614 This optimizes rapid insert/delete of free list entries, and allows us to
4615 get rid of the tailer altogether.
4619 \change_inserted 0 1272941474
4622 \change_inserted 0 1272942759
4623 There are various ways to organize these lists, but because we want to be
4624 able to quickly identify which free list an entry is in, and reduce the
4625 number of locks required for merging, we will use zoning (eg.
4626 each of the N free lists in a tdb file of size M covers a fixed fraction
4628 Note that this means we need to reshuffle the free lists when we expand
4629 the file; this is probably acceptable when we double the hash table size,
4630 since that is such an expensive operation already.
4631 In the case of increasing the file size, there is an optimization we can
4632 use: if we use M in the formula above as the file size rounded up to the
4633 next power of 2, we only need reshuffle free lists when the file size crosses
4634 a power of 2 boundary,
4638 reshuffling the free lists is trivial: we simply merge every consecutive
4652 We could implement snapshots using a similar method to the above, only using
4653 multiple different hash tables/free tables.
4664 #LyX 1.6.4 created this file. For more info see http://www.lyx.org/
4667 \tracking_changes false
4668 \output_changes false
4672 behavior of disallowing transactions should become the default.
4677 The algorithm for freeing is simple: