libio/dbz/dbz.3z

   1 .TH DBZ 3Z "3 Feb 1991"
   2 .BY "C News"
   3 .SH NAME
   4 dbminit, fetch, store, dbmclose \- somewhat dbm-compatible database routines
   5 .br
   6 dbzfresh, dbzagain, dbzfetch, dbzstore \- database routines
   7 .br
   8 dbzsync, dbzsize, dbzincore, dbzcancel, dbzdebug \- database routines
   9 .SH SYNOPSIS
  10 .nf
  11 .B #include <dbz.h>
  12 .PP
  13 .B dbminit(base)
  14 .B char *base;
  15 .PP
  16 .B datum
  17 .B fetch(key)
  18 .B datum key;
  19 .PP
  20 .B store(key, value)
  21 .B datum key;
  22 .B datum value;
  23 .PP
  24 .B dbmclose()
  25 .PP
  26 .B dbzfresh(base, size, fieldsep, cmap, tagmask)
  27 .B char *base;
  28 .B long size;
  29 .B int fieldsep;
  30 .B int cmap;
  31 .B long tagmask;
  32 .PP
  33 .B dbzagain(base, oldbase)
  34 .B char *base;
  35 .B char *oldbase;
  36 .PP
  37 .B datum
  38 .B dbzfetch(key)
  39 .B datum key;
  40 .PP
  41 .B dbzstore(key, value)
  42 .B datum key;
  43 .B datum value;
  44 .PP
  45 .B dbzsync()
  46 .PP
  47 .B long
  48 .B dbzsize(nentries)
  49 .B long nentries;
  50 .PP
  51 .B dbzincore(newvalue)
  52 .PP
  53 .B dbzcancel()
  54 .PP
  55 .B dbzdebug(newvalue)
  56 .SH DESCRIPTION
  57 These functions provide an indexing system for rapid random access to a
  58 text file (the
  59 .I base
  60 .IR file ).
  61 Subject to certain constraints, they are call-compatible with
  62 .IR dbm (3),
  63 although they also provide some extensions.
  64 (Note that they are
  65 .I not
  66 file-compatible with
  67 .I dbm
  68 or any variant thereof.)
  69 .PP
  70 In principle,
  71 .I dbz
  72 stores key-value pairs, where both key and value are arbitrary sequences
  73 of bytes, specified to the functions by
  74 values of type
  75 .IR datum ,
  76 typedefed in the header file to be a structure with members
  77 .I dptr
  78 (a value of type
  79 .I char *
  80 pointing to the bytes)
  81 and
  82 .I dsize
  83 (a value of type
  84 .I int
  85 indicating how long the byte sequence is).
  86 .PP
  87 In practice,
  88 .I dbz
  89 is more restricted than
  90 .IR dbm .
  91 A
  92 .I dbz
  93 database
  94 must be an index into a base file,
  95 with the database
  96 .IR value s
  97 being
  98 .IR fseek (3)
  99 offsets into the base file.
 100 Each such
 101 .I value
 102 must ``point to'' a place in the base file where the corresponding
 103 .I key
 104 sequence is found.
 105 A key can be no longer than
 106 .SM DBZMAXKEY
 107 (a constant defined in the header file) bytes.
 108 No key can be an initial subsequence of another,
 109 which in most applications requires that keys be
 110 either bracketed or terminated in some way (see the
 111 discussion of the
 112 .I fieldsep
 113 parameter of
 114 .IR dbzfresh ,
 115 below,
 116 for a fine point on terminators).
 117 .PP
 118 .I Dbminit
 119 opens a database,
 120 an index into the base file
 121 .IR base ,
 122 consisting of files
 123 .IB base .dir
 124 and
 125 .IB base .pag
 126 which must already exist.
 127 (If the database is new, they should be zero-length files.)
 128 Subsequent accesses go to that database until
 129 .I dbmclose
 130 is called to close the database.
 131 The base file need not exist at the time of the
 132 .IR dbminit ,
 133 but it must exist before accesses are attempted.
 134 .PP
 135 .I Fetch
 136 searches the database for the specified
 137 .IR key ,
 138 returning the corresponding
 139 .IR value
 140 if any.
 141 .I Store
 142 stores the
 143 .IR key - value
 144 pair in the database.
 145 .I Store
 146 will fail unless the database files are writeable.
 147 See below for a complication arising from case mapping.
 148 .PP
 149 .I Dbzfresh
 150 is a variant of
 151 .I dbminit
 152 for creating a new database with more control over details.
 153 Unlike for
 154 .IR dbminit ,
 155 the database files need not exist:
 156 they will be created if necessary,
 157 and truncated in any case.
 158 .PP
 159 .IR Dbzfresh 's
 160 .I size
 161 parameter specifies the size of the first hash table within the database,
 162 in key-value pairs.
 163 Performance will be best if
 164 .I size
 165 is a prime number and
 166 the number of key-value pairs stored in the database does not exceed
 167 about 2/3 of
 168 .IR size .
 169 (The
 170 .I dbzsize
 171 function, given the expected number of key-value pairs,
 172 will suggest a database size that meets these criteria.)
 173 Assuming that an
 174 .I fseek
 175 offset is 4 bytes,
 176 the
 177 .B .pag
 178 file will be
 179 .RI 4* size
 180 bytes
 181 (the
 182 .B .dir
 183 file is tiny and roughly constant in size)
 184 until
 185 the number of key-value pairs exceeds about 80% of
 186 .IR size .
 187 (Nothing awful will happen if the database grows beyond 100% of
 188 .IR size ,
 189 but accesses will slow down somewhat and the
 190 .B .pag
 191 file will grow somewhat.)
 192 .PP
 193 .IR Dbzfresh 's
 194 .I fieldsep
 195 parameter specifies the field separator in the base file.
 196 If this is not
 197 NUL (0), and the last character of a
 198 .I key
 199 argument is NUL, that NUL compares equal to either a NUL or a
 200 .I fieldsep
 201 in the base file.
 202 This permits use of NUL to terminate key strings without requiring that
 203 NULs appear in the base file.
 204 The
 205 .I fieldsep
 206 of a database created with
 207 .I dbminit
 208 is the horizontal-tab character.
 209 .PP
 210 For use in news systems, various forms of case mapping (e.g. uppercase to
 211 lowercase) in keys are available.
 212 The
 213 .I cmap
 214 parameter to
 215 .I dbzfresh
 216 is a single character specifying which of several mapping algorithms to use.
 217 Available algorithms are:
 218 .RS
 219 .TP
 220 .B 0
 221 case-sensitive:  no case mapping
 222 .TP
 223 .B B
 224 same as
 225 .B 0
 226 .TP
 227 .B NUL
 228 same as
 229 .B 0
 230 .TP
 231 .B =
 232 case-insensitive:  uppercase and lowercase equivalent
 233 .TP
 234 .B b
 235 same as
 236 .B =
 237 .TP
 238 .B C
 239 RFC822 message-ID rules, case-sensitive before `@' (with certain exceptions)
 240 and case-insensitive after
 241 .TP
 242 .B ?
 243 whatever the local default is, normally
 244 .B C
 245 .RE
 246 .PP
 247 Mapping algorithm
 248 .B 0
 249 (no mapping) is faster than the others and is overwhelmingly the correct
 250 choice for most applications.
 251 Unless compatibility constraints interfere, it is more efficient to pre-map
 252 the keys, storing mapped keys in the base file, than to have
 253 .I dbz
 254 do the mapping on every search.
 255 .PP
 256 For historical reasons,
 257 .I fetch
 258 and
 259 .I store
 260 expect their
 261 .I key
 262 arguments to be pre-mapped, but expect unmapped keys in the base file.
 263 .I Dbzfetch
 264 and
 265 .I dbzstore
 266 do the same jobs but handle all case mapping internally,
 267 so the customer need not worry about it.
 268 .PP
 269 .I Dbz
 270 stores only the database
 271 .IR value s
 272 in its files, relying on reference to the base file to confirm a hit on a key.
 273 References to the base file can be minimized, greatly speeding up searches,
 274 if a little bit of information about the keys can be stored in the
 275 .I dbz
 276 files.
 277 This is ``free'' if there are some unused bits in an
 278 .I fseek
 279 offset,
 280 so that the offset can be
 281 .I tagged
 282 with some information about the key.
 283 The
 284 .I tagmask
 285 parameter of
 286 .I dbzfresh
 287 allows specifying the location of unused bits.
 288 .I Tagmask
 289 should be a mask with
 290 one group of
 291 contiguous
 292 .B 1
 293 bits.
 294 The bits in the mask should
 295 be unused (0) in
 296 .I most
 297 offsets.
 298 The bit immediately above the mask (the
 299 .I flag
 300 bit) should be unused (0) in
 301 .I all
 302 offsets;
 303 .I (dbz)store
 304 will reject attempts to store a key-value pair in which the
 305 .I value
 306 has the flag bit on.
 307 Apart from this restriction, tagging is invisible to the user.
 308 As a special case, a
 309 .I tagmask
 310 of 1 means ``no tagging'', for use with enormous base files or
 311 on systems with unusual offset representations.
 312 .PP
 313 A
 314 .I size
 315 of 0
 316 given to
 317 .I dbzfresh
 318 is synonymous with the local default;
 319 the normal default is suitable for tables of 90-100,000
 320 key-value pairs.
 321 A
 322 .I cmap
 323 of 0 (NUL) is synonymous with the character
 324 .BR 0 ,
 325 signifying no case mapping
 326 (note that the character
 327 .B ?
 328 specifies the local default mapping,
 329 normally
 330 .BR C ).
 331 A
 332 .I tagmask
 333 of 0 is synonymous with the local default tag mask,
 334 normally 0x7f000000 (specifying the top bit in a 32-bit offset
 335 as the flag bit, and the next 7 bits as the mask,
 336 which is suitable for base files up to circa 24MB).
 337 Calling
 338 .I dbminit(name)
 339 with the database files empty is equivalent to calling
 340 .IR dbzfresh(name,0,'\et','?',0) .
 341 .PP
 342 When databases are regenerated periodically, as in news,
 343 it is simplest to pick the parameters for a new database based on the old one.
 344 This also permits some memory of past sizes of the old database, so that
 345 a new database size can be chosen to cover expected fluctuations.
 346 .I Dbzagain
 347 is a variant of
 348 .I dbminit
 349 for creating a new database as a new generation of an old database.
 350 The database files for
 351 .I oldbase
 352 must exist.
 353 .I Dbzagain
 354 is equivalent to calling
 355 .I dbzfresh
 356 with the same field separator, case mapping, and tag mask as the old database,
 357 and a
 358 .I size
 359 equal to the result of applying
 360 .I dbzsize
 361 to the largest number of entries in the
 362 .I oldbase
 363 database and its previous 10 generations.
 364 .PP
 365 When many accesses are being done by the same program,
 366 .I dbz
 367 is massively faster if its first hash table is in memory.
 368 If an internal flag is 1,
 369 an attempt is made to read the table in when
 370 the database is opened, and
 371 .I dbmclose
 372 writes it out to disk again (if it was read successfully and
 373 has been modified).
 374 .I Dbzincore
 375 sets the flag to
 376 .I newvalue
 377 (which should be 0 or 1)
 378 and returns the previous value;
 379 this does not affect the status of a database that has already been opened.
 380 The default is 0.
 381 The attempt to read the table in may fail due to memory shortage;
 382 in this case
 383 .I dbz
 384 quietly falls back on its default behavior.
 385 .IR Store s
 386 to an in-memory database are not (in general) written out to the file
 387 until
 388 .IR dbmclose
 389 or
 390 .IR dbzsync ,
 391 so if robustness in the presence of crashes
 392 or concurrent accesses
 393 is crucial, in-memory databases
 394 should probably be avoided.
 395 .PP
 396 .I Dbzsync
 397 causes all buffers etc. to be flushed out to the files.
 398 It is typically used as a precaution against crashes or concurrent accesses
 399 when a
 400 .IR dbz -using
 401 process will be running for a long time.
 402 It is a somewhat expensive operation,
 403 especially
 404 for an in-memory database.
 405 .PP
 406 .I Dbzcancel
 407 cancels any pending writes from buffers.
 408 This is typically useful only for in-core databases, since writes are
 409 otherwise done immediately.
 410 Its main purpose is to let a child process, in the wake of a
 411 .IR fork ,
 412 do a
 413 .I dbmclose
 414 without writing its parent's data to disk.
 415 .PP
 416 If
 417 .I dbz
 418 has been compiled with debugging facilities available (which makes it
 419 bigger and a bit slower),
 420 .I dbzdebug
 421 alters the value (and returns the previous value) of an internal flag
 422 which (when 1; default is 0) causes
 423 verbose and cryptic debugging output on standard output.
 424 .PP
 425 Concurrent reading of databases is fairly safe,
 426 but there is no (inter)locking,
 427 so concurrent updating is not.
 428 .PP
 429 The database files include a record of the byte order of the processor
 430 creating the database, and accesses by processors with different byte
 431 order will work, although they will be slightly slower.
 432 Byte order is preserved by
 433 .IR dbzagain .
 434 However,
 435 agreement on the size and internal structure of an
 436 .I fseek
 437 offset is necessary, as is consensus on
 438 the character set.
 439 .PP
 440 An open database occupies three
 441 .I stdio
 442 streams and their corresponding file descriptors;
 443 a fourth is needed for an in-memory database.
 444 Memory consumption is negligible (except for
 445 .I stdio
 446 buffers) except for in-memory databases.
 447 .SH SEE ALSO
 448 dbz(1), dbm(3)
 449 .SH DIAGNOSTICS
 450 Functions returning
 451 .I int
 452 values return 0 for success, \-1 for failure.
 453 Functions returning
 454 .I datum
 455 values return a value with
 456 .I dptr
 457 set to NULL for failure.
 458 .I Dbminit
 459 attempts to have
 460 .I errno
 461 set plausibly on return, but otherwise this is not guaranteed.
 462 An
 463 .I errno
 464 of
 465 .B EDOM
 466 from
 467 .I dbminit
 468 indicates that the database did not appear to be in
 469 .I dbz
 470 format.
 471 .SH HISTORY
 472 The original
 473 .I dbz
 474 was written by
 475 Jon Zeeff (zeeff@b-tech.ann-arbor.mi.us).
 476 Later contributions by David Butler and Mark Moraes.
 477 Extensive reworking,
 478 including this documentation,
 479 by Henry Spencer (henry@zoo.toronto.edu) as
 480 part of the C News project.
 481 Hashing function by Peter Honeyman.
 482 .SH BUGS
 483 The
 484 .I dptr
 485 members of returned
 486 .I datum
 487 values point to static storage which is overwritten by later calls.
 488 .PP
 489 Unlike
 490 .IR dbm ,
 491 .I dbz
 492 will misbehave if an existing key-value pair is `overwritten' by
 493 a new
 494 .I (dbz)store
 495 with the same key.
 496 The user is responsible for avoiding this by using
 497 .I (dbz)fetch
 498 first to check for duplicates;
 499 an internal optimization remembers the result of the
 500 first search so there is minimal overhead in this.
 501 .PP
 502 Waiting until after
 503 .I dbminit
 504 to bring the base file into existence
 505 will fail if
 506 .IR chdir (2)
 507 has been used meanwhile.
 508 .PP
 509 The RFC822 case mapper implements only a first approximation to the
 510 hideously-complex RFC822 case rules.
 511 .PP
 512 The prime finder in
 513 .I dbzsize
 514 is not particularly quick.
 515 .PP
 516 Should implement the
 517 .I dbm
 518 functions
 519 .IR delete ,
 520 .IR firstkey ,
 521 and
 522 .IR nextkey .
 523 .PP
 524 On C implementations which trap integer overflow,
 525 .I dbz
 526 will refuse to
 527 .I (dbz)store
 528 an
 529 .I fseek
 530 offset equal to the greatest
 531 representable
 532 positive number,
 533 as this would cause overflow in the biased representation used.
 534 .PP
 535 .I Dbzagain
 536 perhaps ought to notice when many offsets
 537 in the old database were
 538 too big for
 539 tagging, and shrink the tag mask to match.
 540 .PP
 541 Marking
 542 .IR dbz 's
 543 file descriptors
 544 .RI close-on- exec
 545 would be a better approach to the problem
 546 .I dbzcancel
 547 tries to address, but that's harder to do portably.