usr/src/cmd/filesync/README

   1 #
   2 # CDDL HEADER START
   3 #
   4 # The contents of this file are subject to the terms of the
   5 # Common Development and Distribution License, Version 1.0 only
   6 # (the "License").  You may not use this file except in compliance
   7 # with the License.
   8 #
   9 # You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
  10 # or http://www.opensolaris.org/os/licensing.
  11 # See the License for the specific language governing permissions
  12 # and limitations under the License.
  13 #
  14 # When distributing Covered Code, include this CDDL HEADER in each
  15 # file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  16 # If applicable, add the following below this CDDL HEADER, with the
  17 # fields enclosed by brackets "[]" replaced with your own identifying
  18 # information: Portions Copyright [yyyy] [name of copyright owner]
  19 #
  20 # CDDL HEADER END
  21 #
  22 # Copyright (c) 1995 Sun Microsystems, Inc.  All Rights Reserved
  23 #
  24 #ident  "%W%    %E% SMI"
  25 #
  26 #       design notes that are likely to be of general (rather than
  27 #       merely historical) interest.
  28
  29 Table of Contents
  30
  31         Overview                        what filesync does
  32
  33         Primary Data Structures
  34                 general principles      why they exist
  35                 key concepts            what they represent
  36                 data structures         major structures and their contents
  37
  38         Overview of Passes              main phases of program execution
  39
  40         Modules                         list and descriptions of files
  41
  42         Studying the Code
  43                 active ingredients      a reading list of high points
  44                 the whole thing         a suggested order for everything
  45
  46         Gross calling structure         who calls whom
  47
  48         Helpful hints                   good things to know
  49
  50 Overview
  51
  52         The purpose of this program is to compare pairs of directory
  53         trees with a baseline snapshot, to determine which files have
  54         changed, and to propagate the changes in order to bring the
  55         trees back into congruency.  The baseline snapshot describes
  56         size, ownership, ... for all files that filesync is managing
  57         WHEN THEY WERE LAST IN SYNC.
  58
  59         The files and directory trees to be compared are determined
  60         by a relatively flexible (user editable) rules file, whose
  61         format (packingrules.4) permits files and or trees to be
  62         specified, explicitly, implicitly, or with wild cards.
  63         There are also provisions for filtering out unwanted files
  64         and for running programs to generate lists of files and
  65         directories to be included or excluded.
  66
  67         The comparisons begin by comparing the structured name
  68         spaces.  For names that appear in both trees, the files
  69         are then compared on the basis of type, size, contents,
  70         ownership and protections.  For files that are already
  71         in the baseline snapshot, if the sizes and modification
  72         times have not changed, we do not bother to recheck the
  73         contents.
  74
  75         The reconciliation process (resolving the differences)
  76         will only propagate a change if it is obvious what should
  77         be done (one side has changed relative to the snapshot,
  78         while the other has not).  If there are conflicting changes,
  79         the file is flagged and the user is asked to reconcile the
  80         differences manually.  There are, however a few switches
  81         that can be used to constrain the analysis or reconciliation,
  82         or to force one particular side to win in case of a conflict.
  83
  84
  85 Primary Data Structures
  86
  87         general principles:
  88                 we will build up an in-memory tree that represents
  89                 the union of the name spaces found in the baseline
  90                 and on the source and destination sides.
  91
  92                 keep in mind that the baseline recalls the state of
  93                 files THE LAST TIME THEY WERE IN AGREEMENT.  If files
  94                 have disagreed for a long time, the baseline still
  95                 remembers what they were like when they agreed.  If
  96                 files have never agreed, the baseline has no notions
  97                 of how they "used to be".
  98
  99         key concepts:
 100                 a "base pair" is a pair of directories whose
 101                 contents (or a subset of whose contents) are to
 102                 be syncrhonized.  The "base pairs" to be managed
 103                 are specified in the packing rules file.
 104
 105                 associated with each "base pair" is a set of rules
 106                 that describe which files (under those directories)
 107                 are to be kept in sync.  Each rule is a list of:
 108                         files and or directories to be included
 109                         wild cards for files or directories to be included
 110                         programs to generate lists of names for inclusion
 111                         file names to be ignored
 112                         wild cards for file names to be ignored
 113                         programs to generate lists of names for ignoring
 114
 115                 as a result of the "evaluation" process we build up
 116                 (under each base pair) a tree that represents all of
 117                 the files that we are supposed to keep in sync, and
 118                 contains everything we need to know about each one
 119                 of those files.  The structure of the tree mirrors
 120                 the directory hierarchy ... actually the union of the
 121                 three hiearchies (baseline, source and destination).
 122
 123                 for each file, we record interesting information (type,
 124                 size, owner, protection, mod time) and keep separate
 125                 note of what these values were:
 126                         in the baseline last time two sides agreed
 127                         on the source side, as we just examined it
 128                         on the destination side, as we just examined it
 129
 130         data structures:
 131
 132                 there is an ordered list of "base" structures
 133                 for each base, we maintain
 134                         three lists of associated "rule" descriptions:
 135                                 inclusion rules
 136                                 exclusion rules
 137                                 restriction rules (from the command line)
 138                         a "file" tree, representing all files below the bases
 139                         a list of statistics to be printed as a summary
 140
 141                 for each "rule", we maintain
 142                         some flags describing the type of rule
 143                         the character string that is the rule
 144
 145                 for each "file", we maintain
 146                         sibling and child pointers to give them tree structure
 147                         flags to describe what we have done/should do
 148                         "fileinfo" information from the src, dest, and baseline
 149
 150                         in addition there are some fields that are used
 151                         to add the file to a list of files requiring
 152                         reconciliation and record what happened to it.
 153
 154                 a "fileinfo" structure contains a subset of the information
 155                 that we obtain from a stat call:
 156                         major/minor/inum
 157                         type
 158                         link count
 159                         ownership, protection, and acls
 160                         size
 161                         modification time
 162
 163                 there is also, built up during analysis, a reconciliation
 164                 list.  This is an ordered list of "file" structures which
 165                 are believed to descibe files that have changed and require
 166                 reconciliation.  The ordering is important both for correctness
 167                 and to preserve relative modification times.
 168
 169 Overview of passes:
 170
 171         pass I (evaluate)
 172
 173                 stat every file that we might be interested in
 174                 (on both src/dest sides).  This includes walking
 175                 the trees under all directories in order to
 176                 find out what files exist and stating all of
 177                 them.
 178
 179                 the main trick in this pass is that there may be
 180                 files we don't want to evaluate (because we are
 181                 limiting our attention to specific files and trees).
 182                 There is a LISTED flag kept in the database that
 183                 tells me whether or not I need to stat/descend any
 184                 given node.
 185
 186                 all restrictions and ignores take effect during this pass.
 187
 188         pass II (analyze)
 189
 190                 given the baseline and all of the current stat information
 191                 gained during pass I, figure out what might conceivably
 192                 have changed and queue it for pass III.  This pass doesn't
 193                 try to figure out what happened or who should win ... it
 194                 merely identifies candidates for pass III.  This pass
 195                 ignores any nodes that were not evaluated during pass I.
 196
 197                 the queueing process, however, determines the order in
 198                 which the files will be processed in pass III, and the
 199                 order is very important.
 200
 201         pass III (reconcile)
 202
 203                 process the list of candidates, figuring out what has
 204                 actually changed and which versions deserve to win.  If
 205                 is clear what needs doing, we actually do it in this
 206                 pass.
 207
 208 Modules
 209
 210         filesync.h
 211                 defines for limits, sizes and return codes
 212                 declarations for global variables (mostly cmd-line parms)
 213                 defines for default file names
 214                 declarations for routines of general interest
 215
 216         database.h
 217                 data-structures for recording rules
 218                 data-structures for recording information about files
 219                 declarations for routines that operate on/with those structures
 220
 221         messages.h
 222                 the text of all localizable messages
 223
 224         debug.h
 225                 definitions and declarations for routines for error
 226                 simulation and bit-map display.
 227
 228         acls.c
 229                 routines to get, set, compare, and display Access Control Lists
 230         action.c
 231                 routines to do the real work of copying, deleting, or
 232                 changing ownership in order to make one side agree
 233                 with the other.
 234         anal.c
 235                 routines to examine the in-core list of files and
 236                 determine what has changed (and therefore what is
 237                 files are candidates for reconciliation).  This
 238                 analysis includes figuring out which files should
 239                 be links rather than copies.
 240         base.c
 241                 routines to read and write the baseline file
 242                 routines to search and manipulate the in-core base list
 243         debug.c
 244                 data structures and routines, used to sumulate errors
 245                 and produce debug output, that map between bits (as found
 246                 in various flag words) character string names for their
 247                 meanings.
 248
 249         eval.c
 250                 routines to build up the internal tree that describes
 251                 the status of all of the files that are described
 252                 by the current rules.
 253         files.c
 254                 routines to manipulate file name arguments, including
 255                 wild cards and embedded environment variables.
 256         ignore.c
 257                 routines to maintain a list of names or patterns for
 258                 files to be ignored, and to check file names against
 259                 that list.
 260         main.c
 261                 global variables, cmd-line parameter processing,
 262                 parameter validation, error reporting, and the
 263                 main loop.
 264         recon.c
 265                 routines to examine a list of files that appear to
 266                 have changed, and figure out what the appropriate
 267                 reconciliation course of action is.
 268         rename.c
 269                 routines to search the tree to determine whether
 270                 or not any creates/deletes are actually renames.
 271         rules.c
 272                 routines to read and write the rules file
 273                 routines to add rules and enumerate in-core rules
 274
 275         filecheck.c
 276                 not really a part of filesync, but rather a utility
 277                 program that is used in the test suite.  It extracts
 278                 information about files that is not readily available
 279                 from other unix commands.
 280
 281 Comments on studying the code
 282
 283         if you are only interested in the "active ingredients":
 284
 285                 read the above notes on data structures and then
 286
 287                 read the structure declarations in database.h
 288
 289                 read the above notes overviewing the passes
 290
 291                 in recon.c: read reconcile
 292
 293                         this routine almost makes sense on its own,
 294                         and it is unquestionably the most important
 295                         routine in the entire program.  Everything
 296                         else just gathers data for reconcile to use,
 297                         or updates the books to reflect the changes.
 298
 299                 in eval.c: read evaluate, eval_file, walker, and note_info
 300
 301                         this is the main guts of pass I
 302
 303                 in anal.c: read analyze, check_file, check_changes & queue_file
 304
 305                         this is the main guts of pass II
 306
 307         if you want to read the whole thing:
 308
 309                 the following routines do fundamentally simple things
 310                 in simple ways, and can (for the most part) be understood
 311                 in vaccuuo.  The things they do are probably sufficiently
 312                 obvious that you can probably understand the more interesting
 313                 code without having read them at all.
 314
 315                         base.c
 316                         rules.c
 317                         files.c
 318                         debug.c
 319                         ignore.c
 320                         acls.c
 321
 322                 the following routines constitute the real meat of the
 323                 program, and while they are broken into specialized
 324                 modules, they probably need to be understood as an
 325                 organic whole:
 326
 327                         main.c          setup and control
 328                         eval.c          pass I
 329                         anal.c          pass II
 330                         recon.c         pass III
 331                         action.c        execution and book-keeping
 332                         rename.c        a special case for a common situation
 333
 334
 335 Gross calling structure / flow of control
 336
 337         main.c:main
 338                 findfiles
 339                 read_baseline
 340                 read_rules
 341                 if new rules
 342                         add_base
 343                         add_include
 344                 evaluate
 345                 analyze
 346                 write_baseline
 347                 write_summary
 348
 349         eval.c:evaluate
 350                 add_file_to_base
 351                 add_glob
 352                 add_run
 353                 ignore_pgm
 354                 ignore_file
 355                 ignore_expr
 356                 eval_file
 357
 358         eval.c:eval_file
 359                 note_info
 360                 nftw
 361                         walker
 362                                 note_info
 363
 364         anal.c:analyze
 365                 check_file
 366                 reconcile
 367
 368         anal.c:check_file
 369                 check_changes
 370                 queue_file
 371
 372
 373         recon.c:reconcile
 374                 samedata
 375                 samestuff
 376                 do_copy
 377                         copy
 378                         do_like
 379                         update_info
 380                 do_like
 381                 do_remove
 382
 383 Helpful Hints
 384
 385         the "file" structure contains a bunch of flags.  Many of them
 386         just summarize what we know about the file (e.g. where it was
 387         found).  Others are more subtle and control the evaluation
 388         process or the writing out of the baseline file.  You can't
 389         really understand the processing unless you understand what
 390         these flags mean.
 391
 392                 F_NEW           added by a new rule
 393
 394                 F_LISTED        this name was generated by a rule
 395
 396                 F_SPARSE        this directory is an intermediate on
 397                                 the way to a name generated by a rule
 398                                 and should not be recursively walked.
 399
 400                 F_EVALUATE      this node was found in evaluation and
 401                                 has up-to-date stat information
 402
 403                 F_CONFLICT      there is a conflict on this node so
 404                                 baseline should remain unchanged
 405
 406                 F_REMOVE        this node should be purged from the baseline
 407
 408                 F_STAT_ERROR    it was impossible to stat this file
 409                                 (and anything below it)
 410
 411         the implications of these flags on processing are
 412
 413                 F_NEW, F_LISTED, F_SPARSE
 414
 415                         affect whether or not a particular node should
 416                         be included in the evaluation pass.
 417
 418                         in some situations, only new rules are interpreted.
 419
 420                         listed files and directories should be evaluated
 421                         and analyzed.  sparse directories should not be
 422                         recursively enumerated.
 423
 424                 F_EVALUATE
 425
 426                         determines whether or not a node is included
 427                         in the analysis pass.  Only nodes that have
 428                         been evaluated will be analyzed.
 429
 430                 F_CONFLICT, F_REMOVE, F_EVALUATE
 431
 432                         affect how a node should be written back into                                   the baseline file.
 433
 434                         if there is a conflict or we haven't evaluated
 435                         a node, we won't update the baseline.
 436
 437                         if a node is marked for removal, it will be
 438                         excluded from the baseline when it is written out.
 439
 440                 F_STAT_ERROR
 441
 442                         if we could not get proper status information
 443                         about a file (or the tree under it) we cannot,
 444                         with any confidence, determine what its state
 445                         is or do anything about it.  Such files are
 446                         flagged as "in conflict".
 447
 448                         it is somewhat kinky that we put error flagged
 449                         files on the reconciliation list.  We do this
 450                         because this is the easiest way to pull them
 451                         out for reporting as conflicts.
 452
 453