doc/developer/Dependencies.txt

   1 A discussion of the new dependency code, version 0.3
   2 by Michael Meeks <mmeeks@gnu.org> and Jody Goldberg <jody@gnome.org>
   3
   4 1.     Overview of the Dependencies
   5
   6         The dependency code is designed to determine which objects depend on
   7 which cells, or regions.  The main use of this is triggering recomputation for
   8 things that depend on something that has changed.  The majority of the code
   9 related to dependencies can be found in module dependent.c.
  10
  11 1.1    Data structures and their meaning
  12
  13        Dependency data is anchored on a per-sheet basis using GnmDepContainer.
  14 This stores all the dependencies for the sheet in two hash tables.  In the
  15 future other objects may also become containers, such as graphs in their own
  16 tab.
  17
  18        The two main types of dependencies, are single and range. Loosely
  19 these describe single cells (ie. = A1) vs. regions (ie. = SUM (A1:Z500))
  20 The range_hash and single_hash tables use DependencyRange and DependencySingle
  21 to store lists of Dependents (not necessarily in this container) that depend
  22 on things in the container.  The range_hash table is hashed on the normalized
  23 range, and the other hashes on the GnmCellPos of the singleton.  Sheet does not
  24 come into play because it is implicit in the container.
  25
  26        DependencySingle stores stores the dependents that rely on individual
  27 cells.  This makes lookup easy.  Only 1 DependencySingle will exist for a given
  28 cell.  We can find that quickly via the hash.
  29
  30         DependencyRange cannot be searched via the hash.  A target cell may be
  31 contained by several regions.  Finding which regions requires traversing the
  32 list of regions looking for intersections.  See below.
  33
  34 2.     GnmDependent Evaluation
  35
  36        The routine dependent_eval will evaluate a dependent and if its value
  37 changes it will queue cells that depend on it, these are obtained via
  38 cell_foreach_dep, which looks up in the single_hash and traverses the
  39 range_hash to obtain its result. Often cell recalculation happens through
  40 workbook_recalc which works through each sheets dependents,  re-evaluating
  41 any that are marked for calculation.
  42
  43 2.1    Evaluation queue vs. recursion
  44
  45        There are two ways in which a cell's recalculation can occur. Firstly
  46 simply traversing the GnmExpr (expr.h) will result in many cells being
  47 evaluated. Essentially each dereference in the GnmExpr (eg. =A1) will cause
  48 the re-calculation of A1's GnmExpr before we can continue (recursively).
  49 This is actually fairly expensive when the expression contains a reference like
  50     =sum(a1:a65535)
  51
  52 3.     Dependencies the bottleneck
  53
  54        Since dependencies tend to fan out, and massively increase the area that
  55 needs to be recomputed, it is clearly advantageous to spend time culling the
  56 dependency tree as intelligently as possible. Furthermore since for each cell
  57 that changes it is necessary to determine its dependencies the lookup of
  58 a cell's dependencies needs to be fast.
  59
  60 3.1    Why two methods
  61
  62        First, consider the case where only range dependencies are used, this
  63 is a fairly simple first implementation. If we have N cells that have
  64 random dependencies on one other cell, then we will have approx N ranges in
  65 the range hash. For each cell we re-calculate we need to iterate over the
  66 entire range hash to determine its dependencies. Hence we have a fundamentally
  67 O(N^2) algorithm, this is very bad news. This scheme spends almost all of its
  68 time in search_cell_deps.
  69
  70        To overcome this problem we partition dependencies into single cell
  71 dependencies and range dependencies. This way for the common =A1 case, we don't
  72 add an entry in the range_hash, we simply add an entry in the simple_hash.
  73 Hence for the cell_foreach_dep we have one less entry in the range hash to
  74 iterate over, which saves N iterations of search_cell_deps.
  75
  76         Another common case is having a lot of formulae sharing the same range
  77 in the sheet as a dependency (eg. cutting and pasting = SUM($A$1:$A$100)). In
  78 this case there is a single dependency range entry with many cells listed as its
  79 dependencies.
  80
  81 3.1.1 bucketing range depends
  82
  83         Searching the list of range depends is still expensive if there are
  84 a lot of them.  eg in the situation where an expression like
  85     =sum(b1:d1)
  86 is duplicated for the entire 1st column.  We end up with a huge number of
  87 non-overlapping regions to search through.  That can be ameliorated by
  88 bucketing the range depends.  Giving only a few columns to each bucket.
  89
  90 FUTURE : it would be useful to notice that a larger range contains a smaller
  91 one, and to build up a tree as necessary.  This would also decrease the size of
  92 the list.
  93
  94 3.2    Inter-sheet dependencies
  95
  96        Inter sheet dependencies are managed simply by inserting the dependency
  97 information into the sheet in which the cells that are dependent on reside.
  98 This is essentially exactly what is expected, given that cell's are linked to
  99 the cell they depend on. Removing inter-sheet dependencies is also identical
 100 to normal dependencies, excepting that it is more likely to throw up formulae
 101 that have not been correctly unlinked in a sheet destroy.  As an optimization
 102 we flag inter-sheet and inter-book dependencies.  If we are destroying an entire
 103 sheet or workbook there is no need to be neat and tidy unlinking individual
 104 depends when the whole structure is going to be deleted.
 105
 106 3.3    What is hashed
 107
 108        While the two hashes (range_hash, simple_hash) are both GHashTables
 109 what is hashed is quite different.
 110
 111 3.3.1  What does the range hash do ?
 112
 113        The hashing on the range_hash is merely used to determine if there is
 114 already a range in the hash with the same dimensions as a new dependency range
 115 being added. This accelerates insertion of dependencies, the hash is traversed
 116 as a simple un-ordered list at all other times.
 117
 118 3.3.2  Why not a direct Cell * -> set mapping for DependencySingle ?
 119
 120        This is not done since there is no guarantee that cells that have
 121 dependencies are in existence yet. Hence it is quite valid for A1 to be '=A2'
 122 before A2 exists. If A2 does not exist then A2 has no Cell structure yet. This
 123 could be obviated by creating depended cells, but this would be inelegant.
 124
 125 3.3.3 How are depends stored
 126         As of 1.1.x we enabled switched from storing lists of dependents in the
 127 DependentSingle and DependentRange.  That was a bottleneck when ensuring
 128 uniqueness if a lot of things depended on the same region directly.  The list
 129 was replaced with a 'microHash'.  A simple hash table that looks like a list
 130 for small sizes.  It is derived from GHashTable in glib, but simplified to make
 131 it lighter weight.
 132
 133 3.4    3D dependencies
 134
 135         Some expressions depend on the order of the sheets in a workbook.
 136 When linking the expressions we notice this and flag the dependent before
 137 adding it to a collection in the workbook.  The workbook is responsible of
 138 linking and unlinking those dependents when the sheet order changes.
 139
 140 3.5     Named expressions
 141
 142         Named expressions are special.  In the current implementation we do not
 143 differentiate between names that use relative references and those that don't.
 144 Names cannot be dependents because relative references change value depending
 145 on where they are evaluated.  However, if a sheet is removed we need to know to
 146 invalidate a name that references it.  The 'referencing_names' member of the
 147 collection tracks that.
 148
 149 3.6     Dynamic depends
 150
 151         Functions like INDEX, INDIRECT and some of the range operators produce
 152 new ranges at run time.  As a result these are not registered when an dependent
 153 is linked.  The DynamicDep structure is a special form of GnmDependent that is
 154 used to collect and remove the dependencies as they get generated.  When a
 155 dependent with dynamic depends is evaluated or unlinked the list is cleared.
 156 The process of evaluation then repopulates the set.
 157
 158 4.     The life cycle of dependency records
 159
 160         A GnmDependent is 'linked' if its dependencies on other things have been
 161 registered with its container,  and 'unlinked' if not.  Linking takes place by
 162 traversing the expression tree and registering the single, range, and 3d
 163 references.
 164
 165 4.3.1  Implicit intersection
 166
 167        This is as yet unimplemented, but will further reduce the number of
 168 ranges to clip against. Essentially an implicit intersection reduces a range
 169 to an adjacent single reference under certain circumstances.
 170
 171 4.3.2  Array Formula
 172
 173        These luckily have a simple dependency structure, since the formula
 174 stored is identical in each cell, the cells may all depend on the corner cell
 175 using a fast single mapping.
 176
 177 5     Future Expansion
 178
 179        There are several avenues open for future expansion.
 180
 181 5.1     DependencyRange trees or quad-trees
 182
 183         Accelerating the range search will give big speedups on large sheets.
 184 The current row based bucketing approach seems to work nicely, but could be
 185 augmented by using trees or quad trees to handle containment better.
 186
 187 5.2    Multi-threading,
 188
 189        With the current structure, it might well be possible to add multi-
 190 threading support to the evaluation engine. The structure of this would take
 191 advantage of the partitioning already provided by the sheet boundary. To do
 192 this it would be necessary to move the eval_queue to a per-sheet entity, and
 193 putting a locking / signaling mechanism on the queue such that inter-sheet
 194 dependencies could be prepended to the queue (thus ensuring rapid
 195 evaluation), and waited on safely. Since each cell is evaluated but once
 196 per re-calc, it would then be safe to read the Cell's value when it dissapeared
 197 from the eval_queue.
 198
 199 5.3    GnmExpr recursion
 200
 201        Whether it is always entirely necessary to re-evaluate a cell solely
 202 on the basis that it is in the GnmExpr is non-obvious to me. Clearly if this
 203 cell is in the dependency queue it would make perfect sense, however if there
 204 is as yet no chance that this cell has been changed, it makes little sense
 205 to re-calculate it (and its tree'd dependencies). The only problem here is
 206 determining whether any of the currently queued dependencies would alter this
 207 cell's dependencies.