docs/proposal-errors.txt

   1 Considerations for ErrorCollection
   2
   3 Presently, HTML Purifier takes a code-execution centric approach to handling
   4 errors. Errors are organized and grouped according to which segment of the
   5 code triggers them, not necessarily the portion of the input document that
   6 triggered the error. This means that errors are pseudo-sorted by category,
   7 rather than location in the document.
   8
   9 One easy way to "fix" this problem would be to re-sort according to line number.
  10 However, the "category" style information we derive from naively following
  11 program execution is still useful. After all, each of the strategies which
  12 can report errors still process the document mostly linearly. Furthermore,
  13 not only do they process linearly, but the way they pass off operations to
  14 sub-systems mirrors that of the document. For example, AttrValidator will
  15 linearly proceed through elements, and on each element will use AttrDef to
  16 validate those contents. From there, the attribute might have more
  17 sub-components, which have execution passed off accordingly.
  18
  19 In fact, each strategy handles a very specific class of "error."
  20
  21 RemoveForeignElements   - element tokens
  22 MakeWellFormed          - element token ordering
  23 FixNesting              - element token ordering
  24 ValidateAttributes      - attributes of elements
  25
  26 The crucial point is that while we care about the hierarchy governing these
  27 different errors, we *don't* care about any other information about what actually
  28 happens to the elements. This brings up another point: if HTML Purifier fixes
  29 something, this is not really a notice/warning/error; it's really a suggestion
  30 of a way to fix the aforementioned defects.
  31
  32 In short, the refactoring to take this into account kinda sucks.
  33
  34 Errors should not be recorded in order that they are reported. Instead, they
  35 should be bound to the line (and preferably element) in which they were found.
  36 This means we need some way to uniquely identify every element in the document,
  37 which doesn't presently exist. An easy way of adding this would be to track
  38 line columns. An important ramification of this is that we *must* use the
  39 DirectLex implementation.
  40
  41     1. Implement column numbers for DirectLex [DONE!]
  42     2. Disable error collection when not using DirectLex [DONE!]
  43
  44 Next, we need to re-orient all of the error declarations to place CurrentToken
  45 at utmost important. Since this is passed via Context, it's not always clear
  46 if that's available. ErrorCollector should complain HARD if it isn't available.
  47 There are some locations when we don't have a token available. These include:
  48
  49     * Lexing - this can actually have a row and column, but NOT correspond to
  50       a token
  51     * End of document errors - bump this to the end
  52
  53 Actually, we *don't* have to complain if CurrentToken isn't available; we just
  54 set it as a document-wide error. And actually, nothing needs to be done here.
  55
  56 Something interesting to consider is whether or not we care about the locations
  57 of attributes and CSS properties, i.e. the sub-objects that compose these things.
  58 In terms of consistency, at the very least attributes should have column/line
  59 numbers attached to them. However, this may be overkill, as attributes are
  60 uniquely identifiable. You could go even further, with CSS, but they are also
  61 uniquely identifiable.
  62
  63 Bottom-line is, however, this information must be available, in form of the
  64 CurrentAttribute and CurrentCssProperty (theoretical) context variables, and
  65 it must be used to organize the errors that the sub-processes may throw.
  66 There is also a hierarchy of sorts that may make merging this into one context
  67 variable more sense, if it hadn't been for HTML's reasonably rigid structure.
  68 A CSS property will never contain an HTML attribute. So we won't ever get
  69 recursive relations, and having multiple depths won't ever make sense. Leave
  70 this be.
  71
  72 We already have this information, and consequently, using start and end is
  73 *unnecessary*, so long as the context variables are set appropriately. We don't
  74 care if an error was thrown by an attribute transform or an attribute definition;
  75 to the end user these are the same (for a developer, they are different, but
  76 they're better off with a stack trace (which we should add support for) in such
  77 cases).
  78
  79     3. Remove start()/end() code. Don't get rid of recursion, though [DONE]
  80     4. Setup ErrorCollector to use context information to setup hierarchies.
  81        This may require a different internal format. Use objects if it gets
  82        complex. [DONE]
  83
  84        ASIDE
  85             More on this topic: since we are now binding errors to lines
  86             and columns, a particular error can have three relationships to that
  87             specific location:
  88
  89             1. The token at that location directly
  90                 RemoveForeignElements
  91                 AttrValidator (transforms)
  92                 MakeWellFormed
  93             2. A "component" of that token (i.e. attribute)
  94                 AttrValidator (removals)
  95             3. A modification to that node (i.e. contents from start to end
  96                token) as a whole
  97                 FixNesting
  98
  99             This needs to be marked accordingly. In the presentation, it might
 100             make sense keep (3) separate, have (2) a sublist of (1). (1) can
 101             be a closing tag, in which case (3) makes no sense at all, OR it
 102             should be related with its opening tag (this may not necessarily
 103             be possible before MakeWellFormed is run).
 104
 105             So, the line and column counts as our identifier, so:
 106
 107             $errors[$line][$col] = ...
 108
 109             Then, we need to identify case 1, 2 or 3. They are identified as
 110             such:
 111
 112             1. Need some sort of semaphore in RemoveForeignElements, etc.
 113             2. If CurrentAttr/CurrentCssProperty is non-null
 114             3. Default (FixNesting, MakeWellFormed)
 115
 116             One consideration about (1) is that it usually is actually a
 117             (3) modification, but we have no way of knowing about that because
 118             of various optimizations. However, they can probably be treated
 119             the same. The other difficulty is that (3) is never a line and
 120             column; rather, it is a range (i.e. a duple) and telling the user
 121             the very start of the range may confuse them. For example,
 122
 123             <b>Foo<div>bar</div></b>
 124             ^     ^
 125
 126             The node being operated on is <b>, so the error would be assigned
 127             to the first caret, with a "node reorganized" error. Then, the
 128             ChildDef would have submitted its own suggestions and errors with
 129             regard to what's going in the internals.  So I suppose this is
 130             ok. :-)
 131
 132             Now, the structure of the earlier mentioned ... would be something
 133             like this:
 134
 135             object {
 136                 type = (token|attr|property),
 137                 value, // appropriate for type
 138                 errors => array(),
 139                 sub-errors = [recursive],
 140             }
 141
 142             This helps us keep things agnostic. It is also sufficiently complex
 143             enough to warrant an object.
 144
 145 So, more wanking about the object format is in order. The way HTML Purifier is
 146 currently setup, the only possible hierarchy is:
 147
 148     token -> attr -> css property
 149
 150 These relations do not exist all of the time; a comment or end token would not
 151 ever have any attributes, and non-style attributes would never have CSS properties
 152 associated with them.
 153
 154 I believe that it is worth supporting multiple paths. At some point, we might
 155 have a hierarchy like:
 156
 157     * -> syntax
 158       -> token -> attr -> css property
 159                        -> url
 160                -> css stylesheet <style>
 161
 162 et cetera. Now, one of the practical implications of this is that every "node"
 163 on our tree is well-defined, so in theory it should be possible to either 1.
 164 create a separate class for each error struct, or 2. embed this information
 165 directly into HTML Purifier's token stream.  Embedding the information in the
 166 token stream is not a terribly good idea, since tokens can be removed, etc.
 167 So that leaves us with 1... and if we use a generic interface we can cut down
 168 on a lot of code we might need. So let's leave it like this.
 169
 170 ~~~~
 171
 172 Then we setup suggestions.
 173
 174     5. Setup a separate error class which tells the user any modifications
 175        HTML Purifier made.
 176
 177 Some information about this:
 178
 179 Our current paradigm is to tell the user what HTML Purifier did to the HTML.
 180 This is the most natural mode of operation, since that's what HTML Purifier
 181 is all about; it was not meant to be a validator.
 182
 183 However, most other people have experience dealing with a validator. In cases
 184 where HTML Purifier unambiguously does the right thing, simply giving the user
 185 the correct version isn't a bad idea, but problems arise when:
 186
 187 - The user has such bad HTML we do something odd, when we should have just
 188   flagged the HTML as an error. Such examples are when we do things like
 189   remove text from directly inside a <table> tag. It was probably meant to
 190   be in a <td> tag or be outside the table, but we're not smart enough to
 191   realize this so we just remove it. In such a case, we should tell the user
 192   that there was foreign data in the table, but then we shouldn't "demand"
 193   the user remove the data; it's more of a "here's a possible way of
 194   rectifying the problem"
 195
 196 - Giving line context for input is hard enough, but feasible; giving output
 197   line context will be extremely difficult due to shifting lines; we'd probably
 198   have to track what the tokens are and then find the appropriate out context
 199   and it's not guaranteed to work etc etc etc.
 200
 201 ````````````
 202
 203 Don't forget to spruce up output.
 204
 205     6. Output needs to automatically give line and column numbers, basically
 206        "at line" on steroids. Look at W3C's output; it's ok. [PARTIALLY DONE]
 207
 208        - We need a standard CSS to apply (check demo.css for some starting
 209          styling; some buttons would also be hip)
 210
 211     vim: et sw=4 sts=4