comparison.xhtml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   3     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
   4   <!ENTITY % htmlpurifier.current SYSTEM "current.ent"> %htmlpurifier.current;
   5 ]>
   6 <html xmlns="http://www.w3.org/1999/xhtml"
   7   xmlns:xi="http://www.w3.org/2001/XInclude"
   8   xmlns:xc="urn:xhtml-compiler"
   9   xml:lang="en">
  10 <head>
  11   <title>Comparison - HTML Purifier</title>
  12   <xi:include href="common-meta.xml" xpointer="xpointer(/*/node())" />
  13   <meta name="keywords" content="HTMLPurifier, HTML Purifier, HTML, filter, filtering, HTML_Safe, PEAR, comparison, kses, striptags, SafeHTMLChecker" />
  14 </head>
  15 <body>
  16
  17 <xi:include href="common-header.xml" xpointer="xpointer(/*/node())" />
  18
  19 <div id="main">
  20 <h1 id="title">Comparison</h1>
  21
  22 <div id="content">
  23
  24 <p>
  25   With the advent of <a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0</a>,
  26   the end user has gone from passive consumer to active producer of content
  27   on the World Wide Web.  <a href="http://en.wikipedia.org/wiki/Wiki">Wikis</a>,
  28   <a href="http://en.wikipedia.org/wiki/Social_software">Social Software</a> and
  29   <a href="http://en.wikipedia.org/wiki/Blog">Blogs</a> all put the user in control.
  30 </p>
  31
  32 <p>
  33   Give the user too much control, however, and you set yourself up for <a
  34   href="http://en.wikipedia.org/wiki/Cross-site_scripting"><abbr>XSS</abbr
  35   ></a> attacks. For this reason, <abbr>HTML</abbr>'s flexibility has
  36   proven to be both a blessing and a curse, and the software that
  37   processes it must strike a fine balance between security and usability.
  38   How do we prevent users from injecting JavaScript or inserting malformed
  39   <abbr>HTML</abbr> while allowing a rich syntax of tags, attributes and
  40   <abbr>CSS</abbr>? How do we put <abbr>HTML</abbr> inside
  41   <abbr>RSS</abbr> feed without worrying about sloppy coding messing up
  42   <abbr>XML</abbr> parsing? Almost every <abbr>PHP</abbr> developer has
  43   come across this problem before, and many have tried (albeit
  44   unsuccessfully) to solve this problem. We will analyze existing
  45   libraries to demonstrate how they are ineffective and, of course, how
  46   <strong>HTML Purifier</strong> solves all our problems and achieves
  47   standards-compliance.
  48 </p>
  49
  50 <p>
  51   I will take no quarter and pull no punches: as of the time of writing,
  52   no other library comes even <em>close</em> to solving the problem effectively
  53   for richly formatted documents.  But, nonetheless, there is a necessary
  54   disclaimer:
  55 </p>
  56
  57 <div class="disclaimer">
  58   <p>
  59     This comparison document was written by the author of HTML Purifier,
  60     and clearly is <strong>in favor</strong> of HTML Purifier. However, that doesn't
  61     mean that it is biased: I have made every attempt to be <strong>factual and
  62     fair</strong>, and I hope that you will agree, by the time you finish reading
  63     this document, that HTML Purifier is the only satisfactory <abbr>HTML</abbr>
  64     filter out there today.
  65   </p>
  66 </div>
  67
  68 <div id="toc" />
  69
  70 <h2 id="Summary">Summary</h2>
  71
  72 <p>A table summarizing the differences for the impatient.</p>
  73
  74 <div class="wide-table">
  75 <table cellspacing="0">
  76
  77 <thead>
  78   <tr>
  79     <th>Library</th>
  80     <th>Version</th>
  81     <th>Date</th>
  82     <th>License</th>
  83     <th>Whitelist</th>
  84     <th>Removal</th>
  85     <th>Well-formed</th>
  86     <th>Nesting</th>
  87     <th>Attributes</th>
  88     <th>XSS&nbsp;safe</th>
  89     <th>Standards&nbsp;safe</th>
  90   </tr>
  91 </thead>
  92
  93 <tbody>
  94
  95 <tr>
  96   <td>striptags</td>
  97   <td>n/a</td>
  98   <td>n/a</td>
  99   <td>n/a</td>
 100   <td class="impl-almostyes">Yes (user)</td>
 101   <td class="impl-partial">Buggy</td>
 102   <td class="impl-no">No</td>
 103   <td class="impl-no">No</td>
 104   <td class="impl-no">No</td>
 105   <td class="impl-no">No</td>
 106   <td class="impl-no">No</td>
 107 </tr>
 108
 109 <tr>
 110   <td>PHP Input Filter</td>
 111   <td>1.2.2</td>
 112   <td>2005-10-05</td>
 113   <td>GPL</td>
 114   <td class="impl-almostyes">Yes (user)</td>
 115   <td class="impl-yes">Yes</td>
 116   <td class="impl-no">No</td>
 117   <td class="impl-no">No</td>
 118   <td class="impl-partial">Partial</td>
 119   <td class="impl-almostyes">Probably</td>
 120   <td class="impl-no">No</td>
 121 </tr>
 122
 123 <tr>
 124   <td>HTML_Safe</td>
 125   <td>0.9.9beta</td>
 126   <td>2005-12-21</td>
 127   <td>BSD (3)</td>
 128   <td class="impl-no">Mostly No</td>
 129   <td class="impl-yes">Yes</td>
 130   <td class="impl-yes">Yes</td>
 131   <td class="impl-no">No</td>
 132   <td class="impl-partial">Partial</td>
 133   <td class="impl-almostyes">Probably</td>
 134   <td class="impl-no">No</td>
 135 </tr>
 136
 137 <tr>
 138   <td>kses</td>
 139   <td>0.2.2</td>
 140   <td>2005-02-06</td>
 141   <td>GPL</td>
 142   <td class="impl-almostyes">Yes (user)</td>
 143   <td class="impl-yes">Yes</td>
 144   <td class="impl-no">No</td>
 145   <td class="impl-no">No</td>
 146   <td class="impl-partial">Partial</td>
 147   <td class="impl-almostyes">Probably</td>
 148   <td class="impl-no">No</td>
 149 </tr>
 150
 151 <tr>
 152   <td>htmLawed</td>
 153   <td>1.1.9.1</td>
 154   <td>2009-02-26</td>
 155   <td>GPL</td>
 156   <td class="impl-partial">Yes (not default)</td>
 157   <td class="impl-almostyes">Yes (user)</td>
 158   <td class="impl-almostyes">Yes (user)</td>
 159   <td class="impl-partial">Partial</td>
 160   <td class="impl-partial">Partial</td>
 161   <td class="impl-almostyes">Probably</td>
 162   <td class="impl-no">No</td>
 163 </tr>
 164
 165 <tr>
 166   <td>Safe HTML Checker</td>
 167   <td>n/a</td>
 168   <td>2003-09-15</td>
 169   <td>n/a</td>
 170   <td class="impl-partial">Yes (bare)</td>
 171   <td class="impl-yes">Yes</td>
 172   <td class="impl-yes">Yes</td>
 173   <td class="impl-almostyes">Almost</td>
 174   <td class="impl-partial">Partial</td>
 175   <td class="impl-yes">Yes</td>
 176   <td class="impl-almostyes">Almost</td>
 177 </tr>
 178
 179 <tr>
 180   <td>HTML Purifier</td>
 181   <td>&htmlpurifier.current.version;</td>
 182   <td>&htmlpurifier.current.release-date;</td>
 183   <td>LGPL</td>
 184   <td class="impl-yes">Yes</td>
 185   <td class="impl-yes">Yes</td>
 186   <td class="impl-yes">Yes</td>
 187   <td class="impl-yes">Yes</td>
 188   <td class="impl-yes">Yes</td>
 189   <td class="impl-yes">Yes</td>
 190   <td class="impl-yes">Yes</td>
 191 </tr>
 192
 193 </tbody>
 194
 195 </table>
 196 </div>
 197
 198 <p>
 199   <a href="#Tidy">HTML Tidy</a> is omitted from this list because it is not
 200   an <abbr>HTML</abbr> filter.
 201 </p>
 202
 203 <h2 id="AltMarkup">Look Ma, No <abbr>HTML</abbr>!</h2>
 204
 205 <blockquote class="fancy">
 206   <div class="quote" style="text-align:center;">
 207     A clever person solves a problem.
 208     A wise person avoids it.
 209   </div>
 210   <div class="origin">&mdash; Albert Einstein</div>
 211 </blockquote>
 212
 213 <p>
 214   Before we jump into the weird and not-so-wonderful world of
 215   <abbr>HTML</abbr> filters, we must first consider another domain:
 216   non-<abbr>HTML</abbr> markup libraries. While libraries of this type
 217   really shouldn't be considered <abbr>HTML</abbr> filters, they are the
 218   number one method of taking user input and processing it into something
 219   more than plain old text. These libraries forgo <abbr>HTML</abbr> and
 220   define their own markup syntax. <a
 221   href="http://en.wikipedia.org/wiki/BBCode">BBCode</a>, <a
 222   href="http://en.wikipedia.org/wiki/Wikitext">Wikitext</a>, <a
 223   href="http://daringfireball.net/projects/markdown/">Markdown</a> and <a
 224   href="http://textism.com/tools/textile/">Textile</a> are all examples of
 225   such markup languages (although it should be noted that Wikitext and
 226   Markdown can allow <abbr>HTML</abbr> within them). The benefits (to
 227   those who use it, anyway) are clear: simplicity and security.
 228 </p>
 229
 230 <table cellspacing="0">
 231   <thead>
 232     <tr>
 233       <th>Markup language</th>
 234       <th>Sample</th>
 235     </tr>
 236   </thead>
 237   <tbody>
 238     <tr>
 239       <th>BBCode</th>
 240       <td><tt>[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].</tt></td>
 241     </tr>
 242     <tr>
 243       <th>Wikitext<sup>1</sup></th>
 244       <td><tt>'''B''' ''i'' [http://www.example.com/ link]</tt></td>
 245     </tr>
 246     <tr>
 247       <th>Markdown<sup>2</sup></th>
 248       <td><tt>**B** *i* [link](http://www.example.com/)</tt></td>
 249     </tr>
 250     <tr>
 251       <th>Textile</th>
 252       <td><tt>*B* _i_ &quot;link&quot;:http://www.example.com/</tt></td>
 253     </tr>
 254     <tr>
 255       <th><abbr>HTML</abbr></th>
 256       <td><tt>&lt;b&gt;B&lt;/b&gt; &lt;i&gt;i&lt;/i&gt; &lt;a href=&quot;http://www.example.com/&quot;&gt;link&lt;/a&gt;</tt></td>
 257   </tr>
 258     <tr>
 259       <th><acronym>WYSIWYG</acronym></th>
 260       <td><b>B</b> <i>i</i> <a href="http://www.example.com/">link</a></td>
 261     </tr>
 262   </tbody>
 263 </table>
 264
 265 <ol class="notes">
 266   <li>
 267     Wikitext shown is modeled after <a
 268     href="http://www.mediawiki.org/wiki/MediaWiki">MediaWiki</a> style.
 269     There are many variants of Wikitext currently extant.
 270   </li>
 271   <li>
 272     Strictly speaking, the Markdown syntax is not equivalent: bold text
 273     is expressed as <code>&lt;strong&gt;</code> and italicized text is
 274     expressed as <code>&lt;em&gt;</code>. Most browser default stylesheets,
 275     however, map those two semantic tags to the associated styling, so
 276     many users assume that it really is italics (and use it improperly for,
 277     say, book titles.)
 278   </li>
 279 </ol>
 280
 281 <h3 id="AltMarkup:Simplicity">Simplicity</h3>
 282
 283 <p>
 284   <abbr>HTML</abbr> source code is often criticized for being difficult to
 285   read. For example, compare:
 286 </p>
 287
 288 <pre>
 289 * Item 1
 290 * Item 2
 291 </pre>
 292
 293 <p>...with:</p>
 294
 295 <pre>
 296 &lt;ul&gt;
 297     &lt;li&gt;Item 1&lt;/li&gt;
 298     &lt;li&gt;Item 2&lt;/li&gt;
 299 &lt;/ul&gt;
 300 </pre>
 301
 302 <p>
 303   Which would you prefer to edit? The answer seems obvious, but be careful
 304   not to fall into the fallacy of <a
 305   href="http://en.wikipedia.org/wiki/False_dilemma">false dilemma</a>.
 306   There <em>is</em> a third choice: the <acronym>WYSIWYG</acronym> (rich
 307   text) editor, which blows earlier choices out of the water in terms of
 308   usability.
 309 </p>
 310
 311 <p>
 312   Note that rich text editors and alternate markup syntaxes are not
 313   mutually exclusive, but, when push comes to shove, it's easier
 314   implement this sort of editor on top of <abbr>HTML</abbr> than some obscure
 315   markup language.  And in the cases when it is done, you usually end up with
 316   a live preview, not a true rich text editor.
 317 </p>
 318
 319 <blockquote class="digression">
 320   <p>
 321     <q>Now just wait a second,</q> you may be saying, <q><acronym>WYSIWYG</acronym>
 322     editors aren't all that great.</q> There are many good arguments against
 323     these editors, and <a
 324     href="http://www.ideography.co.uk/library/seybold/WYSIWYG.html">intelligent
 325     people have written essays</a> devoted to criticizing
 326     <acronym>WYSIWYG</acronym>. In addition to the usual arguments against
 327     said editors, the web poses another limitation: no JavaScript means no
 328     editor, and no editor means... (gasp) manually typing in code.
 329   </p>
 330   <p>
 331     Even the most dogmatic purist, however, should recognize that for all
 332     its faults, prospective clients <em>really</em> want rich text editors.
 333     There are steps you can take to mitigate the associated drawbacks of
 334     these editors.
 335   </p>
 336   <p>
 337     It is often asserted that <acronym>WYSIWYG</acronym> editors
 338     <em>encourage excessive presentational markup</em>. As it turns out,
 339     this is the case with any markup language that allows the smallest
 340     iota of presentational tags, be it <tt>&lt;font&gt;</tt> or
 341     <tt>[color=red]</tt>. A good way to reduce this trouble is to simply
 342     eliminate the dialogue boxes that allow users to change colors or fonts
 343     (which usually have no legitimate use) and adopt a <acronym>WYSIWYM</acronym>
 344     scheme, allowing users to select contextually correct formatting styles
 345     for segments of text.
 346   </p>
 347 </blockquote>
 348
 349 <p>
 350   Simplicity is also a double-edged sword.  The moment any remotely
 351   complex markup is needed, these lightweight markup languages fail to
 352   produce.  Sure you can make '''this text bold''' with Wikitext, but that
 353   infobox all <q>rendered nicely in aqua blue</q> will require a gaggle of
 354   &lt;div&gt;s and <abbr>CSS</abbr>. These languages face the same troubles
 355   as regular <abbr>HTML</abbr> filters in that their whitelist is too
 356   restrictive (besides the fact that their table markup is extraordinarily
 357   complex).
 358 </p>
 359
 360 <h3 id="AltMarkup:Security">Security</h3>
 361
 362 <p>
 363   BBCode can be boiled down to a <q>wanna-be</q> version of
 364   <abbr>HTML</abbr>. I mean, replacing
 365   the angled brackets with square brackets and omitting the occasional parameter
 366   name? How much more un-original can you get? Somehow, I don't think BBCode
 367   was meant to readable. <a
 368   href="http://en.wikipedia.org/wiki/BBCode">Wikipedia</a> agrees:
 369 </p>
 370
 371 <blockquote>
 372   BBCode was devised and put to use in order to provide a safer, easier
 373   and more limited way of allowing users to format their messages.
 374   Previously, many message boards allowed the users to include <abbr>HTML</abbr>,
 375   which could be used to break/imitate parts of the layout, or run
 376   JavaScript. Some implementations of BBCode have suffered problems related
 377   to the way they translate the BBCode into <abbr>HTML</abbr>, which could negate the
 378   security that was intended to be given by BBCode.
 379 </blockquote>
 380
 381 <p>Or, put more simply:</p>
 382
 383 <blockquote>
 384   BBCode came to life when developers where too
 385   lazy to parse <abbr>HTML</abbr> correctly
 386   and decided to invent their own markup language. As with all products of
 387   laziness, the result is completely inconsistent, unstandardized, and
 388   widely adopted.
 389 </blockquote>
 390
 391 <p>
 392   Well, developers, the whole point of HTML Purifier is that I do the
 393   work so you can just execute the ridiculously simple
 394   <tt>$purifier->purify($html)</tt> call and go on to do, well, whatever
 395   you developers do. <tt>:-P</tt>
 396 </p>
 397
 398 <h3 id="AltMarkup:Conclusion">Conclusion</h3>
 399
 400 <p>
 401   These alternative markup languages have their shiny points, and HTML
 402   Purifier is not meant to replace them.  However, a major reason for
 403   their existence has been called into question.  Why are <em>you</em>
 404   using these languages?
 405 </p>
 406
 407 <h2 id="Tidy">HTML Tidy</h2>
 408
 409 <p>
 410   Dave Raggett's
 411   <a href="http://www.w3.org/People/Raggett/tidy/">HTML Tidy</a> is a program;
 412   neat enough, at least, to make it into <abbr>PHP</abbr> as a
 413   <a href="http://us2.php.net/manual/en/ref.tidy.php"><abbr>PECL</abbr> extension.</a>
 414   The premise is simple, the execution effective. Tidy is, in short, a great
 415   <em>tool</em>.
 416 </p>
 417
 418 <p>
 419   It is not, however, a filter.  I am often surprised when people ask
 420   me, <q>What about Tidy?</q>  There's nothing against Tidy: Tidy tackles
 421   a different problem set.  Let's see what <tt>man tidy</tt> has to say:
 422 </p>
 423
 424 <blockquote cite="http://tidy.sourceforge.net/docs/tidy_man.html">
 425   Tidy reads <abbr>HTML</abbr>, <abbr>XHTML</abbr> and
 426   <abbr>XML</abbr> files and writes cleaned up markup. For
 427   <abbr>HTML</abbr> variants, it detects and corrects many common coding errors and
 428   strives to produce visually equivalent markup that is both <abbr>W3C</abbr> compliant
 429   and works on most browsers. A common use of Tidy is to convert plain <abbr>HTML</abbr>
 430   to <abbr>XHTML</abbr>.
 431 </blockquote>
 432
 433 <p>
 434   Hmm... why do I not see the words <q>filter</q> or
 435   <q><abbr>XSS</abbr></q> in here? Perhaps it's
 436   because Tidy accepts <em>any</em> valid
 437   <abbr>HTML</abbr>.  Including
 438   <tt>script</tt> tags.  Which leads us to our second part: Tidy parses
 439   <em>documents</em>, not document <em>fragments</em>.
 440 </p>
 441
 442 <p>
 443   This is not to say that I haven't seen Tidy be used in this sort of
 444   fashion.  MediaWiki, for instance, uses Tidy to cleanup the final <abbr>HTML</abbr>
 445   output before shuttling it off to the browser.  The developers, nevertheless,
 446   agree that this is only a band-aid solution, and that the real way
 447   to fix it is to fix the parser. Tidy's great, but in terms of security,
 448   it's not suitable for untrusted sources.
 449 </p>
 450
 451 <h2 id="AntiSamy">OWASP AntiSamy</h2>
 452
 453 <p>
 454   Although <a href="http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project">OWASP AntiSamy</a> is implemented in Java and .NET, it is
 455   worth a quick mention here because it purports to do the same thing
 456   as HTML Purifier. The bottom line? It gets pretty close, but
 457   it just doesn't have the same depth as HTML Purifier.
 458 </p>
 459
 460 <p>
 461   Architecturally speaking, OWASP AntiSamy is highly dependent on
 462   what are called <q>policy files</q>, which is an highly extended form
 463   of <abbr>XML</abbr> Schema with information on what attributes and elements to allow. As such,
 464   the actual code for filtering is relatively light-weight. AntiSamy
 465   gets lots of points for using legitimate <abbr>HTML</abbr> and <abbr>CSS</abbr> parsers (extra
 466   props for the <abbr>CSS</abbr> parser; HTML Purifier doesn't use one, but we should!)
 467 </p>
 468
 469 <p>
 470   Unfortunately, while <abbr>XML</abbr> Schema files can get a high level of
 471   control on the validation, the regular expression heavy approach
 472   begins showing signs of stress when data-types are complex (e.g.
 473   <abbr>URI</abbr>s), and <abbr>XML</abbr> Schema is ill-suited for large-scale <acronym>DOM</acronym> manipulation,
 474   which is necessary when transforming <abbr>HTML</abbr> for standards compliance.
 475   Nonetheless, I would be fairly confident in its <abbr>XSS</abbr> cleaning
 476   abilities, so long as it removes things it doesn't recognize by default
 477   (something I find slightly perplexing in its policy files, since some
 478   rules indicate things to be removed.)
 479 </p>
 480
 481 <h2 id="Preface">Preface</h2>
 482
 483 <p>
 484   I've ordered my analyses according to how bad a library is.  The worst
 485   is first, and then we move up the spectrum.  I will point out the most
 486   flagrant problems with the libraries, but note that I will omit more
 487   advanced vulnerabilities: if you can't catch an <tt>onmouseover</tt>
 488   attribute, I really shouldn't reprimand you for letting non-<abbr>SGML</abbr> code
 489   points through.  The ideal solution, however, must do all these things.
 490 </p>
 491
 492 <p>
 493   Note that besides striptags,
 494   most of the libraries are moderately effective against the most common <abbr>XSS</abbr>
 495   attacks.  None of them (save Safe HTML Checker) fare very well
 496   in the standards-compliance department though.
 497 </p>
 498
 499 <h2 id="striptags">striptags()</h2>
 500
 501 <table class="summary">
 502   <tr><th>Whitelist</th>              <td class="impl-yes">Yes, user-specified</td></tr>
 503   <tr><th>Removes foreign tags</th>   <td class="impl-partial">Buggy</td></tr>
 504   <tr><th>Makes well-formed</th>      <td class="impl-no">No</td></tr>
 505   <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 506   <tr><th>Validates attributes</th>   <td class="impl-no">No</td></tr>
 507 </table>
 508
 509 <p>
 510   The <abbr>PHP</abbr> function
 511   <a href="http://php.net/manual/en/function.strip-tags.php">striptags()</a> is
 512   the classic solution for attempting to clean up
 513   <abbr>HTML</abbr>.  It
 514   is also the <em>worst</em> solution, and should be avoided like the plague.
 515   The fact that it doesn't validate attributes at all means that anyone can
 516   insert an <tt>onmouseover='xss();'</tt> and exploit your application.
 517 </p>
 518
 519 <p>
 520   While this can be bandaided with a series of regular expressions that strip out
 521   on[event] (you're still vulnerable to <abbr>XSS</abbr> and at the mercy of
 522   quirky browser behavior), striptags() is fundamentally flawed and should not be
 523   used.
 524 </p>
 525
 526 <h2 id="Input_Filter">PHP Input Filter</h2>
 527
 528 <p>
 529   Though its title may not imply it,
 530   <a href="http://www.phpclasses.org/browse/package/2189.html">PHP Input Filter</a>
 531   is a souped up version of striptags() with the ability to inspect
 532   attributes.  (Don't mind the hastily tacked on query escaping function).
 533 </p>
 534
 535 <table class="summary">
 536   <tr><th>Version</th>                <td class="impl-yes">1.2.2</td></tr>
 537   <tr><th>Last update</th>            <td class="impl-irrelevant">2005-10-05</td></tr>
 538   <tr><th>License</th>                <td class="impl-irrelevant">GPL</td></tr>
 539   <tr><th>Whitelist</th>              <td class="impl-yes">Yes, user defined</td></tr>
 540   <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 541   <tr><th>Makes well-formed</th>      <td class="impl-no">No</td></tr>
 542   <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 543   <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 544   <tr><th>XSS safe</th>               <td class="impl-almostyes">Probably</td></tr>
 545   <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 546 </table>
 547
 548 <p>
 549   PHP Input Filter implements an
 550   <abbr>HTML</abbr> parser, and
 551   performs very basic checks on whether or not tags and attributes have
 552   been defined in the whitelist as well as some
 553   smarter <abbr>XSS</abbr> checks.  It is left up to
 554   the user to define what they'll permit.
 555 </p>
 556
 557 <p>
 558   With absolutely no checking of well-formedness, it is trivially easy
 559   to trick the filter into leaving unclosed tags lying around. While to some
 560   standards-compliance may be viewed by some as a <q>nice feature</q>,
 561   basic sanity checks like this must be implemented, otherwise a user
 562   can mangle a website's layout.
 563 </p>
 564
 565 <p>
 566   More troubles: Woe to
 567   any user that allows the <tt>style</tt> attribute: you can't simply
 568   just let <abbr>CSS</abbr> through and expect your
 569   layout not to be badly mutilated. To top things off,
 570   the filter doesn't even preserve data properly: attributes have all
 571   spaces stripped out of them.  Stay away, stay away!
 572 </p>
 573
 574 <h2 id="HTML_Safe">HTML_Safe/SafeHTML</h2>
 575
 576 <p>
 577   <a href="http://pear.php.net/package/HTML_Safe">HTML_Safe</a> is
 578   <acronym>PEAR</acronym>'s <abbr>HTML</abbr> filtering library.
 579   It should be noted that this is the same library as
 580   <a href="http://pixel-apes.com/safehtml/">SafeHTML</a>, though with different
 581   branding (and a different version number).
 582 </p>
 583
 584 <table class="summary">
 585   <tr><th>Version</th>                <td class="impl-almostyes">0.9.9beta</td></tr>
 586   <tr><th>Last update</th>            <td class="impl-irrelevant">2005-12-21</td></tr>
 587   <tr><th>License</th>                <td class="impl-irrelevant">BSD (3 clause)</td></tr>
 588   <tr><th>Whitelist</th>              <td class="impl-no">Mostly No</td></tr>
 589   <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 590   <tr><th>Makes well-formed</th>      <td class="impl-yes">Yes</td></tr>
 591   <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 592   <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 593   <tr><th>XSS safe</th>               <td class="impl-almostyes">Probably</td></tr>
 594   <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 595 </table>
 596
 597 <p>
 598   HTML_Safe's mechanism of action involves parsing
 599   <abbr>HTML</abbr> with a
 600   <acronym>SAX</acronym> parser and performing
 601   validation and filtering as the handlers are called.  HTML_Safe does a lot
 602   of things right, which is why I say it <em>probably</em> isn't vulnerable
 603   to <abbr>XSS</abbr>, but its approach
 604   is fundamentally flawed: blacklists.
 605 </p>
 606
 607 <p>
 608   This library maintains arrays of dangerous tags, attributes and
 609   <abbr>CSS</abbr> properties.  (It also
 610   has a blacklist of dangerous <abbr>URI</abbr> protocols, but this is
 611   intelligently disabled by default in favor of a protocol whitelist.)
 612   What this means is that HTML_Safe has no qualms of accepting input
 613   like <tt>&lt;foobar&gt; Bang &lt;/foobar&gt;</tt>.  Anything goes except
 614   the tags in those arrays.  Scratch standards-compliance (and that was
 615   without even considering proper nesting).
 616 </p>
 617
 618 <p>
 619   For now, HTML_Safe might be safe from <abbr>XSS</abbr>.
 620   In the future, however, one of the infinitely many tags that HTML_Safe lets
 621   through might just possibly be given special functionality by browser vendors.
 622   And it might just turn out that this can be exploited.  <em>Any</em> blacklist
 623   solution puts you at a perpetual arms race against crackers who are constantly
 624   discovering new and inventive ways to abuse tags and attributes that you
 625   didn't blacklist.
 626 </p>
 627
 628 <h2 id="kses">kses</h2>
 629
 630 <p>
 631   <a href="http://sourceforge.net/projects/kses/">kses</a> appears to
 632   be the de-facto solution for cleaning  <abbr>HTML</abbr>, having found
 633   its way into applications such as <a href="http://wordpress.org/">WordPress</a>
 634   and being the number one search result for <q>php html filter</q>.
 635 </p>
 636
 637 <table class="summary">
 638   <tr><th>Version</th>                <td class="impl-partial">0.2.2</td></tr>
 639   <tr><th>Last update</th>            <td class="impl-irrelevant">2005-02-06</td></tr>
 640   <tr><th>License</th>                <td class="impl-irrelevant">GPL</td></tr>
 641   <tr><th>Whitelist</th>              <td class="impl-yes">Yes, user defined</td></tr>
 642   <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 643   <tr><th>Makes well-formed</th>      <td class="impl-no">No</td></tr>
 644   <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 645   <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 646   <tr><th>XSS safe</th>               <td class="impl-almostyes">Probably</td></tr>
 647   <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 648 </table>
 649
 650 <p>
 651   To be truthful, I didn't do as comprehensive a code survey for kses
 652   as I did for some of the other libraries.  Out of
 653   all the classes I've reviewed so far, kses was definitely the hardest to
 654   understand.
 655 </p>
 656
 657 <p>
 658   kses's modus operandi is splitting up html with a monster regexp
 659   and then validating each section with <tt>kses_split2()</tt>.  It
 660   suffers from the same problems as Input Filter: no well-formedness
 661   checks leading to rampant runaway tags (and no standards-compliance).
 662   WordPress, the primary user of kses today, had to implement their
 663   own custom tag-balancing code to fix this problem: don't use this
 664   library without some equivalent!
 665 </p>
 666
 667 <p>
 668   Its whitelist syntax, however, is the most complex of all these libraries,
 669   so I'm going to take some time to argue why this particular implementation
 670   is bad.  The author of this library was thoughtful enough to provide some
 671   basic constraint checks on attributes like maxlen and maxval.  Now, barring
 672   the fact that there simply aren't enough checks, and the fact that they are
 673   all lumped together in one function, we now must wonder whether or not
 674   the user will go through the trouble of specifying the maximum length
 675   of a title attribute.
 676 </p>
 677
 678 <p>
 679   I have my opinions about inherent human laziness, but perhaps WordPress's
 680   default filterset is the most telling example:
 681 </p>
 682
 683 <pre>
 684 $allowedposttags = array (
 685     /* formatted and trimmed */
 686     'hr' => array (
 687         'align' => array (),
 688         'noshade' => array (),
 689         'size' => array (),
 690         'width' => array ()
 691      )
 692 );
 693 </pre>
 694
 695 <p>
 696   Hmm... do I see a blatant lack of attribute constraints?  Conclusion:
 697   if the user can get away with not doing work, they will!  The biggest
 698   problem in all these whitelists filters is that they forgot to <em>supply</em>
 699   the whitelist.  The whitelist is just as important as the code that uses
 700   the whitelist to filter <abbr>HTML</abbr>.
 701 </p>
 702
 703 <h2 id="htmLawed">htmLawed</h2>
 704
 705 <p>
 706   <a href="http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php">htmLawed</a>
 707   is kses on steroids. After looking at HTML Purifier and deciding that it was
 708   too slow for him, Santosh Patnaik went ahead and rewrote the kses engine
 709   with more features.
 710 </p>
 711
 712 <table class="summary">
 713   <tr><th>Version</th>                <td class="impl-yes">1.1.9.1</td></tr>
 714   <tr><th>Last update</th>            <td class="impl-irrelevant">2009-02-26</td></tr>
 715   <tr><th>License</th>                <td class="impl-irrelevant">GPL</td></tr>
 716   <tr><th>Whitelist</th>              <td class="impl-partial">Yes, but blacklist is default</td></tr>
 717   <tr><th>Removes foreign tags</th>   <td class="impl-almostyes">Yes, user defined</td></tr>
 718   <tr><th>Makes well-formed</th>      <td class="impl-almostyes">Yes, user defined</td></tr>
 719   <tr><th>Fixes nesting</th>          <td class="impl-no">Partial</td></tr>
 720   <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 721   <tr><th>XSS safe</th>               <td class="impl-no">Probably</td></tr>
 722   <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 723 </table>
 724
 725 <p>
 726   htmLawed improves standards-compliance, but it is not fully
 727   standards-compliant; there are a number of cases which the author has
 728   explicitly stated he will not fix. There are issues with content
 729   models in <code>table</code> and <code>ruby</code> and tags that
 730   <em>must</em> have content in them.
 731 </p>
 732
 733 <p>
 734   Let's, for a moment, imagine that htmLawed is <abbr>XSS</abbr>-safe when
 735   <code>safe</code> is on.
 736   Even then, it still is not <abbr>XSS</abbr>-safe out of the tin: you have
 737   to turn on htmLawed's security features! This is
 738   <a href="http://www.bioinformatics.org/phplabware/forum/viewtopic.php?id=28">by
 739   design</a>. Sane defaults are important, because for every person who
 740   does read the documentation, there is
 741   <a href="http://www.bioinformatics.org/phplabware/forum/viewtopic.php?id=28">another</a>
 742   one who doesn't (and is mislead by claims that <q>htmLawed is a single-file PHP
 743   software that makes input text secure</q>), and is
 744   surprised at some behavior.
 745   Software must be <strong>safe by default</strong>; the user can then relax
 746   any security restrictions.
 747 </p>
 748
 749 <p>
 750   I also disagree with some of the choices with regards to what elements are
 751   <q>safe</q>. <code>form</code>
 752   is <abbr>XSS</abbr>-safe,
 753   but it is certainly not phishing safe. Forms can be
 754   used to spoof system dialogs <em>on that person's domain</em>. These should
 755   <em>not</em> be allowed in <code>safe</code> mode.
 756 </p>
 757
 758 <h2 id="Safe_HTML_Checker">Safe HTML Checker</h2>
 759
 760 <p>
 761   <a href="http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker">Safe
 762   HTML Checker</a> is (to my knowledge) the first attempt to make a filter
 763   that also outputs standards-compliant <abbr>XHTML</abbr>.  It wasn't even released or
 764   licensed officially, but we'll let that slide: a 4<sup>th</sup> place
 765   search result must have done something right.
 766 </p>
 767
 768 <table class="summary">
 769     <tr><th>Version</th>                <td class="impl-partial">in-house</td></tr>
 770     <tr><th>Last update</th>            <td class="impl-almostyes">2003-09-15</td></tr>
 771     <tr><th>License</th>                <td class="impl-no">undefined</td></tr>
 772     <tr><th>Whitelist</th>              <td class="impl-partial">Yes (bare-bones)</td></tr>
 773     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 774     <tr><th>Makes well-formed</th>      <td class="impl-yes">Yes</td></tr>
 775     <tr><th>Fixes nesting</th>          <td class="impl-almostyes">Almost</td></tr>
 776     <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 777     <tr><th>XSS safe</th>               <td class="impl-yes">Yes</td></tr>
 778     <tr><th>Standards safe</th>         <td class="impl-almostyes">Almost</td></tr>
 779 </table>
 780
 781 <p>
 782   Indeed, it is quite a well-written piece of code.  It demonstrates
 783   knowledge of inline versus block elements, thus almost nearly getting
 784   nesting correct (the only exception is an unimplemented omitted SGML
 785   exclusion for <tt>&lt;a&gt;</tt> tags, and that's easy to fix).
 786 </p>
 787
 788 <p>
 789   Unfortunately, part of the reason why it works so well is that it's
 790   extremely restrictive.  No styling, no tables, very few attributes.
 791   Perfectly appropriate for blog comments, but then again, there's always
 792   BBCode.  This probably means that Safe HTML Checker has a different
 793   goal than HTML Purifier.
 794 </p>
 795
 796 <p>
 797   The <abbr>XML</abbr> parser is also quite strict. Accidentally missed a
 798   &lt; sign? The parser will complain with the cryptic message:
 799   <q><abbr>XHTML</abbr> is not well-formed</q>. The solution is not as
 800   simple as just switching to a more permissive parser: Safe HTML Checker
 801   relies on the fact that the parser will have matched up the tags for
 802   them.
 803 </p>
 804
 805 <h2 id="HTMLPurifier">HTML Purifier</h2>
 806
 807 <table class="summary">
 808   <tr><th>Version</th>                <td class="impl-yes">&htmlpurifier.current.version;</td></tr>
 809   <tr><th>Last update</th>            <td class="impl-yes">&htmlpurifier.current.release-date;</td></tr>
 810   <tr><th>License</th>                <td class="impl-irrelevant">LGPL</td></tr>
 811   <tr><th>Whitelist</th>              <td class="impl-yes">Yes</td></tr>
 812   <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 813   <tr><th>Makes well-formed</th>      <td class="impl-yes">Yes</td></tr>
 814   <tr><th>Fixes nesting</th>          <td class="impl-yes">Yes</td></tr>
 815   <tr><th>Validates attributes</th>   <td class="impl-yes">Yes</td></tr>
 816   <tr><th>XSS safe</th>               <td class="impl-yes">Yes</td></tr>
 817   <tr><th>Standards safe</th>         <td class="impl-yes">Yes</td></tr>
 818 </table>
 819
 820 <p>
 821   That table should say it all, but I'll add a few more features:
 822   </p>
 823
 824 <table class="summary">
 825   <tr><th>UTF-8 aware</th><td class="impl-yes">Yes</td></tr>
 826   <tr><th>Object-Oriented</th><td class="impl-yes">Yes</td></tr>
 827   <tr><th>Validates CSS</th><td class="impl-yes">Yes</td></tr>
 828   <tr><th>Tables</th><td class="impl-yes">Yes</td></tr>
 829   <tr><th>PHP 5 only</th><td class="impl-yes">Yes</td></tr>
 830   <tr><th>E_STRICT compliant</th><td class="impl-yes">Yes</td></tr>
 831   <tr><th>Can auto-paragraph</th><td class="impl-yes">Yes</td></tr>
 832   <tr><th>Extensible</th><td class="impl-yes">Yes</td></tr>
 833   <tr><th>Unit tested</th><td class="impl-yes">Yes</td></tr>
 834 </table>
 835
 836 <p>
 837   This is not to say that HTML Purifier doesn't have problems of its own.
 838   It's big (while the others usually fit in one file, this one requires a huge
 839   include list), and it's <a href="http://htmlpurifier.org/live/TODO">missing
 840   features.</a> But even with these deficiencies,
 841   HTML Purifier is far better than the other libraries.
 842 </p>
 843
 844 <p>
 845   So... <a href="download">what are you waiting for?</a>
 846 </p>
 847
 848 </div>
 849 </div>
 850 </body>
 851 </html>