comparison.xhtml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   3     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
   4 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   5 <head>
   6 <title>Comparison - HTML Purifier</title>
   7 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
   8 <meta name="keywords" content="HTMLPurifier, HTML Purifier, HTML, filter, filtering, HTML_Safe, PEAR, comparison, kses, striptags, SafeHTMLChecker" />
   9 <meta name="author" content="Edward Z. Yang" />
  10 <link rel="icon" href="./favicon.ico" type="image/x-icon" />
  11 <link rel="shortcut icon" href="./favicon.ico" type="image/x-icon" />
  12 <link rel="stylesheet" href="./style.css" type="text/css" />
  13 <!--[if lt IE 7.]><script defer="defer" type="text/javascript" src="./pngfix.js"></script><![endif]-->
  14 </head>
  15 <body>
  16
  17 <img src="./logo.png" id="logo" alt="HTML Purifier" />
  18
  19 <h1 id="title">Comparison</h1>
  20 <div id="header"><a href="./"><span class="html">HTML</span> <span class="purifier">Purifier</span></a></div>
  21
  22 <div id="content">
  23 <p class="lead">With the advent of
  24 <a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0</a>, the end user has
  25 gone from passive consumer to active producer of content on the World Wide
  26 Web.  <a href="http://en.wikipedia.org/wiki/Wiki">Wikis</a>,
  27 <a href="http://en.wikipedia.org/wiki/Social_software">Social Software</a> and
  28 <a href="http://en.wikipedia.org/wiki/Blog">Blogs</a> all
  29 put the user in control.</p>
  30
  31 <p>Give the user too much control, however, and you set yourself up
  32 for <a href="http://en.wikipedia.org/wiki/Cross-site_scripting"><acronym>XSS</acronym></a> attacks.  For this reason,
  33 <acronym>HTML</acronym>'s flexibility
  34 has proven to be both a blessing and a curse, and the software that processes
  35 it must strike a fine balance between security and usability.  How do
  36 we prevent users from injecting JavaScript or inserting malformed
  37 <acronym>HTML</acronym> while allowing
  38 a rich syntax of tags, attributes and <acronym>CSS</acronym>? How do we put
  39 <acronym>HTML</acronym> inside
  40 <acronym>RSS</acronym> feed without worrying
  41 about sloppy coding messing up <acronym>XML</acronym> parsing?
  42 Almost every <acronym>PHP</acronym>
  43 developer has come across this problem before, and many have tried
  44 (albeit unsuccessfully) to solve this problem.  We will analyze existing
  45 libraries to demonstrate how they are ineffective and, of course,
  46 how <strong>HTML Purifier</strong> solves all our problems and achieves
  47 standards-compliance.</p>
  48
  49 <p>I will take no quarter and pull no punches: as of the time of writing,
  50 no other library comes even <em>close</em> to solving the problem effectively
  51 for richly formatted documents.  But, nonetheless, there is a necessary
  52 disclaimer:</p>
  53
  54 <p class="disclaimer">
  55     This comparison document was written by the author of HTML Purifier,
  56     and clearly is <strong>in favor</strong> of HTML Purifier. However, that doesn't
  57     mean that it is biased: I have made every attempt to be <strong>factual and
  58     fair</strong>, and I hope that you will agree, by the time you finish reading
  59     this document, that HTML Purifier is the only satisfactory HTML
  60     filter out there today.
  61 </p>
  62
  63 <div id="toc" />
  64
  65 <h2 id="Summary">Summary</h2>
  66
  67 <p class="lead">A table summarizing the differences for the impatient.</p>
  68
  69 <div class="wide-table">
  70 <table cellspacing="0">
  71
  72 <thead>
  73     <tr>
  74         <th>Library</th>
  75         <th>Version</th>
  76         <th>Date</th>
  77         <th>License</th>
  78         <th>Whitelist</th>
  79         <th>Removal</th>
  80         <th>Well-formed</th>
  81         <th>Nesting</th>
  82         <th>Attributes</th>
  83         <th>XSS&nbsp;safe</th>
  84         <th>Standards&nbsp;safe</th>
  85     </tr>
  86 </thead>
  87
  88 <tbody>
  89
  90 <tr>
  91     <td>striptags</td>
  92     <td>n/a</td>
  93     <td>n/a</td>
  94     <td>n/a</td>
  95     <td class="impl-almostyes">Yes (user)</td>
  96     <td class="impl-partial">Buggy</td>
  97     <td class="impl-no">No</td>
  98     <td class="impl-no">No</td>
  99     <td class="impl-no">No</td>
 100     <td class="impl-no">No</td>
 101     <td class="impl-no">No</td>
 102 </tr>
 103
 104 <tr>
 105     <td>PHP Input Filter</td>
 106     <td>1.2.2</td>
 107     <td>2005-10-05</td>
 108     <td>GPL</td>
 109     <td class="impl-almostyes">Yes (user)</td>
 110     <td class="impl-yes">Yes</td>
 111     <td class="impl-no">No</td>
 112     <td class="impl-no">No</td>
 113     <td class="impl-partial">Partial</td>
 114     <td class="impl-almostyes">Probably</td>
 115     <td class="impl-no">No</td>
 116 </tr>
 117
 118 <tr>
 119     <td>HTML_Safe</td>
 120     <td>0.9.9beta</td>
 121     <td>2005-12-21</td>
 122     <td>BSD (3)</td>
 123     <td class="impl-no">Mostly No</td>
 124     <td class="impl-yes">Yes</td>
 125     <td class="impl-yes">Yes</td>
 126     <td class="impl-no">No</td>
 127     <td class="impl-partial">Partial</td>
 128     <td class="impl-almostyes">Probably</td>
 129     <td class="impl-no">No</td>
 130 </tr>
 131
 132 <tr>
 133     <td>kses</td>
 134     <td>0.2.2</td>
 135     <td>2005-02-06</td>
 136     <td>GPL</td>
 137     <td class="impl-almostyes">Yes (user)</td>
 138     <td class="impl-yes">Yes</td>
 139     <td class="impl-no">No</td>
 140     <td class="impl-no">No</td>
 141     <td class="impl-partial">Partial</td>
 142     <td class="impl-almostyes">Probably</td>
 143     <td class="impl-no">No</td>
 144 </tr>
 145
 146 <tr>
 147     <td>Safe HTML Checker</td>
 148     <td>n/a</td>
 149     <td>2003-09-15</td>
 150     <td>n/a</td>
 151     <td class="impl-almostyes">Yes (bare)</td>
 152     <td class="impl-yes">Yes</td>
 153     <td class="impl-yes">Yes</td>
 154     <td class="impl-almostyes">Almost</td>
 155     <td class="impl-partial">Partial</td>
 156     <td class="impl-yes">Yes</td>
 157     <td class="impl-almostyes">Almost</td>
 158 </tr>
 159
 160 <tr>
 161     <td>HTML Purifier</td>
 162     <td>1.4.1</td>
 163     <td>2007-01-21</td>
 164     <td>LGPL</td>
 165     <td class="impl-yes">Yes</td>
 166     <td class="impl-yes">Yes</td>
 167     <td class="impl-yes">Yes</td>
 168     <td class="impl-yes">Yes</td>
 169     <td class="impl-yes">Yes</td>
 170     <td class="impl-yes">Yes</td>
 171     <td class="impl-yes">Yes</td>
 172 </tr>
 173
 174 </tbody>
 175
 176 </table>
 177 </div>
 178
 179 <p class="lead"><a href="#Tidy">HTML Tidy</a> is omitted from this list because it is not an <acronym>HTML</acronym>
 180 filter.</p>
 181
 182 <h2 id="AltMarkup">Look Ma, No <acronym>HTML</acronym>!</h2>
 183
 184 <blockquote class="fancy">
 185     <div class="quote" style="text-align:center;">
 186         A clever person solves a problem.
 187         A wise person avoids it.
 188     </div>
 189     <div class="origin">&mdash; Albert Einstein</div>
 190 </blockquote>
 191
 192 <p class="lead">Before we jump into the weird and not-so-wonderful world
 193 of <acronym>HTML</acronym> filters, we must first consider another domain: alternate
 194 markup libraries. While libraries of this type really shouldn't be
 195 considered <acronym>HTML</acronym> filters,
 196 they are the number one method of taking user input and processing it into
 197 something more than plain old text.  These libraries forgo
 198 <acronym>HTML</acronym> and define their
 199 own markup syntax. <a href="http://en.wikipedia.org/wiki/BBCode">BBCode</a>,
 200 <a href="http://en.wikipedia.org/wiki/Wikitext">Wikitext</a>,
 201 <a href="http://daringfireball.net/projects/markdown/">Markdown</a> and
 202 <a href="http://textism.com/tools/textile/">Textile</a> are all examples of
 203 such markup languages (although it should be noted that
 204 Wikitext and Markdown can allow
 205 <acronym>HTML</acronym> within them).
 206 The benefits (to those who use it, anyway) are clear: simplicity and
 207 security.
 208 </p>
 209
 210 <table cellspacing="0">
 211     <thead>
 212         <tr>
 213             <th>Markup language</th>
 214             <th>Sample</th>
 215         </tr>
 216     </thead>
 217     <tbody>
 218         <tr>
 219             <th>BBCode</th>
 220             <td><tt>[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].</tt></td>
 221         </tr>
 222         <tr>
 223             <th>Wikitext<sup>1</sup></th>
 224             <td><tt>'''B''' ''i'' [http://www.example.com/ link]</tt></td>
 225         </tr>
 226         <tr>
 227             <th>Markdown<sup>2</sup></th>
 228             <td><tt>**B** *i* [link](http://www.example.com/)</tt></td>
 229         </tr>
 230         <tr>
 231             <th>Textile</th>
 232             <td><tt>*B* _i_ &quot;link&quot;:http://www.example.com/</tt></td>
 233         </tr>
 234         <tr>
 235             <th><acronym>HTML</acronym></th>
 236             <td><tt>&lt;b&gt;B&lt;/b&gt; &lt;i&gt;i&lt;/i&gt; &lt;a href=&quot;http://www.example.com/&quot;&gt;link&lt;/a&gt;</tt></td>
 237         </tr>
 238         <tr>
 239             <th><acronym>WYSIWYG</acronym></th>
 240             <td><b>B</b> <i>i</i> <a href="http://www.example.com/">link</a></td>
 241         </tr>
 242     </tbody>
 243 </table>
 244
 245 <ol class="notes">
 246     <li>Wikitext shown is modeled after <a
 247         href="http://www.mediawiki.org/wiki/MediaWiki">MediaWiki</a> style.
 248         There are many variants of Wikitext currently extant.</li>
 249     <li>Strictly speaking, the Markdown syntax is not equivalent: bold text
 250         is expressed as <code>&lt;strong&gt;</code> and italicized text is
 251         expressed as <code>&lt;em&gt;</code>. Most browser default stylesheets,
 252         however, map those two semantic tags to the associated styling, so
 253         many users assume that it really is italics (and use it improperly for,
 254         say, book titles.)</li>
 255 </ol>
 256
 257 <h3 id="AltMarkup:Simplicity">Simplicity</h3>
 258
 259 <p class="lead"><acronym>HTML</acronym>
 260 source code is often criticized for being difficult to read. For example,
 261 compare:</p>
 262
 263 <pre>
 264 * Item 1
 265 * Item 2
 266 </pre>
 267
 268 <p>...versus:</p>
 269
 270 <pre>
 271 &lt;ul&gt;
 272     &lt;li&gt;Item 1&lt;/li&gt;
 273     &lt;li&gt;Item 2&lt;/li&gt;
 274 &lt;/ul&gt;
 275 </pre>
 276
 277 <p>Which would you prefer to edit? The answer seems obvious, but be careful
 278 not to fall into the fallacy of <a
 279 href="http://en.wikipedia.org/wiki/False_dilemma">false dilemma</a>.
 280 There <em>is</em> a third choice: the
 281 <acronym>WYSIWYG</acronym> (rich text)
 282 editor, which blows earlier choices out of the water in terms
 283 of usability.</p>
 284
 285 <p>Note that rich text editors and alternate markup syntaxes are not
 286 mutually exclusive, but, when push comes to shove, it's easier
 287 implement this sort of editor on top of <acronym>HTML</acronym> than some obscure
 288 markup language.  And in the cases when it is done, you usually end up with
 289 a live preview, not a true rich text editor.</p>
 290
 291 <blockquote class="digression">
 292     <p><q>Now just wait a second,</q> you may be saying,
 293     <q><acronym>WYSIWYG</acronym>
 294     editors aren't all that great.</q>  There are many good arguments
 295     against these editors, and <a
 296     href="http://www.ideography.co.uk/library/seybold/WYSIWYG.html">intelligent
 297     people have written essays</a> devoted to
 298     criticizing <acronym>WYSIWYG</acronym>.
 299     In addition to the usual arguments against said editors, the web poses
 300     another limitation: no JavaScript means no
 301     editor, and no editor means... (gasp) manually typing in code.</p>
 302
 303     <p>Even the most dogmatic purist, however, should recognize that for all
 304     its faults, prospective clients <em>really</em> want rich text editors.
 305     There are steps you can take to mitigate the associated drawbacks of
 306     these editors.</p>
 307
 308     <p>It is often asserted that
 309     <acronym>WYSIWYG</acronym> editors
 310     <em>encourage excessive presentational markup</em>. As it turns out,
 311     this is the case with any markup language that allows the smallest
 312     iota of presentational tags, be it <tt>&lt;font&gt;</tt> or
 313     <tt>[color=red]</tt>.
 314     A good way to reduce this trouble is to simply eliminate the
 315     dialogue boxes that allow users to change colors or fonts (which
 316     usually have no legitimate use) and adopt a
 317     <acronym>WYSIWYM</acronym> scheme,
 318     allowing users to select contextually correct formatting styles
 319     for segments of text.</p>
 320 </blockquote>
 321
 322 <p>Simplicity is also a double-edged sword.  The moment any remotely
 323 complex markup is needed, these lightweight markup languages fail to
 324 produce.  Sure you can make '''this text bold''' with Wikitext, but that
 325 infobox all <q>rendered nicely in aqua blue</q> will require a gaggle of
 326 &lt;div&gt;s and <acronym>CSS</acronym>.
 327 These languages face the same troubles as regular <acronym>HTML</acronym>
 328 filters in that their whitelist is too restrictive (besides the fact that
 329 their table markup is extraordinarily complex).</p>
 330
 331 <h3 id="AltMarkup:Security">Security</h3>
 332
 333 <p class="lead">BBCode can be boiled down to a <q>wanna-be</q> version of
 334 <acronym>HTML</acronym>. I mean, replacing
 335 the angled brackets with square brackets and omitting the occasional parameter
 336 name? How much more un-original can you get? Somehow, I don't think BBCode
 337 was meant to readable. <a
 338 href="http://en.wikipedia.org/wiki/BBCode">Wikipedia</a> agrees:</p>
 339
 340 <blockquote>
 341     BBCode was devised and put to use in order to provide a safer, easier
 342     and more limited way of allowing users to format their messages.
 343     Previously, many message boards allowed the users to include <acronym>HTML</acronym>,
 344     which could be used to break/imitate parts of the layout, or run
 345     JavaScript. Some implementations of BBCode have suffered problems related
 346     to the way they translate the BBCode into <acronym>HTML</acronym>, which could negate the
 347     security that was intended to be given by BBCode.
 348 </blockquote>
 349
 350 <p>Or, put more simply:</p>
 351
 352 <blockquote>
 353     BBCode came to life when developers where too
 354     lazy to parse <acronym>HTML</acronym> correctly
 355     and decided to invent their own markup language. As with all products of
 356     laziness, the result is completely inconsistent, unstandardized, and
 357     widely adopted.
 358 </blockquote>
 359
 360 <p>Well, developers, the whole point of HTML Purifier is that I do the
 361 work so you can just execute the ridiculously simple
 362 <tt>$purifier->purify($html)</tt> call and go on to do, well, whatever
 363 you developers do. <tt>:-P</tt></p>
 364
 365 <h3 id="AltMarkup:Conclusion">Conclusion</h3>
 366
 367 <p>These alternative markup languages have their shiny points, and HTML
 368 Purifier is not meant to replace them.  However, a major reason for
 369 their existence has been called into question.  Why are <em>you</em>
 370 using these languages?</p>
 371
 372 <h2 id="Tidy">HTML Tidy</h2>
 373
 374 <p class="lead">Dave Raggett's
 375 <a href="http://www.w3.org/People/Raggett/tidy/">HTML Tidy</a> is a program;
 376 neat enough, at least, to make it into <acronym>PHP</acronym> as a
 377 <a href="http://us2.php.net/manual/en/ref.tidy.php"><acronym>PECL</acronym> extension.</a>
 378 The premise is simple, the execution effective. Tidy is, in short, a great
 379 <em>tool</em>.</p>
 380
 381 <p>It is not, however, a filter.  I am often surprised when people ask
 382 me, <q>What about Tidy?</q>  There's nothing against Tidy: Tidy tackles
 383 a different problem set.  Let's see what <tt>man tidy</tt> has to say:</p>
 384
 385 <blockquote cite="http://tidy.sourceforge.net/docs/tidy_man.html">
 386     Tidy reads <acronym>HTML</acronym>, <acronym>XHTML</acronym> and
 387     <acronym>XML</acronym> files and writes cleaned up markup. For
 388     <acronym>HTML</acronym> variants, it detects and corrects many common coding errors and
 389     strives to produce visually equivalent markup that is both <acronym>W3C</acronym> compliant
 390     and works on most browsers. A common use of Tidy is to convert plain <acronym>HTML</acronym>
 391     to <acronym>XHTML</acronym>.
 392 </blockquote>
 393
 394 <p>Hmm... why do I not see the words <q>filter</q> or
 395 <q><acronym>XSS</acronym></q> in here? Perhaps it's
 396 because Tidy accepts <em>any</em> valid
 397 <acronym>HTML</acronym>.  Including
 398 <tt>script</tt> tags.  Which leads us to our second part: Tidy parses
 399 <em>documents</em>, not document <em>fragments</em>.</p>
 400
 401 <p>This is not to say that I haven't seen Tidy be used in this sort of
 402 fashion.  MediaWiki, for instance, uses Tidy to cleanup the final <acronym>HTML</acronym>
 403 output before shuttling it off to the browser.  The developers, nevertheless,
 404 agree that this is only a band-aid solution, and that the real way
 405 to fix it is to fix the parser. Tidy's great, but in terms of security,
 406 it's not suitable for untrusted sources.</p>
 407
 408 <h2 id="Preface">Preface</h2>
 409
 410 <p>I've ordered my analyses according to how bad a library is.  The worst
 411 is first, and then we move up the spectrum.  I will point out the most
 412 flagrant problems with the libraries, but note that I will omit more
 413 advanced vulnerabilities: if you can't catch an <tt>onmouseover</tt>
 414 attribute, I really shouldn't reprimand you for letting non-<acronym>SGML</acronym> code
 415 points through.  The ideal solution, however, must do all these things.</p>
 416
 417 <p>Note that besides striptags,
 418 most of the libraries are moderately effective against the most common <acronym>XSS</acronym>
 419 attacks.  None of them (save Safe HTML Checker) fare very well
 420 in the standards-compliance department though.</p>
 421
 422 <h2 id="striptags">striptags()</h2>
 423
 424 <table class="summary">
 425     <tr><th>Whitelist</th>              <td class="impl-yes">Yes, user-specified</td></tr>
 426     <tr><th>Removes foreign tags</th>   <td class="impl-partial">Buggy</td></tr>
 427     <tr><th>Makes well-formed</th>      <td class="impl-no">No</td></tr>
 428     <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 429     <tr><th>Validates attributes</th>   <td class="impl-no">No</td></tr>
 430 </table>
 431
 432 <p class="lead">The <acronym>PHP</acronym> function
 433 <a href="http://php.net/manual/en/function.strip-tags.php">striptags()</a> is
 434 the classic solution for attempting to clean up
 435 <acronym>HTML</acronym>.  It
 436 is also the <em>worst</em> solution, and should be avoided like the plague.
 437 The fact that it doesn't validate attributes at all means that anyone can
 438 insert an <tt>onmouseover='xss();'</tt> and exploit your application.</p>
 439
 440 <p>While
 441 this can be bandaided with a series of regular expressions that strip out
 442 on[event] (you're still vulnerable to <acronym>XSS</acronym> and at the mercy of
 443 quirky browser behavior), striptags() is fundamentally flawed and should not be
 444 used.
 445 </p>
 446
 447 <h2 id="Input_Filter">PHP Input Filter</h2>
 448
 449 <p class="lead">Though its title may not imply it,
 450 <a href="http://www.phpclasses.org/browse/package/2189.html">PHP Input Filter</a>
 451 is a souped up version of striptags() with the ability to inspect
 452 attributes.  (Don't mind the hastily tacked on query escaping function).</p>
 453
 454 <table class="summary">
 455     <tr><th>Version</th>                <td class="impl-yes">1.2.2</td></tr>
 456     <tr><th>Last update</th>            <td class="impl-irrelevant">2005-10-05</td></tr>
 457     <tr><th>License</th>                <td class="impl-irrelevant">GPL</td></tr>
 458     <tr><th>Whitelist</th>              <td class="impl-yes">Yes, user defined</td></tr>
 459     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 460     <tr><th>Makes well-formed</th>      <td class="impl-no">No</td></tr>
 461     <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 462     <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 463     <tr><th>XSS safe</th>               <td class="impl-almostyes">Probably</td></tr>
 464     <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 465 </table>
 466
 467 <p>PHP Input Filter implements an
 468 <acronym>HTML</acronym> parser, and
 469 performs very basic checks on whether or not tags and attributes have
 470 been defined in the whitelist as well as some
 471 smarter <acronym>XSS</acronym> checks.  It is left up to
 472 the user to define what they'll permit.</p>
 473
 474 <p>With absolutely no checking of well-formedness, it is trivially easy
 475 to trick the filter into leaving unclosed tags lying around. While to some
 476 standards-compliance may be viewed by some as a <q>nice feature</q>,
 477 basic sanity checks like this must be implemented, otherwise a user
 478 can mangle a website's layout.</p>
 479
 480 <p>More troubles: Woe to
 481 any user that allows the <tt>style</tt> attribute: you can't simply
 482 just let <acronym>CSS</acronym> through and expect your
 483 layout not to be badly mutilated. To top things off,
 484 the filter doesn't even preserve data properly: attributes have all
 485 spaces stripped out of them.  Stay away, stay away!</p>
 486
 487 <h2 id="HTML_Safe">HTML_Safe/SafeHTML</h2>
 488
 489 <p class="lead"><a href="http://pear.php.net/package/HTML_Safe">HTML_Safe</a> is
 490 <acronym>PEAR</acronym>'s <acronym>HTML</acronym> filtering library.
 491 It should be noted that this is the same library as
 492 <a href="http://pixel-apes.com/safehtml/">SafeHTML</a>, though with different
 493 branding (and a different version number).</p>
 494
 495 <table class="summary">
 496     <tr><th>Version</th>                <td class="impl-almostyes">0.9.9beta</td></tr>
 497     <tr><th>Last update</th>            <td class="impl-irrelevant">2005-12-21</td></tr>
 498     <tr><th>License</th>                <td class="impl-irrelevant">BSD (3 clause)</td></tr>
 499     <tr><th>Whitelist</th>              <td class="impl-no">Mostly No</td></tr>
 500     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 501     <tr><th>Makes well-formed</th>      <td class="impl-yes">Yes</td></tr>
 502     <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 503     <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 504     <tr><th>XSS safe</th>               <td class="impl-almostyes">Probably</td></tr>
 505     <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 506 </table>
 507
 508 <p>HTML_Safe's mechanism of action involves parsing
 509 <acronym>HTML</acronym> with a
 510 <acronym>SAX</acronym> parser and performing
 511 validation and filtering as the handlers are called.  HTML_Safe does a lot
 512 of things right, which is why I say it <em>probably</em> isn't vulnerable
 513 to <acronym>XSS</acronym>, but its approach
 514 is fundamentally flawed: blacklists.</p>
 515
 516 <p>This library maintains arrays of dangerous tags, attributes and
 517 <acronym>CSS</acronym> properties.  (It also
 518 has a blacklist of dangerous <acronym>URI</acronym> protocols, but this is
 519 intelligently disabled by default in favor of a protocol whitelist.)
 520 What this means is that HTML_Safe has no qualms of accepting input
 521 like <tt>&lt;foobar&gt; Bang &lt;/foobar&gt;</tt>.  Anything goes except
 522 the tags in those arrays.  Scratch standards-compliance (and that was
 523 without even considering proper nesting).</p>
 524
 525 <p>For now, HTML_Safe might be safe from
 526 <acronym>XSS</acronym>.
 527 In the future, however, one of the infinitely many tags that HTML_Safe lets
 528 through might just possibly be given special functionality by browser vendors.
 529 And it might just turn out that this can be exploited.  <em>Any</em> blacklist
 530 solution puts you at a perpetual arms race against crackers who are constantly
 531 discovering new and inventive ways to abuse tags and attributes that you
 532 didn't blacklist.</p>
 533
 534 <h2 id="kses">kses</h2>
 535
 536 <p class="lead"><a href="http://sourceforge.net/projects/kses/">kses</a> appears to
 537 be the de-facto solution for cleaning  <acronym>HTML</acronym>, having found
 538 its way into applications such as <a href="http://wordpress.org/">WordPress</a>
 539 and being the number one search result for <q>php html filter</q>.</p>
 540
 541 <table class="summary">
 542     <tr><th>Version</th>                <td class="impl-partial">0.2.2</td></tr>
 543     <tr><th>Last update</th>            <td class="impl-irrelevant">2005-02-06</td></tr>
 544     <tr><th>License</th>                <td class="impl-irrelevant">GPL</td></tr>
 545     <tr><th>Whitelist</th>              <td class="impl-yes">Yes, user defined</td></tr>
 546     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 547     <tr><th>Makes well-formed</th>      <td class="impl-no">No</td></tr>
 548     <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 549     <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 550     <tr><th>XSS safe</th>               <td class="impl-almostyes">Probably</td></tr>
 551     <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 552 </table>
 553
 554 <p>To be truthful, I didn't do as comprehensive a code survey for kses
 555 as I did for some of the other libraries.  Out of
 556 all the classes I've reviewed so far, kses was definitely the hardest to
 557 understand.</p>
 558
 559 <p>kses's modus operandi is splitting up html with a monster regexp
 560 and then validating each section with <tt>kses_split2()</tt>.  It
 561 suffers from the same problems as Input Filter: no well-formedness
 562 checks leading to rampant runaway tags (and no standards-compliance).
 563 WordPress, the primary user of kses today, had to implement their
 564 own custom tag-balancing code to fix this problem: don't use this
 565 library without some equivalent!</p>
 566
 567 <p>Its whitelist syntax, however, is the most complex of all these libraries,
 568 so I'm going to take some time to argue why this particular implementation
 569 is bad.  The author of this library was thoughtful enough to provide some
 570 basic constraint checks on attributes like maxlen and maxval.  Now, barring
 571 the fact that there simply aren't enough checks, and the fact that they are
 572 all lumped together in one function, we now must wonder whether or not
 573 the user will go through the trouble of specifying the maximum length
 574 of a title attribute.</p>
 575
 576 <p>I have my opinions about inherent human laziness, but perhaps WordPress's
 577 default filterset is the most telling example:</p>
 578
 579 <pre>
 580 $allowedposttags = array (
 581     /* formatted and trimmed */
 582     'hr' => array (
 583         'align' => array (),
 584         'noshade' => array (),
 585         'size' => array (),
 586         'width' => array ()
 587      )
 588 );
 589 </pre>
 590
 591 <p>Hmm... do I see a blatant lack of attribute constraints?  Conclusion:
 592 if the user can get away with not doing work, they will!  The biggest
 593 problem in all these whitelists filters is that they forgot to <em>supply</em>
 594 the whitelist.  The whitelist is just as important as the code that uses
 595 the whitelist to filter <acronym>HTML</acronym>.</p>
 596
 597 <h2 id="Safe_HTML_Checker">Safe HTML Checker</h2>
 598
 599 <p class="lead">
 600 <a href="http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker">Safe
 601 HTML Checker</a> is (to my knowledge) the first attempt to make a filter
 602 that also outputs standards-compliant <acronym>XHTML</acronym>.  It wasn't even released or
 603 licensed officially, but we'll let that slide: a 4<sup>th</sup> place
 604 search result must have done something right.</p>
 605
 606 <table class="summary">
 607     <tr><th>Version</th>                <td class="impl-partial">in-house</td></tr>
 608     <tr><th>Last update</th>            <td class="impl-almostyes">2003-09-15</td></tr>
 609     <tr><th>License</th>                <td class="impl-no">undefined</td></tr>
 610     <tr><th>Whitelist</th>              <td class="impl-almostyes">Yes (bare-bones)</td></tr>
 611     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 612     <tr><th>Makes well-formed</th>      <td class="impl-yes">Yes</td></tr>
 613     <tr><th>Fixes nesting</th>          <td class="impl-almostyes">Almost</td></tr>
 614     <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 615     <tr><th>XSS safe</th>               <td class="impl-yes">Yes</td></tr>
 616     <tr><th>Standards safe</th>         <td class="impl-almostyes">Almost</td></tr>
 617 </table>
 618
 619 <p>Indeed, it is quite a well-written piece of code.  It demonstrates
 620 knowledge of inline versus block elements, thus almost nearly getting
 621 nesting correct (the only exception is an unimplemented omitted SGML
 622 exclusion for <tt>&lt;a&gt;</tt> tags, and that's easy to fix).</p>
 623
 624 <p>Unfortunately, part of the reason why it works so well is that it's
 625 extremely restrictive.  No styling, no tables, very few attributes.
 626 Perfectly appropriate for blog comments, but then again, there's always
 627 BBCode.  This probably means that Safe HTML Checker has a different
 628 goal than HTML Purifier.</p>
 629
 630 <p>The <acronym>XML</acronym> parser
 631 is also quite strict.  Accidentally missed a &lt; sign? The parser will
 632 complain with the cryptic message:
 633 <q><acronym>XHTML</acronym>
 634 is not well-formed</q>.
 635 The solution is not as simple as just switching to a more permissive
 636 parser: Safe HTML Checker relies on the fact that the parser will have
 637 matched up the tags for them.</p>
 638
 639 <h2 id="HTMLPurifier">HTML Purifier</h2>
 640
 641 <table class="summary">
 642     <tr><th>Version</th>                <td class="impl-yes">1.4.1</td></tr>
 643     <tr><th>Last update</th>            <td class="impl-yes">2007-01-21</td></tr>
 644     <tr><th>License</th>                <td class="impl-irrelevant">LGPL</td></tr>
 645     <tr><th>Whitelist</th>              <td class="impl-yes">Yes</td></tr>
 646     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 647     <tr><th>Makes well-formed</th>      <td class="impl-yes">Yes</td></tr>
 648     <tr><th>Fixes nesting</th>          <td class="impl-yes">Yes</td></tr>
 649     <tr><th>Validates attributes</th>   <td class="impl-yes">Yes</td></tr>
 650     <tr><th>XSS safe</th>               <td class="impl-yes">Yes</td></tr>
 651     <tr><th>Standards safe</th>         <td class="impl-yes">Yes</td></tr>
 652 </table>
 653
 654 <p class="lead">That table should say it all, but I'll add a few more features:</p>
 655
 656 <table class="summary">
 657     <tr><th>UTF-8 aware</th><td class="impl-yes">Yes</td></tr>
 658     <tr><th>Object-Oriented</th><td class="impl-yes">Yes</td></tr>
 659     <tr><th>Validates CSS</th><td class="impl-yes">Yes</td></tr>
 660     <tr><th>Tables</th><td class="impl-yes">Yes</td></tr>
 661     <tr><th>PHP 5 aware</th><td class="impl-yes">Yes</td></tr>
 662     <tr><th>E_STRICT compliant</th><td class="impl-yes">Yes (use -strict)</td></tr>
 663 </table>
 664
 665 <p>This is not to say that HTML Purifier doesn't have problems of its own.
 666 It's a fairly nascent library (that doesn't mean its buggy though), it's big
 667 (while the others usually fit in one file, this one requires a huge
 668 include list), and it's <a href="http://hp.jpsband.org/live/TODO">missing
 669 features.</a> But even in its current state,
 670 HTML Purifier is far better than the other libraries.</p>
 671
 672 <p>So... <a href="./#Download">what are you waiting for?</a></p>
 673
 674 </div>
 675 </body>
 676 </html>