comparison.xhtml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   3     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
   4 <html xmlns="http://www.w3.org/1999/xhtml"
   5       xmlns:xi="http://www.w3.org/2001/XInclude"
   6       xmlns:xc="urn:xhtml-compiler"
   7       xmlns:svn="urn:xhtml-compiler:Subversion"
   8       svn:head-url="$HeadURL$"
   9       svn:revision="$Revision$"
  10       xc:rss-from-svn="yes"
  11       xml:lang="en" lang="en">
  12 <head>
  13 <title>Comparison - HTML Purifier</title>
  14 <xi:include href="common-meta.xml" xpointer="xpointer(/*/node())" />
  15 <link rel="stylesheet" href="comparison.css" type="text/css" />
  16 <meta name="keywords" content="HTMLPurifier, HTML Purifier, HTML, filter, filtering, HTML_Safe, PEAR, comparison, kses, striptags, SafeHTMLChecker" />
  17 </head>
  18 <body>
  19
  20 <xi:include href="common-header.xml" xpointer="xpointer(/*/node())" />
  21 <h1 id="title">Comparison</h1>
  22
  23 <div id="content">
  24
  25 <p>With the advent of
  26 <a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0</a>, the end user has
  27 gone from passive consumer to active producer of content on the World Wide
  28 Web.  <a href="http://en.wikipedia.org/wiki/Wiki">Wikis</a>,
  29 <a href="http://en.wikipedia.org/wiki/Social_software">Social Software</a> and
  30 <a href="http://en.wikipedia.org/wiki/Blog">Blogs</a> all
  31 put the user in control.</p>
  32
  33 <p>Give the user too much control, however, and you set yourself up
  34 for <a href="http://en.wikipedia.org/wiki/Cross-site_scripting"><abbr>XSS</abbr></a> attacks.  For this reason,
  35 <abbr>HTML</abbr>'s flexibility
  36 has proven to be both a blessing and a curse, and the software that processes
  37 it must strike a fine balance between security and usability.  How do
  38 we prevent users from injecting JavaScript or inserting malformed
  39 <abbr>HTML</abbr> while allowing
  40 a rich syntax of tags, attributes and <abbr>CSS</abbr>? How do we put
  41 <abbr>HTML</abbr> inside
  42 <abbr>RSS</abbr> feed without worrying
  43 about sloppy coding messing up <abbr>XML</abbr> parsing?
  44 Almost every <abbr>PHP</abbr>
  45 developer has come across this problem before, and many have tried
  46 (albeit unsuccessfully) to solve this problem.  We will analyze existing
  47 libraries to demonstrate how they are ineffective and, of course,
  48 how <strong>HTML Purifier</strong> solves all our problems and achieves
  49 standards-compliance.</p>
  50
  51 <p>I will take no quarter and pull no punches: as of the time of writing,
  52 no other library comes even <em>close</em> to solving the problem effectively
  53 for richly formatted documents.  But, nonetheless, there is a necessary
  54 disclaimer:</p>
  55
  56 <p class="disclaimer">
  57     This comparison document was written by the author of HTML Purifier,
  58     and clearly is <strong>in favor</strong> of HTML Purifier. However, that doesn't
  59     mean that it is biased: I have made every attempt to be <strong>factual and
  60     fair</strong>, and I hope that you will agree, by the time you finish reading
  61     this document, that HTML Purifier is the only satisfactory <abbr>HTML</abbr>
  62     filter out there today.
  63 </p>
  64
  65 <div id="toc" />
  66
  67 <h2 id="Summary">Summary</h2>
  68
  69 <p>A table summarizing the differences for the impatient.</p>
  70
  71 <div class="wide-table">
  72 <table cellspacing="0">
  73
  74 <thead>
  75     <tr>
  76         <th>Library</th>
  77         <th>Version</th>
  78         <th>Date</th>
  79         <th>License</th>
  80         <th>Whitelist</th>
  81         <th>Removal</th>
  82         <th>Well-formed</th>
  83         <th>Nesting</th>
  84         <th>Attributes</th>
  85         <th>XSS&nbsp;safe</th>
  86         <th>Standards&nbsp;safe</th>
  87     </tr>
  88 </thead>
  89
  90 <tbody>
  91
  92 <tr>
  93     <td>striptags</td>
  94     <td>n/a</td>
  95     <td>n/a</td>
  96     <td>n/a</td>
  97     <td class="impl-almostyes">Yes (user)</td>
  98     <td class="impl-partial">Buggy</td>
  99     <td class="impl-no">No</td>
 100     <td class="impl-no">No</td>
 101     <td class="impl-no">No</td>
 102     <td class="impl-no">No</td>
 103     <td class="impl-no">No</td>
 104 </tr>
 105
 106 <tr>
 107     <td>PHP Input Filter</td>
 108     <td>1.2.2</td>
 109     <td>2005-10-05</td>
 110     <td>GPL</td>
 111     <td class="impl-almostyes">Yes (user)</td>
 112     <td class="impl-yes">Yes</td>
 113     <td class="impl-no">No</td>
 114     <td class="impl-no">No</td>
 115     <td class="impl-partial">Partial</td>
 116     <td class="impl-almostyes">Probably</td>
 117     <td class="impl-no">No</td>
 118 </tr>
 119
 120 <tr>
 121     <td>HTML_Safe</td>
 122     <td>0.9.9beta</td>
 123     <td>2005-12-21</td>
 124     <td>BSD (3)</td>
 125     <td class="impl-no">Mostly No</td>
 126     <td class="impl-yes">Yes</td>
 127     <td class="impl-yes">Yes</td>
 128     <td class="impl-no">No</td>
 129     <td class="impl-partial">Partial</td>
 130     <td class="impl-almostyes">Probably</td>
 131     <td class="impl-no">No</td>
 132 </tr>
 133
 134 <tr>
 135     <td>kses</td>
 136     <td>0.2.2</td>
 137     <td>2005-02-06</td>
 138     <td>GPL</td>
 139     <td class="impl-almostyes">Yes (user)</td>
 140     <td class="impl-yes">Yes</td>
 141     <td class="impl-no">No</td>
 142     <td class="impl-no">No</td>
 143     <td class="impl-partial">Partial</td>
 144     <td class="impl-almostyes">Probably</td>
 145     <td class="impl-no">No</td>
 146 </tr>
 147
 148 <tr>
 149     <td>Safe HTML Checker</td>
 150     <td>n/a</td>
 151     <td>2003-09-15</td>
 152     <td>n/a</td>
 153     <td class="impl-almostyes">Yes (bare)</td>
 154     <td class="impl-yes">Yes</td>
 155     <td class="impl-yes">Yes</td>
 156     <td class="impl-almostyes">Almost</td>
 157     <td class="impl-partial">Partial</td>
 158     <td class="impl-yes">Yes</td>
 159     <td class="impl-almostyes">Almost</td>
 160 </tr>
 161
 162 <tr>
 163     <td>HTML Purifier</td>
 164     <td>1.6.0</td>
 165     <td>2007-04-01</td>
 166     <td>LGPL</td>
 167     <td class="impl-yes">Yes</td>
 168     <td class="impl-yes">Yes</td>
 169     <td class="impl-yes">Yes</td>
 170     <td class="impl-yes">Yes</td>
 171     <td class="impl-yes">Yes</td>
 172     <td class="impl-yes">Yes</td>
 173     <td class="impl-yes">Yes</td>
 174 </tr>
 175
 176 </tbody>
 177
 178 </table>
 179 </div>
 180
 181 <p><a href="#Tidy">HTML Tidy</a> is omitted from this list because it is not an <abbr>HTML</abbr>
 182 filter.</p>
 183
 184 <h2 id="AltMarkup">Look Ma, No <abbr>HTML</abbr>!</h2>
 185
 186 <blockquote class="fancy">
 187     <div class="quote" style="text-align:center;">
 188         A clever person solves a problem.
 189         A wise person avoids it.
 190     </div>
 191     <div class="origin">&mdash; Albert Einstein</div>
 192 </blockquote>
 193
 194 <p>Before we jump into the weird and not-so-wonderful world
 195 of <abbr>HTML</abbr> filters, we must first consider another domain: non-<abbr>HTML</abbr>
 196 markup libraries. While libraries of this type really shouldn't be
 197 considered <abbr>HTML</abbr> filters,
 198 they are the number one method of taking user input and processing it into
 199 something more than plain old text.  These libraries forgo
 200 <abbr>HTML</abbr> and define their
 201 own markup syntax. <a href="http://en.wikipedia.org/wiki/BBCode">BBCode</a>,
 202 <a href="http://en.wikipedia.org/wiki/Wikitext">Wikitext</a>,
 203 <a href="http://daringfireball.net/projects/markdown/">Markdown</a> and
 204 <a href="http://textism.com/tools/textile/">Textile</a> are all examples of
 205 such markup languages (although it should be noted that
 206 Wikitext and Markdown can allow
 207 <abbr>HTML</abbr> within them).
 208 The benefits (to those who use it, anyway) are clear: simplicity and
 209 security.
 210 </p>
 211
 212 <table cellspacing="0">
 213     <thead>
 214         <tr>
 215             <th>Markup language</th>
 216             <th>Sample</th>
 217         </tr>
 218     </thead>
 219     <tbody>
 220         <tr>
 221             <th>BBCode</th>
 222             <td><tt>[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].</tt></td>
 223         </tr>
 224         <tr>
 225             <th>Wikitext<sup>1</sup></th>
 226             <td><tt>'''B''' ''i'' [http://www.example.com/ link]</tt></td>
 227         </tr>
 228         <tr>
 229             <th>Markdown<sup>2</sup></th>
 230             <td><tt>**B** *i* [link](http://www.example.com/)</tt></td>
 231         </tr>
 232         <tr>
 233             <th>Textile</th>
 234             <td><tt>*B* _i_ &quot;link&quot;:http://www.example.com/</tt></td>
 235         </tr>
 236         <tr>
 237             <th><abbr>HTML</abbr></th>
 238             <td><tt>&lt;b&gt;B&lt;/b&gt; &lt;i&gt;i&lt;/i&gt; &lt;a href=&quot;http://www.example.com/&quot;&gt;link&lt;/a&gt;</tt></td>
 239         </tr>
 240         <tr>
 241             <th><acronym>WYSIWYG</acronym></th>
 242             <td><b>B</b> <i>i</i> <a href="http://www.example.com/">link</a></td>
 243         </tr>
 244     </tbody>
 245 </table>
 246
 247 <ol class="notes">
 248     <li>Wikitext shown is modeled after <a
 249         href="http://www.mediawiki.org/wiki/MediaWiki">MediaWiki</a> style.
 250         There are many variants of Wikitext currently extant.</li>
 251     <li>Strictly speaking, the Markdown syntax is not equivalent: bold text
 252         is expressed as <code>&lt;strong&gt;</code> and italicized text is
 253         expressed as <code>&lt;em&gt;</code>. Most browser default stylesheets,
 254         however, map those two semantic tags to the associated styling, so
 255         many users assume that it really is italics (and use it improperly for,
 256         say, book titles.)</li>
 257 </ol>
 258
 259 <h3 id="AltMarkup:Simplicity">Simplicity</h3>
 260
 261 <p><abbr>HTML</abbr>
 262 source code is often criticized for being difficult to read. For example,
 263 compare:</p>
 264
 265 <pre>
 266 * Item 1
 267 * Item 2
 268 </pre>
 269
 270 <p>...versus:</p>
 271
 272 <pre>
 273 &lt;ul&gt;
 274     &lt;li&gt;Item 1&lt;/li&gt;
 275     &lt;li&gt;Item 2&lt;/li&gt;
 276 &lt;/ul&gt;
 277 </pre>
 278
 279 <p>Which would you prefer to edit? The answer seems obvious, but be careful
 280 not to fall into the fallacy of <a
 281 href="http://en.wikipedia.org/wiki/False_dilemma">false dilemma</a>.
 282 There <em>is</em> a third choice: the
 283 <acronym>WYSIWYG</acronym> (rich text)
 284 editor, which blows earlier choices out of the water in terms
 285 of usability.</p>
 286
 287 <p>Note that rich text editors and alternate markup syntaxes are not
 288 mutually exclusive, but, when push comes to shove, it's easier
 289 implement this sort of editor on top of <abbr>HTML</abbr> than some obscure
 290 markup language.  And in the cases when it is done, you usually end up with
 291 a live preview, not a true rich text editor.</p>
 292
 293 <blockquote class="digression">
 294     <p><q>Now just wait a second,</q> you may be saying,
 295     <q><acronym>WYSIWYG</acronym>
 296     editors aren't all that great.</q>  There are many good arguments
 297     against these editors, and <a
 298     href="http://www.ideography.co.uk/library/seybold/WYSIWYG.html">intelligent
 299     people have written essays</a> devoted to
 300     criticizing <acronym>WYSIWYG</acronym>.
 301     In addition to the usual arguments against said editors, the web poses
 302     another limitation: no JavaScript means no
 303     editor, and no editor means... (gasp) manually typing in code.</p>
 304
 305     <p>Even the most dogmatic purist, however, should recognize that for all
 306     its faults, prospective clients <em>really</em> want rich text editors.
 307     There are steps you can take to mitigate the associated drawbacks of
 308     these editors.</p>
 309
 310     <p>It is often asserted that
 311     <acronym>WYSIWYG</acronym> editors
 312     <em>encourage excessive presentational markup</em>. As it turns out,
 313     this is the case with any markup language that allows the smallest
 314     iota of presentational tags, be it <tt>&lt;font&gt;</tt> or
 315     <tt>[color=red]</tt>.
 316     A good way to reduce this trouble is to simply eliminate the
 317     dialogue boxes that allow users to change colors or fonts (which
 318     usually have no legitimate use) and adopt a
 319     <acronym>WYSIWYM</acronym> scheme,
 320     allowing users to select contextually correct formatting styles
 321     for segments of text.</p>
 322 </blockquote>
 323
 324 <p>Simplicity is also a double-edged sword.  The moment any remotely
 325 complex markup is needed, these lightweight markup languages fail to
 326 produce.  Sure you can make '''this text bold''' with Wikitext, but that
 327 infobox all <q>rendered nicely in aqua blue</q> will require a gaggle of
 328 &lt;div&gt;s and <abbr>CSS</abbr>.
 329 These languages face the same troubles as regular <abbr>HTML</abbr>
 330 filters in that their whitelist is too restrictive (besides the fact that
 331 their table markup is extraordinarily complex).</p>
 332
 333 <h3 id="AltMarkup:Security">Security</h3>
 334
 335 <p>BBCode can be boiled down to a <q>wanna-be</q> version of
 336 <abbr>HTML</abbr>. I mean, replacing
 337 the angled brackets with square brackets and omitting the occasional parameter
 338 name? How much more un-original can you get? Somehow, I don't think BBCode
 339 was meant to readable. <a
 340 href="http://en.wikipedia.org/wiki/BBCode">Wikipedia</a> agrees:</p>
 341
 342 <blockquote>
 343     BBCode was devised and put to use in order to provide a safer, easier
 344     and more limited way of allowing users to format their messages.
 345     Previously, many message boards allowed the users to include <abbr>HTML</abbr>,
 346     which could be used to break/imitate parts of the layout, or run
 347     JavaScript. Some implementations of BBCode have suffered problems related
 348     to the way they translate the BBCode into <abbr>HTML</abbr>, which could negate the
 349     security that was intended to be given by BBCode.
 350 </blockquote>
 351
 352 <p>Or, put more simply:</p>
 353
 354 <blockquote>
 355     BBCode came to life when developers where too
 356     lazy to parse <abbr>HTML</abbr> correctly
 357     and decided to invent their own markup language. As with all products of
 358     laziness, the result is completely inconsistent, unstandardized, and
 359     widely adopted.
 360 </blockquote>
 361
 362 <p>Well, developers, the whole point of HTML Purifier is that I do the
 363 work so you can just execute the ridiculously simple
 364 <tt>$purifier->purify($html)</tt> call and go on to do, well, whatever
 365 you developers do. <tt>:-P</tt></p>
 366
 367 <h3 id="AltMarkup:Conclusion">Conclusion</h3>
 368
 369 <p>These alternative markup languages have their shiny points, and HTML
 370 Purifier is not meant to replace them.  However, a major reason for
 371 their existence has been called into question.  Why are <em>you</em>
 372 using these languages?</p>
 373
 374 <h2 id="Tidy">HTML Tidy</h2>
 375
 376 <p>Dave Raggett's
 377 <a href="http://www.w3.org/People/Raggett/tidy/">HTML Tidy</a> is a program;
 378 neat enough, at least, to make it into <abbr>PHP</abbr> as a
 379 <a href="http://us2.php.net/manual/en/ref.tidy.php"><abbr>PECL</abbr> extension.</a>
 380 The premise is simple, the execution effective. Tidy is, in short, a great
 381 <em>tool</em>.</p>
 382
 383 <p>It is not, however, a filter.  I am often surprised when people ask
 384 me, <q>What about Tidy?</q>  There's nothing against Tidy: Tidy tackles
 385 a different problem set.  Let's see what <tt>man tidy</tt> has to say:</p>
 386
 387 <blockquote cite="http://tidy.sourceforge.net/docs/tidy_man.html">
 388     Tidy reads <abbr>HTML</abbr>, <abbr>XHTML</abbr> and
 389     <abbr>XML</abbr> files and writes cleaned up markup. For
 390     <abbr>HTML</abbr> variants, it detects and corrects many common coding errors and
 391     strives to produce visually equivalent markup that is both <abbr>W3C</abbr> compliant
 392     and works on most browsers. A common use of Tidy is to convert plain <abbr>HTML</abbr>
 393     to <abbr>XHTML</abbr>.
 394 </blockquote>
 395
 396 <p>Hmm... why do I not see the words <q>filter</q> or
 397 <q><abbr>XSS</abbr></q> in here? Perhaps it's
 398 because Tidy accepts <em>any</em> valid
 399 <abbr>HTML</abbr>.  Including
 400 <tt>script</tt> tags.  Which leads us to our second part: Tidy parses
 401 <em>documents</em>, not document <em>fragments</em>.</p>
 402
 403 <p>This is not to say that I haven't seen Tidy be used in this sort of
 404 fashion.  MediaWiki, for instance, uses Tidy to cleanup the final <abbr>HTML</abbr>
 405 output before shuttling it off to the browser.  The developers, nevertheless,
 406 agree that this is only a band-aid solution, and that the real way
 407 to fix it is to fix the parser. Tidy's great, but in terms of security,
 408 it's not suitable for untrusted sources.</p>
 409
 410 <h2 id="Preface">Preface</h2>
 411
 412 <p>I've ordered my analyses according to how bad a library is.  The worst
 413 is first, and then we move up the spectrum.  I will point out the most
 414 flagrant problems with the libraries, but note that I will omit more
 415 advanced vulnerabilities: if you can't catch an <tt>onmouseover</tt>
 416 attribute, I really shouldn't reprimand you for letting non-<abbr>SGML</abbr> code
 417 points through.  The ideal solution, however, must do all these things.</p>
 418
 419 <p>Note that besides striptags,
 420 most of the libraries are moderately effective against the most common <abbr>XSS</abbr>
 421 attacks.  None of them (save Safe HTML Checker) fare very well
 422 in the standards-compliance department though.</p>
 423
 424 <h2 id="striptags">striptags()</h2>
 425
 426 <table class="summary">
 427     <tr><th>Whitelist</th>              <td class="impl-yes">Yes, user-specified</td></tr>
 428     <tr><th>Removes foreign tags</th>   <td class="impl-partial">Buggy</td></tr>
 429     <tr><th>Makes well-formed</th>      <td class="impl-no">No</td></tr>
 430     <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 431     <tr><th>Validates attributes</th>   <td class="impl-no">No</td></tr>
 432 </table>
 433
 434 <p>The <abbr>PHP</abbr> function
 435 <a href="http://php.net/manual/en/function.strip-tags.php">striptags()</a> is
 436 the classic solution for attempting to clean up
 437 <abbr>HTML</abbr>.  It
 438 is also the <em>worst</em> solution, and should be avoided like the plague.
 439 The fact that it doesn't validate attributes at all means that anyone can
 440 insert an <tt>onmouseover='xss();'</tt> and exploit your application.</p>
 441
 442 <p>While
 443 this can be bandaided with a series of regular expressions that strip out
 444 on[event] (you're still vulnerable to <abbr>XSS</abbr> and at the mercy of
 445 quirky browser behavior), striptags() is fundamentally flawed and should not be
 446 used.
 447 </p>
 448
 449 <h2 id="Input_Filter">PHP Input Filter</h2>
 450
 451 <p>Though its title may not imply it,
 452 <a href="http://www.phpclasses.org/browse/package/2189.html">PHP Input Filter</a>
 453 is a souped up version of striptags() with the ability to inspect
 454 attributes.  (Don't mind the hastily tacked on query escaping function).</p>
 455
 456 <table class="summary">
 457     <tr><th>Version</th>                <td class="impl-yes">1.2.2</td></tr>
 458     <tr><th>Last update</th>            <td class="impl-irrelevant">2005-10-05</td></tr>
 459     <tr><th>License</th>                <td class="impl-irrelevant">GPL</td></tr>
 460     <tr><th>Whitelist</th>              <td class="impl-yes">Yes, user defined</td></tr>
 461     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 462     <tr><th>Makes well-formed</th>      <td class="impl-no">No</td></tr>
 463     <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 464     <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 465     <tr><th>XSS safe</th>               <td class="impl-almostyes">Probably</td></tr>
 466     <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 467 </table>
 468
 469 <p>PHP Input Filter implements an
 470 <abbr>HTML</abbr> parser, and
 471 performs very basic checks on whether or not tags and attributes have
 472 been defined in the whitelist as well as some
 473 smarter <abbr>XSS</abbr> checks.  It is left up to
 474 the user to define what they'll permit.</p>
 475
 476 <p>With absolutely no checking of well-formedness, it is trivially easy
 477 to trick the filter into leaving unclosed tags lying around. While to some
 478 standards-compliance may be viewed by some as a <q>nice feature</q>,
 479 basic sanity checks like this must be implemented, otherwise a user
 480 can mangle a website's layout.</p>
 481
 482 <p>More troubles: Woe to
 483 any user that allows the <tt>style</tt> attribute: you can't simply
 484 just let <abbr>CSS</abbr> through and expect your
 485 layout not to be badly mutilated. To top things off,
 486 the filter doesn't even preserve data properly: attributes have all
 487 spaces stripped out of them.  Stay away, stay away!</p>
 488
 489 <h2 id="HTML_Safe">HTML_Safe/SafeHTML</h2>
 490
 491 <p><a href="http://pear.php.net/package/HTML_Safe">HTML_Safe</a> is
 492 <acronym>PEAR</acronym>'s <abbr>HTML</abbr> filtering library.
 493 It should be noted that this is the same library as
 494 <a href="http://pixel-apes.com/safehtml/">SafeHTML</a>, though with different
 495 branding (and a different version number).</p>
 496
 497 <table class="summary">
 498     <tr><th>Version</th>                <td class="impl-almostyes">0.9.9beta</td></tr>
 499     <tr><th>Last update</th>            <td class="impl-irrelevant">2005-12-21</td></tr>
 500     <tr><th>License</th>                <td class="impl-irrelevant">BSD (3 clause)</td></tr>
 501     <tr><th>Whitelist</th>              <td class="impl-no">Mostly No</td></tr>
 502     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 503     <tr><th>Makes well-formed</th>      <td class="impl-yes">Yes</td></tr>
 504     <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 505     <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 506     <tr><th>XSS safe</th>               <td class="impl-almostyes">Probably</td></tr>
 507     <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 508 </table>
 509
 510 <p>HTML_Safe's mechanism of action involves parsing
 511 <abbr>HTML</abbr> with a
 512 <acronym>SAX</acronym> parser and performing
 513 validation and filtering as the handlers are called.  HTML_Safe does a lot
 514 of things right, which is why I say it <em>probably</em> isn't vulnerable
 515 to <abbr>XSS</abbr>, but its approach
 516 is fundamentally flawed: blacklists.</p>
 517
 518 <p>This library maintains arrays of dangerous tags, attributes and
 519 <abbr>CSS</abbr> properties.  (It also
 520 has a blacklist of dangerous <abbr>URI</abbr> protocols, but this is
 521 intelligently disabled by default in favor of a protocol whitelist.)
 522 What this means is that HTML_Safe has no qualms of accepting input
 523 like <tt>&lt;foobar&gt; Bang &lt;/foobar&gt;</tt>.  Anything goes except
 524 the tags in those arrays.  Scratch standards-compliance (and that was
 525 without even considering proper nesting).</p>
 526
 527 <p>For now, HTML_Safe might be safe from
 528 <abbr>XSS</abbr>.
 529 In the future, however, one of the infinitely many tags that HTML_Safe lets
 530 through might just possibly be given special functionality by browser vendors.
 531 And it might just turn out that this can be exploited.  <em>Any</em> blacklist
 532 solution puts you at a perpetual arms race against crackers who are constantly
 533 discovering new and inventive ways to abuse tags and attributes that you
 534 didn't blacklist.</p>
 535
 536 <h2 id="kses">kses</h2>
 537
 538 <p><a href="http://sourceforge.net/projects/kses/">kses</a> appears to
 539 be the de-facto solution for cleaning  <abbr>HTML</abbr>, having found
 540 its way into applications such as <a href="http://wordpress.org/">WordPress</a>
 541 and being the number one search result for <q>php html filter</q>.</p>
 542
 543 <table class="summary">
 544     <tr><th>Version</th>                <td class="impl-partial">0.2.2</td></tr>
 545     <tr><th>Last update</th>            <td class="impl-irrelevant">2005-02-06</td></tr>
 546     <tr><th>License</th>                <td class="impl-irrelevant">GPL</td></tr>
 547     <tr><th>Whitelist</th>              <td class="impl-yes">Yes, user defined</td></tr>
 548     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 549     <tr><th>Makes well-formed</th>      <td class="impl-no">No</td></tr>
 550     <tr><th>Fixes nesting</th>          <td class="impl-no">No</td></tr>
 551     <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 552     <tr><th>XSS safe</th>               <td class="impl-almostyes">Probably</td></tr>
 553     <tr><th>Standards safe</th>         <td class="impl-no">No</td></tr>
 554 </table>
 555
 556 <p>To be truthful, I didn't do as comprehensive a code survey for kses
 557 as I did for some of the other libraries.  Out of
 558 all the classes I've reviewed so far, kses was definitely the hardest to
 559 understand.</p>
 560
 561 <p>kses's modus operandi is splitting up html with a monster regexp
 562 and then validating each section with <tt>kses_split2()</tt>.  It
 563 suffers from the same problems as Input Filter: no well-formedness
 564 checks leading to rampant runaway tags (and no standards-compliance).
 565 WordPress, the primary user of kses today, had to implement their
 566 own custom tag-balancing code to fix this problem: don't use this
 567 library without some equivalent!</p>
 568
 569 <p>Its whitelist syntax, however, is the most complex of all these libraries,
 570 so I'm going to take some time to argue why this particular implementation
 571 is bad.  The author of this library was thoughtful enough to provide some
 572 basic constraint checks on attributes like maxlen and maxval.  Now, barring
 573 the fact that there simply aren't enough checks, and the fact that they are
 574 all lumped together in one function, we now must wonder whether or not
 575 the user will go through the trouble of specifying the maximum length
 576 of a title attribute.</p>
 577
 578 <p>I have my opinions about inherent human laziness, but perhaps WordPress's
 579 default filterset is the most telling example:</p>
 580
 581 <pre>
 582 $allowedposttags = array (
 583     /* formatted and trimmed */
 584     'hr' => array (
 585         'align' => array (),
 586         'noshade' => array (),
 587         'size' => array (),
 588         'width' => array ()
 589      )
 590 );
 591 </pre>
 592
 593 <p>Hmm... do I see a blatant lack of attribute constraints?  Conclusion:
 594 if the user can get away with not doing work, they will!  The biggest
 595 problem in all these whitelists filters is that they forgot to <em>supply</em>
 596 the whitelist.  The whitelist is just as important as the code that uses
 597 the whitelist to filter <abbr>HTML</abbr>.</p>
 598
 599 <h2 id="Safe_HTML_Checker">Safe HTML Checker</h2>
 600
 601 <p>
 602 <a href="http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker">Safe
 603 HTML Checker</a> is (to my knowledge) the first attempt to make a filter
 604 that also outputs standards-compliant <abbr>XHTML</abbr>.  It wasn't even released or
 605 licensed officially, but we'll let that slide: a 4<sup>th</sup> place
 606 search result must have done something right.</p>
 607
 608 <table class="summary">
 609     <tr><th>Version</th>                <td class="impl-partial">in-house</td></tr>
 610     <tr><th>Last update</th>            <td class="impl-almostyes">2003-09-15</td></tr>
 611     <tr><th>License</th>                <td class="impl-no">undefined</td></tr>
 612     <tr><th>Whitelist</th>              <td class="impl-almostyes">Yes (bare-bones)</td></tr>
 613     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 614     <tr><th>Makes well-formed</th>      <td class="impl-yes">Yes</td></tr>
 615     <tr><th>Fixes nesting</th>          <td class="impl-almostyes">Almost</td></tr>
 616     <tr><th>Validates attributes</th>   <td class="impl-partial">Partial</td></tr>
 617     <tr><th>XSS safe</th>               <td class="impl-yes">Yes</td></tr>
 618     <tr><th>Standards safe</th>         <td class="impl-almostyes">Almost</td></tr>
 619 </table>
 620
 621 <p>Indeed, it is quite a well-written piece of code.  It demonstrates
 622 knowledge of inline versus block elements, thus almost nearly getting
 623 nesting correct (the only exception is an unimplemented omitted SGML
 624 exclusion for <tt>&lt;a&gt;</tt> tags, and that's easy to fix).</p>
 625
 626 <p>Unfortunately, part of the reason why it works so well is that it's
 627 extremely restrictive.  No styling, no tables, very few attributes.
 628 Perfectly appropriate for blog comments, but then again, there's always
 629 BBCode.  This probably means that Safe HTML Checker has a different
 630 goal than HTML Purifier.</p>
 631
 632 <p>The <abbr>XML</abbr> parser
 633 is also quite strict.  Accidentally missed a &lt; sign? The parser will
 634 complain with the cryptic message:
 635 <q><abbr>XHTML</abbr>
 636 is not well-formed</q>.
 637 The solution is not as simple as just switching to a more permissive
 638 parser: Safe HTML Checker relies on the fact that the parser will have
 639 matched up the tags for them.</p>
 640
 641 <h2 id="HTMLPurifier">HTML Purifier</h2>
 642
 643 <table class="summary">
 644     <tr><th>Version</th>                <td class="impl-yes">1.6.0</td></tr>
 645     <tr><th>Last update</th>            <td class="impl-yes">2007-04-01</td></tr>
 646     <tr><th>License</th>                <td class="impl-irrelevant">LGPL</td></tr>
 647     <tr><th>Whitelist</th>              <td class="impl-yes">Yes</td></tr>
 648     <tr><th>Removes foreign tags</th>   <td class="impl-yes">Yes</td></tr>
 649     <tr><th>Makes well-formed</th>      <td class="impl-yes">Yes</td></tr>
 650     <tr><th>Fixes nesting</th>          <td class="impl-yes">Yes</td></tr>
 651     <tr><th>Validates attributes</th>   <td class="impl-yes">Yes</td></tr>
 652     <tr><th>XSS safe</th>               <td class="impl-yes">Yes</td></tr>
 653     <tr><th>Standards safe</th>         <td class="impl-yes">Yes</td></tr>
 654 </table>
 655
 656 <p>That table should say it all, but I'll add a few more features:</p>
 657
 658 <table class="summary">
 659     <tr><th>UTF-8 aware</th><td class="impl-yes">Yes</td></tr>
 660     <tr><th>Object-Oriented</th><td class="impl-yes">Yes</td></tr>
 661     <tr><th>Validates CSS</th><td class="impl-yes">Yes</td></tr>
 662     <tr><th>Tables</th><td class="impl-yes">Yes</td></tr>
 663     <tr><th>PHP 5 aware</th><td class="impl-yes">Yes</td></tr>
 664     <tr><th>E_STRICT compliant</th><td class="impl-yes">Yes (use -strict)</td></tr>
 665 </table>
 666
 667 <p>This is not to say that HTML Purifier doesn't have problems of its own.
 668 It's a fairly nascent library (that doesn't mean its buggy though), it's big
 669 (while the others usually fit in one file, this one requires a huge
 670 include list), and it's <a href="http://htmlpurifier.org/live/TODO">missing
 671 features.</a> But even in its current state,
 672 HTML Purifier is far better than the other libraries.</p>
 673
 674 <p>So... <a href="./#Download">what are you waiting for?</a></p>
 675
 676 </div>
 677 </body>
 678 </html>