Note that WordPress has their own tag-balancer.
[htmlpurifier-web.git] / comparison.html
blob2a2fdce1bf7b9e9bbc53e6db5e2811e7d0694e71
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
4 <head>
5 <title>Comparison - HTML Purifier</title>
6 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
7 <meta name="keywords" content="HTMLPurifier, HTML Purifier, HTML, filter, filtering, HTML_Safe, PEAR, comparison, kses, striptags, SafeHTMLChecker" />
8 <meta name="author" content="Edward Z. Yang" />
9 <link rel="icon" href="./favicon.ico" type="image/x-icon" />
10 <link rel="shortcut icon" href="./favicon.ico" type="image/x-icon" />
11 <link rel="stylesheet" href="./style.css" type="text/css" />
12 <!--[if lt IE 7.]><script defer="defer" type="text/javascript" src="./pngfix.js"></script><![endif]-->
13 <script defer="defer" type="text/javascript" src="./toc-gen.js"></script>
14 </head>
15 <body>
17 <img src="./logo.png" id="logo" alt="HTML Purifier" />
19 <h1 id="title">Comparison</h1>
20 <div id="header"><a href="./"><span class="html">HTML</span> <span class="purifier">Purifier</span></a></div>
22 <div id="content">
24 <p class="lead">With the advent of
25 <a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0</a>, the end user has
26 gone from passive consumer to active producer of content on the World Wide
27 Web. <a href="http://en.wikipedia.org/wiki/Wiki">Wikis</a>,
28 <a href="http://en.wikipedia.org/wiki/Social_software">Social Software</a> and
29 <a href="http://en.wikipedia.org/wiki/Blog">Blogs</a> all
30 put the user in control.</p>
32 <p>Give the user too much control, however, and you set yourself up
33 for <a href="http://en.wikipedia.org/wiki/Cross-site_scripting"><acronym
34 title="Cross Site Scripting">XSS</acronym></a> attacks. For this reason,
35 <acronym title="HyperText Markup Language">HTML</acronym>'s flexibility
36 has proven to be both a blessing and a curse, and the software that processes
37 it must strike a fine balance between security and usability. How do
38 we prevent users from injecting JavaScript or inserting malformed
39 <acronym title="HyperText Markup Language">HTML</acronym> while allowing
40 a rich syntax of tags, attributes and <acronym
41 title="Cascading Style Sheets">CSS</acronym>? How do we put
42 <acronym title="HyperText Markup Language">HTML</acronym> inside
43 <acronym title="Really Simple Syndication">RSS</acronym> feed without worrying
44 about sloppy coding messing up <acronym
45 title="eXtensible Markup Language">XML</acronym> parsing?
46 Almost every <acronym title="PHP: Hypertext Preprocessor">PHP</acronym>
47 developer has come across this problem before, and many have tried
48 (albeit unsuccessfully) to solve this problem. We will analyze existing
49 libraries to demonstrate how they are ineffective and, of course,
50 how <strong>HTML Purifier</strong> solves all our problems and achieves
51 standards-compliance.</p>
53 <p>I will take no quarter and pull no punches: as of the time of writing,
54 no other library comes even <em>close</em> to solving the problem effectively
55 for richly formatted documents. But, nonetheless, there is a necessary
56 disclaimer:</p>
58 <p class="disclaimer">
59 This comparison document was written by the author of HTML Purifier,
60 and clearly is <strong>in favor</strong> of HTML Purifier. However, that doesn't
61 mean that it is biased: I have made every attempt to be <strong>factual and
62 fair</strong>, and I hope that you will agree, by the time you finish reading
63 this document, that HTML Purifier is the only satisfactory HTML
64 filter out there today.
65 </p>
67 <div id="toc"><noscript>
68 <p><strong>Notice:</strong> There is a Table of Contents, but it is dynamically
69 generated. Please enable JavaScript to see it.</p>
70 </noscript></div>
72 <h2 id="Summary">Summary</h2>
74 <p class="lead">A table summarizing the differences for the impatient.</p>
76 <div class="wide-table">
77 <table cellspacing="0">
79 <thead>
80 <tr>
81 <th>Library</th>
82 <th>Version</th>
83 <th>Date</th>
84 <th>License</th>
85 <th>Whitelist</th>
86 <th>Removal</th>
87 <th>Well-formed</th>
88 <th>Nesting</th>
89 <th>Attributes</th>
90 <th>XSS&nbsp;safe</th>
91 <th>Standards&nbsp;safe</th>
92 </tr>
93 </thead>
95 <tbody>
97 <tr>
98 <td>striptags</td>
99 <td>n/a</td>
100 <td>n/a</td>
101 <td>n/a</td>
102 <td class="impl-almostyes">Yes (user)</td>
103 <td class="impl-partial">Buggy</td>
104 <td class="impl-no">No</td>
105 <td class="impl-no">No</td>
106 <td class="impl-no">No</td>
107 <td class="impl-no">No</td>
108 <td class="impl-no">No</td>
109 </tr>
111 <tr>
112 <td>PHP Input Filter</td>
113 <td>1.2.2</td>
114 <td>2005-10-05</td>
115 <td>GPL</td>
116 <td class="impl-almostyes">Yes (user)</td>
117 <td class="impl-yes">Yes</td>
118 <td class="impl-no">No</td>
119 <td class="impl-no">No</td>
120 <td class="impl-partial">Partial</td>
121 <td class="impl-almostyes">Probably</td>
122 <td class="impl-no">No</td>
123 </tr>
125 <tr>
126 <td>HTML_Safe</td>
127 <td>0.9.9beta</td>
128 <td>2005-12-21</td>
129 <td>BSD (3)</td>
130 <td class="impl-no">Mostly No</td>
131 <td class="impl-yes">Yes</td>
132 <td class="impl-yes">Yes</td>
133 <td class="impl-no">No</td>
134 <td class="impl-partial">Partial</td>
135 <td class="impl-almostyes">Probably</td>
136 <td class="impl-no">No</td>
137 </tr>
139 <tr>
140 <td>kses</td>
141 <td>0.2.2</td>
142 <td>2005-02-06</td>
143 <td>GPL</td>
144 <td class="impl-almostyes">Yes (user)</td>
145 <td class="impl-yes">Yes</td>
146 <td class="impl-no">No</td>
147 <td class="impl-no">No</td>
148 <td class="impl-partial">Partial</td>
149 <td class="impl-almostyes">Probably</td>
150 <td class="impl-no">No</td>
151 </tr>
153 <tr>
154 <td>Safe HTML Checker</td>
155 <td>n/a</td>
156 <td>2003-09-15</td>
157 <td>n/a</td>
158 <td class="impl-almostyes">Yes (bare)</td>
159 <td class="impl-yes">Yes</td>
160 <td class="impl-yes">Yes</td>
161 <td class="impl-almostyes">Almost</td>
162 <td class="impl-partial">Partial</td>
163 <td class="impl-yes">Yes</td>
164 <td class="impl-almostyes">Almost</td>
165 </tr>
167 <tr>
168 <td>HTML Purifier</td>
169 <td>1.4.1</td>
170 <td>2007-01-21</td>
171 <td>LGPL</td>
172 <td class="impl-yes">Yes</td>
173 <td class="impl-yes">Yes</td>
174 <td class="impl-yes">Yes</td>
175 <td class="impl-yes">Yes</td>
176 <td class="impl-yes">Yes</td>
177 <td class="impl-yes">Yes</td>
178 <td class="impl-yes">Yes</td>
179 </tr>
181 </tbody>
183 </table>
184 </div>
186 <p class="lead"><a href="#Tidy">HTML Tidy</a> is omitted from this list because it is not an HTML
187 filter.</p>
189 <h2 id="AltMarkup">Look Ma, No HTML!</h2>
191 <blockquote class="fancy">
192 <div class="quote" style="text-align:center;">
193 A clever person solves a problem.
194 A wise person avoids it.
195 </div>
196 <div class="origin">&mdash; Albert Einstein</div>
197 </blockquote>
199 <p class="lead">Before we jump into the weird and not-so-wonderful world
200 of HTML filters, we must first consider another domain: alternate
201 markup libraries. While libraries of this type really shouldn't be
202 considered <acronym title="HyperText Markup Language">HTML</acronym> filters,
203 they are the number one method of taking user input and processing it into
204 something more than plain old text. These libraries forgo
205 <acronym title="HyperText Markup Language">HTML</acronym> and define their
206 own markup syntax. <a href="http://en.wikipedia.org/wiki/BBCode">BBCode</a>,
207 <a href="http://en.wikipedia.org/wiki/Wikitext">Wikitext</a>,
208 <a href="http://daringfireball.net/projects/markdown/">Markdown</a> and
209 <a href="http://textism.com/tools/textile/">Textile</a> are all examples of
210 such markup languages (although it should be noted that
211 Wikitext and Markdown can allow
212 <acronym title="HyperText Markup Language">HTML</acronym> within them).
213 The benefits (to those who use it, anyway) are clear: simplicity and
214 security.
215 </p>
217 <table cellspacing="0">
218 <thead>
219 <tr>
220 <th>Markup language</th>
221 <th>Sample</th>
222 </tr>
223 </thead>
224 <tbody>
225 <tr>
226 <th>BBCode</th>
227 <td><tt>[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].</tt></td>
228 </tr>
229 <tr>
230 <th>Wikitext<sup>1</sup></th>
231 <td><tt>'''B''' ''i'' [http://www.example.com/ link]</tt></td>
232 </tr>
233 <tr>
234 <th>Markdown<sup>2</sup></th>
235 <td><tt>**B** *i* [link](http://www.example.com/)</tt></td>
236 </tr>
237 <tr>
238 <th>Textile</th>
239 <td><tt>*B* _i_ &quot;link&quot;:http://www.example.com/</tt></td>
240 </tr>
241 <tr>
242 <th>HTML</th>
243 <td><tt>&lt;b&gt;B&lt;/b&gt; &lt;i&gt;i&lt;/i&gt; &lt;a href=&quot;http://www.example.com/&quot;&gt;link&lt;/a&gt;</tt></td>
244 </tr>
245 <tr>
246 <th>WYSIWYG</th>
247 <td><b>B</b> <i>i</i> <a href="http://www.example.com/">link</a></td>
248 </tr>
249 </tbody>
250 </table>
252 <ol class="notes">
253 <li>Wikitext shown is modeled after <a
254 href="http://www.mediawiki.org/wiki/MediaWiki">MediaWiki</a> style.
255 There are many variants of Wikitext currently extant.</li>
256 <li>Strictly speaking, the Markdown syntax is not equivalent: bold text
257 is expressed as <code>&lt;strong&gt;</code> and italicized text is
258 expressed as <code>&lt;em&gt;</code>. Most browser default stylesheets,
259 however, map those two semantic tags to the associated styling, so
260 many users assume that it really is italics (and use it improperly for,
261 say, book titles.)</li>
262 </ol>
264 <h3 id="AltMarkup:Simplicity">Simplicity</h3>
266 <p class="lead"><acronym title="HyperText Markup Language">HTML</acronym>
267 source code is often criticized for being difficult to read. For example,
268 compare:</p>
270 <pre>
271 * Item 1
272 * Item 2
273 </pre>
275 <p>...versus:</p>
277 <pre>
278 &lt;ul&gt;
279 &lt;li&gt;Item 1&lt;/li&gt;
280 &lt;li&gt;Item 2&lt;/li&gt;
281 &lt;/ul&gt;
282 </pre>
284 <p>Which would you prefer to edit? The answer seems obvious, but be careful
285 not to fall into the fallacy of <a
286 href="http://en.wikipedia.org/wiki/False_dilemma">false dilemma</a>.
287 There <em>is</em> a third choice: the
288 <acronym title="What You See Is What You Get">WYSIWYG</acronym> (rich text)
289 editor, which blows earlier choices out of the water in terms
290 of usability.</p>
292 <p>Note that rich text editors and alternate markup syntaxes are not
293 mutually exclusive, but, when push comes to shove, it's easier
294 implement this sort of editor on top of <acronym
295 title="HyperText Markup Language">HTML</acronym> than some obscure
296 markup language. And in the cases when it is done, you usually end up with
297 a live preview, not a true rich text editor.</p>
299 <blockquote class="digression">
300 <p>&quot;Now just wait a second,&quot; you may be saying,
301 &quot;<acronym title="What You See Is What You Get">WYSIWYG</acronym>
302 editors aren't all that great.&quot; There are many good arguments
303 against these editors, and <a
304 href="http://www.ideography.co.uk/library/seybold/WYSIWYG.html">intelligent
305 people have written essays</a> devoted to
306 criticizing <acronym title="What You See Is What You Get">WYSIWYG</acronym>.
307 In addition to the usual arguments against said editors, the web poses
308 another limitation: no JavaScript means no
309 editor, and no editor means... (gasp) manually typing in code.</p>
311 <p>Even the most dogmatic purist, however, should recognize that for all
312 its faults, prospective clients <em>really</em> want rich text editors.
313 There are steps you can take to mitigate the associated drawbacks of
314 these editors.</p>
316 <p>It is often asserted that
317 <acronym title="What You See Is What You Get">WYSIWYG</acronym> editors
318 <em>encourage excessive presentational markup</em>. As it turns out,
319 this is the case with any markup language that allows the smallest
320 iota of presentational tags, be it <tt>&lt;font&gt;</tt> or
321 <tt>[color=red]</tt>.
322 A good way to reduce this trouble is to simply eliminate the
323 dialogue boxes that allow users to change colors or fonts (which
324 usually have no legitimate use) and adopt a
325 <acronym title="What You See Is What You Mean">WYSIWYM</acronym> scheme,
326 allowing users to select contextually correct formatting styles
327 for segments of text.</p>
328 </blockquote>
330 <p>Simplicity is also a double-edged sword. The moment any remotely
331 complex markup is needed, these lightweight markup languages fail to
332 produce. Sure you can make '''this text bold''' with Wikitext, but that
333 infobox all &quot;rendered nicely in aqua blue&quot; will require a gaggle of
334 &lt;div&gt;s and <acronym title="Cascading Style Sheets">CSS</acronym>.
335 These languages face the same troubles as regular <acronym
336 title="HyperText Markup Language">HTML</acronym> filters in that their
337 whitelist is too restrictive (besides the fact that their table markup
338 is extraordinarily complex).</p>
340 <h3 id="AltMarkup:Security">Security</h3>
342 <p class="lead">BBCode can be boiled down to a &quot;wanna-be&quot; version of
343 <acronym title="HyperText Markup Language">HTML</acronym>. I mean, replacing
344 the angled brackets with square brackets and omitting the occasional parameter
345 name? How much more un-original can you get? Somehow, I don't think BBCode
346 was meant to readable. <a
347 href="http://en.wikipedia.org/wiki/BBCode">Wikipedia</a> agrees:</p>
349 <blockquote>
350 BBCode was devised and put to use in order to provide a safer, easier
351 and more limited way of allowing users to format their messages.
352 Previously, many message boards allowed the users to include HTML,
353 which could be used to break/imitate parts of the layout, or run
354 JavaScript. Some implementations of BBCode have suffered problems related
355 to the way they translate the BBCode into HTML, which could negate the
356 security that was intended to be given by BBCode.
357 </blockquote>
359 <p>Or, put more simply:</p>
361 <blockquote>
362 BBCode came to life when developers where too lazy to parse HTML correctly
363 and decided to invent their own markup language. As with all products of
364 laziness, the result is completely inconsistent, unstandardized, and
365 widely adopted.
366 </blockquote>
368 <p>Well, developers, the whole point of HTML Purifier is that I do the
369 work so you can just execute the ridiculously simple
370 <tt>$purifier->purify($html)</tt> call and go on to do, well, whatever
371 you developers do. <tt>:-P</tt></p>
373 <h3 id="AltMarkup:Conclusion">Conclusion</h3>
375 <p>These alternative markup languages have their shiny points, and HTML
376 Purifier is not meant to replace them. However, a major reason for
377 their existence has been called into question. Why are <em>you</em>
378 using these languages?</p>
380 <h2 id="Tidy">HTML Tidy</h2>
382 <p class="lead">Dave Raggett's
383 <a href="http://www.w3.org/People/Raggett/tidy/">HTML Tidy</a> is a program;
384 neat enough, at least, to make it into PHP as a
385 <a href="http://us2.php.net/manual/en/ref.tidy.php">PECL extension.</a>
386 The premise is simple, the execution effective. Tidy is, in short, a great
387 <em>tool</em>.</p>
389 <p>It is not, however, a filter. I am often surprised when people ask
390 me, &quot;What about Tidy?&quot; There's nothing against Tidy: Tidy tackles
391 a different problem set. Let's see what <tt>man tidy</tt> has to say:</p>
393 <blockquote cite="http://tidy.sourceforge.net/docs/tidy_man.html">
394 Tidy reads HTML, XHTML and XML files and writes cleaned up markup. For
395 HTML variants, it detects and corrects many common coding errors and
396 strives to produce visually equivalent markup that is both W3C compliant
397 and works on most browsers. A common use of Tidy is to convert plain HTML
398 to XHTML.
399 </blockquote>
401 <p>Hmm... why do I not see the words &quot;filter&quot; or &quot;<acronym
402 title="Cross Site Scripting">XSS</acronym>&quot; in here? Perhaps it's
403 because Tidy accepts <em>any</em> valid
404 <acronym title="HyperText Markup Language">HTML</acronym>. Including
405 <tt>script</tt> tags. Which leads us to our second part: Tidy parses
406 <em>documents</em>, not document <em>fragments</em>.</p>
408 <p>This is not to say that I haven't seen Tidy be used in this sort of
409 fashion. MediaWiki, for instance, uses Tidy to cleanup the final HTML
410 output before shuttling it off to the browser. The developers, nevertheless,
411 agree that this is only a band-aid solution, and that the real way
412 to fix it is to fix the parser. Tidy's great, but in terms of security,
413 it's not suitable for untrusted sources.</p>
415 <h2 id="Preface">Preface</h2>
417 <p>I've ordered my analyses according to how bad a library is. The worst
418 is first, and then we move up the spectrum. I will point out the most
419 flagrant problems with the libraries, but note that I will omit more
420 advanced vulnerabilities: if you can't catch an <tt>onmouseover</tt>
421 attribute, I really shouldn't reprimand you for letting non-SGML code
422 points through. The ideal solution, however, must do all these things.</p>
424 <p>Note that besides striptags,
425 most of the libraries are moderately effective against the most common XSS
426 attacks. None of them (save Safe HTML Checker) fare very well
427 in the standards-compliance department though.</p>
429 <h2 id="striptags">striptags()</h2>
431 <table class="summary">
432 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user-specified</td></tr>
433 <tr><th>Removes foreign tags</th> <td class="impl-partial">Buggy</td></tr>
434 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
435 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
436 <tr><th>Validates attributes</th> <td class="impl-no">No</td></tr>
437 </table>
439 <p class="lead">The PHP function
440 <a href="http://php.net/manual/en/function.strip-tags.php">striptags()</a> is
441 the classic solution for attempting to clean up
442 <acronym title="HyperText Markup Language">HTML</acronym>. It
443 is also the <em>worst</em> solution, and should be avoided like the plague.
444 The fact that it doesn't validate attributes at all means that anyone can
445 insert an <tt>onmouseover='xss();'</tt> and exploit your application.</p>
447 <p>While
448 this can be bandaided with a series of regular expressions that strip out
449 on[event] (you're still vulnerable to <acronym
450 title="Cross Site Scripting">XSS</acronym> and at the mercy of
451 quirky browser behavior), striptags() is fundamentally flawed and should not be
452 used.
453 </p>
455 <h2 id="Input_Filter">PHP Input Filter</h2>
457 <p class="lead">Though its title may not imply it,
458 <a href="http://www.phpclasses.org/browse/package/2189.html">PHP Input Filter</a>
459 is a souped up version of striptags() with the ability to inspect
460 attributes. (Don't mind the hastily tacked on query escaping function).</p>
462 <table class="summary">
463 <tr><th>Version</th> <td class="impl-yes">1.2.2</td></tr>
464 <tr><th>Last update</th> <td class="impl-irrelevant">2005-10-05</td></tr>
465 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
466 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
467 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
468 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
469 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
470 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
471 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
472 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
473 </table>
475 <p>PHP Input Filter implements an
476 <acronym title="HyperText Markup Language">HTML</acronym> parser, and
477 performs very basic checks on whether or not tags and attributes have
478 been defined in the whitelist as well as some smarter <acronym
479 title="Cross Site Scripting">XSS</acronym> checks. It is left up to
480 the user to define what they'll permit.</p>
482 <p>With absolutely no checking of well-formedness, it is trivially easy
483 to trick the filter into leaving unclosed tags lying around. While to some
484 standards-compliance may be viewed by some as a &quot;nice feature&quot;,
485 basic sanity checks like this must be implemented, otherwise a user
486 can mangle a website's layout.</p>
488 <p>More troubles: Woe to
489 any user that allows the <tt>style</tt> attribute: you can't simply
490 just let <acronym
491 title="Cascading Style Sheets">CSS</acronym> through and expect your
492 layout not to be badly mutilated. To top things off,
493 the filter doesn't even preserve data properly: attributes have all
494 spaces stripped out of them. Stay away, stay away!</p>
496 <h2 id="HTML_Safe">HTML_Safe/SafeHTML</h2>
498 <p class="lead"><a href="http://pear.php.net/package/HTML_Safe">HTML_Safe</a> is
499 <acronym title="PHP Application and Extension Repository">PEAR</acronym>'s
500 <acronym title="HyperText Markup Language">HTML</acronym>
501 filtering library.
502 It should be noted that this is the same library as
503 <a href="http://pixel-apes.com/safehtml/">SafeHTML</a>, though with different
504 branding (and a different version number).</p>
506 <table class="summary">
507 <tr><th>Version</th> <td class="impl-almostyes">0.9.9beta</td></tr>
508 <tr><th>Last update</th> <td class="impl-irrelevant">2005-12-21</td></tr>
509 <tr><th>License</th> <td class="impl-irrelevant">BSD (3 clause)</td></tr>
510 <tr><th>Whitelist</th> <td class="impl-no">Mostly No</td></tr>
511 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
512 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
513 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
514 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
515 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
516 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
517 </table>
519 <p>HTML_Safe's mechanism of action involves parsing
520 <acronym title="HyperText Markup Language">HTML</acronym> with a
521 <acronym title="Simple API for XML">SAX</acronym> parser and performing
522 validation and filtering as the handlers are called. HTML_Safe does a lot
523 of things right, which is why I say it <em>probably</em> isn't vulnerable
524 to <acronym title="Cross Site Scripting">XSS</acronym>, but its approach
525 is fundamentally flawed: blacklists.</p>
527 <p>This library maintains arrays of dangerous tags, attributes and
528 <acronym title="Cascading Style Sheets">CSS</acronym> properties. (It also
529 has a blacklist of dangerous <acronym
530 title="Uniform Resource Identifier">URI</acronym> protocols, but this is
531 intelligently disabled by default in favor of a protocol whitelist.)
532 What this means is that HTML_Safe has no qualms of accepting input
533 like <tt>&lt;foobar&gt; Bang &lt;/foobar&gt;</tt>. Anything goes except
534 the tags in those arrays. Scratch standards-compliance (and that was
535 without even considering proper nesting).</p>
537 <p>For now, HTML_Safe might be safe from
538 <acronym title="Cross Site Scripting">XSS</acronym>.
539 In the future, however, one of the infinitely many tags that HTML_Safe lets
540 through might just possibly be given special functionality by browser vendors.
541 And it might just turn out that this can be exploited. <em>Any</em> blacklist
542 solution puts you at a perpetual arms race against crackers who are constantly
543 discovering new and inventive ways to abuse tags and attributes that you
544 didn't blacklist.</p>
546 <h2 id="kses">kses</h2>
548 <p class="lead"><a href="http://sourceforge.net/projects/kses/">kses</a> appears to
549 be the de-facto solution for cleaning
550 <acronym title="HyperText Markup Language">HTML</acronym>, having found
551 its way into applications such as <a href="http://wordpress.org/">WordPress</a>
552 and being the number one search result for &quot;php html filter&quot;.</p>
554 <table class="summary">
555 <tr><th>Version</th> <td class="impl-partial">0.2.2</td></tr>
556 <tr><th>Last update</th> <td class="impl-irrelevant">2005-02-06</td></tr>
557 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
558 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
559 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
560 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
561 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
562 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
563 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
564 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
565 </table>
567 <p>To be truthful, I didn't do as comprehensive a code survey for kses
568 as I did for some of the other libraries. Out of
569 all the classes I've reviewed so far, kses was definitely the hardest to
570 understand.</p>
572 <p>kses's modus operandi is splitting up html with a monster regexp
573 and then validating each section with <tt>kses_split2()</tt>. It
574 suffers from the same problems as Input Filter: no well-formedness
575 checks leading to rampant runaway tags (and no standards-compliance).
576 WordPress, the primary user of kses today, had to implement their
577 own custom tag-balancing code to fix this problem: don't use this
578 library without some equivalent!</p>
580 <p>Its whitelist syntax, however, is the most complex of all these libraries,
581 so I'm going to take some time to argue why this particular implementation
582 is bad. The author of this library was thoughtful enough to provide some
583 basic constraint checks on attributes like maxlen and maxval. Now, barring
584 the fact that there simply aren't enough checks, and the fact that they are
585 all lumped together in one function, we now must wonder whether or not
586 the user will go through the trouble of specifying the maximum length
587 of a title attribute.</p>
589 <p>I have my opinions about inherent human laziness, but perhaps WordPress's
590 default filterset is the most telling example:</p>
592 <pre>
593 $allowedposttags = array (
594 /* formatted and trimmed */
595 'hr' => array (
596 'align' => array (),
597 'noshade' => array (),
598 'size' => array (),
599 'width' => array ()
602 </pre>
604 <p>Hmm... do I see a blatant lack of attribute constraints? Conclusion:
605 if the user can get away with not doing work, they will! The biggest
606 problem in all these whitelists filters is that they forgot to <em>supply</em>
607 the whitelist. The whitelist is just as important as the code that uses
608 the whitelist to filter
609 <acronym title="HyperText Markup Language">HTML</acronym>.</p>
611 <h2 id="Safe_HTML_Checker">Safe HTML Checker</h2>
613 <p class="lead">
614 <a href="http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker">Safe
615 HTML Checker</a> is (to my knowledge) the first attempt to make a filter
616 that also outputs standards-compliant XHTML. It wasn't even released or
617 licensed officially, but we'll let that slide: a 4<sup>th</sup> place
618 search result must have done something right.</p>
620 <table class="summary">
621 <tr><th>Version</th> <td class="impl-partial">in-house</td></tr>
622 <tr><th>Last update</th> <td class="impl-almostyes">2003-09-15</td></tr>
623 <tr><th>License</th> <td class="impl-no">undefined</td></tr>
624 <tr><th>Whitelist</th> <td class="impl-almostyes">Yes (bare-bones)</td></tr>
625 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
626 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
627 <tr><th>Fixes nesting</th> <td class="impl-almostyes">Almost</td></tr>
628 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
629 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
630 <tr><th>Standards safe</th> <td class="impl-almostyes">Almost</td></tr>
631 </table>
633 <p>Indeed, it is quite a well-written piece of code. It demonstrates
634 knowledge of inline versus block elements, thus almost nearly getting
635 nesting correct (the only exception is an unimplemented omitted SGML
636 exclusion for <tt>&lt;a&gt;</tt> tags, and that's easy to fix).</p>
638 <p>Unfortunately, part of the reason why it works so well is that it's
639 extremely restrictive. No styling, no tables, very few attributes.
640 Perfectly appropriate for blog comments, but then again, there's always
641 BBCode. This probably means that Safe HTML Checker has a different
642 goal than HTML Purifier.</p>
644 <p>The <acronym title="eXtensible Markup Language">XML</acronym> parser
645 is also quite strict. Accidentally missed a &lt; sign? The parser will
646 complain with the cryptic message:
647 &quot;<acronym title="eXtensible HyperText Markup Language">XHTML</acronym>
648 is not well-formed&quot;.
649 The solution is not as simple as just switching to a more permissive
650 parser: Safe HTML Checker relies on the fact that the parser will have
651 matched up the tags for them.</p>
653 <h2 id="HTMLPurifier">HTML Purifier</h2>
655 <table class="summary">
656 <tr><th>Version</th> <td class="impl-yes">1.4.1</td></tr>
657 <tr><th>Last update</th> <td class="impl-yes">2007-01-21</td></tr>
658 <tr><th>License</th> <td class="impl-irrelevant">LGPL</td></tr>
659 <tr><th>Whitelist</th> <td class="impl-yes">Yes</td></tr>
660 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
661 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
662 <tr><th>Fixes nesting</th> <td class="impl-yes">Yes</td></tr>
663 <tr><th>Validates attributes</th> <td class="impl-yes">Yes</td></tr>
664 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
665 <tr><th>Standards safe</th> <td class="impl-yes">Yes</td></tr>
666 </table>
668 <p class="lead">That table should say it all, but I'll add a few more features:</p>
670 <table class="summary">
671 <tr><th>UTF-8 aware</th><td class="impl-yes">Yes</td></tr>
672 <tr><th>Object-Oriented</th><td class="impl-yes">Yes</td></tr>
673 <tr><th>Validates CSS</th><td class="impl-yes">Yes</td></tr>
674 <tr><th>Tables</th><td class="impl-yes">Yes</td></tr>
675 <tr><th>PHP 5 aware</th><td class="impl-yes">Yes</td></tr>
676 <tr><th>E_STRICT compliant</th><td class="impl-yes">Yes (use -strict)</td></tr>
677 </table>
679 <p>This is not to say that HTML Purifier doesn't have problems of its own.
680 It's a fairly nascent library (that doesn't mean its buggy though), it's big
681 (while the others usually fit in one file, this one requires a huge
682 include list), and it's <a href="http://hp.jpsband.org/live/TODO">missing
683 features.</a> But even in its current state,
684 HTML Purifier is far better than the other libraries.</p>
686 <p>So... <a href="./#Download">what are you waiting for?</a></p>
688 </div>
689 </body>
690 </html>