Add experimental mode to demo, this switches the code used to the trunk.
[htmlpurifier-web.git] / comparison.xhtml
blobf1a89cec0797071d26dbe589784c6402125f2170
1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
3 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
4 <!ENTITY % htmlpurifier.current SYSTEM "current.ent"> %htmlpurifier.current;
5 ]>
6 <html xmlns="http://www.w3.org/1999/xhtml"
7 xmlns:xi="http://www.w3.org/2001/XInclude"
8 xmlns:xc="urn:xhtml-compiler"
9 xmlns:svn="urn:xhtml-compiler:Subversion"
10 svn:head-url="$HeadURL$"
11 svn:revision="$Revision$"
12 xc:rss-from-svn="yes"
13 xml:lang="en">
14 <head>
15 <title>Comparison - HTML Purifier</title>
16 <xi:include href="common-meta.xml" xpointer="xpointer(/*/node())" />
17 <meta name="keywords" content="HTMLPurifier, HTML Purifier, HTML, filter, filtering, HTML_Safe, PEAR, comparison, kses, striptags, SafeHTMLChecker" />
18 </head>
19 <body>
21 <xi:include href="common-header.xml" xpointer="xpointer(/*/node())" />
22 <h1 id="title">Comparison</h1>
24 <div id="content">
26 <p>With the advent of
27 <a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0</a>, the end user has
28 gone from passive consumer to active producer of content on the World Wide
29 Web. <a href="http://en.wikipedia.org/wiki/Wiki">Wikis</a>,
30 <a href="http://en.wikipedia.org/wiki/Social_software">Social Software</a> and
31 <a href="http://en.wikipedia.org/wiki/Blog">Blogs</a> all
32 put the user in control.</p>
34 <p>Give the user too much control, however, and you set yourself up
35 for <a href="http://en.wikipedia.org/wiki/Cross-site_scripting"><abbr>XSS</abbr></a> attacks. For this reason,
36 <abbr>HTML</abbr>'s flexibility
37 has proven to be both a blessing and a curse, and the software that processes
38 it must strike a fine balance between security and usability. How do
39 we prevent users from injecting JavaScript or inserting malformed
40 <abbr>HTML</abbr> while allowing
41 a rich syntax of tags, attributes and <abbr>CSS</abbr>? How do we put
42 <abbr>HTML</abbr> inside
43 <abbr>RSS</abbr> feed without worrying
44 about sloppy coding messing up <abbr>XML</abbr> parsing?
45 Almost every <abbr>PHP</abbr>
46 developer has come across this problem before, and many have tried
47 (albeit unsuccessfully) to solve this problem. We will analyze existing
48 libraries to demonstrate how they are ineffective and, of course,
49 how <strong>HTML Purifier</strong> solves all our problems and achieves
50 standards-compliance.</p>
52 <p>I will take no quarter and pull no punches: as of the time of writing,
53 no other library comes even <em>close</em> to solving the problem effectively
54 for richly formatted documents. But, nonetheless, there is a necessary
55 disclaimer:</p>
57 <div class="disclaimer">
58 <p>
59 This comparison document was written by the author of HTML Purifier,
60 and clearly is <strong>in favor</strong> of HTML Purifier. However, that doesn't
61 mean that it is biased: I have made every attempt to be <strong>factual and
62 fair</strong>, and I hope that you will agree, by the time you finish reading
63 this document, that HTML Purifier is the only satisfactory <abbr>HTML</abbr>
64 filter out there today.
65 </p>
66 </div>
68 <div id="toc" />
70 <h2 id="Summary">Summary</h2>
72 <p>A table summarizing the differences for the impatient.</p>
74 <div class="wide-table">
75 <table cellspacing="0">
77 <thead>
78 <tr>
79 <th>Library</th>
80 <th>Version</th>
81 <th>Date</th>
82 <th>License</th>
83 <th>Whitelist</th>
84 <th>Removal</th>
85 <th>Well-formed</th>
86 <th>Nesting</th>
87 <th>Attributes</th>
88 <th>XSS&nbsp;safe</th>
89 <th>Standards&nbsp;safe</th>
90 </tr>
91 </thead>
93 <tbody>
95 <tr>
96 <td>striptags</td>
97 <td>n/a</td>
98 <td>n/a</td>
99 <td>n/a</td>
100 <td class="impl-almostyes">Yes (user)</td>
101 <td class="impl-partial">Buggy</td>
102 <td class="impl-no">No</td>
103 <td class="impl-no">No</td>
104 <td class="impl-no">No</td>
105 <td class="impl-no">No</td>
106 <td class="impl-no">No</td>
107 </tr>
109 <tr>
110 <td>PHP Input Filter</td>
111 <td>1.2.2</td>
112 <td>2005-10-05</td>
113 <td>GPL</td>
114 <td class="impl-almostyes">Yes (user)</td>
115 <td class="impl-yes">Yes</td>
116 <td class="impl-no">No</td>
117 <td class="impl-no">No</td>
118 <td class="impl-partial">Partial</td>
119 <td class="impl-almostyes">Probably</td>
120 <td class="impl-no">No</td>
121 </tr>
123 <tr>
124 <td>HTML_Safe</td>
125 <td>0.9.9beta</td>
126 <td>2005-12-21</td>
127 <td>BSD (3)</td>
128 <td class="impl-no">Mostly No</td>
129 <td class="impl-yes">Yes</td>
130 <td class="impl-yes">Yes</td>
131 <td class="impl-no">No</td>
132 <td class="impl-partial">Partial</td>
133 <td class="impl-almostyes">Probably</td>
134 <td class="impl-no">No</td>
135 </tr>
137 <tr>
138 <td>kses</td>
139 <td>0.2.2</td>
140 <td>2005-02-06</td>
141 <td>GPL</td>
142 <td class="impl-almostyes">Yes (user)</td>
143 <td class="impl-yes">Yes</td>
144 <td class="impl-no">No</td>
145 <td class="impl-no">No</td>
146 <td class="impl-partial">Partial</td>
147 <td class="impl-almostyes">Probably</td>
148 <td class="impl-no">No</td>
149 </tr>
151 <tr>
152 <td>Safe HTML Checker</td>
153 <td>n/a</td>
154 <td>2003-09-15</td>
155 <td>n/a</td>
156 <td class="impl-almostyes">Yes (bare)</td>
157 <td class="impl-yes">Yes</td>
158 <td class="impl-yes">Yes</td>
159 <td class="impl-almostyes">Almost</td>
160 <td class="impl-partial">Partial</td>
161 <td class="impl-yes">Yes</td>
162 <td class="impl-almostyes">Almost</td>
163 </tr>
165 <tr>
166 <td>HTML Purifier</td>
167 <td>&htmlpurifier.current.version;</td>
168 <td>&htmlpurifier.current.release-date;</td>
169 <td>LGPL</td>
170 <td class="impl-yes">Yes</td>
171 <td class="impl-yes">Yes</td>
172 <td class="impl-yes">Yes</td>
173 <td class="impl-yes">Yes</td>
174 <td class="impl-yes">Yes</td>
175 <td class="impl-yes">Yes</td>
176 <td class="impl-yes">Yes</td>
177 </tr>
179 </tbody>
181 </table>
182 </div>
184 <p><a href="#Tidy">HTML Tidy</a> is omitted from this list because it is not an <abbr>HTML</abbr>
185 filter.</p>
187 <h2 id="AltMarkup">Look Ma, No <abbr>HTML</abbr>!</h2>
189 <blockquote class="fancy">
190 <div class="quote" style="text-align:center;">
191 A clever person solves a problem.
192 A wise person avoids it.
193 </div>
194 <div class="origin">&mdash; Albert Einstein</div>
195 </blockquote>
197 <p>Before we jump into the weird and not-so-wonderful world
198 of <abbr>HTML</abbr> filters, we must first consider another domain: non-<abbr>HTML</abbr>
199 markup libraries. While libraries of this type really shouldn't be
200 considered <abbr>HTML</abbr> filters,
201 they are the number one method of taking user input and processing it into
202 something more than plain old text. These libraries forgo
203 <abbr>HTML</abbr> and define their
204 own markup syntax. <a href="http://en.wikipedia.org/wiki/BBCode">BBCode</a>,
205 <a href="http://en.wikipedia.org/wiki/Wikitext">Wikitext</a>,
206 <a href="http://daringfireball.net/projects/markdown/">Markdown</a> and
207 <a href="http://textism.com/tools/textile/">Textile</a> are all examples of
208 such markup languages (although it should be noted that
209 Wikitext and Markdown can allow
210 <abbr>HTML</abbr> within them).
211 The benefits (to those who use it, anyway) are clear: simplicity and
212 security.
213 </p>
215 <table cellspacing="0">
216 <thead>
217 <tr>
218 <th>Markup language</th>
219 <th>Sample</th>
220 </tr>
221 </thead>
222 <tbody>
223 <tr>
224 <th>BBCode</th>
225 <td><tt>[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].</tt></td>
226 </tr>
227 <tr>
228 <th>Wikitext<sup>1</sup></th>
229 <td><tt>'''B''' ''i'' [http://www.example.com/ link]</tt></td>
230 </tr>
231 <tr>
232 <th>Markdown<sup>2</sup></th>
233 <td><tt>**B** *i* [link](http://www.example.com/)</tt></td>
234 </tr>
235 <tr>
236 <th>Textile</th>
237 <td><tt>*B* _i_ &quot;link&quot;:http://www.example.com/</tt></td>
238 </tr>
239 <tr>
240 <th><abbr>HTML</abbr></th>
241 <td><tt>&lt;b&gt;B&lt;/b&gt; &lt;i&gt;i&lt;/i&gt; &lt;a href=&quot;http://www.example.com/&quot;&gt;link&lt;/a&gt;</tt></td>
242 </tr>
243 <tr>
244 <th><acronym>WYSIWYG</acronym></th>
245 <td><b>B</b> <i>i</i> <a href="http://www.example.com/">link</a></td>
246 </tr>
247 </tbody>
248 </table>
250 <ol class="notes">
251 <li>Wikitext shown is modeled after <a
252 href="http://www.mediawiki.org/wiki/MediaWiki">MediaWiki</a> style.
253 There are many variants of Wikitext currently extant.</li>
254 <li>Strictly speaking, the Markdown syntax is not equivalent: bold text
255 is expressed as <code>&lt;strong&gt;</code> and italicized text is
256 expressed as <code>&lt;em&gt;</code>. Most browser default stylesheets,
257 however, map those two semantic tags to the associated styling, so
258 many users assume that it really is italics (and use it improperly for,
259 say, book titles.)</li>
260 </ol>
262 <h3 id="AltMarkup:Simplicity">Simplicity</h3>
264 <p><abbr>HTML</abbr>
265 source code is often criticized for being difficult to read. For example,
266 compare:</p>
268 <pre>
269 * Item 1
270 * Item 2
271 </pre>
273 <p>...versus:</p>
275 <pre>
276 &lt;ul&gt;
277 &lt;li&gt;Item 1&lt;/li&gt;
278 &lt;li&gt;Item 2&lt;/li&gt;
279 &lt;/ul&gt;
280 </pre>
282 <p>Which would you prefer to edit? The answer seems obvious, but be careful
283 not to fall into the fallacy of <a
284 href="http://en.wikipedia.org/wiki/False_dilemma">false dilemma</a>.
285 There <em>is</em> a third choice: the
286 <acronym>WYSIWYG</acronym> (rich text)
287 editor, which blows earlier choices out of the water in terms
288 of usability.</p>
290 <p>Note that rich text editors and alternate markup syntaxes are not
291 mutually exclusive, but, when push comes to shove, it's easier
292 implement this sort of editor on top of <abbr>HTML</abbr> than some obscure
293 markup language. And in the cases when it is done, you usually end up with
294 a live preview, not a true rich text editor.</p>
296 <blockquote class="digression">
297 <p><q>Now just wait a second,</q> you may be saying,
298 <q><acronym>WYSIWYG</acronym>
299 editors aren't all that great.</q> There are many good arguments
300 against these editors, and <a
301 href="http://www.ideography.co.uk/library/seybold/WYSIWYG.html">intelligent
302 people have written essays</a> devoted to
303 criticizing <acronym>WYSIWYG</acronym>.
304 In addition to the usual arguments against said editors, the web poses
305 another limitation: no JavaScript means no
306 editor, and no editor means... (gasp) manually typing in code.</p>
308 <p>Even the most dogmatic purist, however, should recognize that for all
309 its faults, prospective clients <em>really</em> want rich text editors.
310 There are steps you can take to mitigate the associated drawbacks of
311 these editors.</p>
313 <p>It is often asserted that
314 <acronym>WYSIWYG</acronym> editors
315 <em>encourage excessive presentational markup</em>. As it turns out,
316 this is the case with any markup language that allows the smallest
317 iota of presentational tags, be it <tt>&lt;font&gt;</tt> or
318 <tt>[color=red]</tt>.
319 A good way to reduce this trouble is to simply eliminate the
320 dialogue boxes that allow users to change colors or fonts (which
321 usually have no legitimate use) and adopt a
322 <acronym>WYSIWYM</acronym> scheme,
323 allowing users to select contextually correct formatting styles
324 for segments of text.</p>
325 </blockquote>
327 <p>Simplicity is also a double-edged sword. The moment any remotely
328 complex markup is needed, these lightweight markup languages fail to
329 produce. Sure you can make '''this text bold''' with Wikitext, but that
330 infobox all <q>rendered nicely in aqua blue</q> will require a gaggle of
331 &lt;div&gt;s and <abbr>CSS</abbr>.
332 These languages face the same troubles as regular <abbr>HTML</abbr>
333 filters in that their whitelist is too restrictive (besides the fact that
334 their table markup is extraordinarily complex).</p>
336 <h3 id="AltMarkup:Security">Security</h3>
338 <p>BBCode can be boiled down to a <q>wanna-be</q> version of
339 <abbr>HTML</abbr>. I mean, replacing
340 the angled brackets with square brackets and omitting the occasional parameter
341 name? How much more un-original can you get? Somehow, I don't think BBCode
342 was meant to readable. <a
343 href="http://en.wikipedia.org/wiki/BBCode">Wikipedia</a> agrees:</p>
345 <blockquote>
346 BBCode was devised and put to use in order to provide a safer, easier
347 and more limited way of allowing users to format their messages.
348 Previously, many message boards allowed the users to include <abbr>HTML</abbr>,
349 which could be used to break/imitate parts of the layout, or run
350 JavaScript. Some implementations of BBCode have suffered problems related
351 to the way they translate the BBCode into <abbr>HTML</abbr>, which could negate the
352 security that was intended to be given by BBCode.
353 </blockquote>
355 <p>Or, put more simply:</p>
357 <blockquote>
358 BBCode came to life when developers where too
359 lazy to parse <abbr>HTML</abbr> correctly
360 and decided to invent their own markup language. As with all products of
361 laziness, the result is completely inconsistent, unstandardized, and
362 widely adopted.
363 </blockquote>
365 <p>Well, developers, the whole point of HTML Purifier is that I do the
366 work so you can just execute the ridiculously simple
367 <tt>$purifier->purify($html)</tt> call and go on to do, well, whatever
368 you developers do. <tt>:-P</tt></p>
370 <h3 id="AltMarkup:Conclusion">Conclusion</h3>
372 <p>These alternative markup languages have their shiny points, and HTML
373 Purifier is not meant to replace them. However, a major reason for
374 their existence has been called into question. Why are <em>you</em>
375 using these languages?</p>
377 <h2 id="Tidy">HTML Tidy</h2>
379 <p>Dave Raggett's
380 <a href="http://www.w3.org/People/Raggett/tidy/">HTML Tidy</a> is a program;
381 neat enough, at least, to make it into <abbr>PHP</abbr> as a
382 <a href="http://us2.php.net/manual/en/ref.tidy.php"><abbr>PECL</abbr> extension.</a>
383 The premise is simple, the execution effective. Tidy is, in short, a great
384 <em>tool</em>.</p>
386 <p>It is not, however, a filter. I am often surprised when people ask
387 me, <q>What about Tidy?</q> There's nothing against Tidy: Tidy tackles
388 a different problem set. Let's see what <tt>man tidy</tt> has to say:</p>
390 <blockquote cite="http://tidy.sourceforge.net/docs/tidy_man.html">
391 Tidy reads <abbr>HTML</abbr>, <abbr>XHTML</abbr> and
392 <abbr>XML</abbr> files and writes cleaned up markup. For
393 <abbr>HTML</abbr> variants, it detects and corrects many common coding errors and
394 strives to produce visually equivalent markup that is both <abbr>W3C</abbr> compliant
395 and works on most browsers. A common use of Tidy is to convert plain <abbr>HTML</abbr>
396 to <abbr>XHTML</abbr>.
397 </blockquote>
399 <p>Hmm... why do I not see the words <q>filter</q> or
400 <q><abbr>XSS</abbr></q> in here? Perhaps it's
401 because Tidy accepts <em>any</em> valid
402 <abbr>HTML</abbr>. Including
403 <tt>script</tt> tags. Which leads us to our second part: Tidy parses
404 <em>documents</em>, not document <em>fragments</em>.</p>
406 <p>This is not to say that I haven't seen Tidy be used in this sort of
407 fashion. MediaWiki, for instance, uses Tidy to cleanup the final <abbr>HTML</abbr>
408 output before shuttling it off to the browser. The developers, nevertheless,
409 agree that this is only a band-aid solution, and that the real way
410 to fix it is to fix the parser. Tidy's great, but in terms of security,
411 it's not suitable for untrusted sources.</p>
413 <h2 id="Preface">Preface</h2>
415 <p>I've ordered my analyses according to how bad a library is. The worst
416 is first, and then we move up the spectrum. I will point out the most
417 flagrant problems with the libraries, but note that I will omit more
418 advanced vulnerabilities: if you can't catch an <tt>onmouseover</tt>
419 attribute, I really shouldn't reprimand you for letting non-<abbr>SGML</abbr> code
420 points through. The ideal solution, however, must do all these things.</p>
422 <p>Note that besides striptags,
423 most of the libraries are moderately effective against the most common <abbr>XSS</abbr>
424 attacks. None of them (save Safe HTML Checker) fare very well
425 in the standards-compliance department though.</p>
427 <h2 id="striptags">striptags()</h2>
429 <table class="summary">
430 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user-specified</td></tr>
431 <tr><th>Removes foreign tags</th> <td class="impl-partial">Buggy</td></tr>
432 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
433 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
434 <tr><th>Validates attributes</th> <td class="impl-no">No</td></tr>
435 </table>
437 <p>The <abbr>PHP</abbr> function
438 <a href="http://php.net/manual/en/function.strip-tags.php">striptags()</a> is
439 the classic solution for attempting to clean up
440 <abbr>HTML</abbr>. It
441 is also the <em>worst</em> solution, and should be avoided like the plague.
442 The fact that it doesn't validate attributes at all means that anyone can
443 insert an <tt>onmouseover='xss();'</tt> and exploit your application.</p>
445 <p>While
446 this can be bandaided with a series of regular expressions that strip out
447 on[event] (you're still vulnerable to <abbr>XSS</abbr> and at the mercy of
448 quirky browser behavior), striptags() is fundamentally flawed and should not be
449 used.
450 </p>
452 <h2 id="Input_Filter">PHP Input Filter</h2>
454 <p>Though its title may not imply it,
455 <a href="http://www.phpclasses.org/browse/package/2189.html">PHP Input Filter</a>
456 is a souped up version of striptags() with the ability to inspect
457 attributes. (Don't mind the hastily tacked on query escaping function).</p>
459 <table class="summary">
460 <tr><th>Version</th> <td class="impl-yes">1.2.2</td></tr>
461 <tr><th>Last update</th> <td class="impl-irrelevant">2005-10-05</td></tr>
462 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
463 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
464 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
465 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
466 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
467 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
468 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
469 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
470 </table>
472 <p>PHP Input Filter implements an
473 <abbr>HTML</abbr> parser, and
474 performs very basic checks on whether or not tags and attributes have
475 been defined in the whitelist as well as some
476 smarter <abbr>XSS</abbr> checks. It is left up to
477 the user to define what they'll permit.</p>
479 <p>With absolutely no checking of well-formedness, it is trivially easy
480 to trick the filter into leaving unclosed tags lying around. While to some
481 standards-compliance may be viewed by some as a <q>nice feature</q>,
482 basic sanity checks like this must be implemented, otherwise a user
483 can mangle a website's layout.</p>
485 <p>More troubles: Woe to
486 any user that allows the <tt>style</tt> attribute: you can't simply
487 just let <abbr>CSS</abbr> through and expect your
488 layout not to be badly mutilated. To top things off,
489 the filter doesn't even preserve data properly: attributes have all
490 spaces stripped out of them. Stay away, stay away!</p>
492 <h2 id="HTML_Safe">HTML_Safe/SafeHTML</h2>
494 <p><a href="http://pear.php.net/package/HTML_Safe">HTML_Safe</a> is
495 <acronym>PEAR</acronym>'s <abbr>HTML</abbr> filtering library.
496 It should be noted that this is the same library as
497 <a href="http://pixel-apes.com/safehtml/">SafeHTML</a>, though with different
498 branding (and a different version number).</p>
500 <table class="summary">
501 <tr><th>Version</th> <td class="impl-almostyes">0.9.9beta</td></tr>
502 <tr><th>Last update</th> <td class="impl-irrelevant">2005-12-21</td></tr>
503 <tr><th>License</th> <td class="impl-irrelevant">BSD (3 clause)</td></tr>
504 <tr><th>Whitelist</th> <td class="impl-no">Mostly No</td></tr>
505 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
506 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
507 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
508 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
509 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
510 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
511 </table>
513 <p>HTML_Safe's mechanism of action involves parsing
514 <abbr>HTML</abbr> with a
515 <acronym>SAX</acronym> parser and performing
516 validation and filtering as the handlers are called. HTML_Safe does a lot
517 of things right, which is why I say it <em>probably</em> isn't vulnerable
518 to <abbr>XSS</abbr>, but its approach
519 is fundamentally flawed: blacklists.</p>
521 <p>This library maintains arrays of dangerous tags, attributes and
522 <abbr>CSS</abbr> properties. (It also
523 has a blacklist of dangerous <abbr>URI</abbr> protocols, but this is
524 intelligently disabled by default in favor of a protocol whitelist.)
525 What this means is that HTML_Safe has no qualms of accepting input
526 like <tt>&lt;foobar&gt; Bang &lt;/foobar&gt;</tt>. Anything goes except
527 the tags in those arrays. Scratch standards-compliance (and that was
528 without even considering proper nesting).</p>
530 <p>For now, HTML_Safe might be safe from
531 <abbr>XSS</abbr>.
532 In the future, however, one of the infinitely many tags that HTML_Safe lets
533 through might just possibly be given special functionality by browser vendors.
534 And it might just turn out that this can be exploited. <em>Any</em> blacklist
535 solution puts you at a perpetual arms race against crackers who are constantly
536 discovering new and inventive ways to abuse tags and attributes that you
537 didn't blacklist.</p>
539 <h2 id="kses">kses</h2>
541 <p><a href="http://sourceforge.net/projects/kses/">kses</a> appears to
542 be the de-facto solution for cleaning <abbr>HTML</abbr>, having found
543 its way into applications such as <a href="http://wordpress.org/">WordPress</a>
544 and being the number one search result for <q>php html filter</q>.</p>
546 <table class="summary">
547 <tr><th>Version</th> <td class="impl-partial">0.2.2</td></tr>
548 <tr><th>Last update</th> <td class="impl-irrelevant">2005-02-06</td></tr>
549 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
550 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
551 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
552 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
553 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
554 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
555 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
556 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
557 </table>
559 <p>To be truthful, I didn't do as comprehensive a code survey for kses
560 as I did for some of the other libraries. Out of
561 all the classes I've reviewed so far, kses was definitely the hardest to
562 understand.</p>
564 <p>kses's modus operandi is splitting up html with a monster regexp
565 and then validating each section with <tt>kses_split2()</tt>. It
566 suffers from the same problems as Input Filter: no well-formedness
567 checks leading to rampant runaway tags (and no standards-compliance).
568 WordPress, the primary user of kses today, had to implement their
569 own custom tag-balancing code to fix this problem: don't use this
570 library without some equivalent!</p>
572 <p>Its whitelist syntax, however, is the most complex of all these libraries,
573 so I'm going to take some time to argue why this particular implementation
574 is bad. The author of this library was thoughtful enough to provide some
575 basic constraint checks on attributes like maxlen and maxval. Now, barring
576 the fact that there simply aren't enough checks, and the fact that they are
577 all lumped together in one function, we now must wonder whether or not
578 the user will go through the trouble of specifying the maximum length
579 of a title attribute.</p>
581 <p>I have my opinions about inherent human laziness, but perhaps WordPress's
582 default filterset is the most telling example:</p>
584 <pre>
585 $allowedposttags = array (
586 /* formatted and trimmed */
587 'hr' => array (
588 'align' => array (),
589 'noshade' => array (),
590 'size' => array (),
591 'width' => array ()
594 </pre>
596 <p>Hmm... do I see a blatant lack of attribute constraints? Conclusion:
597 if the user can get away with not doing work, they will! The biggest
598 problem in all these whitelists filters is that they forgot to <em>supply</em>
599 the whitelist. The whitelist is just as important as the code that uses
600 the whitelist to filter <abbr>HTML</abbr>.</p>
602 <h2 id="Safe_HTML_Checker">Safe HTML Checker</h2>
605 <a href="http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker">Safe
606 HTML Checker</a> is (to my knowledge) the first attempt to make a filter
607 that also outputs standards-compliant <abbr>XHTML</abbr>. It wasn't even released or
608 licensed officially, but we'll let that slide: a 4<sup>th</sup> place
609 search result must have done something right.</p>
611 <table class="summary">
612 <tr><th>Version</th> <td class="impl-partial">in-house</td></tr>
613 <tr><th>Last update</th> <td class="impl-almostyes">2003-09-15</td></tr>
614 <tr><th>License</th> <td class="impl-no">undefined</td></tr>
615 <tr><th>Whitelist</th> <td class="impl-almostyes">Yes (bare-bones)</td></tr>
616 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
617 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
618 <tr><th>Fixes nesting</th> <td class="impl-almostyes">Almost</td></tr>
619 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
620 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
621 <tr><th>Standards safe</th> <td class="impl-almostyes">Almost</td></tr>
622 </table>
624 <p>Indeed, it is quite a well-written piece of code. It demonstrates
625 knowledge of inline versus block elements, thus almost nearly getting
626 nesting correct (the only exception is an unimplemented omitted SGML
627 exclusion for <tt>&lt;a&gt;</tt> tags, and that's easy to fix).</p>
629 <p>Unfortunately, part of the reason why it works so well is that it's
630 extremely restrictive. No styling, no tables, very few attributes.
631 Perfectly appropriate for blog comments, but then again, there's always
632 BBCode. This probably means that Safe HTML Checker has a different
633 goal than HTML Purifier.</p>
635 <p>The <abbr>XML</abbr> parser
636 is also quite strict. Accidentally missed a &lt; sign? The parser will
637 complain with the cryptic message:
638 <q><abbr>XHTML</abbr>
639 is not well-formed</q>.
640 The solution is not as simple as just switching to a more permissive
641 parser: Safe HTML Checker relies on the fact that the parser will have
642 matched up the tags for them.</p>
644 <h2 id="HTMLPurifier">HTML Purifier</h2>
646 <table class="summary">
647 <tr><th>Version</th> <td class="impl-yes">&htmlpurifier.current.version;</td></tr>
648 <tr><th>Last update</th> <td class="impl-yes">&htmlpurifier.current.release-date;</td></tr>
649 <tr><th>License</th> <td class="impl-irrelevant">LGPL</td></tr>
650 <tr><th>Whitelist</th> <td class="impl-yes">Yes</td></tr>
651 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
652 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
653 <tr><th>Fixes nesting</th> <td class="impl-yes">Yes</td></tr>
654 <tr><th>Validates attributes</th> <td class="impl-yes">Yes</td></tr>
655 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
656 <tr><th>Standards safe</th> <td class="impl-yes">Yes</td></tr>
657 </table>
659 <p>That table should say it all, but I'll add a few more features:</p>
661 <table class="summary">
662 <tr><th>UTF-8 aware</th><td class="impl-yes">Yes</td></tr>
663 <tr><th>Object-Oriented</th><td class="impl-yes">Yes</td></tr>
664 <tr><th>Validates CSS</th><td class="impl-yes">Yes</td></tr>
665 <tr><th>Tables</th><td class="impl-yes">Yes</td></tr>
666 <tr><th>PHP 5 aware</th><td class="impl-yes">Yes</td></tr>
667 <tr><th>E_STRICT compliant</th><td class="impl-yes">Yes (use -strict)</td></tr>
668 </table>
670 <p>This is not to say that HTML Purifier doesn't have problems of its own.
671 It's big
672 (while the others usually fit in one file, this one requires a huge
673 include list), and it's <a href="http://htmlpurifier.org/live/TODO">missing
674 features.</a> But even in its current state,
675 HTML Purifier is far better than the other libraries.</p>
677 <p>So... <a href="download.html">what are you waiting for?</a></p>
679 </div>
680 </body>
681 </html>