Add dev and xhtml-compiler submodules, to emulate svn:externals.
[htmlpurifier-web.git] / comparison.xhtml
blob50b4d0c4d882240e7ca435e98ecd2f4b65d72bfe
1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
3 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
4 <!ENTITY % htmlpurifier.current SYSTEM "current.ent"> %htmlpurifier.current;
5 ]>
6 <html xmlns="http://www.w3.org/1999/xhtml"
7 xmlns:xi="http://www.w3.org/2001/XInclude"
8 xmlns:xc="urn:xhtml-compiler"
9 xmlns:svn="urn:xhtml-compiler:Subversion"
10 svn:head-url="$HeadURL$"
11 svn:revision="$Revision$"
12 xc:rss-from-svn="yes"
13 xml:lang="en">
14 <head>
15 <title>Comparison - HTML Purifier</title>
16 <xi:include href="common-meta.xml" xpointer="xpointer(/*/node())" />
17 <meta name="keywords" content="HTMLPurifier, HTML Purifier, HTML, filter, filtering, HTML_Safe, PEAR, comparison, kses, striptags, SafeHTMLChecker" />
18 </head>
19 <body>
21 <xi:include href="common-header.xml" xpointer="xpointer(/*/node())" />
22 <h1 id="title">Comparison</h1>
24 <div id="content">
26 <p>
27 With the advent of <a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0</a>,
28 the end user has gone from passive consumer to active producer of content
29 on the World Wide Web. <a href="http://en.wikipedia.org/wiki/Wiki">Wikis</a>,
30 <a href="http://en.wikipedia.org/wiki/Social_software">Social Software</a> and
31 <a href="http://en.wikipedia.org/wiki/Blog">Blogs</a> all put the user in control.
32 </p>
34 <p>
35 Give the user too much control, however, and you set yourself up for <a
36 href="http://en.wikipedia.org/wiki/Cross-site_scripting"><abbr>XSS</abbr
37 ></a> attacks. For this reason, <abbr>HTML</abbr>'s flexibility has
38 proven to be both a blessing and a curse, and the software that
39 processes it must strike a fine balance between security and usability.
40 How do we prevent users from injecting JavaScript or inserting malformed
41 <abbr>HTML</abbr> while allowing a rich syntax of tags, attributes and
42 <abbr>CSS</abbr>? How do we put <abbr>HTML</abbr> inside
43 <abbr>RSS</abbr> feed without worrying about sloppy coding messing up
44 <abbr>XML</abbr> parsing? Almost every <abbr>PHP</abbr> developer has
45 come across this problem before, and many have tried (albeit
46 unsuccessfully) to solve this problem. We will analyze existing
47 libraries to demonstrate how they are ineffective and, of course, how
48 <strong>HTML Purifier</strong> solves all our problems and achieves
49 standards-compliance.
50 </p>
52 <p>
53 I will take no quarter and pull no punches: as of the time of writing,
54 no other library comes even <em>close</em> to solving the problem effectively
55 for richly formatted documents. But, nonetheless, there is a necessary
56 disclaimer:
57 </p>
59 <div class="disclaimer">
60 <p>
61 This comparison document was written by the author of HTML Purifier,
62 and clearly is <strong>in favor</strong> of HTML Purifier. However, that doesn't
63 mean that it is biased: I have made every attempt to be <strong>factual and
64 fair</strong>, and I hope that you will agree, by the time you finish reading
65 this document, that HTML Purifier is the only satisfactory <abbr>HTML</abbr>
66 filter out there today.
67 </p>
68 </div>
70 <div id="toc" />
72 <h2 id="Summary">Summary</h2>
74 <p>A table summarizing the differences for the impatient.</p>
76 <div class="wide-table">
77 <table cellspacing="0">
79 <thead>
80 <tr>
81 <th>Library</th>
82 <th>Version</th>
83 <th>Date</th>
84 <th>License</th>
85 <th>Whitelist</th>
86 <th>Removal</th>
87 <th>Well-formed</th>
88 <th>Nesting</th>
89 <th>Attributes</th>
90 <th>XSS&nbsp;safe</th>
91 <th>Standards&nbsp;safe</th>
92 </tr>
93 </thead>
95 <tbody>
97 <tr>
98 <td>striptags</td>
99 <td>n/a</td>
100 <td>n/a</td>
101 <td>n/a</td>
102 <td class="impl-almostyes">Yes (user)</td>
103 <td class="impl-partial">Buggy</td>
104 <td class="impl-no">No</td>
105 <td class="impl-no">No</td>
106 <td class="impl-no">No</td>
107 <td class="impl-no">No</td>
108 <td class="impl-no">No</td>
109 </tr>
111 <tr>
112 <td>PHP Input Filter</td>
113 <td>1.2.2</td>
114 <td>2005-10-05</td>
115 <td>GPL</td>
116 <td class="impl-almostyes">Yes (user)</td>
117 <td class="impl-yes">Yes</td>
118 <td class="impl-no">No</td>
119 <td class="impl-no">No</td>
120 <td class="impl-partial">Partial</td>
121 <td class="impl-almostyes">Probably</td>
122 <td class="impl-no">No</td>
123 </tr>
125 <tr>
126 <td>HTML_Safe</td>
127 <td>0.9.9beta</td>
128 <td>2005-12-21</td>
129 <td>BSD (3)</td>
130 <td class="impl-no">Mostly No</td>
131 <td class="impl-yes">Yes</td>
132 <td class="impl-yes">Yes</td>
133 <td class="impl-no">No</td>
134 <td class="impl-partial">Partial</td>
135 <td class="impl-almostyes">Probably</td>
136 <td class="impl-no">No</td>
137 </tr>
139 <tr>
140 <td>kses</td>
141 <td>0.2.2</td>
142 <td>2005-02-06</td>
143 <td>GPL</td>
144 <td class="impl-almostyes">Yes (user)</td>
145 <td class="impl-yes">Yes</td>
146 <td class="impl-no">No</td>
147 <td class="impl-no">No</td>
148 <td class="impl-partial">Partial</td>
149 <td class="impl-almostyes">Probably</td>
150 <td class="impl-no">No</td>
151 </tr>
153 <tr>
154 <td>htmLawed</td>
155 <td>1.0.3</td>
156 <td>2008-03-03</td>
157 <td>GPL</td>
158 <td class="impl-partial">Yes (not default)</td>
159 <td class="impl-almostyes">Yes (user)</td>
160 <td class="impl-almostyes">Yes (user)</td>
161 <td class="impl-no">No</td>
162 <td class="impl-partial">Partial</td>
163 <td class="impl-no">No</td>
164 <td class="impl-no">No</td>
165 </tr>
167 <tr>
168 <td>Safe HTML Checker</td>
169 <td>n/a</td>
170 <td>2003-09-15</td>
171 <td>n/a</td>
172 <td class="impl-partial">Yes (bare)</td>
173 <td class="impl-yes">Yes</td>
174 <td class="impl-yes">Yes</td>
175 <td class="impl-almostyes">Almost</td>
176 <td class="impl-partial">Partial</td>
177 <td class="impl-yes">Yes</td>
178 <td class="impl-almostyes">Almost</td>
179 </tr>
181 <tr>
182 <td>HTML Purifier</td>
183 <td>&htmlpurifier.current.version;</td>
184 <td>&htmlpurifier.current.release-date;</td>
185 <td>LGPL</td>
186 <td class="impl-yes">Yes</td>
187 <td class="impl-yes">Yes</td>
188 <td class="impl-yes">Yes</td>
189 <td class="impl-yes">Yes</td>
190 <td class="impl-yes">Yes</td>
191 <td class="impl-yes">Yes</td>
192 <td class="impl-yes">Yes</td>
193 </tr>
195 </tbody>
197 </table>
198 </div>
201 <a href="#Tidy">HTML Tidy</a> is omitted from this list because it is not
202 an <abbr>HTML</abbr> filter.
203 </p>
205 <h2 id="AltMarkup">Look Ma, No <abbr>HTML</abbr>!</h2>
207 <blockquote class="fancy">
208 <div class="quote" style="text-align:center;">
209 A clever person solves a problem.
210 A wise person avoids it.
211 </div>
212 <div class="origin">&mdash; Albert Einstein</div>
213 </blockquote>
216 Before we jump into the weird and not-so-wonderful world of
217 <abbr>HTML</abbr> filters, we must first consider another domain:
218 non-<abbr>HTML</abbr> markup libraries. While libraries of this type
219 really shouldn't be considered <abbr>HTML</abbr> filters, they are the
220 number one method of taking user input and processing it into something
221 more than plain old text. These libraries forgo <abbr>HTML</abbr> and
222 define their own markup syntax. <a
223 href="http://en.wikipedia.org/wiki/BBCode">BBCode</a>, <a
224 href="http://en.wikipedia.org/wiki/Wikitext">Wikitext</a>, <a
225 href="http://daringfireball.net/projects/markdown/">Markdown</a> and <a
226 href="http://textism.com/tools/textile/">Textile</a> are all examples of
227 such markup languages (although it should be noted that Wikitext and
228 Markdown can allow <abbr>HTML</abbr> within them). The benefits (to
229 those who use it, anyway) are clear: simplicity and security.
230 </p>
232 <table cellspacing="0">
233 <thead>
234 <tr>
235 <th>Markup language</th>
236 <th>Sample</th>
237 </tr>
238 </thead>
239 <tbody>
240 <tr>
241 <th>BBCode</th>
242 <td><tt>[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].</tt></td>
243 </tr>
244 <tr>
245 <th>Wikitext<sup>1</sup></th>
246 <td><tt>'''B''' ''i'' [http://www.example.com/ link]</tt></td>
247 </tr>
248 <tr>
249 <th>Markdown<sup>2</sup></th>
250 <td><tt>**B** *i* [link](http://www.example.com/)</tt></td>
251 </tr>
252 <tr>
253 <th>Textile</th>
254 <td><tt>*B* _i_ &quot;link&quot;:http://www.example.com/</tt></td>
255 </tr>
256 <tr>
257 <th><abbr>HTML</abbr></th>
258 <td><tt>&lt;b&gt;B&lt;/b&gt; &lt;i&gt;i&lt;/i&gt; &lt;a href=&quot;http://www.example.com/&quot;&gt;link&lt;/a&gt;</tt></td>
259 </tr>
260 <tr>
261 <th><acronym>WYSIWYG</acronym></th>
262 <td><b>B</b> <i>i</i> <a href="http://www.example.com/">link</a></td>
263 </tr>
264 </tbody>
265 </table>
267 <ol class="notes">
268 <li>
269 Wikitext shown is modeled after <a
270 href="http://www.mediawiki.org/wiki/MediaWiki">MediaWiki</a> style.
271 There are many variants of Wikitext currently extant.
272 </li>
273 <li>
274 Strictly speaking, the Markdown syntax is not equivalent: bold text
275 is expressed as <code>&lt;strong&gt;</code> and italicized text is
276 expressed as <code>&lt;em&gt;</code>. Most browser default stylesheets,
277 however, map those two semantic tags to the associated styling, so
278 many users assume that it really is italics (and use it improperly for,
279 say, book titles.)
280 </li>
281 </ol>
283 <h3 id="AltMarkup:Simplicity">Simplicity</h3>
286 <abbr>HTML</abbr> source code is often criticized for being difficult to
287 read. For example, compare:
288 </p>
290 <pre>
291 * Item 1
292 * Item 2
293 </pre>
295 <p>...with:</p>
297 <pre>
298 &lt;ul&gt;
299 &lt;li&gt;Item 1&lt;/li&gt;
300 &lt;li&gt;Item 2&lt;/li&gt;
301 &lt;/ul&gt;
302 </pre>
305 Which would you prefer to edit? The answer seems obvious, but be careful
306 not to fall into the fallacy of <a
307 href="http://en.wikipedia.org/wiki/False_dilemma">false dilemma</a>.
308 There <em>is</em> a third choice: the <acronym>WYSIWYG</acronym> (rich
309 text) editor, which blows earlier choices out of the water in terms of
310 usability.
311 </p>
314 Note that rich text editors and alternate markup syntaxes are not
315 mutually exclusive, but, when push comes to shove, it's easier
316 implement this sort of editor on top of <abbr>HTML</abbr> than some obscure
317 markup language. And in the cases when it is done, you usually end up with
318 a live preview, not a true rich text editor.
319 </p>
321 <blockquote class="digression">
323 <q>Now just wait a second,</q> you may be saying, <q><acronym>WYSIWYG</acronym>
324 editors aren't all that great.</q> There are many good arguments against
325 these editors, and <a
326 href="http://www.ideography.co.uk/library/seybold/WYSIWYG.html">intelligent
327 people have written essays</a> devoted to criticizing
328 <acronym>WYSIWYG</acronym>. In addition to the usual arguments against
329 said editors, the web poses another limitation: no JavaScript means no
330 editor, and no editor means... (gasp) manually typing in code.
331 </p>
333 Even the most dogmatic purist, however, should recognize that for all
334 its faults, prospective clients <em>really</em> want rich text editors.
335 There are steps you can take to mitigate the associated drawbacks of
336 these editors.
337 </p>
339 It is often asserted that <acronym>WYSIWYG</acronym> editors
340 <em>encourage excessive presentational markup</em>. As it turns out,
341 this is the case with any markup language that allows the smallest
342 iota of presentational tags, be it <tt>&lt;font&gt;</tt> or
343 <tt>[color=red]</tt>. A good way to reduce this trouble is to simply
344 eliminate the dialogue boxes that allow users to change colors or fonts
345 (which usually have no legitimate use) and adopt a <acronym>WYSIWYM</acronym>
346 scheme, allowing users to select contextually correct formatting styles
347 for segments of text.
348 </p>
349 </blockquote>
352 Simplicity is also a double-edged sword. The moment any remotely
353 complex markup is needed, these lightweight markup languages fail to
354 produce. Sure you can make '''this text bold''' with Wikitext, but that
355 infobox all <q>rendered nicely in aqua blue</q> will require a gaggle of
356 &lt;div&gt;s and <abbr>CSS</abbr>. These languages face the same troubles
357 as regular <abbr>HTML</abbr> filters in that their whitelist is too
358 restrictive (besides the fact that their table markup is extraordinarily
359 complex).
360 </p>
362 <h3 id="AltMarkup:Security">Security</h3>
365 BBCode can be boiled down to a <q>wanna-be</q> version of
366 <abbr>HTML</abbr>. I mean, replacing
367 the angled brackets with square brackets and omitting the occasional parameter
368 name? How much more un-original can you get? Somehow, I don't think BBCode
369 was meant to readable. <a
370 href="http://en.wikipedia.org/wiki/BBCode">Wikipedia</a> agrees:
371 </p>
373 <blockquote>
374 BBCode was devised and put to use in order to provide a safer, easier
375 and more limited way of allowing users to format their messages.
376 Previously, many message boards allowed the users to include <abbr>HTML</abbr>,
377 which could be used to break/imitate parts of the layout, or run
378 JavaScript. Some implementations of BBCode have suffered problems related
379 to the way they translate the BBCode into <abbr>HTML</abbr>, which could negate the
380 security that was intended to be given by BBCode.
381 </blockquote>
383 <p>Or, put more simply:</p>
385 <blockquote>
386 BBCode came to life when developers where too
387 lazy to parse <abbr>HTML</abbr> correctly
388 and decided to invent their own markup language. As with all products of
389 laziness, the result is completely inconsistent, unstandardized, and
390 widely adopted.
391 </blockquote>
394 Well, developers, the whole point of HTML Purifier is that I do the
395 work so you can just execute the ridiculously simple
396 <tt>$purifier->purify($html)</tt> call and go on to do, well, whatever
397 you developers do. <tt>:-P</tt>
398 </p>
400 <h3 id="AltMarkup:Conclusion">Conclusion</h3>
403 These alternative markup languages have their shiny points, and HTML
404 Purifier is not meant to replace them. However, a major reason for
405 their existence has been called into question. Why are <em>you</em>
406 using these languages?
407 </p>
409 <h2 id="Tidy">HTML Tidy</h2>
412 Dave Raggett's
413 <a href="http://www.w3.org/People/Raggett/tidy/">HTML Tidy</a> is a program;
414 neat enough, at least, to make it into <abbr>PHP</abbr> as a
415 <a href="http://us2.php.net/manual/en/ref.tidy.php"><abbr>PECL</abbr> extension.</a>
416 The premise is simple, the execution effective. Tidy is, in short, a great
417 <em>tool</em>.
418 </p>
421 It is not, however, a filter. I am often surprised when people ask
422 me, <q>What about Tidy?</q> There's nothing against Tidy: Tidy tackles
423 a different problem set. Let's see what <tt>man tidy</tt> has to say:
424 </p>
426 <blockquote cite="http://tidy.sourceforge.net/docs/tidy_man.html">
427 Tidy reads <abbr>HTML</abbr>, <abbr>XHTML</abbr> and
428 <abbr>XML</abbr> files and writes cleaned up markup. For
429 <abbr>HTML</abbr> variants, it detects and corrects many common coding errors and
430 strives to produce visually equivalent markup that is both <abbr>W3C</abbr> compliant
431 and works on most browsers. A common use of Tidy is to convert plain <abbr>HTML</abbr>
432 to <abbr>XHTML</abbr>.
433 </blockquote>
436 Hmm... why do I not see the words <q>filter</q> or
437 <q><abbr>XSS</abbr></q> in here? Perhaps it's
438 because Tidy accepts <em>any</em> valid
439 <abbr>HTML</abbr>. Including
440 <tt>script</tt> tags. Which leads us to our second part: Tidy parses
441 <em>documents</em>, not document <em>fragments</em>.
442 </p>
445 This is not to say that I haven't seen Tidy be used in this sort of
446 fashion. MediaWiki, for instance, uses Tidy to cleanup the final <abbr>HTML</abbr>
447 output before shuttling it off to the browser. The developers, nevertheless,
448 agree that this is only a band-aid solution, and that the real way
449 to fix it is to fix the parser. Tidy's great, but in terms of security,
450 it's not suitable for untrusted sources.
451 </p>
453 <h2 id="Preface">Preface</h2>
456 I've ordered my analyses according to how bad a library is. The worst
457 is first, and then we move up the spectrum. I will point out the most
458 flagrant problems with the libraries, but note that I will omit more
459 advanced vulnerabilities: if you can't catch an <tt>onmouseover</tt>
460 attribute, I really shouldn't reprimand you for letting non-<abbr>SGML</abbr> code
461 points through. The ideal solution, however, must do all these things.
462 </p>
465 Note that besides striptags,
466 most of the libraries are moderately effective against the most common <abbr>XSS</abbr>
467 attacks. None of them (save Safe HTML Checker) fare very well
468 in the standards-compliance department though.
469 </p>
471 <h2 id="striptags">striptags()</h2>
473 <table class="summary">
474 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user-specified</td></tr>
475 <tr><th>Removes foreign tags</th> <td class="impl-partial">Buggy</td></tr>
476 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
477 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
478 <tr><th>Validates attributes</th> <td class="impl-no">No</td></tr>
479 </table>
482 The <abbr>PHP</abbr> function
483 <a href="http://php.net/manual/en/function.strip-tags.php">striptags()</a> is
484 the classic solution for attempting to clean up
485 <abbr>HTML</abbr>. It
486 is also the <em>worst</em> solution, and should be avoided like the plague.
487 The fact that it doesn't validate attributes at all means that anyone can
488 insert an <tt>onmouseover='xss();'</tt> and exploit your application.
489 </p>
492 While this can be bandaided with a series of regular expressions that strip out
493 on[event] (you're still vulnerable to <abbr>XSS</abbr> and at the mercy of
494 quirky browser behavior), striptags() is fundamentally flawed and should not be
495 used.
496 </p>
498 <h2 id="Input_Filter">PHP Input Filter</h2>
501 Though its title may not imply it,
502 <a href="http://www.phpclasses.org/browse/package/2189.html">PHP Input Filter</a>
503 is a souped up version of striptags() with the ability to inspect
504 attributes. (Don't mind the hastily tacked on query escaping function).
505 </p>
507 <table class="summary">
508 <tr><th>Version</th> <td class="impl-yes">1.2.2</td></tr>
509 <tr><th>Last update</th> <td class="impl-irrelevant">2005-10-05</td></tr>
510 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
511 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
512 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
513 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
514 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
515 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
516 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
517 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
518 </table>
521 PHP Input Filter implements an
522 <abbr>HTML</abbr> parser, and
523 performs very basic checks on whether or not tags and attributes have
524 been defined in the whitelist as well as some
525 smarter <abbr>XSS</abbr> checks. It is left up to
526 the user to define what they'll permit.
527 </p>
530 With absolutely no checking of well-formedness, it is trivially easy
531 to trick the filter into leaving unclosed tags lying around. While to some
532 standards-compliance may be viewed by some as a <q>nice feature</q>,
533 basic sanity checks like this must be implemented, otherwise a user
534 can mangle a website's layout.
535 </p>
538 More troubles: Woe to
539 any user that allows the <tt>style</tt> attribute: you can't simply
540 just let <abbr>CSS</abbr> through and expect your
541 layout not to be badly mutilated. To top things off,
542 the filter doesn't even preserve data properly: attributes have all
543 spaces stripped out of them. Stay away, stay away!
544 </p>
546 <h2 id="HTML_Safe">HTML_Safe/SafeHTML</h2>
549 <a href="http://pear.php.net/package/HTML_Safe">HTML_Safe</a> is
550 <acronym>PEAR</acronym>'s <abbr>HTML</abbr> filtering library.
551 It should be noted that this is the same library as
552 <a href="http://pixel-apes.com/safehtml/">SafeHTML</a>, though with different
553 branding (and a different version number).
554 </p>
556 <table class="summary">
557 <tr><th>Version</th> <td class="impl-almostyes">0.9.9beta</td></tr>
558 <tr><th>Last update</th> <td class="impl-irrelevant">2005-12-21</td></tr>
559 <tr><th>License</th> <td class="impl-irrelevant">BSD (3 clause)</td></tr>
560 <tr><th>Whitelist</th> <td class="impl-no">Mostly No</td></tr>
561 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
562 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
563 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
564 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
565 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
566 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
567 </table>
570 HTML_Safe's mechanism of action involves parsing
571 <abbr>HTML</abbr> with a
572 <acronym>SAX</acronym> parser and performing
573 validation and filtering as the handlers are called. HTML_Safe does a lot
574 of things right, which is why I say it <em>probably</em> isn't vulnerable
575 to <abbr>XSS</abbr>, but its approach
576 is fundamentally flawed: blacklists.
577 </p>
580 This library maintains arrays of dangerous tags, attributes and
581 <abbr>CSS</abbr> properties. (It also
582 has a blacklist of dangerous <abbr>URI</abbr> protocols, but this is
583 intelligently disabled by default in favor of a protocol whitelist.)
584 What this means is that HTML_Safe has no qualms of accepting input
585 like <tt>&lt;foobar&gt; Bang &lt;/foobar&gt;</tt>. Anything goes except
586 the tags in those arrays. Scratch standards-compliance (and that was
587 without even considering proper nesting).
588 </p>
591 For now, HTML_Safe might be safe from <abbr>XSS</abbr>.
592 In the future, however, one of the infinitely many tags that HTML_Safe lets
593 through might just possibly be given special functionality by browser vendors.
594 And it might just turn out that this can be exploited. <em>Any</em> blacklist
595 solution puts you at a perpetual arms race against crackers who are constantly
596 discovering new and inventive ways to abuse tags and attributes that you
597 didn't blacklist.
598 </p>
600 <h2 id="kses">kses</h2>
603 <a href="http://sourceforge.net/projects/kses/">kses</a> appears to
604 be the de-facto solution for cleaning <abbr>HTML</abbr>, having found
605 its way into applications such as <a href="http://wordpress.org/">WordPress</a>
606 and being the number one search result for <q>php html filter</q>.
607 </p>
609 <table class="summary">
610 <tr><th>Version</th> <td class="impl-partial">0.2.2</td></tr>
611 <tr><th>Last update</th> <td class="impl-irrelevant">2005-02-06</td></tr>
612 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
613 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
614 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
615 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
616 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
617 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
618 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
619 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
620 </table>
623 To be truthful, I didn't do as comprehensive a code survey for kses
624 as I did for some of the other libraries. Out of
625 all the classes I've reviewed so far, kses was definitely the hardest to
626 understand.
627 </p>
630 kses's modus operandi is splitting up html with a monster regexp
631 and then validating each section with <tt>kses_split2()</tt>. It
632 suffers from the same problems as Input Filter: no well-formedness
633 checks leading to rampant runaway tags (and no standards-compliance).
634 WordPress, the primary user of kses today, had to implement their
635 own custom tag-balancing code to fix this problem: don't use this
636 library without some equivalent!
637 </p>
640 Its whitelist syntax, however, is the most complex of all these libraries,
641 so I'm going to take some time to argue why this particular implementation
642 is bad. The author of this library was thoughtful enough to provide some
643 basic constraint checks on attributes like maxlen and maxval. Now, barring
644 the fact that there simply aren't enough checks, and the fact that they are
645 all lumped together in one function, we now must wonder whether or not
646 the user will go through the trouble of specifying the maximum length
647 of a title attribute.
648 </p>
651 I have my opinions about inherent human laziness, but perhaps WordPress's
652 default filterset is the most telling example:
653 </p>
655 <pre>
656 $allowedposttags = array (
657 /* formatted and trimmed */
658 'hr' => array (
659 'align' => array (),
660 'noshade' => array (),
661 'size' => array (),
662 'width' => array ()
665 </pre>
668 Hmm... do I see a blatant lack of attribute constraints? Conclusion:
669 if the user can get away with not doing work, they will! The biggest
670 problem in all these whitelists filters is that they forgot to <em>supply</em>
671 the whitelist. The whitelist is just as important as the code that uses
672 the whitelist to filter <abbr>HTML</abbr>.
673 </p>
675 <h2 id="htmLawed">htmLawed</h2>
678 <a href="http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php">htmLawed</a>
679 is kses on steroids. After looking at HTML Purifier and deciding that it was
680 too slow for him, Santosh Patnaik went ahead and rewrote the kses engine
681 with more features. It is the only other filtering library currently available
682 that is being actively maintained.
683 </p>
685 <table class="summary">
686 <tr><th>Version</th> <td class="impl-yes">1.0.3</td></tr>
687 <tr><th>Last update</th> <td class="impl-irrelevant">2008-03-03</td></tr>
688 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
689 <tr><th>Whitelist</th> <td class="impl-partial">Yes, but blacklist is default</td></tr>
690 <tr><th>Removes foreign tags</th> <td class="impl-almostyes">Yes, user defined</td></tr>
691 <tr><th>Makes well-formed</th> <td class="impl-almostyes">Yes, user defined</td></tr>
692 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
693 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
694 <tr><th>XSS safe</th> <td class="impl-no">No</td></tr>
695 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
696 </table>
699 With the 1.0.3 release, htmLawed added support for a magic <code>safe</code>
700 parameter which, when set to 1, makes htmLawed output <abbr>XSS</abbr>-safe
701 output. Unfortunately, it is vulnerable to a few of the vectors in
702 the infamous <a href="http://ha.ckers.org/xss.html">ha.ckers.org <abbr>XSS</abbr>
703 cheat-sheet.</a> I have not listed them here to give the vendor a chance
704 to fix these issues.
705 </p>
708 htmLawed improves standards-compliance, but it is not fully standards-compliant;
709 there are a number of cases which the author has explicitly stated he will not
710 fix. There are issues with content
711 models in <code>table</code> and <code>ruby</code>, tags that <em>must</em>
712 have content in them, and the <code>blockquote</code> tag in strict doctypes.
713 </p>
716 Let's, for a moment, imagine that htmLawed is <abbr>XSS</abbr>-safe when
717 <code>safe</code> is on (it isn't, as we demonstrated before).
718 Even then, it still is not <abbr>XSS</abbr>-safe out of the tin: you have
719 to turn on htmLawed's security features! This is
720 <a href="http://www.bioinformatics.org/phplabware/forum/viewtopic.php?id=28">by
721 design</a>. Sane defaults are important, because for every person who
722 does read the documentation, there is
723 <a href="http://www.bioinformatics.org/phplabware/forum/viewtopic.php?id=28">another</a>
724 one who doesn't (and is mislead by claims that <q>htmLawed is a single-file PHP
725 software that makes input text secure</q>), and is
726 surprised at some behavior.
727 Software must be <strong>safe by default</strong>; the user can then relax
728 any security restrictions.
729 </p>
732 I also disagree with some of the choices with regards to what elements are
733 <q>safe</q>. <code>form</code> and <code>iframe</code>,
734 indeed, are <abbr>XSS</abbr>-safe,
735 but they are certainly not phishing safe. An attacker can set an iframe
736 to 100% width and height and effectively take over a website; forms can be
737 used to spoof system dialogs <em>on that person's domain</em>. These should
738 <em>not</em> be allowed in <code>safe</code> mode.
739 </p>
742 Users, you may be smarting for some better performance, but avoid this
743 library for now, at the very least until the vulnerabilities are fixed.
744 </p>
746 <h2 id="Safe_HTML_Checker">Safe HTML Checker</h2>
749 <a href="http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker">Safe
750 HTML Checker</a> is (to my knowledge) the first attempt to make a filter
751 that also outputs standards-compliant <abbr>XHTML</abbr>. It wasn't even released or
752 licensed officially, but we'll let that slide: a 4<sup>th</sup> place
753 search result must have done something right.
754 </p>
756 <table class="summary">
757 <tr><th>Version</th> <td class="impl-partial">in-house</td></tr>
758 <tr><th>Last update</th> <td class="impl-almostyes">2003-09-15</td></tr>
759 <tr><th>License</th> <td class="impl-no">undefined</td></tr>
760 <tr><th>Whitelist</th> <td class="impl-partial">Yes (bare-bones)</td></tr>
761 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
762 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
763 <tr><th>Fixes nesting</th> <td class="impl-almostyes">Almost</td></tr>
764 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
765 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
766 <tr><th>Standards safe</th> <td class="impl-almostyes">Almost</td></tr>
767 </table>
770 Indeed, it is quite a well-written piece of code. It demonstrates
771 knowledge of inline versus block elements, thus almost nearly getting
772 nesting correct (the only exception is an unimplemented omitted SGML
773 exclusion for <tt>&lt;a&gt;</tt> tags, and that's easy to fix).
774 </p>
777 Unfortunately, part of the reason why it works so well is that it's
778 extremely restrictive. No styling, no tables, very few attributes.
779 Perfectly appropriate for blog comments, but then again, there's always
780 BBCode. This probably means that Safe HTML Checker has a different
781 goal than HTML Purifier.
782 </p>
785 The <abbr>XML</abbr> parser is also quite strict. Accidentally missed a
786 &lt; sign? The parser will complain with the cryptic message:
787 <q><abbr>XHTML</abbr> is not well-formed</q>. The solution is not as
788 simple as just switching to a more permissive parser: Safe HTML Checker
789 relies on the fact that the parser will have matched up the tags for
790 them.
791 </p>
793 <h2 id="HTMLPurifier">HTML Purifier</h2>
795 <table class="summary">
796 <tr><th>Version</th> <td class="impl-yes">&htmlpurifier.current.version;</td></tr>
797 <tr><th>Last update</th> <td class="impl-yes">&htmlpurifier.current.release-date;</td></tr>
798 <tr><th>License</th> <td class="impl-irrelevant">LGPL</td></tr>
799 <tr><th>Whitelist</th> <td class="impl-yes">Yes</td></tr>
800 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
801 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
802 <tr><th>Fixes nesting</th> <td class="impl-yes">Yes</td></tr>
803 <tr><th>Validates attributes</th> <td class="impl-yes">Yes</td></tr>
804 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
805 <tr><th>Standards safe</th> <td class="impl-yes">Yes</td></tr>
806 </table>
809 That table should say it all, but I'll add a few more features:
810 </p>
812 <table class="summary">
813 <tr><th>UTF-8 aware</th><td class="impl-yes">Yes</td></tr>
814 <tr><th>Object-Oriented</th><td class="impl-yes">Yes</td></tr>
815 <tr><th>Validates CSS</th><td class="impl-yes">Yes</td></tr>
816 <tr><th>Tables</th><td class="impl-yes">Yes</td></tr>
817 <tr><th>PHP 5 only</th><td class="impl-yes">Yes</td></tr>
818 <tr><th>E_STRICT compliant</th><td class="impl-yes">Yes</td></tr>
819 <tr><th>Can auto-paragraph</th><td class="impl-yes">Yes</td></tr>
820 <tr><th>Extensible</th><td class="impl-yes">Yes</td></tr>
821 <tr><th>Unit tested</th><td class="impl-yes">Yes</td></tr>
822 </table>
825 This is not to say that HTML Purifier doesn't have problems of its own.
826 It's big (while the others usually fit in one file, this one requires a huge
827 include list), and it's <a href="http://htmlpurifier.org/live/TODO">missing
828 features.</a> But even with these deficiencies,
829 HTML Purifier is far better than the other libraries.
830 </p>
833 So... <a href="download.html">what are you waiting for?</a>
834 </p>
836 </div>
837 </body>
838 </html>