Add mention of Martin Brampton's book.
[htmlpurifier-web.git] / comparison.xhtml
blob1f336d39e6bd36b944359d2d44b54e330d031c1c
1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
3 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
4 <!ENTITY % htmlpurifier.current SYSTEM "current.ent"> %htmlpurifier.current;
5 ]>
6 <html xmlns="http://www.w3.org/1999/xhtml"
7 xmlns:xi="http://www.w3.org/2001/XInclude"
8 xmlns:xc="urn:xhtml-compiler"
9 xml:lang="en">
10 <head>
11 <title>Comparison - HTML Purifier</title>
12 <xi:include href="common-meta.xml" xpointer="xpointer(/*/node())" />
13 <meta name="keywords" content="HTMLPurifier, HTML Purifier, HTML, filter, filtering, HTML_Safe, PEAR, comparison, kses, striptags, SafeHTMLChecker" />
14 </head>
15 <body>
17 <xi:include href="common-header.xml" xpointer="xpointer(/*/node())" />
18 <h1 id="title">Comparison</h1>
20 <div id="content">
22 <p>
23 With the advent of <a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0</a>,
24 the end user has gone from passive consumer to active producer of content
25 on the World Wide Web. <a href="http://en.wikipedia.org/wiki/Wiki">Wikis</a>,
26 <a href="http://en.wikipedia.org/wiki/Social_software">Social Software</a> and
27 <a href="http://en.wikipedia.org/wiki/Blog">Blogs</a> all put the user in control.
28 </p>
30 <p>
31 Give the user too much control, however, and you set yourself up for <a
32 href="http://en.wikipedia.org/wiki/Cross-site_scripting"><abbr>XSS</abbr
33 ></a> attacks. For this reason, <abbr>HTML</abbr>'s flexibility has
34 proven to be both a blessing and a curse, and the software that
35 processes it must strike a fine balance between security and usability.
36 How do we prevent users from injecting JavaScript or inserting malformed
37 <abbr>HTML</abbr> while allowing a rich syntax of tags, attributes and
38 <abbr>CSS</abbr>? How do we put <abbr>HTML</abbr> inside
39 <abbr>RSS</abbr> feed without worrying about sloppy coding messing up
40 <abbr>XML</abbr> parsing? Almost every <abbr>PHP</abbr> developer has
41 come across this problem before, and many have tried (albeit
42 unsuccessfully) to solve this problem. We will analyze existing
43 libraries to demonstrate how they are ineffective and, of course, how
44 <strong>HTML Purifier</strong> solves all our problems and achieves
45 standards-compliance.
46 </p>
48 <p>
49 I will take no quarter and pull no punches: as of the time of writing,
50 no other library comes even <em>close</em> to solving the problem effectively
51 for richly formatted documents. But, nonetheless, there is a necessary
52 disclaimer:
53 </p>
55 <div class="disclaimer">
56 <p>
57 This comparison document was written by the author of HTML Purifier,
58 and clearly is <strong>in favor</strong> of HTML Purifier. However, that doesn't
59 mean that it is biased: I have made every attempt to be <strong>factual and
60 fair</strong>, and I hope that you will agree, by the time you finish reading
61 this document, that HTML Purifier is the only satisfactory <abbr>HTML</abbr>
62 filter out there today.
63 </p>
64 </div>
66 <div id="toc" />
68 <h2 id="Summary">Summary</h2>
70 <p>A table summarizing the differences for the impatient.</p>
72 <div class="wide-table">
73 <table cellspacing="0">
75 <thead>
76 <tr>
77 <th>Library</th>
78 <th>Version</th>
79 <th>Date</th>
80 <th>License</th>
81 <th>Whitelist</th>
82 <th>Removal</th>
83 <th>Well-formed</th>
84 <th>Nesting</th>
85 <th>Attributes</th>
86 <th>XSS&nbsp;safe</th>
87 <th>Standards&nbsp;safe</th>
88 </tr>
89 </thead>
91 <tbody>
93 <tr>
94 <td>striptags</td>
95 <td>n/a</td>
96 <td>n/a</td>
97 <td>n/a</td>
98 <td class="impl-almostyes">Yes (user)</td>
99 <td class="impl-partial">Buggy</td>
100 <td class="impl-no">No</td>
101 <td class="impl-no">No</td>
102 <td class="impl-no">No</td>
103 <td class="impl-no">No</td>
104 <td class="impl-no">No</td>
105 </tr>
107 <tr>
108 <td>PHP Input Filter</td>
109 <td>1.2.2</td>
110 <td>2005-10-05</td>
111 <td>GPL</td>
112 <td class="impl-almostyes">Yes (user)</td>
113 <td class="impl-yes">Yes</td>
114 <td class="impl-no">No</td>
115 <td class="impl-no">No</td>
116 <td class="impl-partial">Partial</td>
117 <td class="impl-almostyes">Probably</td>
118 <td class="impl-no">No</td>
119 </tr>
121 <tr>
122 <td>HTML_Safe</td>
123 <td>0.9.9beta</td>
124 <td>2005-12-21</td>
125 <td>BSD (3)</td>
126 <td class="impl-no">Mostly No</td>
127 <td class="impl-yes">Yes</td>
128 <td class="impl-yes">Yes</td>
129 <td class="impl-no">No</td>
130 <td class="impl-partial">Partial</td>
131 <td class="impl-almostyes">Probably</td>
132 <td class="impl-no">No</td>
133 </tr>
135 <tr>
136 <td>kses</td>
137 <td>0.2.2</td>
138 <td>2005-02-06</td>
139 <td>GPL</td>
140 <td class="impl-almostyes">Yes (user)</td>
141 <td class="impl-yes">Yes</td>
142 <td class="impl-no">No</td>
143 <td class="impl-no">No</td>
144 <td class="impl-partial">Partial</td>
145 <td class="impl-almostyes">Probably</td>
146 <td class="impl-no">No</td>
147 </tr>
149 <tr>
150 <td>htmLawed</td>
151 <td>1.0.3</td>
152 <td>2008-03-03</td>
153 <td>GPL</td>
154 <td class="impl-partial">Yes (not default)</td>
155 <td class="impl-almostyes">Yes (user)</td>
156 <td class="impl-almostyes">Yes (user)</td>
157 <td class="impl-no">No</td>
158 <td class="impl-partial">Partial</td>
159 <td class="impl-no">No</td>
160 <td class="impl-no">No</td>
161 </tr>
163 <tr>
164 <td>Safe HTML Checker</td>
165 <td>n/a</td>
166 <td>2003-09-15</td>
167 <td>n/a</td>
168 <td class="impl-partial">Yes (bare)</td>
169 <td class="impl-yes">Yes</td>
170 <td class="impl-yes">Yes</td>
171 <td class="impl-almostyes">Almost</td>
172 <td class="impl-partial">Partial</td>
173 <td class="impl-yes">Yes</td>
174 <td class="impl-almostyes">Almost</td>
175 </tr>
177 <tr>
178 <td>HTML Purifier</td>
179 <td>&htmlpurifier.current.version;</td>
180 <td>&htmlpurifier.current.release-date;</td>
181 <td>LGPL</td>
182 <td class="impl-yes">Yes</td>
183 <td class="impl-yes">Yes</td>
184 <td class="impl-yes">Yes</td>
185 <td class="impl-yes">Yes</td>
186 <td class="impl-yes">Yes</td>
187 <td class="impl-yes">Yes</td>
188 <td class="impl-yes">Yes</td>
189 </tr>
191 </tbody>
193 </table>
194 </div>
197 <a href="#Tidy">HTML Tidy</a> is omitted from this list because it is not
198 an <abbr>HTML</abbr> filter.
199 </p>
201 <h2 id="AltMarkup">Look Ma, No <abbr>HTML</abbr>!</h2>
203 <blockquote class="fancy">
204 <div class="quote" style="text-align:center;">
205 A clever person solves a problem.
206 A wise person avoids it.
207 </div>
208 <div class="origin">&mdash; Albert Einstein</div>
209 </blockquote>
212 Before we jump into the weird and not-so-wonderful world of
213 <abbr>HTML</abbr> filters, we must first consider another domain:
214 non-<abbr>HTML</abbr> markup libraries. While libraries of this type
215 really shouldn't be considered <abbr>HTML</abbr> filters, they are the
216 number one method of taking user input and processing it into something
217 more than plain old text. These libraries forgo <abbr>HTML</abbr> and
218 define their own markup syntax. <a
219 href="http://en.wikipedia.org/wiki/BBCode">BBCode</a>, <a
220 href="http://en.wikipedia.org/wiki/Wikitext">Wikitext</a>, <a
221 href="http://daringfireball.net/projects/markdown/">Markdown</a> and <a
222 href="http://textism.com/tools/textile/">Textile</a> are all examples of
223 such markup languages (although it should be noted that Wikitext and
224 Markdown can allow <abbr>HTML</abbr> within them). The benefits (to
225 those who use it, anyway) are clear: simplicity and security.
226 </p>
228 <table cellspacing="0">
229 <thead>
230 <tr>
231 <th>Markup language</th>
232 <th>Sample</th>
233 </tr>
234 </thead>
235 <tbody>
236 <tr>
237 <th>BBCode</th>
238 <td><tt>[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].</tt></td>
239 </tr>
240 <tr>
241 <th>Wikitext<sup>1</sup></th>
242 <td><tt>'''B''' ''i'' [http://www.example.com/ link]</tt></td>
243 </tr>
244 <tr>
245 <th>Markdown<sup>2</sup></th>
246 <td><tt>**B** *i* [link](http://www.example.com/)</tt></td>
247 </tr>
248 <tr>
249 <th>Textile</th>
250 <td><tt>*B* _i_ &quot;link&quot;:http://www.example.com/</tt></td>
251 </tr>
252 <tr>
253 <th><abbr>HTML</abbr></th>
254 <td><tt>&lt;b&gt;B&lt;/b&gt; &lt;i&gt;i&lt;/i&gt; &lt;a href=&quot;http://www.example.com/&quot;&gt;link&lt;/a&gt;</tt></td>
255 </tr>
256 <tr>
257 <th><acronym>WYSIWYG</acronym></th>
258 <td><b>B</b> <i>i</i> <a href="http://www.example.com/">link</a></td>
259 </tr>
260 </tbody>
261 </table>
263 <ol class="notes">
264 <li>
265 Wikitext shown is modeled after <a
266 href="http://www.mediawiki.org/wiki/MediaWiki">MediaWiki</a> style.
267 There are many variants of Wikitext currently extant.
268 </li>
269 <li>
270 Strictly speaking, the Markdown syntax is not equivalent: bold text
271 is expressed as <code>&lt;strong&gt;</code> and italicized text is
272 expressed as <code>&lt;em&gt;</code>. Most browser default stylesheets,
273 however, map those two semantic tags to the associated styling, so
274 many users assume that it really is italics (and use it improperly for,
275 say, book titles.)
276 </li>
277 </ol>
279 <h3 id="AltMarkup:Simplicity">Simplicity</h3>
282 <abbr>HTML</abbr> source code is often criticized for being difficult to
283 read. For example, compare:
284 </p>
286 <pre>
287 * Item 1
288 * Item 2
289 </pre>
291 <p>...with:</p>
293 <pre>
294 &lt;ul&gt;
295 &lt;li&gt;Item 1&lt;/li&gt;
296 &lt;li&gt;Item 2&lt;/li&gt;
297 &lt;/ul&gt;
298 </pre>
301 Which would you prefer to edit? The answer seems obvious, but be careful
302 not to fall into the fallacy of <a
303 href="http://en.wikipedia.org/wiki/False_dilemma">false dilemma</a>.
304 There <em>is</em> a third choice: the <acronym>WYSIWYG</acronym> (rich
305 text) editor, which blows earlier choices out of the water in terms of
306 usability.
307 </p>
310 Note that rich text editors and alternate markup syntaxes are not
311 mutually exclusive, but, when push comes to shove, it's easier
312 implement this sort of editor on top of <abbr>HTML</abbr> than some obscure
313 markup language. And in the cases when it is done, you usually end up with
314 a live preview, not a true rich text editor.
315 </p>
317 <blockquote class="digression">
319 <q>Now just wait a second,</q> you may be saying, <q><acronym>WYSIWYG</acronym>
320 editors aren't all that great.</q> There are many good arguments against
321 these editors, and <a
322 href="http://www.ideography.co.uk/library/seybold/WYSIWYG.html">intelligent
323 people have written essays</a> devoted to criticizing
324 <acronym>WYSIWYG</acronym>. In addition to the usual arguments against
325 said editors, the web poses another limitation: no JavaScript means no
326 editor, and no editor means... (gasp) manually typing in code.
327 </p>
329 Even the most dogmatic purist, however, should recognize that for all
330 its faults, prospective clients <em>really</em> want rich text editors.
331 There are steps you can take to mitigate the associated drawbacks of
332 these editors.
333 </p>
335 It is often asserted that <acronym>WYSIWYG</acronym> editors
336 <em>encourage excessive presentational markup</em>. As it turns out,
337 this is the case with any markup language that allows the smallest
338 iota of presentational tags, be it <tt>&lt;font&gt;</tt> or
339 <tt>[color=red]</tt>. A good way to reduce this trouble is to simply
340 eliminate the dialogue boxes that allow users to change colors or fonts
341 (which usually have no legitimate use) and adopt a <acronym>WYSIWYM</acronym>
342 scheme, allowing users to select contextually correct formatting styles
343 for segments of text.
344 </p>
345 </blockquote>
348 Simplicity is also a double-edged sword. The moment any remotely
349 complex markup is needed, these lightweight markup languages fail to
350 produce. Sure you can make '''this text bold''' with Wikitext, but that
351 infobox all <q>rendered nicely in aqua blue</q> will require a gaggle of
352 &lt;div&gt;s and <abbr>CSS</abbr>. These languages face the same troubles
353 as regular <abbr>HTML</abbr> filters in that their whitelist is too
354 restrictive (besides the fact that their table markup is extraordinarily
355 complex).
356 </p>
358 <h3 id="AltMarkup:Security">Security</h3>
361 BBCode can be boiled down to a <q>wanna-be</q> version of
362 <abbr>HTML</abbr>. I mean, replacing
363 the angled brackets with square brackets and omitting the occasional parameter
364 name? How much more un-original can you get? Somehow, I don't think BBCode
365 was meant to readable. <a
366 href="http://en.wikipedia.org/wiki/BBCode">Wikipedia</a> agrees:
367 </p>
369 <blockquote>
370 BBCode was devised and put to use in order to provide a safer, easier
371 and more limited way of allowing users to format their messages.
372 Previously, many message boards allowed the users to include <abbr>HTML</abbr>,
373 which could be used to break/imitate parts of the layout, or run
374 JavaScript. Some implementations of BBCode have suffered problems related
375 to the way they translate the BBCode into <abbr>HTML</abbr>, which could negate the
376 security that was intended to be given by BBCode.
377 </blockquote>
379 <p>Or, put more simply:</p>
381 <blockquote>
382 BBCode came to life when developers where too
383 lazy to parse <abbr>HTML</abbr> correctly
384 and decided to invent their own markup language. As with all products of
385 laziness, the result is completely inconsistent, unstandardized, and
386 widely adopted.
387 </blockquote>
390 Well, developers, the whole point of HTML Purifier is that I do the
391 work so you can just execute the ridiculously simple
392 <tt>$purifier->purify($html)</tt> call and go on to do, well, whatever
393 you developers do. <tt>:-P</tt>
394 </p>
396 <h3 id="AltMarkup:Conclusion">Conclusion</h3>
399 These alternative markup languages have their shiny points, and HTML
400 Purifier is not meant to replace them. However, a major reason for
401 their existence has been called into question. Why are <em>you</em>
402 using these languages?
403 </p>
405 <h2 id="Tidy">HTML Tidy</h2>
408 Dave Raggett's
409 <a href="http://www.w3.org/People/Raggett/tidy/">HTML Tidy</a> is a program;
410 neat enough, at least, to make it into <abbr>PHP</abbr> as a
411 <a href="http://us2.php.net/manual/en/ref.tidy.php"><abbr>PECL</abbr> extension.</a>
412 The premise is simple, the execution effective. Tidy is, in short, a great
413 <em>tool</em>.
414 </p>
417 It is not, however, a filter. I am often surprised when people ask
418 me, <q>What about Tidy?</q> There's nothing against Tidy: Tidy tackles
419 a different problem set. Let's see what <tt>man tidy</tt> has to say:
420 </p>
422 <blockquote cite="http://tidy.sourceforge.net/docs/tidy_man.html">
423 Tidy reads <abbr>HTML</abbr>, <abbr>XHTML</abbr> and
424 <abbr>XML</abbr> files and writes cleaned up markup. For
425 <abbr>HTML</abbr> variants, it detects and corrects many common coding errors and
426 strives to produce visually equivalent markup that is both <abbr>W3C</abbr> compliant
427 and works on most browsers. A common use of Tidy is to convert plain <abbr>HTML</abbr>
428 to <abbr>XHTML</abbr>.
429 </blockquote>
432 Hmm... why do I not see the words <q>filter</q> or
433 <q><abbr>XSS</abbr></q> in here? Perhaps it's
434 because Tidy accepts <em>any</em> valid
435 <abbr>HTML</abbr>. Including
436 <tt>script</tt> tags. Which leads us to our second part: Tidy parses
437 <em>documents</em>, not document <em>fragments</em>.
438 </p>
441 This is not to say that I haven't seen Tidy be used in this sort of
442 fashion. MediaWiki, for instance, uses Tidy to cleanup the final <abbr>HTML</abbr>
443 output before shuttling it off to the browser. The developers, nevertheless,
444 agree that this is only a band-aid solution, and that the real way
445 to fix it is to fix the parser. Tidy's great, but in terms of security,
446 it's not suitable for untrusted sources.
447 </p>
449 <h2 id="Preface">Preface</h2>
452 I've ordered my analyses according to how bad a library is. The worst
453 is first, and then we move up the spectrum. I will point out the most
454 flagrant problems with the libraries, but note that I will omit more
455 advanced vulnerabilities: if you can't catch an <tt>onmouseover</tt>
456 attribute, I really shouldn't reprimand you for letting non-<abbr>SGML</abbr> code
457 points through. The ideal solution, however, must do all these things.
458 </p>
461 Note that besides striptags,
462 most of the libraries are moderately effective against the most common <abbr>XSS</abbr>
463 attacks. None of them (save Safe HTML Checker) fare very well
464 in the standards-compliance department though.
465 </p>
467 <h2 id="striptags">striptags()</h2>
469 <table class="summary">
470 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user-specified</td></tr>
471 <tr><th>Removes foreign tags</th> <td class="impl-partial">Buggy</td></tr>
472 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
473 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
474 <tr><th>Validates attributes</th> <td class="impl-no">No</td></tr>
475 </table>
478 The <abbr>PHP</abbr> function
479 <a href="http://php.net/manual/en/function.strip-tags.php">striptags()</a> is
480 the classic solution for attempting to clean up
481 <abbr>HTML</abbr>. It
482 is also the <em>worst</em> solution, and should be avoided like the plague.
483 The fact that it doesn't validate attributes at all means that anyone can
484 insert an <tt>onmouseover='xss();'</tt> and exploit your application.
485 </p>
488 While this can be bandaided with a series of regular expressions that strip out
489 on[event] (you're still vulnerable to <abbr>XSS</abbr> and at the mercy of
490 quirky browser behavior), striptags() is fundamentally flawed and should not be
491 used.
492 </p>
494 <h2 id="Input_Filter">PHP Input Filter</h2>
497 Though its title may not imply it,
498 <a href="http://www.phpclasses.org/browse/package/2189.html">PHP Input Filter</a>
499 is a souped up version of striptags() with the ability to inspect
500 attributes. (Don't mind the hastily tacked on query escaping function).
501 </p>
503 <table class="summary">
504 <tr><th>Version</th> <td class="impl-yes">1.2.2</td></tr>
505 <tr><th>Last update</th> <td class="impl-irrelevant">2005-10-05</td></tr>
506 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
507 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
508 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
509 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
510 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
511 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
512 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
513 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
514 </table>
517 PHP Input Filter implements an
518 <abbr>HTML</abbr> parser, and
519 performs very basic checks on whether or not tags and attributes have
520 been defined in the whitelist as well as some
521 smarter <abbr>XSS</abbr> checks. It is left up to
522 the user to define what they'll permit.
523 </p>
526 With absolutely no checking of well-formedness, it is trivially easy
527 to trick the filter into leaving unclosed tags lying around. While to some
528 standards-compliance may be viewed by some as a <q>nice feature</q>,
529 basic sanity checks like this must be implemented, otherwise a user
530 can mangle a website's layout.
531 </p>
534 More troubles: Woe to
535 any user that allows the <tt>style</tt> attribute: you can't simply
536 just let <abbr>CSS</abbr> through and expect your
537 layout not to be badly mutilated. To top things off,
538 the filter doesn't even preserve data properly: attributes have all
539 spaces stripped out of them. Stay away, stay away!
540 </p>
542 <h2 id="HTML_Safe">HTML_Safe/SafeHTML</h2>
545 <a href="http://pear.php.net/package/HTML_Safe">HTML_Safe</a> is
546 <acronym>PEAR</acronym>'s <abbr>HTML</abbr> filtering library.
547 It should be noted that this is the same library as
548 <a href="http://pixel-apes.com/safehtml/">SafeHTML</a>, though with different
549 branding (and a different version number).
550 </p>
552 <table class="summary">
553 <tr><th>Version</th> <td class="impl-almostyes">0.9.9beta</td></tr>
554 <tr><th>Last update</th> <td class="impl-irrelevant">2005-12-21</td></tr>
555 <tr><th>License</th> <td class="impl-irrelevant">BSD (3 clause)</td></tr>
556 <tr><th>Whitelist</th> <td class="impl-no">Mostly No</td></tr>
557 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
558 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
559 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
560 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
561 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
562 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
563 </table>
566 HTML_Safe's mechanism of action involves parsing
567 <abbr>HTML</abbr> with a
568 <acronym>SAX</acronym> parser and performing
569 validation and filtering as the handlers are called. HTML_Safe does a lot
570 of things right, which is why I say it <em>probably</em> isn't vulnerable
571 to <abbr>XSS</abbr>, but its approach
572 is fundamentally flawed: blacklists.
573 </p>
576 This library maintains arrays of dangerous tags, attributes and
577 <abbr>CSS</abbr> properties. (It also
578 has a blacklist of dangerous <abbr>URI</abbr> protocols, but this is
579 intelligently disabled by default in favor of a protocol whitelist.)
580 What this means is that HTML_Safe has no qualms of accepting input
581 like <tt>&lt;foobar&gt; Bang &lt;/foobar&gt;</tt>. Anything goes except
582 the tags in those arrays. Scratch standards-compliance (and that was
583 without even considering proper nesting).
584 </p>
587 For now, HTML_Safe might be safe from <abbr>XSS</abbr>.
588 In the future, however, one of the infinitely many tags that HTML_Safe lets
589 through might just possibly be given special functionality by browser vendors.
590 And it might just turn out that this can be exploited. <em>Any</em> blacklist
591 solution puts you at a perpetual arms race against crackers who are constantly
592 discovering new and inventive ways to abuse tags and attributes that you
593 didn't blacklist.
594 </p>
596 <h2 id="kses">kses</h2>
599 <a href="http://sourceforge.net/projects/kses/">kses</a> appears to
600 be the de-facto solution for cleaning <abbr>HTML</abbr>, having found
601 its way into applications such as <a href="http://wordpress.org/">WordPress</a>
602 and being the number one search result for <q>php html filter</q>.
603 </p>
605 <table class="summary">
606 <tr><th>Version</th> <td class="impl-partial">0.2.2</td></tr>
607 <tr><th>Last update</th> <td class="impl-irrelevant">2005-02-06</td></tr>
608 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
609 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
610 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
611 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
612 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
613 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
614 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
615 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
616 </table>
619 To be truthful, I didn't do as comprehensive a code survey for kses
620 as I did for some of the other libraries. Out of
621 all the classes I've reviewed so far, kses was definitely the hardest to
622 understand.
623 </p>
626 kses's modus operandi is splitting up html with a monster regexp
627 and then validating each section with <tt>kses_split2()</tt>. It
628 suffers from the same problems as Input Filter: no well-formedness
629 checks leading to rampant runaway tags (and no standards-compliance).
630 WordPress, the primary user of kses today, had to implement their
631 own custom tag-balancing code to fix this problem: don't use this
632 library without some equivalent!
633 </p>
636 Its whitelist syntax, however, is the most complex of all these libraries,
637 so I'm going to take some time to argue why this particular implementation
638 is bad. The author of this library was thoughtful enough to provide some
639 basic constraint checks on attributes like maxlen and maxval. Now, barring
640 the fact that there simply aren't enough checks, and the fact that they are
641 all lumped together in one function, we now must wonder whether or not
642 the user will go through the trouble of specifying the maximum length
643 of a title attribute.
644 </p>
647 I have my opinions about inherent human laziness, but perhaps WordPress's
648 default filterset is the most telling example:
649 </p>
651 <pre>
652 $allowedposttags = array (
653 /* formatted and trimmed */
654 'hr' => array (
655 'align' => array (),
656 'noshade' => array (),
657 'size' => array (),
658 'width' => array ()
661 </pre>
664 Hmm... do I see a blatant lack of attribute constraints? Conclusion:
665 if the user can get away with not doing work, they will! The biggest
666 problem in all these whitelists filters is that they forgot to <em>supply</em>
667 the whitelist. The whitelist is just as important as the code that uses
668 the whitelist to filter <abbr>HTML</abbr>.
669 </p>
671 <h2 id="htmLawed">htmLawed</h2>
674 <a href="http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php">htmLawed</a>
675 is kses on steroids. After looking at HTML Purifier and deciding that it was
676 too slow for him, Santosh Patnaik went ahead and rewrote the kses engine
677 with more features. It is the only other filtering library currently available
678 that is being actively maintained.
679 </p>
681 <table class="summary">
682 <tr><th>Version</th> <td class="impl-yes">1.0.3</td></tr>
683 <tr><th>Last update</th> <td class="impl-irrelevant">2008-03-03</td></tr>
684 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
685 <tr><th>Whitelist</th> <td class="impl-partial">Yes, but blacklist is default</td></tr>
686 <tr><th>Removes foreign tags</th> <td class="impl-almostyes">Yes, user defined</td></tr>
687 <tr><th>Makes well-formed</th> <td class="impl-almostyes">Yes, user defined</td></tr>
688 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
689 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
690 <tr><th>XSS safe</th> <td class="impl-no">No</td></tr>
691 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
692 </table>
695 With the 1.0.3 release, htmLawed added support for a magic <code>safe</code>
696 parameter which, when set to 1, makes htmLawed output <abbr>XSS</abbr>-safe
697 output. Unfortunately, it is vulnerable to a few of the vectors in
698 the infamous <a href="http://ha.ckers.org/xss.html">ha.ckers.org <abbr>XSS</abbr>
699 cheat-sheet.</a> I have not listed them here to give the vendor a chance
700 to fix these issues.
701 </p>
704 htmLawed improves standards-compliance, but it is not fully standards-compliant;
705 there are a number of cases which the author has explicitly stated he will not
706 fix. There are issues with content
707 models in <code>table</code> and <code>ruby</code>, tags that <em>must</em>
708 have content in them, and the <code>blockquote</code> tag in strict doctypes.
709 </p>
712 Let's, for a moment, imagine that htmLawed is <abbr>XSS</abbr>-safe when
713 <code>safe</code> is on (it isn't, as we demonstrated before).
714 Even then, it still is not <abbr>XSS</abbr>-safe out of the tin: you have
715 to turn on htmLawed's security features! This is
716 <a href="http://www.bioinformatics.org/phplabware/forum/viewtopic.php?id=28">by
717 design</a>. Sane defaults are important, because for every person who
718 does read the documentation, there is
719 <a href="http://www.bioinformatics.org/phplabware/forum/viewtopic.php?id=28">another</a>
720 one who doesn't (and is mislead by claims that <q>htmLawed is a single-file PHP
721 software that makes input text secure</q>), and is
722 surprised at some behavior.
723 Software must be <strong>safe by default</strong>; the user can then relax
724 any security restrictions.
725 </p>
728 I also disagree with some of the choices with regards to what elements are
729 <q>safe</q>. <code>form</code> and <code>iframe</code>,
730 indeed, are <abbr>XSS</abbr>-safe,
731 but they are certainly not phishing safe. An attacker can set an iframe
732 to 100% width and height and effectively take over a website; forms can be
733 used to spoof system dialogs <em>on that person's domain</em>. These should
734 <em>not</em> be allowed in <code>safe</code> mode.
735 </p>
738 Users, you may be smarting for some better performance, but avoid this
739 library for now, at the very least until the vulnerabilities are fixed.
740 </p>
742 <h2 id="Safe_HTML_Checker">Safe HTML Checker</h2>
745 <a href="http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker">Safe
746 HTML Checker</a> is (to my knowledge) the first attempt to make a filter
747 that also outputs standards-compliant <abbr>XHTML</abbr>. It wasn't even released or
748 licensed officially, but we'll let that slide: a 4<sup>th</sup> place
749 search result must have done something right.
750 </p>
752 <table class="summary">
753 <tr><th>Version</th> <td class="impl-partial">in-house</td></tr>
754 <tr><th>Last update</th> <td class="impl-almostyes">2003-09-15</td></tr>
755 <tr><th>License</th> <td class="impl-no">undefined</td></tr>
756 <tr><th>Whitelist</th> <td class="impl-partial">Yes (bare-bones)</td></tr>
757 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
758 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
759 <tr><th>Fixes nesting</th> <td class="impl-almostyes">Almost</td></tr>
760 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
761 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
762 <tr><th>Standards safe</th> <td class="impl-almostyes">Almost</td></tr>
763 </table>
766 Indeed, it is quite a well-written piece of code. It demonstrates
767 knowledge of inline versus block elements, thus almost nearly getting
768 nesting correct (the only exception is an unimplemented omitted SGML
769 exclusion for <tt>&lt;a&gt;</tt> tags, and that's easy to fix).
770 </p>
773 Unfortunately, part of the reason why it works so well is that it's
774 extremely restrictive. No styling, no tables, very few attributes.
775 Perfectly appropriate for blog comments, but then again, there's always
776 BBCode. This probably means that Safe HTML Checker has a different
777 goal than HTML Purifier.
778 </p>
781 The <abbr>XML</abbr> parser is also quite strict. Accidentally missed a
782 &lt; sign? The parser will complain with the cryptic message:
783 <q><abbr>XHTML</abbr> is not well-formed</q>. The solution is not as
784 simple as just switching to a more permissive parser: Safe HTML Checker
785 relies on the fact that the parser will have matched up the tags for
786 them.
787 </p>
789 <h2 id="HTMLPurifier">HTML Purifier</h2>
791 <table class="summary">
792 <tr><th>Version</th> <td class="impl-yes">&htmlpurifier.current.version;</td></tr>
793 <tr><th>Last update</th> <td class="impl-yes">&htmlpurifier.current.release-date;</td></tr>
794 <tr><th>License</th> <td class="impl-irrelevant">LGPL</td></tr>
795 <tr><th>Whitelist</th> <td class="impl-yes">Yes</td></tr>
796 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
797 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
798 <tr><th>Fixes nesting</th> <td class="impl-yes">Yes</td></tr>
799 <tr><th>Validates attributes</th> <td class="impl-yes">Yes</td></tr>
800 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
801 <tr><th>Standards safe</th> <td class="impl-yes">Yes</td></tr>
802 </table>
805 That table should say it all, but I'll add a few more features:
806 </p>
808 <table class="summary">
809 <tr><th>UTF-8 aware</th><td class="impl-yes">Yes</td></tr>
810 <tr><th>Object-Oriented</th><td class="impl-yes">Yes</td></tr>
811 <tr><th>Validates CSS</th><td class="impl-yes">Yes</td></tr>
812 <tr><th>Tables</th><td class="impl-yes">Yes</td></tr>
813 <tr><th>PHP 5 only</th><td class="impl-yes">Yes</td></tr>
814 <tr><th>E_STRICT compliant</th><td class="impl-yes">Yes</td></tr>
815 <tr><th>Can auto-paragraph</th><td class="impl-yes">Yes</td></tr>
816 <tr><th>Extensible</th><td class="impl-yes">Yes</td></tr>
817 <tr><th>Unit tested</th><td class="impl-yes">Yes</td></tr>
818 </table>
821 This is not to say that HTML Purifier doesn't have problems of its own.
822 It's big (while the others usually fit in one file, this one requires a huge
823 include list), and it's <a href="http://htmlpurifier.org/live/TODO">missing
824 features.</a> But even with these deficiencies,
825 HTML Purifier is far better than the other libraries.
826 </p>
829 So... <a href="download.html">what are you waiting for?</a>
830 </p>
832 </div>
833 </body>
834 </html>