Hopefully make the documentation page work a little bit better.
[htmlpurifier-web.git] / comparison.xhtml
blob7d24b52f12e00ce11f48b0b0cc5ae0d4366251ce
1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
3 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
4 <!ENTITY % htmlpurifier.current SYSTEM "current.ent"> %htmlpurifier.current;
5 ]>
6 <html xmlns="http://www.w3.org/1999/xhtml"
7 xmlns:xi="http://www.w3.org/2001/XInclude"
8 xmlns:xc="urn:xhtml-compiler"
9 xml:lang="en">
10 <head>
11 <title>Comparison - HTML Purifier</title>
12 <xi:include href="common-meta.xml" xpointer="xpointer(/*/node())" />
13 <meta name="keywords" content="HTMLPurifier, HTML Purifier, HTML, filter, filtering, HTML_Safe, PEAR, comparison, kses, striptags, SafeHTMLChecker" />
14 </head>
15 <body>
17 <xi:include href="common-header.xml" xpointer="xpointer(/*/node())" />
19 <div id="main">
20 <h1 id="title">Comparison</h1>
22 <div id="content">
24 <p>
25 With the advent of <a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0</a>,
26 the end user has gone from passive consumer to active producer of content
27 on the World Wide Web. <a href="http://en.wikipedia.org/wiki/Wiki">Wikis</a>,
28 <a href="http://en.wikipedia.org/wiki/Social_software">Social Software</a> and
29 <a href="http://en.wikipedia.org/wiki/Blog">Blogs</a> all put the user in control.
30 </p>
32 <p>
33 Give the user too much control, however, and you set yourself up for <a
34 href="http://en.wikipedia.org/wiki/Cross-site_scripting"><abbr>XSS</abbr
35 ></a> attacks. For this reason, <abbr>HTML</abbr>'s flexibility has
36 proven to be both a blessing and a curse, and the software that
37 processes it must strike a fine balance between security and usability.
38 How do we prevent users from injecting JavaScript or inserting malformed
39 <abbr>HTML</abbr> while allowing a rich syntax of tags, attributes and
40 <abbr>CSS</abbr>? How do we put <abbr>HTML</abbr> inside
41 <abbr>RSS</abbr> feed without worrying about sloppy coding messing up
42 <abbr>XML</abbr> parsing? Almost every <abbr>PHP</abbr> developer has
43 come across this problem before, and many have tried (albeit
44 unsuccessfully) to solve this problem. We will analyze existing
45 libraries to demonstrate how they are ineffective and, of course, how
46 <strong>HTML Purifier</strong> solves all our problems and achieves
47 standards-compliance.
48 </p>
50 <p>
51 I will take no quarter and pull no punches: as of the time of writing,
52 no other library comes even <em>close</em> to solving the problem effectively
53 for richly formatted documents. But, nonetheless, there is a necessary
54 disclaimer:
55 </p>
57 <div class="disclaimer">
58 <p>
59 This comparison document was written by the author of HTML Purifier,
60 and clearly is <strong>in favor</strong> of HTML Purifier. However, that doesn't
61 mean that it is biased: I have made every attempt to be <strong>factual and
62 fair</strong>, and I hope that you will agree, by the time you finish reading
63 this document, that HTML Purifier is the only satisfactory <abbr>HTML</abbr>
64 filter out there today.
65 </p>
66 </div>
68 <div id="toc" />
70 <h2 id="Summary">Summary</h2>
72 <p>A table summarizing the differences for the impatient.</p>
74 <div class="wide-table">
75 <table cellspacing="0">
77 <thead>
78 <tr>
79 <th>Library</th>
80 <th>Version</th>
81 <th>Date</th>
82 <th>License</th>
83 <th>Whitelist</th>
84 <th>Removal</th>
85 <th>Well-formed</th>
86 <th>Nesting</th>
87 <th>Attributes</th>
88 <th>XSS&nbsp;safe</th>
89 <th>Standards&nbsp;safe</th>
90 </tr>
91 </thead>
93 <tbody>
95 <tr>
96 <td>striptags</td>
97 <td>n/a</td>
98 <td>n/a</td>
99 <td>n/a</td>
100 <td class="impl-almostyes">Yes (user)</td>
101 <td class="impl-partial">Buggy</td>
102 <td class="impl-no">No</td>
103 <td class="impl-no">No</td>
104 <td class="impl-no">No</td>
105 <td class="impl-no">No</td>
106 <td class="impl-no">No</td>
107 </tr>
109 <tr>
110 <td>PHP Input Filter</td>
111 <td>1.2.2</td>
112 <td>2005-10-05</td>
113 <td>GPL</td>
114 <td class="impl-almostyes">Yes (user)</td>
115 <td class="impl-yes">Yes</td>
116 <td class="impl-no">No</td>
117 <td class="impl-no">No</td>
118 <td class="impl-partial">Partial</td>
119 <td class="impl-almostyes">Probably</td>
120 <td class="impl-no">No</td>
121 </tr>
123 <tr>
124 <td>HTML_Safe</td>
125 <td>0.9.9beta</td>
126 <td>2005-12-21</td>
127 <td>BSD (3)</td>
128 <td class="impl-no">Mostly No</td>
129 <td class="impl-yes">Yes</td>
130 <td class="impl-yes">Yes</td>
131 <td class="impl-no">No</td>
132 <td class="impl-partial">Partial</td>
133 <td class="impl-almostyes">Probably</td>
134 <td class="impl-no">No</td>
135 </tr>
137 <tr>
138 <td>kses</td>
139 <td>0.2.2</td>
140 <td>2005-02-06</td>
141 <td>GPL</td>
142 <td class="impl-almostyes">Yes (user)</td>
143 <td class="impl-yes">Yes</td>
144 <td class="impl-no">No</td>
145 <td class="impl-no">No</td>
146 <td class="impl-partial">Partial</td>
147 <td class="impl-almostyes">Probably</td>
148 <td class="impl-no">No</td>
149 </tr>
151 <tr>
152 <td>htmLawed</td>
153 <td>1.1.9.1</td>
154 <td>2009-02-26</td>
155 <td>GPL</td>
156 <td class="impl-partial">Yes (not default)</td>
157 <td class="impl-almostyes">Yes (user)</td>
158 <td class="impl-almostyes">Yes (user)</td>
159 <td class="impl-partial">Partial</td>
160 <td class="impl-partial">Partial</td>
161 <td class="impl-almostyes">Probably</td>
162 <td class="impl-no">No</td>
163 </tr>
165 <tr>
166 <td>Safe HTML Checker</td>
167 <td>n/a</td>
168 <td>2003-09-15</td>
169 <td>n/a</td>
170 <td class="impl-partial">Yes (bare)</td>
171 <td class="impl-yes">Yes</td>
172 <td class="impl-yes">Yes</td>
173 <td class="impl-almostyes">Almost</td>
174 <td class="impl-partial">Partial</td>
175 <td class="impl-yes">Yes</td>
176 <td class="impl-almostyes">Almost</td>
177 </tr>
179 <tr>
180 <td>HTML Purifier</td>
181 <td>&htmlpurifier.current.version;</td>
182 <td>&htmlpurifier.current.release-date;</td>
183 <td>LGPL</td>
184 <td class="impl-yes">Yes</td>
185 <td class="impl-yes">Yes</td>
186 <td class="impl-yes">Yes</td>
187 <td class="impl-yes">Yes</td>
188 <td class="impl-yes">Yes</td>
189 <td class="impl-yes">Yes</td>
190 <td class="impl-yes">Yes</td>
191 </tr>
193 </tbody>
195 </table>
196 </div>
199 <a href="#Tidy">HTML Tidy</a> is omitted from this list because it is not
200 an <abbr>HTML</abbr> filter.
201 </p>
203 <h2 id="AltMarkup">Look Ma, No <abbr>HTML</abbr>!</h2>
205 <blockquote class="fancy">
206 <div class="quote" style="text-align:center;">
207 A clever person solves a problem.
208 A wise person avoids it.
209 </div>
210 <div class="origin">&mdash; Albert Einstein</div>
211 </blockquote>
214 Before we jump into the weird and not-so-wonderful world of
215 <abbr>HTML</abbr> filters, we must first consider another domain:
216 non-<abbr>HTML</abbr> markup libraries. While libraries of this type
217 really shouldn't be considered <abbr>HTML</abbr> filters, they are the
218 number one method of taking user input and processing it into something
219 more than plain old text. These libraries forgo <abbr>HTML</abbr> and
220 define their own markup syntax. <a
221 href="http://en.wikipedia.org/wiki/BBCode">BBCode</a>, <a
222 href="http://en.wikipedia.org/wiki/Wikitext">Wikitext</a>, <a
223 href="http://daringfireball.net/projects/markdown/">Markdown</a> and <a
224 href="http://textism.com/tools/textile/">Textile</a> are all examples of
225 such markup languages (although it should be noted that Wikitext and
226 Markdown can allow <abbr>HTML</abbr> within them). The benefits (to
227 those who use it, anyway) are clear: simplicity and security.
228 </p>
230 <table cellspacing="0">
231 <thead>
232 <tr>
233 <th>Markup language</th>
234 <th>Sample</th>
235 </tr>
236 </thead>
237 <tbody>
238 <tr>
239 <th>BBCode</th>
240 <td><tt>[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].</tt></td>
241 </tr>
242 <tr>
243 <th>Wikitext<sup>1</sup></th>
244 <td><tt>'''B''' ''i'' [http://www.example.com/ link]</tt></td>
245 </tr>
246 <tr>
247 <th>Markdown<sup>2</sup></th>
248 <td><tt>**B** *i* [link](http://www.example.com/)</tt></td>
249 </tr>
250 <tr>
251 <th>Textile</th>
252 <td><tt>*B* _i_ &quot;link&quot;:http://www.example.com/</tt></td>
253 </tr>
254 <tr>
255 <th><abbr>HTML</abbr></th>
256 <td><tt>&lt;b&gt;B&lt;/b&gt; &lt;i&gt;i&lt;/i&gt; &lt;a href=&quot;http://www.example.com/&quot;&gt;link&lt;/a&gt;</tt></td>
257 </tr>
258 <tr>
259 <th><acronym>WYSIWYG</acronym></th>
260 <td><b>B</b> <i>i</i> <a href="http://www.example.com/">link</a></td>
261 </tr>
262 </tbody>
263 </table>
265 <ol class="notes">
266 <li>
267 Wikitext shown is modeled after <a
268 href="http://www.mediawiki.org/wiki/MediaWiki">MediaWiki</a> style.
269 There are many variants of Wikitext currently extant.
270 </li>
271 <li>
272 Strictly speaking, the Markdown syntax is not equivalent: bold text
273 is expressed as <code>&lt;strong&gt;</code> and italicized text is
274 expressed as <code>&lt;em&gt;</code>. Most browser default stylesheets,
275 however, map those two semantic tags to the associated styling, so
276 many users assume that it really is italics (and use it improperly for,
277 say, book titles.)
278 </li>
279 </ol>
281 <h3 id="AltMarkup:Simplicity">Simplicity</h3>
284 <abbr>HTML</abbr> source code is often criticized for being difficult to
285 read. For example, compare:
286 </p>
288 <pre>
289 * Item 1
290 * Item 2
291 </pre>
293 <p>...with:</p>
295 <pre>
296 &lt;ul&gt;
297 &lt;li&gt;Item 1&lt;/li&gt;
298 &lt;li&gt;Item 2&lt;/li&gt;
299 &lt;/ul&gt;
300 </pre>
303 Which would you prefer to edit? The answer seems obvious, but be careful
304 not to fall into the fallacy of <a
305 href="http://en.wikipedia.org/wiki/False_dilemma">false dilemma</a>.
306 There <em>is</em> a third choice: the <acronym>WYSIWYG</acronym> (rich
307 text) editor, which blows earlier choices out of the water in terms of
308 usability.
309 </p>
312 Note that rich text editors and alternate markup syntaxes are not
313 mutually exclusive, but, when push comes to shove, it's easier
314 implement this sort of editor on top of <abbr>HTML</abbr> than some obscure
315 markup language. And in the cases when it is done, you usually end up with
316 a live preview, not a true rich text editor.
317 </p>
319 <blockquote class="digression">
321 <q>Now just wait a second,</q> you may be saying, <q><acronym>WYSIWYG</acronym>
322 editors aren't all that great.</q> There are many good arguments against
323 these editors, and <a
324 href="http://www.ideography.co.uk/library/seybold/WYSIWYG.html">intelligent
325 people have written essays</a> devoted to criticizing
326 <acronym>WYSIWYG</acronym>. In addition to the usual arguments against
327 said editors, the web poses another limitation: no JavaScript means no
328 editor, and no editor means... (gasp) manually typing in code.
329 </p>
331 Even the most dogmatic purist, however, should recognize that for all
332 its faults, prospective clients <em>really</em> want rich text editors.
333 There are steps you can take to mitigate the associated drawbacks of
334 these editors.
335 </p>
337 It is often asserted that <acronym>WYSIWYG</acronym> editors
338 <em>encourage excessive presentational markup</em>. As it turns out,
339 this is the case with any markup language that allows the smallest
340 iota of presentational tags, be it <tt>&lt;font&gt;</tt> or
341 <tt>[color=red]</tt>. A good way to reduce this trouble is to simply
342 eliminate the dialogue boxes that allow users to change colors or fonts
343 (which usually have no legitimate use) and adopt a <acronym>WYSIWYM</acronym>
344 scheme, allowing users to select contextually correct formatting styles
345 for segments of text.
346 </p>
347 </blockquote>
350 Simplicity is also a double-edged sword. The moment any remotely
351 complex markup is needed, these lightweight markup languages fail to
352 produce. Sure you can make '''this text bold''' with Wikitext, but that
353 infobox all <q>rendered nicely in aqua blue</q> will require a gaggle of
354 &lt;div&gt;s and <abbr>CSS</abbr>. These languages face the same troubles
355 as regular <abbr>HTML</abbr> filters in that their whitelist is too
356 restrictive (besides the fact that their table markup is extraordinarily
357 complex).
358 </p>
360 <h3 id="AltMarkup:Security">Security</h3>
363 BBCode can be boiled down to a <q>wanna-be</q> version of
364 <abbr>HTML</abbr>. I mean, replacing
365 the angled brackets with square brackets and omitting the occasional parameter
366 name? How much more un-original can you get? Somehow, I don't think BBCode
367 was meant to readable. <a
368 href="http://en.wikipedia.org/wiki/BBCode">Wikipedia</a> agrees:
369 </p>
371 <blockquote>
372 BBCode was devised and put to use in order to provide a safer, easier
373 and more limited way of allowing users to format their messages.
374 Previously, many message boards allowed the users to include <abbr>HTML</abbr>,
375 which could be used to break/imitate parts of the layout, or run
376 JavaScript. Some implementations of BBCode have suffered problems related
377 to the way they translate the BBCode into <abbr>HTML</abbr>, which could negate the
378 security that was intended to be given by BBCode.
379 </blockquote>
381 <p>Or, put more simply:</p>
383 <blockquote>
384 BBCode came to life when developers where too
385 lazy to parse <abbr>HTML</abbr> correctly
386 and decided to invent their own markup language. As with all products of
387 laziness, the result is completely inconsistent, unstandardized, and
388 widely adopted.
389 </blockquote>
392 Well, developers, the whole point of HTML Purifier is that I do the
393 work so you can just execute the ridiculously simple
394 <tt>$purifier->purify($html)</tt> call and go on to do, well, whatever
395 you developers do. <tt>:-P</tt>
396 </p>
398 <h3 id="AltMarkup:Conclusion">Conclusion</h3>
401 These alternative markup languages have their shiny points, and HTML
402 Purifier is not meant to replace them. However, a major reason for
403 their existence has been called into question. Why are <em>you</em>
404 using these languages?
405 </p>
407 <h2 id="Tidy">HTML Tidy</h2>
410 Dave Raggett's
411 <a href="http://www.w3.org/People/Raggett/tidy/">HTML Tidy</a> is a program;
412 neat enough, at least, to make it into <abbr>PHP</abbr> as a
413 <a href="http://us2.php.net/manual/en/ref.tidy.php"><abbr>PECL</abbr> extension.</a>
414 The premise is simple, the execution effective. Tidy is, in short, a great
415 <em>tool</em>.
416 </p>
419 It is not, however, a filter. I am often surprised when people ask
420 me, <q>What about Tidy?</q> There's nothing against Tidy: Tidy tackles
421 a different problem set. Let's see what <tt>man tidy</tt> has to say:
422 </p>
424 <blockquote cite="http://tidy.sourceforge.net/docs/tidy_man.html">
425 Tidy reads <abbr>HTML</abbr>, <abbr>XHTML</abbr> and
426 <abbr>XML</abbr> files and writes cleaned up markup. For
427 <abbr>HTML</abbr> variants, it detects and corrects many common coding errors and
428 strives to produce visually equivalent markup that is both <abbr>W3C</abbr> compliant
429 and works on most browsers. A common use of Tidy is to convert plain <abbr>HTML</abbr>
430 to <abbr>XHTML</abbr>.
431 </blockquote>
434 Hmm... why do I not see the words <q>filter</q> or
435 <q><abbr>XSS</abbr></q> in here? Perhaps it's
436 because Tidy accepts <em>any</em> valid
437 <abbr>HTML</abbr>. Including
438 <tt>script</tt> tags. Which leads us to our second part: Tidy parses
439 <em>documents</em>, not document <em>fragments</em>.
440 </p>
443 This is not to say that I haven't seen Tidy be used in this sort of
444 fashion. MediaWiki, for instance, uses Tidy to cleanup the final <abbr>HTML</abbr>
445 output before shuttling it off to the browser. The developers, nevertheless,
446 agree that this is only a band-aid solution, and that the real way
447 to fix it is to fix the parser. Tidy's great, but in terms of security,
448 it's not suitable for untrusted sources.
449 </p>
451 <h2 id="AntiSamy">OWASP AntiSamy</h2>
454 Although <a href="http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project">OWASP AntiSamy</a> is implemented in Java and .NET, it is
455 worth a quick mention here because it purports to do the same thing
456 as HTML Purifier. The bottom line? It gets pretty close, but
457 it just doesn't have the same depth as HTML Purifier.
458 </p>
461 Architecturally speaking, OWASP AntiSamy is highly dependent on
462 what are called <q>policy files</q>, which is an highly extended form
463 of <abbr>XML</abbr> Schema with information on what attributes and elements to allow. As such,
464 the actual code for filtering is relatively light-weight. AntiSamy
465 gets lots of points for using legitimate <abbr>HTML</abbr> and <abbr>CSS</abbr> parsers (extra
466 props for the <abbr>CSS</abbr> parser; HTML Purifier doesn't use one, but we should!)
467 </p>
470 Unfortunately, while <abbr>XML</abbr> Schema files can get a high level of
471 control on the validation, the regular expression heavy approach
472 begins showing signs of stress when data-types are complex (e.g.
473 <abbr>URI</abbr>s), and <abbr>XML</abbr> Schema is ill-suited for large-scale <acronym>DOM</acronym> manipulation,
474 which is necessary when transforming <abbr>HTML</abbr> for standards compliance.
475 Nonetheless, I would be fairly confident in its <abbr>XSS</abbr> cleaning
476 abilities, so long as it removes things it doesn't recognize by default
477 (something I find slightly perplexing in its policy files, since some
478 rules indicate things to be removed.)
479 </p>
481 <h2 id="Preface">Preface</h2>
484 I've ordered my analyses according to how bad a library is. The worst
485 is first, and then we move up the spectrum. I will point out the most
486 flagrant problems with the libraries, but note that I will omit more
487 advanced vulnerabilities: if you can't catch an <tt>onmouseover</tt>
488 attribute, I really shouldn't reprimand you for letting non-<abbr>SGML</abbr> code
489 points through. The ideal solution, however, must do all these things.
490 </p>
493 Note that besides striptags,
494 most of the libraries are moderately effective against the most common <abbr>XSS</abbr>
495 attacks. None of them (save Safe HTML Checker) fare very well
496 in the standards-compliance department though.
497 </p>
499 <h2 id="striptags">striptags()</h2>
501 <table class="summary">
502 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user-specified</td></tr>
503 <tr><th>Removes foreign tags</th> <td class="impl-partial">Buggy</td></tr>
504 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
505 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
506 <tr><th>Validates attributes</th> <td class="impl-no">No</td></tr>
507 </table>
510 The <abbr>PHP</abbr> function
511 <a href="http://php.net/manual/en/function.strip-tags.php">striptags()</a> is
512 the classic solution for attempting to clean up
513 <abbr>HTML</abbr>. It
514 is also the <em>worst</em> solution, and should be avoided like the plague.
515 The fact that it doesn't validate attributes at all means that anyone can
516 insert an <tt>onmouseover='xss();'</tt> and exploit your application.
517 </p>
520 While this can be bandaided with a series of regular expressions that strip out
521 on[event] (you're still vulnerable to <abbr>XSS</abbr> and at the mercy of
522 quirky browser behavior), striptags() is fundamentally flawed and should not be
523 used.
524 </p>
526 <h2 id="Input_Filter">PHP Input Filter</h2>
529 Though its title may not imply it,
530 <a href="http://www.phpclasses.org/browse/package/2189.html">PHP Input Filter</a>
531 is a souped up version of striptags() with the ability to inspect
532 attributes. (Don't mind the hastily tacked on query escaping function).
533 </p>
535 <table class="summary">
536 <tr><th>Version</th> <td class="impl-yes">1.2.2</td></tr>
537 <tr><th>Last update</th> <td class="impl-irrelevant">2005-10-05</td></tr>
538 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
539 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
540 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
541 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
542 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
543 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
544 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
545 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
546 </table>
549 PHP Input Filter implements an
550 <abbr>HTML</abbr> parser, and
551 performs very basic checks on whether or not tags and attributes have
552 been defined in the whitelist as well as some
553 smarter <abbr>XSS</abbr> checks. It is left up to
554 the user to define what they'll permit.
555 </p>
558 With absolutely no checking of well-formedness, it is trivially easy
559 to trick the filter into leaving unclosed tags lying around. While to some
560 standards-compliance may be viewed by some as a <q>nice feature</q>,
561 basic sanity checks like this must be implemented, otherwise a user
562 can mangle a website's layout.
563 </p>
566 More troubles: Woe to
567 any user that allows the <tt>style</tt> attribute: you can't simply
568 just let <abbr>CSS</abbr> through and expect your
569 layout not to be badly mutilated. To top things off,
570 the filter doesn't even preserve data properly: attributes have all
571 spaces stripped out of them. Stay away, stay away!
572 </p>
574 <h2 id="HTML_Safe">HTML_Safe/SafeHTML</h2>
577 <a href="http://pear.php.net/package/HTML_Safe">HTML_Safe</a> is
578 <acronym>PEAR</acronym>'s <abbr>HTML</abbr> filtering library.
579 It should be noted that this is the same library as
580 <a href="http://pixel-apes.com/safehtml/">SafeHTML</a>, though with different
581 branding (and a different version number).
582 </p>
584 <table class="summary">
585 <tr><th>Version</th> <td class="impl-almostyes">0.9.9beta</td></tr>
586 <tr><th>Last update</th> <td class="impl-irrelevant">2005-12-21</td></tr>
587 <tr><th>License</th> <td class="impl-irrelevant">BSD (3 clause)</td></tr>
588 <tr><th>Whitelist</th> <td class="impl-no">Mostly No</td></tr>
589 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
590 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
591 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
592 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
593 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
594 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
595 </table>
598 HTML_Safe's mechanism of action involves parsing
599 <abbr>HTML</abbr> with a
600 <acronym>SAX</acronym> parser and performing
601 validation and filtering as the handlers are called. HTML_Safe does a lot
602 of things right, which is why I say it <em>probably</em> isn't vulnerable
603 to <abbr>XSS</abbr>, but its approach
604 is fundamentally flawed: blacklists.
605 </p>
608 This library maintains arrays of dangerous tags, attributes and
609 <abbr>CSS</abbr> properties. (It also
610 has a blacklist of dangerous <abbr>URI</abbr> protocols, but this is
611 intelligently disabled by default in favor of a protocol whitelist.)
612 What this means is that HTML_Safe has no qualms of accepting input
613 like <tt>&lt;foobar&gt; Bang &lt;/foobar&gt;</tt>. Anything goes except
614 the tags in those arrays. Scratch standards-compliance (and that was
615 without even considering proper nesting).
616 </p>
619 For now, HTML_Safe might be safe from <abbr>XSS</abbr>.
620 In the future, however, one of the infinitely many tags that HTML_Safe lets
621 through might just possibly be given special functionality by browser vendors.
622 And it might just turn out that this can be exploited. <em>Any</em> blacklist
623 solution puts you at a perpetual arms race against crackers who are constantly
624 discovering new and inventive ways to abuse tags and attributes that you
625 didn't blacklist.
626 </p>
628 <h2 id="kses">kses</h2>
631 <a href="http://sourceforge.net/projects/kses/">kses</a> appears to
632 be the de-facto solution for cleaning <abbr>HTML</abbr>, having found
633 its way into applications such as <a href="http://wordpress.org/">WordPress</a>
634 and being the number one search result for <q>php html filter</q>.
635 </p>
637 <table class="summary">
638 <tr><th>Version</th> <td class="impl-partial">0.2.2</td></tr>
639 <tr><th>Last update</th> <td class="impl-irrelevant">2005-02-06</td></tr>
640 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
641 <tr><th>Whitelist</th> <td class="impl-yes">Yes, user defined</td></tr>
642 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
643 <tr><th>Makes well-formed</th> <td class="impl-no">No</td></tr>
644 <tr><th>Fixes nesting</th> <td class="impl-no">No</td></tr>
645 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
646 <tr><th>XSS safe</th> <td class="impl-almostyes">Probably</td></tr>
647 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
648 </table>
651 To be truthful, I didn't do as comprehensive a code survey for kses
652 as I did for some of the other libraries. Out of
653 all the classes I've reviewed so far, kses was definitely the hardest to
654 understand.
655 </p>
658 kses's modus operandi is splitting up html with a monster regexp
659 and then validating each section with <tt>kses_split2()</tt>. It
660 suffers from the same problems as Input Filter: no well-formedness
661 checks leading to rampant runaway tags (and no standards-compliance).
662 WordPress, the primary user of kses today, had to implement their
663 own custom tag-balancing code to fix this problem: don't use this
664 library without some equivalent!
665 </p>
668 Its whitelist syntax, however, is the most complex of all these libraries,
669 so I'm going to take some time to argue why this particular implementation
670 is bad. The author of this library was thoughtful enough to provide some
671 basic constraint checks on attributes like maxlen and maxval. Now, barring
672 the fact that there simply aren't enough checks, and the fact that they are
673 all lumped together in one function, we now must wonder whether or not
674 the user will go through the trouble of specifying the maximum length
675 of a title attribute.
676 </p>
679 I have my opinions about inherent human laziness, but perhaps WordPress's
680 default filterset is the most telling example:
681 </p>
683 <pre>
684 $allowedposttags = array (
685 /* formatted and trimmed */
686 'hr' => array (
687 'align' => array (),
688 'noshade' => array (),
689 'size' => array (),
690 'width' => array ()
693 </pre>
696 Hmm... do I see a blatant lack of attribute constraints? Conclusion:
697 if the user can get away with not doing work, they will! The biggest
698 problem in all these whitelists filters is that they forgot to <em>supply</em>
699 the whitelist. The whitelist is just as important as the code that uses
700 the whitelist to filter <abbr>HTML</abbr>.
701 </p>
703 <h2 id="htmLawed">htmLawed</h2>
706 <a href="http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php">htmLawed</a>
707 is kses on steroids. After looking at HTML Purifier and deciding that it was
708 too slow for him, Santosh Patnaik went ahead and rewrote the kses engine
709 with more features.
710 </p>
712 <table class="summary">
713 <tr><th>Version</th> <td class="impl-yes">1.1.9.1</td></tr>
714 <tr><th>Last update</th> <td class="impl-irrelevant">2009-02-26</td></tr>
715 <tr><th>License</th> <td class="impl-irrelevant">GPL</td></tr>
716 <tr><th>Whitelist</th> <td class="impl-partial">Yes, but blacklist is default</td></tr>
717 <tr><th>Removes foreign tags</th> <td class="impl-almostyes">Yes, user defined</td></tr>
718 <tr><th>Makes well-formed</th> <td class="impl-almostyes">Yes, user defined</td></tr>
719 <tr><th>Fixes nesting</th> <td class="impl-no">Partial</td></tr>
720 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
721 <tr><th>XSS safe</th> <td class="impl-no">Probably</td></tr>
722 <tr><th>Standards safe</th> <td class="impl-no">No</td></tr>
723 </table>
726 htmLawed improves standards-compliance, but it is not fully
727 standards-compliant; there are a number of cases which the author has
728 explicitly stated he will not fix. There are issues with content
729 models in <code>table</code> and <code>ruby</code> and tags that
730 <em>must</em> have content in them.
731 </p>
734 Let's, for a moment, imagine that htmLawed is <abbr>XSS</abbr>-safe when
735 <code>safe</code> is on.
736 Even then, it still is not <abbr>XSS</abbr>-safe out of the tin: you have
737 to turn on htmLawed's security features! This is
738 <a href="http://www.bioinformatics.org/phplabware/forum/viewtopic.php?id=28">by
739 design</a>. Sane defaults are important, because for every person who
740 does read the documentation, there is
741 <a href="http://www.bioinformatics.org/phplabware/forum/viewtopic.php?id=28">another</a>
742 one who doesn't (and is mislead by claims that <q>htmLawed is a single-file PHP
743 software that makes input text secure</q>), and is
744 surprised at some behavior.
745 Software must be <strong>safe by default</strong>; the user can then relax
746 any security restrictions.
747 </p>
750 I also disagree with some of the choices with regards to what elements are
751 <q>safe</q>. <code>form</code>
752 is <abbr>XSS</abbr>-safe,
753 but it is certainly not phishing safe. Forms can be
754 used to spoof system dialogs <em>on that person's domain</em>. These should
755 <em>not</em> be allowed in <code>safe</code> mode.
756 </p>
758 <h2 id="Safe_HTML_Checker">Safe HTML Checker</h2>
761 <a href="http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker">Safe
762 HTML Checker</a> is (to my knowledge) the first attempt to make a filter
763 that also outputs standards-compliant <abbr>XHTML</abbr>. It wasn't even released or
764 licensed officially, but we'll let that slide: a 4<sup>th</sup> place
765 search result must have done something right.
766 </p>
768 <table class="summary">
769 <tr><th>Version</th> <td class="impl-partial">in-house</td></tr>
770 <tr><th>Last update</th> <td class="impl-almostyes">2003-09-15</td></tr>
771 <tr><th>License</th> <td class="impl-no">undefined</td></tr>
772 <tr><th>Whitelist</th> <td class="impl-partial">Yes (bare-bones)</td></tr>
773 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
774 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
775 <tr><th>Fixes nesting</th> <td class="impl-almostyes">Almost</td></tr>
776 <tr><th>Validates attributes</th> <td class="impl-partial">Partial</td></tr>
777 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
778 <tr><th>Standards safe</th> <td class="impl-almostyes">Almost</td></tr>
779 </table>
782 Indeed, it is quite a well-written piece of code. It demonstrates
783 knowledge of inline versus block elements, thus almost nearly getting
784 nesting correct (the only exception is an unimplemented omitted SGML
785 exclusion for <tt>&lt;a&gt;</tt> tags, and that's easy to fix).
786 </p>
789 Unfortunately, part of the reason why it works so well is that it's
790 extremely restrictive. No styling, no tables, very few attributes.
791 Perfectly appropriate for blog comments, but then again, there's always
792 BBCode. This probably means that Safe HTML Checker has a different
793 goal than HTML Purifier.
794 </p>
797 The <abbr>XML</abbr> parser is also quite strict. Accidentally missed a
798 &lt; sign? The parser will complain with the cryptic message:
799 <q><abbr>XHTML</abbr> is not well-formed</q>. The solution is not as
800 simple as just switching to a more permissive parser: Safe HTML Checker
801 relies on the fact that the parser will have matched up the tags for
802 them.
803 </p>
805 <h2 id="HTMLPurifier">HTML Purifier</h2>
807 <table class="summary">
808 <tr><th>Version</th> <td class="impl-yes">&htmlpurifier.current.version;</td></tr>
809 <tr><th>Last update</th> <td class="impl-yes">&htmlpurifier.current.release-date;</td></tr>
810 <tr><th>License</th> <td class="impl-irrelevant">LGPL</td></tr>
811 <tr><th>Whitelist</th> <td class="impl-yes">Yes</td></tr>
812 <tr><th>Removes foreign tags</th> <td class="impl-yes">Yes</td></tr>
813 <tr><th>Makes well-formed</th> <td class="impl-yes">Yes</td></tr>
814 <tr><th>Fixes nesting</th> <td class="impl-yes">Yes</td></tr>
815 <tr><th>Validates attributes</th> <td class="impl-yes">Yes</td></tr>
816 <tr><th>XSS safe</th> <td class="impl-yes">Yes</td></tr>
817 <tr><th>Standards safe</th> <td class="impl-yes">Yes</td></tr>
818 </table>
821 That table should say it all, but I'll add a few more features:
822 </p>
824 <table class="summary">
825 <tr><th>UTF-8 aware</th><td class="impl-yes">Yes</td></tr>
826 <tr><th>Object-Oriented</th><td class="impl-yes">Yes</td></tr>
827 <tr><th>Validates CSS</th><td class="impl-yes">Yes</td></tr>
828 <tr><th>Tables</th><td class="impl-yes">Yes</td></tr>
829 <tr><th>PHP 5 only</th><td class="impl-yes">Yes</td></tr>
830 <tr><th>E_STRICT compliant</th><td class="impl-yes">Yes</td></tr>
831 <tr><th>Can auto-paragraph</th><td class="impl-yes">Yes</td></tr>
832 <tr><th>Extensible</th><td class="impl-yes">Yes</td></tr>
833 <tr><th>Unit tested</th><td class="impl-yes">Yes</td></tr>
834 </table>
837 This is not to say that HTML Purifier doesn't have problems of its own.
838 It's big (while the others usually fit in one file, this one requires a huge
839 include list), and it's <a href="http://htmlpurifier.org/live/TODO">missing
840 features.</a> But even with these deficiencies,
841 HTML Purifier is far better than the other libraries.
842 </p>
845 So... <a href="download">what are you waiting for?</a>
846 </p>
848 </div>
849 </div>
850 </body>
851 </html>