- -

With the advent of -Web 2.0, the end user has -gone from passive consumer to active producer of content on the World Wide -Web. Wikis, -Social Software and -Blogs all -put the user in control.

- -

Give the user too much control, however, and you set yourself up -for XSS attacks. For this reason, -HTML's flexibility -has proven to be both a blessing and a curse, and the software that processes -it must strike a fine balance between security and usability. How do -we prevent users from injecting JavaScript or inserting malformed -HTML while allowing -a rich syntax of tags, attributes and CSS? How do we put -HTML inside -RSS feed without worrying -about sloppy coding messing up XML parsing? -Almost every PHP -developer has come across this problem before, and many have tried -(albeit unsuccessfully) to solve this problem. We will analyze existing -libraries to demonstrate how they are ineffective and, of course, -how HTML Purifier solves all our problems and achieves -standards-compliance.

- -

I will take no quarter and pull no punches: as of the time of writing, -no other library comes even close to solving the problem effectively -for richly formatted documents. But, nonetheless, there is a necessary -disclaimer:

- -

- This comparison document was written by the author of HTML Purifier, - and clearly is in favor of HTML Purifier. However, that doesn't - mean that it is biased: I have made every attempt to be factual and - fair, and I hope that you will agree, by the time you finish reading - this document, that HTML Purifier is the only satisfactory HTML - filter out there today. -

- -

Summary

- -

A table summarizing the differences for the impatient.

- -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Library	Version	Date	License	Whitelist	Removal	Well-formed	Nesting	Attributes	XSS safe	Standards safe
striptags	n/a	n/a	n/a	Yes (user)	Buggy	No	No	No	No	No
PHP Input Filter	1.2.2	2005-10-05	GPL	Yes (user)	Yes	No	No	Partial	Probably	No
HTML_Safe	0.9.9beta	2005-12-21	BSD (3)	Mostly No	Yes	Yes	No	Partial	Probably	No
kses	0.2.2	2005-02-06	GPL	Yes (user)	Yes	No	No	Partial	Probably	No
Safe HTML Checker	n/a	2003-09-15	n/a	Yes (bare)	Yes	Yes	Almost	Partial	Yes	Almost
HTML Purifier	1.6.0	2007-04-01	LGPL	Yes	Yes	Yes	Yes	Yes	Yes	Yes

- -

HTML Tidy is omitted from this list because it is not an HTML -filter.

- -

Look Ma, No HTML!

- -

-
- A clever person solves a problem. - A wise person avoids it. -
-
— Albert Einstein
-

- -

Before we jump into the weird and not-so-wonderful world -of HTML filters, we must first consider another domain: non-HTML -markup libraries. While libraries of this type really shouldn't be -considered HTML filters, -they are the number one method of taking user input and processing it into -something more than plain old text. These libraries forgo -HTML and define their -own markup syntax. BBCode, -Wikitext, -Markdown and -Textile are all examples of -such markup languages (although it should be noted that -Wikitext and Markdown can allow -HTML within them). -The benefits (to those who use it, anyway) are clear: simplicity and -security. -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Markup language	Sample
BBCode	`[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].`
Wikitext¹	`'''B''' ''i'' [http://www.example.com/ link]`
Markdown²	`B i [link](http://www.example.com/)`
Textile	`B _i_ "link":http://www.example.com/`
HTML	`<b>B</b> <i>i</i> <a href="http://www.example.com/">link</a>`
WYSIWYG	B i link

- -

Wikitext shown is modeled after MediaWiki style. - There are many variants of Wikitext currently extant.
Strictly speaking, the Markdown syntax is not equivalent: bold text - is expressed as <strong> and italicized text is - expressed as <em>. Most browser default stylesheets, - however, map those two semantic tags to the associated styling, so - many users assume that it really is italics (and use it improperly for, - say, book titles.)

- -

Simplicity

- -

HTML -source code is often criticized for being difficult to read. For example, -compare:

- -

-* Item 1
-* Item 2
-

- -

...versus:

- -

-<ul>
-    <li>Item 1</li>
-    <li>Item 2</li>
-</ul>
-

- -

Which would you prefer to edit? The answer seems obvious, but be careful -not to fall into the fallacy of false dilemma. -There is a third choice: the -WYSIWYG (rich text) -editor, which blows earlier choices out of the water in terms -of usability.

- -

Note that rich text editors and alternate markup syntaxes are not -mutually exclusive, but, when push comes to shove, it's easier -implement this sort of editor on top of HTML than some obscure -markup language. And in the cases when it is done, you usually end up with -a live preview, not a true rich text editor.

- -

-
Now just wait a second, you may be saying, - WYSIWYG - editors aren't all that great. There are many good arguments - against these editors, and intelligent - people have written essays devoted to - criticizing WYSIWYG. - In addition to the usual arguments against said editors, the web poses - another limitation: no JavaScript means no - editor, and no editor means... (gasp) manually typing in code.
- -
Even the most dogmatic purist, however, should recognize that for all - its faults, prospective clients really want rich text editors. - There are steps you can take to mitigate the associated drawbacks of - these editors.
- -
It is often asserted that - WYSIWYG editors - encourage excessive presentational markup. As it turns out, - this is the case with any markup language that allows the smallest - iota of presentational tags, be it <font> or - [color=red]. - A good way to reduce this trouble is to simply eliminate the - dialogue boxes that allow users to change colors or fonts (which - usually have no legitimate use) and adopt a - WYSIWYM scheme, - allowing users to select contextually correct formatting styles - for segments of text.
-

- -

Simplicity is also a double-edged sword. The moment any remotely -complex markup is needed, these lightweight markup languages fail to -produce. Sure you can make '''this text bold''' with Wikitext, but that -infobox all rendered nicely in aqua blue will require a gaggle of -<div>s and CSS. -These languages face the same troubles as regular HTML -filters in that their whitelist is too restrictive (besides the fact that -their table markup is extraordinarily complex).

- -

Security

- -

BBCode can be boiled down to a wanna-be version of -HTML. I mean, replacing -the angled brackets with square brackets and omitting the occasional parameter -name? How much more un-original can you get? Somehow, I don't think BBCode -was meant to readable. Wikipedia agrees:

- -

- BBCode was devised and put to use in order to provide a safer, easier - and more limited way of allowing users to format their messages. - Previously, many message boards allowed the users to include HTML, - which could be used to break/imitate parts of the layout, or run - JavaScript. Some implementations of BBCode have suffered problems related - to the way they translate the BBCode into HTML, which could negate the - security that was intended to be given by BBCode. -

- -

Or, put more simply:

- -

- BBCode came to life when developers where too - lazy to parse HTML correctly - and decided to invent their own markup language. As with all products of - laziness, the result is completely inconsistent, unstandardized, and - widely adopted. -

- -

Well, developers, the whole point of HTML Purifier is that I do the -work so you can just execute the ridiculously simple -$purifier->purify($html) call and go on to do, well, whatever -you developers do. :-P

- -

Conclusion

- -

These alternative markup languages have their shiny points, and HTML -Purifier is not meant to replace them. However, a major reason for -their existence has been called into question. Why are you -using these languages?

- -

HTML Tidy

- -

Dave Raggett's -HTML Tidy is a program; -neat enough, at least, to make it into PHP as a -PECL extension. -The premise is simple, the execution effective. Tidy is, in short, a great -tool.

- -

It is not, however, a filter. I am often surprised when people ask -me, What about Tidy? There's nothing against Tidy: Tidy tackles -a different problem set. Let's see what man tidy has to say:

- -

- Tidy reads HTML, XHTML and - XML files and writes cleaned up markup. For - HTML variants, it detects and corrects many common coding errors and - strives to produce visually equivalent markup that is both W3C compliant - and works on most browsers. A common use of Tidy is to convert plain HTML - to XHTML. -

- -

Hmm... why do I not see the words filter or -XSS in here? Perhaps it's -because Tidy accepts any valid -HTML. Including -script tags. Which leads us to our second part: Tidy parses -documents, not document fragments.

- -

This is not to say that I haven't seen Tidy be used in this sort of -fashion. MediaWiki, for instance, uses Tidy to cleanup the final HTML -output before shuttling it off to the browser. The developers, nevertheless, -agree that this is only a band-aid solution, and that the real way -to fix it is to fix the parser. Tidy's great, but in terms of security, -it's not suitable for untrusted sources.

- -

Preface

- -

I've ordered my analyses according to how bad a library is. The worst -is first, and then we move up the spectrum. I will point out the most -flagrant problems with the libraries, but note that I will omit more -advanced vulnerabilities: if you can't catch an onmouseover -attribute, I really shouldn't reprimand you for letting non-SGML code -points through. The ideal solution, however, must do all these things.

- -

Note that besides striptags, -most of the libraries are moderately effective against the most common XSS -attacks. None of them (save Safe HTML Checker) fare very well -in the standards-compliance department though.

- -

striptags()

- - - - - - - -

Whitelist	Yes, user-specified
Removes foreign tags	Buggy
Makes well-formed	No
Fixes nesting	No
Validates attributes	No

- -

The PHP function -striptags() is -the classic solution for attempting to clean up -HTML. It -is also the worst solution, and should be avoided like the plague. -The fact that it doesn't validate attributes at all means that anyone can -insert an onmouseover='xss();' and exploit your application.

- -

While -this can be bandaided with a series of regular expressions that strip out -on[event] (you're still vulnerable to XSS and at the mercy of -quirky browser behavior), striptags() is fundamentally flawed and should not be -used. -

- -

PHP Input Filter

- -

Though its title may not imply it, -PHP Input Filter -is a souped up version of striptags() with the ability to inspect -attributes. (Don't mind the hastily tacked on query escaping function).

- - - - - - - - - - - - -

Version	1.2.2
Last update	2005-10-05
License	GPL
Whitelist	Yes, user defined
Removes foreign tags	Yes
Makes well-formed	No
Fixes nesting	No
Validates attributes	Partial
XSS safe	Probably
Standards safe	No

- -

PHP Input Filter implements an -HTML parser, and -performs very basic checks on whether or not tags and attributes have -been defined in the whitelist as well as some -smarter XSS checks. It is left up to -the user to define what they'll permit.

- -

With absolutely no checking of well-formedness, it is trivially easy -to trick the filter into leaving unclosed tags lying around. While to some -standards-compliance may be viewed by some as a nice feature, -basic sanity checks like this must be implemented, otherwise a user -can mangle a website's layout.

- -

More troubles: Woe to -any user that allows the style attribute: you can't simply -just let CSS through and expect your -layout not to be badly mutilated. To top things off, -the filter doesn't even preserve data properly: attributes have all -spaces stripped out of them. Stay away, stay away!

- -

HTML_Safe/SafeHTML

- -

HTML_Safe is -PEAR's HTML filtering library. -It should be noted that this is the same library as -SafeHTML, though with different -branding (and a different version number).

- - - - - - - - - - - - -

Version	0.9.9beta
Last update	2005-12-21
License	BSD (3 clause)
Whitelist	Mostly No
Removes foreign tags	Yes
Makes well-formed	Yes
Fixes nesting	No
Validates attributes	Partial
XSS safe	Probably
Standards safe	No

- -

HTML_Safe's mechanism of action involves parsing -HTML with a -SAX parser and performing -validation and filtering as the handlers are called. HTML_Safe does a lot -of things right, which is why I say it probably isn't vulnerable -to XSS, but its approach -is fundamentally flawed: blacklists.

- -

This library maintains arrays of dangerous tags, attributes and -CSS properties. (It also -has a blacklist of dangerous URI protocols, but this is -intelligently disabled by default in favor of a protocol whitelist.) -What this means is that HTML_Safe has no qualms of accepting input -like <foobar> Bang </foobar>. Anything goes except -the tags in those arrays. Scratch standards-compliance (and that was -without even considering proper nesting).

- -

For now, HTML_Safe might be safe from -XSS. -In the future, however, one of the infinitely many tags that HTML_Safe lets -through might just possibly be given special functionality by browser vendors. -And it might just turn out that this can be exploited. Any blacklist -solution puts you at a perpetual arms race against crackers who are constantly -discovering new and inventive ways to abuse tags and attributes that you -didn't blacklist.

- -

kses

- -

kses appears to -be the de-facto solution for cleaning HTML, having found -its way into applications such as WordPress -and being the number one search result for php html filter.

- - - - - - - - - - - - -

Version	0.2.2
Last update	2005-02-06
License	GPL
Whitelist	Yes, user defined
Removes foreign tags	Yes
Makes well-formed	No
Fixes nesting	No
Validates attributes	Partial
XSS safe	Probably
Standards safe	No

- -

To be truthful, I didn't do as comprehensive a code survey for kses -as I did for some of the other libraries. Out of -all the classes I've reviewed so far, kses was definitely the hardest to -understand.

- -

kses's modus operandi is splitting up html with a monster regexp -and then validating each section with kses_split2(). It -suffers from the same problems as Input Filter: no well-formedness -checks leading to rampant runaway tags (and no standards-compliance). -WordPress, the primary user of kses today, had to implement their -own custom tag-balancing code to fix this problem: don't use this -library without some equivalent!

- -

Its whitelist syntax, however, is the most complex of all these libraries, -so I'm going to take some time to argue why this particular implementation -is bad. The author of this library was thoughtful enough to provide some -basic constraint checks on attributes like maxlen and maxval. Now, barring -the fact that there simply aren't enough checks, and the fact that they are -all lumped together in one function, we now must wonder whether or not -the user will go through the trouble of specifying the maximum length -of a title attribute.

- -

I have my opinions about inherent human laziness, but perhaps WordPress's -default filterset is the most telling example:

- -

-$allowedposttags = array (
-    /* formatted and trimmed */
-    'hr' => array (
-        'align' => array (),
-        'noshade' => array (),
-        'size' => array (),
-        'width' => array ()
-     )
-);
-

- -

Hmm... do I see a blatant lack of attribute constraints? Conclusion: -if the user can get away with not doing work, they will! The biggest -problem in all these whitelists filters is that they forgot to supply -the whitelist. The whitelist is just as important as the code that uses -the whitelist to filter HTML.

- -

Safe HTML Checker

- -

-Safe -HTML Checker is (to my knowledge) the first attempt to make a filter -that also outputs standards-compliant XHTML. It wasn't even released or -licensed officially, but we'll let that slide: a 4^th place -search result must have done something right.

- - - - - - - - - - - - -

Version	in-house
Last update	2003-09-15
License	undefined
Whitelist	Yes (bare-bones)
Removes foreign tags	Yes
Makes well-formed	Yes
Fixes nesting	Almost
Validates attributes	Partial
XSS safe	Yes
Standards safe	Almost

- -

Indeed, it is quite a well-written piece of code. It demonstrates -knowledge of inline versus block elements, thus almost nearly getting -nesting correct (the only exception is an unimplemented omitted SGML -exclusion for <a> tags, and that's easy to fix).

- -

Unfortunately, part of the reason why it works so well is that it's -extremely restrictive. No styling, no tables, very few attributes. -Perfectly appropriate for blog comments, but then again, there's always -BBCode. This probably means that Safe HTML Checker has a different -goal than HTML Purifier.

- -

The XML parser -is also quite strict. Accidentally missed a < sign? The parser will -complain with the cryptic message: -XHTML -is not well-formed. -The solution is not as simple as just switching to a more permissive -parser: Safe HTML Checker relies on the fact that the parser will have -matched up the tags for them.

- -

HTML Purifier

- - - - - - - - - - - - -

Version	1.6.0
Last update	2007-04-01
License	LGPL
Whitelist	Yes
Removes foreign tags	Yes
Makes well-formed	Yes
Fixes nesting	Yes
Validates attributes	Yes
XSS safe	Yes
Standards safe	Yes

- -

That table should say it all, but I'll add a few more features:

- - - - - - - - -

UTF-8 aware	Yes
Object-Oriented	Yes
Validates CSS	Yes
Tables	Yes
PHP 5 aware	Yes
E_STRICT compliant	Yes (use -strict)

- -

This is not to say that HTML Purifier doesn't have problems of its own. -It's a fairly nascent library (that doesn't mean its buggy though), it's big -(while the others usually fit in one file, this one requires a huge -include list), and it's missing -features. But even in its current state, -HTML Purifier is far better than the other libraries.

- -

So... what are you waiting for?

- -

- + + + + + Comparison - HTML Purifier + + + + + + + +

Comparison

+ +

With the advent of +Web 2.0, the end user has +gone from passive consumer to active producer of content on the World Wide +Web. Wikis, +Social Software and +Blogs all +put the user in control.

+ +

Give the user too much control, however, and you set yourself up +for XSS attacks. For this reason, +HTML's flexibility +has proven to be both a blessing and a curse, and the software that processes +it must strike a fine balance between security and usability. How do +we prevent users from injecting JavaScript or inserting malformed +HTML while allowing +a rich syntax of tags, attributes and CSS? How do we put +HTML inside +RSS feed without worrying +about sloppy coding messing up XML parsing? +Almost every PHP +developer has come across this problem before, and many have tried +(albeit unsuccessfully) to solve this problem. We will analyze existing +libraries to demonstrate how they are ineffective and, of course, +how HTML Purifier solves all our problems and achieves +standards-compliance.

+ +

I will take no quarter and pull no punches: as of the time of writing, +no other library comes even close to solving the problem effectively +for richly formatted documents. But, nonetheless, there is a necessary +disclaimer:

+ +

+ This comparison document was written by the author of HTML Purifier, + and clearly is in favor of HTML Purifier. However, that doesn't + mean that it is biased: I have made every attempt to be factual and + fair, and I hope that you will agree, by the time you finish reading + this document, that HTML Purifier is the only satisfactory HTML + filter out there today. +

+ +

Summary

+ +

A table summarizing the differences for the impatient.

+ +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Library	Version	Date	License	Whitelist	Removal	Well-formed	Nesting	Attributes	XSS safe	Standards safe
striptags	n/a	n/a	n/a	Yes (user)	Buggy	No	No	No	No	No
PHP Input Filter	1.2.2	2005-10-05	GPL	Yes (user)	Yes	No	No	Partial	Probably	No
HTML_Safe	0.9.9beta	2005-12-21	BSD (3)	Mostly No	Yes	Yes	No	Partial	Probably	No
kses	0.2.2	2005-02-06	GPL	Yes (user)	Yes	No	No	Partial	Probably	No
Safe HTML Checker	n/a	2003-09-15	n/a	Yes (bare)	Yes	Yes	Almost	Partial	Yes	Almost
HTML Purifier	1.6.0	2007-04-01	LGPL	Yes	Yes	Yes	Yes	Yes	Yes	Yes

+ +

HTML Tidy is omitted from this list because it is not an HTML +filter.

+ +

Look Ma, No HTML!

+ +

+
+ A clever person solves a problem. + A wise person avoids it. +
+
— Albert Einstein
+

+ +

Before we jump into the weird and not-so-wonderful world +of HTML filters, we must first consider another domain: non-HTML +markup libraries. While libraries of this type really shouldn't be +considered HTML filters, +they are the number one method of taking user input and processing it into +something more than plain old text. These libraries forgo +HTML and define their +own markup syntax. BBCode, +Wikitext, +Markdown and +Textile are all examples of +such markup languages (although it should be noted that +Wikitext and Markdown can allow +HTML within them). +The benefits (to those who use it, anyway) are clear: simplicity and +security. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Markup language	Sample
BBCode	`[b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url].`
Wikitext¹	`'''B''' ''i'' [http://www.example.com/ link]`
Markdown²	`B i [link](http://www.example.com/)`
Textile	`B _i_ "link":http://www.example.com/`
HTML	`<b>B</b> <i>i</i> <a href="http://www.example.com/">link</a>`
WYSIWYG	B i link

+ +

Wikitext shown is modeled after MediaWiki style. + There are many variants of Wikitext currently extant.
Strictly speaking, the Markdown syntax is not equivalent: bold text + is expressed as <strong> and italicized text is + expressed as <em>. Most browser default stylesheets, + however, map those two semantic tags to the associated styling, so + many users assume that it really is italics (and use it improperly for, + say, book titles.)

+ +

Simplicity

+ +

HTML +source code is often criticized for being difficult to read. For example, +compare:

+ +

+* Item 1
+* Item 2
+

+ +

...versus:

+ +

+<ul>
+    <li>Item 1</li>
+    <li>Item 2</li>
+</ul>
+

+ +

Which would you prefer to edit? The answer seems obvious, but be careful +not to fall into the fallacy of false dilemma. +There is a third choice: the +WYSIWYG (rich text) +editor, which blows earlier choices out of the water in terms +of usability.

+ +

Note that rich text editors and alternate markup syntaxes are not +mutually exclusive, but, when push comes to shove, it's easier +implement this sort of editor on top of HTML than some obscure +markup language. And in the cases when it is done, you usually end up with +a live preview, not a true rich text editor.

+ +

+
Now just wait a second, you may be saying, + WYSIWYG + editors aren't all that great. There are many good arguments + against these editors, and intelligent + people have written essays devoted to + criticizing WYSIWYG. + In addition to the usual arguments against said editors, the web poses + another limitation: no JavaScript means no + editor, and no editor means... (gasp) manually typing in code.
+ +
Even the most dogmatic purist, however, should recognize that for all + its faults, prospective clients really want rich text editors. + There are steps you can take to mitigate the associated drawbacks of + these editors.
+ +
It is often asserted that + WYSIWYG editors + encourage excessive presentational markup. As it turns out, + this is the case with any markup language that allows the smallest + iota of presentational tags, be it <font> or + [color=red]. + A good way to reduce this trouble is to simply eliminate the + dialogue boxes that allow users to change colors or fonts (which + usually have no legitimate use) and adopt a + WYSIWYM scheme, + allowing users to select contextually correct formatting styles + for segments of text.
+

+ +

Simplicity is also a double-edged sword. The moment any remotely +complex markup is needed, these lightweight markup languages fail to +produce. Sure you can make '''this text bold''' with Wikitext, but that +infobox all rendered nicely in aqua blue will require a gaggle of +<div>s and CSS. +These languages face the same troubles as regular HTML +filters in that their whitelist is too restrictive (besides the fact that +their table markup is extraordinarily complex).

+ +

Security

+ +

BBCode can be boiled down to a wanna-be version of +HTML. I mean, replacing +the angled brackets with square brackets and omitting the occasional parameter +name? How much more un-original can you get? Somehow, I don't think BBCode +was meant to readable. Wikipedia agrees:

+ +

+ BBCode was devised and put to use in order to provide a safer, easier + and more limited way of allowing users to format their messages. + Previously, many message boards allowed the users to include HTML, + which could be used to break/imitate parts of the layout, or run + JavaScript. Some implementations of BBCode have suffered problems related + to the way they translate the BBCode into HTML, which could negate the + security that was intended to be given by BBCode. +

+ +

Or, put more simply:

+ +

+ BBCode came to life when developers where too + lazy to parse HTML correctly + and decided to invent their own markup language. As with all products of + laziness, the result is completely inconsistent, unstandardized, and + widely adopted. +

+ +

Well, developers, the whole point of HTML Purifier is that I do the +work so you can just execute the ridiculously simple +$purifier->purify($html) call and go on to do, well, whatever +you developers do. :-P

+ +

Conclusion

+ +

These alternative markup languages have their shiny points, and HTML +Purifier is not meant to replace them. However, a major reason for +their existence has been called into question. Why are you +using these languages?

+ +

HTML Tidy

+ +

Dave Raggett's +HTML Tidy is a program; +neat enough, at least, to make it into PHP as a +PECL extension. +The premise is simple, the execution effective. Tidy is, in short, a great +tool.

+ +

It is not, however, a filter. I am often surprised when people ask +me, What about Tidy? There's nothing against Tidy: Tidy tackles +a different problem set. Let's see what man tidy has to say:

+ +

+ Tidy reads HTML, XHTML and + XML files and writes cleaned up markup. For + HTML variants, it detects and corrects many common coding errors and + strives to produce visually equivalent markup that is both W3C compliant + and works on most browsers. A common use of Tidy is to convert plain HTML + to XHTML. +

+ +

Hmm... why do I not see the words filter or +XSS in here? Perhaps it's +because Tidy accepts any valid +HTML. Including +script tags. Which leads us to our second part: Tidy parses +documents, not document fragments.

+ +

This is not to say that I haven't seen Tidy be used in this sort of +fashion. MediaWiki, for instance, uses Tidy to cleanup the final HTML +output before shuttling it off to the browser. The developers, nevertheless, +agree that this is only a band-aid solution, and that the real way +to fix it is to fix the parser. Tidy's great, but in terms of security, +it's not suitable for untrusted sources.

+ +

Preface

+ +

I've ordered my analyses according to how bad a library is. The worst +is first, and then we move up the spectrum. I will point out the most +flagrant problems with the libraries, but note that I will omit more +advanced vulnerabilities: if you can't catch an onmouseover +attribute, I really shouldn't reprimand you for letting non-SGML code +points through. The ideal solution, however, must do all these things.

+ +

Note that besides striptags, +most of the libraries are moderately effective against the most common XSS +attacks. None of them (save Safe HTML Checker) fare very well +in the standards-compliance department though.

+ +

striptags()

+ + + + + + + +

Whitelist	Yes, user-specified
Removes foreign tags	Buggy
Makes well-formed	No
Fixes nesting	No
Validates attributes	No

+ +

The PHP function +striptags() is +the classic solution for attempting to clean up +HTML. It +is also the worst solution, and should be avoided like the plague. +The fact that it doesn't validate attributes at all means that anyone can +insert an onmouseover='xss();' and exploit your application.

+ +

While +this can be bandaided with a series of regular expressions that strip out +on[event] (you're still vulnerable to XSS and at the mercy of +quirky browser behavior), striptags() is fundamentally flawed and should not be +used. +

+ +

PHP Input Filter

+ +

Though its title may not imply it, +PHP Input Filter +is a souped up version of striptags() with the ability to inspect +attributes. (Don't mind the hastily tacked on query escaping function).

+ + + + + + + + + + + + +

Version	1.2.2
Last update	2005-10-05
License	GPL
Whitelist	Yes, user defined
Removes foreign tags	Yes
Makes well-formed	No
Fixes nesting	No
Validates attributes	Partial
XSS safe	Probably
Standards safe	No

+ +

PHP Input Filter implements an +HTML parser, and +performs very basic checks on whether or not tags and attributes have +been defined in the whitelist as well as some +smarter XSS checks. It is left up to +the user to define what they'll permit.

+ +

With absolutely no checking of well-formedness, it is trivially easy +to trick the filter into leaving unclosed tags lying around. While to some +standards-compliance may be viewed by some as a nice feature, +basic sanity checks like this must be implemented, otherwise a user +can mangle a website's layout.

+ +

More troubles: Woe to +any user that allows the style attribute: you can't simply +just let CSS through and expect your +layout not to be badly mutilated. To top things off, +the filter doesn't even preserve data properly: attributes have all +spaces stripped out of them. Stay away, stay away!

+ +

HTML_Safe/SafeHTML

+ +

HTML_Safe is +PEAR's HTML filtering library. +It should be noted that this is the same library as +SafeHTML, though with different +branding (and a different version number).

+ + + + + + + + + + + + +

Version	0.9.9beta
Last update	2005-12-21
License	BSD (3 clause)
Whitelist	Mostly No
Removes foreign tags	Yes
Makes well-formed	Yes
Fixes nesting	No
Validates attributes	Partial
XSS safe	Probably
Standards safe	No

+ +

HTML_Safe's mechanism of action involves parsing +HTML with a +SAX parser and performing +validation and filtering as the handlers are called. HTML_Safe does a lot +of things right, which is why I say it probably isn't vulnerable +to XSS, but its approach +is fundamentally flawed: blacklists.

+ +

This library maintains arrays of dangerous tags, attributes and +CSS properties. (It also +has a blacklist of dangerous URI protocols, but this is +intelligently disabled by default in favor of a protocol whitelist.) +What this means is that HTML_Safe has no qualms of accepting input +like <foobar> Bang </foobar>. Anything goes except +the tags in those arrays. Scratch standards-compliance (and that was +without even considering proper nesting).

+ +

For now, HTML_Safe might be safe from +XSS. +In the future, however, one of the infinitely many tags that HTML_Safe lets +through might just possibly be given special functionality by browser vendors. +And it might just turn out that this can be exploited. Any blacklist +solution puts you at a perpetual arms race against crackers who are constantly +discovering new and inventive ways to abuse tags and attributes that you +didn't blacklist.

+ +

kses

+ +

kses appears to +be the de-facto solution for cleaning HTML, having found +its way into applications such as WordPress +and being the number one search result for php html filter.

+ + + + + + + + + + + + +

Version	0.2.2
Last update	2005-02-06
License	GPL
Whitelist	Yes, user defined
Removes foreign tags	Yes
Makes well-formed	No
Fixes nesting	No
Validates attributes	Partial
XSS safe	Probably
Standards safe	No

+ +

To be truthful, I didn't do as comprehensive a code survey for kses +as I did for some of the other libraries. Out of +all the classes I've reviewed so far, kses was definitely the hardest to +understand.

+ +

kses's modus operandi is splitting up html with a monster regexp +and then validating each section with kses_split2(). It +suffers from the same problems as Input Filter: no well-formedness +checks leading to rampant runaway tags (and no standards-compliance). +WordPress, the primary user of kses today, had to implement their +own custom tag-balancing code to fix this problem: don't use this +library without some equivalent!

+ +

Its whitelist syntax, however, is the most complex of all these libraries, +so I'm going to take some time to argue why this particular implementation +is bad. The author of this library was thoughtful enough to provide some +basic constraint checks on attributes like maxlen and maxval. Now, barring +the fact that there simply aren't enough checks, and the fact that they are +all lumped together in one function, we now must wonder whether or not +the user will go through the trouble of specifying the maximum length +of a title attribute.

+ +

I have my opinions about inherent human laziness, but perhaps WordPress's +default filterset is the most telling example:

+ +

+$allowedposttags = array (
+    /* formatted and trimmed */
+    'hr' => array (
+        'align' => array (),
+        'noshade' => array (),
+        'size' => array (),
+        'width' => array ()
+     )
+);
+

+ +

Hmm... do I see a blatant lack of attribute constraints? Conclusion: +if the user can get away with not doing work, they will! The biggest +problem in all these whitelists filters is that they forgot to supply +the whitelist. The whitelist is just as important as the code that uses +the whitelist to filter HTML.

+ +

Safe HTML Checker

+ +

+Safe +HTML Checker is (to my knowledge) the first attempt to make a filter +that also outputs standards-compliant XHTML. It wasn't even released or +licensed officially, but we'll let that slide: a 4^th place +search result must have done something right.

+ + + + + + + + + + + + +

Version	in-house
Last update	2003-09-15
License	undefined
Whitelist	Yes (bare-bones)
Removes foreign tags	Yes
Makes well-formed	Yes
Fixes nesting	Almost
Validates attributes	Partial
XSS safe	Yes
Standards safe	Almost

+ +

Indeed, it is quite a well-written piece of code. It demonstrates +knowledge of inline versus block elements, thus almost nearly getting +nesting correct (the only exception is an unimplemented omitted SGML +exclusion for <a> tags, and that's easy to fix).

+ +

Unfortunately, part of the reason why it works so well is that it's +extremely restrictive. No styling, no tables, very few attributes. +Perfectly appropriate for blog comments, but then again, there's always +BBCode. This probably means that Safe HTML Checker has a different +goal than HTML Purifier.

+ +

The XML parser +is also quite strict. Accidentally missed a < sign? The parser will +complain with the cryptic message: +XHTML +is not well-formed. +The solution is not as simple as just switching to a more permissive +parser: Safe HTML Checker relies on the fact that the parser will have +matched up the tags for them.

+ +

HTML Purifier

+ + + + + + + + + + + + +

Version	1.6.0
Last update	2007-04-01
License	LGPL
Whitelist	Yes
Removes foreign tags	Yes
Makes well-formed	Yes
Fixes nesting	Yes
Validates attributes	Yes
XSS safe	Yes
Standards safe	Yes

+ +

That table should say it all, but I'll add a few more features:

+ + + + + + + + +

UTF-8 aware	Yes
Object-Oriented	Yes
Validates CSS	Yes
Tables	Yes
PHP 5 aware	Yes
E_STRICT compliant	Yes (use -strict)

+ +

This is not to say that HTML Purifier doesn't have problems of its own. +It's a fairly nascent library (that doesn't mean its buggy though), it's big +(while the others usually fit in one file, this one requires a huge +include list), and it's missing +features. But even in its current state, +HTML Purifier is far better than the other libraries.

+ +

So... what are you waiting for?

+ +

+ \ No newline at end of file diff --git a/index.xhtml b/index.xhtml index ca2b448..9e98892 100644 --- a/index.xhtml +++ b/index.xhtml @@ -1,451 +1,451 @@ - - - - -HTML Purifier - Filter your HTML the standards-compliant way! - - - - - - - - - -

- - - - -

- -

- -

HTML Purifier is a standards-compliant -HTML filter library written in -PHP. HTML Purifier will not only remove all malicious -code (better known as XSS) with a thoroughly audited, -secure yet permissive whitelist, -it will also make sure your documents are -standards compliant, something only achievable with a -comprehensive knowledge of W3C's specifications. -Tired of using BBCode due to the current landscape of deficient or -insecure HTML filters? Have a -WYSIWYG editor but never been able to use it? Looking -for high-quality, standards-compliant, open-source components for that -application you're building? HTML Purifier is for you!

- -

-
- I'd just like to say we use HTML Purifier in IRIS for - filtering emails against XSS attacks and we've been more than impressed. -
-
— Chris Corbyn, Senior IRIS Developer
-

- -

Background

- -

There are a number of open-source HTML filtering solutions out -there on the web already -(i.e. PEAR's -HTML_Safe, -kses -and - -SafeHtmlChecker.class.php). What sets HTML Purifier apart from them? -Aren't all of these choices secure?

- -

When it comes to HTML, attention to -detail is key. Does the library demonstrate an in-depth -knowledge of the DTD that defines -HTML? Does it perform its filtering off a robust -whitelist rather than a usually out-dated blacklist? Does it go through -the care to check every single attribute in the document for validity? -Does it actually understand tag markup, or pay lip-service with a series -of deficient regexes and str_replace's?

- -

Somewhere along the way, all of HTML Purifier's predecessors fall -flat. HTML_Safe dooms itself to attacks of the future by using a -blacklist. Configurable filters like kses and PHP Input Filter still -cannot validate the contents inside attributes. With all these gaps in -coverage, none of the usual libraries come close to achieving -standards-compliance. There is a user-unfriendly, -draconic XML-based filter called Safe HTML Checker, -but even it forgets that <a> tags cannot be nested -within each other!

- -

Know thy enemy. Wily hackers have a huge arsenal of -XSS hidden within the depths of the -HTML specification. HTML Purifier takes its -effectiveness from the fact that it will decompose the whole document -into tokens, and rigorously process the tokens by removing -non-whitelisted elements, transforming bad practice tags like font into -span, properly checking the nesting of tags and their children and -validating all attributes according to their RFCs. -HTML Purifier's comprehensive algorithms are complemented by a -breadth of knowledge, ensuring that richly formatted -documents pass through unstripped.

- -

To my knowledge, there is nothing else in the wild that offers -protection from XSS, standards-compliance, and the -corrective processing of poorly formed HTML -simultaneously. Don't take my word for it though: -do your research. Investigate the other libraries, and decide for -yourself who you would prefer to be the gatekeeper to -your system.

- -

To find out more, you can read the -Comparison -for a play-by-play analysis of the major filter libraries currently -out there.

- -

-
- [Y]ou save my day by allowing me not to write another damned HTML parser. -
-
- — Joseph Halter, Technical Director at Akira Web -
-

- - -

News

- -

HTML Purifier 1.6.0 released

Sun, 01 April 2007 23:40:59 EDT

- -

Sorry, no April Fool's joke this year. To compensate, we have - the 1.6.0 Long Overdue release. This version contains support - for a number of deprecated attributes HTML Purifier should have - had from the very beginning, including the name, bgcolor, border, - width and height attributes. The CSS property 'height', - rel and rev attributes and ID blacklist regexps are also available. - In addition, HTML Purifier will give a friendly error message - when you try to enable an element or attribute that doesn't exist.

- -

All in all, this is a fairly compact release, but it does - address some common requests brought up in the Forums, so I suggest - you upgrade anyway. You can check News - for a complete changelog, but there's not much else.

- -

A note to you distributors

Wed, 28 March 2007 21:05:12 EDT

- -

Yes, TikiWiki and PHProjekt, - I'm looking at you. I am absolutely delighted that these two fairly - popular and robust open-source projects are using my library. - However, I am not at all pleased at the fact that you have not - been keeping up to date with HTML Purifier releases.

TikiWiki: 1.3.0
PHProjekt: 1.3.2

I entreat yea, please sign up for the announcement list and - keep my library up-to-date! It's not difficult, I keep backwards - compatibility, and it makes your users happy! Especially that - DOM XML bug, which seems was - far more serious than I originally thought it was. That is all.

- -

PEAR channel available

Sat, 24 March 2007 20:27:42 EDT

- -

At the prompting of Lars Olesen, HTML Purifier now - has its very own PEAR channel. This means that - installing HTML Purifier is as simple as:

pear channel-discover hp.jpsband.org
-pear install hp/HTMLPurifier

- -

HTML Purifier 1.5.0 released

Fri, 23 March 2007 22:42:12 EDT

- -

The 1.5.0 major bugfix - release is available today. There have been some major internal - refactoring efforts, but these changes are invisible to you.

- -

Entrepid souls wanting to test out the new - HTMLModuleManager class can check out the - HTMLModules. Also, I will personally assist anyone - who has modified HTMLDefinition.php. If you - have patched any files, please consult the Support forums before - upgrading.

- -

And now, the goodies:

- -

XHTML 1.1-style modularization of - HTMLDefinition. Instead of one monster, - huge HTMLDefinition class, the file has been - partitioned into modular bits organized into categories - like Hypertext, Lists and Tables. The - design of these modules makes it possible to arbitrarily - add your own elements without ever having to patch a core - file. However, the interface is unintuitive, not - documented, and definitely going to change. Keep your eyes - on this one.
Rudimentary internationalization system implemented. It's - not used yet, but will become the foundation of a projected - error reporting feature HTML Purifier will be getting soon.
x subtag now allowed in language codes.
Buggy chameleon support for ins and del - fixed.
Element by element AllowedAttribute declaration now possible - for global attributes. Instead of *.class, you can write - span.class (the old syntax still works, and enables - the attribute for all elements).
Fatal error when PHP4 DOM - XML extension was loaded now fixed. Update: - It seems that a lot of users run into this problem, as I know at least - five cases. Upgrade to 1.5.0 and it will be fixed, I promise!
Youtube filter regexp now multiline.

...as well as an assortment of some code refactoring (all - bugfixes are covered above). See News - for a complete changelog.

- - -

RSS feed!

Sat, 17 March 2007 5:42:12 EDT

- -

We have a shiny new RSS feed - at news.rss, which is hooked up to this - news feed. Subscribe for release notifications as well as random - news about HTML Purifier.

- -

Plugins

- -

HTML Purifier is a great library to integrate with existing -CMSes and other applications or WYSIWYG -editors. Currently, we have plugins for:

- -

Drupal HTML Purifier Module (beta) by Bart Jansens
MODx Content Management System

- -

-
- This plugin is on top of my favorite list[.] I am going to heavily - depend on it since my clients insist on having WYSIWYG and I insist on - having pages that validate and are semantically sound. -
-
- — David Molliere, MODx Marketing & Design Team -
-

- -

Plugins for other major applications gladly accepted!

- - -

Demo

- -

Enter your HTML and see how it will be filtered!

- - -

...or try these sample inputs:

- -

Download

- -

The current version is -1.6.0. Pick your distribution:

- -

The PHP5-strict version is exactly the same -as the regular version with a few tweaks -to prevent it from complaining with -E_STRICT -warnings.This library is open-source, licensed under the -LGPL v2.1+.

- -

HTML Purifier is also available as a PEAR package. -You can install it by executing:

- -

pear channel-discover hp.jpsband.org
-pear install hp/HTMLPurifier

- -

You can also grab the latest developmental code from our Subversion -repository. Simply execute this command:

- -

svn co http://hp.jpsband.org/svnroot/htmlpurifier/trunk ./

- -

...or browse -anonymously at that address. Previous releases can be obtained by browsing -the release directory -or checking code out of the -tags/ -directory.

- -

SHA-1 checksums:

- -

-088569ae55d99bdbbee6031215ecc26f60489b70 htmlpurifier-1.6.0-strict.tar.gz
-3deb033d6b20c22e7883cf2f7f719605fe6dd161 htmlpurifier-1.6.0-strict.zip
-b4eed7787b84b7a86b24beaa5394616600780ceb htmlpurifier-1.6.0.tar.gz
-3e375e83bc782e031362ce49c559e0d4f2511b6f htmlpurifier-1.6.0.zip
-

- -

There are also .sig files which you can use to cryptographically verify -that the release is from me, Edward Z. Yang. You can find -my public key -here (0x869C48DA). My key's fingerprint is: -3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA.

- -

Verify with these commands:

- -

gpg --verify $filename.sig

- -

You can be notified of new releases by a low-traffic announce list. Subscribe -here:

- - - -

Resources

End-User - Documentation — In-depth documents on how to get - the most out of HTML Purifier.
Mantis Bugtracker — Found a bug? Report - it here!
Support Forum — Talk about all things - HTML Purifier.
Print - Definition — If you want to actually see what HTML Purifier's - filtering rules are, look no further than to this page. You can even - experiment with the configuration to see how things respond to different - directives.
XSS - Attacks Smoketest — Tests how well HTML Purifier fares - against RSnake's famous cheatsheet of XSS attacks.
Roadmap - — Subject to lots of delays, but it's a glimpse of the future
Artwork - — Extra media goodies.
Configuration - documentation — See the INSTALL document on how to - configure your HTML Purifier installation.
Doxygen-generated - Documentation — No class left undocumented! Cross-referenced - code! A must-read for any prospective HTML Purifier hacker. - (close by, PHPDoc-generated - Documentation.)

- -

Spread the Word!

- -

Help spread awareness about HTML Purifier by:

- -

Bookmarking this website on your del.icio.us account, and/or

Including this little label on your website: -

, with this code: -

<a href="http://hp.jpsband.org/"><img
-src="http://hp.jpsband.org/live/art/powered.png"
-alt="Powered by HTML Purifier" border="0" /></a>

- -

Contact

- -

You can send me an email at -htmlpurifier@jpsband.org. -However, I prefer that you use the forums for asking general support -questions (response time will be the same, I promise!) -Any emails I receive will be considered public: if I think a -solution I thought up to help you would be particularly useful to others, -expect it to show up on the website.

- -

- - - + + + + +HTML Purifier - Filter your HTML the standards-compliant way! + + + + + + + + + +

+ + + + +

+ +

HTML Purifier is a standards-compliant +HTML filter library written in +PHP. HTML Purifier will not only remove all malicious +code (better known as XSS) with a thoroughly audited, +secure yet permissive whitelist, +it will also make sure your documents are +standards compliant, something only achievable with a +comprehensive knowledge of W3C's specifications. +Tired of using BBCode due to the current landscape of deficient or +insecure HTML filters? Have a +WYSIWYG editor but never been able to use it? Looking +for high-quality, standards-compliant, open-source components for that +application you're building? HTML Purifier is for you!

+ +

+
+ I'd just like to say we use HTML Purifier in IRIS for + filtering emails against XSS attacks and we've been more than impressed. +
+
— Chris Corbyn, Senior IRIS Developer
+

+ +

Background

+ +

There are a number of open-source HTML filtering solutions out +there on the web already +(i.e. PEAR's +HTML_Safe, +kses +and + +SafeHtmlChecker.class.php). What sets HTML Purifier apart from them? +Aren't all of these choices secure?

+ +

When it comes to HTML, attention to +detail is key. Does the library demonstrate an in-depth +knowledge of the DTD that defines +HTML? Does it perform its filtering off a robust +whitelist rather than a usually out-dated blacklist? Does it go through +the care to check every single attribute in the document for validity? +Does it actually understand tag markup, or pay lip-service with a series +of deficient regexes and str_replace's?

+ +

Somewhere along the way, all of HTML Purifier's predecessors fall +flat. HTML_Safe dooms itself to attacks of the future by using a +blacklist. Configurable filters like kses and PHP Input Filter still +cannot validate the contents inside attributes. With all these gaps in +coverage, none of the usual libraries come close to achieving +standards-compliance. There is a user-unfriendly, +draconic XML-based filter called Safe HTML Checker, +but even it forgets that <a> tags cannot be nested +within each other!

+ +

Know thy enemy. Wily hackers have a huge arsenal of +XSS hidden within the depths of the +HTML specification. HTML Purifier takes its +effectiveness from the fact that it will decompose the whole document +into tokens, and rigorously process the tokens by removing +non-whitelisted elements, transforming bad practice tags like font into +span, properly checking the nesting of tags and their children and +validating all attributes according to their RFCs. +HTML Purifier's comprehensive algorithms are complemented by a +breadth of knowledge, ensuring that richly formatted +documents pass through unstripped.

+ +

To my knowledge, there is nothing else in the wild that offers +protection from XSS, standards-compliance, and the +corrective processing of poorly formed HTML +simultaneously. Don't take my word for it though: +do your research. Investigate the other libraries, and decide for +yourself who you would prefer to be the gatekeeper to +your system.

+ +

To find out more, you can read the +Comparison +for a play-by-play analysis of the major filter libraries currently +out there.

+ +

+
+ [Y]ou save my day by allowing me not to write another damned HTML parser. +
+
+ — Joseph Halter, Technical Director at Akira Web +
+

+ + +

News

+ +

HTML Purifier 1.6.0 released

Sun, 01 April 2007 23:40:59 EDT

+ +

Sorry, no April Fool's joke this year. To compensate, we have + the 1.6.0 Long Overdue release. This version contains support + for a number of deprecated attributes HTML Purifier should have + had from the very beginning, including the name, bgcolor, border, + width and height attributes. The CSS property 'height', + rel and rev attributes and ID blacklist regexps are also available. + In addition, HTML Purifier will give a friendly error message + when you try to enable an element or attribute that doesn't exist.

+ +

All in all, this is a fairly compact release, but it does + address some common requests brought up in the Forums, so I suggest + you upgrade anyway. You can check News + for a complete changelog, but there's not much else.

+ +

A note to you distributors

Wed, 28 March 2007 21:05:12 EDT

+ +

Yes, TikiWiki and PHProjekt, + I'm looking at you. I am absolutely delighted that these two fairly + popular and robust open-source projects are using my library. + However, I am not at all pleased at the fact that you have not + been keeping up to date with HTML Purifier releases.

TikiWiki: 1.3.0
PHProjekt: 1.3.2

I entreat yea, please sign up for the announcement list and + keep my library up-to-date! It's not difficult, I keep backwards + compatibility, and it makes your users happy! Especially that + DOM XML bug, which seems was + far more serious than I originally thought it was. That is all.

+ +

PEAR channel available

Sat, 24 March 2007 20:27:42 EDT

+ +

At the prompting of Lars Olesen, HTML Purifier now + has its very own PEAR channel. This means that + installing HTML Purifier is as simple as:

pear channel-discover hp.jpsband.org
+pear install hp/HTMLPurifier

+ +

HTML Purifier 1.5.0 released

Fri, 23 March 2007 22:42:12 EDT

+ +

The 1.5.0 major bugfix + release is available today. There have been some major internal + refactoring efforts, but these changes are invisible to you.

+ +

Entrepid souls wanting to test out the new + HTMLModuleManager class can check out the + HTMLModules. Also, I will personally assist anyone + who has modified HTMLDefinition.php. If you + have patched any files, please consult the Support forums before + upgrading.

+ +

And now, the goodies:

+ +

XHTML 1.1-style modularization of + HTMLDefinition. Instead of one monster, + huge HTMLDefinition class, the file has been + partitioned into modular bits organized into categories + like Hypertext, Lists and Tables. The + design of these modules makes it possible to arbitrarily + add your own elements without ever having to patch a core + file. However, the interface is unintuitive, not + documented, and definitely going to change. Keep your eyes + on this one.
Rudimentary internationalization system implemented. It's + not used yet, but will become the foundation of a projected + error reporting feature HTML Purifier will be getting soon.
x subtag now allowed in language codes.
Buggy chameleon support for ins and del + fixed.
Element by element AllowedAttribute declaration now possible + for global attributes. Instead of *.class, you can write + span.class (the old syntax still works, and enables + the attribute for all elements).
Fatal error when PHP4 DOM + XML extension was loaded now fixed. Update: + It seems that a lot of users run into this problem, as I know at least + five cases. Upgrade to 1.5.0 and it will be fixed, I promise!
Youtube filter regexp now multiline.

...as well as an assortment of some code refactoring (all + bugfixes are covered above). See News + for a complete changelog.

+ + +

RSS feed!

Sat, 17 March 2007 5:42:12 EDT

+ +

We have a shiny new RSS feed + at news.rss, which is hooked up to this + news feed. Subscribe for release notifications as well as random + news about HTML Purifier.

+ +

Plugins

+ +

HTML Purifier is a great library to integrate with existing +CMSes and other applications or WYSIWYG +editors. Currently, we have plugins for:

+ +

Drupal HTML Purifier Module (beta) by Bart Jansens
MODx Content Management System

+ +

+
+ This plugin is on top of my favorite list[.] I am going to heavily + depend on it since my clients insist on having WYSIWYG and I insist on + having pages that validate and are semantically sound. +
+
+ — David Molliere, MODx Marketing & Design Team +
+

+ +

Plugins for other major applications gladly accepted!

+ + +

Demo

+ +

Enter your HTML and see how it will be filtered!

+ + +

...or try these sample inputs:

+ +

Download

+ +

The current version is +1.6.0. Pick your distribution:

+ +

The PHP5-strict version is exactly the same +as the regular version with a few tweaks +to prevent it from complaining with +E_STRICT +warnings.This library is open-source, licensed under the +LGPL v2.1+.

+ +

HTML Purifier is also available as a PEAR package. +You can install it by executing:

+ +

pear channel-discover hp.jpsband.org
+pear install hp/HTMLPurifier

+ +

You can also grab the latest developmental code from our Subversion +repository. Simply execute this command:

+ +

svn co http://hp.jpsband.org/svnroot/htmlpurifier/trunk ./

+ +

...or browse +anonymously at that address. Previous releases can be obtained by browsing +the release directory +or checking code out of the +tags/ +directory.

+ +

SHA-1 checksums:

+ +

+088569ae55d99bdbbee6031215ecc26f60489b70 htmlpurifier-1.6.0-strict.tar.gz
+3deb033d6b20c22e7883cf2f7f719605fe6dd161 htmlpurifier-1.6.0-strict.zip
+b4eed7787b84b7a86b24beaa5394616600780ceb htmlpurifier-1.6.0.tar.gz
+3e375e83bc782e031362ce49c559e0d4f2511b6f htmlpurifier-1.6.0.zip
+

+ +

There are also .sig files which you can use to cryptographically verify +that the release is from me, Edward Z. Yang. You can find +my public key +here (0x869C48DA). My key's fingerprint is: +3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA.

+ +

Verify with these commands:

+ +

gpg --verify $filename.sig

+ +

You can be notified of new releases by a low-traffic announce list. Subscribe +here:

+ + + +

Resources

End-User + Documentation — In-depth documents on how to get + the most out of HTML Purifier.
Mantis Bugtracker — Found a bug? Report + it here!
Support Forum — Talk about all things + HTML Purifier.
Print + Definition — If you want to actually see what HTML Purifier's + filtering rules are, look no further than to this page. You can even + experiment with the configuration to see how things respond to different + directives.
XSS + Attacks Smoketest — Tests how well HTML Purifier fares + against RSnake's famous cheatsheet of XSS attacks.
Roadmap + — Subject to lots of delays, but it's a glimpse of the future
Artwork + — Extra media goodies.
Configuration + documentation — See the INSTALL document on how to + configure your HTML Purifier installation.
Doxygen-generated + Documentation — No class left undocumented! Cross-referenced + code! A must-read for any prospective HTML Purifier hacker. + (close by, PHPDoc-generated + Documentation.)

+ +

Spread the Word!

+ +

Help spread awareness about HTML Purifier by:

+ +

Bookmarking this website on your del.icio.us account, and/or

Including this little label on your website: +

, with this code: +

<a href="http://hp.jpsband.org/"><img
+src="http://hp.jpsband.org/live/art/powered.png"
+alt="Powered by HTML Purifier" border="0" /></a>

+ +

Contact

+ +

You can send me an email at +htmlpurifier@jpsband.org. +However, I prefer that you use the forums for asking general support +questions (response time will be the same, I promise!) +Any emails I receive will be considered public: if I think a +solution I thought up to help you would be particularly useful to others, +expect it to show up on the website.

+ +

+ + + -- 2.11.4.GIT

Comparison

Summary

Look Ma, No HTML!

Simplicity

Security

Conclusion

HTML Tidy

Preface

striptags()

PHP Input Filter

HTML_Safe/SafeHTML

kses

Safe HTML Checker

HTML Purifier

Comparison

Summary

Look Ma, No HTML!

Simplicity

Security

Conclusion

HTML Tidy

Preface

striptags()

PHP Input Filter

HTML_Safe/SafeHTML

kses

Safe HTML Checker

HTML Purifier

HTML -Purifier

Navigation

Background

News

HTML Purifier 1.6.0 released

A note to you distributors

PEAR channel available

HTML Purifier 1.5.0 released

RSS feed!

Plugins

Demo

Download

Resources

Spread the Word!

Contact

HTML +Purifier

Navigation

Background

News

HTML Purifier 1.6.0 released

A note to you distributors

PEAR channel available

HTML Purifier 1.5.0 released

RSS feed!

Plugins

Demo

Download

Resources

Spread the Word!

Contact