From fbb8fa2a15f565ca97e6d15371ce1234e716e2ab Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Sun, 2 Mar 2008 06:14:08 +0000 Subject: [PATCH] Update Comparison with htmLawed. git-svn-id: http://htmlpurifier.org/svnroot@1590 48356398-32a2-884e-a903-53898d9a118a --- comparison.xhtml | 131 +++++++++++++++++++++++++++++++++- xhtml-compiler/XHTMLCompiler/Page.php | 1 + 2 files changed, 129 insertions(+), 3 deletions(-) diff --git a/comparison.xhtml b/comparison.xhtml index b475dfd..c916f0d 100644 --- a/comparison.xhtml +++ b/comparison.xhtml @@ -151,6 +151,20 @@ + htmLawed + 1.0.2 + 2008-02-13 + GPL + Elements + Yes (user) + Yes (user) + No + Partial + Unlikely + No + + + Safe HTML Checker n/a 2003-09-15 @@ -658,6 +672,114 @@ $allowedposttags = array ( the whitelist to filter HTML.

+

htmLawed

+ +

+ htmLawed + is kses on steroids. After looking at HTML Purifier and deciding that it was + too slow for him, Santosh Patnaik went ahead and rewrote the kses engine + with more features. It is the only other filtering library currently available + that is being actively maintained. Very promising at first, but as I reviewed + it I became more and more dismayed. +

+ + + + + + + + + + + + +
Version 1.0.2
Last update 2008-02-13
License GPL
Whitelist Only elements
Removes foreign tags Yes, user defined
Makes well-formed Yes, user defined
Fixes nesting No
Validates attributes Partial
XSS safe Unlikely, user defined
Standards safe No
+ +

+ The fact that htmLawed comes from kses means for a fairly unintelligible + codebase and heavily procedural code, as well as a dramatically lower memory + footprint and faster execution. However, htmLawed also attempts to reach feature + parity with HTML Purifier, although falls short in several areas. +

+ +

+ Namely, standards compliance! htmLawed gets really close (dealing appropriately + with some of the higher profile cases touted by HTML Purifier), but still falls + flat in several areas. For example, it doesn't properly check the contents + of table tags: +

+ +
Cell]]>
+ +

+ This is passed through unharmed, even though the content model of tables does + not permit table cells. As far as I can tell, fixing this completely (which + means also accounting for thead and ordering) is + not a trivial task, as kses was never built to inspect the children of a node. + The architecture of the library itself is deficient. +

+ +

+ There are other cases in which dubious design decisions are visible. For example, + htmLawed is not XSS-safe out of the tin. This is + by + design (which, coincidentally, I think is very bad). I believe a user + has to define a safe tag-set in elements, define all + bad attributes in deny_attribute (blacklist rears its ugly head + for attributes), + disallow comments (otherwise Internet Explorer users are vulnerable + to conditional comments), and, in all likelihood, do a few other things, + before you can open it up to Joe Random User. This is all + documented in the htmLawed documentation. But its terribly fragmented + and misrepresented with gems like the following increase security risks + and listing script afterwards. Of course it increases security risks: + it makes your application insecure. +

+ +

+ All this is enough to make any person ask, So, is htmLawed a filter or a + tidy library? +

+ +

+ The answer is it is neither! Even in its most permissive mode, which allows a whole + host of JavaScript vectors, it "denies" the javascript: protocol. + You can't turn that off. And as we've mentioned before, it doesn't fix nesting + perfectly, and to add insult to injury, these features can be turned off. + It feels like the development attitude + has not been, Deny everything, and then check each element carefully + for safety, but rather, Hit the major points and hope no-one notices the + deficiencies. I mean, even the comparsion + with HTML Purifier contains inaccuracies: no, we don't support HTML 5 (we + only experimentally support an HTML 5 parser, which is quite different + from the HTML 5 tag-set), and yes, we do have anti-spam + measures. +

+ +

+ htmLawed is a flawed library. But not fatally, I don't think. + Here are some suggestions: +

+ + + +

+ And probably more. But the above will suffice for now. Users, you may + be smarting for some better performance, but avoid this library for now. +

+

Safe HTML Checker

@@ -729,15 +851,18 @@ $allowedposttags = array ( Object-OrientedYes Validates CSSYes TablesYes - PHP 5 awareYes - E_STRICT compliantYes (use -strict) + PHP 5 onlyYes + E_STRICT compliantYes + Can auto-paragraphYes + ExtensibleYes + Unit testedYes

This is not to say that HTML Purifier doesn't have problems of its own. It's big (while the others usually fit in one file, this one requires a huge include list), and it's missing - features. But even in its current state, + features. But even with these deficiencies, HTML Purifier is far better than the other libraries.

diff --git a/xhtml-compiler/XHTMLCompiler/Page.php b/xhtml-compiler/XHTMLCompiler/Page.php index 52b8783..a88363e 100644 --- a/xhtml-compiler/XHTMLCompiler/Page.php +++ b/xhtml-compiler/XHTMLCompiler/Page.php @@ -66,6 +66,7 @@ class XHTMLCompiler_Page 'Requested directory cannot be found; check your file path and try again.' ); } + if ($dir[strlen($dir)-1] == '/') $dir = substr($dir, 0, -1); $allowed_dirs = $xc->getConf('allowed_dirs'); $ok = false; -- 2.11.4.GIT