From 38ceb5859917ee64625cc6a97aaca0a3a882377e Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Sat, 17 Mar 2007 02:24:36 +0000 Subject: [PATCH] Add Acronymizer DOMFilter. index.xhtml edited accordingly. diff --git a/index.xhtml b/index.xhtml index 8050ade..a4f79e2 100644 --- a/index.xhtml +++ b/index.xhtml @@ -1,3 +1,4 @@ + @@ -26,7 +27,7 @@
  • News
  • Plugins
  • Demo
  • -
  • Download
  • +
  • Download
  • Resources
  • Forum
  • Contact
  • @@ -37,24 +38,20 @@ Download HTML Purifier -

    HTML Purifier is a standards-compliant -HTML filter library -written in PHP. -HTML Purifier will not only remove all malicious code (better known as -XSS) with a -thoroughly audited, secure yet permissive -whitelist, it will also -make sure your documents are standards compliant, something -only achievable with a comprehensive knowledge of -W3C's specifications. -Tired of using BBCode due to the -current landscape of deficient or insecure HTML filters? Have a -WYSIWYG -editor but never been able to use it? Looking for high-quality, -standards-compliant, open-source components for that application you're -building? -HTML Purifier is for you!

    +

    HTML Purifier is a standards-compliant +HTML filter library written in +PHP. HTML Purifier will not only remove all malicious +code (better known as XSS) with a thoroughly audited, +secure yet permissive whitelist, +it will also make sure your documents are +standards compliant, something only achievable with a +comprehensive knowledge of W3C's specifications. +Tired of using BBCode due to the current landscape of deficient or +insecure HTML filters? Have a +WYSIWYG editor but never been able to use it? Looking +for high-quality, standards-compliant, open-source components for that +application you're building? HTML Purifier is for you!

    @@ -66,11 +63,9 @@ HTML Purifier is for you!

    Background

    -

    There are a number of open-source HTML filtering solutions out +

    There are a number of open-source HTML filtering solutions out there on the web already -(i.e. PEAR's +(i.e. PEAR's HTML_Safe, kses and @@ -78,47 +73,45 @@ and SafeHtmlChecker.class.php). What sets HTML Purifier apart from them? Aren't all of these choices "secure"?

    -

    When it comes to HTML, -attention to detail is key. Does the library demonstrate -an in-depth knowledge of the DTD that defines HTML? Does it perform its -filtering off a robust whitelist rather than a usually out-dated blacklist? -Does it go through the care to check every single attribute in the document -for validity? Does it actually understand tag markup, or pay lip-service -with a series of deficient regexes and str_replace's?

    - -

    Somewhere along the way, all of HTML Purifier's predecessors fall -flat. HTML_Safe dooms itself to attacks of the future by using a blacklist. -Configurable filters like kses and PHP Input Filter still cannot validate the -contents inside attributes. With all these gaps in coverage, none of the -usual libraries come close to achieving standards-compliance. -There is a user-unfriendly, draconic -XML-based filter -called Safe HTML Checker, but even it forgets that <a> tags -cannot be nested within each other!

    +

    When it comes to HTML, attention to +detail is key. Does the library demonstrate an in-depth +knowledge of the DTD that defines +HTML? Does it perform its filtering off a robust +whitelist rather than a usually out-dated blacklist? Does it go through +the care to check every single attribute in the document for validity? +Does it actually understand tag markup, or pay lip-service with a series +of deficient regexes and str_replace's?

    + +

    Somewhere along the way, all of HTML Purifier's predecessors fall +flat. HTML_Safe dooms itself to attacks of the future by using a +blacklist. Configurable filters like kses and PHP Input Filter still +cannot validate the contents inside attributes. With all these gaps in +coverage, none of the usual libraries come close to achieving +standards-compliance. There is a user-unfriendly, +draconic XML-based filter called Safe HTML Checker, +but even it forgets that <a> tags cannot be nested +within each other!

    Know thy enemy. Wily hackers have a huge arsenal of -XSS hidden within the depths -of the HTML -specification. HTML Purifier takes its effectiveness from the fact that it will -decompose the whole document into tokens, and rigorously process the tokens by -removing non-whitelisted elements, transforming bad practice tags like font -into span, properly checking the nesting of tags and their children and -validating all attributes according to their RFCs. HTML Purifier's comprehensive -algorithms are complemented by a breadth of knowledge, -ensuring that richly formatted documents pass through unstripped.

    +XSS hidden within the depths of the +HTML specification. HTML Purifier takes its +effectiveness from the fact that it will decompose the whole document +into tokens, and rigorously process the tokens by removing +non-whitelisted elements, transforming bad practice tags like font into +span, properly checking the nesting of tags and their children and +validating all attributes according to their RFCs. +HTML Purifier's comprehensive algorithms are complemented by a +breadth of knowledge, ensuring that richly formatted +documents pass through unstripped.

    Compare HTML Purifier with other filters -

    To my knowledge, there is nothing else in the wild that offers -protection from XSS, standards-compliance, and the corrective -processing of poorly formed HTML simultaneously. Don't take my word -for it though: -do your research. Investigate the -other libraries, and decide for yourself who you would prefer to be the -gatekeeper to your system.

    +

    To my knowledge, there is nothing else in the wild that offers +protection from XSS, standards-compliance, and the corrective processing +of poorly formed HTML simultaneously. Don't take my word for it though: +do your research. Investigate the other libraries, and decide for +yourself who you would prefer to be the gatekeeper to +your system.

    To find out more, you can read the Comparison diff --git a/xhtml-compiler/XHTMLCompiler/DOMFilter.php b/xhtml-compiler/XHTMLCompiler/DOMFilter.php index 3950d7e..fa4ee0c 100644 --- a/xhtml-compiler/XHTMLCompiler/DOMFilter.php +++ b/xhtml-compiler/XHTMLCompiler/DOMFilter.php @@ -15,6 +15,36 @@ abstract class XHTMLCompiler_DOMFilter extends XHTMLCompiler_Filter */ abstract public function process(DOMDocument $dom, $page); + /** + * Performs common initialization of DOM and XPath + */ + protected function setup($dom) { + $this->dom = $dom; + $this->xpath = new DOMXPath($dom); + $this->xpath->registerNamespace('html', "http://www.w3.org/1999/xhtml"); + } + + /** + * XPath object for the current DOM + */ + protected $xpath; + + /** + * Current DOMDocument + */ + protected $dom; + + /** + * Querys a DOM with an XPath expression + * @param $expr XPath expression to evaluate + * @param $context Context node + */ + protected function query($expr, $context = false) { + if (!$this->dom) throw new Exception('Filter must be setup before using convenience functions'); + if (!$context) return $this->xpath->query($expr); + return $this->xpath->query($expr, $context); + } + } ?> \ No newline at end of file diff --git a/xhtml-compiler/XHTMLCompiler/DOMFilter/Acronymizer.php b/xhtml-compiler/XHTMLCompiler/DOMFilter/Acronymizer.php new file mode 100644 index 0000000..5a7ab79 --- /dev/null +++ b/xhtml-compiler/XHTMLCompiler/DOMFilter/Acronymizer.php @@ -0,0 +1,46 @@ + 'PHP: HyperText Preprocessor', + 'HTML' => 'HyperText Markup Language', + 'XHTML' => 'eXtensible HyperText Markup Language', + 'XSS' => 'Cross-Site Scripting', + 'W3C' => 'World Wide Web Consortium', + 'WYSIWYG' => 'What You See Is What You Get', + 'WYSIWYM' => 'What You See Is What You Mean', + 'PEAR' => 'PHP Extension and Application Repository', + 'DTD' => 'Document Type Definition', + 'XML' => 'eXtensible Markup Language', + 'RFC' => 'Request for Comment', + ); + + public function process(DOMDocument $dom, $page) { + $this->setup($dom); + $nodes = $this->query("//html:acronym[not(@title)]"); + foreach ($nodes as $node) { + $acronym = $node->textContent; + if (!isset($this->acronyms[$acronym])) { + trigger_error(htmlspecialchars($acronym) . ' is not a recognized acronym (missing title attribute)'); + continue; + } + $node->setAttribute('title', $this->acronyms[$acronym]); + } + } + +} + +?> \ No newline at end of file diff --git a/xhtml-compiler/XHTMLCompiler/DOMFilter/GenerateTableOfContents.php b/xhtml-compiler/XHTMLCompiler/DOMFilter/GenerateTableOfContents.php index 8d7d4cb..9682355 100644 --- a/xhtml-compiler/XHTMLCompiler/DOMFilter/GenerateTableOfContents.php +++ b/xhtml-compiler/XHTMLCompiler/DOMFilter/GenerateTableOfContents.php @@ -11,12 +11,10 @@ class XHTMLCompiler_DOMFilter_GenerateTableOfContents extends XHTMLCompiler_DOMF public function process(DOMDocument $dom, $page) { - // setup xpath, this can be factored out - $xpath = new DOMXPath($dom); - $xpath->registerNamespace('html', "http://www.w3.org/1999/xhtml"); + $this->setup($dom); // test for ToC container, if not present don't bother - $container = $xpath->query("//html:div[@id='toc']")->item(0); + $container = $this->query("//html:div[@id='toc']")->item(0); if (!$container) return; // grab all headings h2 and down from the document @@ -24,7 +22,7 @@ class XHTMLCompiler_DOMFilter_GenerateTableOfContents extends XHTMLCompiler_DOMF foreach ($headings as $k => $v) $headings[$k] = "self::html:$v"; $query_headings = implode(' or ', $headings); $query = "//*[$query_headings]"; // looks like "//*[self::html:h2 or ...]" - $headings = $xpath->query($query); + $headings = $this->query($query); // setup the table of contents element $toc = $dom->createElement('ul'); diff --git a/xhtml-compiler/XHTMLCompiler/FilterManager.php b/xhtml-compiler/XHTMLCompiler/FilterManager.php index 1b8ba76..cfe687d 100644 --- a/xhtml-compiler/XHTMLCompiler/FilterManager.php +++ b/xhtml-compiler/XHTMLCompiler/FilterManager.php @@ -94,6 +94,7 @@ class XHTMLCompiler_FilterManager $dom->preserveWhiteSpace = false; $dom->formatOutput = true; $dom->loadXML($text); + $dom->encoding = 'UTF-8'; foreach ($this->DOMFilters as $filter) { $filter->process($dom, $page); } @@ -101,6 +102,7 @@ class XHTMLCompiler_FilterManager foreach ($this->postTextFilters as $filter) { $text = $filter->process($text, $page); } + $text = str_replace(''."\n", '', $text); return $text; } -- 2.11.4.GIT