3 How to install HTML Purifier
5 HTML Purifier is designed to run out of the box, so actually using the
6 library is extremely easy. (Although... if you were looking for a
7 step-by-step installation GUI, you've downloaded the wrong software!)
9 While the impatient can get going immediately with some of the sample
10 code at the bottom of this library, it's well worth reading this entire
11 document--most of the other documentation assumes that you are familiar
15 ---------------------------------------------------------------------------
18 HTML Purifier is PHP 5 only, and is actively tested from PHP 5.0.5 and
19 up. It has no core dependencies with other libraries. PHP
20 4 support was deprecated on December 31, 2007 with HTML Purifier 3.0.0.
21 HTML Purifier is not compatible with zend.ze1_compatibility_mode.
23 These optional extensions can enhance the capabilities of HTML Purifier:
25 * iconv : Converts text to and from non-UTF-8 encodings
26 * bcmath : Used for unit conversion and imagecrash protection
27 * tidy : Used for pretty-printing HTML
29 These optional libraries can enhance the capabilities of HTML Purifier:
31 * CSSTidy : Clean CSS stylesheets using %Core.ExtractStyleBlocks
32 Note: You should use the modernized fork of CSSTidy available
33 at https://github.com/Cerdic/CSSTidy
34 * Net_IDNA2 (PEAR) : IRI support using %Core.EnableIDNA
35 Note: This is not necessary for PHP 5.3 or later
37 ---------------------------------------------------------------------------
40 A big plus of HTML Purifier is its inerrant support of standards, so
41 your web-pages should be standards-compliant. (They should also use
42 semantic markup, but that's another issue altogether, one HTML Purifier
43 cannot fix without reading your mind.)
45 HTML Purifier can process these doctypes:
47 * XHTML 1.0 Transitional (default)
49 * HTML 4.01 Transitional
53 ...and these character encodings:
56 * Any encoding iconv supports (with crippled internationalization support)
58 These defaults reflect what my choices would be if I were authoring an
59 HTML document, however, what you choose depends on the nature of your
60 codebase. If you don't know what doctype you are using, you can determine
61 the doctype from this identifier at the top of your source code:
63 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
64 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
66 ...and the character encoding from this code:
68 <meta http-equiv="Content-type" content="text/html;charset=ENCODING">
70 If the character encoding declaration is missing, STOP NOW, and
71 read 'docs/enduser-utf8.html' (web accessible at
72 http://htmlpurifier.org/docs/enduser-utf8.html). In fact, even if it is
73 present, read this document anyway, as many websites specify their
74 document's character encoding incorrectly.
77 ---------------------------------------------------------------------------
78 3. Including the library
80 The procedure is quite simple:
82 require_once '/path/to/library/HTMLPurifier.auto.php';
84 This will setup an autoloader, so the library's files are only included
87 Only the contents in the library/ folder are necessary, so you can remove
88 everything else when using HTML Purifier in a production environment.
90 If you installed HTML Purifier via PEAR, all you need to do is:
92 require_once 'HTMLPurifier.auto.php';
94 Please note that the usual PEAR practice of including just the classes you
95 want will not work with HTML Purifier's autoloading scheme.
97 Advanced users, read on; other users can skip to section 4.
99 Autoload compatibility
100 ----------------------
102 HTML Purifier attempts to be as smart as possible when registering an
103 autoloader, but there are some cases where you will need to change
104 your own code to accomodate HTML Purifier. These are those cases:
106 PHP VERSION IS LESS THAN 5.1.2, AND YOU'VE DEFINED __autoload
107 Because spl_autoload_register() doesn't exist in early versions
108 of PHP 5, HTML Purifier has no way of adding itself to the autoload
109 stack. Modify your __autoload function to test
110 HTMLPurifier_Bootstrap::autoload($class)
112 For example, suppose your autoload function looks like this:
114 function __autoload($class) {
115 require str_replace('_', '/', $class) . '.php';
119 A modified version with HTML Purifier would look like this:
121 function __autoload($class) {
122 if (HTMLPurifier_Bootstrap::autoload($class)) return true;
123 require str_replace('_', '/', $class) . '.php';
127 Note that there *is* some custom behavior in our autoloader; the
128 original autoloader in our example would work for 99% of the time,
129 but would fail when including language files.
131 AN __autoload FUNCTION IS DECLARED AFTER OUR AUTOLOADER IS REGISTERED
132 spl_autoload_register() has the curious behavior of disabling
133 the existing __autoload() handler. Users need to explicitly
134 spl_autoload_register('__autoload'). Because we use SPL when it
135 is available, __autoload() will ALWAYS be disabled. If __autoload()
136 is declared before HTML Purifier is loaded, this is not a problem:
137 HTML Purifier will register the function for you. But if it is
138 declared afterwards, it will mysteriously not work. This
139 snippet of code (after your autoloader is defined) will fix it:
141 spl_autoload_register('__autoload')
143 Users should also be on guard if they use a version of PHP previous
144 to 5.1.2 without an autoloader--HTML Purifier will define __autoload()
145 for you, which can collide with an autoloader that was added by *you*
149 For better performance
150 ----------------------
152 Opcode caches, which greatly speed up PHP initialization for scripts
153 with large amounts of code (HTML Purifier included), don't like
154 autoloaders. We offer an include file that includes all of HTML Purifier's
155 files in one go in an opcode cache friendly manner:
157 // If /path/to/library isn't already in your include path, uncomment
159 // require '/path/to/library/HTMLPurifier.path.php';
161 require 'HTMLPurifier.includes.php';
163 Optional components still need to be included--you'll know if you try to
164 use a feature and you get a class doesn't exists error! The autoloader
165 can be used in conjunction with this approach to catch classes that are
166 missing. Simply add this afterwards:
168 require 'HTMLPurifier.autoload.php';
173 HTML Purifier has a standalone distribution; you can also generate
174 a standalone file from the full version by running the script
175 maintenance/generate-standalone.php . The standalone version has the
176 benefit of having most of its code in one file, so parsing is much
177 faster and the library is easier to manage.
179 If HTMLPurifier.standalone.php exists in the library directory, you
180 can use it like this:
182 require '/path/to/HTMLPurifier.standalone.php';
184 This is equivalent to including HTMLPurifier.includes.php, except that
185 the contents of standalone/ will be added to your path. To override this
186 behavior, specify a new HTMLPURIFIER_PREFIX where standalone files can
187 be found (usually, this will be one directory up, the "true" library
188 directory in full distributions). Don't forget to set your path too!
190 The autoloader can be added to the end to ensure the classes are
191 loaded when necessary; otherwise you can manually include them.
192 To use the autoloader, use this:
194 require 'HTMLPurifier.autoload.php';
199 HTMLPurifier.auto.php performs a number of operations that can be done
200 individually. These are:
202 HTMLPurifier.path.php
203 Puts /path/to/library in the include path. For high performance,
204 this should be done in php.ini.
206 HTMLPurifier.autoload.php
207 Registers our autoload handler HTMLPurifier_Bootstrap::autoload($class).
209 You can do these operations by yourself--in fact, you must modify your own
210 autoload handler if you are using a version of PHP earlier than PHP 5.1.2
211 (See "Autoload compatibility" above).
214 ---------------------------------------------------------------------------
217 HTML Purifier is designed to run out-of-the-box, but occasionally HTML
218 Purifier needs to be told what to do. If you answer no to any of these
219 questions, read on; otherwise, you can skip to the next section (or, if you're
220 into configuring things just for the heck of it, skip to 4.3).
223 * Am I using XHTML 1.0 Transitional?
225 If you answered no to any of these questions, instantiate a configuration
228 $config = HTMLPurifier_Config::createDefault();
231 4.1. Setting a different character encoding
233 You really shouldn't use any other encoding except UTF-8, especially if you
234 plan to support multilingual websites (read section three for more details).
235 However, switching to UTF-8 is not always immediately feasible, so we can
238 HTML Purifier uses iconv to support other character encodings, as such,
239 any encoding that iconv supports <http://www.gnu.org/software/libiconv/>
240 HTML Purifier supports with this code:
242 $config->set('Core.Encoding', /* put your encoding here */);
244 An example usage for Latin-1 websites (the most common encoding for English
247 $config->set('Core.Encoding', 'ISO-8859-1');
249 Note that HTML Purifier's support for non-Unicode encodings is crippled by the
250 fact that any character not supported by that encoding will be silently
251 dropped, EVEN if it is ampersand escaped. If you want to work around
252 this, you are welcome to read docs/enduser-utf8.html for a fix,
253 but please be cognizant of the issues the "solution" creates (for this
254 reason, I do not include the solution in this document).
257 4.2. Setting a different doctype
259 For those of you using HTML 4.01 Transitional, you can disable
260 XHTML output like this:
262 $config->set('HTML.Doctype', 'HTML 4.01 Transitional');
264 Other supported doctypes include:
267 * HTML 4.01 Transitional
269 * XHTML 1.0 Transitional
275 There are more configuration directives which can be read about
276 here: <http://htmlpurifier.org/live/configdoc/plain.html> They're a bit boring,
277 but they can help out for those of you who like to exert maximum control over
278 your code. Some of the more interesting ones are configurable at the
279 demo <http://htmlpurifier.org/demo.php> and are well worth looking into
282 For example, you can fine tune allowed elements and attributes, convert
283 relative URLs to absolute ones, and even autoparagraph input text! These
284 are, respectively, %HTML.Allowed, %URI.MakeAbsolute and %URI.Base, and
285 %AutoFormat.AutoParagraph. The %Namespace.Directive naming convention
288 $config->set('Namespace.Directive', $value);
292 $config->set('HTML.Allowed', 'p,b,a[href],i');
293 $config->set('URI.Base', 'http://www.example.com');
294 $config->set('URI.MakeAbsolute', true);
295 $config->set('AutoFormat.AutoParagraph', true);
298 ---------------------------------------------------------------------------
301 HTML Purifier generates some cache files (generally one or two) to speed up
302 its execution. For maximum performance, make sure that
303 library/HTMLPurifier/DefinitionCache/Serializer is writeable by the webserver.
305 If you are in the library/ folder of HTML Purifier, you can set the
306 appropriate permissions using:
308 chmod -R 0755 HTMLPurifier/DefinitionCache/Serializer
310 If the above command doesn't work, you may need to assign write permissions
311 to all. This may be necessary if your webserver runs as nobody, but is
312 not recommended since it means any other user can write files in the
315 chmod -R 0777 HTMLPurifier/DefinitionCache/Serializer
317 You can also chmod files via your FTP client; this option
318 is usually accessible by right clicking the corresponding directory and
319 then selecting "chmod" or "file permissions".
321 Starting with 2.0.1, HTML Purifier will generate friendly error messages
322 that will tell you exactly what you have to chmod the directory to, if in doubt,
325 If you are unable or unwilling to give write permissions to the cache
326 directory, you can either disable the cache (and suffer a performance
329 $config->set('Core.DefinitionCache', null);
331 Or move the cache directory somewhere else (no trailing slash):
333 $config->set('Cache.SerializerPath', '/home/user/absolute/path');
336 ---------------------------------------------------------------------------
339 The interface is mind-numbingly simple:
341 $purifier = new HTMLPurifier($config);
342 $clean_html = $purifier->purify( $dirty_html );
344 That's it! For more examples, check out docs/examples/ (they aren't very
345 different though). Also, docs/enduser-slow.html gives advice on what to
346 do if HTML Purifier is slowing down your application.
349 ---------------------------------------------------------------------------
352 First, make sure library/HTMLPurifier/DefinitionCache/Serializer is
353 writable by the webserver (see Section 5: Caching above for details).
354 If your website is in UTF-8 and XHTML Transitional, use this code:
357 require_once '/path/to/htmlpurifier/library/HTMLPurifier.auto.php';
359 $config = HTMLPurifier_Config::createDefault();
360 $purifier = new HTMLPurifier($config);
361 $clean_html = $purifier->purify($dirty_html);
364 If your website is in a different encoding or doctype, use this code:
367 require_once '/path/to/htmlpurifier/library/HTMLPurifier.auto.php';
369 $config = HTMLPurifier_Config::createDefault();
370 $config->set('Core.Encoding', 'ISO-8859-1'); // replace with your encoding
371 $config->set('HTML.Doctype', 'HTML 4.01 Transitional'); // replace with your doctype
372 $purifier = new HTMLPurifier($config);
374 $clean_html = $purifier->purify($dirty_html);