Update documentation to new configuration format.
[htmlpurifier.git] / docs / enduser-tidy.html
1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
3 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
4 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
5 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
6 <meta name="description" content="Tutorial for tweaking HTML Purifier's Tidy-like behavior." />
7 <link rel="stylesheet" type="text/css" href="style.css" />
9 <title>Tidy - HTML Purifier</title>
11 </head><body>
13 <h1>Tidy</h1>
15 <div id="filing">Filed under Development</div>
16 <div id="index">Return to the <a href="index.html">index</a>.</div>
17 <div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
19 <p>You've probably heard of HTML Tidy, Dave Raggett's little piece
20 of software that cleans up poorly written HTML. Let me say it straight
21 out:</p>
23 <p class="emphasis">This ain't HTML Tidy!</p>
25 <p>Rather, Tidy stands for a cool set of Tidy-inspired features in HTML Purifier
26 that allows users to submit deprecated elements and attributes and get
27 valid strict markup back. For example:</p>
29 <pre>&lt;center&gt;Centered&lt;/center&gt;</pre>
31 <p>...becomes:</p>
33 <pre>&lt;div style=&quot;text-align:center;&quot;&gt;Centered&lt;/div&gt;</pre>
35 <p>...when this particular fix is run on the HTML. This tutorial will give
36 you the lowdown of what exactly HTML Purifier will do when Tidy
37 is on, and how to fine-tune this behavior. Once again, <strong>you do
38 not need Tidy installed on your PHP to use these features!</strong></p>
40 <h2>What does it do?</h2>
42 <p>Tidy will do several things to your HTML:</p>
44 <ul>
45 <li>Convert deprecated elements and attributes to standards-compliant
46 alternatives</li>
47 <li>Enforce XHTML compatibility guidelines and other best practices</li>
48 <li>Preserve data that would normally be removed as per W3C</li>
49 </ul>
51 <h2>What are levels?</h2>
53 <p>Levels describe how aggressive the Tidy module should be when
54 cleaning up HTML. There are four levels to pick: none, light, medium
55 and heavy. Each of these levels has a well-defined set of behavior
56 associated with it, although it may change depending on your doctype.</p>
58 <dl>
59 <dt>light</dt>
60 <dd>This is the <strong>lenient</strong> level. If a tag or attribute
61 is about to be removed because it isn't supported by the
62 doctype, Tidy will step in and change into an alternative that
63 is supported.</dd>
64 <dt>medium</dt>
65 <dd>This is the <strong>correctional</strong> level. At this level,
66 all the functions of light are performed, as well as some extra,
67 non-essential best practices enforcement. Changes made on this
68 level are very benign and are unlikely to cause problems.</dd>
69 <dt>heavy</dt>
70 <dd>This is the <strong>aggressive</strong> level. If a tag or
71 attribute is deprecated, it will be converted into a non-deprecated
72 version, no ifs ands or buts.</dd>
73 </dl>
75 <p>By default, Tidy operates on the <strong>medium</strong> level. You can
76 change the level of cleaning by setting the %HTML.TidyLevel configuration
77 directive:</p>
79 <pre>$config-&gt;set('HTML.TidyLevel', 'heavy'); // burn baby burn!</pre>
81 <h2>Is the light level really light?</h2>
83 <p>It depends on what doctype you're using. If your documents are HTML
84 4.01 <em>Transitional</em>, HTML Purifier will be lazy
85 and won't clean up your <code>center</code>
86 or <code>font</code> tags. But if you're using HTML 4.01 <em>Strict</em>,
87 HTML Purifier has no choice: it has to convert them, or they will
88 be nuked out of existence. So while light on Transitional will result
89 in little to no changes, light on Strict will still result in quite
90 a lot of fixes.</p>
92 <p>This is different behavior from 1.6 or before, where deprecated
93 tags in transitional documents would
94 always be cleaned up regardless. This is also better behavior.</p>
96 <h2>My pages look different!</h2>
98 <p>HTML Purifier is tasked with converting deprecated tags and
99 attributes to standards-compliant alternatives, which usually
100 need copious amounts of CSS. It's also not foolproof: sometimes
101 things do get lost in the translation. This is why when HTML Purifier
102 can get away with not doing cleaning, it won't; this is why
103 the default value is <strong>medium</strong> and not heavy.</p>
105 <p>Fortunately, only a few attributes have problems with the switch
106 over. They are described below:</p>
108 <table class="table">
109 <thead><tr>
110 <th>Element@Attr</th>
111 <th>Changes</th>
112 </tr></thead>
113 <tbody>
114 <tr>
115 <td>caption@align</td>
116 <td>Firefox supports stuffing the caption on the
117 left and right side of the table, a feature that
118 Internet Explorer, understandably, does not have.
119 When align equals right or left, the text will simply
120 be aligned on the left or right side.</td>
121 </tr>
122 <tr>
123 <td>img@align</td>
124 <td>The implementation for align bottom is good, but not
125 perfect. There are a few pixel differences.</td>
126 </tr>
127 <tr>
128 <td>br@clear</td>
129 <td>Clear both gets a little wonky in Internet Explorer. Haven't
130 really been able to figure out why.</td>
131 </tr>
132 <tr>
133 <td>hr@noshade</td>
134 <td>All browsers implement this slightly differently: we've
135 chosen to make noshade horizontal rules gray.</td>
136 </tr>
137 </tbody>
138 </table>
140 <p>There are a few more minor, although irritating, bugs.
141 Some older browsers support deprecated attributes,
142 but not CSS. Transformed elements and attributes will look unstyled
143 to said browsers. Also, CSS precedence is slightly different for
144 inline styles versus presentational markup. In increasing precedence:</p>
146 <ol>
147 <li>Presentational attributes</li>
148 <li>External style sheets</li>
149 <li>Inline styling</li>
150 </ol>
152 <p>This means that styling that may have been masked by external CSS
153 declarations will start showing up (a good thing, perhaps). Finally,
154 if you've turned off the style attribute, almost all of
155 these transformations will not work. Sorry mates.</p>
157 <p>You can review the rendering before and after of these transformations
158 by consulting the <a
159 href="http://htmlpurifier.org/live/smoketests/attrTransform.php">attrTransform.php
160 smoketest</a>.</p>
162 <h2>I like the general idea, but the specifics bug me!</h2>
164 <p>So you want HTML Purifier to clean up your HTML, but you're not
165 so happy about the br@clear implementation. That's perfectly fine!
166 HTML Purifier will make accomodations:</p>
168 <pre>$config-&gt;set('HTML.Doctype', 'XHTML 1.0 Transitional');
169 $config-&gt;set('HTML.TidyLevel', 'heavy'); // all changes, minus...
170 <strong>$config-&gt;set('HTML.TidyRemove', 'br@clear');</strong></pre>
172 <p>That third line does the magic, removing the br@clear fix
173 from the module, ensuring that <code>&lt;br clear="both" /&gt;</code>
174 will pass through unharmed. The reverse is possible too:</p>
176 <pre>$config-&gt;set('HTML.Doctype', 'XHTML 1.0 Transitional');
177 $config-&gt;set('HTML.TidyLevel', 'none'); // no changes, plus...
178 <strong>$config-&gt;set('HTML.TidyAdd', 'p@align');</strong></pre>
180 <p>In this case, all transformations are shut off, except for the p@align
181 one, which you found handy.</p>
183 <p>To find out what the names of fixes you want to turn on or off are,
184 you'll have to consult the source code, specifically the files in
185 <code>HTMLPurifier/HTMLModule/Tidy/</code>. There is, however, a
186 general syntax:</p>
188 <table class="table">
189 <thead>
190 <tr>
191 <th>Name</th>
192 <th>Example</th>
193 <th>Interpretation</th>
194 </tr>
195 </thead>
196 <tbody>
197 <tr>
198 <td>element</td>
199 <td>font</td>
200 <td>Tag transform for <em>element</em></td>
201 </tr>
202 <tr>
203 <td>element@attr</td>
204 <td>br@clear</td>
205 <td>Attribute transform for <em>attr</em> on <em>element</em></td>
206 </tr>
207 <tr>
208 <td>@attr</td>
209 <td>@lang</td>
210 <td>Global attribute transform for <em>attr</em></td>
211 </tr>
212 <tr>
213 <td>e#content_model_type</td>
214 <td>blockquote#content_model_type</td>
215 <td>Change of child processing implementation for <em>e</em></td>
216 </tr>
217 </tbody>
218 </table>
220 <h2>So... what's the lowdown?</h2>
222 <p>The lowdown is, quite frankly, HTML Purifier's default settings are
223 probably good enough. The next step is to bump the level up to heavy,
224 and if that still doesn't satisfy your appetite, do some fine-tuning.
225 Other than that, don't worry about it: this all works silently and
226 effectively in the background.</p>
228 </body></html>
230 <!-- vim: et sw=4 sts=4