Release 1.6.0, merged in r875-930.
[htmlpurifier.git] / docs / dev-advanced-api.html
blobabc83025ae9b22e7c0c6989a1466051acc232042
1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
3 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
4 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
5 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
6 <meta name="description" content="Functional specification for HTML Purifier's advanced API for defining custom filtering behavior." />
7 <link rel="stylesheet" type="text/css" href="style.css" />
9 <title>Advanced API - HTML Purifier</title>
11 </head><body>
13 <h1>Advanced API</h1>
15 <div id="filing">Filed under Development</div>
16 <div id="index">Return to the <a href="index.html">index</a>.</div>
17 <div id="home"><a href="http://hp.jpsband.org/">HTML Purifier</a> End-User Documentation</div>
19 <p>HTML Purifier currently natively supports only a subset of HTML's
20 allowed elements, attributes, and behavior. This is by design,
21 but as the user is always right, they'll need some method to overload
22 these behaviors.</p>
24 <p>Our goals are to let the user:</p>
26 <dl>
27 <dt>Select</dt>
28 <dd><ul>
29 <li>Doctype</li>
30 <li>Mode: Lenient / Correctional</li>
31 <li>Elements / Attributes / Modules</li>
32 <li>Filterset</li>
33 </ul></dd>
34 <dt>Customize</dt>
35 <dd><ul>
36 <li>Attributes</li>
37 <li>Elements</li>
38 </ul></dd>
39 <dt>Internals</dt>
40 <dd><ul>
41 <li>Modules / Elements / Attributes / Attribute Types</li>
42 <li>Filtersets</li>
43 <li>Doctype</li>
44 </ul></dd>
45 </dl>
47 <h2>Select</h2>
49 <p>For basic use, the user will have to specify some basic parameters. This
50 is not strictly necessary, as HTML Purifier's default setting will always
51 output safe code, but is required for standards-compliant output.</p>
53 <h3>Selecting a Doctype</h3>
55 <p>The first thing to select is the <strong>doctype</strong>. This
56 is essential for standards-compliant output.</p>
58 <p class="technical">This identifier is based
59 on the name the W3C has given to the document type and <em>not</em>
60 the DTD identifier.</p>
62 <p>This parameter is set via the configuration object:</p>
64 <pre>$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');</pre>
66 <p>Due to historical reasons, the default doctype is XHTML 1.0
67 Transitional, however, we really shouldn't be guessing what the user's
68 doctype is. Fortunantely, people who can't be bothered to set this won't
69 be bothered when their pages stop validating.</p>
71 <h3>Selecting Mode</h3>
73 <p>Within doctypes, there are various <strong>modes</strong> of operation.
74 These indicate variant behaviors that, while not strictly changing the
75 allowed set of elements and attributes, definitely affect the output.
76 Currently, we have two modes, which may be used together:</p>
78 <dl>
79 <dt>Lenient</dt>
80 <dd>
81 <p>Deprecated elements and attributes will be transformed into
82 standards-compliant alternatives when explicitly disallowed.</p>
83 <p>For example, in the XHTML 1.0 Strict doctype, a <code>center</code>
84 element would be turned into a <code>div</code> with the CSS property
85 <code>text-align:center;</code>, but in XHTML 1.0 Transitional
86 the element would be preserved.</p>
87 <p>This mode is on by default.</p>
88 </dd>
89 <dt>Correctional[items to correct]</dt>
90 <dd>
91 <p>Deprecated elements and attributes will be transformed into
92 standards-compliant alternatives whenever possible.
93 It may have various levels of operation.</p>
94 <p>Referring back to the previous example, the <code>center</code> element would
95 be transformed in both cases. However, elements without a
96 reasonable standards-compliant alternative will be preserved
97 in their form.</p>
98 <p>A user may want to correct certain deprecated attributes, but
99 not others. For example, the <code>bgcolor</code> attribute may be
100 acceptable, but the <code>center</code> element not; also, possibly,
101 an HTML Purifier transformation may be buggy, so the user wants
102 to forgo it. Thus, correctional accepts an array defining which
103 elements and attributes to cleanup, or no parameter at all, which
104 means everything gets corrected. This also means that each
105 correction needs to be given a unique ID that can be referenced
106 in this manner. (We may also allow globbing, like *.name or a.*
107 for mass-enabling correction, and subtractive mode, where things
108 specified stop correction.) This array gets passed into the
109 constructor of the mode's module.</p>
110 <p>This mode is on by default.</p>
111 </dd>
112 </dl>
114 <p>A possible call to select modes would be:</p>
116 <pre>$config->set('HTML', 'Mode', array('correctional', 'lenient'));</pre>
118 <p>If modes have extra parameters, a hash is necessary:</p>
120 <pre>$config->set('HTML', 'Mode', array(
121 'correctional' => 'center,a.name',
122 'lenient' => true // this one's just boolean
123 ));</pre>
125 <p>Modes may be specified along with the doctype declaration (we may want
126 to get a better set of separator characters):</p>
128 <pre>$config->setDoctype('XHTML Transitional 1.0', '+correctional[center,a.name] -lenient');</pre>
131 With regards to the various levels of operation conjectured in the
132 Correctional mode, this is prompted by the fact that a user may want to
133 correct certain problems but not others, for example, fix the <code>center</code>
134 element but not the <code>u</code> element, both of which are deprecated.
135 Having an integer <q>level</q> will not work very well for such fine
136 grained tweaking, but an array of specific settings might.</p>
138 <h3>Selecting Elements / Attributes / Modules</h3>
140 <p></p>
142 <p>If this cookie cutter approach doesn't appeal to a user, they may
143 decide to roll their own filterset by selecting modules, elements and
144 attributes to allow.</p>
146 <p class="technical">This would make use of the same facilities
147 as a filterset author would use, except that it would go under an
148 <q>anonymous</q> filterset that would be auto-selected if any of the
149 relevant module/elements/attribute selection configuration directives were
150 non-null.</p>
152 <p>In practice, this is the most commonly demanded feature. Most users are
153 perfectly happy defining a filterset that looks like:</p>
155 <pre>$config->setAllowedHTML('a[href,title];em;p;blockquote');</pre>
157 <p class="technical">The directive %HTML.Allowed is a convenience function
158 that may be fully expressed with the legacy interface, and thus is
159 given its own setter.</p>
161 <p>We currently support a separated interface, which also must be preserved:</p>
163 <pre>$config->set('HTML', 'AllowedElements', 'a,em,p,blockquote');
164 $config->set('HTML', 'AllowedAttributes', 'a.href,a.title');</pre>
166 <p>A user may also choose to allow modules:</p>
168 <pre>$config->set('HTML', 'AllowedModules', 'Hypertext,Text,Lists'); // or
169 $config->setAllowedHTML('Hypertext,Text,Lists');</pre>
171 <p>But it is not expected that this feature will be widely used.</p>
173 <p class="fixme">The granularity of these modules is too coarse for
174 the average user (for example, the core module loads everything from
175 the essential <code>p</code> element to the not-so-safe <code>h1</code>
176 element). How do we make this still a viable solution? Possible answers
177 may be sub-modules or module parameters. This may not even be a problem,
178 considering that most people won't be selecting modules.</p>
180 <p class="technical">Modules are distinguished from regular elements by the
181 case of their first letter. While XML distinguishes between and allows
182 lower and uppercase letters in element names, most well-known XML
183 languages use only lower-case
184 element names for sake of consistency.</p>
186 <p class="technical">Considering that, internally speaking, as mandated by
187 the XHTML 1.1 Modularization specification, we have organized our
188 elements around modules, considerable gymnastics will be needed to
189 get this sort of functionality working.</p>
191 <h3>Unified selector</h3>
193 <p>Because selecting each and every one of these configuration options
194 is a chore, we may wish to offer a specialized configuration method
195 for selecting a filterset. Possibility:</p>
197 <pre>function selectFilter($doctype, $filterset, $mode)</pre>
199 <p>...which is simply a light wrapper over the individual configuration
200 calls. A custom config file format or text format could also be adopted.</p>
202 <h2>Customize</h2>
204 <p>By reviewing topic posts in the support forum, we determined that
205 there were two primarily demanded customization features people wanted:
206 to add an attribute to an existing element, and to add an element.
207 Thus, we'll want to create convenience functions for these common
208 use-cases.</p>
210 <p>Note that the functions described here are only available if
211 a raw copy of <code>HTMLPurifier_HTMLDefinition</code> was retrieved.
212 <code>addAttribute</code> may work on a processed copy, but for
213 consistency's sake we will mandate this for everything.</p>
215 <h3>Attributes</h3>
217 <p>An attribute is bound to an element by a name and has a specific
218 <code>AttrDef</code> that validates it. Thus, the interface should
219 be:</p>
221 <pre>function addAttribute($element, $attribute, $attribute_def);</pre>
223 <p>With a use-case that looks like:</p>
225 <pre>$def->addAttribute('a', 'rel', new HTMLPurifier_AttrDef_Enum(array('nofollow')));</pre>
227 <p>The <code>$attribute_def</code> value can be a little flexible,
228 to make things simpler. We'll let it also be:</p>
230 <ul>
231 <li>Class name: We'll instantiate it for you</li>
232 <li>Function name: We'll create an <code>HTMLPurifier_AttrDef_Anonymous</code>
233 class with that function registered as a callback.</li>
234 <li>String attribute type: We'll use <code>HTMLPurifier_AttrTypes</code>
235 </li>
236 <li>String starting with <code>enum(</code>: We'll explode it and stuff it in an
237 <code>HTMLPurifier_AttrDef_Enum</code> for you.</li>
238 </ul>
240 <p>Making the previous example written as:</p>
242 <pre>$def->addAttribute('a', 'rel', 'enum(nofollow)');</pre>
244 <h3>Elements</h3>
246 <p>An element requires certain information as specified by
247 <code>HTMLPurifier_ElementDef</code>. However, not all of it is necessary,
248 the usual things required are:</p>
250 <ul>
251 <li>Attributes</li>
252 <li>Content model/type</li>
253 <li>Registration in a content set</li>
254 </ul>
256 <p>This suggests an API like this:</p>
258 <pre>function addElement($element, $type, $content_model, $attributes = array());</pre>
260 <p>Each parameter explained in depth:</p>
262 <dl>
263 <dt><code>$element</code></dt>
264 <dd>Element name, ex. 'label'</dd>
265 <dt><code>$type</code></dt>
266 <dd>Content set to register in, ex. 'Inline' or 'Flow'</dd>
267 <dt><code>$content_model</code></dt>
268 <dd>Description of allowed children. This is a merged form of
269 <code>HTMLPurifier_ElementDef</code>'s member variables
270 <code>$content_model</code> and <code>$content_model_type</code>,
271 where the form is <q>Type: Model</q>, ex. 'Optional: Inline'.</dd>
272 <dt><code>$attributes</code></dt>
273 <dd>Array of attribute names to attribute definitions, much like
274 the above-described attribute customization.</dd>
275 </dl>
277 <p>A possible usage:</p>
279 <pre>$def->addElement('font', 'Inline', 'Optional: Inline',
280 array(0 => array('Common'), 'color' => 'Color'));</pre>
282 <p>We may want to Common attribute collection inclusion to be added
283 by default.</p>
285 <div id="version">$Id$</div>
287 </body></html>