docs/dev/hacking.txt

   1 ==========================
   2  Docutils_ Hacker's Guide
   3 ==========================
   4
   5 :Author: Lea Wiemann
   6 :Contact: LeWiemann@gmail.com
   7 :Revision: $Revision$
   8 :Date: $Date$
   9 :Copyright: This document has been placed in the public domain.
  10
  11 :Abstract: This is the introduction to Docutils for all persons who
  12     want to extend Docutils in some way.
  13 :Prerequisites: You have used reStructuredText_ and played around with
  14     the `Docutils front-end tools`_ before.  Some (basic) Python
  15     knowledge is certainly helpful (though not necessary, strictly
  16     speaking).
  17
  18 .. _Docutils: http://docutils.sourceforge.net/
  19 .. _reStructuredText: http://docutils.sourceforge.net/rst.html
  20 .. _Docutils front-end tools: ../user/tools.html
  21
  22 .. contents::
  23
  24
  25 Overview of the Docutils Architecture
  26 =====================================
  27
  28 To give you an understanding of the Docutils architecture, we'll dive
  29 right into the internals using a practical example.
  30
  31 Consider the following reStructuredText file::
  32
  33     My *favorite* language is Python_.
  34
  35     .. _Python: http://www.python.org/
  36
  37 Using the ``rst2html.py`` front-end tool, you would get an HTML output
  38 which looks like this::
  39
  40     [uninteresting HTML code removed]
  41     <body>
  42     <div class="document">
  43     <p>My <em>favorite</em> language is <a class="reference" href="http://www.python.org/">Python</a>.</p>
  44     </div>
  45     </body>
  46     </html>
  47
  48 While this looks very simple, it's enough to illustrate all internal
  49 processing stages of Docutils.  Let's see how this document is
  50 processed from the reStructuredText source to the final HTML output:
  51
  52
  53 Reading the Document
  54 --------------------
  55
  56 The **Reader** reads the document from the source file and passes it
  57 to the parser (see below).  The default reader is the standalone
  58 reader (``docutils/readers/standalone.py``) which just reads the input
  59 data from a single text file.  Unless you want to do really fancy
  60 things, there is no need to change that.
  61
  62 Since you probably won't need to touch readers, we will just move on
  63 to the next stage:
  64
  65
  66 Parsing the Document
  67 --------------------
  68
  69 The **Parser** analyzes the the input document and creates a **node
  70 tree** representation.  In this case we are using the
  71 **reStructuredText parser** (``docutils/parsers/rst/__init__.py``).
  72 To see what that node tree looks like, we call ``quicktest.py`` (which
  73 can be found in the ``tools/`` directory of the Docutils distribution)
  74 with our example file (``test.txt``) as first parameter (Windows users
  75 might need to type ``python quicktest.py test.txt``)::
  76
  77     $ quicktest.py test.txt
  78     <document source="test.txt">
  79         <paragraph>
  80             My
  81             <emphasis>
  82                 favorite
  83              language is
  84             <reference name="Python" refname="python">
  85                 Python
  86             .
  87         <target ids="python" names="python" refuri="http://www.python.org/">
  88
  89 Let us now examine the node tree:
  90
  91 The top-level node is ``document``.  It has a ``source`` attribute
  92 whose value is ``text.txt``.  There are two children: A ``paragraph``
  93 node and a ``target`` node.  The ``paragraph`` in turn has children: A
  94 text node ("My "), an ``emphasis`` node, a text node (" language is "),
  95 a ``reference`` node, and again a ``Text`` node (".").
  96
  97 These node types (``document``, ``paragraph``, ``emphasis``, etc.) are
  98 all defined in ``docutils/nodes.py``.  The node types are internally
  99 arranged as a class hierarchy (for example, both ``emphasis`` and
 100 ``reference`` have the common superclass ``Inline``).  To get an
 101 overview of the node class hierarchy, use epydoc (type ``epydoc
 102 nodes.py``) and look at the class hierarchy tree.
 103
 104
 105 Transforming the Document
 106 -------------------------
 107
 108 In the node tree above, the ``reference`` node does not contain the
 109 target URI (``http://www.python.org/``) yet.
 110
 111 Assigning the target URI (from the ``target`` node) to the
 112 ``reference`` node is *not* done by the parser (the parser only
 113 translates the input document into a node tree).
 114
 115 Instead, it's done by a **Transform**.  In this case (resolving a
 116 reference), it's done by the ``ExternalTargets`` transform in
 117 ``docutils/transforms/references.py``.
 118
 119 In fact, there are quite a lot of Transforms, which do various useful
 120 things like creating the table of contents, applying substitution
 121 references or resolving auto-numbered footnotes.
 122
 123 The Transforms are applied after parsing.  To see how the node tree
 124 has changed after applying the Transforms, we use the
 125 ``rst2pseudoxml.py`` tool:
 126
 127 .. parsed-literal::
 128
 129     $ rst2pseudoxml.py test.txt
 130     <document source="test.txt">
 131         <paragraph>
 132             My
 133             <emphasis>
 134                 favorite
 135              language is
 136             <reference name="Python" **refuri="http://www.python.org/"**>
 137                 Python
 138             .
 139         <target ids="python" names="python" ``refuri="http://www.python.org/"``>
 140
 141 For our small test document, the only change is that the ``refname``
 142 attribute of the reference has been replaced by a ``refuri``
 143 attribute |---| the reference has been resolved.
 144
 145 While this does not look very exciting, transforms are a powerful tool
 146 to apply any kind of transformation on the node tree.
 147
 148 By the way, you can also get a "real" XML representation of the node
 149 tree by using ``rst2xml.py`` instead of ``rst2pseudoxml.py``.
 150
 151
 152 Writing the Document
 153 --------------------
 154
 155 To get an HTML document out of the node tree, we use a **Writer**, the
 156 HTML writer in this case (``docutils/writers/html4css1.py``).
 157
 158 The writer receives the node tree and returns the output document.
 159 For HTML output, we can test this using the ``rst2html.py`` tool::
 160
 161     $ rst2html.py --link-stylesheet test.txt
 162     <?xml version="1.0" encoding="utf-8" ?>
 163     <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 164     <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 165     <head>
 166     <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 167     <meta name="generator" content="Docutils 0.3.10: http://docutils.sourceforge.net/" />
 168     <title></title>
 169     <link rel="stylesheet" href="../docutils/writers/html4css1/html4css1.css" type="text/css" />
 170     </head>
 171     <body>
 172     <div class="document">
 173     <p>My <em>favorite</em> language is <a class="reference" href="http://www.python.org/">Python</a>.</p>
 174     </div>
 175     </body>
 176     </html>
 177
 178 So here we finally have our HTML output.  The actual document contents
 179 are in the fourth-last line.  Note, by the way, that the HTML writer
 180 did not render the (invisible) ``target`` node |---| only the
 181 ``paragraph`` node and its children appear in the HTML output.
 182
 183
 184 Extending Docutils
 185 ==================
 186
 187 Now you'll ask, "how do I actually extend Docutils?"
 188
 189 First of all, once you are clear about *what* you want to achieve, you
 190 have to decide *where* to implement it |---| in the Parser (e.g. by
 191 adding a directive or role to the reStructuredText parser), as a
 192 Transform, or in the Writer.  There is often one obvious choice among
 193 those three (Parser, Transform, Writer).  If you are unsure, ask on
 194 the Docutils-develop_ mailing list.
 195
 196 In order to find out how to start, it is often helpful to look at
 197 similar features which are already implemented.  For example, if you
 198 want to add a new directive to the reStructuredText parser, look at
 199 the implementation of a similar directive in
 200 ``docutils/parsers/rst/directives/``.
 201
 202
 203 Modifying the Document Tree Before It Is Written
 204 ------------------------------------------------
 205
 206 You can modify the document tree right before the writer is called.
 207 One possibility is to use the publish_doctree_ and
 208 publish_from_doctree_ functions.
 209
 210 To retrieve the document tree, call::
 211
 212     document = docutils.core.publish_doctree(...)
 213
 214 Please see the docstring of publish_doctree for a list of parameters.
 215
 216 .. XXX Need to write a well-readable list of (commonly used) options
 217    of the publish_* functions.  Probably in api/publisher.txt.
 218
 219 ``document`` is the root node of the document tree.  You can now
 220 change the document by accessing the ``document`` node and its
 221 children |---| see `The Node Interface`_ below.
 222
 223 When you're done with modifying the document tree, you can write it
 224 out by calling::
 225
 226     output = docutils.core.publish_from_doctree(document, ...)
 227
 228 .. _publish_doctree: ../api/publisher.html#publish_doctree
 229 .. _publish_from_doctree: ../api/publisher.html#publish_from_doctree
 230
 231
 232 The Node Interface
 233 ------------------
 234
 235 As described in the overview above, Docutils' internal representation
 236 of a document is a tree of nodes.  We'll now have a look at the
 237 interface of these nodes.
 238
 239 (To be completed.)
 240
 241
 242 What Now?
 243 =========
 244
 245 This document is not complete.  Many topics could (and should) be
 246 covered here.  To find out with which topics we should write about
 247 first, we are awaiting *your* feedback.  So please ask your questions
 248 on the Docutils-develop_ mailing list.
 249
 250
 251 .. _Docutils-develop: ../user/mailing-lists.html#docutils-develop
 252
 253
 254 .. |---| unicode:: 8212 .. em-dash
 255    :trim:
 256
 257 \f
 258 ..
 259    Local Variables:
 260    mode: indented-text
 261    indent-tabs-mode: nil
 262    sentence-end-double-space: t
 263    fill-column: 70
 264    End: