sandbox/package-doc/multiple-input-files.txt

   1 ================================
   2  Docutils: Multiple Input Files
   3 ================================
   4
   5 :Author: Lea Wiemann <LeWiemann@gmail.com>
   6 :Date: $Date$
   7 :Revision: $Revision$
   8 :Copyright: This document has been placed in the public domain.
   9
  10 .. sectnum::
  11 .. contents::
  12
  13
  14 Introduction
  15 ============
  16
  17 We would like to support documents whose source text comes from
  18 multiple files.  For instance, the Docutils documentation could be
  19 considered a single large document; parsing all files into one single
  20 document tree would enable us to do cross-linking between parts of the
  21 documentation (our current way to cross-link between files is to link
  22 to HTML files and fragments, which is limited to HTML).  Another
  23 example is a reference manual for a customized software system.  The
  24 manual is built from a set of sub-documents that may differ from
  25 installation to installation.
  26
  27 Note that this issue is separate from that of output to multiple
  28 files; after implementing support for multiple input files, all we
  29 will be able to do is to generate a single huge output file.
  30
  31 This is a collection of notes and semi-random thoughts (many of which
  32 are credit to David, from IM conversations).  Feel free to add yours
  33 and/or make changes as you see fit!
  34
  35 You can also discuss this proposal on the Docutils-develop_ mailing
  36 list, or reach us individually via email or Jabber/Google Talk at
  37 LeWiemann@gmail.com and dgoodger@gmail.com, respectively.
  38
  39 .. _Docutils-develop:
  40    http://docutils.sf.net/docs/user/mailing-lists.html#docutils-develop
  41
  42
  43 Terminology
  44 ===========
  45
  46 Right now, we are using the following terminology: A document which
  47 includes other documents (using the ``subdocument`` directive
  48 described below) is called a *super-document*.  The included documents
  49 are called *sub-documents*.  Sub-documents can in turn include other
  50 documents and can thus be super-documents themselves.  Any top-most
  51 super-document in the hierarchy, which is not included by any other
  52 document, is called a *compound document*.
  53
  54 The set of all documents that can be part of a compound document is
  55 the *document set* (or *docset*).  The directory that is ancestor to
  56 all documents in the document set is the *docset root*.
  57
  58
  59 The ``subdocument`` Directive
  60 =============================
  61
  62 * The "include" directive is not usable for this because we want to
  63   have independent parsing contexts (for instance, section title
  64   adornment should not have to be consistent across input files).
  65
  66 * So create a "subdocument" directive (syntax: ".. subdocument::
  67   file.txt").  This directive causes the referenced file to be parsed
  68   and its document tree to be inserted in place.
  69
  70   - The sub-document must only consist of top-level sections and
  71     transitions, or it must have a document title.  (The document
  72     title will become a section title when the sub-document is
  73     inserted using the ``subdocument`` directive.)  We only support
  74     per-document docinfos -- if a sub-document contains multiple
  75     top-level sections, don't touch field lists at all.
  76
  77     This restriction may be lifted later; there is no theoretical
  78     reason that prevents sub-documents from containing arbitrary text
  79     (within the limits of the DTD, e.g. no body elements may end up
  80     after section elements) -- it is merely somewhat harder to
  81     implement.
  82
  83   - As long as above restrictions apply, the subdocument directive can
  84     be treated like a section at parse-time; that is, no elements
  85     except for more sections or transitions may follow.
  86
  87   - In order to facilitate assembling a large number of hierarchical
  88     files into a large document, the subdocument directive should
  89     allow specifying any number of files, like this::
  90
  91         .. subdocument::
  92
  93            chapter1.txt
  94                chapter1-section1.txt
  95                chapter1-section2.txt
  96            chapter2.txt
  97                chapter2-section1.txt
  98                ...
  99
 100     Specifying an indented file (like chapter1-section1.txt) is
 101     equivalent to inserting ".. subdocument:: chapter1-section1.txt"
 102     at the end of chapter1.txt.
 103
 104     Lists of files should be required to be directive content, not
 105     parameters, because file lists as parameters would be prone to
 106     uncaught user errors.  In this example, the indentation of
 107     "chapter1-section1.txt" would be stripped by the directive parser,
 108     which is contrary to what the user expects::
 109
 110         .. subdocument:: chapter1.txt
 111                chapter1-section1.txt
 112
 113 * The sub-document files should each be processable stand-alone
 114   (without the other files), each forming a document on its own.
 115
 116 * What to do with docinfos in subdocuments:
 117
 118   - Allow for section infos by generalizing the existing docinfo node.
 119
 120   - Add an option to either strip or leave docinfos, specifiable as an
 121     option to the "subdocument" directive:
 122
 123     Uniform handling::
 124
 125         .. subdocument:: :leave-docinfo:
 126
 127            chapter1.txt
 128            chapter2.txt
 129            chapter3.txt
 130
 131     Non-uniform handling::
 132
 133         .. subdocument:: chapter1.txt
 134         .. subdocument:: chapter2.txt
 135            :leave-docinfo:
 136         .. subdocument:: chapter3.txt
 137
 138   - There is currently no way to have per-section "section infos" in
 139     reStructuredText files, as opposed to only file-wide docinfos.  So
 140     the only way to get section infos in a document is to use
 141     sub-documents.  This might be just fine though.
 142
 143   - For a first implementation, just go the easy route and strip all
 144     docinfos in sub-documents.
 145
 146 * In order to facilitate multi-file output that parallels the input
 147   file structure, add "source" attributes to section nodes for
 148   sections that come from different input files.
 149
 150 * Restriction: Do not allow sub-documents without a top-level section,
 151   or with body elements in front of the first section.  IOW, the
 152   sub-document may only contain PreBibliographic elements, sections,
 153   and transitions.  The PreBibliographic elements in front of the
 154   first section get moved into the section.
 155
 156   David says we shouldn't have this restriction -- for instance you
 157   might want to have an introductory paragraph in front of the first
 158   section of the first sub-document.  Lea doesn't mind the restriction
 159   and says you could use the "include" directive.  Since having the
 160   restriction makes the implementation somewhat easier, we agreed on
 161   having this restriction, waiting until the first user presents a
 162   good use case to remove it, and calling it a YAGNI until then.
 163
 164 * Silently drop header and footer in sub-documents.  (Document this in
 165   directives.txt though.)
 166
 167 * To do: Explore alternatives besides "subdocument" for the directive
 168   name.
 169
 170   [DG] "subdoc" is a good alias.  "Subdocument" implies a single item;
 171   perhaps "manifest" instead, for multiple items?
 172
 173 * ``docutils.conf`` files in the sub-documents' respective directories
 174   are honored.
 175
 176 * You may want to read some insightful remarks by Joaquim Baptista on
 177   how `files should be expected to be part of different compound
 178   documents`__.
 179
 180   .. _different compound documents:
 181   __ http://article.gmane.org/gmane.text.docutils.devel/4043
 182
 183
 184 .. _xrefs:
 185
 186 Cross-References
 187 ================
 188
 189 .. note:: You may need to read the `reST spec`_ in order to understand
 190    the terminology (targets, references).  In this section, an
 191    "*external named reference*" means a reference whose target is
 192    outside of the current file, but within the current docset.
 193
 194    .. _reST spec: http://docutils.sf.net/docs/ref/rst/restructuredtext.html
 195
 196 A major issue to think about is how to do **cross-references**
 197 (colloquially known as **xrefs**) between files.  Things like
 198 local references or substitutions should not be shared between files
 199 (their definitions can simply be loaded using the "include"
 200 directive).  However, sharing *external* targets and thereby allowing
 201 cross-references between files is one of the major features of
 202 an architecture that supports multiple input files.
 203
 204 (Implementation note: For this to work, we need to apply transforms to
 205 sub-documents; basically, all transforms but the one resolving
 206 external references should be applied.  This means that a new reader
 207 instance must be created.  r5266 makes applying transforms to
 208 sub-documents possible by pulling the responsibility for applying
 209 transforms out of the Publisher.)
 210
 211 Issues arise once we think about how to group target names into
 212 namespaces.  Unfortunately, simply putting all targets into a global,
 213 document-wide namespace is bound to cause collisions; files that were
 214 processable stand-alone are no longer processable when used in
 215 conjunction with other files because they share common target names.
 216
 217 Since linking to targets outside the scope of the current sub-document
 218 has significant disadvantages, we will need some form of qualifiers.
 219
 220
 221 Namespace Identifiers
 222 ---------------------
 223
 224 This makes it necessary to add a notion of *namespace identifiers*.
 225
 226 .. sidebar:: Why headers are a bad idea
 227
 228    One of the appealing features of reStructuredText, compared to
 229    LaTeX, is that creating a new file does not require writing a
 230    header.  Just type the title, some text, run rst2html, and you're
 231    done.  Writing a stand-alone LaTeX document on the other hand
 232    typically begins with declaring the \\documentclass, loading all
 233    the packages you need for your document, setting some options, and
 234    finally \\begin{document}.
 235
 236    While it may not be possible to go *entirely* without any explicit
 237    markup, it is certainly a worthwhile goal to keep the amount of
 238    such markup to a minimum.
 239
 240 It is possible to always name the namespace of the current file (as it
 241 is done in C++).  For instance, "``.. namespace:: frob``" at the
 242 beginning of a file could declare that the namespace of the current
 243 file is called "frob".  However, this is a little verbose as it adds a
 244 line at the top of each file (see the sidebar).  Also, it removes the
 245 reader's ability to easily look up the referenced files (you might not
 246 know which file(s) declare the "frob" namespace).
 247
 248 On the other hand, namespace names could also be derived from paths
 249 and file names.  (Note though that these two options need not be
 250 mutually exclusive.)  Since using only the file name would cause
 251 ambiguity, it is necessary to include its path in the namespace name.
 252 For instance, the file ``docs/dev/todo.txt`` could be referenced by
 253 the implicit namespace identifier ``docs/dev/todo.txt``; a reference
 254 would look like ```<docs/dev/todo.txt> large documents`_``.  Using
 255 paths relative to the current file makes it hard to move files or
 256 document parts.  Therefore, we need to establish the notion of a
 257 *docset root* which path names are relative to.
 258
 259 The docset root could be specified using a "docset-root" directive at
 260 the top of each sub-document that uses external named references.  On
 261 the other hand, perhaps we do not need to know the docset root until
 262 we process the compound document (in which case it can be implicitly
 263 derived from the location of the master file).  So let's wait with
 264 implementing a "docset-root" directive until the need arises.
 265
 266
 267 Namespace Aliases
 268 -----------------
 269
 270 A general disadvantage of using paths as namespace identifiers is that
 271 changes in the directory structure cause a massive amount of changes
 272 in the reStructuredText files, because all the paths need to be
 273 updated.  This is not any worse than the current situation.  However,
 274 to improve maintainability it would be desirable to make the namespace
 275 of an often-referenced files known under a shorter name.  (The shorter
 276 namespace identifier should only be valid within the file where it is
 277 declared, and possibly sub-documents.)  For instance, one could make
 278 "docs/ref/restructuredtext.txt" known as "spec" using one of the
 279 following syntax alternatives::
 280
 281     .. namespaces::
 282
 283        :spec: docs/ref/restructuredtext.txt
 284
 285 Namespace aliases can also be used make one namespace refer to
 286 different physical files depending on the super-document.  Namespace
 287 definitions should therefore be inherited from super-documents to
 288 sub-documents.  The "namespaces" directive overrides namespace
 289 definitions inherited from super-documents, unless the *:inherit:*
 290 option is specified.  (The :inherit: option thus allows to provide
 291 default paths for namespace aliases, which can still be overridden in
 292 super-documents.)
 293
 294
 295 Qualifier Syntax
 296 ----------------
 297
 298 Angled brackets::
 299
 300     `<namespace> target`_
 301
 302 This is similar to the syntax for embedded URI's (```target
 303 <URI>`_``).  It fits well into the existing syntax.
 304
 305
 306 Implementation
 307 ==============
 308
 309 In order to parse sub-documents, we need to create new parser
 310 instances.
 311
 312 For now, we'll instantiate them by calling parser.__class__(); in the
 313 long run the reader, parser, and writer parameters of the Publisher
 314 should be turned into classes (or callbacks) instead of instances.
 315
 316 The Parser must know about the Reader (or about the
 317 Publisher) and calls Reader.parse_[sub]document in order to parser
 318 sub-documents.
 319
 320
 321 Dumpster
 322 ========
 323
 324 You can stop reading now.  This section is only to archive sections
 325 we're no longer interested in.
 326
 327
 328 Rejected Proposal: Local and Global Namespace, no Qualifiers
 329 ------------------------------------------------------------
 330
 331 An obvious solution would be to add a notion of a file-local and a
 332 global namespace.  When trying to resolve a reference, first the
 333 target name is looked up in the local namespace of the current file;
 334 if no suitable target is found there, the target name is searched for
 335 document-wide, in the global namespace; if the target name exists and
 336 is unique within the compound document, the reference can be resolved.
 337
 338 .. sidebar:: Why independent references are a good idea
 339
 340    While the requirement that the compound document be processed in
 341    order to resolve external named references makes implementation
 342    easier, it is certainly worthwhile to provide for a means to
 343    resolve external named references without a re-run of the compound
 344    document for speed reasons.
 345
 346    Since authoring can involve an edit-process-edit-process cycle, it
 347    should be possible to process files individually, rather than the
 348    compound document (which can be very slow).  Of course, as long as
 349    external named references are marked in the source file as such,
 350    they can, in a stand-alone pass, always be marked as "unresolvable"
 351    (e.g. in red) in the output, and only be resolved when the compound
 352    document is processed.  However, it would be even better to be able
 353    to actually resolve the references.
 354
 355 If references to the global namespace are not marked up as such,
 356 however, the individual files are no longer processable stand-alone
 357 because they contain unresolvable references.  While it may be
 358 acceptable that external named cross-references do not (fully) work
 359 any longer when a file is processed stand-alone, it would be nice to
 360 be able to handle unresolved external references somehow (at least by
 361 marking them as "unresolvable" in the output), rather than simply
 362 throwing an error (see the sidebar).
 363
 364 This can be solved by marking external references as such, like this::
 365
 366     `local target name`_
 367     `-> global target name`_
 368
 369 where "local target name" must be a unique target name within the
 370 current file, and "global target name" must be a unique target name
 371 within the current compound document.
 372
 373 (We would need to explicitly establish a notion of "stand-alone" vs.
 374 "full document" processing in this case.  But since this proposal is
 375 being rejected, I'm not going to explore this further.)
 376
 377
 378 Drawbacks
 379 ~~~~~~~~~
 380
 381 This approach turns out to have a major drawback though: External
 382 named references depend on the context of the containing
 383 super-document.  However, as Joaquim `pointed out`__, files should be
 384 expected to be part of several super-documents.  This means that once
 385 a file is put into the context of a new document, its external named
 386 references might point to non-existing or duplicate targets.  This
 387 seems like a maintenance problem for complex (large) collections of
 388 documentation.
 389
 390 __ `different compound documents`_
 391
 392 Another peculiarity of this system is that, as long as a file is
 393 processed stand-alone, external named references are not associated
 394 with the file that defines the target.  This brings the advantage that
 395 renaming and moving files won't invalidate reference names.  On the
 396 downside, it lacks clarity for the reader because the file containing
 397 a target is often not inferable from the target name (try to guess
 398 which file ```-> html4css1`_`` links to) -- this may be significant
 399 since reStructuredText should be readable in its source form.
 400
 401
 402 Importing Namespaces
 403 --------------------
 404
 405 While namespaces should generally be available without explicitly
 406 importing them (in order to avoid length headers), it would probably
 407 be handy to have a means of inserting all targets of another namespace
 408 into the current one (similar to Python's "from module import \*").
 409 The disadvantage is that it may cause confusion.
 410
 411 Contenders for the syntax::
 412
 413     .. import:: namespace   (Pythonic)
 414
 415     .. import-targets:: namespace   (more verbose)
 416
 417     .. using:: namespace    (like C++)
 418
 419 Or, provided that we use "``.. namespace:: short-name <- namespace``",
 420 and "```namespace -> target`_``" as reference syntax, this would be a
 421 logical fit::
 422
 423     .. namespace:: <- namespace
 424     `-> target`_                   (instead of `namespace -> target`_)
 425
 426 The advantage of this syntax is that we can prohibit importing more
 427 than one namespace, which might cause confusion.  Importing only one
 428 namespace might be a handy shortcut though.
 429
 430
 431 Caching
 432 -------
 433
 434 In order to be able to regenerate the whole compound document in a
 435 timely manner after changing a single file, it is necessary to
 436 implement a caching system.
 437
 438 Processing a document is done in the following steps:
 439
 440 1. For each file in the docset, parse it and turn the target names
 441    into file-local ID's (this includes error handling for duplicate
 442    target names).  Cache the parse tree, the name-to-ID mapping, and
 443    the list of all files included using the "include" directive.  Skip
 444    this step for files whose cache entry date-stamp is newer than the
 445    file's mtime and ctime, and all included files mtimes and ctimes.
 446
 447    This means that the "subdocument" directive must be resolved at
 448    transform time (and not at parse time), because otherwise we cannot
 449    store the doctree before the sub-document has been inserted.
 450
 451 2. For each file, run transforms, resolving external named references
 452    using the cached name-to-ID mappings of other files.
 453
 454 3. Write out the resulting document (currently a single file).  (The
 455    writer needs to turn namespace/ID pairs into output-file-local
 456    ID's.)
 457
 458 Processing a file stand-alone is done in the same way, except that
 459 steps 1 and 2 are only performed for the file being processed, not for
 460 each file in the docset.  If other files' cached name-to-ID mappings
 461 are not up-to-date (when being accessed in step 2), they should be
 462 automatically updated.
 463
 464 All cache entries should be stored in a docset cache file, in order to
 465 avoid LaTeX-like creation of many junk files.  Possible names include
 466 docutils.cache, docutils.aux, or either of them with leading dot.  The
 467 file is stored in the docset root and contains a header and a large
 468 pickle string (reading and writing even large strings of pickled data
 469 is reasonably fast).  In the header of the cache file, store
 470 sys.version and docutils.__version__; discard cache files that have
 471 the wrong version.
 472
 473 Potential security issue: Since unpickling is unsafe, an attacker
 474 could provide a carefully crafted cache file, which is then
 475 automatically picked up by Docutils.  Remedies: Insert some
 476 unguessable system-specific key (generate randomly and store in
 477 ~/.docutils.cache.key), and automatically discard cache files that
 478 have the wrong key.  Or simply place a big warning in the
 479 documentation not to accept cache files from strangers.
 480
 481 No caching is done if no docset-root is defined (which means that the
 482 file being processed is independent and not part of any compound
 483 document).
 484
 485
 486 Implementation
 487 --------------
 488
 489 As described in section Caching_, when processing files stand-alone
 490 and resolving their external named references, it may be necessary to
 491 process (or re-process) referenced files.  Since this is during
 492 transform-time, the parser instance is no longer available; it is
 493 therefore necessary to create a new instance.
 494
 495 All requests for doctree and name-to-ID mappings should go through the
 496 caching system.  In case of a miss, the caching system instantiates a
 497 parser and (re-)parses the requested file.
 498
 499 In fact, all calls by the standalone reader to the reStructuredText
 500 parser should go through the cache.  In the case of independent files
 501 which are not part of a larger docset, the system always assumes a
 502 cache miss.