doc/setext-info.txt

   1
   2   What is setext
   3 ----------------
   4
   5   The following is extracted from text written by Ian Feldman.
   6
   7   As originally explained in TidBITS#100 and mentioned there from
   8   now on, that publication now comes "wrapped as a setext." The noun
   9   itself stands for both a method to wrap (format) texts according
  10   to specific layout rules and for a single _structure_enhanced_
  11   text.  The latter is a text which has been formatted in such a
  12   fashion that it contains clues as to the typographical and logical
  13   structure of its source (word-processed) document(s), if any.
  14   Those clues, which are called "typotags," facilitate later automatic
  15   detection of that structure so it can be validated, extracted,
  16   processed, transformed, enhanced as needed, if needed.
  17
  18   It follows that setexts, being nothing but pure text (albeit with a
  19   special layout), are eminently readable using ANY editor or word
  20   processor in existence today or tommorrow, on any computer with a
  21   computer program that is capable of opening and reading text files.
  22   By default all properly setext-ized files will have an ".etx" or
  23   ".ETX" suffix.  This stands for an "emailable/ enhanced text", the
  24   ExtraTerrestrial overtones nothwistanding ;-))
  25
  26   Unlike other forms of text encoding that use explicit, visible tag
  27   elements such as <this> and <\that>, the setext format relies
  28   solely on the presence of _implicit_ typotags, carefully chosen
  29   to be as visually unobtrusive as possible.  The underlined word
  30   above is one such instance of the defacto "invisible" coding.
  31   Inserted typotags will at worst appear as mere "typos" in the text.
  32
  33   [Extensions made to the original set of typotags have muddied this
  34   clarity a little bit, but they were necessary for NEdit development.]
  35
  36   Similarly, just to give an example, here is a short description
  37   of the four types of word emphasis typotags that setexts MAY
  38   contain, limited to one emphasis type ONLY per word or word group:
  39
  40  -------------------  ----------------------------  --------------
  41 !      **aBoldWord**  **multiple bold words**       ; bold-tt
  42 !_anUnderlinedWord_    _multiple_underlined_words_  ; underline-tt
  43 !    ~anItalicWord~    ~multiple italicized words~  ; italic-tt
  44 !         aHotWord_     multiple_hot_words_         ; hot-tt
  45  -----------------------------------------------------------------
  46
  47  What makes a setext?
  48 ---------------------
  49
  50   Before any decoding can take place a text has first to be
  51   verified whether it is a setext and not some arbitrarily-wrapped
  52   stream of characters.  Although there are more ways than one to
  53   achieve that goal there is one _primary_ test that has to be
  54   passed with colors or else the text being tested cannot be a
  55   setext.
  56
  57   Chief among the typotags are two that signal presence of setext
  58   titles and subheads inside the text.  A setext document can be
  59   formatted more or less properly, may contain or lack any other of
  60   its "native" elements but it has to have at least one proper
  61   subhead or a title in order to be declared as "a certified
  62   setext."
  63
  64     Column 1 of text line
  65     |
  66     V
  67     Here are a few demo setext subheads:
  68     ------------------------------------
  69
  70     _ _ _ _ Which Share Just One _ _ _ _
  71     ------------------------------------
  72
  73             ----------> UnifyinG FeaturE
  74     ------------------------------------
  75
  76     of EQUAL RIGHTMOST VISIBLE character
  77     ------------------------------------
  78
  79       length as that of its subhead-tt's
  80     ------------------------------------
  81
  82     [this line is called subhead-string]
  83     ------------------------------------
  84
  85     [the one below is called subhead-tt]
  86     ------------------------------------
  87
  88     [together they make a valid subhead]
  89     ------------------------------------
  90
  91        (!) and of course, subheads do not have to be of the same length ;-)
  92     -----------------------------------------------------------------------
  93
  94       (nor have to begin in column 1)
  95     ---------------------------------
  96
  97      although it is recommended that they stay below 40 characters
  98     --------------------------------------------------------------
  99
 100         Second Setext In This File
 101     ==============================
 102
 103      ((end of examples))
 104      -------------------
 105      ((_not_ a subhead))
 106     ^
 107     |
 108     Column 1 of text line
 109
 110   Note, the last 3 lines of the examples do not constitute a valid
 111   subhead because they do not start in column 1.
 112
 113   Chief among the reasons why one should first look for presence of
 114   subheads rather than titles is that it is fully conceivable that a
 115   setext might have been created without an explicit title-tt in
 116   order to allow decoder programs to distinguish between part one
 117   and any subsequent ones in a possible multi-part mailing. This
 118   absence of a title-tt could be enough of a signal to start looking
 119   for possible "part x of y" message in either the subject line,
 120   filename or anywhere "above" the first detected subhead of the
 121   current text.
 122
 123   Therefore, here's a formal definition of what makes a setext:
 124
 125  +-------------------------------------------------------------+
 126  |  a text that contains at least one verified setext subhead  |
 127  |  or setext title                                            |
 128  +-------------------------------------------------------------+
 129
 130  Other considerations
 131 ---------------------
 132
 133   A possibility arises to keep the paragraph text unwrapped, rather
 134   than folded uniformly at say the 66th character mark.  After all,
 135   if the setext is primarily to be displayed inside an editor,
 136   rather than on an 80 character terminal screen, then there is not
 137   much sense in prior folding of the lines to a specific
 138   guaranteed-to-fit-on-a-TTY-screen length.  The editor/word
 139   processor program will fit the unwrapped text to the available
 140   display area, and might actually prefer to have to deal with
 141   whole unwrapped paragraphs rather than with otherwise relatively
 142   short lines.
 143
 144   Most text-processing programs with native word-wrap capabilities
 145   actually consider return-terminated lines to be paragraphs in
 146   their own right.  Thus, if a setext is not to travel via email
 147   anyway (because of it being distributed differently or making use
 148   of accented characters) then it might as well arrive in unfolded
 149   state so that no extra time need be spent on making the
 150   paragraphs "whole again." [This is not the choice that is taken
 151   with NEdit help because it is easier to visualize the final text
 152   for those who do not use text wrapping.]
 153
 154   Observe that it is not the state of the paragraph text that makes
 155   or breaks a setext.  No, the sole criterion of whether a text is
 156   a setext is the presence of at least one verified subhead, as
 157   described above. Thus even texts with unfolded paragraphs are
 158   setexts if they contain at least one subhead-tt.
 159
 160   The sole mechanism used in setext to encode which of such lines
 161   are in reality paragraphs (as opposed to those that shouldn't be
 162   folded mechanically) is the character indent.  In fact, after the
 163   subhead-tt the second most important typotag is the indent-tt,
 164   made up of exactly two space characters, which denotes any such
 165   indented lines as ready-candidates for reflowing by so inclined
 166   front-ends (either on their own or as part of like-indented lines
 167   above and below it).  So any potentially long line of a setext
 168   that has been indent-tted will be understood (by any validated
 169   setext front-end) as to be ready for wrapping-to-length if so
 170   required.
 171
 172 .. All the following document by Steven Haehn
 173
 174 Typotags Available
 175 ------------------
 176
 177   The following table contains typotags recognized by the setext
 178   utility. The "setext form" column in the table is formatted such
 179   that the left most character of the column represents the first
 180   character in a line of setext. The circumflex character (^) means
 181   that the characters of the typotag are significant only when they
 182   are anchored to the front of the setext line. Typotags marked
 183   with an asterisk (*) are extensions added for NEdit help
 184   generation.
 185
 186 !! ============   ===================  ==================
 187 !!      name of   setext form          acted upon or
 188 !!  the typotag   of typotag           displayed as
 189 !! ============   ===================  ==================
 190 !!     title-tt  "Title                a title
 191 !!                ====="               in chosen style
 192 !! ------------   -------------------  ------------------
 193 !!   subhead-tt  "Subhead              a subhead
 194 !!                -------"             in chosen style
 195 !! ------------   -------------------  ------------------
 196 !!   section-tt  ^#> section-text      a section heading
 197 !!                                     with '#' from 1..9
 198 !!                                     in chosen style
 199 !! ------------   -------------------  ------------------
 200 !!    indent-tt  ^  lines indented     lines undented
 201 !!               ^  by 2 spaces        and unfolded
 202 !! ------------   -------------------  ------------------
 203 !!      bold-tt       **[multi]word**  1+ bold word(s)
 204 !!    italic-tt          ~multi word~  1+ italic word(s)
 205 !! underline-tt        [_multi]_word_  underlined text
 206 !!       hot-tt         [multi_]word_  1+ hot word(s)
 207 !!     quote-tt  ^>[space][text]       > [mono-spaced]
 208 !!    bullet-tt  ^*[space][text]       [bullet] [text]
 209 !!   untouch-tt   `_quoted typotag!_`  `_left alone!_`
 210 !!   notouch-tt* ^!followed by text    text-left-alone
 211 !!     field-tt*     |>name[=value]<|  value of name
 212 !!      line-tt* ^   ---               horizontal rule
 213 !! ------------   -------------------  ------------------
 214 !!      href-tt* ^.. _word URL         jump to address
 215 !!      note-tt  ^.. _word Note:("*")  ("cause error")
 216 !!    target-tt*     _[multi_]word     [multi ]word
 217 !! ------------   -------------------  ------------------
 218 !!   twobuck-tt   $$ [last on a line]  [parse another]
 219 !!  suppress-tt  ^..[space][not dot]   [line hidden]
 220 !!    twodot-tt  ^..[alone on a line]  [taken note of]
 221 !! ------------   -------------------  ------------------
 222 !!     maybe-tt* ^.. ? name[~] text    show text when
 223 !!                                     name defined
 224 !!  maybenot-tt* ^.. ! name[~] text    show text when
 225 !!                                     name NOT defined
 226 !!  endmaybe-tt* ^.. ~ name            end of a multi-
 227 !!                                     line maybe[not]-tt
 228 !! ------------   -------------------  ------------------
 229 !!  passthru-tt* ^!![text]             text emitted
 230 !!                                     without processing
 231 !! ------------   -------------------  ------------------
 232 !!    escape-tt*  @x where 'x'  is     x is what remains
 233 !!                escaped character    @@ needed for 1 @
 234 !! ============   ===================  ==================
 235 !!
 236
 237   The title-tt, subhead-tt and indent-tt have already been
 238   discussed in length in the previous sections. All typotag
 239   elements, but the subhead-tt, are optional, that is, not
 240   necessary for a setext to be declared as such. The simple
 241   character marking typotags, bold-tt, italic-tt, and underline-tt
 242   have been used throughout the document and are used to mark text
 243   with their obvious meanings.
 244
 245 3>Section-tt (document divisions)
 246
 247   The section-tt allows subdividing of the setext into further
 248   subsections for greater nesting capability. Typical usage starts
 249   the numbering level at 3 because the title-tt and subhead-tt
 250   basically represent sections 1 and 2, respectively.
 251
 252 3>Bullet-tt (list marker)
 253
 254   The bullet-tt typotag is use to create a list of items. Note that
 255   it can only be used to create single line entries, like the
 256   following:
 257
 258     Column 1 of text line
 259     |
 260     V
 261     * This is the first bullet.
 262     * This is the second bullet.
 263
 264   Remember that you have to insert empty lines immediately before
 265   and after the bullet list.
 266
 267 3>Untouch-tt, Notouch-tt, Passthru-tt, Escape-tt (quoting text)
 268
 269   Each one of these leave-my-text-alone typotags offer varying
 270   degrees of operation. The untouch-tt surrounds the text that
 271   is not to be interpreted. The accent grave (`) character is
 272   used to start and finish the untouchable text. (An extension
 273   to this has been allowed in the setext utility. An untouch-tt
 274   may be terminated by an apostrophe (').) The following are
 275   all valid untouch-tt typotags.
 276
 277     `this is the _original_ version of the untouch-tt`
 278     `this is the _extended_ form of the untouch-tt'
 279     `This couldn't _be_ a problem could it?'
 280
 281   Note that the third example has used the contraction "couldn't"
 282   which did not terminate the untouch-tt because the apostrophe was
 283   not followed by whitespace or punctuation.
 284
 285   The notouch-tt typotag is used to take care of entire lines of
 286   text. The difference between this and the untouch-tt is that there
 287   is no visual residual typotag mark left in the output. It is
 288   replaced by a space. For example,
 289
 290     Column 1 of text line
 291     |
 292     V
 293     ! This line of text will look like this sans the ! in column 1.
 294
 295   becomes,
 296
 297       This line of text will look like this sans the ! in column 1.
 298
 299   The difference between the passthru-tt and the notouch-tt is
 300   the subtle difference of not replacing the markers with space, but
 301   totally removing them. (The original usage was to try to emit
 302   special 'C' compiler directives directly into the help code
 303   product). Thus,
 304
 305     Column 1 of text line
 306     |
 307     V
 308     !!#ifdef VMS
 309
 310   would turn into
 311
 312     #ifdef VMS
 313
 314   The escape-tt (@) is used to escape the special markers of
 315   the other typotags and itself. Here is an example of escaping
 316   itself.
 317
 318     develop@@nedit.org
 319
 320   This will become "develop@nedit.org" in resulting documents.
 321
 322
 323 3>Suppress-tt, Twodot-tt (author annotations or comments)
 324
 325   The suppress-tt typotag allows an author to place annotations in a
 326   setext document which will not appear in a generated product. Most
 327   of the extensions to the original setext definition were placed
 328   inside this form of typotag.
 329
 330     Column 1 of text line
 331     |
 332     V
 333     .. This is a document comment that would normally disappear
 334     .. from generated text, html, or the like. These lines are
 335     .. what constitute a suppress-tt. The following line is the
 336     .. twodot-tt.
 337     ..
 338
 339 3>Hot-tt, Href-tt, Target-tt (hyperlinking text)
 340
 341   These three typotags are used in conjunction to create
 342   hypertext reference mechanism used int HTML and NEdit
 343   help code generation. The hot-tt is an original typotag which
 344   needed the additional two tags to be able create actual hyperlinks
 345   to other sections of the document, or to external references that
 346   could be exploited. These tags are ignored (stripped) when
 347   generating simple text documents.
 348
 349   The hot-tt typotag is used to mark the text which would be used as
 350   the doorway to accessing other parts of the document. It either
 351   references a title or subhead string directly, or an href-tt. An
 352   href-tt (hypertext reference typotag) is used as an intermediary
 353   for the hyperlink destination. Its value either specifies an
 354   external document reference, or an internal document reference.
 355   The target-tt is used to mark the internal document references
 356   mentioned in a href-tt.
 357
 358   Now for some examples. All the marked text will be inside
 359   parenthesis so it will stand out as to what explicitly is being
 360   marked.
 361
 362   This hot-tt directly references the (Typotags_Available_)
 363   subheading above. Whereas, the following hot-tt (references_)
 364   the href-tt marked by this target-tt (_typotag).
 365
 366   Here is what the href-tt would look like:
 367
 368     Column 1 of text line
 369     |
 370     V
 371 !   .. _references #typotsg
 372
 373 .. The following line is the actual hypertext reference in this
 374 .. document. This annotation is an example of supress-tt usage.
 375 .. _references #typotag
 376
 377 3>Maybe[not]-tt, Endmaybe-tt (conditional text regions)
 378
 379   Multiple line maybe-tt or maybenot-tt (conditional text regions)
 380   are introduced as follows:
 381
 382     Column 1 of text line
 383     |
 384     V
 385     .. ? name~   (this is the maybe-tt)
 386     .. ! name~   (this is the maybenot-tt)
 387
 388   Both are terminated with an endmaybe-tt on a separate line.
 389
 390     Column 1 of text line
 391     |
 392     V
 393     .. ~ name
 394
 395   The name* of the conditional region is left up to the text
 396   author.  Single line maybe[not]-tt typotags do not use the '~'
 397   character at the end of the name and are terminated at the end
 398   of the line.
 399
 400     Column 1 of text line
 401     |
 402     V
 403     .. ? oneLine (This is a one line maybe-tt)
 404     .. ! oneLine (This is a one line maybenot-tt)
 405
 406   * There are some predefined conditional region names that are
 407   already known to the setext parser: html, text, and (NEdit) help.
 408   The special conditional text region named "html" allows a mixture
 409   of setext and HTML tags.
 410
 411   Nesting of conditional text regions is allowed. For instance, if
 412   there are three conditional regions, A, B, and C, C can be nested
 413   inside B, which can be nested inside A. For example,
 414   A-B-C...C-B-A.
 415
 416       Column 1 of text line
 417     |
 418     V
 419     .. ? A~    Example of legally nested conditional text regions
 420     .. ? B~
 421     .. ? C~
 422     .. ~ C
 423     .. ~ B
 424     .. ~ A
 425
 426   Note that a surrounding region cannot end before one of its inner
 427   regions is terminated (eg. of illegal nesting A-B-C...C-A-B,
 428   where A terminated prior to B).
 429
 430 3>Field-tt (variable definition and substitution)
 431
 432   Field-tt typotags are used to define variables and reference
 433   their values. Field definitions can only occur within a
 434   suppress-tt.
 435
 436   For example, to define the variable 'author' and fill it
 437   with the value "Steven Haehn":
 438
 439       Column 1 of text line
 440     |
 441     V
 442     .. |>author=Steven Haehn<|
 443
 444   To use the value of the defined variable, place the field-tt,
 445   |>author<|, in any printable text region. If there is no known
 446   value for the  field, it will remain unchanged and appear as
 447   written in the setext.
 448
 449   The following are predefined for use in a field-tt
 450   for any setext document translated by the setext utility.
 451
 452     Date = <MonthName day, year>         (eg. December 6, 2001)
 453     date = <MonthAbbreviation day, year> (eg. Dec 6, 2001)
 454     year = <year>                        (eg. 2001)
 455
 456 3>Line-tt (horizontal rule demarcation)
 457
 458   This typotag is used to place horizontal markers into generated
 459   text documents. Like the following.
 460
 461    Column 4 of text line
 462    |
 463    V
 464    -------------------------------------------------------------
 465
 466 3>Twobuck-tt (setext termination marker)
 467
 468   This typotag is used to mark the end of document parsing.
 469
 470  $$
 471
 472 $Id: setext-info.txt,v 1.3 2002/09/26 12:37:38 ajhood Exp $