From f02f9795dfffb9e6394cd117bdb0847f38b9508f Mon Sep 17 00:00:00 2001 From: milde Date: Mon, 5 Dec 2011 19:35:32 +0000 Subject: [PATCH] Fix [ 3402314 ] non-ASCII whitespace and punctuation around inline markup. Revision of the rules for allowed characters around inline markup start-string and end-string: Keep the carefully crafted ASCII-character set but add Unicode categories to the sets of allowed characters. This keeps the number of "false positives" requiring escaping low while making the rules simpler and international. This is a feature change. git-svn-id: https://docutils.svn.sourceforge.net/svnroot/docutils/trunk/docutils@7243 929543f6-e4f2-0310-98a6-ba3bd3dd1d04 --- HISTORY.txt | 4 +- docs/dev/todo.txt | 188 ++---------------- docs/ref/rst/restructuredtext.txt | 134 ++++++------- docutils/parsers/rst/punctuation_chars.py | 211 ++++++++++++++++++++ docutils/parsers/rst/states.py | 45 ++--- test/test_parsers/test_rst/test_inline_markup.py | 237 ++++++++++++++++++----- 6 files changed, 508 insertions(+), 311 deletions(-) create mode 100644 docutils/parsers/rst/punctuation_chars.py diff --git a/HISTORY.txt b/HISTORY.txt index f36241a17..18e705ef4 100644 --- a/HISTORY.txt +++ b/HISTORY.txt @@ -37,8 +37,8 @@ Changes Since 0.8.1 * docutils/parsers/rst/states.py - - Allow also non-ASCII whitespace characters around inline markup. - (first part of fix for [ 3402314 ]). + - Fix [ 3402314 ] allow non-ASCII whitespace, punctuation + characters and "international" quotes around inline markup. * docutils/parsers/rst/tableparser.py diff --git a/docs/dev/todo.txt b/docs/dev/todo.txt index 6611ea854..bb9a6637e 100644 --- a/docs/dev/todo.txt +++ b/docs/dev/todo.txt @@ -825,10 +825,6 @@ Misc See . -* Change the specification so that more punctuation is allowed - before/after inline markup start/end string - (http://article.gmane.org/gmane.text.docutils.cvs/3824). - * Complain about bad URI characters (http://article.gmane.org/gmane.text.docutils.user/2046) and disallow internal whitespace @@ -1129,150 +1125,32 @@ Misc Inline markup recognition rules ------------------------------- -Allow unicode whitespace and punctuation around `inline markup`_. See bug -http://sourceforge.net/tracker/?func=detail&aid=3402314&group_id=38414&atid=422030 -and the older discussion -. - -The rules are currently *complicated* (rules, exceptions, -explicite character lists, exceptions of exceptions) and *incomplete*: Many -non-ASCII characters are missing in the inline markup start-string and -end-string recognition rules. Use cases like »German ›angular‹ quotes« are -not recognized. +The `inline markup`_ recognition rules were devised intentionally to allow +90% of non-markup uses of "*", "`", "_", and "|" *without* resorting to +backslashes. For 9% of the remaining 10%, use inline literals or literal +blocks. Only those who understand the escaping and inline markup rules +should attempt the remaining 1%. ;-) .. _inline markup: ../ref/rst/restructuredtext.html#inline-markup -Proposal -```````` - -Define character classes based on `Unicode categories`_, possibly with some -exceptions (for backwards compatibility or based on use cases) and use them -in the inline markup start-string and end-string recognition rules. - -The following sub-section is intended to replace the 5 inline markup rules in -the reStructuredText Markup Specification's section on `inline markup`_. -The composition of the character classes is open for discussion_. - -The actual change needs to be done in `parsers.rst.states.Inliner`. - -Inline markup syntax rules -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The inline markup start-string and end-string recognition rules distinguish -the following character classes based on `Unicode categories`_: - -_`Whitespace`: - :Zs: Separator, Space - :Zl: Separator, Line - - :Zp: Separator, Paragraph - - Exception: Non-breaking spaces count as Delimiters_, they may - immediately follow a start-string or precede an end-string. - - :[ ]: U+00A0, NO-BREAK SPACE - :[ ]: U+202F, NARROW NO-BREAK SPACE - -_`Open`: - :Ps: Punctuation, Open - :Pi: Punctuation, Initial quote - :Pf: Punctuation, Final quote [#PiPf]_ - :<: U+003C, LESS-THAN SIGN [#ltgt]_ - -_`Close`: - :Pe: Punctuation, Close - :Pf: Punctuation, Final quote - :Pi: Punctuation, Initial quote [#PiPf]_ - :>: U+003E, GREATER-THAN SIGN - -_`Delimiters`: - :Pd: Punctuation, Dash - :Po: Punctuation, Other [#Po]_ - :[ ]: U+00A0, NO-BREAK SPACE - :[ ]: U+202F, NARROW NO-BREAK SPACE - -If any of the following conditions are not met, the start-string or -end-string will not be recognized or processed: - -1. Inline markup start-strings must start a text block or be immediately - preceded by a character of the classes Whitespace_, Open_, or - Delimiters_. - -2. Inline markup start-strings must not be followed by Whitespace_. - -3. Inline markup end-strings must not be preceded by Whitespace_. - -4. Inline markup end-strings must end a text block or be immediately - followed by a character of the classes Whitespace_, Close_, or - Delimiters_. - -5. If an inline markup start-string is immediately preceded by a - single or double quote or a character from Open_, it must not be - immediately followed by a corresponding single or double quote or - character from Close. - -6. An inline markup end-string must be separated by at least one - character from the start-string. - -7. An unescaped backslash preceding a start-string or end-string will - disable markup recognition, except for the end-string of `inline - literals`_. See `Escaping Mechanism`_ above for details. - - -Discussion -`````````` - -The current markup recognition rules deviate from the above proposal in some -cases "to allow 90% of non-markup uses of "*", "`", "_", and "|" without -resorting to backslashes". - -The above proposal aims to catch 85% of non-markup uses with simpler -rules and enable additional markup uses (e.g. »German ›angular‹ quotes«) -without escaping. It breaks backwards compatibility in some cases. -However, if this is "the right thing", it should be done **now**, as long -as the project is still "beta". - -Character classifications in need of discussion: - -.. [#PiPf] Pi (Punctuation, Initial quote) characters are "usually - closing, sometimes opening". Pf (Punctuation, Final quote) characters - are "usually closing, sometimes opening". I.e., both Pi and Pf may - behave like Ps or Pe depending on usage. The current implementation - sorts them into Open_ and Close_. - Adding Pf to Close_ and Pi to Open_ solves e.g. the problem with - »German ›angular‹ quotes«. - -.. [#ltgt] ``<`` and ``>`` belong to the Unicode category Ms (Symbols, Math). - The current implementation sorts them into Open_ and Close_ because of - their use as angular brackets in ASCII markup. - -.. [#Po] The ``Po`` characters ``.,;!?`` are usually followed by - whitespace. The backslash ``\`` is rarely used in front of marked-up - text. The current implementation sorts these characters into Close_. - - The Po characters ``¡¿`` open a sentence. The current - implementation sorts them into Open_. +Changes need to be done in `parsers.rst.states.Inliner`. Alternatives -```````````` -a) The proposal_ above: +a) Use `Unicode categories`_ for all chars (ASCII or not) - +1 truly international (considering characters of all writing systems - recorded in Unicode) - +2 simpler specification of the rules - -1 more complicated implementation + +1 comprehensible, standards based, + -1 many "false positives" need escaping, + -1 not backwards compatible. -b) Backwards compatibility +b) full backwards compatibility - :Pi: into Open_ - :Pf: into Close_ + :Pi: only before start-string + :Pf: only behind end-string :Po: "conservative" sorting of other punctuation: - :``.,;!?\``: Close_ - :````¡¿``: Open_ - - Are there more? + :``.,;!?\\``: Close + :``¡¿``: Open +1 backwards compatible, +1 logical extension of the existing rules, @@ -1280,41 +1158,9 @@ b) Backwards compatibility -1 rules even more complicated, -1 not clear how to sort "other" punctuation that is currently not recognized, - -2 use cases like »German ›angular‹ quotes« not recognized. + -2 international quoting convention like + »German ›angular‹ quotes« not recognized. -c) Simple rule: merge Open_, Close, and Delimiters_ - - Whitespace_, Open_, Close_, and Delimiters_ may all precede or follow - inline markup. - - +3 very comprehensible, - -1 false positives need escaping, - -2 not backwards compatible. - -Implementation -`````````````` - -Some ideas for implementing the above rules: - -David's regexp to match whitespace but keep NO-BREAK spaces as "invisible -escape":: - - u'(?![\xa0\u202f])\\s', re.UNICODE - -For punctuation, check `Unicode categories`_ with -``unicodedata.category(ch)`` -(http://bytes.com/topic/python/answers/854011-identifying-unicode-punctuation-characters-python-regex) -and generate a pattern string, e.g. :: - - chars_open = u''.join(unichr(x) for x in range(74868) - if unicodedata.category(unichr(x)) in ('Ps', 'Pi', 'Pf') - -Do this in the setup script and use the resulting string literal? -(Avoids re-calculation with every parsing run.) - -.. _inline markup: ../ref/rst/restructuredtext.html#inline-markup -.. _inline literals: ../ref/rst/restructuredtext.html#inline-literals -.. _escaping mechanism: ../ref/rst/restructuredtext.html#escaping-mechanism .. _Unicode categories: http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values diff --git a/docs/ref/rst/restructuredtext.txt b/docs/ref/rst/restructuredtext.txt index 71fab2ba7..59679e548 100644 --- a/docs/ref/rst/restructuredtext.txt +++ b/docs/ref/rst/restructuredtext.txt @@ -2368,39 +2368,23 @@ Three constructs use different start-strings and end-strings: `Standalone hyperlinks`_ are recognized implicitly, and use no extra markup. -The inline markup start-string and end-string recognition rules are as -follows. If any of the conditions are not met, the start-string or -end-string will not be recognized or processed. +Inline markup recognition rules +------------------------------- -1. Inline markup start-strings must start a text block or be - immediately preceded by whitespace, one of the ASCII - characters ``' " ( [ { <``, or the Unicode characters: - - .. class:: borderless - - === ========================================================== - ‘ (U+2018, left single-quote) - “ (U+201C, left double-quote) - ’ (U+2019, right single-quote, or apostrophe) - « (U+00AB, left guillemet, or double angle quotation mark) - ¡ (U+00A1, inverted exclamation mark) - ¿ (U+00BF, inverted question mark) - === ========================================================== - - The ASCII characters ``- / :`` and the Unicode characters - - .. class:: borderless +Inline markup start-strings and end-strings are only recognized if all of +the following conditions are met: - === ========================================================== - ‐ (U+2010, hyphen) - ‑ (U+2011, non-breaking hyphen) - ‒ (U+2012, figure dash) - – (U+2013, en dash) - — (U+2014, em dash) - [ ] (U+00A0, non-breaking space [between the brackets]) - === ========================================================== +1. Inline markup start-strings must start a text block or be + immediately preceded by - are _`delimiters`. They may precede or follow inline markup. + * whitespace, + * one of the ASCII characters ``- : / ' " < ( [ {`` or + * a non-ASCII punctuation character with `Unicode category`_ + `Pd` (Dash), + `Po` (Other), + `Ps` (Open), + `Pi` (Initial quote), or + `Pf` (Final quote) [#PiPf]_. 2. Inline markup start-strings must be immediately followed by non-whitespace. @@ -2409,26 +2393,22 @@ end-string will not be recognized or processed. non-whitespace. 4. Inline markup end-strings must end a text block or be immediately - followed by whitespace, the ASCII characters - ``' " ) ] } > . , ; ! ? \``, the Unicode characters: - - .. class:: borderless - - === ========================================================== - ’ (U+2019, right single-quote, or apostrophe) - ” (U+201D, right double-quote) - » (U+00BB, right guillemet, or double angle quotation mark) - === ========================================================== - - or the `delimiters`_ listed in (1) above. - -5. If an inline markup start-string is immediately preceded by a - single or double quote, "(", "[", "{", or "<", it must not be - immediately followed by the corresponding single or double quote, - ")", "]", "}", or ">". - - .. this also holds for the opening/closing Unicode character pairs - (since at least 05. Sep 2008). + followed by + + * whitespace, + * one of the ASCII characters ``- . , : ; ! ? \ / ' " ) ] } >`` or + * a non-ASCII punctuation character with `Unicode category`_ + `Pd` (Dash), + `Po` (Other), + `Pe` (Close), + `Pf` (Final quote), or + `Pi` (Initial quote) [#PiPf]_. + +5. If an inline markup start-string is immediately preceded by one of the + ASCII characters ``' " < ( [ {``, or a character with Unicode character + category `Ps`, `Pi`, or `Pf`, it must not be followed by the + corresponding [#corresponding-quotes]_ closing character from + ``' " ) ] } >`` or the categories `Pe`, `Pf`, or `Pi`. 6. An inline markup end-string must be separated by at least one character from the start-string. @@ -2437,32 +2417,52 @@ end-string will not be recognized or processed. disable markup recognition, except for the end-string of `inline literals`_. See `Escaping Mechanism`_ above for details. -For example, none of the following are recognized as containing inline -markup start-strings: +.. [#PiPf] `Pi` (Punctuation, Initial quote) characters are "usually + closing, sometimes opening". `Pf` (Punctuation, Final quote) + characters are "usually closing, sometimes opening". + +.. [#corresponding-quotes] For quotes, corresponding characters can be + any of the `quotation marks in international usage`_ + +.. _Unicode category: + http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values + +.. _quotation marks in international usage: + http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage + +The inline markup recognition rules were devised to allow 90% of non-markup +uses of "*", "`", "_", and "|" without escaping. For example, none of the +following terms are recognized as containing inline markup strings: -- asterisks: * "*" '*' (*) (* [*] {*} 1*x BOM32_* -- double asterisks: ** a**b O(N**2) etc. -- backquotes: ` `` etc. -- underscores: _ __ __init__ __init__() etc. -- vertical bars: | || etc. +- 2*x a**b O(N**2) e**(x*y) f(x)*f(y) a|b file*.* (breaks 1) +- 2 * x a ** b (* BOM32_* ` `` _ __ | (breaks 2) +- "*" '|' (*) [*] {*} <*> + ‘*’ ‚*‘ ‘*‚ ’*’ ‚*’ + “*” „*“ “*„ ”*” „*” + »*« ›*‹ «*» »*» ›*› (breaks 5) +- || (breaks 6) +- __init__ __init__() -It may be desirable to use inline literals for some of these anyhow, +No escaping is required inside the following inline markup examples: + +- *2 * x *a **b *.txt* (breaks 3) +- *2*x a**b O(N**2) e**(x*y) f(x)*f(y) a*(1+2)* (breaks 4) + +It may be desirable to use `inline literals`_ for some of these anyhow, especially if they represent code snippets. It's a judgment call. These cases *do* require either literal-quoting or escaping to avoid -misinterpretation:: +misinterpretation: - *4, class_, *args, **kwargs, `TeX-quoted', *ML, *.txt + \*4, class\_, \*args, \**kwargs, \`TeX-quoted', \*ML, \*.txt -The inline markup recognition rules were devised intentionally to -allow 90% of non-markup uses of "*", "`", "_", and "|" *without* -resorting to backslashes. For 9 of the remaining 10%, use inline -literals or literal blocks:: +In most use cases, `inline literals`_ or `literal blocks`_ are the best +choice (by default, this also selects a monospaced font):: - "``\*``" -> "\*" (possibly in another font or quoted) + *4, class_, *args, **kwargs, `TeX-quoted', *ML, *.txt -Only those who understand the escaping and inline markup rules should -attempt the remaining 1%. ;-) +Recognition order +----------------- Inline markup delimiter characters are used for multiple constructs, so to avoid ambiguity there must be a specific recognition order for diff --git a/docutils/parsers/rst/punctuation_chars.py b/docutils/parsers/rst/punctuation_chars.py new file mode 100644 index 000000000..b8dbe2b43 --- /dev/null +++ b/docutils/parsers/rst/punctuation_chars.py @@ -0,0 +1,211 @@ +#!/usr/bin/env python +# -*- coding: utf8 -*- +# :Copyright: © 2011 Günter Milde. +# :License: Released under the terms of the `2-Clause BSD license`_, in short: +# +# Copying and distribution of this file, with or without modification, +# are permitted in any medium without royalty provided the copyright +# notice and this notice are preserved. +# This file is offered as-is, without any warranty. +# +# .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause + +# :Id: $Id$ + +import sys, re +import unicodedata + +# punctuation characters around inline markup +# =========================================== +# +# This module provides the lists of characters for the implementation of +# the `inline markup recognition rules`_ in the reStructuredText parser +# (states.py) +# +# .. _inline markup recognition rules: +# ../../../docs/ref/rst/restructuredtext.html#inline-markup + +# Docutils punctuation category sample strings +# -------------------------------------------- +# +# The sample strings are generated by punctuation_samples() and put here +# literal to avoid the time-consuming generation with every Docutils +# run. Running this file as a standalone module checks the definitions below +# against a re-calculation. + +openers = ur"""\"\'\(\<\[\{༺༼᚛⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「«‘“‹⸂⸄⸉⸌⸜⸠‚„»’”›⸃⸅⸊⸍⸝⸡‛‟""" +closers = ur"""\"\'\)\>\]\}༻༽᚜⁆⁾₎〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴿︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞)]}⦆」»’”›⸃⸅⸊⸍⸝⸡‛‟«‘“‹⸂⸄⸉⸌⸜⸠‚„""" +delimiters = ur"\-\/\:֊־᐀᠆‐‑‒–—―⸗⸚〜〰゠︱︲﹘﹣-¡·¿;·՚՛՜՝՞՟։׀׃׆׳״؉؊،؍؛؞؟٪٫٬٭۔܀܁܂܃܄܅܆܇܈܉܊܋܌܍߷߸߹࠰࠱࠲࠳࠴࠵࠶࠷࠸࠹࠺࠻࠼࠽࠾।॥॰෴๏๚๛༄༅༆༇༈༉༊་༌།༎༏༐༑༒྅࿐࿑࿒࿓࿔၊။၌၍၎၏჻፡።፣፤፥፦፧፨᙭᙮᛫᛬᛭᜵᜶។៕៖៘៙៚᠀᠁᠂᠃᠄᠅᠇᠈᠉᠊᥄᥅᧞᧟᨞᨟᪠᪡᪢᪣᪤᪥᪦᪨᪩᪪᪫᪬᪭᭚᭛᭜᭝᭞᭟᭠᰻᰼᰽᰾᰿᱾᱿᳓‖‗†‡•‣․‥…‧‰‱′″‴‵‶‷‸※‼‽‾⁁⁂⁃⁇⁈⁉⁊⁋⁌⁍⁎⁏⁐⁑⁓⁕⁖⁗⁘⁙⁚⁛⁜⁝⁞⳹⳺⳻⳼⳾⳿⸀⸁⸆⸇⸈⸋⸎⸏⸐⸑⸒⸓⸔⸕⸖⸘⸙⸛⸞⸟⸪⸫⸬⸭⸮⸰⸱、。〃〽・꓾꓿꘍꘎꘏꙳꙾꛲꛳꛴꛵꛶꛷꡴꡵꡶꡷꣎꣏꣸꣹꣺꤮꤯꥟꧁꧂꧃꧄꧅꧆꧇꧈꧉꧊꧋꧌꧍꧞꧟꩜꩝꩞꩟꫞꫟꯫︐︑︒︓︔︕︖︙︰﹅﹆﹉﹊﹋﹌﹐﹑﹒﹔﹕﹖﹗﹟﹠﹡﹨﹪﹫!"#%&'*,./:;?@\。、・𐄀𐄁𐎟𐏐𐡗𐤟𐤿𐩐𐩑𐩒𐩓𐩔𐩕𐩖𐩗𐩘𐩿𐬹𐬺𐬻𐬼𐬽𐬾𐬿𑂻𑂼𑂾𑂿𑃀𑃁𒑰𒑱𒑲𒑳" +closing_delimiters = ur"\.\,\;\!\?" + + +# Unicode punctuation character categories +# ---------------------------------------- + +unicode_punctuation_categories = { + # 'Pc': 'Connector', # not used in Docutils inline markup recognition + 'Pd': 'Dash', + 'Ps': 'Open', + 'Pe': 'Close', + 'Pi': 'Initial quote', # may behave like Ps or Pe depending on usage + 'Pf': 'Final quote', # may behave like Ps or Pe depending on usage + 'Po': 'Other' + } +"""Unicode character categories for punctuation""" + + +# generate character pattern strings +# ================================== + +def unicode_charlists(categories, cp_min=0, cp_max=None): + """Return dictionary of Unicode character lists. + + For each of the `catagories`, an item contains a list with all Unicode + characters with `cp_min` <= code-point <= `cp_max` that belong to the + category. (The default values check every code-point supported by Python.) + """ + # Determine highest code point with one of the given categories + # (may shorten the search time considerably if there are many + # categories with not too high characters): + if cp_max is None: + cp_max = max(x for x in xrange(sys.maxunicode + 1) + if unicodedata.category(unichr(x)) in categories) + # print cp_max # => 74867 for unicode_punctuation_categories + charlists = {} + for cat in categories: + charlists[cat] = [unichr(x) for x in xrange(cp_min, cp_max+1) + if unicodedata.category(unichr(x)) == cat] + return charlists + + +# Character categories in Docutils +# -------------------------------- + +def punctuation_samples(): + + """Docutils punctuation category sample strings. + + Return list of sample strings for the categories "Open", "Close", + "Delimiters" and "Closing-Delimiters" used in the `inline markup + recognition rules`_. + """ + + # Lists with characters in Unicode punctuation character categories + cp_min = 160 # ASCII chars have special rules for backwards compatibility + ucharlists = unicode_charlists(unicode_punctuation_categories, cp_min) + + # match opening/closing characters + # -------------------------------- + # Rearange the lists to ensure matching characters at the same + # index position. + + # low quotation marks are also used as closers (e.g. in Greek) + # move them to category Pi: + ucharlists['Ps'].remove(u'‚') # 201A SINGLE LOW-9 QUOTATION MARK + ucharlists['Ps'].remove(u'„') # 201E DOUBLE LOW-9 QUOTATION MARK + ucharlists['Pi'] += [u'‚', u'„'] + + ucharlists['Pi'].remove(u'‛') # 201B SINGLE HIGH-REVERSED-9 QUOTATION MARK + ucharlists['Pi'].remove(u'‟') # 201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK + ucharlists['Pf'] += [u'‛', u'‟'] + + # 301F LOW DOUBLE PRIME QUOTATION MARK misses the opening pendant: + ucharlists['Ps'].insert(ucharlists['Pe'].index(u'\u301f'), u'\u301d') + + # print u''.join(ucharlists['Ps']).encode('utf8') + # print u''.join(ucharlists['Pe']).encode('utf8') + # print u''.join(ucharlists['Pi']).encode('utf8') + # print u''.join(ucharlists['Pf']).encode('utf8') + + # The Docutils character categories + # --------------------------------- + # + # The categorization of ASCII chars is non-standard to reduce both + # false positives and need for escaping. (see `inline markup recognition + # rules`_) + + # matching, allowed before markup + openers = [re.escape('"\'(<[{')] + for cat in ('Ps', 'Pi', 'Pf'): + openers.extend(ucharlists[cat]) + + # matching, allowed after markup + closers = [re.escape('"\')>]}')] + for cat in ('Pe', 'Pf', 'Pi'): + closers.extend(ucharlists[cat]) + + # non-matching, allowed on both sides + delimiters = [re.escape('-/:')] + for cat in ('Pd', 'Po'): + delimiters.extend(ucharlists[cat]) + + # non-matching, after markup + closing_delimiters = [re.escape('.,;!?')] + + # # Test open/close matching: + # for i in range(min(len(openers),len(closers))): + # print '%4d %s %s' % (i, openers[i].encode('utf8'), + # closers[i].encode('utf8')) + + return [u''.join(chars) + for chars in (openers, closers, delimiters, closing_delimiters)] + + +# Matching open/close quotes +# -------------------------- + +# Rule (5) requires determination of matching open/close pairs. However, +# the pairing of open/close quotes is ambigue due to different typographic +# conventions in different languages. + +quote_pairs = {u'\xbb': u'\xbb', # Swedish + u'\u2018': u'\u201a', # Greek + u'\u2019': u'\u2019', # Swedish + u'\u201a': u'\u2018\u2019', # German, Polish + u'\u201c': u'\u201e', # German + u'\u201e': u'\u201c\u201d', + u'\u201d': u'\u201d', # Swedish + u'\u203a': u'\u203a', # Swedish + } + +def match_chars(c1, c2): + try: + i = openers.index(c1) + except ValueError: # c1 not in openers + return False + return c2 == closers[i] or c2 in quote_pairs.get(c1, '') + + + + +# print results +# ============= + +if __name__ == '__main__': + + # (re) create and compare the samples: + (o, c, d, cd) = punctuation_samples() + if o != openers: + print '- openers = ur"""%s"""' % openers.encode('utf8') + print '+ openers = ur"""%s"""' % o.encode('utf8') + if c != closers: + print '- closers = ur"""%s"""' % closers.encode('utf8') + print '+ closers = ur"""%s"""' % c.encode('utf8') + if d != delimiters: + print '- delimiters = ur"%s"' % delimiters.encode('utf8') + print '+ delimiters = ur"%s"' % d.encode('utf8') + if cd != closing_delimiters: + print '- closing_delimiters = ur"%s"' % closing_delimiters.encode('utf8') + print '+ closing_delimiters = ur"%s"' % cd.encode('utf8') + + # # test prints + # print 'openers = ', repr(openers) + # print 'closers = ', repr(closers) + # print 'delimiters = ', repr(delimiters) + # print 'closing_delimiters = ', repr(closing_delimiters) + + # ucharlists = unicode_charlists(unicode_punctuation_categories) + # for cat, chars in ucharlists.items(): + # # print cat, chars + # # compact output (visible with a comprehensive font): + # print (u":%s: %s" % (cat, u''.join(chars))).encode('utf8') diff --git a/docutils/parsers/rst/states.py b/docutils/parsers/rst/states.py index 882b3f762..556fac783 100644 --- a/docutils/parsers/rst/states.py +++ b/docutils/parsers/rst/states.py @@ -116,7 +116,7 @@ from docutils.utils import escape2null, unescape, column_width import docutils.parsers.rst from docutils.parsers.rst import directives, languages, tableparser, roles from docutils.parsers.rst.languages import en as _fallback_language_module - +from docutils.parsers.rst import punctuation_chars class MarkupError(DataError): pass class UnknownInterpretedRoleError(DataError): pass @@ -530,18 +530,16 @@ class Inliner: # Inline object recognition # ------------------------- - # character categories: - openers = u'\'"([{<\u2018\u201c\xab\u00a1\u00bf' # see quoted_start below - closers = u'\'")]}>\u2019\u201d\xbb!?' - delimiters = u'-/:\u2010\u2011\u2012\u2013\u2014\u00a0' # lookahead and look-behind expressions for inline markup rules - # (see todo.html#inline-markup-syntax-rules) - start_string_prefix = (u'((?<=^)|(?<=\\s|[\u2019%s%s]))' - % (re.escape(delimiters), - re.escape(openers))) - end_string_suffix = (u'((?=$)|(?=\\s|[.,; \x00%s%s]))' - % (re.escape(delimiters), - re.escape(closers))) + start_string_prefix = (u'(^|(?<=\\s|[%s%s]))' % + (punctuation_chars.openers, + punctuation_chars.delimiters)) + end_string_suffix = (u'($|(?=\\s|[\x00%s%s%s]))' % + (punctuation_chars.closing_delimiters, + punctuation_chars.delimiters, + punctuation_chars.closers)) + # print start_string_prefix.encode('utf8') + # TODO: support non-ASCII whitespace in the following 4 patterns? non_whitespace_before = r'(? @@ -29,18 +30,26 @@ totest['emphasis'] = [ emphasis """], [u"""\ -l'*emphasis* and l\u2019*emphasis* with apostrophe +l'*emphasis* with the *emphasis*' apostrophe. +l\u2019*emphasis* with the *emphasis*\u2019 apostrophe. """, u"""\ - l' + l\' emphasis - and l\u2019 + with the \n\ emphasis - with apostrophe + \' apostrophe. + l\u2019 + + emphasis + with the \n\ + + emphasis + \u2019 apostrophe. """], ["""\ *emphasized sentence @@ -66,41 +75,64 @@ across lines* Inline emphasis start-string without end-string. """], -[r""" -'*emphasis*' and 1/*emphasis*/2 and 3-*emphasis*-4 and 5:*emphasis*:6 -but not '*' or '"*"' or x*2* or 2*x* or \*args or * -or *the\* *stars\\\* *inside* +[r"""some punctuation is allowed around inline markup, e.g. +/*emphasis*/, -*emphasis*-, and :*emphasis*: (delimiters), +(*emphasis*), [*emphasis*], <*emphasis*>, {*emphasis*} (open/close pairs) + +but not +)*emphasis*(, ]*emphasis*[, >*emphasis*>, }*emphasis*{ (close/open pairs) +(*), [*], '*' or '"*"' ("quoted" start-string), +x*2* or 2*x* (alphanumeric char before), +\*args or * (escaped, whitespace behind start-string) +or *the\* *stars\* *inside* (escaped, whitespace before end-string). -(however, '*args' will trigger a warning and may be problematic) +However, '*args' will trigger a warning and may be problematic. what about *this**? """, """\ - ' + some punctuation is allowed around inline markup, e.g. + / emphasis - ' and 1/ + /, - emphasis - /2 and 3- + -, and : emphasis - -4 and 5: + : (delimiters), + ( + + emphasis + ), [ + + emphasis + ], < emphasis - :6 - but not '*' or '"*"' or x*2* or 2*x* or *args or * + >, { + + emphasis + } (open/close pairs) + + but not + )*emphasis*(, ]*emphasis*[, >*emphasis*>, }*emphasis*{ (close/open pairs) + (*), [*], '*' or '"*"' ("quoted" start-string), + x*2* or 2*x* (alphanumeric char before), + *args or * (escaped, whitespace behind start-string) or \n\ - the* *stars\* *inside + the* *stars* *inside + (escaped, whitespace before end-string). - (however, ' + However, ' * - args' will trigger a warning and may be problematic) - + args' will trigger a warning and may be problematic. + Inline emphasis start-string without end-string. @@ -110,31 +142,123 @@ what about *this**? ? """], [u"""\ -quoted '*emphasis*', quoted "*emphasis*", -quoted \u2018*emphasis*\u2019, quoted \u201c*emphasis*\u201d, -quoted \xab*emphasis*\xbb +Quotes around inline markup: + +'*emphasis*' "*emphasis*" Straight, +‘*emphasis*’ “*emphasis*” English, ..., +« *emphasis* » ‹ *emphasis* › « *emphasis* » ‹ *emphasis* › +« *emphasis* » ‹ *emphasis* › French, +„*emphasis*“ ‚*emphasis*‘ »*emphasis*« ›*emphasis*‹ German, Czech, ..., +„*emphasis*” «*emphasis*» Romanian, +“*emphasis*„ ‘*emphasis*‚ Greek, +「*emphasis*」 『*emphasis*』traditional Chinese, +”*emphasis*” ’*emphasis*’ »*emphasis*» ›*emphasis*› Swedish, Finnish, +„*emphasis*” ‚*emphasis*’ Polish, +„*emphasis*” »*emphasis*« ’*emphasis*’ Hungarian, """, u"""\ - quoted ' + Quotes around inline markup: + + \' emphasis - ', quoted " + \' " emphasis - ", - quoted \u2018 + " Straight, + \u2018 emphasis - \u2019, quoted \u201c + \u2019 \u201c emphasis - \u201d, - quoted \xab + \u201d English, ..., + \xab\u202f emphasis - \xbb + \u202f\xbb \u2039\u202f + + emphasis + \u202f\u203a \xab\xa0 + + emphasis + \xa0\xbb \u2039\xa0 + + emphasis + \xa0\u203a + \xab\u2005 + + emphasis + \u2005\xbb \u2039\u2005 + + emphasis + \u2005\u203a French, + \u201e + + emphasis + \u201c \u201a + + emphasis + \u2018 \xbb + + emphasis + \xab \u203a + + emphasis + \u2039 German, Czech, ..., + \u201e + + emphasis + \u201d \xab + + emphasis + \xbb Romanian, + \u201c + + emphasis + \u201e \u2018 + + emphasis + \u201a Greek, + \u300c + + emphasis + \u300d \u300e + + emphasis + \u300ftraditional Chinese, + \u201d + + emphasis + \u201d \u2019 + + emphasis + \u2019 \xbb + + emphasis + \xbb \u203a + + emphasis + \u203a Swedish, Finnish, + \u201e + + emphasis + \u201d \u201a + + emphasis + \u2019 Polish, + \u201e + + emphasis + \u201d \xbb + + emphasis + \xab \u2019 + + emphasis + \u2019 Hungarian, """], [r""" Emphasized asterisk: *\** @@ -345,13 +469,13 @@ u"""\ 'literal' - with quotes, + with quotes, \n\ "literal" with quotes, \u2018literal\u2019 - with quotes, + with quotes, \n\ \u201cliteral\u201d with quotes, @@ -617,7 +741,7 @@ u"""\ 'phrase reference' - with quotes, + with quotes, \n\ "phrase reference" with quotes, @@ -694,7 +818,7 @@ u"""\ 'anonymous reference' - with quotes, + with quotes, \n\ "anonymous reference" with quotes, @@ -994,13 +1118,13 @@ u"""\ 'target1' - with quotes, + with quotes, \n\ "target2" with quotes, \u2018target3\u2019 - with quotes, + with quotes, \n\ \u201ctarget4\u201d with quotes, @@ -1405,7 +1529,7 @@ u"""\ \u00a0no-break-space\u00a0 . """], -# Whitespace characters: +# Whitespace characters: # \u180e*MONGOLIAN VOWEL SEPARATOR*\u180e, fails in Python 2.4 [u"""\ text separated by @@ -1508,28 +1632,47 @@ u"""\ LINE SEPARATOR """], +# « * » ‹ * › « * » ‹ * › « * » ‹ * › French, [u"""\ -None of these should be markup (matched openers & closers): +"Quoted" markup start-string (matched openers & closers) -> no markup: -\u2018*\u2019 \u201c*\u201d \xab*\xbb \u00bf*? \u00a1*! +'*' "*" (*) <*> [*] {*} +⁅*⁆ -But this should: +Some international quoting styles: +‘*’ “*” English, ..., +„*“ ‚*‘ »*« ›*‹ German, Czech, ..., +„*” «*» Romanian, +“*„ ‘*‚ Greek, +「*」 『*』traditional Chinese, +”*” ’*’ »*» ›*› Swedish, Finnish, +„*” ‚*’ Polish, +„*” »*« ’*’ Hungarian, -l\u2019*exception*. +But this is „*’ emphasized »*‹. """, u"""\ - None of these should be markup (matched openers & closers): + "Quoted" markup start-string (matched openers & closers) -> no markup: - \u2018*\u2019 \u201c*\u201d \xab*\xbb \xbf*? \xa1*! + '*' "*" (*) <*> [*] {*} + ⁅*⁆ - But this should: + Some international quoting styles: + ‘*’ “*” English, ..., + „*“ ‚*‘ »*« ›*‹ German, Czech, ..., + „*” «*» Romanian, + “*„ ‘*‚ Greek, + 「*」 『*』traditional Chinese, + ”*” ’*’ »*» ›*› Swedish, Finnish, + „*” ‚*’ Polish, + „*” »*« ’*’ Hungarian, - l\u2019 + But this is „ - exception - . + ’ emphasized » + ‹. """], ] -- 2.11.4.GIT