Really use the full NBSP inside French quotes.
[docutils.git] / docutils / docutils / utils / smartquotes.py
blob7b0128b087985f4f0e9a55a7baed996db4ede594
1 #!/usr/bin/python
2 # -*- coding: utf-8 -*-
4 # :Id: $Id$
5 # :Copyright: © 2010 Günter Milde,
6 # original `SmartyPants`_: © 2003 John Gruber
7 # smartypants.py: © 2004, 2007 Chad Miller
8 # :Maintainer: docutils-develop@lists.sourceforge.net
9 # :License: Released under the terms of the `2-Clause BSD license`_, in short:
11 # Copying and distribution of this file, with or without modification,
12 # are permitted in any medium without royalty provided the copyright
13 # notices and this notice are preserved.
14 # This file is offered as-is, without any warranty.
16 # .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
19 r"""
20 ========================
21 SmartyPants for Docutils
22 ========================
24 Synopsis
25 ========
27 Smart-quotes for Docutils.
29 The original "SmartyPants" is a free web publishing plug-in for Movable Type,
30 Blosxom, and BBEdit that easily translates plain ASCII punctuation characters
31 into "smart" typographic punctuation characters.
33 `smartypants.py`, endeavours to be a functional port of
34 SmartyPants to Python, for use with Pyblosxom_.
36 `smartquotes.py` is an adaption of Smartypants to Docutils_. By using Unicode
37 characters instead of HTML entities for typographic quotes, it works for any
38 output format that supports Unicode.
40 Authors
41 =======
43 `John Gruber`_ did all of the hard work of writing this software in Perl for
44 `Movable Type`_ and almost all of this useful documentation. `Chad Miller`_
45 ported it to Python to use with Pyblosxom_.
46 Adapted to Docutils_ by Günter Milde.
48 Additional Credits
49 ==================
51 Portions of the SmartyPants original work are based on Brad Choate's nifty
52 MTRegex plug-in. `Brad Choate`_ also contributed a few bits of source code to
53 this plug-in. Brad Choate is a fine hacker indeed.
55 `Jeremy Hedley`_ and `Charles Wiltgen`_ deserve mention for exemplary beta
56 testing of the original SmartyPants.
58 `Rael Dornfest`_ ported SmartyPants to Blosxom.
60 .. _Brad Choate: http://bradchoate.com/
61 .. _Jeremy Hedley: http://antipixel.com/
62 .. _Charles Wiltgen: http://playbacktime.com/
63 .. _Rael Dornfest: http://raelity.org/
66 Copyright and License
67 =====================
69 SmartyPants_ license (3-Clause BSD license):
71 Copyright (c) 2003 John Gruber (http://daringfireball.net/)
72 All rights reserved.
74 Redistribution and use in source and binary forms, with or without
75 modification, are permitted provided that the following conditions are
76 met:
78 * Redistributions of source code must retain the above copyright
79 notice, this list of conditions and the following disclaimer.
81 * Redistributions in binary form must reproduce the above copyright
82 notice, this list of conditions and the following disclaimer in
83 the documentation and/or other materials provided with the
84 distribution.
86 * Neither the name "SmartyPants" nor the names of its contributors
87 may be used to endorse or promote products derived from this
88 software without specific prior written permission.
90 This software is provided by the copyright holders and contributors
91 "as is" and any express or implied warranties, including, but not
92 limited to, the implied warranties of merchantability and fitness for
93 a particular purpose are disclaimed. In no event shall the copyright
94 owner or contributors be liable for any direct, indirect, incidental,
95 special, exemplary, or consequential damages (including, but not
96 limited to, procurement of substitute goods or services; loss of use,
97 data, or profits; or business interruption) however caused and on any
98 theory of liability, whether in contract, strict liability, or tort
99 (including negligence or otherwise) arising in any way out of the use
100 of this software, even if advised of the possibility of such damage.
102 smartypants.py license (2-Clause BSD license):
104 smartypants.py is a derivative work of SmartyPants.
106 Redistribution and use in source and binary forms, with or without
107 modification, are permitted provided that the following conditions are
108 met:
110 * Redistributions of source code must retain the above copyright
111 notice, this list of conditions and the following disclaimer.
113 * Redistributions in binary form must reproduce the above copyright
114 notice, this list of conditions and the following disclaimer in
115 the documentation and/or other materials provided with the
116 distribution.
118 This software is provided by the copyright holders and contributors
119 "as is" and any express or implied warranties, including, but not
120 limited to, the implied warranties of merchantability and fitness for
121 a particular purpose are disclaimed. In no event shall the copyright
122 owner or contributors be liable for any direct, indirect, incidental,
123 special, exemplary, or consequential damages (including, but not
124 limited to, procurement of substitute goods or services; loss of use,
125 data, or profits; or business interruption) however caused and on any
126 theory of liability, whether in contract, strict liability, or tort
127 (including negligence or otherwise) arising in any way out of the use
128 of this software, even if advised of the possibility of such damage.
130 .. _John Gruber: http://daringfireball.net/
131 .. _Chad Miller: http://web.chad.org/
133 .. _Pyblosxom: http://pyblosxom.bluesock.org/
134 .. _SmartyPants: http://daringfireball.net/projects/smartypants/
135 .. _Movable Type: http://www.movabletype.org/
136 .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
137 .. _Docutils: http://docutils.sf.net/
139 Description
140 ===========
142 SmartyPants can perform the following transformations:
144 - Straight quotes ( " and ' ) into "curly" quote characters
145 - Backticks-style quotes (\`\`like this'') into "curly" quote characters
146 - Dashes (``--`` and ``---``) into en- and em-dash entities
147 - Three consecutive dots (``...`` or ``. . .``) into an ellipsis entity
149 This means you can write, edit, and save your posts using plain old
150 ASCII straight quotes, plain dashes, and plain dots, but your published
151 posts (and final HTML output) will appear with smart quotes, em-dashes,
152 and proper ellipses.
154 SmartyPants does not modify characters within ``<pre>``, ``<code>``, ``<kbd>``,
155 ``<math>`` or ``<script>`` tag blocks. Typically, these tags are used to
156 display text where smart quotes and other "smart punctuation" would not be
157 appropriate, such as source code or example markup.
160 Backslash Escapes
161 =================
163 If you need to use literal straight quotes (or plain hyphens and periods),
164 `smartquotes` accepts the following backslash escape sequences to force
165 ASCII-punctuation. Mind, that you need two backslashes as Docutils expands it,
166 too.
168 ======== =========
169 Escape Character
170 ======== =========
171 ``\\`` \\
172 ``\\"`` \\"
173 ``\\'`` \\'
174 ``\\.`` \\.
175 ``\\-`` \\-
176 ``\\``` \\`
177 ======== =========
179 This is useful, for example, when you want to use straight quotes as
180 foot and inch marks: 6\\'2\\" tall; a 17\\" iMac.
182 Options
183 =======
185 For Pyblosxom users, the ``smartypants_attributes`` attribute is where you
186 specify configuration options.
188 Numeric values are the easiest way to configure SmartyPants' behavior:
191 Suppress all transformations. (Do nothing.)
193 Performs default SmartyPants transformations: quotes (including
194 \`\`backticks'' -style), em-dashes, and ellipses. "``--``" (dash dash)
195 is used to signify an em-dash; there is no support for en-dashes
198 Same as smarty_pants="1", except that it uses the old-school typewriter
199 shorthand for dashes: "``--``" (dash dash) for en-dashes, "``---``"
200 (dash dash dash)
201 for em-dashes.
204 Same as smarty_pants="2", but inverts the shorthand for dashes:
205 "``--``" (dash dash) for em-dashes, and "``---``" (dash dash dash) for
206 en-dashes.
208 "-1"
209 Stupefy mode. Reverses the SmartyPants transformation process, turning
210 the characters produced by SmartyPants into their ASCII equivalents.
211 E.g. "“" is turned into a simple double-quote (\"), "—" is
212 turned into two dashes, etc.
215 The following single-character attribute values can be combined to toggle
216 individual transformations from within the smarty_pants attribute. For
217 example, to educate normal quotes and em-dashes, but not ellipses or
218 \`\`backticks'' -style quotes:
220 E.g. ``py['smartypants_attributes'] = "1"`` is equivalent to
221 ``py['smartypants_attributes'] = "qBde"``.
224 Educates normal quote characters: (") and (').
227 Educates \`\`backticks'' -style double quotes.
230 Educates \`\`backticks'' -style double quotes and \`single' quotes.
233 Educates em-dashes.
236 Educates em-dashes and en-dashes, using old-school typewriter shorthand:
237 (dash dash) for en-dashes, (dash dash dash) for em-dashes.
240 Educates em-dashes and en-dashes, using inverted old-school typewriter
241 shorthand: (dash dash) for em-dashes, (dash dash dash) for en-dashes.
244 Educates ellipses.
247 Translates any instance of ``&quot;`` into a normal double-quote character.
248 This should be of no interest to most people, but of particular interest
249 to anyone who writes their posts using Dreamweaver, as Dreamweaver
250 inexplicably uses this entity to represent a literal double-quote
251 character. SmartyPants only educates normal quotes, not entities (because
252 ordinarily, entities are used for the explicit purpose of representing the
253 specific character they represent). The "w" option must be used in
254 conjunction with one (or both) of the other quote options ("q" or "b").
255 Thus, if you wish to apply all SmartyPants transformations (quotes, en-
256 and em-dashes, and ellipses) and also translate ``&quot;`` entities into
257 regular quotes so SmartyPants can educate them, you should pass the
258 following to the smarty_pants attribute:
261 Caveats
262 =======
264 Why You Might Not Want to Use Smart Quotes in Your Weblog
265 ---------------------------------------------------------
267 For one thing, you might not care.
269 Most normal, mentally stable individuals do not take notice of proper
270 typographic punctuation. Many design and typography nerds, however, break
271 out in a nasty rash when they encounter, say, a restaurant sign that uses
272 a straight apostrophe to spell "Joe's".
274 If you're the sort of person who just doesn't care, you might well want to
275 continue not caring. Using straight quotes -- and sticking to the 7-bit
276 ASCII character set in general -- is certainly a simpler way to live.
278 Even if you *do* care about accurate typography, you still might want to
279 think twice before educating the quote characters in your weblog. One side
280 effect of publishing curly quote characters is that it makes your
281 weblog a bit harder for others to quote from using copy-and-paste. What
282 happens is that when someone copies text from your blog, the copied text
283 contains the 8-bit curly quote characters (as well as the 8-bit characters
284 for em-dashes and ellipses, if you use these options). These characters
285 are not standard across different text encoding methods, which is why they
286 need to be encoded as characters.
288 People copying text from your weblog, however, may not notice that you're
289 using curly quotes, and they'll go ahead and paste the unencoded 8-bit
290 characters copied from their browser into an email message or their own
291 weblog. When pasted as raw "smart quotes", these characters are likely to
292 get mangled beyond recognition.
294 That said, my own opinion is that any decent text editor or email client
295 makes it easy to stupefy smart quote characters into their 7-bit
296 equivalents, and I don't consider it my problem if you're using an
297 indecent text editor or email client.
300 Algorithmic Shortcomings
301 ------------------------
303 One situation in which quotes will get curled the wrong way is when
304 apostrophes are used at the start of leading contractions. For example:
306 ``'Twas the night before Christmas.``
308 In the case above, SmartyPants will turn the apostrophe into an opening
309 single-quote, when in fact it should be the `right single quotation mark`
310 character which is also "the preferred character to use for apostrophe"
311 (Unicode). I don't think this problem can be solved in the general case --
312 every word processor I've tried gets this wrong as well. In such cases, it's
313 best to use the proper character for closing single-quotes (’) by hand.
315 In English, the same character is used for apostrophe and closing single
316 quote (both plain and "smart" ones). For other locales (French, Italean,
317 Swiss, ...) "smart" single closing quotes differ from the curly apostrophe.
319 .. class:: language-fr
321 Il dit : "C'est 'super' !"
323 If the apostrophe is used at the end of a word, it cannot be distinguished
324 from a single quote by the algorithm. Therefore, a text like::
326 .. class:: language-de-CH
328 "Er sagt: 'Ich fass' es nicht.'"
330 will get a single closing guillemet instead of an apostrophe.
332 This can be prevented by use use of the curly apostrophe character (’) in
333 the source::
335 - "Er sagt: 'Ich fass' es nicht.'"
336 + "Er sagt: 'Ich fass’ es nicht.'"
339 Version History
340 ===============
342 1.7.1: 2017-03-19
343 - Update and extend language-dependent quotes.
344 - Differentiate apostrophe from single quote.
346 1.7: 2012-11-19
347 - Internationalization: language-dependent quotes.
349 1.6.1: 2012-11-06
350 - Refactor code, code cleanup,
351 - `educate_tokens()` generator as interface for Docutils.
353 1.6: 2010-08-26
354 - Adaption to Docutils:
355 - Use Unicode instead of HTML entities,
356 - Remove code special to pyblosxom.
358 1.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400
359 - Fixed bug where blocks of precious unalterable text was instead
360 interpreted. Thanks to Le Roux and Dirk van Oosterbosch.
362 1.5_1.5: Sat, 13 Aug 2005 15:50:24 -0400
363 - Fix bogus magical quotation when there is no hint that the
364 user wants it, e.g., in "21st century". Thanks to Nathan Hamblen.
365 - Be smarter about quotes before terminating numbers in an en-dash'ed
366 range.
368 1.5_1.4: Thu, 10 Feb 2005 20:24:36 -0500
369 - Fix a date-processing bug, as reported by jacob childress.
370 - Begin a test-suite for ensuring correct output.
371 - Removed import of "string", since I didn't really need it.
372 (This was my first every Python program. Sue me!)
374 1.5_1.3: Wed, 15 Sep 2004 18:25:58 -0400
375 - Abort processing if the flavour is in forbidden-list. Default of
376 [ "rss" ] (Idea of Wolfgang SCHNERRING.)
377 - Remove stray virgules from en-dashes. Patch by Wolfgang SCHNERRING.
379 1.5_1.2: Mon, 24 May 2004 08:14:54 -0400
380 - Some single quotes weren't replaced properly. Diff-tesuji played
381 by Benjamin GEIGER.
383 1.5_1.1: Sun, 14 Mar 2004 14:38:28 -0500
384 - Support upcoming pyblosxom 0.9 plugin verification feature.
386 1.5_1.0: Tue, 09 Mar 2004 08:08:35 -0500
387 - Initial release
390 default_smartypants_attr = "1"
393 import re
395 class smartchars(object):
396 """Smart quotes and dashes
399 endash = u'–' # "&#8211;" EN DASH
400 emdash = u'—' # "&#8212;" EM DASH
401 ellipsis = u'…' # "&#8230;" HORIZONTAL ELLIPSIS
402 apostrophe = u'’' # "&#8217;" RIGHT SINGLE QUOTATION MARK
404 # quote characters (language-specific, set in __init__())
405 # http://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
406 # http://de.wikipedia.org/wiki/Anf%C3%BChrungszeichen#Andere_Sprachen
407 # https://fr.wikipedia.org/wiki/Guillemet
408 # http://typographisme.net/post/Les-espaces-typographiques-et-le-web
409 # http://www.btb.termiumplus.gc.ca/tpv2guides/guides/redac/index-fra.html
410 # https://en.wikipedia.org/wiki/Hebrew_punctuation#Quotation_marks
411 # http://www.tustep.uni-tuebingen.de/bi/bi00/bi001t1-anfuehrung.pdf
412 quotes = {'af': u'“”‘’',
413 'af-x-altquot': u'„”‚’',
414 'ca': u'«»“”',
415 'ca-x-altquot': u'“”‘’',
416 'cs': u'„“‚‘',
417 'cs-x-altquot': u'»«›‹',
418 'da': u'»«‘’',
419 'da-x-altquot': u'„“‚‘',
420 'de': u'„“‚‘',
421 'de-x-altquot': u'»«›‹',
422 'de-ch': u'«»‹›',
423 'el': u'«»“”',
424 'en': u'“”‘’',
425 'en-uk-x-altquot': u'‘’“”', # Attention: " → ‘ and ' → “ !
426 'eo': u'“”‘’',
427 'es': u'«»“”',
428 'es-x-altquot': u'“”‘’',
429 'et': u'„“‚‘', # no secondary quote listed in
430 'et-x-altquot': u'«»‹›', # the sources above (wikipedia.org)
431 'eu': u'«»‹›',
432 'fi': u'””’’',
433 'fi-x-altquot': u'»»››',
434 'fr': (u'« ', u' »', u'“', u'”'), # full no-break space
435 'fr-x-altquot': (u'« ', u' »', u'“', u'”'), # narrow no-break space
436 'fr-ch': u'«»‹›',
437 'fr-ch-x-altquot': (u'« ', u' »', u'‹ ', u' ›'), # narrow no-break space, http://typoguide.ch/
438 'gl': u'«»“”',
439 'he': u'”“»«',
440 'he-x-altquot': u'„”‚’',
441 'hr': u'„”‘’',
442 'hr-x-altquot': u'»«›‹',
443 'hsb': u'„“‚‘',
444 'hsb-x-altquot':u'»«›‹',
445 'hu': u'„”«»',
446 'it': u'«»“”',
447 'it-ch': u'«»‹›',
448 'it-x-altquot': u'“”‘’',
449 # 'it-x-altquot2': u'“„‘‚', # antiquated?
450 'ja': u'「」『』',
451 'lt': u'„“‚‘',
452 'lv': u'„“‚‘',
453 'nl': u'“”‘’',
454 'nl-x-altquot': u'„”‚’',
455 # 'nl-x-altquot2': u'””’’',
456 'pl': u'„”«»',
457 'pl-x-altquot': u'«»“”',
458 'pt': u'«»“”',
459 'pt-br': u'“”‘’',
460 'ro': u'„”«»',
461 'ru': u'«»„“',
462 'sh': u'„”‚’',
463 'sh-x-altquot': u'»«›‹',
464 'sk': u'„“‚‘',
465 'sk-x-altquot': u'»«›‹',
466 'sr': u'„”’’',
467 'sl': u'„“‚‘',
468 'sl-x-altquot': u'»«›‹',
469 'sv': u'””’’',
470 'sv-x-altquot': u'»»››',
471 'tr': u'“”‘’',
472 'tr-x-altquot': u'«»‹›',
473 # 'tr-x-altquot2': u'“„‘‚', # antiquated?
474 'uk': u'«»„“',
475 'uk-x-altquot': u'„“‚‘',
476 'zh-cn': u'“”‘’',
477 'zh-tw': u'「」『』',
480 def __init__(self, language='en'):
481 self.language = language
482 try:
483 (self.opquote, self.cpquote,
484 self.osquote, self.csquote) = self.quotes[language.lower()]
485 except KeyError:
486 self.opquote, self.cpquote, self.osquote, self.csquote = u'""\'\''
489 def smartyPants(text, attr=default_smartypants_attr, language='en'):
490 """Main function for "traditional" use."""
492 return "".join([t for t in educate_tokens(tokenize(text),
493 attr, language)])
496 def educate_tokens(text_tokens, attr=default_smartypants_attr, language='en'):
497 """Return iterator that "educates" the items of `text_tokens`.
500 # Parse attributes:
501 # 0 : do nothing
502 # 1 : set all
503 # 2 : set all, using old school en- and em- dash shortcuts
504 # 3 : set all, using inverted old school en and em- dash shortcuts
506 # q : quotes
507 # b : backtick quotes (``double'' only)
508 # B : backtick quotes (``double'' and `single')
509 # d : dashes
510 # D : old school dashes
511 # i : inverted old school dashes
512 # e : ellipses
513 # w : convert &quot; entities to " for Dreamweaver users
515 convert_quot = False # translate &quot; entities into normal quotes?
516 do_dashes = False
517 do_backticks = False
518 do_quotes = False
519 do_ellipses = False
520 do_stupefy = False
522 if attr == "0": # Do nothing.
523 yield text
524 elif attr == "1": # Do everything, turn all options on.
525 do_quotes = True
526 do_backticks = True
527 do_dashes = 1
528 do_ellipses = True
529 elif attr == "2":
530 # Do everything, turn all options on, use old school dash shorthand.
531 do_quotes = True
532 do_backticks = True
533 do_dashes = 2
534 do_ellipses = True
535 elif attr == "3":
536 # Do everything, use inverted old school dash shorthand.
537 do_quotes = True
538 do_backticks = True
539 do_dashes = 3
540 do_ellipses = True
541 elif attr == "-1": # Special "stupefy" mode.
542 do_stupefy = True
543 else:
544 if "q" in attr: do_quotes = True
545 if "b" in attr: do_backticks = True
546 if "B" in attr: do_backticks = 2
547 if "d" in attr: do_dashes = 1
548 if "D" in attr: do_dashes = 2
549 if "i" in attr: do_dashes = 3
550 if "e" in attr: do_ellipses = True
551 if "w" in attr: convert_quot = True
553 prev_token_last_char = " "
554 # Last character of the previous text token. Used as
555 # context to curl leading quote characters correctly.
557 for (ttype, text) in text_tokens:
559 # skip HTML and/or XML tags as well as emtpy text tokens
560 # without updating the last character
561 if ttype == 'tag' or not text:
562 yield text
563 continue
565 # skip literal text (math, literal, raw, ...)
566 if ttype == 'literal':
567 prev_token_last_char = text[-1:]
568 yield text
569 continue
571 last_char = text[-1:] # Remember last char before processing.
573 text = processEscapes(text)
575 if convert_quot:
576 text = re.sub('&quot;', '"', text)
578 if do_dashes == 1:
579 text = educateDashes(text)
580 elif do_dashes == 2:
581 text = educateDashesOldSchool(text)
582 elif do_dashes == 3:
583 text = educateDashesOldSchoolInverted(text)
585 if do_ellipses:
586 text = educateEllipses(text)
588 # Note: backticks need to be processed before quotes.
589 if do_backticks:
590 text = educateBackticks(text, language)
592 if do_backticks == 2:
593 text = educateSingleBackticks(text, language)
595 if do_quotes:
596 text = educateQuotes(prev_token_last_char+text, language)[1:]
598 if do_stupefy:
599 text = stupefyEntities(text, language)
601 # Remember last char as context for the next token
602 prev_token_last_char = last_char
604 text = processEscapes(text, restore=True)
606 yield text
610 def educateQuotes(text, language='en'):
612 Parameter: - text string (unicode or bytes).
613 - language (`BCP 47` language tag.)
614 Returns: The `text`, with "educated" curly quote characters.
616 Example input: "Isn't this fun?"
617 Example output: “Isn’t this fun?“;
620 smart = smartchars(language)
622 # oldtext = text
623 punct_class = r"""[!"#\$\%'()*+,-.\/:;<=>?\@\[\\\]\^_`{|}~]"""
625 # Special case if the very first character is a quote
626 # followed by punctuation at a non-word-break.
627 # Close the quotes by brute force:
628 text = re.sub(r"""^'(?=%s\\B)""" % (punct_class,), smart.csquote, text)
629 text = re.sub(r"""^"(?=%s\\B)""" % (punct_class,), smart.cpquote, text)
631 # Special case for double sets of quotes, e.g.:
632 # <p>He said, "'Quoted' words in a larger quote."</p>
633 text = re.sub(r""""'(?=\w)""", smart.opquote+smart.osquote, text)
634 text = re.sub(r"""'"(?=\w)""", smart.osquote+smart.opquote, text)
636 # Special case for decade abbreviations (the '80s):
637 if language.startswith('en'): # TODO similar cases in other languages?
638 text = re.sub(r"""'(?=\d{2}s)""", smart.apostrophe, text, re.UNICODE)
640 close_class = r"""[^\ \t\r\n\[\{\(\-]"""
641 dec_dashes = r"""&#8211;|&#8212;"""
643 # Get most opening single quotes:
644 opening_single_quotes_regex = re.compile(r"""
646 \s | # a whitespace char, or
647 &nbsp; | # a non-breaking space entity, or
648 -- | # dashes, or
649 &[mn]dash; | # named dash entities
650 %s | # or decimal entities
651 &\#x201[34]; # or hex
653 ' # the quote
654 (?=\w) # followed by a word character
655 """ % (dec_dashes,), re.VERBOSE | re.UNICODE)
656 text = opening_single_quotes_regex.sub(r'\1'+smart.osquote, text)
658 # In many locales, single closing quotes are different from apostrophe:
659 if smart.csquote != smart.apostrophe:
660 apostrophe_regex = re.compile(r"(?<=(\w|\d))'(?=\w)", re.UNICODE)
661 text = apostrophe_regex.sub(smart.apostrophe, text)
662 # TODO: keep track of quoting level to recognize apostrophe in, e.g.,
663 # "Ich fass' es nicht."
665 closing_single_quotes_regex = re.compile(r"""
666 (%s)
668 (?!\s | # whitespace
669 s\b |
670 \d # digits ('80s)
672 """ % (close_class,), re.VERBOSE | re.UNICODE)
673 text = closing_single_quotes_regex.sub(r'\1'+smart.csquote, text)
675 closing_single_quotes_regex = re.compile(r"""
676 (%s)
678 (\s | s\b)
679 """ % (close_class,), re.VERBOSE | re.UNICODE)
680 text = closing_single_quotes_regex.sub(r'\1%s\2' % smart.csquote, text)
682 # Any remaining single quotes should be opening ones:
683 text = re.sub(r"""'""", smart.osquote, text)
685 # Get most opening double quotes:
686 opening_double_quotes_regex = re.compile(r"""
688 \s | # a whitespace char, or
689 &nbsp; | # a non-breaking space entity, or
690 -- | # dashes, or
691 &[mn]dash; | # named dash entities
692 %s | # or decimal entities
693 &\#x201[34]; # or hex
695 " # the quote
696 (?=\w) # followed by a word character
697 """ % (dec_dashes,), re.VERBOSE)
698 text = opening_double_quotes_regex.sub(r'\1'+smart.opquote, text)
700 # Double closing quotes:
701 closing_double_quotes_regex = re.compile(r"""
702 #(%s)? # character that indicates the quote should be closing
704 (?=\s)
705 """ % (close_class,), re.VERBOSE)
706 text = closing_double_quotes_regex.sub(smart.cpquote, text)
708 closing_double_quotes_regex = re.compile(r"""
709 (%s) # character that indicates the quote should be closing
711 """ % (close_class,), re.VERBOSE)
712 text = closing_double_quotes_regex.sub(r'\1'+smart.cpquote, text)
714 # Any remaining quotes should be opening ones.
715 text = re.sub(r'"', smart.opquote, text)
717 return text
720 def educateBackticks(text, language='en'):
722 Parameter: String (unicode or bytes).
723 Returns: The `text`, with ``backticks'' -style double quotes
724 translated into HTML curly quote entities.
725 Example input: ``Isn't this fun?''
726 Example output: “Isn't this fun?“;
728 smart = smartchars(language)
730 text = re.sub(r"""``""", smart.opquote, text)
731 text = re.sub(r"""''""", smart.cpquote, text)
732 return text
735 def educateSingleBackticks(text, language='en'):
737 Parameter: String (unicode or bytes).
738 Returns: The `text`, with `backticks' -style single quotes
739 translated into HTML curly quote entities.
741 Example input: `Isn't this fun?'
742 Example output: ‘Isn’t this fun?’
744 smart = smartchars(language)
746 text = re.sub(r"""`""", smart.osquote, text)
747 text = re.sub(r"""'""", smart.csquote, text)
748 return text
751 def educateDashes(text):
753 Parameter: String (unicode or bytes).
754 Returns: The `text`, with each instance of "--" translated to
755 an em-dash character.
758 text = re.sub(r"""---""", smartchars.endash, text) # en (yes, backwards)
759 text = re.sub(r"""--""", smartchars.emdash, text) # em (yes, backwards)
760 return text
763 def educateDashesOldSchool(text):
765 Parameter: String (unicode or bytes).
766 Returns: The `text`, with each instance of "--" translated to
767 an en-dash character, and each "---" translated to
768 an em-dash character.
771 text = re.sub(r"""---""", smartchars.emdash, text)
772 text = re.sub(r"""--""", smartchars.endash, text)
773 return text
776 def educateDashesOldSchoolInverted(text):
778 Parameter: String (unicode or bytes).
779 Returns: The `text`, with each instance of "--" translated to
780 an em-dash character, and each "---" translated to
781 an en-dash character. Two reasons why: First, unlike the
782 en- and em-dash syntax supported by
783 EducateDashesOldSchool(), it's compatible with existing
784 entries written before SmartyPants 1.1, back when "--" was
785 only used for em-dashes. Second, em-dashes are more
786 common than en-dashes, and so it sort of makes sense that
787 the shortcut should be shorter to type. (Thanks to Aaron
788 Swartz for the idea.)
790 text = re.sub(r"""---""", smartchars.endash, text) # em
791 text = re.sub(r"""--""", smartchars.emdash, text) # en
792 return text
796 def educateEllipses(text):
798 Parameter: String (unicode or bytes).
799 Returns: The `text`, with each instance of "..." translated to
800 an ellipsis character.
802 Example input: Huh...?
803 Example output: Huh&#8230;?
806 text = re.sub(r"""\.\.\.""", smartchars.ellipsis, text)
807 text = re.sub(r"""\. \. \.""", smartchars.ellipsis, text)
808 return text
811 def stupefyEntities(text, language='en'):
813 Parameter: String (unicode or bytes).
814 Returns: The `text`, with each SmartyPants character translated to
815 its ASCII counterpart.
817 Example input: “Hello — world.”
818 Example output: "Hello -- world."
820 smart = smartchars(language)
822 text = re.sub(smart.endash, "-", text) # en-dash
823 text = re.sub(smart.emdash, "--", text) # em-dash
825 text = re.sub(smart.osquote, "'", text) # open single quote
826 text = re.sub(smart.csquote, "'", text) # close single quote
828 text = re.sub(smart.opquote, '"', text) # open double quote
829 text = re.sub(smart.cpquote, '"', text) # close double quote
831 text = re.sub(smart.ellipsis, '...', text)# ellipsis
833 return text
836 def processEscapes(text, restore=False):
837 r"""
838 Parameter: String (unicode or bytes).
839 Returns: The `text`, with after processing the following backslash
840 escape sequences. This is useful if you want to force a "dumb"
841 quote or other character to appear.
843 Escape Value
844 ------ -----
845 \\ &#92;
846 \" &#34;
847 \' &#39;
848 \. &#46;
849 \- &#45;
850 \` &#96;
852 replacements = ((r'\\', r'&#92;'),
853 (r'\"', r'&#34;'),
854 (r"\'", r'&#39;'),
855 (r'\.', r'&#46;'),
856 (r'\-', r'&#45;'),
857 (r'\`', r'&#96;'))
858 if restore:
859 for (ch, rep) in replacements:
860 text = text.replace(rep, ch[1])
861 else:
862 for (ch, rep) in replacements:
863 text = text.replace(ch, rep)
865 return text
868 def tokenize(text):
870 Parameter: String containing HTML markup.
871 Returns: An iterator that yields the tokens comprising the input
872 string. Each token is either a tag (possibly with nested,
873 tags contained therein, such as <a href="<MTFoo>">, or a
874 run of text between tags. Each yielded element is a
875 two-element tuple; the first is either 'tag' or 'text';
876 the second is the actual value.
878 Based on the _tokenize() subroutine from Brad Choate's MTRegex plugin.
879 <http://www.bradchoate.com/past/mtregex.php>
882 pos = 0
883 length = len(text)
884 # tokens = []
886 depth = 6
887 nested_tags = "|".join(['(?:<(?:[^<>]',] * depth) + (')*>)' * depth)
888 #match = r"""(?: <! ( -- .*? -- \s* )+ > ) | # comments
889 # (?: <\? .*? \?> ) | # directives
890 # %s # nested tags """ % (nested_tags,)
891 tag_soup = re.compile(r"""([^<]*)(<[^>]*>)""")
893 token_match = tag_soup.search(text)
895 previous_end = 0
896 while token_match is not None:
897 if token_match.group(1):
898 yield ('text', token_match.group(1))
900 yield ('tag', token_match.group(2))
902 previous_end = token_match.end()
903 token_match = tag_soup.search(text, token_match.end())
905 if previous_end < len(text):
906 yield ('text', text[previous_end:])
910 if __name__ == "__main__":
912 import locale
914 try:
915 locale.setlocale(locale.LC_ALL, '')
916 except:
917 pass
919 from docutils.core import publish_string
920 docstring_html = publish_string(__doc__, writer_name='html5')
922 print docstring_html
925 # Unit test output goes out stderr.
926 import unittest
927 sp = smartyPants
929 class TestSmartypantsAllAttributes(unittest.TestCase):
930 # the default attribute is "1", which means "all".
932 def test_dates(self):
933 self.assertEqual(sp("1440-80's"), u"1440-80’s")
934 self.assertEqual(sp("1440-'80s"), u"1440-‘80s")
935 self.assertEqual(sp("1440---'80s"), u"1440–‘80s")
936 self.assertEqual(sp("1960s"), "1960s") # no effect.
937 self.assertEqual(sp("1960's"), u"1960’s")
938 self.assertEqual(sp("one two '60s"), u"one two ‘60s")
939 self.assertEqual(sp("'60s"), u"‘60s")
941 def test_ordinal_numbers(self):
942 self.assertEqual(sp("21st century"), "21st century") # no effect.
943 self.assertEqual(sp("3rd"), "3rd") # no effect.
945 def test_educated_quotes(self):
946 self.assertEqual(sp('''"Isn't this fun?"'''), u'“Isn’t this fun?”')
948 def test_html_tags(self):
949 text = '<a src="foo">more</a>'
950 self.assertEqual(sp(text), text)
952 unittest.main()