Fix #338: re.sub() flag argument at wrong position.
[docutils.git] / docutils / docutils / utils / smartquotes.py
blobaa1234ac045d185422fa539b1ddf3c2719b682da
1 #!/usr/bin/python
2 # -*- coding: utf-8 -*-
4 # :Id: $Id$
5 # :Copyright: © 2010 Günter Milde,
6 # original `SmartyPants`_: © 2003 John Gruber
7 # smartypants.py: © 2004, 2007 Chad Miller
8 # :Maintainer: docutils-develop@lists.sourceforge.net
9 # :License: Released under the terms of the `2-Clause BSD license`_, in short:
11 # Copying and distribution of this file, with or without modification,
12 # are permitted in any medium without royalty provided the copyright
13 # notices and this notice are preserved.
14 # This file is offered as-is, without any warranty.
16 # .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
19 r"""
20 =========================
21 Smart Quotes for Docutils
22 =========================
24 Synopsis
25 ========
27 "SmartyPants" is a free web publishing plug-in for Movable Type, Blosxom, and
28 BBEdit that easily translates plain ASCII punctuation characters into "smart"
29 typographic punctuation characters.
31 ``smartquotes.py`` is an adaption of "SmartyPants" to Docutils_.
33 * Using Unicode instead of HTML entities for typographic punctuation
34 characters, it works for any output format that supports Unicode.
35 * Supports `language specific quote characters`__.
37 __ http://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
40 Authors
41 =======
43 `John Gruber`_ did all of the hard work of writing this software in Perl for
44 `Movable Type`_ and almost all of this useful documentation. `Chad Miller`_
45 ported it to Python to use with Pyblosxom_.
46 Adapted to Docutils_ by Günter Milde.
48 Additional Credits
49 ==================
51 Portions of the SmartyPants original work are based on Brad Choate's nifty
52 MTRegex plug-in. `Brad Choate`_ also contributed a few bits of source code to
53 this plug-in. Brad Choate is a fine hacker indeed.
55 `Jeremy Hedley`_ and `Charles Wiltgen`_ deserve mention for exemplary beta
56 testing of the original SmartyPants.
58 `Rael Dornfest`_ ported SmartyPants to Blosxom.
60 .. _Brad Choate: http://bradchoate.com/
61 .. _Jeremy Hedley: http://antipixel.com/
62 .. _Charles Wiltgen: http://playbacktime.com/
63 .. _Rael Dornfest: http://raelity.org/
66 Copyright and License
67 =====================
69 SmartyPants_ license (3-Clause BSD license):
71 Copyright (c) 2003 John Gruber (http://daringfireball.net/)
72 All rights reserved.
74 Redistribution and use in source and binary forms, with or without
75 modification, are permitted provided that the following conditions are
76 met:
78 * Redistributions of source code must retain the above copyright
79 notice, this list of conditions and the following disclaimer.
81 * Redistributions in binary form must reproduce the above copyright
82 notice, this list of conditions and the following disclaimer in
83 the documentation and/or other materials provided with the
84 distribution.
86 * Neither the name "SmartyPants" nor the names of its contributors
87 may be used to endorse or promote products derived from this
88 software without specific prior written permission.
90 This software is provided by the copyright holders and contributors
91 "as is" and any express or implied warranties, including, but not
92 limited to, the implied warranties of merchantability and fitness for
93 a particular purpose are disclaimed. In no event shall the copyright
94 owner or contributors be liable for any direct, indirect, incidental,
95 special, exemplary, or consequential damages (including, but not
96 limited to, procurement of substitute goods or services; loss of use,
97 data, or profits; or business interruption) however caused and on any
98 theory of liability, whether in contract, strict liability, or tort
99 (including negligence or otherwise) arising in any way out of the use
100 of this software, even if advised of the possibility of such damage.
102 smartypants.py license (2-Clause BSD license):
104 smartypants.py is a derivative work of SmartyPants.
106 Redistribution and use in source and binary forms, with or without
107 modification, are permitted provided that the following conditions are
108 met:
110 * Redistributions of source code must retain the above copyright
111 notice, this list of conditions and the following disclaimer.
113 * Redistributions in binary form must reproduce the above copyright
114 notice, this list of conditions and the following disclaimer in
115 the documentation and/or other materials provided with the
116 distribution.
118 This software is provided by the copyright holders and contributors
119 "as is" and any express or implied warranties, including, but not
120 limited to, the implied warranties of merchantability and fitness for
121 a particular purpose are disclaimed. In no event shall the copyright
122 owner or contributors be liable for any direct, indirect, incidental,
123 special, exemplary, or consequential damages (including, but not
124 limited to, procurement of substitute goods or services; loss of use,
125 data, or profits; or business interruption) however caused and on any
126 theory of liability, whether in contract, strict liability, or tort
127 (including negligence or otherwise) arising in any way out of the use
128 of this software, even if advised of the possibility of such damage.
130 .. _John Gruber: http://daringfireball.net/
131 .. _Chad Miller: http://web.chad.org/
133 .. _Pyblosxom: http://pyblosxom.bluesock.org/
134 .. _SmartyPants: http://daringfireball.net/projects/smartypants/
135 .. _Movable Type: http://www.movabletype.org/
136 .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
137 .. _Docutils: http://docutils.sf.net/
139 Description
140 ===========
142 SmartyPants can perform the following transformations:
144 - Straight quotes ( " and ' ) into "curly" quote characters
145 - Backticks-style quotes (\`\`like this'') into "curly" quote characters
146 - Dashes (``--`` and ``---``) into en- and em-dash entities
147 - Three consecutive dots (``...`` or ``. . .``) into an ellipsis entity
149 This means you can write, edit, and save your posts using plain old
150 ASCII straight quotes, plain dashes, and plain dots, but your published
151 posts (and final HTML output) will appear with smart quotes, em-dashes,
152 and proper ellipses.
154 SmartyPants does not modify characters within ``<pre>``, ``<code>``, ``<kbd>``,
155 ``<math>`` or ``<script>`` tag blocks. Typically, these tags are used to
156 display text where smart quotes and other "smart punctuation" would not be
157 appropriate, such as source code or example markup.
160 Backslash Escapes
161 =================
163 If you need to use literal straight quotes (or plain hyphens and periods),
164 `smartquotes` accepts the following backslash escape sequences to force
165 ASCII-punctuation. Mind, that you need two backslashes as Docutils expands it,
166 too.
168 ======== =========
169 Escape Character
170 ======== =========
171 ``\\`` \\
172 ``\\"`` \\"
173 ``\\'`` \\'
174 ``\\.`` \\.
175 ``\\-`` \\-
176 ``\\``` \\`
177 ======== =========
179 This is useful, for example, when you want to use straight quotes as
180 foot and inch marks: 6\\'2\\" tall; a 17\\" iMac.
183 Caveats
184 =======
186 Why You Might Not Want to Use Smart Quotes in Your Weblog
187 ---------------------------------------------------------
189 For one thing, you might not care.
191 Most normal, mentally stable individuals do not take notice of proper
192 typographic punctuation. Many design and typography nerds, however, break
193 out in a nasty rash when they encounter, say, a restaurant sign that uses
194 a straight apostrophe to spell "Joe's".
196 If you're the sort of person who just doesn't care, you might well want to
197 continue not caring. Using straight quotes -- and sticking to the 7-bit
198 ASCII character set in general -- is certainly a simpler way to live.
200 Even if you *do* care about accurate typography, you still might want to
201 think twice before educating the quote characters in your weblog. One side
202 effect of publishing curly quote characters is that it makes your
203 weblog a bit harder for others to quote from using copy-and-paste. What
204 happens is that when someone copies text from your blog, the copied text
205 contains the 8-bit curly quote characters (as well as the 8-bit characters
206 for em-dashes and ellipses, if you use these options). These characters
207 are not standard across different text encoding methods, which is why they
208 need to be encoded as characters.
210 People copying text from your weblog, however, may not notice that you're
211 using curly quotes, and they'll go ahead and paste the unencoded 8-bit
212 characters copied from their browser into an email message or their own
213 weblog. When pasted as raw "smart quotes", these characters are likely to
214 get mangled beyond recognition.
216 That said, my own opinion is that any decent text editor or email client
217 makes it easy to stupefy smart quote characters into their 7-bit
218 equivalents, and I don't consider it my problem if you're using an
219 indecent text editor or email client.
222 Algorithmic Shortcomings
223 ------------------------
225 One situation in which quotes will get curled the wrong way is when
226 apostrophes are used at the start of leading contractions. For example::
228 'Twas the night before Christmas.
230 In the case above, SmartyPants will turn the apostrophe into an opening
231 single-quote, when in fact it should be the `right single quotation mark`
232 character which is also "the preferred character to use for apostrophe"
233 (Unicode). I don't think this problem can be solved in the general case --
234 every word processor I've tried gets this wrong as well. In such cases, it's
235 best to use the proper character for closing single-quotes (’) by hand.
237 In English, the same character is used for apostrophe and closing single
238 quote (both plain and "smart" ones). For other locales (French, Italean,
239 Swiss, ...) "smart" single closing quotes differ from the curly apostrophe.
241 .. class:: language-fr
243 Il dit : "C'est 'super' !"
245 If the apostrophe is used at the end of a word, it cannot be distinguished
246 from a single quote by the algorithm. Therefore, a text like::
248 .. class:: language-de-CH
250 "Er sagt: 'Ich fass' es nicht.'"
252 will get a single closing guillemet instead of an apostrophe.
254 This can be prevented by use use of the curly apostrophe character (’) in
255 the source::
257 - "Er sagt: 'Ich fass' es nicht.'"
258 + "Er sagt: 'Ich fass’ es nicht.'"
261 Version History
262 ===============
264 1.8.1 2017-10-25
265 - Use open quote after Unicode whitespace, ZWSP, and ZWNJ.
266 - Code cleanup.
268 1.8: 2017-04-24
269 - Command line front-end.
271 1.7.1: 2017-03-19
272 - Update and extend language-dependent quotes.
273 - Differentiate apostrophe from single quote.
275 1.7: 2012-11-19
276 - Internationalization: language-dependent quotes.
278 1.6.1: 2012-11-06
279 - Refactor code, code cleanup,
280 - `educate_tokens()` generator as interface for Docutils.
282 1.6: 2010-08-26
283 - Adaption to Docutils:
284 - Use Unicode instead of HTML entities,
285 - Remove code special to pyblosxom.
287 1.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400
288 - Fixed bug where blocks of precious unalterable text was instead
289 interpreted. Thanks to Le Roux and Dirk van Oosterbosch.
291 1.5_1.5: Sat, 13 Aug 2005 15:50:24 -0400
292 - Fix bogus magical quotation when there is no hint that the
293 user wants it, e.g., in "21st century". Thanks to Nathan Hamblen.
294 - Be smarter about quotes before terminating numbers in an en-dash'ed
295 range.
297 1.5_1.4: Thu, 10 Feb 2005 20:24:36 -0500
298 - Fix a date-processing bug, as reported by jacob childress.
299 - Begin a test-suite for ensuring correct output.
300 - Removed import of "string", since I didn't really need it.
301 (This was my first every Python program. Sue me!)
303 1.5_1.3: Wed, 15 Sep 2004 18:25:58 -0400
304 - Abort processing if the flavour is in forbidden-list. Default of
305 [ "rss" ] (Idea of Wolfgang SCHNERRING.)
306 - Remove stray virgules from en-dashes. Patch by Wolfgang SCHNERRING.
308 1.5_1.2: Mon, 24 May 2004 08:14:54 -0400
309 - Some single quotes weren't replaced properly. Diff-tesuji played
310 by Benjamin GEIGER.
312 1.5_1.1: Sun, 14 Mar 2004 14:38:28 -0500
313 - Support upcoming pyblosxom 0.9 plugin verification feature.
315 1.5_1.0: Tue, 09 Mar 2004 08:08:35 -0500
316 - Initial release
319 options = r"""
320 Options
321 =======
323 Numeric values are the easiest way to configure SmartyPants' behavior:
325 :0: Suppress all transformations. (Do nothing.)
327 :1: Performs default SmartyPants transformations: quotes (including
328 \`\`backticks'' -style), em-dashes, and ellipses. "``--``" (dash dash)
329 is used to signify an em-dash; there is no support for en-dashes
331 :2: Same as smarty_pants="1", except that it uses the old-school typewriter
332 shorthand for dashes: "``--``" (dash dash) for en-dashes, "``---``"
333 (dash dash dash)
334 for em-dashes.
336 :3: Same as smarty_pants="2", but inverts the shorthand for dashes:
337 "``--``" (dash dash) for em-dashes, and "``---``" (dash dash dash) for
338 en-dashes.
340 :-1: Stupefy mode. Reverses the SmartyPants transformation process, turning
341 the characters produced by SmartyPants into their ASCII equivalents.
342 E.g. the LEFT DOUBLE QUOTATION MARK (“) is turned into a simple
343 double-quote (\"), "—" is turned into two dashes, etc.
346 The following single-character attribute values can be combined to toggle
347 individual transformations from within the smarty_pants attribute. For
348 example, ``"1"`` is equivalent to ``"qBde"``.
350 :q: Educates normal quote characters: (") and (').
352 :b: Educates \`\`backticks'' -style double quotes.
354 :B: Educates \`\`backticks'' -style double quotes and \`single' quotes.
356 :d: Educates em-dashes.
358 :D: Educates em-dashes and en-dashes, using old-school typewriter shorthand:
359 (dash dash) for en-dashes, (dash dash dash) for em-dashes.
361 :i: Educates em-dashes and en-dashes, using inverted old-school typewriter
362 shorthand: (dash dash) for em-dashes, (dash dash dash) for en-dashes.
364 :e: Educates ellipses.
366 :w: Translates any instance of ``&quot;`` into a normal double-quote character.
367 This should be of no interest to most people, but of particular interest
368 to anyone who writes their posts using Dreamweaver, as Dreamweaver
369 inexplicably uses this entity to represent a literal double-quote
370 character. SmartyPants only educates normal quotes, not entities (because
371 ordinarily, entities are used for the explicit purpose of representing the
372 specific character they represent). The "w" option must be used in
373 conjunction with one (or both) of the other quote options ("q" or "b").
374 Thus, if you wish to apply all SmartyPants transformations (quotes, en-
375 and em-dashes, and ellipses) and also translate ``&quot;`` entities into
376 regular quotes so SmartyPants can educate them, you should pass the
377 following to the smarty_pants attribute:
381 default_smartypants_attr = "1"
384 import re, sys
386 class smartchars(object):
387 """Smart quotes and dashes
390 endash = u'–' # "&#8211;" EN DASH
391 emdash = u'—' # "&#8212;" EM DASH
392 ellipsis = u'…' # "&#8230;" HORIZONTAL ELLIPSIS
393 apostrophe = u'’' # "&#8217;" RIGHT SINGLE QUOTATION MARK
395 # quote characters (language-specific, set in __init__())
396 # [1] http://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
397 # [2] http://de.wikipedia.org/wiki/Anf%C3%BChrungszeichen#Andere_Sprachen
398 # [3] https://fr.wikipedia.org/wiki/Guillemet
399 # [4] http://typographisme.net/post/Les-espaces-typographiques-et-le-web
400 # [5] http://www.btb.termiumplus.gc.ca/tpv2guides/guides/redac/index-fra.html
401 # [6] https://en.wikipedia.org/wiki/Hebrew_punctuation#Quotation_marks
402 # [7] http://www.tustep.uni-tuebingen.de/bi/bi00/bi001t1-anfuehrung.pdf
403 # [8] http://www.korrekturavdelingen.no/anforselstegn.htm
404 # [9] Typografisk håndbok. Oslo: Spartacus. 2000. s. 67. ISBN 8243001530.
405 # [10] http://www.typografi.org/sitat/sitatart.html
407 # See also configuration option "smartquote-locales".
408 quotes = {'af': u'“”‘’',
409 'af-x-altquot': u'„”‚’',
410 'bg': u'„“‚‘', # Bulgarian, https://bg.wikipedia.org/wiki/Кавички
411 'ca': u'«»“”',
412 'ca-x-altquot': u'“”‘’',
413 'cs': u'„“‚‘',
414 'cs-x-altquot': u'»«›‹',
415 'da': u'»«›‹',
416 'da-x-altquot': u'„“‚‘',
417 # 'da-x-altquot2': u'””’’',
418 'de': u'„“‚‘',
419 'de-x-altquot': u'»«›‹',
420 'de-ch': u'«»‹›',
421 'el': u'«»“”',
422 'en': u'“”‘’',
423 'en-uk-x-altquot': u'‘’“”', # Attention: " → ‘ and ' → “ !
424 'eo': u'“”‘’',
425 'es': u'«»“”',
426 'es-x-altquot': u'“”‘’',
427 'et': u'„“‚‘', # no secondary quote listed in
428 'et-x-altquot': u'«»‹›', # the sources above (wikipedia.org)
429 'eu': u'«»‹›',
430 'fi': u'””’’',
431 'fi-x-altquot': u'»»››',
432 'fr': (u'« ', u' »', u'“', u'”'), # full no-break space
433 'fr-x-altquot': (u'« ', u' »', u'“', u'”'), # narrow no-break space
434 'fr-ch': u'«»‹›',
435 'fr-ch-x-altquot': (u'« ', u' »', u'‹ ', u' ›'), # narrow no-break space, http://typoguide.ch/
436 'gl': u'«»“”',
437 'he': u'”“»«', # Hebrew is RTL, test position:
438 'he-x-altquot': u'„”‚’', # low quotation marks are opening.
439 # 'he-x-altquot': u'“„‘‚', # RTL: low quotation marks opening
440 'hr': u'„”‘’', # http://hrvatska-tipografija.com/polunavodnici/
441 'hr-x-altquot': u'»«›‹',
442 'hsb': u'„“‚‘',
443 'hsb-x-altquot':u'»«›‹',
444 'hu': u'„”«»',
445 'is': u'„“‚‘',
446 'it': u'«»“”',
447 'it-ch': u'«»‹›',
448 'it-x-altquot': u'“”‘’',
449 # 'it-x-altquot2': u'“„‘‚', # [7] in headlines
450 'ja': u'「」『』',
451 'lt': u'„“‚‘',
452 'lv': u'„“‚‘',
453 'mk': u'„“‚‘', # Macedonian, https://mk.wikipedia.org/wiki/Правопис_и_правоговор_на_македонскиот_јазик
454 'nl': u'“”‘’',
455 'nl-x-altquot': u'„”‚’',
456 # 'nl-x-altquot2': u'””’’',
457 'nb': u'«»’’', # Norsk bokmål (canonical form 'no')
458 'nn': u'«»’’', # Nynorsk [10]
459 'nn-x-altquot': u'«»‘’', # [8], [10]
460 # 'nn-x-altquot2': u'«»«»', # [9], [10
461 # 'nn-x-altquot3': u'„“‚‘', # [10]
462 'no': u'«»’’', # Norsk bokmål [10]
463 'no-x-altquot': u'«»‘’', # [8], [10]
464 # 'no-x-altquot2': u'«»«»', # [9], [10
465 # 'no-x-altquot3': u'„“‚‘', # [10]
466 'pl': u'„”«»',
467 'pl-x-altquot': u'«»‚’',
468 # 'pl-x-altquot2': u'„”‚’', # https://pl.wikipedia.org/wiki/Cudzys%C5%82%C3%B3w
469 'pt': u'«»“”',
470 'pt-br': u'“”‘’',
471 'ro': u'„”«»',
472 'ru': u'«»„“',
473 'sh': u'„”‚’', # Serbo-Croatian
474 'sh-x-altquot': u'»«›‹',
475 'sk': u'„“‚‘', # Slovak
476 'sk-x-altquot': u'»«›‹',
477 'sl': u'„“‚‘', # Slovenian
478 'sl-x-altquot': u'»«›‹',
479 'sq': u'«»‹›', # Albanian
480 'sq-x-altquot': u'“„‘‚',
481 'sr': u'„”’’',
482 'sr-x-altquot': u'»«›‹',
483 'sv': u'””’’',
484 'sv-x-altquot': u'»»››',
485 'tr': u'“”‘’',
486 'tr-x-altquot': u'«»‹›',
487 # 'tr-x-altquot2': u'“„‘‚', # [7] antiquated?
488 'uk': u'«»„“',
489 'uk-x-altquot': u'„“‚‘',
490 'zh-cn': u'“”‘’',
491 'zh-tw': u'「」『』',
494 def __init__(self, language='en'):
495 self.language = language
496 try:
497 (self.opquote, self.cpquote,
498 self.osquote, self.csquote) = self.quotes[language.lower()]
499 except KeyError:
500 self.opquote, self.cpquote, self.osquote, self.csquote = u'""\'\''
503 def smartyPants(text, attr=default_smartypants_attr, language='en'):
504 """Main function for "traditional" use."""
506 return "".join([t for t in educate_tokens(tokenize(text),
507 attr, language)])
510 def educate_tokens(text_tokens, attr=default_smartypants_attr, language='en'):
511 """Return iterator that "educates" the items of `text_tokens`.
514 # Parse attributes:
515 # 0 : do nothing
516 # 1 : set all
517 # 2 : set all, using old school en- and em- dash shortcuts
518 # 3 : set all, using inverted old school en and em- dash shortcuts
520 # q : quotes
521 # b : backtick quotes (``double'' only)
522 # B : backtick quotes (``double'' and `single')
523 # d : dashes
524 # D : old school dashes
525 # i : inverted old school dashes
526 # e : ellipses
527 # w : convert &quot; entities to " for Dreamweaver users
529 convert_quot = False # translate &quot; entities into normal quotes?
530 do_dashes = False
531 do_backticks = False
532 do_quotes = False
533 do_ellipses = False
534 do_stupefy = False
536 # if attr == "0": # pass tokens unchanged (see below).
537 if attr == "1": # Do everything, turn all options on.
538 do_quotes = True
539 do_backticks = True
540 do_dashes = 1
541 do_ellipses = True
542 elif attr == "2":
543 # Do everything, turn all options on, use old school dash shorthand.
544 do_quotes = True
545 do_backticks = True
546 do_dashes = 2
547 do_ellipses = True
548 elif attr == "3":
549 # Do everything, use inverted old school dash shorthand.
550 do_quotes = True
551 do_backticks = True
552 do_dashes = 3
553 do_ellipses = True
554 elif attr == "-1": # Special "stupefy" mode.
555 do_stupefy = True
556 else:
557 if "q" in attr: do_quotes = True
558 if "b" in attr: do_backticks = True
559 if "B" in attr: do_backticks = 2
560 if "d" in attr: do_dashes = 1
561 if "D" in attr: do_dashes = 2
562 if "i" in attr: do_dashes = 3
563 if "e" in attr: do_ellipses = True
564 if "w" in attr: convert_quot = True
566 prev_token_last_char = " "
567 # Last character of the previous text token. Used as
568 # context to curl leading quote characters correctly.
570 for (ttype, text) in text_tokens:
572 # skip HTML and/or XML tags as well as emtpy text tokens
573 # without updating the last character
574 if ttype == 'tag' or not text:
575 yield text
576 continue
578 # skip literal text (math, literal, raw, ...)
579 if ttype == 'literal':
580 prev_token_last_char = text[-1:]
581 yield text
582 continue
584 last_char = text[-1:] # Remember last char before processing.
586 text = processEscapes(text)
588 if convert_quot:
589 text = re.sub('&quot;', '"', text)
591 if do_dashes == 1:
592 text = educateDashes(text)
593 elif do_dashes == 2:
594 text = educateDashesOldSchool(text)
595 elif do_dashes == 3:
596 text = educateDashesOldSchoolInverted(text)
598 if do_ellipses:
599 text = educateEllipses(text)
601 # Note: backticks need to be processed before quotes.
602 if do_backticks:
603 text = educateBackticks(text, language)
605 if do_backticks == 2:
606 text = educateSingleBackticks(text, language)
608 if do_quotes:
609 # Replace plain quotes in context to prevent converstion to
610 # 2-character sequence in French.
611 context = prev_token_last_char.replace('"',';').replace("'",';')
612 text = educateQuotes(context+text, language)[1:]
614 if do_stupefy:
615 text = stupefyEntities(text, language)
617 # Remember last char as context for the next token
618 prev_token_last_char = last_char
620 text = processEscapes(text, restore=True)
622 yield text
626 def educateQuotes(text, language='en'):
628 Parameter: - text string (unicode or bytes).
629 - language (`BCP 47` language tag.)
630 Returns: The `text`, with "educated" curly quote characters.
632 Example input: "Isn't this fun?"
633 Example output: “Isn’t this fun?“;
636 smart = smartchars(language)
638 punct_class = r"""[!"#\$\%'()*+,-.\/:;<=>?\@\[\\\]\^_`{|}~]"""
639 close_class = r"""[^\ \t\r\n\[\{\(\-]"""
640 open_class = u'[\u200B\u200C]' # ZWSP, ZWNJ
641 dec_dashes = r"""&#8211;|&#8212;"""
643 # Special case if the very first character is a quote
644 # followed by punctuation at a non-word-break.
645 # Close the quotes by brute force:
646 text = re.sub(r"""^'(?=%s\\B)""" % (punct_class,), smart.csquote, text)
647 text = re.sub(r"""^"(?=%s\\B)""" % (punct_class,), smart.cpquote, text)
649 # Special case for double sets of quotes, e.g.:
650 # <p>He said, "'Quoted' words in a larger quote."</p>
651 text = re.sub(r""""'(?=\w)""", smart.opquote+smart.osquote, text)
652 text = re.sub(r"""'"(?=\w)""", smart.osquote+smart.opquote, text)
654 # Special case for decade abbreviations (the '80s):
655 if language.startswith('en'): # TODO similar cases in other languages?
656 text = re.sub(r"'(?=\d{2}s)", smart.apostrophe, text, flags=re.UNICODE)
658 # Get most opening single quotes:
659 opening_single_quotes_regex = re.compile(ur"""
660 (# ?<= # look behind fails: requires fixed-width pattern
661 \s | # a whitespace char, or
662 %s | # another separating char, or
663 &nbsp; | # a non-breaking space entity, or
664 [–—] | # literal dashes, or
665 -- | # dumb dashes, or
666 &[mn]dash; | # dash entities (named or
667 %s | # decimal or
668 &\#x201[34]; # hex)
670 ' # the quote
671 (?=\w) # followed by a word character
672 """ % (open_class,dec_dashes), re.VERBOSE | re.UNICODE)
673 text = opening_single_quotes_regex.sub(r'\1'+smart.osquote, text)
675 # In many locales, single closing quotes are different from apostrophe:
676 if smart.csquote != smart.apostrophe:
677 apostrophe_regex = re.compile(r"(?<=(\w|\d))'(?=\w)", re.UNICODE)
678 text = apostrophe_regex.sub(smart.apostrophe, text)
679 # TODO: keep track of quoting level to recognize apostrophe in, e.g.,
680 # "Ich fass' es nicht."
682 closing_single_quotes_regex = re.compile(r"""
683 (?<=%s)
685 """ % close_class, re.VERBOSE)
686 text = closing_single_quotes_regex.sub(smart.csquote, text)
688 # Any remaining single quotes should be opening ones:
689 text = re.sub(r"""'""", smart.osquote, text)
691 # Get most opening double quotes:
692 opening_double_quotes_regex = re.compile(ur"""
694 \s | # a whitespace char, or
695 %s | # another separating char, or
696 &nbsp; | # a non-breaking space entity, or
697 [–—] | # literal dashes, or
698 -- | # dumb dashes, or
699 &[mn]dash; | # dash entities (named or
700 %s | # decimal or
701 &\#x201[34]; # hex)
703 " # the quote
704 (?=\w) # followed by a word character
705 """ % (open_class,dec_dashes), re.VERBOSE | re.UNICODE)
706 text = opening_double_quotes_regex.sub(r'\1'+smart.opquote, text)
708 # Double closing quotes:
709 closing_double_quotes_regex = re.compile(r"""
711 (?<=%s)" | # char indicating the quote should be closing
712 "(?=\s) # whitespace behind
714 """ % (close_class,), re.VERBOSE | re.UNICODE)
715 text = closing_double_quotes_regex.sub(smart.cpquote, text)
717 # Any remaining quotes should be opening ones.
718 text = re.sub(r'"', smart.opquote, text)
720 return text
723 def educateBackticks(text, language='en'):
725 Parameter: String (unicode or bytes).
726 Returns: The `text`, with ``backticks'' -style double quotes
727 translated into HTML curly quote entities.
728 Example input: ``Isn't this fun?''
729 Example output: “Isn't this fun?“;
731 smart = smartchars(language)
733 text = re.sub(r"""``""", smart.opquote, text)
734 text = re.sub(r"""''""", smart.cpquote, text)
735 return text
738 def educateSingleBackticks(text, language='en'):
740 Parameter: String (unicode or bytes).
741 Returns: The `text`, with `backticks' -style single quotes
742 translated into HTML curly quote entities.
744 Example input: `Isn't this fun?'
745 Example output: ‘Isn’t this fun?’
747 smart = smartchars(language)
749 text = re.sub(r"""`""", smart.osquote, text)
750 text = re.sub(r"""'""", smart.csquote, text)
751 return text
754 def educateDashes(text):
756 Parameter: String (unicode or bytes).
757 Returns: The `text`, with each instance of "--" translated to
758 an em-dash character.
761 text = re.sub(r"""---""", smartchars.endash, text) # en (yes, backwards)
762 text = re.sub(r"""--""", smartchars.emdash, text) # em (yes, backwards)
763 return text
766 def educateDashesOldSchool(text):
768 Parameter: String (unicode or bytes).
769 Returns: The `text`, with each instance of "--" translated to
770 an en-dash character, and each "---" translated to
771 an em-dash character.
774 text = re.sub(r"""---""", smartchars.emdash, text)
775 text = re.sub(r"""--""", smartchars.endash, text)
776 return text
779 def educateDashesOldSchoolInverted(text):
781 Parameter: String (unicode or bytes).
782 Returns: The `text`, with each instance of "--" translated to
783 an em-dash character, and each "---" translated to
784 an en-dash character. Two reasons why: First, unlike the
785 en- and em-dash syntax supported by
786 EducateDashesOldSchool(), it's compatible with existing
787 entries written before SmartyPants 1.1, back when "--" was
788 only used for em-dashes. Second, em-dashes are more
789 common than en-dashes, and so it sort of makes sense that
790 the shortcut should be shorter to type. (Thanks to Aaron
791 Swartz for the idea.)
793 text = re.sub(r"""---""", smartchars.endash, text) # em
794 text = re.sub(r"""--""", smartchars.emdash, text) # en
795 return text
799 def educateEllipses(text):
801 Parameter: String (unicode or bytes).
802 Returns: The `text`, with each instance of "..." translated to
803 an ellipsis character.
805 Example input: Huh...?
806 Example output: Huh&#8230;?
809 text = re.sub(r"""\.\.\.""", smartchars.ellipsis, text)
810 text = re.sub(r"""\. \. \.""", smartchars.ellipsis, text)
811 return text
814 def stupefyEntities(text, language='en'):
816 Parameter: String (unicode or bytes).
817 Returns: The `text`, with each SmartyPants character translated to
818 its ASCII counterpart.
820 Example input: “Hello — world.”
821 Example output: "Hello -- world."
823 smart = smartchars(language)
825 text = re.sub(smart.endash, "-", text) # en-dash
826 text = re.sub(smart.emdash, "--", text) # em-dash
828 text = re.sub(smart.osquote, "'", text) # open single quote
829 text = re.sub(smart.csquote, "'", text) # close single quote
831 text = re.sub(smart.opquote, '"', text) # open double quote
832 text = re.sub(smart.cpquote, '"', text) # close double quote
834 text = re.sub(smart.ellipsis, '...', text)# ellipsis
836 return text
839 def processEscapes(text, restore=False):
840 r"""
841 Parameter: String (unicode or bytes).
842 Returns: The `text`, with after processing the following backslash
843 escape sequences. This is useful if you want to force a "dumb"
844 quote or other character to appear.
846 Escape Value
847 ------ -----
848 \\ &#92;
849 \" &#34;
850 \' &#39;
851 \. &#46;
852 \- &#45;
853 \` &#96;
855 replacements = ((r'\\', r'&#92;'),
856 (r'\"', r'&#34;'),
857 (r"\'", r'&#39;'),
858 (r'\.', r'&#46;'),
859 (r'\-', r'&#45;'),
860 (r'\`', r'&#96;'))
861 if restore:
862 for (ch, rep) in replacements:
863 text = text.replace(rep, ch[1])
864 else:
865 for (ch, rep) in replacements:
866 text = text.replace(ch, rep)
868 return text
871 def tokenize(text):
873 Parameter: String containing HTML markup.
874 Returns: An iterator that yields the tokens comprising the input
875 string. Each token is either a tag (possibly with nested,
876 tags contained therein, such as <a href="<MTFoo>">, or a
877 run of text between tags. Each yielded element is a
878 two-element tuple; the first is either 'tag' or 'text';
879 the second is the actual value.
881 Based on the _tokenize() subroutine from Brad Choate's MTRegex plugin.
882 <http://www.bradchoate.com/past/mtregex.php>
885 pos = 0
886 length = len(text)
887 # tokens = []
889 depth = 6
890 nested_tags = "|".join(['(?:<(?:[^<>]',] * depth) + (')*>)' * depth)
891 #match = r"""(?: <! ( -- .*? -- \s* )+ > ) | # comments
892 # (?: <\? .*? \?> ) | # directives
893 # %s # nested tags """ % (nested_tags,)
894 tag_soup = re.compile(r"""([^<]*)(<[^>]*>)""")
896 token_match = tag_soup.search(text)
898 previous_end = 0
899 while token_match is not None:
900 if token_match.group(1):
901 yield ('text', token_match.group(1))
903 yield ('tag', token_match.group(2))
905 previous_end = token_match.end()
906 token_match = tag_soup.search(text, token_match.end())
908 if previous_end < len(text):
909 yield ('text', text[previous_end:])
913 if __name__ == "__main__":
915 import itertools
916 try:
917 import locale # module missing in Jython
918 locale.setlocale(locale.LC_ALL, '') # set to user defaults
919 defaultlanguage = locale.getdefaultlocale()[0]
920 except:
921 defaultlanguage = 'en'
923 # Normalize and drop unsupported subtags:
924 defaultlanguage = defaultlanguage.lower().replace('-','_')
925 # split (except singletons, which mark the following tag as non-standard):
926 defaultlanguage = re.sub(r'_([a-zA-Z0-9])_', r'_\1-', defaultlanguage)
927 _subtags = [subtag for subtag in defaultlanguage.split('_')]
928 _basetag = _subtags.pop(0)
929 # find all combinations of subtags
930 for n in range(len(_subtags), 0, -1):
931 for tags in itertools.combinations(_subtags, n):
932 _tag = '-'.join((_basetag,)+tags)
933 if _tag in smartchars.quotes:
934 defaultlanguage = _tag
935 break
936 else:
937 if _basetag in smartchars.quotes:
938 defaultlanguage = _basetag
939 else:
940 defaultlanguage = 'en'
943 import argparse
944 parser = argparse.ArgumentParser(
945 description='Filter stdin making ASCII punctuation "smart".')
946 # parser.add_argument("text", help="text to be acted on")
947 parser.add_argument("-a", "--action", default="1",
948 help="what to do with the input (see --actionhelp)")
949 parser.add_argument("-e", "--encoding", default="utf8",
950 help="text encoding")
951 parser.add_argument("-l", "--language", default=defaultlanguage,
952 help="text language (BCP47 tag), Default: %s"%defaultlanguage)
953 parser.add_argument("-q", "--alternative-quotes", action="store_true",
954 help="use alternative quote style")
955 parser.add_argument("--doc", action="store_true",
956 help="print documentation")
957 parser.add_argument("--actionhelp", action="store_true",
958 help="list available actions")
959 parser.add_argument("--stylehelp", action="store_true",
960 help="list available quote styles")
961 parser.add_argument("--test", action="store_true",
962 help="perform short self-test")
963 args = parser.parse_args()
965 if args.doc:
966 print (__doc__)
967 elif args.actionhelp:
968 print options
969 elif args.stylehelp:
970 print
971 print "Available styles (primary open/close, secondary open/close)"
972 print "language tag quotes"
973 print "============ ======"
974 for key in sorted(smartchars.quotes.keys()):
975 print "%-14s %s" % (key, smartchars.quotes[key])
976 elif args.test:
977 # Unit test output goes to stderr.
978 import unittest
980 class TestSmartypantsAllAttributes(unittest.TestCase):
981 # the default attribute is "1", which means "all".
982 def test_dates(self):
983 self.assertEqual(smartyPants("1440-80's"), u"1440-80’s")
984 self.assertEqual(smartyPants("1440-'80s"), u"1440-’80s")
985 self.assertEqual(smartyPants("1440---'80s"), u"1440–’80s")
986 self.assertEqual(smartyPants("1960's"), u"1960’s")
987 self.assertEqual(smartyPants("one two '60s"), u"one two ’60s")
988 self.assertEqual(smartyPants("'60s"), u"’60s")
990 def test_educated_quotes(self):
991 self.assertEqual(smartyPants('"Isn\'t this fun?"'), u'“Isn’t this fun?”')
993 def test_html_tags(self):
994 text = '<a src="foo">more</a>'
995 self.assertEqual(smartyPants(text), text)
997 suite = unittest.TestLoader().loadTestsFromTestCase(
998 TestSmartypantsAllAttributes)
999 unittest.TextTestRunner().run(suite)
1001 else:
1002 if args.alternative_quotes:
1003 if '-x-altquot' in args.language:
1004 args.language = args.language.replace('-x-altquot', '')
1005 else:
1006 args.language += '-x-altquot'
1007 text = sys.stdin.read().decode(args.encoding)
1008 print smartyPants(text, attr=args.action,
1009 language=args.language).encode(args.encoding)