Fix [ 317 ] Extra space inserted with French smartquotes .
[docutils.git] / docutils / docutils / utils / smartquotes.py
blob6fb54e4e04d53672d788bd4b72dc4ea914434707
1 #!/usr/bin/python
2 # -*- coding: utf-8 -*-
4 # :Id: $Id$
5 # :Copyright: © 2010 Günter Milde,
6 # original `SmartyPants`_: © 2003 John Gruber
7 # smartypants.py: © 2004, 2007 Chad Miller
8 # :Maintainer: docutils-develop@lists.sourceforge.net
9 # :License: Released under the terms of the `2-Clause BSD license`_, in short:
11 # Copying and distribution of this file, with or without modification,
12 # are permitted in any medium without royalty provided the copyright
13 # notices and this notice are preserved.
14 # This file is offered as-is, without any warranty.
16 # .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
19 r"""
20 =========================
21 Smart Quotes for Docutils
22 =========================
24 Synopsis
25 ========
27 "SmartyPants" is a free web publishing plug-in for Movable Type, Blosxom, and
28 BBEdit that easily translates plain ASCII punctuation characters into "smart"
29 typographic punctuation characters.
31 ``smartquotes.py`` is an adaption of "SmartyPants" to Docutils_.
33 * Using Unicode characters instead of HTML entities for typographic quotes, it
34 works for any output format that supports Unicode.
35 * Support `language specific quote characters`__.
37 __ http://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
40 Authors
41 =======
43 `John Gruber`_ did all of the hard work of writing this software in Perl for
44 `Movable Type`_ and almost all of this useful documentation. `Chad Miller`_
45 ported it to Python to use with Pyblosxom_.
46 Adapted to Docutils_ by Günter Milde.
48 Additional Credits
49 ==================
51 Portions of the SmartyPants original work are based on Brad Choate's nifty
52 MTRegex plug-in. `Brad Choate`_ also contributed a few bits of source code to
53 this plug-in. Brad Choate is a fine hacker indeed.
55 `Jeremy Hedley`_ and `Charles Wiltgen`_ deserve mention for exemplary beta
56 testing of the original SmartyPants.
58 `Rael Dornfest`_ ported SmartyPants to Blosxom.
60 .. _Brad Choate: http://bradchoate.com/
61 .. _Jeremy Hedley: http://antipixel.com/
62 .. _Charles Wiltgen: http://playbacktime.com/
63 .. _Rael Dornfest: http://raelity.org/
66 Copyright and License
67 =====================
69 SmartyPants_ license (3-Clause BSD license):
71 Copyright (c) 2003 John Gruber (http://daringfireball.net/)
72 All rights reserved.
74 Redistribution and use in source and binary forms, with or without
75 modification, are permitted provided that the following conditions are
76 met:
78 * Redistributions of source code must retain the above copyright
79 notice, this list of conditions and the following disclaimer.
81 * Redistributions in binary form must reproduce the above copyright
82 notice, this list of conditions and the following disclaimer in
83 the documentation and/or other materials provided with the
84 distribution.
86 * Neither the name "SmartyPants" nor the names of its contributors
87 may be used to endorse or promote products derived from this
88 software without specific prior written permission.
90 This software is provided by the copyright holders and contributors
91 "as is" and any express or implied warranties, including, but not
92 limited to, the implied warranties of merchantability and fitness for
93 a particular purpose are disclaimed. In no event shall the copyright
94 owner or contributors be liable for any direct, indirect, incidental,
95 special, exemplary, or consequential damages (including, but not
96 limited to, procurement of substitute goods or services; loss of use,
97 data, or profits; or business interruption) however caused and on any
98 theory of liability, whether in contract, strict liability, or tort
99 (including negligence or otherwise) arising in any way out of the use
100 of this software, even if advised of the possibility of such damage.
102 smartypants.py license (2-Clause BSD license):
104 smartypants.py is a derivative work of SmartyPants.
106 Redistribution and use in source and binary forms, with or without
107 modification, are permitted provided that the following conditions are
108 met:
110 * Redistributions of source code must retain the above copyright
111 notice, this list of conditions and the following disclaimer.
113 * Redistributions in binary form must reproduce the above copyright
114 notice, this list of conditions and the following disclaimer in
115 the documentation and/or other materials provided with the
116 distribution.
118 This software is provided by the copyright holders and contributors
119 "as is" and any express or implied warranties, including, but not
120 limited to, the implied warranties of merchantability and fitness for
121 a particular purpose are disclaimed. In no event shall the copyright
122 owner or contributors be liable for any direct, indirect, incidental,
123 special, exemplary, or consequential damages (including, but not
124 limited to, procurement of substitute goods or services; loss of use,
125 data, or profits; or business interruption) however caused and on any
126 theory of liability, whether in contract, strict liability, or tort
127 (including negligence or otherwise) arising in any way out of the use
128 of this software, even if advised of the possibility of such damage.
130 .. _John Gruber: http://daringfireball.net/
131 .. _Chad Miller: http://web.chad.org/
133 .. _Pyblosxom: http://pyblosxom.bluesock.org/
134 .. _SmartyPants: http://daringfireball.net/projects/smartypants/
135 .. _Movable Type: http://www.movabletype.org/
136 .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
137 .. _Docutils: http://docutils.sf.net/
139 Description
140 ===========
142 SmartyPants can perform the following transformations:
144 - Straight quotes ( " and ' ) into "curly" quote characters
145 - Backticks-style quotes (\`\`like this'') into "curly" quote characters
146 - Dashes (``--`` and ``---``) into en- and em-dash entities
147 - Three consecutive dots (``...`` or ``. . .``) into an ellipsis entity
149 This means you can write, edit, and save your posts using plain old
150 ASCII straight quotes, plain dashes, and plain dots, but your published
151 posts (and final HTML output) will appear with smart quotes, em-dashes,
152 and proper ellipses.
154 SmartyPants does not modify characters within ``<pre>``, ``<code>``, ``<kbd>``,
155 ``<math>`` or ``<script>`` tag blocks. Typically, these tags are used to
156 display text where smart quotes and other "smart punctuation" would not be
157 appropriate, such as source code or example markup.
160 Backslash Escapes
161 =================
163 If you need to use literal straight quotes (or plain hyphens and periods),
164 `smartquotes` accepts the following backslash escape sequences to force
165 ASCII-punctuation. Mind, that you need two backslashes as Docutils expands it,
166 too.
168 ======== =========
169 Escape Character
170 ======== =========
171 ``\\`` \\
172 ``\\"`` \\"
173 ``\\'`` \\'
174 ``\\.`` \\.
175 ``\\-`` \\-
176 ``\\``` \\`
177 ======== =========
179 This is useful, for example, when you want to use straight quotes as
180 foot and inch marks: 6\\'2\\" tall; a 17\\" iMac.
182 Options
183 =======
185 Numeric values are the easiest way to configure SmartyPants' behavior:
187 :0: Suppress all transformations. (Do nothing.)
189 :1: Performs default SmartyPants transformations: quotes (including
190 \`\`backticks'' -style), em-dashes, and ellipses. "``--``" (dash dash)
191 is used to signify an em-dash; there is no support for en-dashes
193 :2: Same as smarty_pants="1", except that it uses the old-school typewriter
194 shorthand for dashes: "``--``" (dash dash) for en-dashes, "``---``"
195 (dash dash dash)
196 for em-dashes.
198 :3: Same as smarty_pants="2", but inverts the shorthand for dashes:
199 "``--``" (dash dash) for em-dashes, and "``---``" (dash dash dash) for
200 en-dashes.
202 :-1: Stupefy mode. Reverses the SmartyPants transformation process, turning
203 the characters produced by SmartyPants into their ASCII equivalents.
204 E.g. the LEFT DOUBLE QUOTATION MARK (“) is turned into a simple
205 double-quote (\"), "—" is turned into two dashes, etc.
208 The following single-character attribute values can be combined to toggle
209 individual transformations from within the smarty_pants attribute. For
210 example, ``py['smartypants_attributes'] = "1"`` is equivalent to
211 ``py['smartypants_attributes'] = "qBde"``.
213 :q: Educates normal quote characters: (") and (').
215 :b: Educates \`\`backticks'' -style double quotes.
217 :B: Educates \`\`backticks'' -style double quotes and \`single' quotes.
219 :d: Educates em-dashes.
221 :D: Educates em-dashes and en-dashes, using old-school typewriter shorthand:
222 (dash dash) for en-dashes, (dash dash dash) for em-dashes.
224 :i: Educates em-dashes and en-dashes, using inverted old-school typewriter
225 shorthand: (dash dash) for em-dashes, (dash dash dash) for en-dashes.
227 :e: Educates ellipses.
229 :w: Translates any instance of ``&quot;`` into a normal double-quote character.
230 This should be of no interest to most people, but of particular interest
231 to anyone who writes their posts using Dreamweaver, as Dreamweaver
232 inexplicably uses this entity to represent a literal double-quote
233 character. SmartyPants only educates normal quotes, not entities (because
234 ordinarily, entities are used for the explicit purpose of representing the
235 specific character they represent). The "w" option must be used in
236 conjunction with one (or both) of the other quote options ("q" or "b").
237 Thus, if you wish to apply all SmartyPants transformations (quotes, en-
238 and em-dashes, and ellipses) and also translate ``&quot;`` entities into
239 regular quotes so SmartyPants can educate them, you should pass the
240 following to the smarty_pants attribute:
243 Caveats
244 =======
246 Why You Might Not Want to Use Smart Quotes in Your Weblog
247 ---------------------------------------------------------
249 For one thing, you might not care.
251 Most normal, mentally stable individuals do not take notice of proper
252 typographic punctuation. Many design and typography nerds, however, break
253 out in a nasty rash when they encounter, say, a restaurant sign that uses
254 a straight apostrophe to spell "Joe's".
256 If you're the sort of person who just doesn't care, you might well want to
257 continue not caring. Using straight quotes -- and sticking to the 7-bit
258 ASCII character set in general -- is certainly a simpler way to live.
260 Even if you *do* care about accurate typography, you still might want to
261 think twice before educating the quote characters in your weblog. One side
262 effect of publishing curly quote characters is that it makes your
263 weblog a bit harder for others to quote from using copy-and-paste. What
264 happens is that when someone copies text from your blog, the copied text
265 contains the 8-bit curly quote characters (as well as the 8-bit characters
266 for em-dashes and ellipses, if you use these options). These characters
267 are not standard across different text encoding methods, which is why they
268 need to be encoded as characters.
270 People copying text from your weblog, however, may not notice that you're
271 using curly quotes, and they'll go ahead and paste the unencoded 8-bit
272 characters copied from their browser into an email message or their own
273 weblog. When pasted as raw "smart quotes", these characters are likely to
274 get mangled beyond recognition.
276 That said, my own opinion is that any decent text editor or email client
277 makes it easy to stupefy smart quote characters into their 7-bit
278 equivalents, and I don't consider it my problem if you're using an
279 indecent text editor or email client.
282 Algorithmic Shortcomings
283 ------------------------
285 One situation in which quotes will get curled the wrong way is when
286 apostrophes are used at the start of leading contractions. For example::
288 'Twas the night before Christmas.
290 In the case above, SmartyPants will turn the apostrophe into an opening
291 single-quote, when in fact it should be the `right single quotation mark`
292 character which is also "the preferred character to use for apostrophe"
293 (Unicode). I don't think this problem can be solved in the general case --
294 every word processor I've tried gets this wrong as well. In such cases, it's
295 best to use the proper character for closing single-quotes (’) by hand.
297 In English, the same character is used for apostrophe and closing single
298 quote (both plain and "smart" ones). For other locales (French, Italean,
299 Swiss, ...) "smart" single closing quotes differ from the curly apostrophe.
301 .. class:: language-fr
303 Il dit : "C'est 'super' !"
305 If the apostrophe is used at the end of a word, it cannot be distinguished
306 from a single quote by the algorithm. Therefore, a text like::
308 .. class:: language-de-CH
310 "Er sagt: 'Ich fass' es nicht.'"
312 will get a single closing guillemet instead of an apostrophe.
314 This can be prevented by use use of the curly apostrophe character (’) in
315 the source::
317 - "Er sagt: 'Ich fass' es nicht.'"
318 + "Er sagt: 'Ich fass’ es nicht.'"
321 Version History
322 ===============
324 1.7.1: 2017-03-19
325 - Update and extend language-dependent quotes.
326 - Differentiate apostrophe from single quote.
328 1.7: 2012-11-19
329 - Internationalization: language-dependent quotes.
331 1.6.1: 2012-11-06
332 - Refactor code, code cleanup,
333 - `educate_tokens()` generator as interface for Docutils.
335 1.6: 2010-08-26
336 - Adaption to Docutils:
337 - Use Unicode instead of HTML entities,
338 - Remove code special to pyblosxom.
340 1.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400
341 - Fixed bug where blocks of precious unalterable text was instead
342 interpreted. Thanks to Le Roux and Dirk van Oosterbosch.
344 1.5_1.5: Sat, 13 Aug 2005 15:50:24 -0400
345 - Fix bogus magical quotation when there is no hint that the
346 user wants it, e.g., in "21st century". Thanks to Nathan Hamblen.
347 - Be smarter about quotes before terminating numbers in an en-dash'ed
348 range.
350 1.5_1.4: Thu, 10 Feb 2005 20:24:36 -0500
351 - Fix a date-processing bug, as reported by jacob childress.
352 - Begin a test-suite for ensuring correct output.
353 - Removed import of "string", since I didn't really need it.
354 (This was my first every Python program. Sue me!)
356 1.5_1.3: Wed, 15 Sep 2004 18:25:58 -0400
357 - Abort processing if the flavour is in forbidden-list. Default of
358 [ "rss" ] (Idea of Wolfgang SCHNERRING.)
359 - Remove stray virgules from en-dashes. Patch by Wolfgang SCHNERRING.
361 1.5_1.2: Mon, 24 May 2004 08:14:54 -0400
362 - Some single quotes weren't replaced properly. Diff-tesuji played
363 by Benjamin GEIGER.
365 1.5_1.1: Sun, 14 Mar 2004 14:38:28 -0500
366 - Support upcoming pyblosxom 0.9 plugin verification feature.
368 1.5_1.0: Tue, 09 Mar 2004 08:08:35 -0500
369 - Initial release
372 default_smartypants_attr = "1"
375 import re
377 class smartchars(object):
378 """Smart quotes and dashes
381 endash = u'–' # "&#8211;" EN DASH
382 emdash = u'—' # "&#8212;" EM DASH
383 ellipsis = u'…' # "&#8230;" HORIZONTAL ELLIPSIS
384 apostrophe = u'’' # "&#8217;" RIGHT SINGLE QUOTATION MARK
386 # quote characters (language-specific, set in __init__())
387 # [1] http://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
388 # [2] http://de.wikipedia.org/wiki/Anf%C3%BChrungszeichen#Andere_Sprachen
389 # [3] https://fr.wikipedia.org/wiki/Guillemet
390 # [4] http://typographisme.net/post/Les-espaces-typographiques-et-le-web
391 # [5] http://www.btb.termiumplus.gc.ca/tpv2guides/guides/redac/index-fra.html
392 # [6] https://en.wikipedia.org/wiki/Hebrew_punctuation#Quotation_marks
393 # [7] http://www.tustep.uni-tuebingen.de/bi/bi00/bi001t1-anfuehrung.pdf
395 # TODO: configuration option, e.g.::
397 # smartquote-locales: nl: „“’’, # apostrophe for ``'s Gravenhage``
398 # nr: se, # alias
399 # fr: « : »:‹ : ›, # :-separated list with NBSPs
400 quotes = {'af': u'“”‘’',
401 'af-x-altquot': u'„”‚’',
402 'ca': u'«»“”',
403 'ca-x-altquot': u'“”‘’',
404 'cs': u'„“‚‘',
405 'cs-x-altquot': u'»«›‹',
406 'da': u'»«›‹',
407 'da-x-altquot': u'„“‚‘',
408 # 'da-x-altquot2': u'””’’',
409 'de': u'„“‚‘',
410 'de-x-altquot': u'»«›‹',
411 'de-ch': u'«»‹›',
412 'el': u'«»“”',
413 'en': u'“”‘’',
414 'en-uk-x-altquot': u'‘’“”', # Attention: " → ‘ and ' → “ !
415 'eo': u'“”‘’',
416 'es': u'«»“”',
417 'es-x-altquot': u'“”‘’',
418 'et': u'„“‚‘', # no secondary quote listed in
419 'et-x-altquot': u'«»‹›', # the sources above (wikipedia.org)
420 'eu': u'«»‹›',
421 'fi': u'””’’',
422 'fi-x-altquot': u'»»››',
423 'fr': (u'« ', u' »', u'“', u'”'), # full no-break space
424 'fr-x-altquot': (u'« ', u' »', u'“', u'”'), # narrow no-break space
425 'fr-ch': u'«»‹›',
426 'fr-ch-x-altquot': (u'« ', u' »', u'‹ ', u' ›'), # narrow no-break space, http://typoguide.ch/
427 'gl': u'«»“”',
428 'he': u'”“»«', # Hebrew is RTL, test position:
429 'he-x-altquot': u'„”‚’', # low quotation marks are opening.
430 # 'he-x-altquot': u'“„‘‚', # RTL: low quotation marks opening
431 'hr': u'„”‘’', # http://hrvatska-tipografija.com/polunavodnici/
432 'hr-x-altquot': u'»«›‹',
433 'hsb': u'„“‚‘',
434 'hsb-x-altquot':u'»«›‹',
435 'hu': u'„”«»',
436 'is': u'„“‚‘',
437 'it': u'«»“”',
438 'it-ch': u'«»‹›',
439 'it-x-altquot': u'“”‘’',
440 # 'it-x-altquot2': u'“„‘‚', # [7] antiquated?
441 'ja': u'「」『』',
442 'lt': u'„“‚‘',
443 'lv': u'„“‚‘',
444 'nl': u'“”‘’',
445 'nl-x-altquot': u'„”‚’',
446 # 'nl-x-altquot2': u'””’’',
447 'pl': u'„”«»',
448 'pl-x-altquot': u'«»‚’',
449 # 'pl-x-altquot2': u'„”‚’', # https://pl.wikipedia.org/wiki/Cudzys%C5%82%C3%B3w
450 'pt': u'«»“”',
451 'pt-br': u'“”‘’',
452 'ro': u'„”«»',
453 'ru': u'«»„“',
454 'sh': u'„”‚’', # Serbo-Croatian
455 'sh-x-altquot': u'»«›‹',
456 'sk': u'„“‚‘', # Slovak
457 'sk-x-altquot': u'»«›‹',
458 'sl': u'„“‚‘', # Slovenian
459 'sl-x-altquot': u'»«›‹',
460 'sq': u'«»‹›', # Albanian
461 'sq-x-altquot': u'“„‘‚',
462 'sr': u'„”’’',
463 'sr-x-altquot': u'»«›‹',
464 'sv': u'””’’',
465 'sv-x-altquot': u'»»››',
466 'tr': u'“”‘’',
467 'tr-x-altquot': u'«»‹›',
468 # 'tr-x-altquot2': u'“„‘‚', # [7] antiquated?
469 'uk': u'«»„“',
470 'uk-x-altquot': u'„“‚‘',
471 'zh-cn': u'“”‘’',
472 'zh-tw': u'「」『』',
475 def __init__(self, language='en'):
476 self.language = language
477 try:
478 (self.opquote, self.cpquote,
479 self.osquote, self.csquote) = self.quotes[language.lower()]
480 except KeyError:
481 self.opquote, self.cpquote, self.osquote, self.csquote = u'""\'\''
484 def smartyPants(text, attr=default_smartypants_attr, language='en'):
485 """Main function for "traditional" use."""
487 return "".join([t for t in educate_tokens(tokenize(text),
488 attr, language)])
491 def educate_tokens(text_tokens, attr=default_smartypants_attr, language='en'):
492 """Return iterator that "educates" the items of `text_tokens`.
495 # Parse attributes:
496 # 0 : do nothing
497 # 1 : set all
498 # 2 : set all, using old school en- and em- dash shortcuts
499 # 3 : set all, using inverted old school en and em- dash shortcuts
501 # q : quotes
502 # b : backtick quotes (``double'' only)
503 # B : backtick quotes (``double'' and `single')
504 # d : dashes
505 # D : old school dashes
506 # i : inverted old school dashes
507 # e : ellipses
508 # w : convert &quot; entities to " for Dreamweaver users
510 convert_quot = False # translate &quot; entities into normal quotes?
511 do_dashes = False
512 do_backticks = False
513 do_quotes = False
514 do_ellipses = False
515 do_stupefy = False
517 if attr == "0": # Do nothing.
518 yield text
519 elif attr == "1": # Do everything, turn all options on.
520 do_quotes = True
521 do_backticks = True
522 do_dashes = 1
523 do_ellipses = True
524 elif attr == "2":
525 # Do everything, turn all options on, use old school dash shorthand.
526 do_quotes = True
527 do_backticks = True
528 do_dashes = 2
529 do_ellipses = True
530 elif attr == "3":
531 # Do everything, use inverted old school dash shorthand.
532 do_quotes = True
533 do_backticks = True
534 do_dashes = 3
535 do_ellipses = True
536 elif attr == "-1": # Special "stupefy" mode.
537 do_stupefy = True
538 else:
539 if "q" in attr: do_quotes = True
540 if "b" in attr: do_backticks = True
541 if "B" in attr: do_backticks = 2
542 if "d" in attr: do_dashes = 1
543 if "D" in attr: do_dashes = 2
544 if "i" in attr: do_dashes = 3
545 if "e" in attr: do_ellipses = True
546 if "w" in attr: convert_quot = True
548 prev_token_last_char = " "
549 # Last character of the previous text token. Used as
550 # context to curl leading quote characters correctly.
552 for (ttype, text) in text_tokens:
554 # skip HTML and/or XML tags as well as emtpy text tokens
555 # without updating the last character
556 if ttype == 'tag' or not text:
557 yield text
558 continue
560 # skip literal text (math, literal, raw, ...)
561 if ttype == 'literal':
562 prev_token_last_char = text[-1:]
563 yield text
564 continue
566 last_char = text[-1:] # Remember last char before processing.
568 text = processEscapes(text)
570 if convert_quot:
571 text = re.sub('&quot;', '"', text)
573 if do_dashes == 1:
574 text = educateDashes(text)
575 elif do_dashes == 2:
576 text = educateDashesOldSchool(text)
577 elif do_dashes == 3:
578 text = educateDashesOldSchoolInverted(text)
580 if do_ellipses:
581 text = educateEllipses(text)
583 # Note: backticks need to be processed before quotes.
584 if do_backticks:
585 text = educateBackticks(text, language)
587 if do_backticks == 2:
588 text = educateSingleBackticks(text, language)
590 if do_quotes:
591 # Replace plain quotes to prevent converstion to
592 # 2-character sequence in French.
593 context = prev_token_last_char.replace('"',';').replace("'",';')
594 text = educateQuotes(context+text, language)[1:]
596 if do_stupefy:
597 text = stupefyEntities(text, language)
599 # Remember last char as context for the next token
600 prev_token_last_char = last_char
602 text = processEscapes(text, restore=True)
604 yield text
608 def educateQuotes(text, language='en'):
610 Parameter: - text string (unicode or bytes).
611 - language (`BCP 47` language tag.)
612 Returns: The `text`, with "educated" curly quote characters.
614 Example input: "Isn't this fun?"
615 Example output: “Isn’t this fun?“;
618 smart = smartchars(language)
620 # oldtext = text
621 punct_class = r"""[!"#\$\%'()*+,-.\/:;<=>?\@\[\\\]\^_`{|}~]"""
623 # Special case if the very first character is a quote
624 # followed by punctuation at a non-word-break.
625 # Close the quotes by brute force:
626 text = re.sub(r"""^'(?=%s\\B)""" % (punct_class,), smart.csquote, text)
627 text = re.sub(r"""^"(?=%s\\B)""" % (punct_class,), smart.cpquote, text)
629 # Special case for double sets of quotes, e.g.:
630 # <p>He said, "'Quoted' words in a larger quote."</p>
631 text = re.sub(r""""'(?=\w)""", smart.opquote+smart.osquote, text)
632 text = re.sub(r"""'"(?=\w)""", smart.osquote+smart.opquote, text)
634 # Special case for decade abbreviations (the '80s):
635 if language.startswith('en'): # TODO similar cases in other languages?
636 text = re.sub(r"""'(?=\d{2}s)""", smart.apostrophe, text, re.UNICODE)
638 close_class = r"""[^\ \t\r\n\[\{\(\-]"""
639 dec_dashes = r"""&#8211;|&#8212;"""
641 # Get most opening single quotes:
642 opening_single_quotes_regex = re.compile(r"""
644 \s | # a whitespace char, or
645 &nbsp; | # a non-breaking space entity, or
646 -- | # dashes, or
647 &[mn]dash; | # named dash entities
648 %s | # or decimal entities
649 &\#x201[34]; # or hex
651 ' # the quote
652 (?=\w) # followed by a word character
653 """ % (dec_dashes,), re.VERBOSE | re.UNICODE)
654 text = opening_single_quotes_regex.sub(r'\1'+smart.osquote, text)
656 # In many locales, single closing quotes are different from apostrophe:
657 if smart.csquote != smart.apostrophe:
658 apostrophe_regex = re.compile(r"(?<=(\w|\d))'(?=\w)", re.UNICODE)
659 text = apostrophe_regex.sub(smart.apostrophe, text)
660 # TODO: keep track of quoting level to recognize apostrophe in, e.g.,
661 # "Ich fass' es nicht."
663 closing_single_quotes_regex = re.compile(r"""
664 (%s)
666 (?!\s | # whitespace
667 s\b |
668 \d # digits ('80s)
670 """ % (close_class,), re.VERBOSE | re.UNICODE)
671 text = closing_single_quotes_regex.sub(r'\1'+smart.csquote, text)
673 closing_single_quotes_regex = re.compile(r"""
674 (%s)
676 (\s | s\b)
677 """ % (close_class,), re.VERBOSE | re.UNICODE)
678 text = closing_single_quotes_regex.sub(r'\1%s\2' % smart.csquote, text)
680 # Any remaining single quotes should be opening ones:
681 text = re.sub(r"""'""", smart.osquote, text)
683 # Get most opening double quotes:
684 opening_double_quotes_regex = re.compile(r"""
686 \s | # a whitespace char, or
687 &nbsp; | # a non-breaking space entity, or
688 -- | # dashes, or
689 &[mn]dash; | # named dash entities
690 %s | # or decimal entities
691 &\#x201[34]; # or hex
693 " # the quote
694 (?=\w) # followed by a word character
695 """ % (dec_dashes,), re.VERBOSE)
696 text = opening_double_quotes_regex.sub(r'\1'+smart.opquote, text)
698 # Double closing quotes:
699 closing_double_quotes_regex = re.compile(r"""
700 #(%s)? # character that indicates the quote should be closing
702 (?=\s)
703 """ % (close_class,), re.VERBOSE)
704 text = closing_double_quotes_regex.sub(smart.cpquote, text)
706 closing_double_quotes_regex = re.compile(r"""
707 (%s) # character that indicates the quote should be closing
709 """ % (close_class,), re.VERBOSE)
710 text = closing_double_quotes_regex.sub(r'\1'+smart.cpquote, text)
712 # Any remaining quotes should be opening ones.
713 text = re.sub(r'"', smart.opquote, text)
715 return text
718 def educateBackticks(text, language='en'):
720 Parameter: String (unicode or bytes).
721 Returns: The `text`, with ``backticks'' -style double quotes
722 translated into HTML curly quote entities.
723 Example input: ``Isn't this fun?''
724 Example output: “Isn't this fun?“;
726 smart = smartchars(language)
728 text = re.sub(r"""``""", smart.opquote, text)
729 text = re.sub(r"""''""", smart.cpquote, text)
730 return text
733 def educateSingleBackticks(text, language='en'):
735 Parameter: String (unicode or bytes).
736 Returns: The `text`, with `backticks' -style single quotes
737 translated into HTML curly quote entities.
739 Example input: `Isn't this fun?'
740 Example output: ‘Isn’t this fun?’
742 smart = smartchars(language)
744 text = re.sub(r"""`""", smart.osquote, text)
745 text = re.sub(r"""'""", smart.csquote, text)
746 return text
749 def educateDashes(text):
751 Parameter: String (unicode or bytes).
752 Returns: The `text`, with each instance of "--" translated to
753 an em-dash character.
756 text = re.sub(r"""---""", smartchars.endash, text) # en (yes, backwards)
757 text = re.sub(r"""--""", smartchars.emdash, text) # em (yes, backwards)
758 return text
761 def educateDashesOldSchool(text):
763 Parameter: String (unicode or bytes).
764 Returns: The `text`, with each instance of "--" translated to
765 an en-dash character, and each "---" translated to
766 an em-dash character.
769 text = re.sub(r"""---""", smartchars.emdash, text)
770 text = re.sub(r"""--""", smartchars.endash, text)
771 return text
774 def educateDashesOldSchoolInverted(text):
776 Parameter: String (unicode or bytes).
777 Returns: The `text`, with each instance of "--" translated to
778 an em-dash character, and each "---" translated to
779 an en-dash character. Two reasons why: First, unlike the
780 en- and em-dash syntax supported by
781 EducateDashesOldSchool(), it's compatible with existing
782 entries written before SmartyPants 1.1, back when "--" was
783 only used for em-dashes. Second, em-dashes are more
784 common than en-dashes, and so it sort of makes sense that
785 the shortcut should be shorter to type. (Thanks to Aaron
786 Swartz for the idea.)
788 text = re.sub(r"""---""", smartchars.endash, text) # em
789 text = re.sub(r"""--""", smartchars.emdash, text) # en
790 return text
794 def educateEllipses(text):
796 Parameter: String (unicode or bytes).
797 Returns: The `text`, with each instance of "..." translated to
798 an ellipsis character.
800 Example input: Huh...?
801 Example output: Huh&#8230;?
804 text = re.sub(r"""\.\.\.""", smartchars.ellipsis, text)
805 text = re.sub(r"""\. \. \.""", smartchars.ellipsis, text)
806 return text
809 def stupefyEntities(text, language='en'):
811 Parameter: String (unicode or bytes).
812 Returns: The `text`, with each SmartyPants character translated to
813 its ASCII counterpart.
815 Example input: “Hello — world.”
816 Example output: "Hello -- world."
818 smart = smartchars(language)
820 text = re.sub(smart.endash, "-", text) # en-dash
821 text = re.sub(smart.emdash, "--", text) # em-dash
823 text = re.sub(smart.osquote, "'", text) # open single quote
824 text = re.sub(smart.csquote, "'", text) # close single quote
826 text = re.sub(smart.opquote, '"', text) # open double quote
827 text = re.sub(smart.cpquote, '"', text) # close double quote
829 text = re.sub(smart.ellipsis, '...', text)# ellipsis
831 return text
834 def processEscapes(text, restore=False):
835 r"""
836 Parameter: String (unicode or bytes).
837 Returns: The `text`, with after processing the following backslash
838 escape sequences. This is useful if you want to force a "dumb"
839 quote or other character to appear.
841 Escape Value
842 ------ -----
843 \\ &#92;
844 \" &#34;
845 \' &#39;
846 \. &#46;
847 \- &#45;
848 \` &#96;
850 replacements = ((r'\\', r'&#92;'),
851 (r'\"', r'&#34;'),
852 (r"\'", r'&#39;'),
853 (r'\.', r'&#46;'),
854 (r'\-', r'&#45;'),
855 (r'\`', r'&#96;'))
856 if restore:
857 for (ch, rep) in replacements:
858 text = text.replace(rep, ch[1])
859 else:
860 for (ch, rep) in replacements:
861 text = text.replace(ch, rep)
863 return text
866 def tokenize(text):
868 Parameter: String containing HTML markup.
869 Returns: An iterator that yields the tokens comprising the input
870 string. Each token is either a tag (possibly with nested,
871 tags contained therein, such as <a href="<MTFoo>">, or a
872 run of text between tags. Each yielded element is a
873 two-element tuple; the first is either 'tag' or 'text';
874 the second is the actual value.
876 Based on the _tokenize() subroutine from Brad Choate's MTRegex plugin.
877 <http://www.bradchoate.com/past/mtregex.php>
880 pos = 0
881 length = len(text)
882 # tokens = []
884 depth = 6
885 nested_tags = "|".join(['(?:<(?:[^<>]',] * depth) + (')*>)' * depth)
886 #match = r"""(?: <! ( -- .*? -- \s* )+ > ) | # comments
887 # (?: <\? .*? \?> ) | # directives
888 # %s # nested tags """ % (nested_tags,)
889 tag_soup = re.compile(r"""([^<]*)(<[^>]*>)""")
891 token_match = tag_soup.search(text)
893 previous_end = 0
894 while token_match is not None:
895 if token_match.group(1):
896 yield ('text', token_match.group(1))
898 yield ('tag', token_match.group(2))
900 previous_end = token_match.end()
901 token_match = tag_soup.search(text, token_match.end())
903 if previous_end < len(text):
904 yield ('text', text[previous_end:])
908 if __name__ == "__main__":
910 import locale
912 try:
913 locale.setlocale(locale.LC_ALL, '')
914 except:
915 pass
917 from docutils.core import publish_string
918 docstring_html = publish_string(__doc__, writer_name='html5')
920 print docstring_html
922 # Unit test output goes to stderr.
923 import unittest
924 sp = smartyPants
926 class TestSmartypantsAllAttributes(unittest.TestCase):
927 # the default attribute is "1", which means "all".
929 def test_dates(self):
930 self.assertEqual(sp("1440-80's"), u"1440-80’s")
931 self.assertEqual(sp("1440-'80s"), u"1440-’80s")
932 self.assertEqual(sp("1440---'80s"), u"1440–’80s")
933 self.assertEqual(sp("1960's"), u"1960’s")
934 self.assertEqual(sp("one two '60s"), u"one two ’60s")
935 self.assertEqual(sp("'60s"), u"’60s")
937 def test_educated_quotes(self):
938 self.assertEqual(sp('''"Isn't this fun?"'''), u'“Isn’t this fun?”')
940 def test_html_tags(self):
941 text = '<a src="foo">more</a>'
942 self.assertEqual(sp(text), text)
944 unittest.main()