Fix smart quotes definition for swedish (sv).
[docutils.git] / docutils / utils / smartquotes.py
blob0922669f251f7fa746fabff03cf54d64e932f880
1 #!/usr/bin/python
2 # -*- coding: utf-8 -*-
4 # :Id: $Id$
5 # :Copyright: © 2010 Günter Milde,
6 # original `SmartyPants`_: © 2003 John Gruber
7 # smartypants.py: © 2004, 2007 Chad Miller
8 # :Maintainer: docutils-develop@lists.sourceforge.net
9 # :License: Released under the terms of the `2-Clause BSD license`_, in short:
11 # Copying and distribution of this file, with or without modification,
12 # are permitted in any medium without royalty provided the copyright
13 # notices and this notice are preserved.
14 # This file is offered as-is, without any warranty.
16 # .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
19 r"""
20 ========================
21 SmartyPants for Docutils
22 ========================
24 Synopsis
25 ========
27 Smart-quotes for Docutils.
29 The original "SmartyPants" is a free web publishing plug-in for Movable Type,
30 Blosxom, and BBEdit that easily translates plain ASCII punctuation characters
31 into "smart" typographic punctuation characters.
33 `smartypants.py`, endeavours to be a functional port of
34 SmartyPants to Python, for use with Pyblosxom_.
36 `smartquotes.py` is an adaption of Smartypants to Docutils_. By using Unicode
37 characters instead of HTML entities for typographic quotes, it works for any
38 output format that supports Unicode.
40 Authors
41 =======
43 `John Gruber`_ did all of the hard work of writing this software in Perl for
44 `Movable Type`_ and almost all of this useful documentation. `Chad Miller`_
45 ported it to Python to use with Pyblosxom_.
46 Adapted to Docutils_ by Günter Milde.
48 Additional Credits
49 ==================
51 Portions of the SmartyPants original work are based on Brad Choate's nifty
52 MTRegex plug-in. `Brad Choate`_ also contributed a few bits of source code to
53 this plug-in. Brad Choate is a fine hacker indeed.
55 `Jeremy Hedley`_ and `Charles Wiltgen`_ deserve mention for exemplary beta
56 testing of the original SmartyPants.
58 `Rael Dornfest`_ ported SmartyPants to Blosxom.
60 .. _Brad Choate: http://bradchoate.com/
61 .. _Jeremy Hedley: http://antipixel.com/
62 .. _Charles Wiltgen: http://playbacktime.com/
63 .. _Rael Dornfest: http://raelity.org/
66 Copyright and License
67 =====================
69 SmartyPants_ license (3-Clause BSD license):
71 Copyright (c) 2003 John Gruber (http://daringfireball.net/)
72 All rights reserved.
74 Redistribution and use in source and binary forms, with or without
75 modification, are permitted provided that the following conditions are
76 met:
78 * Redistributions of source code must retain the above copyright
79 notice, this list of conditions and the following disclaimer.
81 * Redistributions in binary form must reproduce the above copyright
82 notice, this list of conditions and the following disclaimer in
83 the documentation and/or other materials provided with the
84 distribution.
86 * Neither the name "SmartyPants" nor the names of its contributors
87 may be used to endorse or promote products derived from this
88 software without specific prior written permission.
90 This software is provided by the copyright holders and contributors
91 "as is" and any express or implied warranties, including, but not
92 limited to, the implied warranties of merchantability and fitness for
93 a particular purpose are disclaimed. In no event shall the copyright
94 owner or contributors be liable for any direct, indirect, incidental,
95 special, exemplary, or consequential damages (including, but not
96 limited to, procurement of substitute goods or services; loss of use,
97 data, or profits; or business interruption) however caused and on any
98 theory of liability, whether in contract, strict liability, or tort
99 (including negligence or otherwise) arising in any way out of the use
100 of this software, even if advised of the possibility of such damage.
102 smartypants.py license (2-Clause BSD license):
104 smartypants.py is a derivative work of SmartyPants.
106 Redistribution and use in source and binary forms, with or without
107 modification, are permitted provided that the following conditions are
108 met:
110 * Redistributions of source code must retain the above copyright
111 notice, this list of conditions and the following disclaimer.
113 * Redistributions in binary form must reproduce the above copyright
114 notice, this list of conditions and the following disclaimer in
115 the documentation and/or other materials provided with the
116 distribution.
118 This software is provided by the copyright holders and contributors
119 "as is" and any express or implied warranties, including, but not
120 limited to, the implied warranties of merchantability and fitness for
121 a particular purpose are disclaimed. In no event shall the copyright
122 owner or contributors be liable for any direct, indirect, incidental,
123 special, exemplary, or consequential damages (including, but not
124 limited to, procurement of substitute goods or services; loss of use,
125 data, or profits; or business interruption) however caused and on any
126 theory of liability, whether in contract, strict liability, or tort
127 (including negligence or otherwise) arising in any way out of the use
128 of this software, even if advised of the possibility of such damage.
130 .. _John Gruber: http://daringfireball.net/
131 .. _Chad Miller: http://web.chad.org/
133 .. _Pyblosxom: http://pyblosxom.bluesock.org/
134 .. _SmartyPants: http://daringfireball.net/projects/smartypants/
135 .. _Movable Type: http://www.movabletype.org/
136 .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
137 .. _Docutils: http://docutils.sf.net/
139 Description
140 ===========
142 SmartyPants can perform the following transformations:
144 - Straight quotes ( " and ' ) into "curly" quote characters
145 - Backticks-style quotes (\`\`like this'') into "curly" quote characters
146 - Dashes (``--`` and ``---``) into en- and em-dash entities
147 - Three consecutive dots (``...`` or ``. . .``) into an ellipsis entity
149 This means you can write, edit, and save your posts using plain old
150 ASCII straight quotes, plain dashes, and plain dots, but your published
151 posts (and final HTML output) will appear with smart quotes, em-dashes,
152 and proper ellipses.
154 SmartyPants does not modify characters within ``<pre>``, ``<code>``, ``<kbd>``,
155 ``<math>`` or ``<script>`` tag blocks. Typically, these tags are used to
156 display text where smart quotes and other "smart punctuation" would not be
157 appropriate, such as source code or example markup.
160 Backslash Escapes
161 =================
163 If you need to use literal straight quotes (or plain hyphens and
164 periods), SmartyPants accepts the following backslash escape sequences
165 to force non-smart punctuation. It does so by transforming the escape
166 sequence into a character:
168 ======== ===== =========
169 Escape Value Character
170 ======== ===== =========
171 ``\\\\`` &#92; \\
172 \\" &#34; "
173 \\' &#39; '
174 \\. &#46; .
175 \\- &#45; \-
176 \\` &#96; \`
177 ======== ===== =========
179 This is useful, for example, when you want to use straight quotes as
180 foot and inch marks: 6\\'2\\" tall; a 17\\" iMac.
182 Options
183 =======
185 For Pyblosxom users, the ``smartypants_attributes`` attribute is where you
186 specify configuration options.
188 Numeric values are the easiest way to configure SmartyPants' behavior:
191 Suppress all transformations. (Do nothing.)
193 Performs default SmartyPants transformations: quotes (including
194 \`\`backticks'' -style), em-dashes, and ellipses. "``--``" (dash dash)
195 is used to signify an em-dash; there is no support for en-dashes.
198 Same as smarty_pants="1", except that it uses the old-school typewriter
199 shorthand for dashes: "``--``" (dash dash) for en-dashes, "``---``"
200 (dash dash dash)
201 for em-dashes.
204 Same as smarty_pants="2", but inverts the shorthand for dashes:
205 "``--``" (dash dash) for em-dashes, and "``---``" (dash dash dash) for
206 en-dashes.
208 "-1"
209 Stupefy mode. Reverses the SmartyPants transformation process, turning
210 the characters produced by SmartyPants into their ASCII equivalents.
211 E.g. "“" is turned into a simple double-quote (\"), "—" is
212 turned into two dashes, etc.
215 The following single-character attribute values can be combined to toggle
216 individual transformations from within the smarty_pants attribute. For
217 example, to educate normal quotes and em-dashes, but not ellipses or
218 \`\`backticks'' -style quotes:
220 ``py['smartypants_attributes'] = "1"``
223 Educates normal quote characters: (") and (').
226 Educates \`\`backticks'' -style double quotes.
229 Educates \`\`backticks'' -style double quotes and \`single' quotes.
232 Educates em-dashes.
235 Educates em-dashes and en-dashes, using old-school typewriter shorthand:
236 (dash dash) for en-dashes, (dash dash dash) for em-dashes.
239 Educates em-dashes and en-dashes, using inverted old-school typewriter
240 shorthand: (dash dash) for em-dashes, (dash dash dash) for en-dashes.
243 Educates ellipses.
246 Translates any instance of ``&quot;`` into a normal double-quote character.
247 This should be of no interest to most people, but of particular interest
248 to anyone who writes their posts using Dreamweaver, as Dreamweaver
249 inexplicably uses this entity to represent a literal double-quote
250 character. SmartyPants only educates normal quotes, not entities (because
251 ordinarily, entities are used for the explicit purpose of representing the
252 specific character they represent). The "w" option must be used in
253 conjunction with one (or both) of the other quote options ("q" or "b").
254 Thus, if you wish to apply all SmartyPants transformations (quotes, en-
255 and em-dashes, and ellipses) and also translate ``&quot;`` entities into
256 regular quotes so SmartyPants can educate them, you should pass the
257 following to the smarty_pants attribute:
260 Caveats
261 =======
263 Why You Might Not Want to Use Smart Quotes in Your Weblog
264 ---------------------------------------------------------
266 For one thing, you might not care.
268 Most normal, mentally stable individuals do not take notice of proper
269 typographic punctuation. Many design and typography nerds, however, break
270 out in a nasty rash when they encounter, say, a restaurant sign that uses
271 a straight apostrophe to spell "Joe's".
273 If you're the sort of person who just doesn't care, you might well want to
274 continue not caring. Using straight quotes -- and sticking to the 7-bit
275 ASCII character set in general -- is certainly a simpler way to live.
277 Even if you I *do* care about accurate typography, you still might want to
278 think twice before educating the quote characters in your weblog. One side
279 effect of publishing curly quote characters is that it makes your
280 weblog a bit harder for others to quote from using copy-and-paste. What
281 happens is that when someone copies text from your blog, the copied text
282 contains the 8-bit curly quote characters (as well as the 8-bit characters
283 for em-dashes and ellipses, if you use these options). These characters
284 are not standard across different text encoding methods, which is why they
285 need to be encoded as characters.
287 People copying text from your weblog, however, may not notice that you're
288 using curly quotes, and they'll go ahead and paste the unencoded 8-bit
289 characters copied from their browser into an email message or their own
290 weblog. When pasted as raw "smart quotes", these characters are likely to
291 get mangled beyond recognition.
293 That said, my own opinion is that any decent text editor or email client
294 makes it easy to stupefy smart quote characters into their 7-bit
295 equivalents, and I don't consider it my problem if you're using an
296 indecent text editor or email client.
299 Algorithmic Shortcomings
300 ------------------------
302 One situation in which quotes will get curled the wrong way is when
303 apostrophes are used at the start of leading contractions. For example:
305 ``'Twas the night before Christmas.``
307 In the case above, SmartyPants will turn the apostrophe into an opening
308 single-quote, when in fact it should be a closing one. I don't think
309 this problem can be solved in the general case -- every word processor
310 I've tried gets this wrong as well. In such cases, it's best to use the
311 proper character for closing single-quotes (``’``) by hand.
314 Version History
315 ===============
317 1.7 2012-11-19
318 - Internationalization: language-dependent quotes.
320 1.6.1: 2012-11-06
321 - Refactor code, code cleanup,
322 - `educate_tokens()` generator as interface for Docutils.
324 1.6: 2010-08-26
325 - Adaption to Docutils:
326 - Use Unicode instead of HTML entities,
327 - Remove code special to pyblosxom.
329 1.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400
330 - Fixed bug where blocks of precious unalterable text was instead
331 interpreted. Thanks to Le Roux and Dirk van Oosterbosch.
333 1.5_1.5: Sat, 13 Aug 2005 15:50:24 -0400
334 - Fix bogus magical quotation when there is no hint that the
335 user wants it, e.g., in "21st century". Thanks to Nathan Hamblen.
336 - Be smarter about quotes before terminating numbers in an en-dash'ed
337 range.
339 1.5_1.4: Thu, 10 Feb 2005 20:24:36 -0500
340 - Fix a date-processing bug, as reported by jacob childress.
341 - Begin a test-suite for ensuring correct output.
342 - Removed import of "string", since I didn't really need it.
343 (This was my first every Python program. Sue me!)
345 1.5_1.3: Wed, 15 Sep 2004 18:25:58 -0400
346 - Abort processing if the flavour is in forbidden-list. Default of
347 [ "rss" ] (Idea of Wolfgang SCHNERRING.)
348 - Remove stray virgules from en-dashes. Patch by Wolfgang SCHNERRING.
350 1.5_1.2: Mon, 24 May 2004 08:14:54 -0400
351 - Some single quotes weren't replaced properly. Diff-tesuji played
352 by Benjamin GEIGER.
354 1.5_1.1: Sun, 14 Mar 2004 14:38:28 -0500
355 - Support upcoming pyblosxom 0.9 plugin verification feature.
357 1.5_1.0: Tue, 09 Mar 2004 08:08:35 -0500
358 - Initial release
361 default_smartypants_attr = "1"
364 import re
366 class smartchars(object):
367 """Smart quotes and dashes
370 endash = u'–' # "&#8211;" EN DASH
371 emdash = u'—' # "&#8212;" EM DASH
372 ellipsis = u'…' # "&#8230;" HORIZONTAL ELLIPSIS
374 # quote characters (language-specific, set in __init__())
376 # English smart quotes (open primary, close primary, open secondary, close
377 # secondary) are:
378 # opquote = u'“' # "&#8220;" LEFT DOUBLE QUOTATION MARK
379 # cpquote = u'”' # "&#8221;" RIGHT DOUBLE QUOTATION MARK
380 # osquote = u'‘' # "&#8216;" LEFT SINGLE QUOTATION MARK
381 # csquote = u'’' # "&#8217;" RIGHT SINGLE QUOTATION MARK
382 # For other languages see:
383 # http://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks
384 # http://de.wikipedia.org/wiki/Anf%C3%BChrungszeichen#Andere_Sprachen
385 quotes = {'af': u'“”‘’',
386 'af-x-altquot': u'„”‚’',
387 'ca': u'«»“”',
388 'ca-x-altquot': u'“”‘’',
389 'cs': u'„“‚‘',
390 'cs-x-altquot': u'»«›‹',
391 'da': u'»«‘’',
392 'da-x-altquot': u'„“‚‘',
393 'de': u'„“‚‘',
394 'de-x-altquot': u'»«›‹',
395 'de-CH': u'«»‹›',
396 'el': u'«»“”',
397 'en': u'“”‘’',
398 'en-UK': u'‘’“”',
399 'eo': u'“”‘’',
400 'es': u'«»“”',
401 'et': u'„“‚‘', # no secondary quote listed in
402 'et-x-altquot': u'»«›‹', # the sources above (wikipedia.org)
403 'eu': u'«»‹›',
404 'es-x-altquot': u'“”‘’',
405 'fi': u'””’’',
406 'fi-x-altquot': u'»»’’',
407 'fr': (u'« ', u' »', u'‹ ', u' ›'), # with narrow no-break space
408 'fr-x-altquot': u'«»‹›', # for use with manually set spaces
409 # 'fr-x-altquot': (u'“ ', u' ”', u'‘ ', u' ’'), # rarely used
410 'fr-CH': u'«»‹›',
411 'gl': u'«»“”',
412 'he': u'”“»«',
413 'he-x-altquot': u'„”‚’',
414 'it': u'«»“”',
415 'it-CH': u'«»‹›',
416 'it-x-altquot': u'“”‘’',
417 'ja': u'「」『』',
418 'lt': u'„“‚‘',
419 'nl': u'“”‘’',
420 'nl-x-altquot': u'„”‚’',
421 'pl': u'„”«»',
422 'pl-x-altquot': u'«»“”',
423 'pt': u'«»“”',
424 'pt-BR': u'“”‘’',
425 'ro': u'„”«»',
426 'ro-x-altquot': u'«»„”',
427 'ru': u'«»„“',
428 'sk': u'„“‚‘',
429 'sk-x-altquot': u'»«›‹',
430 'sv': u'””’’',
431 'sv-x-altquot': u'»»›>',
432 # 'sv-x-altquot': u'»«›‹',
433 'zh-CN': u'“”‘’',
434 'it': u'«»“”',
435 'zh-TW': u'「」『』',
438 def __init__(self, language='en'):
439 self.language = language
440 try:
441 (self.opquote, self.cpquote,
442 self.osquote, self.csquote) = self.quotes[language]
443 except KeyError:
444 self.opquote, self.cpquote, self.osquote, self.csquote = u'""\'\''
447 def smartyPants(text, attr=default_smartypants_attr, language='en'):
448 """Main function for "traditional" use."""
450 return "".join([t for t in educate_tokens(tokenize(text),
451 attr, language)])
454 def educate_tokens(text_tokens, attr=default_smartypants_attr, language='en'):
455 """Return iterator that "educates" the items of `text_tokens`.
458 # Parse attributes:
459 # 0 : do nothing
460 # 1 : set all
461 # 2 : set all, using old school en- and em- dash shortcuts
462 # 3 : set all, using inverted old school en and em- dash shortcuts
464 # q : quotes
465 # b : backtick quotes (``double'' only)
466 # B : backtick quotes (``double'' and `single')
467 # d : dashes
468 # D : old school dashes
469 # i : inverted old school dashes
470 # e : ellipses
471 # w : convert &quot; entities to " for Dreamweaver users
473 convert_quot = False # translate &quot; entities into normal quotes?
474 do_dashes = False
475 do_backticks = False
476 do_quotes = False
477 do_ellipses = False
478 do_stupefy = False
480 if attr == "0": # Do nothing.
481 yield text
482 elif attr == "1": # Do everything, turn all options on.
483 do_quotes = True
484 do_backticks = True
485 do_dashes = 1
486 do_ellipses = True
487 elif attr == "2":
488 # Do everything, turn all options on, use old school dash shorthand.
489 do_quotes = True
490 do_backticks = True
491 do_dashes = 2
492 do_ellipses = True
493 elif attr == "3":
494 # Do everything, use inverted old school dash shorthand.
495 do_quotes = True
496 do_backticks = True
497 do_dashes = 3
498 do_ellipses = True
499 elif attr == "-1": # Special "stupefy" mode.
500 do_stupefy = True
501 else:
502 if "q" in attr: do_quotes = True
503 if "b" in attr: do_backticks = True
504 if "B" in attr: do_backticks = 2
505 if "d" in attr: do_dashes = 1
506 if "D" in attr: do_dashes = 2
507 if "i" in attr: do_dashes = 3
508 if "e" in attr: do_ellipses = True
509 if "w" in attr: convert_quot = True
511 prev_token_last_char = " "
512 # Last character of the previous text token. Used as
513 # context to curl leading quote characters correctly.
515 for (ttype, text) in text_tokens:
517 # skip HTML and/or XML tags as well as emtpy text tokens
518 # without updating the last character
519 if ttype == 'tag' or not text:
520 yield text
521 continue
523 # skip literal text (math, literal, raw, ...)
524 if ttype == 'literal':
525 prev_token_last_char = text[-1:]
526 yield text
527 continue
529 last_char = text[-1:] # Remember last char before processing.
531 text = processEscapes(text)
533 if convert_quot:
534 text = re.sub('&quot;', '"', text)
536 if do_dashes == 1:
537 text = educateDashes(text)
538 elif do_dashes == 2:
539 text = educateDashesOldSchool(text)
540 elif do_dashes == 3:
541 text = educateDashesOldSchoolInverted(text)
543 if do_ellipses:
544 text = educateEllipses(text)
546 # Note: backticks need to be processed before quotes.
547 if do_backticks:
548 text = educateBackticks(text, language)
550 if do_backticks == 2:
551 text = educateSingleBackticks(text, language)
553 if do_quotes:
554 text = educateQuotes(prev_token_last_char+text, language)[1:]
556 if do_stupefy:
557 text = stupefyEntities(text, language)
559 # Remember last char as context for the next token
560 prev_token_last_char = last_char
562 text = processEscapes(text, restore=True)
564 yield text
568 def educateQuotes(text, language='en'):
570 Parameter: - text string (unicode or bytes).
571 - language (`BCP 47` language tag.)
572 Returns: The `text`, with "educated" curly quote characters.
574 Example input: "Isn't this fun?"
575 Example output: “Isn’t this fun?“;
578 smart = smartchars(language)
580 # oldtext = text
581 punct_class = r"""[!"#\$\%'()*+,-.\/:;<=>?\@\[\\\]\^_`{|}~]"""
583 # Special case if the very first character is a quote
584 # followed by punctuation at a non-word-break.
585 # Close the quotes by brute force:
586 text = re.sub(r"""^'(?=%s\\B)""" % (punct_class,), smart.csquote, text)
587 text = re.sub(r"""^"(?=%s\\B)""" % (punct_class,), smart.cpquote, text)
589 # Special case for double sets of quotes, e.g.:
590 # <p>He said, "'Quoted' words in a larger quote."</p>
591 text = re.sub(r""""'(?=\w)""", smart.opquote+smart.osquote, text)
592 text = re.sub(r"""'"(?=\w)""", smart.osquote+smart.opquote, text)
594 # Special case for decade abbreviations (the '80s):
595 text = re.sub(r"""\b'(?=\d{2}s)""", smart.csquote, text)
597 close_class = r"""[^\ \t\r\n\[\{\(\-]"""
598 dec_dashes = r"""&#8211;|&#8212;"""
600 # Get most opening single quotes:
601 opening_single_quotes_regex = re.compile(r"""
603 \s | # a whitespace char, or
604 &nbsp; | # a non-breaking space entity, or
605 -- | # dashes, or
606 &[mn]dash; | # named dash entities
607 %s | # or decimal entities
608 &\#x201[34]; # or hex
610 ' # the quote
611 (?=\w) # followed by a word character
612 """ % (dec_dashes,), re.VERBOSE)
613 text = opening_single_quotes_regex.sub(r'\1'+smart.osquote, text)
615 closing_single_quotes_regex = re.compile(r"""
616 (%s)
618 (?!\s | s\b | \d)
619 """ % (close_class,), re.VERBOSE)
620 text = closing_single_quotes_regex.sub(r'\1'+smart.csquote, text)
622 closing_single_quotes_regex = re.compile(r"""
623 (%s)
625 (\s | s\b)
626 """ % (close_class,), re.VERBOSE)
627 text = closing_single_quotes_regex.sub(r'\1%s\2' % smart.csquote, text)
629 # Any remaining single quotes should be opening ones:
630 text = re.sub(r"""'""", smart.osquote, text)
632 # Get most opening double quotes:
633 opening_double_quotes_regex = re.compile(r"""
635 \s | # a whitespace char, or
636 &nbsp; | # a non-breaking space entity, or
637 -- | # dashes, or
638 &[mn]dash; | # named dash entities
639 %s | # or decimal entities
640 &\#x201[34]; # or hex
642 " # the quote
643 (?=\w) # followed by a word character
644 """ % (dec_dashes,), re.VERBOSE)
645 text = opening_double_quotes_regex.sub(r'\1'+smart.opquote, text)
647 # Double closing quotes:
648 closing_double_quotes_regex = re.compile(r"""
649 #(%s)? # character that indicates the quote should be closing
651 (?=\s)
652 """ % (close_class,), re.VERBOSE)
653 text = closing_double_quotes_regex.sub(smart.cpquote, text)
655 closing_double_quotes_regex = re.compile(r"""
656 (%s) # character that indicates the quote should be closing
658 """ % (close_class,), re.VERBOSE)
659 text = closing_double_quotes_regex.sub(r'\1'+smart.cpquote, text)
661 # Any remaining quotes should be opening ones.
662 text = re.sub(r'"', smart.opquote, text)
664 return text
667 def educateBackticks(text, language='en'):
669 Parameter: String (unicode or bytes).
670 Returns: The `text`, with ``backticks'' -style double quotes
671 translated into HTML curly quote entities.
672 Example input: ``Isn't this fun?''
673 Example output: “Isn't this fun?“;
675 smart = smartchars(language)
677 text = re.sub(r"""``""", smart.opquote, text)
678 text = re.sub(r"""''""", smart.cpquote, text)
679 return text
682 def educateSingleBackticks(text, language='en'):
684 Parameter: String (unicode or bytes).
685 Returns: The `text`, with `backticks' -style single quotes
686 translated into HTML curly quote entities.
688 Example input: `Isn't this fun?'
689 Example output: ‘Isn’t this fun?’
691 smart = smartchars(language)
693 text = re.sub(r"""`""", smart.osquote, text)
694 text = re.sub(r"""'""", smart.csquote, text)
695 return text
698 def educateDashes(text):
700 Parameter: String (unicode or bytes).
701 Returns: The `text`, with each instance of "--" translated to
702 an em-dash character.
705 text = re.sub(r"""---""", smartchars.endash, text) # en (yes, backwards)
706 text = re.sub(r"""--""", smartchars.emdash, text) # em (yes, backwards)
707 return text
710 def educateDashesOldSchool(text):
712 Parameter: String (unicode or bytes).
713 Returns: The `text`, with each instance of "--" translated to
714 an en-dash character, and each "---" translated to
715 an em-dash character.
718 text = re.sub(r"""---""", smartchars.emdash, text)
719 text = re.sub(r"""--""", smartchars.endash, text)
720 return text
723 def educateDashesOldSchoolInverted(text):
725 Parameter: String (unicode or bytes).
726 Returns: The `text`, with each instance of "--" translated to
727 an em-dash character, and each "---" translated to
728 an en-dash character. Two reasons why: First, unlike the
729 en- and em-dash syntax supported by
730 EducateDashesOldSchool(), it's compatible with existing
731 entries written before SmartyPants 1.1, back when "--" was
732 only used for em-dashes. Second, em-dashes are more
733 common than en-dashes, and so it sort of makes sense that
734 the shortcut should be shorter to type. (Thanks to Aaron
735 Swartz for the idea.)
737 text = re.sub(r"""---""", smartchars.endash, text) # em
738 text = re.sub(r"""--""", smartchars.emdash, text) # en
739 return text
743 def educateEllipses(text):
745 Parameter: String (unicode or bytes).
746 Returns: The `text`, with each instance of "..." translated to
747 an ellipsis character.
749 Example input: Huh...?
750 Example output: Huh&#8230;?
753 text = re.sub(r"""\.\.\.""", smartchars.ellipsis, text)
754 text = re.sub(r"""\. \. \.""", smartchars.ellipsis, text)
755 return text
758 def stupefyEntities(text, language='en'):
760 Parameter: String (unicode or bytes).
761 Returns: The `text`, with each SmartyPants character translated to
762 its ASCII counterpart.
764 Example input: “Hello — world.”
765 Example output: "Hello -- world."
767 smart = smartchars(language)
769 text = re.sub(smart.endash, "-", text) # en-dash
770 text = re.sub(smart.emdash, "--", text) # em-dash
772 text = re.sub(smart.osquote, "'", text) # open single quote
773 text = re.sub(smart.csquote, "'", text) # close single quote
775 text = re.sub(smart.opquote, '"', text) # open double quote
776 text = re.sub(smart.cpquote, '"', text) # close double quote
778 text = re.sub(smart.ellipsis, '...', text)# ellipsis
780 return text
783 def processEscapes(text, restore=False):
784 r"""
785 Parameter: String (unicode or bytes).
786 Returns: The `text`, with after processing the following backslash
787 escape sequences. This is useful if you want to force a "dumb"
788 quote or other character to appear.
790 Escape Value
791 ------ -----
792 \\ &#92;
793 \" &#34;
794 \' &#39;
795 \. &#46;
796 \- &#45;
797 \` &#96;
799 replacements = ((r'\\', r'&#92;'),
800 (r'\"', r'&#34;'),
801 (r"\'", r'&#39;'),
802 (r'\.', r'&#46;'),
803 (r'\-', r'&#45;'),
804 (r'\`', r'&#96;'))
805 if restore:
806 for (ch, rep) in replacements:
807 text = text.replace(rep, ch[1])
808 else:
809 for (ch, rep) in replacements:
810 text = text.replace(ch, rep)
812 return text
815 def tokenize(text):
817 Parameter: String containing HTML markup.
818 Returns: An iterator that yields the tokens comprising the input
819 string. Each token is either a tag (possibly with nested,
820 tags contained therein, such as <a href="<MTFoo>">, or a
821 run of text between tags. Each yielded element is a
822 two-element tuple; the first is either 'tag' or 'text';
823 the second is the actual value.
825 Based on the _tokenize() subroutine from Brad Choate's MTRegex plugin.
826 <http://www.bradchoate.com/past/mtregex.php>
829 pos = 0
830 length = len(text)
831 # tokens = []
833 depth = 6
834 nested_tags = "|".join(['(?:<(?:[^<>]',] * depth) + (')*>)' * depth)
835 #match = r"""(?: <! ( -- .*? -- \s* )+ > ) | # comments
836 # (?: <\? .*? \?> ) | # directives
837 # %s # nested tags """ % (nested_tags,)
838 tag_soup = re.compile(r"""([^<]*)(<[^>]*>)""")
840 token_match = tag_soup.search(text)
842 previous_end = 0
843 while token_match is not None:
844 if token_match.group(1):
845 yield ('text', token_match.group(1))
847 yield ('tag', token_match.group(2))
849 previous_end = token_match.end()
850 token_match = tag_soup.search(text, token_match.end())
852 if previous_end < len(text):
853 yield ('text', text[previous_end:])
857 if __name__ == "__main__":
859 import locale
861 try:
862 locale.setlocale(locale.LC_ALL, '')
863 except:
864 pass
866 from docutils.core import publish_string
867 docstring_html = publish_string(__doc__, writer_name='html')
869 print docstring_html
872 # Unit test output goes out stderr.
873 import unittest
874 sp = smartyPants
876 class TestSmartypantsAllAttributes(unittest.TestCase):
877 # the default attribute is "1", which means "all".
879 def test_dates(self):
880 self.assertEqual(sp("1440-80's"), u"1440-80’s")
881 self.assertEqual(sp("1440-'80s"), u"1440-80s")
882 self.assertEqual(sp("1440---'80s"), u"1440–‘80s")
883 self.assertEqual(sp("1960s"), "1960s") # no effect.
884 self.assertEqual(sp("1960's"), u"1960’s")
885 self.assertEqual(sp("one two '60s"), u"one two ‘60s")
886 self.assertEqual(sp("'60s"), u"60s")
888 def test_ordinal_numbers(self):
889 self.assertEqual(sp("21st century"), "21st century") # no effect.
890 self.assertEqual(sp("3rd"), "3rd") # no effect.
892 def test_educated_quotes(self):
893 self.assertEqual(sp('''"Isn't this fun?"'''), u'“Isn’t this fun?”')
895 def test_html_tags(self):
896 text = '<a src="foo">more</a>'
897 self.assertEqual(sp(text), text)
899 unittest.main()
904 __author__ = "Chad Miller <smartypantspy@chad.org>"
905 __version__ = "1.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400"
906 __url__ = "http://wiki.chad.org/SmartyPantsPy"
907 __description__ = "Smart-quotes, smart-ellipses, and smart-dashes for weblog entries in pyblosxom"