Smartquotes: correct "educating" of quotes around inline markup.
[docutils.git] / docutils / utils / smartquotes.py
blob0eaa531cf4f323f3963ee51f92128c56c7ee281d
1 #!/usr/bin/python
2 # -*- coding: utf8 -*-
4 # :Id: $Id$
5 # :Copyright: © 2010 Günter Milde,
6 # original `SmartyPants`_: © 2003 John Gruber
7 # smartypants.py: © 2004, 2007 Chad Miller
8 # :License: Released under the terms of the `2-Clause BSD license`_, in short:
10 # Copying and distribution of this file, with or without modification,
11 # are permitted in any medium without royalty provided the copyright
12 # notices and this notice are preserved.
13 # This file is offered as-is, without any warranty.
15 # .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
18 r"""
19 ========================
20 SmartyPants for Docutils
21 ========================
23 Synopsis
24 ========
26 Smart-quotes for Docutils.
28 The original "SmartyPants" is a free web publishing plug-in for Movable Type,
29 Blosxom, and BBEdit that easily translates plain ASCII punctuation characters
30 into "smart" typographic punctuation characters.
32 `smartypants.py`, endeavours to be a functional port of
33 SmartyPants to Python, for use with Pyblosxom_.
35 `smartquotes.py` is an adaption of Smartypants to Docutils_. By using Unicode
36 characters instead of HTML entities for typographic quotes, it works for any
37 output format that supports Unicode.
39 Authors
40 =======
42 `John Gruber`_ did all of the hard work of writing this software in Perl for
43 `Movable Type`_ and almost all of this useful documentation. `Chad Miller`_
44 ported it to Python to use with Pyblosxom_.
45 Adapted to Docutils_ by Günter Milde
47 Additional Credits
48 ==================
50 Portions of the SmartyPants original work are based on Brad Choate's nifty
51 MTRegex plug-in. `Brad Choate`_ also contributed a few bits of source code to
52 this plug-in. Brad Choate is a fine hacker indeed.
54 `Jeremy Hedley`_ and `Charles Wiltgen`_ deserve mention for exemplary beta
55 testing of the original SmartyPants.
57 `Rael Dornfest`_ ported SmartyPants to Blosxom.
59 .. _Brad Choate: http://bradchoate.com/
60 .. _Jeremy Hedley: http://antipixel.com/
61 .. _Charles Wiltgen: http://playbacktime.com/
62 .. _Rael Dornfest: http://raelity.org/
65 Copyright and License
66 =====================
68 SmartyPants_ license (3-Clause BSD license):
70 Copyright (c) 2003 John Gruber (http://daringfireball.net/)
71 All rights reserved.
73 Redistribution and use in source and binary forms, with or without
74 modification, are permitted provided that the following conditions are
75 met:
77 * Redistributions of source code must retain the above copyright
78 notice, this list of conditions and the following disclaimer.
80 * Redistributions in binary form must reproduce the above copyright
81 notice, this list of conditions and the following disclaimer in
82 the documentation and/or other materials provided with the
83 distribution.
85 * Neither the name "SmartyPants" nor the names of its contributors
86 may be used to endorse or promote products derived from this
87 software without specific prior written permission.
89 This software is provided by the copyright holders and contributors
90 "as is" and any express or implied warranties, including, but not
91 limited to, the implied warranties of merchantability and fitness for
92 a particular purpose are disclaimed. In no event shall the copyright
93 owner or contributors be liable for any direct, indirect, incidental,
94 special, exemplary, or consequential damages (including, but not
95 limited to, procurement of substitute goods or services; loss of use,
96 data, or profits; or business interruption) however caused and on any
97 theory of liability, whether in contract, strict liability, or tort
98 (including negligence or otherwise) arising in any way out of the use
99 of this software, even if advised of the possibility of such damage.
101 smartypants.py license (2-Clause BSD license):
103 smartypants.py is a derivative work of SmartyPants.
105 Redistribution and use in source and binary forms, with or without
106 modification, are permitted provided that the following conditions are
107 met:
109 * Redistributions of source code must retain the above copyright
110 notice, this list of conditions and the following disclaimer.
112 * Redistributions in binary form must reproduce the above copyright
113 notice, this list of conditions and the following disclaimer in
114 the documentation and/or other materials provided with the
115 distribution.
117 This software is provided by the copyright holders and contributors
118 "as is" and any express or implied warranties, including, but not
119 limited to, the implied warranties of merchantability and fitness for
120 a particular purpose are disclaimed. In no event shall the copyright
121 owner or contributors be liable for any direct, indirect, incidental,
122 special, exemplary, or consequential damages (including, but not
123 limited to, procurement of substitute goods or services; loss of use,
124 data, or profits; or business interruption) however caused and on any
125 theory of liability, whether in contract, strict liability, or tort
126 (including negligence or otherwise) arising in any way out of the use
127 of this software, even if advised of the possibility of such damage.
129 .. _John Gruber: http://daringfireball.net/
130 .. _Chad Miller: http://web.chad.org/
132 .. _Pyblosxom: http://pyblosxom.bluesock.org/
133 .. _SmartyPants: http://daringfireball.net/projects/smartypants/
134 .. _Movable Type: http://www.movabletype.org/
135 .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
136 .. _Docutils: http://docutils.sf.net/
138 Description
139 ===========
141 SmartyPants can perform the following transformations:
143 - Straight quotes ( " and ' ) into "curly" quote characters
144 - Backticks-style quotes (\`\`like this'') into "curly" quote characters
145 - Dashes (``--`` and ``---``) into en- and em-dash entities
146 - Three consecutive dots (``...`` or ``. . .``) into an ellipsis entity
148 This means you can write, edit, and save your posts using plain old
149 ASCII straight quotes, plain dashes, and plain dots, but your published
150 posts (and final HTML output) will appear with smart quotes, em-dashes,
151 and proper ellipses.
153 SmartyPants does not modify characters within ``<pre>``, ``<code>``, ``<kbd>``,
154 ``<math>`` or ``<script>`` tag blocks. Typically, these tags are used to
155 display text where smart quotes and other "smart punctuation" would not be
156 appropriate, such as source code or example markup.
159 Backslash Escapes
160 =================
162 If you need to use literal straight quotes (or plain hyphens and
163 periods), SmartyPants accepts the following backslash escape sequences
164 to force non-smart punctuation. It does so by transforming the escape
165 sequence into a character:
167 ======== ===== =========
168 Escape Value Character
169 ======== ===== =========
170 ``\\\\`` &#92; \\
171 \\" &#34; "
172 \\' &#39; '
173 \\. &#46; .
174 \\- &#45; \-
175 \\` &#96; \`
176 ======== ===== =========
178 This is useful, for example, when you want to use straight quotes as
179 foot and inch marks: 6'2" tall; a 17" iMac.
181 Options
182 =======
184 For Pyblosxom users, the ``smartypants_attributes`` attribute is where you
185 specify configuration options.
187 Numeric values are the easiest way to configure SmartyPants' behavior:
190 Suppress all transformations. (Do nothing.)
192 Performs default SmartyPants transformations: quotes (including
193 \`\`backticks'' -style), em-dashes, and ellipses. "``--``" (dash dash)
194 is used to signify an em-dash; there is no support for en-dashes.
197 Same as smarty_pants="1", except that it uses the old-school typewriter
198 shorthand for dashes: "``--``" (dash dash) for en-dashes, "``---``"
199 (dash dash dash)
200 for em-dashes.
203 Same as smarty_pants="2", but inverts the shorthand for dashes:
204 "``--``" (dash dash) for em-dashes, and "``---``" (dash dash dash) for
205 en-dashes.
207 "-1"
208 Stupefy mode. Reverses the SmartyPants transformation process, turning
209 the characters produced by SmartyPants into their ASCII equivalents.
210 E.g. "“" is turned into a simple double-quote ("), "—" is
211 turned into two dashes, etc.
214 The following single-character attribute values can be combined to toggle
215 individual transformations from within the smarty_pants attribute. For
216 example, to educate normal quotes and em-dashes, but not ellipses or
217 \`\`backticks'' -style quotes:
219 ``py['smartypants_attributes'] = "1"``
222 Educates normal quote characters: (") and (').
225 Educates \`\`backticks'' -style double quotes.
228 Educates \`\`backticks'' -style double quotes and \`single' quotes.
231 Educates em-dashes.
234 Educates em-dashes and en-dashes, using old-school typewriter shorthand:
235 (dash dash) for en-dashes, (dash dash dash) for em-dashes.
238 Educates em-dashes and en-dashes, using inverted old-school typewriter
239 shorthand: (dash dash) for em-dashes, (dash dash dash) for en-dashes.
242 Educates ellipses.
245 Translates any instance of ``&quot;`` into a normal double-quote character.
246 This should be of no interest to most people, but of particular interest
247 to anyone who writes their posts using Dreamweaver, as Dreamweaver
248 inexplicably uses this entity to represent a literal double-quote
249 character. SmartyPants only educates normal quotes, not entities (because
250 ordinarily, entities are used for the explicit purpose of representing the
251 specific character they represent). The "w" option must be used in
252 conjunction with one (or both) of the other quote options ("q" or "b").
253 Thus, if you wish to apply all SmartyPants transformations (quotes, en-
254 and em-dashes, and ellipses) and also translate ``&quot;`` entities into
255 regular quotes so SmartyPants can educate them, you should pass the
256 following to the smarty_pants attribute:
259 Caveats
260 =======
262 Why You Might Not Want to Use Smart Quotes in Your Weblog
263 ---------------------------------------------------------
265 For one thing, you might not care.
267 Most normal, mentally stable individuals do not take notice of proper
268 typographic punctuation. Many design and typography nerds, however, break
269 out in a nasty rash when they encounter, say, a restaurant sign that uses
270 a straight apostrophe to spell "Joe's".
272 If you're the sort of person who just doesn't care, you might well want to
273 continue not caring. Using straight quotes -- and sticking to the 7-bit
274 ASCII character set in general -- is certainly a simpler way to live.
276 Even if you I *do* care about accurate typography, you still might want to
277 think twice before educating the quote characters in your weblog. One side
278 effect of publishing curly quote characters is that it makes your
279 weblog a bit harder for others to quote from using copy-and-paste. What
280 happens is that when someone copies text from your blog, the copied text
281 contains the 8-bit curly quote characters (as well as the 8-bit characters
282 for em-dashes and ellipses, if you use these options). These characters
283 are not standard across different text encoding methods, which is why they
284 need to be encoded as characters.
286 People copying text from your weblog, however, may not notice that you're
287 using curly quotes, and they'll go ahead and paste the unencoded 8-bit
288 characters copied from their browser into an email message or their own
289 weblog. When pasted as raw "smart quotes", these characters are likely to
290 get mangled beyond recognition.
292 That said, my own opinion is that any decent text editor or email client
293 makes it easy to stupefy smart quote characters into their 7-bit
294 equivalents, and I don't consider it my problem if you're using an
295 indecent text editor or email client.
298 Algorithmic Shortcomings
299 ------------------------
301 One situation in which quotes will get curled the wrong way is when
302 apostrophes are used at the start of leading contractions. For example:
304 ``'Twas the night before Christmas.``
306 In the case above, SmartyPants will turn the apostrophe into an opening
307 single-quote, when in fact it should be a closing one. I don't think
308 this problem can be solved in the general case -- every word processor
309 I've tried gets this wrong as well. In such cases, it's best to use the
310 proper character for closing single-quotes (``’``) by hand.
313 Version History
314 ===============
316 1.6.1: 2012-11-06
317 - Refactor code, code cleanup,
318 - `educate_tokens()` generator as interface for Docutils.
320 1.6: 2010-08-26
321 - Adaption to Docutils:
322 - Use Unicode instead of HTML entities,
323 - Remove code special to pyblosxom.
325 1.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400
326 - Fixed bug where blocks of precious unalterable text was instead
327 interpreted. Thanks to Le Roux and Dirk van Oosterbosch.
329 1.5_1.5: Sat, 13 Aug 2005 15:50:24 -0400
330 - Fix bogus magical quotation when there is no hint that the
331 user wants it, e.g., in "21st century". Thanks to Nathan Hamblen.
332 - Be smarter about quotes before terminating numbers in an en-dash'ed
333 range.
335 1.5_1.4: Thu, 10 Feb 2005 20:24:36 -0500
336 - Fix a date-processing bug, as reported by jacob childress.
337 - Begin a test-suite for ensuring correct output.
338 - Removed import of "string", since I didn't really need it.
339 (This was my first every Python program. Sue me!)
341 1.5_1.3: Wed, 15 Sep 2004 18:25:58 -0400
342 - Abort processing if the flavour is in forbidden-list. Default of
343 [ "rss" ] (Idea of Wolfgang SCHNERRING.)
344 - Remove stray virgules from en-dashes. Patch by Wolfgang SCHNERRING.
346 1.5_1.2: Mon, 24 May 2004 08:14:54 -0400
347 - Some single quotes weren't replaced properly. Diff-tesuji played
348 by Benjamin GEIGER.
350 1.5_1.1: Sun, 14 Mar 2004 14:38:28 -0500
351 - Support upcoming pyblosxom 0.9 plugin verification feature.
353 1.5_1.0: Tue, 09 Mar 2004 08:08:35 -0500
354 - Initial release
357 default_smartypants_attr = "1"
360 import re
362 class smart(object):
363 """Smart quotes and dashes
365 TODO: internationalization, see e.g.
366 http://de.wikipedia.org/wiki/Anf%C3%BChrungszeichen#Andere_Sprachen
368 endash = u'–' # "&#8211;" EN DASH
369 emdash = u'—' # "&#8212;" EM DASH
370 lquote = u'‘' # "&#8216;" LEFT SINGLE QUOTATION MARK
371 rquote = u'’' # "&#8217;" RIGHT SINGLE QUOTATION MARK
372 #lquote = u'‚' # "&#8218;" SINGLE LOW-9 QUOTATION MARK (German)
373 ldquote = u'“' # "&#8220;" LEFT DOUBLE QUOTATION MARK
374 rdquote = u'”' # "&#8221;" RIGHT DOUBLE QUOTATION MARK
375 #ldquote = u'„' # "&#82212" DOUBLE LOW-9 QUOTATION MARK (German)
376 ellipsis = u'…' # "&#8230;" HORIZONTAL ELLIPSIS
378 def smartyPants(text, attr=default_smartypants_attr):
379 """Main function for "traditional" use."""
381 return "".join([t for t in educate_tokens(tokenize(text), attr)])
384 def educate_tokens(text_tokens, attr=default_smartypants_attr):
385 """Return iterator that "educates" `text_tokens`.
388 # Parse attributes:
389 # 0 : do nothing
390 # 1 : set all
391 # 2 : set all, using old school en- and em- dash shortcuts
392 # 3 : set all, using inverted old school en and em- dash shortcuts
394 # q : quotes
395 # b : backtick quotes (``double'' only)
396 # B : backtick quotes (``double'' and `single')
397 # d : dashes
398 # D : old school dashes
399 # i : inverted old school dashes
400 # e : ellipses
401 # w : convert &quot; entities to " for Dreamweaver users
403 convert_quot = False # translate &quot; entities into normal quotes?
404 do_dashes = False
405 do_backticks = False
406 do_quotes = False
407 do_ellipses = False
408 do_stupefy = False
410 if attr == "0": # Do nothing.
411 yield text
412 elif attr == "1": # Do everything, turn all options on.
413 do_quotes = True
414 do_backticks = True
415 do_dashes = 1
416 do_ellipses = True
417 elif attr == "2":
418 # Do everything, turn all options on, use old school dash shorthand.
419 do_quotes = True
420 do_backticks = True
421 do_dashes = 2
422 do_ellipses = True
423 elif attr == "3":
424 # Do everything, use inverted old school dash shorthand.
425 do_quotes = True
426 do_backticks = True
427 do_dashes = 3
428 do_ellipses = True
429 elif attr == "-1": # Special "stupefy" mode.
430 do_stupefy = True
431 else:
432 if "q" in attr: do_quotes = True
433 if "b" in attr: do_backticks = True
434 if "B" in attr: do_backticks = 2
435 if "d" in attr: do_dashes = 1
436 if "D" in attr: do_dashes = 2
437 if "i" in attr: do_dashes = 3
438 if "e" in attr: do_ellipses = True
439 if "w" in attr: convert_quot = True
441 prev_token_last_char = " "
442 # Get context around inline mark-up. (Remember the last character of the
443 # previous text token, to use as context to curl single-character quote
444 # tokens correctly.)
446 for cur_token in text_tokens:
447 t = cur_token[1]
449 # skip HTML and/or XML tags (do not update last character)
450 if cur_token[0] == 'tag':
451 yield t
452 continue
454 last_char = t[-1:] # Remember last char of this token before processing.
456 # skip literal text (math, literal, raw, ...)
457 if cur_token[0] == 'literal':
458 yield t
459 continue
461 t = processEscapes(t)
463 if convert_quot:
464 t = re.sub('&quot;', '"', t)
466 if do_dashes == 1:
467 t = educateDashes(t)
468 elif do_dashes == 2:
469 t = educateDashesOldSchool(t)
470 elif do_dashes == 3:
471 t = educateDashesOldSchoolInverted(t)
473 if do_ellipses:
474 t = educateEllipses(t)
476 # Note: backticks need to be processed before quotes.
477 if do_backticks:
478 t = educateBackticks(t)
480 if do_backticks == 2:
481 t = educateSingleBackticks(t)
483 if do_quotes:
484 t = educateQuotes(prev_token_last_char+t)[1:]
486 if do_stupefy:
487 t = stupefyEntities(t)
489 # print prev_token_last_char, t.encode('utf8')
490 prev_token_last_char = last_char
492 yield t
496 def educateQuotes(text):
498 Parameter: String (unicode or bytes).
499 Returns: The `text`, with "educated" curly quote characters.
501 Example input: "Isn't this fun?"
502 Example output: “Isn’t this fun?“;
505 # oldtext = text
506 punct_class = r"""[!"#\$\%'()*+,-.\/:;<=>?\@\[\\\]\^_`{|}~]"""
508 # Special case if the very first character is a quote
509 # followed by punctuation at a non-word-break. Close the quotes by brute force:
510 text = re.sub(r"""^'(?=%s\\B)""" % (punct_class,), smart.rquote, text)
511 text = re.sub(r"""^"(?=%s\\B)""" % (punct_class,), smart.rdquote, text)
513 # Special case for double sets of quotes, e.g.:
514 # <p>He said, "'Quoted' words in a larger quote."</p>
515 text = re.sub(r""""'(?=\w)""", smart.ldquote+smart.lquote, text)
516 text = re.sub(r"""'"(?=\w)""", smart.lquote+smart.ldquote, text)
518 # Special case for decade abbreviations (the '80s):
519 text = re.sub(r"""\b'(?=\d{2}s)""", smart.rquote, text)
521 close_class = r"""[^\ \t\r\n\[\{\(\-]"""
522 dec_dashes = r"""&#8211;|&#8212;"""
524 # Get most opening single quotes:
525 opening_single_quotes_regex = re.compile(r"""
527 \s | # a whitespace char, or
528 &nbsp; | # a non-breaking space entity, or
529 -- | # dashes, or
530 &[mn]dash; | # named dash entities
531 %s | # or decimal entities
532 &\#x201[34]; # or hex
534 ' # the quote
535 (?=\w) # followed by a word character
536 """ % (dec_dashes,), re.VERBOSE)
537 text = opening_single_quotes_regex.sub(r'\1'+smart.lquote, text)
539 closing_single_quotes_regex = re.compile(r"""
540 (%s)
542 (?!\s | s\b | \d)
543 """ % (close_class,), re.VERBOSE)
544 text = closing_single_quotes_regex.sub(r'\1'+smart.rquote, text)
546 closing_single_quotes_regex = re.compile(r"""
547 (%s)
549 (\s | s\b)
550 """ % (close_class,), re.VERBOSE)
551 text = closing_single_quotes_regex.sub(r'\1%s\2' % smart.rquote, text)
553 # Any remaining single quotes should be opening ones:
554 text = re.sub(r"""'""", smart.lquote, text)
556 # Get most opening double quotes:
557 opening_double_quotes_regex = re.compile(r"""
559 \s | # a whitespace char, or
560 &nbsp; | # a non-breaking space entity, or
561 -- | # dashes, or
562 &[mn]dash; | # named dash entities
563 %s | # or decimal entities
564 &\#x201[34]; # or hex
566 " # the quote
567 (?=\w) # followed by a word character
568 """ % (dec_dashes,), re.VERBOSE)
569 text = opening_double_quotes_regex.sub(r'\1'+smart.ldquote, text)
571 # Double closing quotes:
572 closing_double_quotes_regex = re.compile(r"""
573 #(%s)? # character that indicates the quote should be closing
575 (?=\s)
576 """ % (close_class,), re.VERBOSE)
577 text = closing_double_quotes_regex.sub(smart.rdquote, text)
579 closing_double_quotes_regex = re.compile(r"""
580 (%s) # character that indicates the quote should be closing
582 """ % (close_class,), re.VERBOSE)
583 text = closing_double_quotes_regex.sub(r'\1'+smart.rdquote, text)
585 # Any remaining quotes should be opening ones.
586 text = re.sub(r'"', smart.ldquote, text)
588 return text
591 def educateBackticks(text):
593 Parameter: String (unicode or bytes).
594 Returns: The `text`, with ``backticks'' -style double quotes
595 translated into HTML curly quote entities.
596 Example input: ``Isn't this fun?''
597 Example output: “Isn't this fun?“;
600 text = re.sub(r"""``""", smart.ldquote, text)
601 text = re.sub(r"""''""", smart.rdquote, text)
602 return text
605 def educateSingleBackticks(text):
607 Parameter: String (unicode or bytes).
608 Returns: The `text`, with `backticks' -style single quotes
609 translated into HTML curly quote entities.
611 Example input: `Isn't this fun?'
612 Example output: ‘Isn’t this fun?’
615 text = re.sub(r"""`""", smart.lquote, text)
616 text = re.sub(r"""'""", smart.rquote, text)
617 return text
620 def educateDashes(text):
622 Parameter: String (unicode or bytes).
623 Returns: The `text`, with each instance of "--" translated to
624 an em-dash character.
627 text = re.sub(r"""---""", smart.endash, text) # en (yes, backwards)
628 text = re.sub(r"""--""", smart.emdash, text) # em (yes, backwards)
629 return text
632 def educateDashesOldSchool(text):
634 Parameter: String (unicode or bytes).
635 Returns: The `text`, with each instance of "--" translated to
636 an en-dash character, and each "---" translated to
637 an em-dash character.
640 text = re.sub(r"""---""", smart.emdash, text) # em (yes, backwards)
641 text = re.sub(r"""--""", smart.endash, text) # en (yes, backwards)
642 return text
645 def educateDashesOldSchoolInverted(text):
647 Parameter: String (unicode or bytes).
648 Returns: The `text`, with each instance of "--" translated to
649 an em-dash character, and each "---" translated to
650 an en-dash character. Two reasons why: First, unlike the
651 en- and em-dash syntax supported by
652 EducateDashesOldSchool(), it's compatible with existing
653 entries written before SmartyPants 1.1, back when "--" was
654 only used for em-dashes. Second, em-dashes are more
655 common than en-dashes, and so it sort of makes sense that
656 the shortcut should be shorter to type. (Thanks to Aaron
657 Swartz for the idea.)
659 text = re.sub(r"""---""", smart.endash, text) # em
660 text = re.sub(r"""--""", smart.emdash, text) # en
661 return text
665 def educateEllipses(text):
667 Parameter: String (unicode or bytes).
668 Returns: The `text`, with each instance of "..." translated to
669 an ellipsis character.
671 Example input: Huh...?
672 Example output: Huh&#8230;?
675 text = re.sub(r"""\.\.\.""", smart.ellipsis, text)
676 text = re.sub(r"""\. \. \.""", smart.ellipsis, text)
677 return text
680 def stupefyEntities(text):
682 Parameter: String (unicode or bytes).
683 Returns: The `text`, with each SmartyPants character translated to
684 its ASCII counterpart.
686 Example input: “Hello — world.”
687 Example output: "Hello -- world."
690 text = re.sub(smart.endash, "-", text) # en-dash
691 text = re.sub(smart.emdash, "--", text) # em-dash
693 text = re.sub(smart.lquote, "'", text) # open single quote
694 text = re.sub(smart.rquote, "'", text) # close single quote
696 text = re.sub(smart.ldquote, '"', text) # open double quote
697 text = re.sub(smart.rdquote, '"', text) # close double quote
699 text = re.sub(smart.ellipsis, '...', text)# ellipsis
701 return text
704 def processEscapes(text):
705 r"""
706 Parameter: String (unicode or bytes).
707 Returns: The `text`, with after processing the following backslash
708 escape sequences. This is useful if you want to force a "dumb"
709 quote or other character to appear.
711 Escape Value
712 ------ -----
713 \\ &#92;
714 \" &#34;
715 \' &#39;
716 \. &#46;
717 \- &#45;
718 \` &#96;
720 text = re.sub(r"""\\\\""", r"""&#92;""", text)
721 text = re.sub(r'''\\"''', r"""&#34;""", text)
722 text = re.sub(r"""\\'""", r"""&#39;""", text)
723 text = re.sub(r"""\\\.""", r"""&#46;""", text)
724 text = re.sub(r"""\\-""", r"""&#45;""", text)
725 text = re.sub(r"""\\`""", r"""&#96;""", text)
727 return text
730 def tokenize(text):
732 Parameter: String containing HTML markup.
733 Returns: An iterator that yields the tokens comprising the input
734 string. Each token is either a tag (possibly with nested,
735 tags contained therein, such as <a href="<MTFoo>">, or a
736 run of text between tags. Each yielded element is a
737 two-element tuple; the first is either 'tag' or 'text';
738 the second is the actual value.
740 Based on the _tokenize() subroutine from Brad Choate's MTRegex plugin.
741 <http://www.bradchoate.com/past/mtregex.php>
744 pos = 0
745 length = len(text)
746 # tokens = []
748 depth = 6
749 nested_tags = "|".join(['(?:<(?:[^<>]',] * depth) + (')*>)' * depth)
750 #match = r"""(?: <! ( -- .*? -- \s* )+ > ) | # comments
751 # (?: <\? .*? \?> ) | # directives
752 # %s # nested tags """ % (nested_tags,)
753 tag_soup = re.compile(r"""([^<]*)(<[^>]*>)""")
755 token_match = tag_soup.search(text)
757 previous_end = 0
758 while token_match is not None:
759 if token_match.group(1):
760 yield ('text', token_match.group(1))
762 yield ('tag', token_match.group(2))
764 previous_end = token_match.end()
765 token_match = tag_soup.search(text, token_match.end())
767 if previous_end < len(text):
768 yield ('text', text[previous_end:])
772 if __name__ == "__main__":
774 import locale
776 try:
777 locale.setlocale(locale.LC_ALL, '')
778 except:
779 pass
781 from docutils.core import publish_string
782 docstring_html = publish_string(__doc__, writer_name='html')
784 print docstring_html
787 # Unit test output goes out stderr.
788 import unittest
789 sp = smartyPants
791 class TestSmartypantsAllAttributes(unittest.TestCase):
792 # the default attribute is "1", which means "all".
794 def test_dates(self):
795 self.assertEqual(sp("1440-80's"), u"1440-80’s")
796 self.assertEqual(sp("1440-'80s"), u"1440-‘80s")
797 self.assertEqual(sp("1440---'80s"), u"1440–‘80s")
798 self.assertEqual(sp("1960s"), "1960s") # no effect.
799 self.assertEqual(sp("1960's"), u"1960’s")
800 self.assertEqual(sp("one two '60s"), u"one two ‘60s")
801 self.assertEqual(sp("'60s"), u"‘60s")
803 def test_ordinal_numbers(self):
804 self.assertEqual(sp("21st century"), "21st century") # no effect.
805 self.assertEqual(sp("3rd"), "3rd") # no effect.
807 def test_educated_quotes(self):
808 self.assertEqual(sp('''"Isn't this fun?"'''), u'“Isn’t this fun?”')
810 def test_html_tags(self):
811 text = '<a src="foo">more</a>'
812 self.assertEqual(sp(text), text)
814 unittest.main()
819 __author__ = "Chad Miller <smartypantspy@chad.org>"
820 __version__ = "1.5_1.6: Fri, 27 Jul 2007 07:06:40 -0400"
821 __url__ = "http://wiki.chad.org/SmartyPantsPy"
822 __description__ = "Smart-quotes, smart-ellipses, and smart-dashes for weblog entries in pyblosxom"