Doc/library/difflib.rst

   1 :mod:`difflib` --- Helpers for computing deltas
   2 ===============================================
   3
   4 .. module:: difflib
   5    :synopsis: Helpers for computing differences between objects.
   6 .. moduleauthor:: Tim Peters <tim_one@users.sourceforge.net>
   7 .. sectionauthor:: Tim Peters <tim_one@users.sourceforge.net>
   8 .. Markup by Fred L. Drake, Jr. <fdrake@acm.org>
   9
  10 .. testsetup::
  11
  12    import sys
  13    from difflib import *
  14
  15 .. versionadded:: 2.1
  16
  17 This module provides classes and functions for comparing sequences. It
  18 can be used for example, for comparing files, and can produce difference
  19 information in various formats, including HTML and context and unified
  20 diffs. For comparing directories and files, see also, the :mod:`filecmp` module.
  21
  22 .. class:: SequenceMatcher
  23
  24    This is a flexible class for comparing pairs of sequences of any type, so long
  25    as the sequence elements are :term:`hashable`.  The basic algorithm predates, and is a
  26    little fancier than, an algorithm published in the late 1980's by Ratcliff and
  27    Obershelp under the hyperbolic name "gestalt pattern matching."  The idea is to
  28    find the longest contiguous matching subsequence that contains no "junk"
  29    elements (the Ratcliff and Obershelp algorithm doesn't address junk).  The same
  30    idea is then applied recursively to the pieces of the sequences to the left and
  31    to the right of the matching subsequence.  This does not yield minimal edit
  32    sequences, but does tend to yield matches that "look right" to people.
  33
  34    **Timing:** The basic Ratcliff-Obershelp algorithm is cubic time in the worst
  35    case and quadratic time in the expected case. :class:`SequenceMatcher` is
  36    quadratic time for the worst case and has expected-case behavior dependent in a
  37    complicated way on how many elements the sequences have in common; best case
  38    time is linear.
  39
  40
  41 .. class:: Differ
  42
  43    This is a class for comparing sequences of lines of text, and producing
  44    human-readable differences or deltas.  Differ uses :class:`SequenceMatcher`
  45    both to compare sequences of lines, and to compare sequences of characters
  46    within similar (near-matching) lines.
  47
  48    Each line of a :class:`Differ` delta begins with a two-letter code:
  49
  50    +----------+-------------------------------------------+
  51    | Code     | Meaning                                   |
  52    +==========+===========================================+
  53    | ``'- '`` | line unique to sequence 1                 |
  54    +----------+-------------------------------------------+
  55    | ``'+ '`` | line unique to sequence 2                 |
  56    +----------+-------------------------------------------+
  57    | ``'  '`` | line common to both sequences             |
  58    +----------+-------------------------------------------+
  59    | ``'? '`` | line not present in either input sequence |
  60    +----------+-------------------------------------------+
  61
  62    Lines beginning with '``?``' attempt to guide the eye to intraline differences,
  63    and were not present in either input sequence. These lines can be confusing if
  64    the sequences contain tab characters.
  65
  66
  67 .. class:: HtmlDiff
  68
  69    This class can be used to create an HTML table (or a complete HTML file
  70    containing the table) showing a side by side, line by line comparison of text
  71    with inter-line and intra-line change highlights.  The table can be generated in
  72    either full or contextual difference mode.
  73
  74    The constructor for this class is:
  75
  76
  77    .. function:: __init__([tabsize][, wrapcolumn][, linejunk][, charjunk])
  78
  79       Initializes instance of :class:`HtmlDiff`.
  80
  81       *tabsize* is an optional keyword argument to specify tab stop spacing and
  82       defaults to ``8``.
  83
  84       *wrapcolumn* is an optional keyword to specify column number where lines are
  85       broken and wrapped, defaults to ``None`` where lines are not wrapped.
  86
  87       *linejunk* and *charjunk* are optional keyword arguments passed into ``ndiff()``
  88       (used by :class:`HtmlDiff` to generate the side by side HTML differences).  See
  89       ``ndiff()`` documentation for argument default values and descriptions.
  90
  91    The following methods are public:
  92
  93
  94    .. function:: make_file(fromlines, tolines [, fromdesc][, todesc][, context][, numlines])
  95
  96       Compares *fromlines* and *tolines* (lists of strings) and returns a string which
  97       is a complete HTML file containing a table showing line by line differences with
  98       inter-line and intra-line changes highlighted.
  99
 100       *fromdesc* and *todesc* are optional keyword arguments to specify from/to file
 101       column header strings (both default to an empty string).
 102
 103       *context* and *numlines* are both optional keyword arguments. Set *context* to
 104       ``True`` when contextual differences are to be shown, else the default is
 105       ``False`` to show the full files. *numlines* defaults to ``5``.  When *context*
 106       is ``True`` *numlines* controls the number of context lines which surround the
 107       difference highlights.  When *context* is ``False`` *numlines* controls the
 108       number of lines which are shown before a difference highlight when using the
 109       "next" hyperlinks (setting to zero would cause the "next" hyperlinks to place
 110       the next difference highlight at the top of the browser without any leading
 111       context).
 112
 113
 114    .. function:: make_table(fromlines, tolines [, fromdesc][, todesc][, context][, numlines])
 115
 116       Compares *fromlines* and *tolines* (lists of strings) and returns a string which
 117       is a complete HTML table showing line by line differences with inter-line and
 118       intra-line changes highlighted.
 119
 120       The arguments for this method are the same as those for the :meth:`make_file`
 121       method.
 122
 123    :file:`Tools/scripts/diff.py` is a command-line front-end to this class and
 124    contains a good example of its use.
 125
 126    .. versionadded:: 2.4
 127
 128
 129 .. function:: context_diff(a, b[, fromfile][, tofile][, fromfiledate][, tofiledate][, n][, lineterm])
 130
 131    Compare *a* and *b* (lists of strings); return a delta (a :term:`generator`
 132    generating the delta lines) in context diff format.
 133
 134    Context diffs are a compact way of showing just the lines that have changed plus
 135    a few lines of context.  The changes are shown in a before/after style.  The
 136    number of context lines is set by *n* which defaults to three.
 137
 138    By default, the diff control lines (those with ``***`` or ``---``) are created
 139    with a trailing newline.  This is helpful so that inputs created from
 140    :func:`file.readlines` result in diffs that are suitable for use with
 141    :func:`file.writelines` since both the inputs and outputs have trailing
 142    newlines.
 143
 144    For inputs that do not have trailing newlines, set the *lineterm* argument to
 145    ``""`` so that the output will be uniformly newline free.
 146
 147    The context diff format normally has a header for filenames and modification
 148    times.  Any or all of these may be specified using strings for *fromfile*,
 149    *tofile*, *fromfiledate*, and *tofiledate*. The modification times are normally
 150    expressed in the format returned by :func:`time.ctime`.  If not specified, the
 151    strings default to blanks.
 152
 153       >>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
 154       >>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
 155       >>> for line in context_diff(s1, s2, fromfile='before.py', tofile='after.py'):
 156       ...     sys.stdout.write(line)  # doctest: +NORMALIZE_WHITESPACE
 157       *** before.py
 158       --- after.py
 159       ***************
 160       *** 1,4 ****
 161       ! bacon
 162       ! eggs
 163       ! ham
 164         guido
 165       --- 1,4 ----
 166       ! python
 167       ! eggy
 168       ! hamster
 169         guido
 170
 171    See :ref:`difflib-interface` for a more detailed example.
 172
 173    .. versionadded:: 2.3
 174
 175
 176 .. function:: get_close_matches(word, possibilities[, n][, cutoff])
 177
 178    Return a list of the best "good enough" matches.  *word* is a sequence for which
 179    close matches are desired (typically a string), and *possibilities* is a list of
 180    sequences against which to match *word* (typically a list of strings).
 181
 182    Optional argument *n* (default ``3``) is the maximum number of close matches to
 183    return; *n* must be greater than ``0``.
 184
 185    Optional argument *cutoff* (default ``0.6``) is a float in the range [0, 1].
 186    Possibilities that don't score at least that similar to *word* are ignored.
 187
 188    The best (no more than *n*) matches among the possibilities are returned in a
 189    list, sorted by similarity score, most similar first.
 190
 191       >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
 192       ['apple', 'ape']
 193       >>> import keyword
 194       >>> get_close_matches('wheel', keyword.kwlist)
 195       ['while']
 196       >>> get_close_matches('apple', keyword.kwlist)
 197       []
 198       >>> get_close_matches('accept', keyword.kwlist)
 199       ['except']
 200
 201
 202 .. function:: ndiff(a, b[, linejunk][, charjunk])
 203
 204    Compare *a* and *b* (lists of strings); return a :class:`Differ`\ -style
 205    delta (a :term:`generator` generating the delta lines).
 206
 207    Optional keyword parameters *linejunk* and *charjunk* are for filter functions
 208    (or ``None``):
 209
 210    *linejunk*: A function that accepts a single string argument, and returns true
 211    if the string is junk, or false if not. The default is (``None``), starting with
 212    Python 2.3.  Before then, the default was the module-level function
 213    :func:`IS_LINE_JUNK`, which filters out lines without visible characters, except
 214    for at most one pound character (``'#'``). As of Python 2.3, the underlying
 215    :class:`SequenceMatcher` class does a dynamic analysis of which lines are so
 216    frequent as to constitute noise, and this usually works better than the pre-2.3
 217    default.
 218
 219    *charjunk*: A function that accepts a character (a string of length 1), and
 220    returns if the character is junk, or false if not. The default is module-level
 221    function :func:`IS_CHARACTER_JUNK`, which filters out whitespace characters (a
 222    blank or tab; note: bad idea to include newline in this!).
 223
 224    :file:`Tools/scripts/ndiff.py` is a command-line front-end to this function.
 225
 226       >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
 227       ...              'ore\ntree\nemu\n'.splitlines(1))
 228       >>> print ''.join(diff),
 229       - one
 230       ?  ^
 231       + ore
 232       ?  ^
 233       - two
 234       - three
 235       ?  -
 236       + tree
 237       + emu
 238
 239
 240 .. function:: restore(sequence, which)
 241
 242    Return one of the two sequences that generated a delta.
 243
 244    Given a *sequence* produced by :meth:`Differ.compare` or :func:`ndiff`, extract
 245    lines originating from file 1 or 2 (parameter *which*), stripping off line
 246    prefixes.
 247
 248    Example:
 249
 250       >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
 251       ...              'ore\ntree\nemu\n'.splitlines(1))
 252       >>> diff = list(diff) # materialize the generated delta into a list
 253       >>> print ''.join(restore(diff, 1)),
 254       one
 255       two
 256       three
 257       >>> print ''.join(restore(diff, 2)),
 258       ore
 259       tree
 260       emu
 261
 262
 263 .. function:: unified_diff(a, b[, fromfile][, tofile][, fromfiledate][, tofiledate][, n][, lineterm])
 264
 265    Compare *a* and *b* (lists of strings); return a delta (a :term:`generator`
 266    generating the delta lines) in unified diff format.
 267
 268    Unified diffs are a compact way of showing just the lines that have changed plus
 269    a few lines of context.  The changes are shown in a inline style (instead of
 270    separate before/after blocks).  The number of context lines is set by *n* which
 271    defaults to three.
 272
 273    By default, the diff control lines (those with ``---``, ``+++``, or ``@@``) are
 274    created with a trailing newline.  This is helpful so that inputs created from
 275    :func:`file.readlines` result in diffs that are suitable for use with
 276    :func:`file.writelines` since both the inputs and outputs have trailing
 277    newlines.
 278
 279    For inputs that do not have trailing newlines, set the *lineterm* argument to
 280    ``""`` so that the output will be uniformly newline free.
 281
 282    The context diff format normally has a header for filenames and modification
 283    times.  Any or all of these may be specified using strings for *fromfile*,
 284    *tofile*, *fromfiledate*, and *tofiledate*. The modification times are normally
 285    expressed in the format returned by :func:`time.ctime`.  If not specified, the
 286    strings default to blanks.
 287
 288       >>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
 289       >>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
 290       >>> for line in unified_diff(s1, s2, fromfile='before.py', tofile='after.py'):
 291       ...     sys.stdout.write(line)   # doctest: +NORMALIZE_WHITESPACE
 292       --- before.py
 293       +++ after.py
 294       @@ -1,4 +1,4 @@
 295       -bacon
 296       -eggs
 297       -ham
 298       +python
 299       +eggy
 300       +hamster
 301        guido
 302
 303    See :ref:`difflib-interface` for a more detailed example.
 304
 305    .. versionadded:: 2.3
 306
 307
 308 .. function:: IS_LINE_JUNK(line)
 309
 310    Return true for ignorable lines.  The line *line* is ignorable if *line* is
 311    blank or contains a single ``'#'``, otherwise it is not ignorable.  Used as a
 312    default for parameter *linejunk* in :func:`ndiff` before Python 2.3.
 313
 314
 315 .. function:: IS_CHARACTER_JUNK(ch)
 316
 317    Return true for ignorable characters.  The character *ch* is ignorable if *ch*
 318    is a space or tab, otherwise it is not ignorable.  Used as a default for
 319    parameter *charjunk* in :func:`ndiff`.
 320
 321
 322 .. seealso::
 323
 324    `Pattern Matching: The Gestalt Approach <http://www.ddj.com/184407970?pgno=5>`_
 325       Discussion of a similar algorithm by John W. Ratcliff and D. E. Metzener. This
 326       was published in `Dr. Dobb's Journal <http://www.ddj.com/>`_ in July, 1988.
 327
 328
 329 .. _sequence-matcher:
 330
 331 SequenceMatcher Objects
 332 -----------------------
 333
 334 The :class:`SequenceMatcher` class has this constructor:
 335
 336
 337 .. class:: SequenceMatcher([isjunk[, a[, b]]])
 338
 339    Optional argument *isjunk* must be ``None`` (the default) or a one-argument
 340    function that takes a sequence element and returns true if and only if the
 341    element is "junk" and should be ignored. Passing ``None`` for *isjunk* is
 342    equivalent to passing ``lambda x: 0``; in other words, no elements are ignored.
 343    For example, pass::
 344
 345       lambda x: x in " \t"
 346
 347    if you're comparing lines as sequences of characters, and don't want to synch up
 348    on blanks or hard tabs.
 349
 350    The optional arguments *a* and *b* are sequences to be compared; both default to
 351    empty strings.  The elements of both sequences must be :term:`hashable`.
 352
 353    :class:`SequenceMatcher` objects have the following methods:
 354
 355
 356    .. method:: set_seqs(a, b)
 357
 358       Set the two sequences to be compared.
 359
 360    :class:`SequenceMatcher` computes and caches detailed information about the
 361    second sequence, so if you want to compare one sequence against many
 362    sequences, use :meth:`set_seq2` to set the commonly used sequence once and
 363    call :meth:`set_seq1` repeatedly, once for each of the other sequences.
 364
 365
 366    .. method:: set_seq1(a)
 367
 368       Set the first sequence to be compared.  The second sequence to be compared
 369       is not changed.
 370
 371
 372    .. method:: set_seq2(b)
 373
 374       Set the second sequence to be compared.  The first sequence to be compared
 375       is not changed.
 376
 377
 378    .. method:: find_longest_match(alo, ahi, blo, bhi)
 379
 380       Find longest matching block in ``a[alo:ahi]`` and ``b[blo:bhi]``.
 381
 382       If *isjunk* was omitted or ``None``, :meth:`find_longest_match` returns
 383       ``(i, j, k)`` such that ``a[i:i+k]`` is equal to ``b[j:j+k]``, where ``alo
 384       <= i <= i+k <= ahi`` and ``blo <= j <= j+k <= bhi``. For all ``(i', j',
 385       k')`` meeting those conditions, the additional conditions ``k >= k'``, ``i
 386       <= i'``, and if ``i == i'``, ``j <= j'`` are also met. In other words, of
 387       all maximal matching blocks, return one that starts earliest in *a*, and
 388       of all those maximal matching blocks that start earliest in *a*, return
 389       the one that starts earliest in *b*.
 390
 391          >>> s = SequenceMatcher(None, " abcd", "abcd abcd")
 392          >>> s.find_longest_match(0, 5, 0, 9)
 393          Match(a=0, b=4, size=5)
 394
 395       If *isjunk* was provided, first the longest matching block is determined
 396       as above, but with the additional restriction that no junk element appears
 397       in the block.  Then that block is extended as far as possible by matching
 398       (only) junk elements on both sides. So the resulting block never matches
 399       on junk except as identical junk happens to be adjacent to an interesting
 400       match.
 401
 402       Here's the same example as before, but considering blanks to be junk. That
 403       prevents ``' abcd'`` from matching the ``' abcd'`` at the tail end of the
 404       second sequence directly.  Instead only the ``'abcd'`` can match, and
 405       matches the leftmost ``'abcd'`` in the second sequence:
 406
 407          >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
 408          >>> s.find_longest_match(0, 5, 0, 9)
 409          Match(a=1, b=0, size=4)
 410
 411       If no blocks match, this returns ``(alo, blo, 0)``.
 412
 413       .. versionchanged:: 2.6
 414          This method returns a :term:`named tuple` ``Match(a, b, size)``.
 415
 416
 417    .. method:: get_matching_blocks()
 418
 419       Return list of triples describing matching subsequences. Each triple is of
 420       the form ``(i, j, n)``, and means that ``a[i:i+n] == b[j:j+n]``.  The
 421       triples are monotonically increasing in *i* and *j*.
 422
 423       The last triple is a dummy, and has the value ``(len(a), len(b), 0)``.  It
 424       is the only triple with ``n == 0``.  If ``(i, j, n)`` and ``(i', j', n')``
 425       are adjacent triples in the list, and the second is not the last triple in
 426       the list, then ``i+n != i'`` or ``j+n != j'``; in other words, adjacent
 427       triples always describe non-adjacent equal blocks.
 428
 429       .. XXX Explain why a dummy is used!
 430
 431       .. versionchanged:: 2.5
 432          The guarantee that adjacent triples always describe non-adjacent blocks
 433          was implemented.
 434
 435       .. doctest::
 436
 437          >>> s = SequenceMatcher(None, "abxcd", "abcd")
 438          >>> s.get_matching_blocks()
 439          [Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]
 440
 441
 442    .. method:: get_opcodes()
 443
 444       Return list of 5-tuples describing how to turn *a* into *b*. Each tuple is
 445       of the form ``(tag, i1, i2, j1, j2)``.  The first tuple has ``i1 == j1 ==
 446       0``, and remaining tuples have *i1* equal to the *i2* from the preceding
 447       tuple, and, likewise, *j1* equal to the previous *j2*.
 448
 449       The *tag* values are strings, with these meanings:
 450
 451       +---------------+---------------------------------------------+
 452       | Value         | Meaning                                     |
 453       +===============+=============================================+
 454       | ``'replace'`` | ``a[i1:i2]`` should be replaced by          |
 455       |               | ``b[j1:j2]``.                               |
 456       +---------------+---------------------------------------------+
 457       | ``'delete'``  | ``a[i1:i2]`` should be deleted.  Note that  |
 458       |               | ``j1 == j2`` in this case.                  |
 459       +---------------+---------------------------------------------+
 460       | ``'insert'``  | ``b[j1:j2]`` should be inserted at          |
 461       |               | ``a[i1:i1]``. Note that ``i1 == i2`` in     |
 462       |               | this case.                                  |
 463       +---------------+---------------------------------------------+
 464       | ``'equal'``   | ``a[i1:i2] == b[j1:j2]`` (the sub-sequences |
 465       |               | are equal).                                 |
 466       +---------------+---------------------------------------------+
 467
 468       For example:
 469
 470          >>> a = "qabxcd"
 471          >>> b = "abycdf"
 472          >>> s = SequenceMatcher(None, a, b)
 473          >>> for tag, i1, i2, j1, j2 in s.get_opcodes():
 474          ...    print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
 475          ...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
 476           delete a[0:1] (q) b[0:0] ()
 477            equal a[1:3] (ab) b[0:2] (ab)
 478          replace a[3:4] (x) b[2:3] (y)
 479            equal a[4:6] (cd) b[3:5] (cd)
 480           insert a[6:6] () b[5:6] (f)
 481
 482
 483    .. method:: get_grouped_opcodes([n])
 484
 485       Return a :term:`generator` of groups with up to *n* lines of context.
 486
 487       Starting with the groups returned by :meth:`get_opcodes`, this method
 488       splits out smaller change clusters and eliminates intervening ranges which
 489       have no changes.
 490
 491       The groups are returned in the same format as :meth:`get_opcodes`.
 492
 493       .. versionadded:: 2.3
 494
 495
 496    .. method:: ratio()
 497
 498       Return a measure of the sequences' similarity as a float in the range [0,
 499       1].
 500
 501       Where T is the total number of elements in both sequences, and M is the
 502       number of matches, this is 2.0\*M / T. Note that this is ``1.0`` if the
 503       sequences are identical, and ``0.0`` if they have nothing in common.
 504
 505       This is expensive to compute if :meth:`get_matching_blocks` or
 506       :meth:`get_opcodes` hasn't already been called, in which case you may want
 507       to try :meth:`quick_ratio` or :meth:`real_quick_ratio` first to get an
 508       upper bound.
 509
 510
 511    .. method:: quick_ratio()
 512
 513       Return an upper bound on :meth:`ratio` relatively quickly.
 514
 515       This isn't defined beyond that it is an upper bound on :meth:`ratio`, and
 516       is faster to compute.
 517
 518
 519    .. method:: real_quick_ratio()
 520
 521       Return an upper bound on :meth:`ratio` very quickly.
 522
 523       This isn't defined beyond that it is an upper bound on :meth:`ratio`, and
 524       is faster to compute than either :meth:`ratio` or :meth:`quick_ratio`.
 525
 526 The three methods that return the ratio of matching to total characters can give
 527 different results due to differing levels of approximation, although
 528 :meth:`quick_ratio` and :meth:`real_quick_ratio` are always at least as large as
 529 :meth:`ratio`:
 530
 531    >>> s = SequenceMatcher(None, "abcd", "bcde")
 532    >>> s.ratio()
 533    0.75
 534    >>> s.quick_ratio()
 535    0.75
 536    >>> s.real_quick_ratio()
 537    1.0
 538
 539
 540 .. _sequencematcher-examples:
 541
 542 SequenceMatcher Examples
 543 ------------------------
 544
 545 This example compares two strings, considering blanks to be "junk:"
 546
 547    >>> s = SequenceMatcher(lambda x: x == " ",
 548    ...                     "private Thread currentThread;",
 549    ...                     "private volatile Thread currentThread;")
 550
 551 :meth:`ratio` returns a float in [0, 1], measuring the similarity of the
 552 sequences.  As a rule of thumb, a :meth:`ratio` value over 0.6 means the
 553 sequences are close matches:
 554
 555    >>> print round(s.ratio(), 3)
 556    0.866
 557
 558 If you're only interested in where the sequences match,
 559 :meth:`get_matching_blocks` is handy:
 560
 561    >>> for block in s.get_matching_blocks():
 562    ...     print "a[%d] and b[%d] match for %d elements" % block
 563    a[0] and b[0] match for 8 elements
 564    a[8] and b[17] match for 21 elements
 565    a[29] and b[38] match for 0 elements
 566
 567 Note that the last tuple returned by :meth:`get_matching_blocks` is always a
 568 dummy, ``(len(a), len(b), 0)``, and this is the only case in which the last
 569 tuple element (number of elements matched) is ``0``.
 570
 571 If you want to know how to change the first sequence into the second, use
 572 :meth:`get_opcodes`:
 573
 574    >>> for opcode in s.get_opcodes():
 575    ...     print "%6s a[%d:%d] b[%d:%d]" % opcode
 576     equal a[0:8] b[0:8]
 577    insert a[8:8] b[8:17]
 578     equal a[8:29] b[17:38]
 579
 580 .. seealso::
 581
 582    * The :func:`get_close_matches` function in this module which shows how
 583      simple code building on :class:`SequenceMatcher` can be used to do useful
 584      work.
 585
 586    * `Simple version control recipe
 587      <http://code.activestate.com/recipes/576729/>`_ for a small application
 588      built with :class:`SequenceMatcher`.
 589
 590
 591 .. _differ-objects:
 592
 593 Differ Objects
 594 --------------
 595
 596 Note that :class:`Differ`\ -generated deltas make no claim to be **minimal**
 597 diffs. To the contrary, minimal diffs are often counter-intuitive, because they
 598 synch up anywhere possible, sometimes accidental matches 100 pages apart.
 599 Restricting synch points to contiguous matches preserves some notion of
 600 locality, at the occasional cost of producing a longer diff.
 601
 602 The :class:`Differ` class has this constructor:
 603
 604
 605 .. class:: Differ([linejunk[, charjunk]])
 606
 607    Optional keyword parameters *linejunk* and *charjunk* are for filter functions
 608    (or ``None``):
 609
 610    *linejunk*: A function that accepts a single string argument, and returns true
 611    if the string is junk.  The default is ``None``, meaning that no line is
 612    considered junk.
 613
 614    *charjunk*: A function that accepts a single character argument (a string of
 615    length 1), and returns true if the character is junk. The default is ``None``,
 616    meaning that no character is considered junk.
 617
 618    :class:`Differ` objects are used (deltas generated) via a single method:
 619
 620
 621    .. method:: Differ.compare(a, b)
 622
 623       Compare two sequences of lines, and generate the delta (a sequence of lines).
 624
 625       Each sequence must contain individual single-line strings ending with newlines.
 626       Such sequences can be obtained from the :meth:`readlines` method of file-like
 627       objects.  The delta generated also consists of newline-terminated strings, ready
 628       to be printed as-is via the :meth:`writelines` method of a file-like object.
 629
 630
 631 .. _differ-examples:
 632
 633 Differ Example
 634 --------------
 635
 636 This example compares two texts. First we set up the texts, sequences of
 637 individual single-line strings ending with newlines (such sequences can also be
 638 obtained from the :meth:`readlines` method of file-like objects):
 639
 640    >>> text1 = '''  1. Beautiful is better than ugly.
 641    ...   2. Explicit is better than implicit.
 642    ...   3. Simple is better than complex.
 643    ...   4. Complex is better than complicated.
 644    ... '''.splitlines(1)
 645    >>> len(text1)
 646    4
 647    >>> text1[0][-1]
 648    '\n'
 649    >>> text2 = '''  1. Beautiful is better than ugly.
 650    ...   3.   Simple is better than complex.
 651    ...   4. Complicated is better than complex.
 652    ...   5. Flat is better than nested.
 653    ... '''.splitlines(1)
 654
 655 Next we instantiate a Differ object:
 656
 657    >>> d = Differ()
 658
 659 Note that when instantiating a :class:`Differ` object we may pass functions to
 660 filter out line and character "junk."  See the :meth:`Differ` constructor for
 661 details.
 662
 663 Finally, we compare the two:
 664
 665    >>> result = list(d.compare(text1, text2))
 666
 667 ``result`` is a list of strings, so let's pretty-print it:
 668
 669    >>> from pprint import pprint
 670    >>> pprint(result)
 671    ['    1. Beautiful is better than ugly.\n',
 672     '-   2. Explicit is better than implicit.\n',
 673     '-   3. Simple is better than complex.\n',
 674     '+   3.   Simple is better than complex.\n',
 675     '?     ++\n',
 676     '-   4. Complex is better than complicated.\n',
 677     '?            ^                     ---- ^\n',
 678     '+   4. Complicated is better than complex.\n',
 679     '?           ++++ ^                      ^\n',
 680     '+   5. Flat is better than nested.\n']
 681
 682 As a single multi-line string it looks like this:
 683
 684    >>> import sys
 685    >>> sys.stdout.writelines(result)
 686        1. Beautiful is better than ugly.
 687    -   2. Explicit is better than implicit.
 688    -   3. Simple is better than complex.
 689    +   3.   Simple is better than complex.
 690    ?     ++
 691    -   4. Complex is better than complicated.
 692    ?            ^                     ---- ^
 693    +   4. Complicated is better than complex.
 694    ?           ++++ ^                      ^
 695    +   5. Flat is better than nested.
 696
 697
 698 .. _difflib-interface:
 699
 700 A command-line interface to difflib
 701 -----------------------------------
 702
 703 This example shows how to use difflib to create a ``diff``-like utility.
 704 It is also contained in the Python source distribution, as
 705 :file:`Tools/scripts/diff.py`.
 706
 707 .. testcode::
 708
 709    """ Command line interface to difflib.py providing diffs in four formats:
 710
 711    * ndiff:    lists every line and highlights interline changes.
 712    * context:  highlights clusters of changes in a before/after format.
 713    * unified:  highlights clusters of changes in an inline format.
 714    * html:     generates side by side comparison with change highlights.
 715
 716    """
 717
 718    import sys, os, time, difflib, optparse
 719
 720    def main():
 721         # Configure the option parser
 722        usage = "usage: %prog [options] fromfile tofile"
 723        parser = optparse.OptionParser(usage)
 724        parser.add_option("-c", action="store_true", default=False,
 725                          help='Produce a context format diff (default)')
 726        parser.add_option("-u", action="store_true", default=False,
 727                          help='Produce a unified format diff')
 728        hlp = 'Produce HTML side by side diff (can use -c and -l in conjunction)'
 729        parser.add_option("-m", action="store_true", default=False, help=hlp)
 730        parser.add_option("-n", action="store_true", default=False,
 731                          help='Produce a ndiff format diff')
 732        parser.add_option("-l", "--lines", type="int", default=3,
 733                          help='Set number of context lines (default 3)')
 734        (options, args) = parser.parse_args()
 735
 736        if len(args) == 0:
 737            parser.print_help()
 738            sys.exit(1)
 739        if len(args) != 2:
 740            parser.error("need to specify both a fromfile and tofile")
 741
 742        n = options.lines
 743        fromfile, tofile = args # as specified in the usage string
 744
 745        # we're passing these as arguments to the diff function
 746        fromdate = time.ctime(os.stat(fromfile).st_mtime)
 747        todate = time.ctime(os.stat(tofile).st_mtime)
 748        fromlines = open(fromfile, 'U').readlines()
 749        tolines = open(tofile, 'U').readlines()
 750
 751        if options.u:
 752            diff = difflib.unified_diff(fromlines, tolines, fromfile, tofile,
 753                                        fromdate, todate, n=n)
 754        elif options.n:
 755            diff = difflib.ndiff(fromlines, tolines)
 756        elif options.m:
 757            diff = difflib.HtmlDiff().make_file(fromlines, tolines, fromfile,
 758                                                tofile, context=options.c,
 759                                                numlines=n)
 760        else:
 761            diff = difflib.context_diff(fromlines, tolines, fromfile, tofile,
 762                                        fromdate, todate, n=n)
 763
 764        # we're using writelines because diff is a generator
 765        sys.stdout.writelines(diff)
 766
 767    if __name__ == '__main__':
 768        main()