Lib/pickletools.py

   1 '''"Executable documentation" for the pickle module.
   2
   3 Extensive comments about the pickle protocols and pickle-machine opcodes
   4 can be found here.  Some functions meant for external use:
   5
   6 genops(pickle)
   7    Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
   8
   9 dis(pickle, out=None, memo=None, indentlevel=4)
  10    Print a symbolic disassembly of a pickle.
  11 '''
  12
  13 __all__ = ['dis',
  14            'genops',
  15           ]
  16
  17 # Other ideas:
  18 #
  19 # - A pickle verifier:  read a pickle and check it exhaustively for
  20 #   well-formedness.  dis() does a lot of this already.
  21 #
  22 # - A protocol identifier:  examine a pickle and return its protocol number
  23 #   (== the highest .proto attr value among all the opcodes in the pickle).
  24 #   dis() already prints this info at the end.
  25 #
  26 # - A pickle optimizer:  for example, tuple-building code is sometimes more
  27 #   elaborate than necessary, catering for the possibility that the tuple
  28 #   is recursive.  Or lots of times a PUT is generated that's never accessed
  29 #   by a later GET.
  30
  31
  32 """
  33 "A pickle" is a program for a virtual pickle machine (PM, but more accurately
  34 called an unpickling machine).  It's a sequence of opcodes, interpreted by the
  35 PM, building an arbitrarily complex Python object.
  36
  37 For the most part, the PM is very simple:  there are no looping, testing, or
  38 conditional instructions, no arithmetic and no function calls.  Opcodes are
  39 executed once each, from first to last, until a STOP opcode is reached.
  40
  41 The PM has two data areas, "the stack" and "the memo".
  42
  43 Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
  44 integer object on the stack, whose value is gotten from a decimal string
  45 literal immediately following the INT opcode in the pickle bytestream.  Other
  46 opcodes take Python objects off the stack.  The result of unpickling is
  47 whatever object is left on the stack when the final STOP opcode is executed.
  48
  49 The memo is simply an array of objects, or it can be implemented as a dict
  50 mapping little integers to objects.  The memo serves as the PM's "long term
  51 memory", and the little integers indexing the memo are akin to variable
  52 names.  Some opcodes pop a stack object into the memo at a given index,
  53 and others push a memo object at a given index onto the stack again.
  54
  55 At heart, that's all the PM has.  Subtleties arise for these reasons:
  56
  57 + Object identity.  Objects can be arbitrarily complex, and subobjects
  58   may be shared (for example, the list [a, a] refers to the same object a
  59   twice).  It can be vital that unpickling recreate an isomorphic object
  60   graph, faithfully reproducing sharing.
  61
  62 + Recursive objects.  For example, after "L = []; L.append(L)", L is a
  63   list, and L[0] is the same list.  This is related to the object identity
  64   point, and some sequences of pickle opcodes are subtle in order to
  65   get the right result in all cases.
  66
  67 + Things pickle doesn't know everything about.  Examples of things pickle
  68   does know everything about are Python's builtin scalar and container
  69   types, like ints and tuples.  They generally have opcodes dedicated to
  70   them.  For things like module references and instances of user-defined
  71   classes, pickle's knowledge is limited.  Historically, many enhancements
  72   have been made to the pickle protocol in order to do a better (faster,
  73   and/or more compact) job on those.
  74
  75 + Backward compatibility and micro-optimization.  As explained below,
  76   pickle opcodes never go away, not even when better ways to do a thing
  77   get invented.  The repertoire of the PM just keeps growing over time.
  78   For example, protocol 0 had two opcodes for building Python integers (INT
  79   and LONG), protocol 1 added three more for more-efficient pickling of short
  80   integers, and protocol 2 added two more for more-efficient pickling of
  81   long integers (before protocol 2, the only ways to pickle a Python long
  82   took time quadratic in the number of digits, for both pickling and
  83   unpickling).  "Opcode bloat" isn't so much a subtlety as a source of
  84   wearying complication.
  85
  86
  87 Pickle protocols:
  88
  89 For compatibility, the meaning of a pickle opcode never changes.  Instead new
  90 pickle opcodes get added, and each version's unpickler can handle all the
  91 pickle opcodes in all protocol versions to date.  So old pickles continue to
  92 be readable forever.  The pickler can generally be told to restrict itself to
  93 the subset of opcodes available under previous protocol versions too, so that
  94 users can create pickles under the current version readable by older
  95 versions.  However, a pickle does not contain its version number embedded
  96 within it.  If an older unpickler tries to read a pickle using a later
  97 protocol, the result is most likely an exception due to seeing an unknown (in
  98 the older unpickler) opcode.
  99
 100 The original pickle used what's now called "protocol 0", and what was called
 101 "text mode" before Python 2.3.  The entire pickle bytestream is made up of
 102 printable 7-bit ASCII characters, plus the newline character, in protocol 0.
 103 That's why it was called text mode.  Protocol 0 is small and elegant, but
 104 sometimes painfully inefficient.
 105
 106 The second major set of additions is now called "protocol 1", and was called
 107 "binary mode" before Python 2.3.  This added many opcodes with arguments
 108 consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
 109 bytes.  Binary mode pickles can be substantially smaller than equivalent
 110 text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
 111 int as 4 bytes following the opcode, which is cheaper to unpickle than the
 112 (perhaps) 11-character decimal string attached to INT.  Protocol 1 also added
 113 a number of opcodes that operate on many stack elements at once (like APPENDS
 114 and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
 115
 116 The third major set of additions came in Python 2.3, and is called "protocol
 117 2".  This added:
 118
 119 - A better way to pickle instances of new-style classes (NEWOBJ).
 120
 121 - A way for a pickle to identify its protocol (PROTO).
 122
 123 - Time- and space- efficient pickling of long ints (LONG{1,4}).
 124
 125 - Shortcuts for small tuples (TUPLE{1,2,3}}.
 126
 127 - Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
 128
 129 - The "extension registry", a vector of popular objects that can be pushed
 130   efficiently by index (EXT{1,2,4}).  This is akin to the memo and GET, but
 131   the registry contents are predefined (there's nothing akin to the memo's
 132   PUT).
 133
 134 Another independent change with Python 2.3 is the abandonment of any
 135 pretense that it might be safe to load pickles received from untrusted
 136 parties -- no sufficient security analysis has been done to guarantee
 137 this and there isn't a use case that warrants the expense of such an
 138 analysis.
 139
 140 To this end, all tests for __safe_for_unpickling__ or for
 141 copy_reg.safe_constructors are removed from the unpickling code.
 142 References to these variables in the descriptions below are to be seen
 143 as describing unpickling in Python 2.2 and before.
 144 """
 145
 146 # Meta-rule:  Descriptions are stored in instances of descriptor objects,
 147 # with plain constructors.  No meta-language is defined from which
 148 # descriptors could be constructed.  If you want, e.g., XML, write a little
 149 # program to generate XML from the objects.
 150
 151 ##############################################################################
 152 # Some pickle opcodes have an argument, following the opcode in the
 153 # bytestream.  An argument is of a specific type, described by an instance
 154 # of ArgumentDescriptor.  These are not to be confused with arguments taken
 155 # off the stack -- ArgumentDescriptor applies only to arguments embedded in
 156 # the opcode stream, immediately following an opcode.
 157
 158 # Represents the number of bytes consumed by an argument delimited by the
 159 # next newline character.
 160 UP_TO_NEWLINE = -1
 161
 162 # Represents the number of bytes consumed by a two-argument opcode where
 163 # the first argument gives the number of bytes in the second argument.
 164 TAKEN_FROM_ARGUMENT1 = -2   # num bytes is 1-byte unsigned int
 165 TAKEN_FROM_ARGUMENT4 = -3   # num bytes is 4-byte signed little-endian int
 166
 167 class ArgumentDescriptor(object):
 168     __slots__ = (
 169         # name of descriptor record, also a module global name; a string
 170         'name',
 171
 172         # length of argument, in bytes; an int; UP_TO_NEWLINE and
 173         # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
 174         # cases
 175         'n',
 176
 177         # a function taking a file-like object, reading this kind of argument
 178         # from the object at the current position, advancing the current
 179         # position by n bytes, and returning the value of the argument
 180         'reader',
 181
 182         # human-readable docs for this arg descriptor; a string
 183         'doc',
 184     )
 185
 186     def __init__(self, name, n, reader, doc):
 187         assert isinstance(name, str)
 188         self.name = name
 189
 190         assert isinstance(n, int) and (n >= 0 or
 191                                        n in (UP_TO_NEWLINE,
 192                                              TAKEN_FROM_ARGUMENT1,
 193                                              TAKEN_FROM_ARGUMENT4))
 194         self.n = n
 195
 196         self.reader = reader
 197
 198         assert isinstance(doc, str)
 199         self.doc = doc
 200
 201 from struct import unpack as _unpack
 202
 203 def read_uint1(f):
 204     r"""
 205     >>> import StringIO
 206     >>> read_uint1(StringIO.StringIO('\xff'))
 207     255
 208     """
 209
 210     data = f.read(1)
 211     if data:
 212         return ord(data)
 213     raise ValueError("not enough data in stream to read uint1")
 214
 215 uint1 = ArgumentDescriptor(
 216             name='uint1',
 217             n=1,
 218             reader=read_uint1,
 219             doc="One-byte unsigned integer.")
 220
 221
 222 def read_uint2(f):
 223     r"""
 224     >>> import StringIO
 225     >>> read_uint2(StringIO.StringIO('\xff\x00'))
 226     255
 227     >>> read_uint2(StringIO.StringIO('\xff\xff'))
 228     65535
 229     """
 230
 231     data = f.read(2)
 232     if len(data) == 2:
 233         return _unpack("<H", data)[0]
 234     raise ValueError("not enough data in stream to read uint2")
 235
 236 uint2 = ArgumentDescriptor(
 237             name='uint2',
 238             n=2,
 239             reader=read_uint2,
 240             doc="Two-byte unsigned integer, little-endian.")
 241
 242
 243 def read_int4(f):
 244     r"""
 245     >>> import StringIO
 246     >>> read_int4(StringIO.StringIO('\xff\x00\x00\x00'))
 247     255
 248     >>> read_int4(StringIO.StringIO('\x00\x00\x00\x80')) == -(2**31)
 249     True
 250     """
 251
 252     data = f.read(4)
 253     if len(data) == 4:
 254         return _unpack("<i", data)[0]
 255     raise ValueError("not enough data in stream to read int4")
 256
 257 int4 = ArgumentDescriptor(
 258            name='int4',
 259            n=4,
 260            reader=read_int4,
 261            doc="Four-byte signed integer, little-endian, 2's complement.")
 262
 263
 264 def read_stringnl(f, decode=True, stripquotes=True):
 265     r"""
 266     >>> import StringIO
 267     >>> read_stringnl(StringIO.StringIO("'abcd'\nefg\n"))
 268     'abcd'
 269
 270     >>> read_stringnl(StringIO.StringIO("\n"))
 271     Traceback (most recent call last):
 272     ...
 273     ValueError: no string quotes around ''
 274
 275     >>> read_stringnl(StringIO.StringIO("\n"), stripquotes=False)
 276     ''
 277
 278     >>> read_stringnl(StringIO.StringIO("''\n"))
 279     ''
 280
 281     >>> read_stringnl(StringIO.StringIO('"abcd"'))
 282     Traceback (most recent call last):
 283     ...
 284     ValueError: no newline found when trying to read stringnl
 285
 286     Embedded escapes are undone in the result.
 287     >>> read_stringnl(StringIO.StringIO(r"'a\n\\b\x00c\td'" + "\n'e'"))
 288     'a\n\\b\x00c\td'
 289     """
 290
 291     data = f.readline()
 292     if not data.endswith('\n'):
 293         raise ValueError("no newline found when trying to read stringnl")
 294     data = data[:-1]    # lose the newline
 295
 296     if stripquotes:
 297         for q in "'\"":
 298             if data.startswith(q):
 299                 if not data.endswith(q):
 300                     raise ValueError("strinq quote %r not found at both "
 301                                      "ends of %r" % (q, data))
 302                 data = data[1:-1]
 303                 break
 304         else:
 305             raise ValueError("no string quotes around %r" % data)
 306
 307     # I'm not sure when 'string_escape' was added to the std codecs; it's
 308     # crazy not to use it if it's there.
 309     if decode:
 310         data = data.decode('string_escape')
 311     return data
 312
 313 stringnl = ArgumentDescriptor(
 314                name='stringnl',
 315                n=UP_TO_NEWLINE,
 316                reader=read_stringnl,
 317                doc="""A newline-terminated string.
 318
 319                    This is a repr-style string, with embedded escapes, and
 320                    bracketing quotes.
 321                    """)
 322
 323 def read_stringnl_noescape(f):
 324     return read_stringnl(f, decode=False, stripquotes=False)
 325
 326 stringnl_noescape = ArgumentDescriptor(
 327                         name='stringnl_noescape',
 328                         n=UP_TO_NEWLINE,
 329                         reader=read_stringnl_noescape,
 330                         doc="""A newline-terminated string.
 331
 332                         This is a str-style string, without embedded escapes,
 333                         or bracketing quotes.  It should consist solely of
 334                         printable ASCII characters.
 335                         """)
 336
 337 def read_stringnl_noescape_pair(f):
 338     r"""
 339     >>> import StringIO
 340     >>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\nEmpty\njunk"))
 341     'Queue Empty'
 342     """
 343
 344     return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
 345
 346 stringnl_noescape_pair = ArgumentDescriptor(
 347                              name='stringnl_noescape_pair',
 348                              n=UP_TO_NEWLINE,
 349                              reader=read_stringnl_noescape_pair,
 350                              doc="""A pair of newline-terminated strings.
 351
 352                              These are str-style strings, without embedded
 353                              escapes, or bracketing quotes.  They should
 354                              consist solely of printable ASCII characters.
 355                              The pair is returned as a single string, with
 356                              a single blank separating the two strings.
 357                              """)
 358
 359 def read_string4(f):
 360     r"""
 361     >>> import StringIO
 362     >>> read_string4(StringIO.StringIO("\x00\x00\x00\x00abc"))
 363     ''
 364     >>> read_string4(StringIO.StringIO("\x03\x00\x00\x00abcdef"))
 365     'abc'
 366     >>> read_string4(StringIO.StringIO("\x00\x00\x00\x03abcdef"))
 367     Traceback (most recent call last):
 368     ...
 369     ValueError: expected 50331648 bytes in a string4, but only 6 remain
 370     """
 371
 372     n = read_int4(f)
 373     if n < 0:
 374         raise ValueError("string4 byte count < 0: %d" % n)
 375     data = f.read(n)
 376     if len(data) == n:
 377         return data
 378     raise ValueError("expected %d bytes in a string4, but only %d remain" %
 379                      (n, len(data)))
 380
 381 string4 = ArgumentDescriptor(
 382               name="string4",
 383               n=TAKEN_FROM_ARGUMENT4,
 384               reader=read_string4,
 385               doc="""A counted string.
 386
 387               The first argument is a 4-byte little-endian signed int giving
 388               the number of bytes in the string, and the second argument is
 389               that many bytes.
 390               """)
 391
 392
 393 def read_string1(f):
 394     r"""
 395     >>> import StringIO
 396     >>> read_string1(StringIO.StringIO("\x00"))
 397     ''
 398     >>> read_string1(StringIO.StringIO("\x03abcdef"))
 399     'abc'
 400     """
 401
 402     n = read_uint1(f)
 403     assert n >= 0
 404     data = f.read(n)
 405     if len(data) == n:
 406         return data
 407     raise ValueError("expected %d bytes in a string1, but only %d remain" %
 408                      (n, len(data)))
 409
 410 string1 = ArgumentDescriptor(
 411               name="string1",
 412               n=TAKEN_FROM_ARGUMENT1,
 413               reader=read_string1,
 414               doc="""A counted string.
 415
 416               The first argument is a 1-byte unsigned int giving the number
 417               of bytes in the string, and the second argument is that many
 418               bytes.
 419               """)
 420
 421
 422 def read_unicodestringnl(f):
 423     r"""
 424     >>> import StringIO
 425     >>> read_unicodestringnl(StringIO.StringIO("abc\uabcd\njunk"))
 426     u'abc\uabcd'
 427     """
 428
 429     data = f.readline()
 430     if not data.endswith('\n'):
 431         raise ValueError("no newline found when trying to read "
 432                          "unicodestringnl")
 433     data = data[:-1]    # lose the newline
 434     return unicode(data, 'raw-unicode-escape')
 435
 436 unicodestringnl = ArgumentDescriptor(
 437                       name='unicodestringnl',
 438                       n=UP_TO_NEWLINE,
 439                       reader=read_unicodestringnl,
 440                       doc="""A newline-terminated Unicode string.
 441
 442                       This is raw-unicode-escape encoded, so consists of
 443                       printable ASCII characters, and may contain embedded
 444                       escape sequences.
 445                       """)
 446
 447 def read_unicodestring4(f):
 448     r"""
 449     >>> import StringIO
 450     >>> s = u'abcd\uabcd'
 451     >>> enc = s.encode('utf-8')
 452     >>> enc
 453     'abcd\xea\xaf\x8d'
 454     >>> n = chr(len(enc)) + chr(0) * 3  # little-endian 4-byte length
 455     >>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))
 456     >>> s == t
 457     True
 458
 459     >>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))
 460     Traceback (most recent call last):
 461     ...
 462     ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
 463     """
 464
 465     n = read_int4(f)
 466     if n < 0:
 467         raise ValueError("unicodestring4 byte count < 0: %d" % n)
 468     data = f.read(n)
 469     if len(data) == n:
 470         return unicode(data, 'utf-8')
 471     raise ValueError("expected %d bytes in a unicodestring4, but only %d "
 472                      "remain" % (n, len(data)))
 473
 474 unicodestring4 = ArgumentDescriptor(
 475                     name="unicodestring4",
 476                     n=TAKEN_FROM_ARGUMENT4,
 477                     reader=read_unicodestring4,
 478                     doc="""A counted Unicode string.
 479
 480                     The first argument is a 4-byte little-endian signed int
 481                     giving the number of bytes in the string, and the second
 482                     argument-- the UTF-8 encoding of the Unicode string --
 483                     contains that many bytes.
 484                     """)
 485
 486
 487 def read_decimalnl_short(f):
 488     r"""
 489     >>> import StringIO
 490     >>> read_decimalnl_short(StringIO.StringIO("1234\n56"))
 491     1234
 492
 493     >>> read_decimalnl_short(StringIO.StringIO("1234L\n56"))
 494     Traceback (most recent call last):
 495     ...
 496     ValueError: trailing 'L' not allowed in '1234L'
 497     """
 498
 499     s = read_stringnl(f, decode=False, stripquotes=False)
 500     if s.endswith("L"):
 501         raise ValueError("trailing 'L' not allowed in %r" % s)
 502
 503     # It's not necessarily true that the result fits in a Python short int:
 504     # the pickle may have been written on a 64-bit box.  There's also a hack
 505     # for True and False here.
 506     if s == "00":
 507         return False
 508     elif s == "01":
 509         return True
 510
 511     try:
 512         return int(s)
 513     except OverflowError:
 514         return long(s)
 515
 516 def read_decimalnl_long(f):
 517     r"""
 518     >>> import StringIO
 519
 520     >>> read_decimalnl_long(StringIO.StringIO("1234\n56"))
 521     Traceback (most recent call last):
 522     ...
 523     ValueError: trailing 'L' required in '1234'
 524
 525     Someday the trailing 'L' will probably go away from this output.
 526
 527     >>> read_decimalnl_long(StringIO.StringIO("1234L\n56"))
 528     1234L
 529
 530     >>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\n6"))
 531     123456789012345678901234L
 532     """
 533
 534     s = read_stringnl(f, decode=False, stripquotes=False)
 535     if not s.endswith("L"):
 536         raise ValueError("trailing 'L' required in %r" % s)
 537     return long(s)
 538
 539
 540 decimalnl_short = ArgumentDescriptor(
 541                       name='decimalnl_short',
 542                       n=UP_TO_NEWLINE,
 543                       reader=read_decimalnl_short,
 544                       doc="""A newline-terminated decimal integer literal.
 545
 546                           This never has a trailing 'L', and the integer fit
 547                           in a short Python int on the box where the pickle
 548                           was written -- but there's no guarantee it will fit
 549                           in a short Python int on the box where the pickle
 550                           is read.
 551                           """)
 552
 553 decimalnl_long = ArgumentDescriptor(
 554                      name='decimalnl_long',
 555                      n=UP_TO_NEWLINE,
 556                      reader=read_decimalnl_long,
 557                      doc="""A newline-terminated decimal integer literal.
 558
 559                          This has a trailing 'L', and can represent integers
 560                          of any size.
 561                          """)
 562
 563
 564 def read_floatnl(f):
 565     r"""
 566     >>> import StringIO
 567     >>> read_floatnl(StringIO.StringIO("-1.25\n6"))
 568     -1.25
 569     """
 570     s = read_stringnl(f, decode=False, stripquotes=False)
 571     return float(s)
 572
 573 floatnl = ArgumentDescriptor(
 574               name='floatnl',
 575               n=UP_TO_NEWLINE,
 576               reader=read_floatnl,
 577               doc="""A newline-terminated decimal floating literal.
 578
 579               In general this requires 17 significant digits for roundtrip
 580               identity, and pickling then unpickling infinities, NaNs, and
 581               minus zero doesn't work across boxes, or on some boxes even
 582               on itself (e.g., Windows can't read the strings it produces
 583               for infinities or NaNs).
 584               """)
 585
 586 def read_float8(f):
 587     r"""
 588     >>> import StringIO, struct
 589     >>> raw = struct.pack(">d", -1.25)
 590     >>> raw
 591     '\xbf\xf4\x00\x00\x00\x00\x00\x00'
 592     >>> read_float8(StringIO.StringIO(raw + "\n"))
 593     -1.25
 594     """
 595
 596     data = f.read(8)
 597     if len(data) == 8:
 598         return _unpack(">d", data)[0]
 599     raise ValueError("not enough data in stream to read float8")
 600
 601
 602 float8 = ArgumentDescriptor(
 603              name='float8',
 604              n=8,
 605              reader=read_float8,
 606              doc="""An 8-byte binary representation of a float, big-endian.
 607
 608              The format is unique to Python, and shared with the struct
 609              module (format string '>d') "in theory" (the struct and cPickle
 610              implementations don't share the code -- they should).  It's
 611              strongly related to the IEEE-754 double format, and, in normal
 612              cases, is in fact identical to the big-endian 754 double format.
 613              On other boxes the dynamic range is limited to that of a 754
 614              double, and "add a half and chop" rounding is used to reduce
 615              the precision to 53 bits.  However, even on a 754 box,
 616              infinities, NaNs, and minus zero may not be handled correctly
 617              (may not survive roundtrip pickling intact).
 618              """)
 619
 620 # Protocol 2 formats
 621
 622 from pickle import decode_long
 623
 624 def read_long1(f):
 625     r"""
 626     >>> import StringIO
 627     >>> read_long1(StringIO.StringIO("\x00"))
 628     0L
 629     >>> read_long1(StringIO.StringIO("\x02\xff\x00"))
 630     255L
 631     >>> read_long1(StringIO.StringIO("\x02\xff\x7f"))
 632     32767L
 633     >>> read_long1(StringIO.StringIO("\x02\x00\xff"))
 634     -256L
 635     >>> read_long1(StringIO.StringIO("\x02\x00\x80"))
 636     -32768L
 637     """
 638
 639     n = read_uint1(f)
 640     data = f.read(n)
 641     if len(data) != n:
 642         raise ValueError("not enough data in stream to read long1")
 643     return decode_long(data)
 644
 645 long1 = ArgumentDescriptor(
 646     name="long1",
 647     n=TAKEN_FROM_ARGUMENT1,
 648     reader=read_long1,
 649     doc="""A binary long, little-endian, using 1-byte size.
 650
 651     This first reads one byte as an unsigned size, then reads that
 652     many bytes and interprets them as a little-endian 2's-complement long.
 653     If the size is 0, that's taken as a shortcut for the long 0L.
 654     """)
 655
 656 def read_long4(f):
 657     r"""
 658     >>> import StringIO
 659     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x00"))
 660     255L
 661     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x7f"))
 662     32767L
 663     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\xff"))
 664     -256L
 665     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\x80"))
 666     -32768L
 667     >>> read_long1(StringIO.StringIO("\x00\x00\x00\x00"))
 668     0L
 669     """
 670
 671     n = read_int4(f)
 672     if n < 0:
 673         raise ValueError("long4 byte count < 0: %d" % n)
 674     data = f.read(n)
 675     if len(data) != n:
 676         raise ValueError("not enough data in stream to read long4")
 677     return decode_long(data)
 678
 679 long4 = ArgumentDescriptor(
 680     name="long4",
 681     n=TAKEN_FROM_ARGUMENT4,
 682     reader=read_long4,
 683     doc="""A binary representation of a long, little-endian.
 684
 685     This first reads four bytes as a signed size (but requires the
 686     size to be >= 0), then reads that many bytes and interprets them
 687     as a little-endian 2's-complement long.  If the size is 0, that's taken
 688     as a shortcut for the long 0L, although LONG1 should really be used
 689     then instead (and in any case where # of bytes < 256).
 690     """)
 691
 692
 693 ##############################################################################
 694 # Object descriptors.  The stack used by the pickle machine holds objects,
 695 # and in the stack_before and stack_after attributes of OpcodeInfo
 696 # descriptors we need names to describe the various types of objects that can
 697 # appear on the stack.
 698
 699 class StackObject(object):
 700     __slots__ = (
 701         # name of descriptor record, for info only
 702         'name',
 703
 704         # type of object, or tuple of type objects (meaning the object can
 705         # be of any type in the tuple)
 706         'obtype',
 707
 708         # human-readable docs for this kind of stack object; a string
 709         'doc',
 710     )
 711
 712     def __init__(self, name, obtype, doc):
 713         assert isinstance(name, str)
 714         self.name = name
 715
 716         assert isinstance(obtype, type) or isinstance(obtype, tuple)
 717         if isinstance(obtype, tuple):
 718             for contained in obtype:
 719                 assert isinstance(contained, type)
 720         self.obtype = obtype
 721
 722         assert isinstance(doc, str)
 723         self.doc = doc
 724
 725     def __repr__(self):
 726         return self.name
 727
 728
 729 pyint = StackObject(
 730             name='int',
 731             obtype=int,
 732             doc="A short (as opposed to long) Python integer object.")
 733
 734 pylong = StackObject(
 735              name='long',
 736              obtype=long,
 737              doc="A long (as opposed to short) Python integer object.")
 738
 739 pyinteger_or_bool = StackObject(
 740                         name='int_or_bool',
 741                         obtype=(int, long, bool),
 742                         doc="A Python integer object (short or long), or "
 743                             "a Python bool.")
 744
 745 pybool = StackObject(
 746              name='bool',
 747              obtype=(bool,),
 748              doc="A Python bool object.")
 749
 750 pyfloat = StackObject(
 751               name='float',
 752               obtype=float,
 753               doc="A Python float object.")
 754
 755 pystring = StackObject(
 756                name='str',
 757                obtype=str,
 758                doc="A Python string object.")
 759
 760 pyunicode = StackObject(
 761                 name='unicode',
 762                 obtype=unicode,
 763                 doc="A Python Unicode string object.")
 764
 765 pynone = StackObject(
 766              name="None",
 767              obtype=type(None),
 768              doc="The Python None object.")
 769
 770 pytuple = StackObject(
 771               name="tuple",
 772               obtype=tuple,
 773               doc="A Python tuple object.")
 774
 775 pylist = StackObject(
 776              name="list",
 777              obtype=list,
 778              doc="A Python list object.")
 779
 780 pydict = StackObject(
 781              name="dict",
 782              obtype=dict,
 783              doc="A Python dict object.")
 784
 785 anyobject = StackObject(
 786                 name='any',
 787                 obtype=object,
 788                 doc="Any kind of object whatsoever.")
 789
 790 markobject = StackObject(
 791                  name="mark",
 792                  obtype=StackObject,
 793                  doc="""'The mark' is a unique object.
 794
 795                  Opcodes that operate on a variable number of objects
 796                  generally don't embed the count of objects in the opcode,
 797                  or pull it off the stack.  Instead the MARK opcode is used
 798                  to push a special marker object on the stack, and then
 799                  some other opcodes grab all the objects from the top of
 800                  the stack down to (but not including) the topmost marker
 801                  object.
 802                  """)
 803
 804 stackslice = StackObject(
 805                  name="stackslice",
 806                  obtype=StackObject,
 807                  doc="""An object representing a contiguous slice of the stack.
 808
 809                  This is used in conjuction with markobject, to represent all
 810                  of the stack following the topmost markobject.  For example,
 811                  the POP_MARK opcode changes the stack from
 812
 813                      [..., markobject, stackslice]
 814                  to
 815                      [...]
 816
 817                  No matter how many object are on the stack after the topmost
 818                  markobject, POP_MARK gets rid of all of them (including the
 819                  topmost markobject too).
 820                  """)
 821
 822 ##############################################################################
 823 # Descriptors for pickle opcodes.
 824
 825 class OpcodeInfo(object):
 826
 827     __slots__ = (
 828         # symbolic name of opcode; a string
 829         'name',
 830
 831         # the code used in a bytestream to represent the opcode; a
 832         # one-character string
 833         'code',
 834
 835         # If the opcode has an argument embedded in the byte string, an
 836         # instance of ArgumentDescriptor specifying its type.  Note that
 837         # arg.reader(s) can be used to read and decode the argument from
 838         # the bytestream s, and arg.doc documents the format of the raw
 839         # argument bytes.  If the opcode doesn't have an argument embedded
 840         # in the bytestream, arg should be None.
 841         'arg',
 842
 843         # what the stack looks like before this opcode runs; a list
 844         'stack_before',
 845
 846         # what the stack looks like after this opcode runs; a list
 847         'stack_after',
 848
 849         # the protocol number in which this opcode was introduced; an int
 850         'proto',
 851
 852         # human-readable docs for this opcode; a string
 853         'doc',
 854     )
 855
 856     def __init__(self, name, code, arg,
 857                  stack_before, stack_after, proto, doc):
 858         assert isinstance(name, str)
 859         self.name = name
 860
 861         assert isinstance(code, str)
 862         assert len(code) == 1
 863         self.code = code
 864
 865         assert arg is None or isinstance(arg, ArgumentDescriptor)
 866         self.arg = arg
 867
 868         assert isinstance(stack_before, list)
 869         for x in stack_before:
 870             assert isinstance(x, StackObject)
 871         self.stack_before = stack_before
 872
 873         assert isinstance(stack_after, list)
 874         for x in stack_after:
 875             assert isinstance(x, StackObject)
 876         self.stack_after = stack_after
 877
 878         assert isinstance(proto, int) and 0 <= proto <= 2
 879         self.proto = proto
 880
 881         assert isinstance(doc, str)
 882         self.doc = doc
 883
 884 I = OpcodeInfo
 885 opcodes = [
 886
 887     # Ways to spell integers.
 888
 889     I(name='INT',
 890       code='I',
 891       arg=decimalnl_short,
 892       stack_before=[],
 893       stack_after=[pyinteger_or_bool],
 894       proto=0,
 895       doc="""Push an integer or bool.
 896
 897       The argument is a newline-terminated decimal literal string.
 898
 899       The intent may have been that this always fit in a short Python int,
 900       but INT can be generated in pickles written on a 64-bit box that
 901       require a Python long on a 32-bit box.  The difference between this
 902       and LONG then is that INT skips a trailing 'L', and produces a short
 903       int whenever possible.
 904
 905       Another difference is due to that, when bool was introduced as a
 906       distinct type in 2.3, builtin names True and False were also added to
 907       2.2.2, mapping to ints 1 and 0.  For compatibility in both directions,
 908       True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
 909       Leading zeroes are never produced for a genuine integer.  The 2.3
 910       (and later) unpicklers special-case these and return bool instead;
 911       earlier unpicklers ignore the leading "0" and return the int.
 912       """),
 913
 914     I(name='BININT',
 915       code='J',
 916       arg=int4,
 917       stack_before=[],
 918       stack_after=[pyint],
 919       proto=1,
 920       doc="""Push a four-byte signed integer.
 921
 922       This handles the full range of Python (short) integers on a 32-bit
 923       box, directly as binary bytes (1 for the opcode and 4 for the integer).
 924       If the integer is non-negative and fits in 1 or 2 bytes, pickling via
 925       BININT1 or BININT2 saves space.
 926       """),
 927
 928     I(name='BININT1',
 929       code='K',
 930       arg=uint1,
 931       stack_before=[],
 932       stack_after=[pyint],
 933       proto=1,
 934       doc="""Push a one-byte unsigned integer.
 935
 936       This is a space optimization for pickling very small non-negative ints,
 937       in range(256).
 938       """),
 939
 940     I(name='BININT2',
 941       code='M',
 942       arg=uint2,
 943       stack_before=[],
 944       stack_after=[pyint],
 945       proto=1,
 946       doc="""Push a two-byte unsigned integer.
 947
 948       This is a space optimization for pickling small positive ints, in
 949       range(256, 2**16).  Integers in range(256) can also be pickled via
 950       BININT2, but BININT1 instead saves a byte.
 951       """),
 952
 953     I(name='LONG',
 954       code='L',
 955       arg=decimalnl_long,
 956       stack_before=[],
 957       stack_after=[pylong],
 958       proto=0,
 959       doc="""Push a long integer.
 960
 961       The same as INT, except that the literal ends with 'L', and always
 962       unpickles to a Python long.  There doesn't seem a real purpose to the
 963       trailing 'L'.
 964
 965       Note that LONG takes time quadratic in the number of digits when
 966       unpickling (this is simply due to the nature of decimal->binary
 967       conversion).  Proto 2 added linear-time (in C; still quadratic-time
 968       in Python) LONG1 and LONG4 opcodes.
 969       """),
 970
 971     I(name="LONG1",
 972       code='\x8a',
 973       arg=long1,
 974       stack_before=[],
 975       stack_after=[pylong],
 976       proto=2,
 977       doc="""Long integer using one-byte length.
 978
 979       A more efficient encoding of a Python long; the long1 encoding
 980       says it all."""),
 981
 982     I(name="LONG4",
 983       code='\x8b',
 984       arg=long4,
 985       stack_before=[],
 986       stack_after=[pylong],
 987       proto=2,
 988       doc="""Long integer using found-byte length.
 989
 990       A more efficient encoding of a Python long; the long4 encoding
 991       says it all."""),
 992
 993     # Ways to spell strings (8-bit, not Unicode).
 994
 995     I(name='STRING',
 996       code='S',
 997       arg=stringnl,
 998       stack_before=[],
 999       stack_after=[pystring],
1000       proto=0,
1001       doc="""Push a Python string object.
1002
1003       The argument is a repr-style string, with bracketing quote characters,
1004       and perhaps embedded escapes.  The argument extends until the next
1005       newline character.
1006       """),
1007
1008     I(name='BINSTRING',
1009       code='T',
1010       arg=string4,
1011       stack_before=[],
1012       stack_after=[pystring],
1013       proto=1,
1014       doc="""Push a Python string object.
1015
1016       There are two arguments:  the first is a 4-byte little-endian signed int
1017       giving the number of bytes in the string, and the second is that many
1018       bytes, which are taken literally as the string content.
1019       """),
1020
1021     I(name='SHORT_BINSTRING',
1022       code='U',
1023       arg=string1,
1024       stack_before=[],
1025       stack_after=[pystring],
1026       proto=1,
1027       doc="""Push a Python string object.
1028
1029       There are two arguments:  the first is a 1-byte unsigned int giving
1030       the number of bytes in the string, and the second is that many bytes,
1031       which are taken literally as the string content.
1032       """),
1033
1034     # Ways to spell None.
1035
1036     I(name='NONE',
1037       code='N',
1038       arg=None,
1039       stack_before=[],
1040       stack_after=[pynone],
1041       proto=0,
1042       doc="Push None on the stack."),
1043
1044     # Ways to spell bools, starting with proto 2.  See INT for how this was
1045     # done before proto 2.
1046
1047     I(name='NEWTRUE',
1048       code='\x88',
1049       arg=None,
1050       stack_before=[],
1051       stack_after=[pybool],
1052       proto=2,
1053       doc="""True.
1054
1055       Push True onto the stack."""),
1056
1057     I(name='NEWFALSE',
1058       code='\x89',
1059       arg=None,
1060       stack_before=[],
1061       stack_after=[pybool],
1062       proto=2,
1063       doc="""True.
1064
1065       Push False onto the stack."""),
1066
1067     # Ways to spell Unicode strings.
1068
1069     I(name='UNICODE',
1070       code='V',
1071       arg=unicodestringnl,
1072       stack_before=[],
1073       stack_after=[pyunicode],
1074       proto=0,  # this may be pure-text, but it's a later addition
1075       doc="""Push a Python Unicode string object.
1076
1077       The argument is a raw-unicode-escape encoding of a Unicode string,
1078       and so may contain embedded escape sequences.  The argument extends
1079       until the next newline character.
1080       """),
1081
1082     I(name='BINUNICODE',
1083       code='X',
1084       arg=unicodestring4,
1085       stack_before=[],
1086       stack_after=[pyunicode],
1087       proto=1,
1088       doc="""Push a Python Unicode string object.
1089
1090       There are two arguments:  the first is a 4-byte little-endian signed int
1091       giving the number of bytes in the string.  The second is that many
1092       bytes, and is the UTF-8 encoding of the Unicode string.
1093       """),
1094
1095     # Ways to spell floats.
1096
1097     I(name='FLOAT',
1098       code='F',
1099       arg=floatnl,
1100       stack_before=[],
1101       stack_after=[pyfloat],
1102       proto=0,
1103       doc="""Newline-terminated decimal float literal.
1104
1105       The argument is repr(a_float), and in general requires 17 significant
1106       digits for roundtrip conversion to be an identity (this is so for
1107       IEEE-754 double precision values, which is what Python float maps to
1108       on most boxes).
1109
1110       In general, FLOAT cannot be used to transport infinities, NaNs, or
1111       minus zero across boxes (or even on a single box, if the platform C
1112       library can't read the strings it produces for such things -- Windows
1113       is like that), but may do less damage than BINFLOAT on boxes with
1114       greater precision or dynamic range than IEEE-754 double.
1115       """),
1116
1117     I(name='BINFLOAT',
1118       code='G',
1119       arg=float8,
1120       stack_before=[],
1121       stack_after=[pyfloat],
1122       proto=1,
1123       doc="""Float stored in binary form, with 8 bytes of data.
1124
1125       This generally requires less than half the space of FLOAT encoding.
1126       In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1127       minus zero, raises an exception if the exponent exceeds the range of
1128       an IEEE-754 double, and retains no more than 53 bits of precision (if
1129       there are more than that, "add a half and chop" rounding is used to
1130       cut it back to 53 significant bits).
1131       """),
1132
1133     # Ways to build lists.
1134
1135     I(name='EMPTY_LIST',
1136       code=']',
1137       arg=None,
1138       stack_before=[],
1139       stack_after=[pylist],
1140       proto=1,
1141       doc="Push an empty list."),
1142
1143     I(name='APPEND',
1144       code='a',
1145       arg=None,
1146       stack_before=[pylist, anyobject],
1147       stack_after=[pylist],
1148       proto=0,
1149       doc="""Append an object to a list.
1150
1151       Stack before:  ... pylist anyobject
1152       Stack after:   ... pylist+[anyobject]
1153
1154       although pylist is really extended in-place.
1155       """),
1156
1157     I(name='APPENDS',
1158       code='e',
1159       arg=None,
1160       stack_before=[pylist, markobject, stackslice],
1161       stack_after=[pylist],
1162       proto=1,
1163       doc="""Extend a list by a slice of stack objects.
1164
1165       Stack before:  ... pylist markobject stackslice
1166       Stack after:   ... pylist+stackslice
1167
1168       although pylist is really extended in-place.
1169       """),
1170
1171     I(name='LIST',
1172       code='l',
1173       arg=None,
1174       stack_before=[markobject, stackslice],
1175       stack_after=[pylist],
1176       proto=0,
1177       doc="""Build a list out of the topmost stack slice, after markobject.
1178
1179       All the stack entries following the topmost markobject are placed into
1180       a single Python list, which single list object replaces all of the
1181       stack from the topmost markobject onward.  For example,
1182
1183       Stack before: ... markobject 1 2 3 'abc'
1184       Stack after:  ... [1, 2, 3, 'abc']
1185       """),
1186
1187     # Ways to build tuples.
1188
1189     I(name='EMPTY_TUPLE',
1190       code=')',
1191       arg=None,
1192       stack_before=[],
1193       stack_after=[pytuple],
1194       proto=1,
1195       doc="Push an empty tuple."),
1196
1197     I(name='TUPLE',
1198       code='t',
1199       arg=None,
1200       stack_before=[markobject, stackslice],
1201       stack_after=[pytuple],
1202       proto=0,
1203       doc="""Build a tuple out of the topmost stack slice, after markobject.
1204
1205       All the stack entries following the topmost markobject are placed into
1206       a single Python tuple, which single tuple object replaces all of the
1207       stack from the topmost markobject onward.  For example,
1208
1209       Stack before: ... markobject 1 2 3 'abc'
1210       Stack after:  ... (1, 2, 3, 'abc')
1211       """),
1212
1213     I(name='TUPLE1',
1214       code='\x85',
1215       arg=None,
1216       stack_before=[anyobject],
1217       stack_after=[pytuple],
1218       proto=2,
1219       doc="""One-tuple.
1220
1221       This code pops one value off the stack and pushes a tuple of
1222       length 1 whose one item is that value back onto it.  IOW:
1223
1224           stack[-1] = tuple(stack[-1:])
1225       """),
1226
1227     I(name='TUPLE2',
1228       code='\x86',
1229       arg=None,
1230       stack_before=[anyobject, anyobject],
1231       stack_after=[pytuple],
1232       proto=2,
1233       doc="""One-tuple.
1234
1235       This code pops two values off the stack and pushes a tuple
1236       of length 2 whose items are those values back onto it.  IOW:
1237
1238           stack[-2:] = [tuple(stack[-2:])]
1239       """),
1240
1241     I(name='TUPLE3',
1242       code='\x87',
1243       arg=None,
1244       stack_before=[anyobject, anyobject, anyobject],
1245       stack_after=[pytuple],
1246       proto=2,
1247       doc="""One-tuple.
1248
1249       This code pops three values off the stack and pushes a tuple
1250       of length 3 whose items are those values back onto it.  IOW:
1251
1252           stack[-3:] = [tuple(stack[-3:])]
1253       """),
1254
1255     # Ways to build dicts.
1256
1257     I(name='EMPTY_DICT',
1258       code='}',
1259       arg=None,
1260       stack_before=[],
1261       stack_after=[pydict],
1262       proto=1,
1263       doc="Push an empty dict."),
1264
1265     I(name='DICT',
1266       code='d',
1267       arg=None,
1268       stack_before=[markobject, stackslice],
1269       stack_after=[pydict],
1270       proto=0,
1271       doc="""Build a dict out of the topmost stack slice, after markobject.
1272
1273       All the stack entries following the topmost markobject are placed into
1274       a single Python dict, which single dict object replaces all of the
1275       stack from the topmost markobject onward.  The stack slice alternates
1276       key, value, key, value, ....  For example,
1277
1278       Stack before: ... markobject 1 2 3 'abc'
1279       Stack after:  ... {1: 2, 3: 'abc'}
1280       """),
1281
1282     I(name='SETITEM',
1283       code='s',
1284       arg=None,
1285       stack_before=[pydict, anyobject, anyobject],
1286       stack_after=[pydict],
1287       proto=0,
1288       doc="""Add a key+value pair to an existing dict.
1289
1290       Stack before:  ... pydict key value
1291       Stack after:   ... pydict
1292
1293       where pydict has been modified via pydict[key] = value.
1294       """),
1295
1296     I(name='SETITEMS',
1297       code='u',
1298       arg=None,
1299       stack_before=[pydict, markobject, stackslice],
1300       stack_after=[pydict],
1301       proto=1,
1302       doc="""Add an arbitrary number of key+value pairs to an existing dict.
1303
1304       The slice of the stack following the topmost markobject is taken as
1305       an alternating sequence of keys and values, added to the dict
1306       immediately under the topmost markobject.  Everything at and after the
1307       topmost markobject is popped, leaving the mutated dict at the top
1308       of the stack.
1309
1310       Stack before:  ... pydict markobject key_1 value_1 ... key_n value_n
1311       Stack after:   ... pydict
1312
1313       where pydict has been modified via pydict[key_i] = value_i for i in
1314       1, 2, ..., n, and in that order.
1315       """),
1316
1317     # Stack manipulation.
1318
1319     I(name='POP',
1320       code='0',
1321       arg=None,
1322       stack_before=[anyobject],
1323       stack_after=[],
1324       proto=0,
1325       doc="Discard the top stack item, shrinking the stack by one item."),
1326
1327     I(name='DUP',
1328       code='2',
1329       arg=None,
1330       stack_before=[anyobject],
1331       stack_after=[anyobject, anyobject],
1332       proto=0,
1333       doc="Push the top stack item onto the stack again, duplicating it."),
1334
1335     I(name='MARK',
1336       code='(',
1337       arg=None,
1338       stack_before=[],
1339       stack_after=[markobject],
1340       proto=0,
1341       doc="""Push markobject onto the stack.
1342
1343       markobject is a unique object, used by other opcodes to identify a
1344       region of the stack containing a variable number of objects for them
1345       to work on.  See markobject.doc for more detail.
1346       """),
1347
1348     I(name='POP_MARK',
1349       code='1',
1350       arg=None,
1351       stack_before=[markobject, stackslice],
1352       stack_after=[],
1353       proto=0,
1354       doc="""Pop all the stack objects at and above the topmost markobject.
1355
1356       When an opcode using a variable number of stack objects is done,
1357       POP_MARK is used to remove those objects, and to remove the markobject
1358       that delimited their starting position on the stack.
1359       """),
1360
1361     # Memo manipulation.  There are really only two operations (get and put),
1362     # each in all-text, "short binary", and "long binary" flavors.
1363
1364     I(name='GET',
1365       code='g',
1366       arg=decimalnl_short,
1367       stack_before=[],
1368       stack_after=[anyobject],
1369       proto=0,
1370       doc="""Read an object from the memo and push it on the stack.
1371
1372       The index of the memo object to push is given by the newline-teriminated
1373       decimal string following.  BINGET and LONG_BINGET are space-optimized
1374       versions.
1375       """),
1376
1377     I(name='BINGET',
1378       code='h',
1379       arg=uint1,
1380       stack_before=[],
1381       stack_after=[anyobject],
1382       proto=1,
1383       doc="""Read an object from the memo and push it on the stack.
1384
1385       The index of the memo object to push is given by the 1-byte unsigned
1386       integer following.
1387       """),
1388
1389     I(name='LONG_BINGET',
1390       code='j',
1391       arg=int4,
1392       stack_before=[],
1393       stack_after=[anyobject],
1394       proto=1,
1395       doc="""Read an object from the memo and push it on the stack.
1396
1397       The index of the memo object to push is given by the 4-byte signed
1398       little-endian integer following.
1399       """),
1400
1401     I(name='PUT',
1402       code='p',
1403       arg=decimalnl_short,
1404       stack_before=[],
1405       stack_after=[],
1406       proto=0,
1407       doc="""Store the stack top into the memo.  The stack is not popped.
1408
1409       The index of the memo location to write into is given by the newline-
1410       terminated decimal string following.  BINPUT and LONG_BINPUT are
1411       space-optimized versions.
1412       """),
1413
1414     I(name='BINPUT',
1415       code='q',
1416       arg=uint1,
1417       stack_before=[],
1418       stack_after=[],
1419       proto=1,
1420       doc="""Store the stack top into the memo.  The stack is not popped.
1421
1422       The index of the memo location to write into is given by the 1-byte
1423       unsigned integer following.
1424       """),
1425
1426     I(name='LONG_BINPUT',
1427       code='r',
1428       arg=int4,
1429       stack_before=[],
1430       stack_after=[],
1431       proto=1,
1432       doc="""Store the stack top into the memo.  The stack is not popped.
1433
1434       The index of the memo location to write into is given by the 4-byte
1435       signed little-endian integer following.
1436       """),
1437
1438     # Access the extension registry (predefined objects).  Akin to the GET
1439     # family.
1440
1441     I(name='EXT1',
1442       code='\x82',
1443       arg=uint1,
1444       stack_before=[],
1445       stack_after=[anyobject],
1446       proto=2,
1447       doc="""Extension code.
1448
1449       This code and the similar EXT2 and EXT4 allow using a registry
1450       of popular objects that are pickled by name, typically classes.
1451       It is envisioned that through a global negotiation and
1452       registration process, third parties can set up a mapping between
1453       ints and object names.
1454
1455       In order to guarantee pickle interchangeability, the extension
1456       code registry ought to be global, although a range of codes may
1457       be reserved for private use.
1458
1459       EXT1 has a 1-byte integer argument.  This is used to index into the
1460       extension registry, and the object at that index is pushed on the stack.
1461       """),
1462
1463     I(name='EXT2',
1464       code='\x83',
1465       arg=uint2,
1466       stack_before=[],
1467       stack_after=[anyobject],
1468       proto=2,
1469       doc="""Extension code.
1470
1471       See EXT1.  EXT2 has a two-byte integer argument.
1472       """),
1473
1474     I(name='EXT4',
1475       code='\x84',
1476       arg=int4,
1477       stack_before=[],
1478       stack_after=[anyobject],
1479       proto=2,
1480       doc="""Extension code.
1481
1482       See EXT1.  EXT4 has a four-byte integer argument.
1483       """),
1484
1485     # Push a class object, or module function, on the stack, via its module
1486     # and name.
1487
1488     I(name='GLOBAL',
1489       code='c',
1490       arg=stringnl_noescape_pair,
1491       stack_before=[],
1492       stack_after=[anyobject],
1493       proto=0,
1494       doc="""Push a global object (module.attr) on the stack.
1495
1496       Two newline-terminated strings follow the GLOBAL opcode.  The first is
1497       taken as a module name, and the second as a class name.  The class
1498       object module.class is pushed on the stack.  More accurately, the
1499       object returned by self.find_class(module, class) is pushed on the
1500       stack, so unpickling subclasses can override this form of lookup.
1501       """),
1502
1503     # Ways to build objects of classes pickle doesn't know about directly
1504     # (user-defined classes).  I despair of documenting this accurately
1505     # and comprehensibly -- you really have to read the pickle code to
1506     # find all the special cases.
1507
1508     I(name='REDUCE',
1509       code='R',
1510       arg=None,
1511       stack_before=[anyobject, anyobject],
1512       stack_after=[anyobject],
1513       proto=0,
1514       doc="""Push an object built from a callable and an argument tuple.
1515
1516       The opcode is named to remind of the __reduce__() method.
1517
1518       Stack before: ... callable pytuple
1519       Stack after:  ... callable(*pytuple)
1520
1521       The callable and the argument tuple are the first two items returned
1522       by a __reduce__ method.  Applying the callable to the argtuple is
1523       supposed to reproduce the original object, or at least get it started.
1524       If the __reduce__ method returns a 3-tuple, the last component is an
1525       argument to be passed to the object's __setstate__, and then the REDUCE
1526       opcode is followed by code to create setstate's argument, and then a
1527       BUILD opcode to apply  __setstate__ to that argument.
1528
1529       If type(callable) is not ClassType, REDUCE complains unless the
1530       callable has been registered with the copy_reg module's
1531       safe_constructors dict, or the callable has a magic
1532       '__safe_for_unpickling__' attribute with a true value.  I'm not sure
1533       why it does this, but I've sure seen this complaint often enough when
1534       I didn't want to <wink>.
1535       """),
1536
1537     I(name='BUILD',
1538       code='b',
1539       arg=None,
1540       stack_before=[anyobject, anyobject],
1541       stack_after=[anyobject],
1542       proto=0,
1543       doc="""Finish building an object, via __setstate__ or dict update.
1544
1545       Stack before: ... anyobject argument
1546       Stack after:  ... anyobject
1547
1548       where anyobject may have been mutated, as follows:
1549
1550       If the object has a __setstate__ method,
1551
1552           anyobject.__setstate__(argument)
1553
1554       is called.
1555
1556       Else the argument must be a dict, the object must have a __dict__, and
1557       the object is updated via
1558
1559           anyobject.__dict__.update(argument)
1560
1561       This may raise RuntimeError in restricted execution mode (which
1562       disallows access to __dict__ directly); in that case, the object
1563       is updated instead via
1564
1565           for k, v in argument.items():
1566               anyobject[k] = v
1567       """),
1568
1569     I(name='INST',
1570       code='i',
1571       arg=stringnl_noescape_pair,
1572       stack_before=[markobject, stackslice],
1573       stack_after=[anyobject],
1574       proto=0,
1575       doc="""Build a class instance.
1576
1577       This is the protocol 0 version of protocol 1's OBJ opcode.
1578       INST is followed by two newline-terminated strings, giving a
1579       module and class name, just as for the GLOBAL opcode (and see
1580       GLOBAL for more details about that).  self.find_class(module, name)
1581       is used to get a class object.
1582
1583       In addition, all the objects on the stack following the topmost
1584       markobject are gathered into a tuple and popped (along with the
1585       topmost markobject), just as for the TUPLE opcode.
1586
1587       Now it gets complicated.  If all of these are true:
1588
1589         + The argtuple is empty (markobject was at the top of the stack
1590           at the start).
1591
1592         + It's an old-style class object (the type of the class object is
1593           ClassType).
1594
1595         + The class object does not have a __getinitargs__ attribute.
1596
1597       then we want to create an old-style class instance without invoking
1598       its __init__() method (pickle has waffled on this over the years; not
1599       calling __init__() is current wisdom).  In this case, an instance of
1600       an old-style dummy class is created, and then we try to rebind its
1601       __class__ attribute to the desired class object.  If this succeeds,
1602       the new instance object is pushed on the stack, and we're done.  In
1603       restricted execution mode it can fail (assignment to __class__ is
1604       disallowed), and I'm not really sure what happens then -- it looks
1605       like the code ends up calling the class object's __init__ anyway,
1606       via falling into the next case.
1607
1608       Else (the argtuple is not empty, it's not an old-style class object,
1609       or the class object does have a __getinitargs__ attribute), the code
1610       first insists that the class object have a __safe_for_unpickling__
1611       attribute.  Unlike as for the __safe_for_unpickling__ check in REDUCE,
1612       it doesn't matter whether this attribute has a true or false value, it
1613       only matters whether it exists (XXX this is a bug; cPickle
1614       requires the attribute to be true).  If __safe_for_unpickling__
1615       doesn't exist, UnpicklingError is raised.
1616
1617       Else (the class object does have a __safe_for_unpickling__ attr),
1618       the class object obtained from INST's arguments is applied to the
1619       argtuple obtained from the stack, and the resulting instance object
1620       is pushed on the stack.
1621
1622       NOTE:  checks for __safe_for_unpickling__ went away in Python 2.3.
1623       """),
1624
1625     I(name='OBJ',
1626       code='o',
1627       arg=None,
1628       stack_before=[markobject, anyobject, stackslice],
1629       stack_after=[anyobject],
1630       proto=1,
1631       doc="""Build a class instance.
1632
1633       This is the protocol 1 version of protocol 0's INST opcode, and is
1634       very much like it.  The major difference is that the class object
1635       is taken off the stack, allowing it to be retrieved from the memo
1636       repeatedly if several instances of the same class are created.  This
1637       can be much more efficient (in both time and space) than repeatedly
1638       embedding the module and class names in INST opcodes.
1639
1640       Unlike INST, OBJ takes no arguments from the opcode stream.  Instead
1641       the class object is taken off the stack, immediately above the
1642       topmost markobject:
1643
1644       Stack before: ... markobject classobject stackslice
1645       Stack after:  ... new_instance_object
1646
1647       As for INST, the remainder of the stack above the markobject is
1648       gathered into an argument tuple, and then the logic seems identical,
1649       except that no __safe_for_unpickling__ check is done (XXX this is
1650       a bug; cPickle does test __safe_for_unpickling__).  See INST for
1651       the gory details.
1652
1653       NOTE:  In Python 2.3, INST and OBJ are identical except for how they
1654       get the class object.  That was always the intent; the implementations
1655       had diverged for accidental reasons.
1656       """),
1657
1658     I(name='NEWOBJ',
1659       code='\x81',
1660       arg=None,
1661       stack_before=[anyobject, anyobject],
1662       stack_after=[anyobject],
1663       proto=2,
1664       doc="""Build an object instance.
1665
1666       The stack before should be thought of as containing a class
1667       object followed by an argument tuple (the tuple being the stack
1668       top).  Call these cls and args.  They are popped off the stack,
1669       and the value returned by cls.__new__(cls, *args) is pushed back
1670       onto the stack.
1671       """),
1672
1673     # Machine control.
1674
1675     I(name='PROTO',
1676       code='\x80',
1677       arg=uint1,
1678       stack_before=[],
1679       stack_after=[],
1680       proto=2,
1681       doc="""Protocol version indicator.
1682
1683       For protocol 2 and above, a pickle must start with this opcode.
1684       The argument is the protocol version, an int in range(2, 256).
1685       """),
1686
1687     I(name='STOP',
1688       code='.',
1689       arg=None,
1690       stack_before=[anyobject],
1691       stack_after=[],
1692       proto=0,
1693       doc="""Stop the unpickling machine.
1694
1695       Every pickle ends with this opcode.  The object at the top of the stack
1696       is popped, and that's the result of unpickling.  The stack should be
1697       empty then.
1698       """),
1699
1700     # Ways to deal with persistent IDs.
1701
1702     I(name='PERSID',
1703       code='P',
1704       arg=stringnl_noescape,
1705       stack_before=[],
1706       stack_after=[anyobject],
1707       proto=0,
1708       doc="""Push an object identified by a persistent ID.
1709
1710       The pickle module doesn't define what a persistent ID means.  PERSID's
1711       argument is a newline-terminated str-style (no embedded escapes, no
1712       bracketing quote characters) string, which *is* "the persistent ID".
1713       The unpickler passes this string to self.persistent_load().  Whatever
1714       object that returns is pushed on the stack.  There is no implementation
1715       of persistent_load() in Python's unpickler:  it must be supplied by an
1716       unpickler subclass.
1717       """),
1718
1719     I(name='BINPERSID',
1720       code='Q',
1721       arg=None,
1722       stack_before=[anyobject],
1723       stack_after=[anyobject],
1724       proto=1,
1725       doc="""Push an object identified by a persistent ID.
1726
1727       Like PERSID, except the persistent ID is popped off the stack (instead
1728       of being a string embedded in the opcode bytestream).  The persistent
1729       ID is passed to self.persistent_load(), and whatever object that
1730       returns is pushed on the stack.  See PERSID for more detail.
1731       """),
1732 ]
1733 del I
1734
1735 # Verify uniqueness of .name and .code members.
1736 name2i = {}
1737 code2i = {}
1738
1739 for i, d in enumerate(opcodes):
1740     if d.name in name2i:
1741         raise ValueError("repeated name %r at indices %d and %d" %
1742                          (d.name, name2i[d.name], i))
1743     if d.code in code2i:
1744         raise ValueError("repeated code %r at indices %d and %d" %
1745                          (d.code, code2i[d.code], i))
1746
1747     name2i[d.name] = i
1748     code2i[d.code] = i
1749
1750 del name2i, code2i, i, d
1751
1752 ##############################################################################
1753 # Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1754 # Also ensure we've got the same stuff as pickle.py, although the
1755 # introspection here is dicey.
1756
1757 code2op = {}
1758 for d in opcodes:
1759     code2op[d.code] = d
1760 del d
1761
1762 def assure_pickle_consistency(verbose=False):
1763     import pickle, re
1764
1765     copy = code2op.copy()
1766     for name in pickle.__all__:
1767         if not re.match("[A-Z][A-Z0-9_]+$", name):
1768             if verbose:
1769                 print "skipping %r: it doesn't look like an opcode name" % name
1770             continue
1771         picklecode = getattr(pickle, name)
1772         if not isinstance(picklecode, str) or len(picklecode) != 1:
1773             if verbose:
1774                 print ("skipping %r: value %r doesn't look like a pickle "
1775                        "code" % (name, picklecode))
1776             continue
1777         if picklecode in copy:
1778             if verbose:
1779                 print "checking name %r w/ code %r for consistency" % (
1780                       name, picklecode)
1781             d = copy[picklecode]
1782             if d.name != name:
1783                 raise ValueError("for pickle code %r, pickle.py uses name %r "
1784                                  "but we're using name %r" % (picklecode,
1785                                                               name,
1786                                                               d.name))
1787             # Forget this one.  Any left over in copy at the end are a problem
1788             # of a different kind.
1789             del copy[picklecode]
1790         else:
1791             raise ValueError("pickle.py appears to have a pickle opcode with "
1792                              "name %r and code %r, but we don't" %
1793                              (name, picklecode))
1794     if copy:
1795         msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1796         for code, d in copy.items():
1797             msg.append("    name %r with code %r" % (d.name, code))
1798         raise ValueError("\n".join(msg))
1799
1800 assure_pickle_consistency()
1801 del assure_pickle_consistency
1802
1803 ##############################################################################
1804 # A pickle opcode generator.
1805
1806 def genops(pickle):
1807     """Generate all the opcodes in a pickle.
1808
1809     'pickle' is a file-like object, or string, containing the pickle.
1810
1811     Each opcode in the pickle is generated, from the current pickle position,
1812     stopping after a STOP opcode is delivered.  A triple is generated for
1813     each opcode:
1814
1815         opcode, arg, pos
1816
1817     opcode is an OpcodeInfo record, describing the current opcode.
1818
1819     If the opcode has an argument embedded in the pickle, arg is its decoded
1820     value, as a Python object.  If the opcode doesn't have an argument, arg
1821     is None.
1822
1823     If the pickle has a tell() method, pos was the value of pickle.tell()
1824     before reading the current opcode.  If the pickle is a string object,
1825     it's wrapped in a StringIO object, and the latter's tell() result is
1826     used.  Else (the pickle doesn't have a tell(), and it's not obvious how
1827     to query its current position) pos is None.
1828     """
1829
1830     import cStringIO as StringIO
1831
1832     if isinstance(pickle, str):
1833         pickle = StringIO.StringIO(pickle)
1834
1835     if hasattr(pickle, "tell"):
1836         getpos = pickle.tell
1837     else:
1838         getpos = lambda: None
1839
1840     while True:
1841         pos = getpos()
1842         code = pickle.read(1)
1843         opcode = code2op.get(code)
1844         if opcode is None:
1845             if code == "":
1846                 raise ValueError("pickle exhausted before seeing STOP")
1847             else:
1848                 raise ValueError("at position %s, opcode %r unknown" % (
1849                                  pos is None and "<unknown>" or pos,
1850                                  code))
1851         if opcode.arg is None:
1852             arg = None
1853         else:
1854             arg = opcode.arg.reader(pickle)
1855         yield opcode, arg, pos
1856         if code == '.':
1857             assert opcode.name == 'STOP'
1858             break
1859
1860 ##############################################################################
1861 # A symbolic pickle disassembler.
1862
1863 def dis(pickle, out=None, memo=None, indentlevel=4):
1864     """Produce a symbolic disassembly of a pickle.
1865
1866     'pickle' is a file-like object, or string, containing a (at least one)
1867     pickle.  The pickle is disassembled from the current position, through
1868     the first STOP opcode encountered.
1869
1870     Optional arg 'out' is a file-like object to which the disassembly is
1871     printed.  It defaults to sys.stdout.
1872
1873     Optional arg 'memo' is a Python dict, used as the pickle's memo.  It
1874     may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
1875     Passing the same memo object to another dis() call then allows disassembly
1876     to proceed across multiple pickles that were all created by the same
1877     pickler with the same memo.  Ordinarily you don't need to worry about this.
1878
1879     Optional arg indentlevel is the number of blanks by which to indent
1880     a new MARK level.  It defaults to 4.
1881
1882     In addition to printing the disassembly, some sanity checks are made:
1883
1884     + All embedded opcode arguments "make sense".
1885
1886     + Explicit and implicit pop operations have enough items on the stack.
1887
1888     + When an opcode implicitly refers to a markobject, a markobject is
1889       actually on the stack.
1890
1891     + A memo entry isn't referenced before it's defined.
1892
1893     + The markobject isn't stored in the memo.
1894
1895     + A memo entry isn't redefined.
1896     """
1897
1898     # Most of the hair here is for sanity checks, but most of it is needed
1899     # anyway to detect when a protocol 0 POP takes a MARK off the stack
1900     # (which in turn is needed to indent MARK blocks correctly).
1901
1902     stack = []          # crude emulation of unpickler stack
1903     if memo is None:
1904         memo = {}       # crude emulation of unpicker memo
1905     maxproto = -1       # max protocol number seen
1906     markstack = []      # bytecode positions of MARK opcodes
1907     indentchunk = ' ' * indentlevel
1908     errormsg = None
1909     for opcode, arg, pos in genops(pickle):
1910         if pos is not None:
1911             print >> out, "%5d:" % pos,
1912
1913         line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
1914                               indentchunk * len(markstack),
1915                               opcode.name)
1916
1917         maxproto = max(maxproto, opcode.proto)
1918         before = opcode.stack_before    # don't mutate
1919         after = opcode.stack_after      # don't mutate
1920         numtopop = len(before)
1921
1922         # See whether a MARK should be popped.
1923         markmsg = None
1924         if markobject in before or (opcode.name == "POP" and
1925                                     stack and
1926                                     stack[-1] is markobject):
1927             assert markobject not in after
1928             if __debug__:
1929                 if markobject in before:
1930                     assert before[-1] is stackslice
1931             if markstack:
1932                 markpos = markstack.pop()
1933                 if markpos is None:
1934                     markmsg = "(MARK at unknown opcode offset)"
1935                 else:
1936                     markmsg = "(MARK at %d)" % markpos
1937                 # Pop everything at and after the topmost markobject.
1938                 while stack[-1] is not markobject:
1939                     stack.pop()
1940                 stack.pop()
1941                 # Stop later code from popping too much.
1942                 try:
1943                     numtopop = before.index(markobject)
1944                 except ValueError:
1945                     assert opcode.name == "POP"
1946                     numtopop = 0
1947             else:
1948                 errormsg = markmsg = "no MARK exists on stack"
1949
1950         # Check for correct memo usage.
1951         if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
1952             assert arg is not None
1953             if arg in memo:
1954                 errormsg = "memo key %r already defined" % arg
1955             elif not stack:
1956                 errormsg = "stack is empty -- can't store into memo"
1957             elif stack[-1] is markobject:
1958                 errormsg = "can't store markobject in the memo"
1959             else:
1960                 memo[arg] = stack[-1]
1961
1962         elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
1963             if arg in memo:
1964                 assert len(after) == 1
1965                 after = [memo[arg]]     # for better stack emulation
1966             else:
1967                 errormsg = "memo key %r has never been stored into" % arg
1968
1969         if arg is not None or markmsg:
1970             # make a mild effort to align arguments
1971             line += ' ' * (10 - len(opcode.name))
1972             if arg is not None:
1973                 line += ' ' + repr(arg)
1974             if markmsg:
1975                 line += ' ' + markmsg
1976         print >> out, line
1977
1978         if errormsg:
1979             # Note that we delayed complaining until the offending opcode
1980             # was printed.
1981             raise ValueError(errormsg)
1982
1983         # Emulate the stack effects.
1984         if len(stack) < numtopop:
1985             raise ValueError("tries to pop %d items from stack with "
1986                              "only %d items" % (numtopop, len(stack)))
1987         if numtopop:
1988             del stack[-numtopop:]
1989         if markobject in after:
1990             assert markobject not in before
1991             markstack.append(pos)
1992
1993         stack.extend(after)
1994
1995     print >> out, "highest protocol among opcodes =", maxproto
1996     if stack:
1997         raise ValueError("stack not empty after STOP: %r" % stack)
1998
1999 # For use in the doctest, simply as an example of a class to pickle.
2000 class _Example:
2001     def __init__(self, value):
2002         self.value = value
2003
2004 _dis_test = r"""
2005 >>> import pickle
2006 >>> x = [1, 2, (3, 4), {'abc': u"def"}]
2007 >>> pkl = pickle.dumps(x, 0)
2008 >>> dis(pkl)
2009     0: (    MARK
2010     1: l        LIST       (MARK at 0)
2011     2: p    PUT        0
2012     5: I    INT        1
2013     8: a    APPEND
2014     9: I    INT        2
2015    12: a    APPEND
2016    13: (    MARK
2017    14: I        INT        3
2018    17: I        INT        4
2019    20: t        TUPLE      (MARK at 13)
2020    21: p    PUT        1
2021    24: a    APPEND
2022    25: (    MARK
2023    26: d        DICT       (MARK at 25)
2024    27: p    PUT        2
2025    30: S    STRING     'abc'
2026    37: p    PUT        3
2027    40: V    UNICODE    u'def'
2028    45: p    PUT        4
2029    48: s    SETITEM
2030    49: a    APPEND
2031    50: .    STOP
2032 highest protocol among opcodes = 0
2033
2034 Try again with a "binary" pickle.
2035
2036 >>> pkl = pickle.dumps(x, 1)
2037 >>> dis(pkl)
2038     0: ]    EMPTY_LIST
2039     1: q    BINPUT     0
2040     3: (    MARK
2041     4: K        BININT1    1
2042     6: K        BININT1    2
2043     8: (        MARK
2044     9: K            BININT1    3
2045    11: K            BININT1    4
2046    13: t            TUPLE      (MARK at 8)
2047    14: q        BINPUT     1
2048    16: }        EMPTY_DICT
2049    17: q        BINPUT     2
2050    19: U        SHORT_BINSTRING 'abc'
2051    24: q        BINPUT     3
2052    26: X        BINUNICODE u'def'
2053    34: q        BINPUT     4
2054    36: s        SETITEM
2055    37: e        APPENDS    (MARK at 3)
2056    38: .    STOP
2057 highest protocol among opcodes = 1
2058
2059 Exercise the INST/OBJ/BUILD family.
2060
2061 >>> import random
2062 >>> dis(pickle.dumps(random.random, 0))
2063     0: c    GLOBAL     'random random'
2064    15: p    PUT        0
2065    18: .    STOP
2066 highest protocol among opcodes = 0
2067
2068 >>> from pickletools import _Example
2069 >>> x = [_Example(42)] * 2
2070 >>> dis(pickle.dumps(x, 0))
2071     0: (    MARK
2072     1: l        LIST       (MARK at 0)
2073     2: p    PUT        0
2074     5: (    MARK
2075     6: i        INST       'pickletools _Example' (MARK at 5)
2076    28: p    PUT        1
2077    31: (    MARK
2078    32: d        DICT       (MARK at 31)
2079    33: p    PUT        2
2080    36: S    STRING     'value'
2081    45: p    PUT        3
2082    48: I    INT        42
2083    52: s    SETITEM
2084    53: b    BUILD
2085    54: a    APPEND
2086    55: g    GET        1
2087    58: a    APPEND
2088    59: .    STOP
2089 highest protocol among opcodes = 0
2090
2091 >>> dis(pickle.dumps(x, 1))
2092     0: ]    EMPTY_LIST
2093     1: q    BINPUT     0
2094     3: (    MARK
2095     4: (        MARK
2096     5: c            GLOBAL     'pickletools _Example'
2097    27: q            BINPUT     1
2098    29: o            OBJ        (MARK at 4)
2099    30: q        BINPUT     2
2100    32: }        EMPTY_DICT
2101    33: q        BINPUT     3
2102    35: U        SHORT_BINSTRING 'value'
2103    42: q        BINPUT     4
2104    44: K        BININT1    42
2105    46: s        SETITEM
2106    47: b        BUILD
2107    48: h        BINGET     2
2108    50: e        APPENDS    (MARK at 3)
2109    51: .    STOP
2110 highest protocol among opcodes = 1
2111
2112 Try "the canonical" recursive-object test.
2113
2114 >>> L = []
2115 >>> T = L,
2116 >>> L.append(T)
2117 >>> L[0] is T
2118 True
2119 >>> T[0] is L
2120 True
2121 >>> L[0][0] is L
2122 True
2123 >>> T[0][0] is T
2124 True
2125 >>> dis(pickle.dumps(L, 0))
2126     0: (    MARK
2127     1: l        LIST       (MARK at 0)
2128     2: p    PUT        0
2129     5: (    MARK
2130     6: g        GET        0
2131     9: t        TUPLE      (MARK at 5)
2132    10: p    PUT        1
2133    13: a    APPEND
2134    14: .    STOP
2135 highest protocol among opcodes = 0
2136
2137 >>> dis(pickle.dumps(L, 1))
2138     0: ]    EMPTY_LIST
2139     1: q    BINPUT     0
2140     3: (    MARK
2141     4: h        BINGET     0
2142     6: t        TUPLE      (MARK at 3)
2143     7: q    BINPUT     1
2144     9: a    APPEND
2145    10: .    STOP
2146 highest protocol among opcodes = 1
2147
2148 Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2149 has to emulate the stack in order to realize that the POP opcode at 16 gets
2150 rid of the MARK at 0.
2151
2152 >>> dis(pickle.dumps(T, 0))
2153     0: (    MARK
2154     1: (        MARK
2155     2: l            LIST       (MARK at 1)
2156     3: p        PUT        0
2157     6: (        MARK
2158     7: g            GET        0
2159    10: t            TUPLE      (MARK at 6)
2160    11: p        PUT        1
2161    14: a        APPEND
2162    15: 0        POP
2163    16: 0        POP        (MARK at 0)
2164    17: g    GET        1
2165    20: .    STOP
2166 highest protocol among opcodes = 0
2167
2168 >>> dis(pickle.dumps(T, 1))
2169     0: (    MARK
2170     1: ]        EMPTY_LIST
2171     2: q        BINPUT     0
2172     4: (        MARK
2173     5: h            BINGET     0
2174     7: t            TUPLE      (MARK at 4)
2175     8: q        BINPUT     1
2176    10: a        APPEND
2177    11: 1        POP_MARK   (MARK at 0)
2178    12: h    BINGET     1
2179    14: .    STOP
2180 highest protocol among opcodes = 1
2181
2182 Try protocol 2.
2183
2184 >>> dis(pickle.dumps(L, 2))
2185     0: \x80 PROTO      2
2186     2: ]    EMPTY_LIST
2187     3: q    BINPUT     0
2188     5: h    BINGET     0
2189     7: \x85 TUPLE1
2190     8: q    BINPUT     1
2191    10: a    APPEND
2192    11: .    STOP
2193 highest protocol among opcodes = 2
2194
2195 >>> dis(pickle.dumps(T, 2))
2196     0: \x80 PROTO      2
2197     2: ]    EMPTY_LIST
2198     3: q    BINPUT     0
2199     5: h    BINGET     0
2200     7: \x85 TUPLE1
2201     8: q    BINPUT     1
2202    10: a    APPEND
2203    11: 0    POP
2204    12: h    BINGET     1
2205    14: .    STOP
2206 highest protocol among opcodes = 2
2207 """
2208
2209 _memo_test = r"""
2210 >>> import pickle
2211 >>> from StringIO import StringIO
2212 >>> f = StringIO()
2213 >>> p = pickle.Pickler(f, 2)
2214 >>> x = [1, 2, 3]
2215 >>> p.dump(x)
2216 >>> p.dump(x)
2217 >>> f.seek(0)
2218 >>> memo = {}
2219 >>> dis(f, memo=memo)
2220     0: \x80 PROTO      2
2221     2: ]    EMPTY_LIST
2222     3: q    BINPUT     0
2223     5: (    MARK
2224     6: K        BININT1    1
2225     8: K        BININT1    2
2226    10: K        BININT1    3
2227    12: e        APPENDS    (MARK at 5)
2228    13: .    STOP
2229 highest protocol among opcodes = 2
2230 >>> dis(f, memo=memo)
2231    14: \x80 PROTO      2
2232    16: h    BINGET     0
2233    18: .    STOP
2234 highest protocol among opcodes = 2
2235 """
2236
2237 __test__ = {'disassembler_test': _dis_test,
2238             'disassembler_memo_test': _memo_test,
2239            }
2240
2241 def _test():
2242     import doctest
2243     return doctest.testmod()
2244
2245 if __name__ == "__main__":
2246     _test()