Lib/pickletools.py

   1 '''"Executable documentation" for the pickle module.
   2
   3 Extensive comments about the pickle protocols and pickle-machine opcodes
   4 can be found here.  Some functions meant for external use:
   5
   6 genops(pickle)
   7    Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
   8
   9 dis(pickle, out=None, memo=None, indentlevel=4)
  10    Print a symbolic disassembly of a pickle.
  11 '''
  12
  13 import codecs
  14 import pickle
  15 import re
  16
  17 __all__ = ['dis', 'genops', 'optimize']
  18
  19 bytes_types = pickle.bytes_types
  20
  21 # Other ideas:
  22 #
  23 # - A pickle verifier:  read a pickle and check it exhaustively for
  24 #   well-formedness.  dis() does a lot of this already.
  25 #
  26 # - A protocol identifier:  examine a pickle and return its protocol number
  27 #   (== the highest .proto attr value among all the opcodes in the pickle).
  28 #   dis() already prints this info at the end.
  29 #
  30 # - A pickle optimizer:  for example, tuple-building code is sometimes more
  31 #   elaborate than necessary, catering for the possibility that the tuple
  32 #   is recursive.  Or lots of times a PUT is generated that's never accessed
  33 #   by a later GET.
  34
  35
  36 """
  37 "A pickle" is a program for a virtual pickle machine (PM, but more accurately
  38 called an unpickling machine).  It's a sequence of opcodes, interpreted by the
  39 PM, building an arbitrarily complex Python object.
  40
  41 For the most part, the PM is very simple:  there are no looping, testing, or
  42 conditional instructions, no arithmetic and no function calls.  Opcodes are
  43 executed once each, from first to last, until a STOP opcode is reached.
  44
  45 The PM has two data areas, "the stack" and "the memo".
  46
  47 Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
  48 integer object on the stack, whose value is gotten from a decimal string
  49 literal immediately following the INT opcode in the pickle bytestream.  Other
  50 opcodes take Python objects off the stack.  The result of unpickling is
  51 whatever object is left on the stack when the final STOP opcode is executed.
  52
  53 The memo is simply an array of objects, or it can be implemented as a dict
  54 mapping little integers to objects.  The memo serves as the PM's "long term
  55 memory", and the little integers indexing the memo are akin to variable
  56 names.  Some opcodes pop a stack object into the memo at a given index,
  57 and others push a memo object at a given index onto the stack again.
  58
  59 At heart, that's all the PM has.  Subtleties arise for these reasons:
  60
  61 + Object identity.  Objects can be arbitrarily complex, and subobjects
  62   may be shared (for example, the list [a, a] refers to the same object a
  63   twice).  It can be vital that unpickling recreate an isomorphic object
  64   graph, faithfully reproducing sharing.
  65
  66 + Recursive objects.  For example, after "L = []; L.append(L)", L is a
  67   list, and L[0] is the same list.  This is related to the object identity
  68   point, and some sequences of pickle opcodes are subtle in order to
  69   get the right result in all cases.
  70
  71 + Things pickle doesn't know everything about.  Examples of things pickle
  72   does know everything about are Python's builtin scalar and container
  73   types, like ints and tuples.  They generally have opcodes dedicated to
  74   them.  For things like module references and instances of user-defined
  75   classes, pickle's knowledge is limited.  Historically, many enhancements
  76   have been made to the pickle protocol in order to do a better (faster,
  77   and/or more compact) job on those.
  78
  79 + Backward compatibility and micro-optimization.  As explained below,
  80   pickle opcodes never go away, not even when better ways to do a thing
  81   get invented.  The repertoire of the PM just keeps growing over time.
  82   For example, protocol 0 had two opcodes for building Python integers (INT
  83   and LONG), protocol 1 added three more for more-efficient pickling of short
  84   integers, and protocol 2 added two more for more-efficient pickling of
  85   long integers (before protocol 2, the only ways to pickle a Python long
  86   took time quadratic in the number of digits, for both pickling and
  87   unpickling).  "Opcode bloat" isn't so much a subtlety as a source of
  88   wearying complication.
  89
  90
  91 Pickle protocols:
  92
  93 For compatibility, the meaning of a pickle opcode never changes.  Instead new
  94 pickle opcodes get added, and each version's unpickler can handle all the
  95 pickle opcodes in all protocol versions to date.  So old pickles continue to
  96 be readable forever.  The pickler can generally be told to restrict itself to
  97 the subset of opcodes available under previous protocol versions too, so that
  98 users can create pickles under the current version readable by older
  99 versions.  However, a pickle does not contain its version number embedded
 100 within it.  If an older unpickler tries to read a pickle using a later
 101 protocol, the result is most likely an exception due to seeing an unknown (in
 102 the older unpickler) opcode.
 103
 104 The original pickle used what's now called "protocol 0", and what was called
 105 "text mode" before Python 2.3.  The entire pickle bytestream is made up of
 106 printable 7-bit ASCII characters, plus the newline character, in protocol 0.
 107 That's why it was called text mode.  Protocol 0 is small and elegant, but
 108 sometimes painfully inefficient.
 109
 110 The second major set of additions is now called "protocol 1", and was called
 111 "binary mode" before Python 2.3.  This added many opcodes with arguments
 112 consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
 113 bytes.  Binary mode pickles can be substantially smaller than equivalent
 114 text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
 115 int as 4 bytes following the opcode, which is cheaper to unpickle than the
 116 (perhaps) 11-character decimal string attached to INT.  Protocol 1 also added
 117 a number of opcodes that operate on many stack elements at once (like APPENDS
 118 and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
 119
 120 The third major set of additions came in Python 2.3, and is called "protocol
 121 2".  This added:
 122
 123 - A better way to pickle instances of new-style classes (NEWOBJ).
 124
 125 - A way for a pickle to identify its protocol (PROTO).
 126
 127 - Time- and space- efficient pickling of long ints (LONG{1,4}).
 128
 129 - Shortcuts for small tuples (TUPLE{1,2,3}}.
 130
 131 - Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
 132
 133 - The "extension registry", a vector of popular objects that can be pushed
 134   efficiently by index (EXT{1,2,4}).  This is akin to the memo and GET, but
 135   the registry contents are predefined (there's nothing akin to the memo's
 136   PUT).
 137
 138 Another independent change with Python 2.3 is the abandonment of any
 139 pretense that it might be safe to load pickles received from untrusted
 140 parties -- no sufficient security analysis has been done to guarantee
 141 this and there isn't a use case that warrants the expense of such an
 142 analysis.
 143
 144 To this end, all tests for __safe_for_unpickling__ or for
 145 copyreg.safe_constructors are removed from the unpickling code.
 146 References to these variables in the descriptions below are to be seen
 147 as describing unpickling in Python 2.2 and before.
 148 """
 149
 150 # Meta-rule:  Descriptions are stored in instances of descriptor objects,
 151 # with plain constructors.  No meta-language is defined from which
 152 # descriptors could be constructed.  If you want, e.g., XML, write a little
 153 # program to generate XML from the objects.
 154
 155 ##############################################################################
 156 # Some pickle opcodes have an argument, following the opcode in the
 157 # bytestream.  An argument is of a specific type, described by an instance
 158 # of ArgumentDescriptor.  These are not to be confused with arguments taken
 159 # off the stack -- ArgumentDescriptor applies only to arguments embedded in
 160 # the opcode stream, immediately following an opcode.
 161
 162 # Represents the number of bytes consumed by an argument delimited by the
 163 # next newline character.
 164 UP_TO_NEWLINE = -1
 165
 166 # Represents the number of bytes consumed by a two-argument opcode where
 167 # the first argument gives the number of bytes in the second argument.
 168 TAKEN_FROM_ARGUMENT1 = -2   # num bytes is 1-byte unsigned int
 169 TAKEN_FROM_ARGUMENT4 = -3   # num bytes is 4-byte signed little-endian int
 170
 171 class ArgumentDescriptor(object):
 172     __slots__ = (
 173         # name of descriptor record, also a module global name; a string
 174         'name',
 175
 176         # length of argument, in bytes; an int; UP_TO_NEWLINE and
 177         # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
 178         # cases
 179         'n',
 180
 181         # a function taking a file-like object, reading this kind of argument
 182         # from the object at the current position, advancing the current
 183         # position by n bytes, and returning the value of the argument
 184         'reader',
 185
 186         # human-readable docs for this arg descriptor; a string
 187         'doc',
 188     )
 189
 190     def __init__(self, name, n, reader, doc):
 191         assert isinstance(name, str)
 192         self.name = name
 193
 194         assert isinstance(n, int) and (n >= 0 or
 195                                        n in (UP_TO_NEWLINE,
 196                                              TAKEN_FROM_ARGUMENT1,
 197                                              TAKEN_FROM_ARGUMENT4))
 198         self.n = n
 199
 200         self.reader = reader
 201
 202         assert isinstance(doc, str)
 203         self.doc = doc
 204
 205 from struct import unpack as _unpack
 206
 207 def read_uint1(f):
 208     r"""
 209     >>> import io
 210     >>> read_uint1(io.BytesIO(b'\xff'))
 211     255
 212     """
 213
 214     data = f.read(1)
 215     if data:
 216         return data[0]
 217     raise ValueError("not enough data in stream to read uint1")
 218
 219 uint1 = ArgumentDescriptor(
 220             name='uint1',
 221             n=1,
 222             reader=read_uint1,
 223             doc="One-byte unsigned integer.")
 224
 225
 226 def read_uint2(f):
 227     r"""
 228     >>> import io
 229     >>> read_uint2(io.BytesIO(b'\xff\x00'))
 230     255
 231     >>> read_uint2(io.BytesIO(b'\xff\xff'))
 232     65535
 233     """
 234
 235     data = f.read(2)
 236     if len(data) == 2:
 237         return _unpack("<H", data)[0]
 238     raise ValueError("not enough data in stream to read uint2")
 239
 240 uint2 = ArgumentDescriptor(
 241             name='uint2',
 242             n=2,
 243             reader=read_uint2,
 244             doc="Two-byte unsigned integer, little-endian.")
 245
 246
 247 def read_int4(f):
 248     r"""
 249     >>> import io
 250     >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00'))
 251     255
 252     >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31)
 253     True
 254     """
 255
 256     data = f.read(4)
 257     if len(data) == 4:
 258         return _unpack("<i", data)[0]
 259     raise ValueError("not enough data in stream to read int4")
 260
 261 int4 = ArgumentDescriptor(
 262            name='int4',
 263            n=4,
 264            reader=read_int4,
 265            doc="Four-byte signed integer, little-endian, 2's complement.")
 266
 267
 268 def read_stringnl(f, decode=True, stripquotes=True):
 269     r"""
 270     >>> import io
 271     >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n"))
 272     'abcd'
 273
 274     >>> read_stringnl(io.BytesIO(b"\n"))
 275     Traceback (most recent call last):
 276     ...
 277     ValueError: no string quotes around b''
 278
 279     >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False)
 280     ''
 281
 282     >>> read_stringnl(io.BytesIO(b"''\n"))
 283     ''
 284
 285     >>> read_stringnl(io.BytesIO(b'"abcd"'))
 286     Traceback (most recent call last):
 287     ...
 288     ValueError: no newline found when trying to read stringnl
 289
 290     Embedded escapes are undone in the result.
 291     >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'"))
 292     'a\n\\b\x00c\td'
 293     """
 294
 295     data = f.readline()
 296     if not data.endswith(b'\n'):
 297         raise ValueError("no newline found when trying to read stringnl")
 298     data = data[:-1]    # lose the newline
 299
 300     if stripquotes:
 301         for q in (b'"', b"'"):
 302             if data.startswith(q):
 303                 if not data.endswith(q):
 304                     raise ValueError("strinq quote %r not found at both "
 305                                      "ends of %r" % (q, data))
 306                 data = data[1:-1]
 307                 break
 308         else:
 309             raise ValueError("no string quotes around %r" % data)
 310
 311     if decode:
 312         data = codecs.escape_decode(data)[0].decode("ascii")
 313     return data
 314
 315 stringnl = ArgumentDescriptor(
 316                name='stringnl',
 317                n=UP_TO_NEWLINE,
 318                reader=read_stringnl,
 319                doc="""A newline-terminated string.
 320
 321                    This is a repr-style string, with embedded escapes, and
 322                    bracketing quotes.
 323                    """)
 324
 325 def read_stringnl_noescape(f):
 326     return read_stringnl(f, stripquotes=False)
 327
 328 stringnl_noescape = ArgumentDescriptor(
 329                         name='stringnl_noescape',
 330                         n=UP_TO_NEWLINE,
 331                         reader=read_stringnl_noescape,
 332                         doc="""A newline-terminated string.
 333
 334                         This is a str-style string, without embedded escapes,
 335                         or bracketing quotes.  It should consist solely of
 336                         printable ASCII characters.
 337                         """)
 338
 339 def read_stringnl_noescape_pair(f):
 340     r"""
 341     >>> import io
 342     >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk"))
 343     'Queue Empty'
 344     """
 345
 346     return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
 347
 348 stringnl_noescape_pair = ArgumentDescriptor(
 349                              name='stringnl_noescape_pair',
 350                              n=UP_TO_NEWLINE,
 351                              reader=read_stringnl_noescape_pair,
 352                              doc="""A pair of newline-terminated strings.
 353
 354                              These are str-style strings, without embedded
 355                              escapes, or bracketing quotes.  They should
 356                              consist solely of printable ASCII characters.
 357                              The pair is returned as a single string, with
 358                              a single blank separating the two strings.
 359                              """)
 360
 361 def read_string4(f):
 362     r"""
 363     >>> import io
 364     >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc"))
 365     ''
 366     >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
 367     'abc'
 368     >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
 369     Traceback (most recent call last):
 370     ...
 371     ValueError: expected 50331648 bytes in a string4, but only 6 remain
 372     """
 373
 374     n = read_int4(f)
 375     if n < 0:
 376         raise ValueError("string4 byte count < 0: %d" % n)
 377     data = f.read(n)
 378     if len(data) == n:
 379         return data.decode("latin-1")
 380     raise ValueError("expected %d bytes in a string4, but only %d remain" %
 381                      (n, len(data)))
 382
 383 string4 = ArgumentDescriptor(
 384               name="string4",
 385               n=TAKEN_FROM_ARGUMENT4,
 386               reader=read_string4,
 387               doc="""A counted string.
 388
 389               The first argument is a 4-byte little-endian signed int giving
 390               the number of bytes in the string, and the second argument is
 391               that many bytes.
 392               """)
 393
 394
 395 def read_string1(f):
 396     r"""
 397     >>> import io
 398     >>> read_string1(io.BytesIO(b"\x00"))
 399     ''
 400     >>> read_string1(io.BytesIO(b"\x03abcdef"))
 401     'abc'
 402     """
 403
 404     n = read_uint1(f)
 405     assert n >= 0
 406     data = f.read(n)
 407     if len(data) == n:
 408         return data.decode("latin-1")
 409     raise ValueError("expected %d bytes in a string1, but only %d remain" %
 410                      (n, len(data)))
 411
 412 string1 = ArgumentDescriptor(
 413               name="string1",
 414               n=TAKEN_FROM_ARGUMENT1,
 415               reader=read_string1,
 416               doc="""A counted string.
 417
 418               The first argument is a 1-byte unsigned int giving the number
 419               of bytes in the string, and the second argument is that many
 420               bytes.
 421               """)
 422
 423
 424 def read_unicodestringnl(f):
 425     r"""
 426     >>> import io
 427     >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd'
 428     True
 429     """
 430
 431     data = f.readline()
 432     if not data.endswith(b'\n'):
 433         raise ValueError("no newline found when trying to read "
 434                          "unicodestringnl")
 435     data = data[:-1]    # lose the newline
 436     return str(data, 'raw-unicode-escape')
 437
 438 unicodestringnl = ArgumentDescriptor(
 439                       name='unicodestringnl',
 440                       n=UP_TO_NEWLINE,
 441                       reader=read_unicodestringnl,
 442                       doc="""A newline-terminated Unicode string.
 443
 444                       This is raw-unicode-escape encoded, so consists of
 445                       printable ASCII characters, and may contain embedded
 446                       escape sequences.
 447                       """)
 448
 449 def read_unicodestring4(f):
 450     r"""
 451     >>> import io
 452     >>> s = 'abcd\uabcd'
 453     >>> enc = s.encode('utf-8')
 454     >>> enc
 455     b'abcd\xea\xaf\x8d'
 456     >>> n = bytes([len(enc), 0, 0, 0])  # little-endian 4-byte length
 457     >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk'))
 458     >>> s == t
 459     True
 460
 461     >>> read_unicodestring4(io.BytesIO(n + enc[:-1]))
 462     Traceback (most recent call last):
 463     ...
 464     ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
 465     """
 466
 467     n = read_int4(f)
 468     if n < 0:
 469         raise ValueError("unicodestring4 byte count < 0: %d" % n)
 470     data = f.read(n)
 471     if len(data) == n:
 472         return str(data, 'utf-8')
 473     raise ValueError("expected %d bytes in a unicodestring4, but only %d "
 474                      "remain" % (n, len(data)))
 475
 476 unicodestring4 = ArgumentDescriptor(
 477                     name="unicodestring4",
 478                     n=TAKEN_FROM_ARGUMENT4,
 479                     reader=read_unicodestring4,
 480                     doc="""A counted Unicode string.
 481
 482                     The first argument is a 4-byte little-endian signed int
 483                     giving the number of bytes in the string, and the second
 484                     argument-- the UTF-8 encoding of the Unicode string --
 485                     contains that many bytes.
 486                     """)
 487
 488
 489 def read_decimalnl_short(f):
 490     r"""
 491     >>> import io
 492     >>> read_decimalnl_short(io.BytesIO(b"1234\n56"))
 493     1234
 494
 495     >>> read_decimalnl_short(io.BytesIO(b"1234L\n56"))
 496     Traceback (most recent call last):
 497     ...
 498     ValueError: trailing 'L' not allowed in b'1234L'
 499     """
 500
 501     s = read_stringnl(f, decode=False, stripquotes=False)
 502     if s.endswith(b"L"):
 503         raise ValueError("trailing 'L' not allowed in %r" % s)
 504
 505     # It's not necessarily true that the result fits in a Python short int:
 506     # the pickle may have been written on a 64-bit box.  There's also a hack
 507     # for True and False here.
 508     if s == b"00":
 509         return False
 510     elif s == b"01":
 511         return True
 512
 513     try:
 514         return int(s)
 515     except OverflowError:
 516         return int(s)
 517
 518 def read_decimalnl_long(f):
 519     r"""
 520     >>> import io
 521
 522     >>> read_decimalnl_long(io.BytesIO(b"1234L\n56"))
 523     1234
 524
 525     >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6"))
 526     123456789012345678901234
 527     """
 528
 529     s = read_stringnl(f, decode=False, stripquotes=False)
 530     if s[-1:] == b'L':
 531         s = s[:-1]
 532     return int(s)
 533
 534
 535 decimalnl_short = ArgumentDescriptor(
 536                       name='decimalnl_short',
 537                       n=UP_TO_NEWLINE,
 538                       reader=read_decimalnl_short,
 539                       doc="""A newline-terminated decimal integer literal.
 540
 541                           This never has a trailing 'L', and the integer fit
 542                           in a short Python int on the box where the pickle
 543                           was written -- but there's no guarantee it will fit
 544                           in a short Python int on the box where the pickle
 545                           is read.
 546                           """)
 547
 548 decimalnl_long = ArgumentDescriptor(
 549                      name='decimalnl_long',
 550                      n=UP_TO_NEWLINE,
 551                      reader=read_decimalnl_long,
 552                      doc="""A newline-terminated decimal integer literal.
 553
 554                          This has a trailing 'L', and can represent integers
 555                          of any size.
 556                          """)
 557
 558
 559 def read_floatnl(f):
 560     r"""
 561     >>> import io
 562     >>> read_floatnl(io.BytesIO(b"-1.25\n6"))
 563     -1.25
 564     """
 565     s = read_stringnl(f, decode=False, stripquotes=False)
 566     return float(s)
 567
 568 floatnl = ArgumentDescriptor(
 569               name='floatnl',
 570               n=UP_TO_NEWLINE,
 571               reader=read_floatnl,
 572               doc="""A newline-terminated decimal floating literal.
 573
 574               In general this requires 17 significant digits for roundtrip
 575               identity, and pickling then unpickling infinities, NaNs, and
 576               minus zero doesn't work across boxes, or on some boxes even
 577               on itself (e.g., Windows can't read the strings it produces
 578               for infinities or NaNs).
 579               """)
 580
 581 def read_float8(f):
 582     r"""
 583     >>> import io, struct
 584     >>> raw = struct.pack(">d", -1.25)
 585     >>> raw
 586     b'\xbf\xf4\x00\x00\x00\x00\x00\x00'
 587     >>> read_float8(io.BytesIO(raw + b"\n"))
 588     -1.25
 589     """
 590
 591     data = f.read(8)
 592     if len(data) == 8:
 593         return _unpack(">d", data)[0]
 594     raise ValueError("not enough data in stream to read float8")
 595
 596
 597 float8 = ArgumentDescriptor(
 598              name='float8',
 599              n=8,
 600              reader=read_float8,
 601              doc="""An 8-byte binary representation of a float, big-endian.
 602
 603              The format is unique to Python, and shared with the struct
 604              module (format string '>d') "in theory" (the struct and pickle
 605              implementations don't share the code -- they should).  It's
 606              strongly related to the IEEE-754 double format, and, in normal
 607              cases, is in fact identical to the big-endian 754 double format.
 608              On other boxes the dynamic range is limited to that of a 754
 609              double, and "add a half and chop" rounding is used to reduce
 610              the precision to 53 bits.  However, even on a 754 box,
 611              infinities, NaNs, and minus zero may not be handled correctly
 612              (may not survive roundtrip pickling intact).
 613              """)
 614
 615 # Protocol 2 formats
 616
 617 from pickle import decode_long
 618
 619 def read_long1(f):
 620     r"""
 621     >>> import io
 622     >>> read_long1(io.BytesIO(b"\x00"))
 623     0
 624     >>> read_long1(io.BytesIO(b"\x02\xff\x00"))
 625     255
 626     >>> read_long1(io.BytesIO(b"\x02\xff\x7f"))
 627     32767
 628     >>> read_long1(io.BytesIO(b"\x02\x00\xff"))
 629     -256
 630     >>> read_long1(io.BytesIO(b"\x02\x00\x80"))
 631     -32768
 632     """
 633
 634     n = read_uint1(f)
 635     data = f.read(n)
 636     if len(data) != n:
 637         raise ValueError("not enough data in stream to read long1")
 638     return decode_long(data)
 639
 640 long1 = ArgumentDescriptor(
 641     name="long1",
 642     n=TAKEN_FROM_ARGUMENT1,
 643     reader=read_long1,
 644     doc="""A binary long, little-endian, using 1-byte size.
 645
 646     This first reads one byte as an unsigned size, then reads that
 647     many bytes and interprets them as a little-endian 2's-complement long.
 648     If the size is 0, that's taken as a shortcut for the long 0L.
 649     """)
 650
 651 def read_long4(f):
 652     r"""
 653     >>> import io
 654     >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00"))
 655     255
 656     >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f"))
 657     32767
 658     >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff"))
 659     -256
 660     >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80"))
 661     -32768
 662     >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00"))
 663     0
 664     """
 665
 666     n = read_int4(f)
 667     if n < 0:
 668         raise ValueError("long4 byte count < 0: %d" % n)
 669     data = f.read(n)
 670     if len(data) != n:
 671         raise ValueError("not enough data in stream to read long4")
 672     return decode_long(data)
 673
 674 long4 = ArgumentDescriptor(
 675     name="long4",
 676     n=TAKEN_FROM_ARGUMENT4,
 677     reader=read_long4,
 678     doc="""A binary representation of a long, little-endian.
 679
 680     This first reads four bytes as a signed size (but requires the
 681     size to be >= 0), then reads that many bytes and interprets them
 682     as a little-endian 2's-complement long.  If the size is 0, that's taken
 683     as a shortcut for the int 0, although LONG1 should really be used
 684     then instead (and in any case where # of bytes < 256).
 685     """)
 686
 687
 688 ##############################################################################
 689 # Object descriptors.  The stack used by the pickle machine holds objects,
 690 # and in the stack_before and stack_after attributes of OpcodeInfo
 691 # descriptors we need names to describe the various types of objects that can
 692 # appear on the stack.
 693
 694 class StackObject(object):
 695     __slots__ = (
 696         # name of descriptor record, for info only
 697         'name',
 698
 699         # type of object, or tuple of type objects (meaning the object can
 700         # be of any type in the tuple)
 701         'obtype',
 702
 703         # human-readable docs for this kind of stack object; a string
 704         'doc',
 705     )
 706
 707     def __init__(self, name, obtype, doc):
 708         assert isinstance(name, str)
 709         self.name = name
 710
 711         assert isinstance(obtype, type) or isinstance(obtype, tuple)
 712         if isinstance(obtype, tuple):
 713             for contained in obtype:
 714                 assert isinstance(contained, type)
 715         self.obtype = obtype
 716
 717         assert isinstance(doc, str)
 718         self.doc = doc
 719
 720     def __repr__(self):
 721         return self.name
 722
 723
 724 pyint = StackObject(
 725             name='int',
 726             obtype=int,
 727             doc="A short (as opposed to long) Python integer object.")
 728
 729 pylong = StackObject(
 730              name='long',
 731              obtype=int,
 732              doc="A long (as opposed to short) Python integer object.")
 733
 734 pyinteger_or_bool = StackObject(
 735                         name='int_or_bool',
 736                         obtype=(int, int, bool),
 737                         doc="A Python integer object (short or long), or "
 738                             "a Python bool.")
 739
 740 pybool = StackObject(
 741              name='bool',
 742              obtype=(bool,),
 743              doc="A Python bool object.")
 744
 745 pyfloat = StackObject(
 746               name='float',
 747               obtype=float,
 748               doc="A Python float object.")
 749
 750 pystring = StackObject(
 751                name='string',
 752                obtype=bytes,
 753                doc="A Python (8-bit) string object.")
 754
 755 pybytes = StackObject(
 756                name='bytes',
 757                obtype=bytes,
 758                doc="A Python bytes object.")
 759
 760 pyunicode = StackObject(
 761                 name='str',
 762                 obtype=str,
 763                 doc="A Python (Unicode) string object.")
 764
 765 pynone = StackObject(
 766              name="None",
 767              obtype=type(None),
 768              doc="The Python None object.")
 769
 770 pytuple = StackObject(
 771               name="tuple",
 772               obtype=tuple,
 773               doc="A Python tuple object.")
 774
 775 pylist = StackObject(
 776              name="list",
 777              obtype=list,
 778              doc="A Python list object.")
 779
 780 pydict = StackObject(
 781              name="dict",
 782              obtype=dict,
 783              doc="A Python dict object.")
 784
 785 anyobject = StackObject(
 786                 name='any',
 787                 obtype=object,
 788                 doc="Any kind of object whatsoever.")
 789
 790 markobject = StackObject(
 791                  name="mark",
 792                  obtype=StackObject,
 793                  doc="""'The mark' is a unique object.
 794
 795                  Opcodes that operate on a variable number of objects
 796                  generally don't embed the count of objects in the opcode,
 797                  or pull it off the stack.  Instead the MARK opcode is used
 798                  to push a special marker object on the stack, and then
 799                  some other opcodes grab all the objects from the top of
 800                  the stack down to (but not including) the topmost marker
 801                  object.
 802                  """)
 803
 804 stackslice = StackObject(
 805                  name="stackslice",
 806                  obtype=StackObject,
 807                  doc="""An object representing a contiguous slice of the stack.
 808
 809                  This is used in conjuction with markobject, to represent all
 810                  of the stack following the topmost markobject.  For example,
 811                  the POP_MARK opcode changes the stack from
 812
 813                      [..., markobject, stackslice]
 814                  to
 815                      [...]
 816
 817                  No matter how many object are on the stack after the topmost
 818                  markobject, POP_MARK gets rid of all of them (including the
 819                  topmost markobject too).
 820                  """)
 821
 822 ##############################################################################
 823 # Descriptors for pickle opcodes.
 824
 825 class OpcodeInfo(object):
 826
 827     __slots__ = (
 828         # symbolic name of opcode; a string
 829         'name',
 830
 831         # the code used in a bytestream to represent the opcode; a
 832         # one-character string
 833         'code',
 834
 835         # If the opcode has an argument embedded in the byte string, an
 836         # instance of ArgumentDescriptor specifying its type.  Note that
 837         # arg.reader(s) can be used to read and decode the argument from
 838         # the bytestream s, and arg.doc documents the format of the raw
 839         # argument bytes.  If the opcode doesn't have an argument embedded
 840         # in the bytestream, arg should be None.
 841         'arg',
 842
 843         # what the stack looks like before this opcode runs; a list
 844         'stack_before',
 845
 846         # what the stack looks like after this opcode runs; a list
 847         'stack_after',
 848
 849         # the protocol number in which this opcode was introduced; an int
 850         'proto',
 851
 852         # human-readable docs for this opcode; a string
 853         'doc',
 854     )
 855
 856     def __init__(self, name, code, arg,
 857                  stack_before, stack_after, proto, doc):
 858         assert isinstance(name, str)
 859         self.name = name
 860
 861         assert isinstance(code, str)
 862         assert len(code) == 1
 863         self.code = code
 864
 865         assert arg is None or isinstance(arg, ArgumentDescriptor)
 866         self.arg = arg
 867
 868         assert isinstance(stack_before, list)
 869         for x in stack_before:
 870             assert isinstance(x, StackObject)
 871         self.stack_before = stack_before
 872
 873         assert isinstance(stack_after, list)
 874         for x in stack_after:
 875             assert isinstance(x, StackObject)
 876         self.stack_after = stack_after
 877
 878         assert isinstance(proto, int) and 0 <= proto <= 3
 879         self.proto = proto
 880
 881         assert isinstance(doc, str)
 882         self.doc = doc
 883
 884 I = OpcodeInfo
 885 opcodes = [
 886
 887     # Ways to spell integers.
 888
 889     I(name='INT',
 890       code='I',
 891       arg=decimalnl_short,
 892       stack_before=[],
 893       stack_after=[pyinteger_or_bool],
 894       proto=0,
 895       doc="""Push an integer or bool.
 896
 897       The argument is a newline-terminated decimal literal string.
 898
 899       The intent may have been that this always fit in a short Python int,
 900       but INT can be generated in pickles written on a 64-bit box that
 901       require a Python long on a 32-bit box.  The difference between this
 902       and LONG then is that INT skips a trailing 'L', and produces a short
 903       int whenever possible.
 904
 905       Another difference is due to that, when bool was introduced as a
 906       distinct type in 2.3, builtin names True and False were also added to
 907       2.2.2, mapping to ints 1 and 0.  For compatibility in both directions,
 908       True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
 909       Leading zeroes are never produced for a genuine integer.  The 2.3
 910       (and later) unpicklers special-case these and return bool instead;
 911       earlier unpicklers ignore the leading "0" and return the int.
 912       """),
 913
 914     I(name='BININT',
 915       code='J',
 916       arg=int4,
 917       stack_before=[],
 918       stack_after=[pyint],
 919       proto=1,
 920       doc="""Push a four-byte signed integer.
 921
 922       This handles the full range of Python (short) integers on a 32-bit
 923       box, directly as binary bytes (1 for the opcode and 4 for the integer).
 924       If the integer is non-negative and fits in 1 or 2 bytes, pickling via
 925       BININT1 or BININT2 saves space.
 926       """),
 927
 928     I(name='BININT1',
 929       code='K',
 930       arg=uint1,
 931       stack_before=[],
 932       stack_after=[pyint],
 933       proto=1,
 934       doc="""Push a one-byte unsigned integer.
 935
 936       This is a space optimization for pickling very small non-negative ints,
 937       in range(256).
 938       """),
 939
 940     I(name='BININT2',
 941       code='M',
 942       arg=uint2,
 943       stack_before=[],
 944       stack_after=[pyint],
 945       proto=1,
 946       doc="""Push a two-byte unsigned integer.
 947
 948       This is a space optimization for pickling small positive ints, in
 949       range(256, 2**16).  Integers in range(256) can also be pickled via
 950       BININT2, but BININT1 instead saves a byte.
 951       """),
 952
 953     I(name='LONG',
 954       code='L',
 955       arg=decimalnl_long,
 956       stack_before=[],
 957       stack_after=[pylong],
 958       proto=0,
 959       doc="""Push a long integer.
 960
 961       The same as INT, except that the literal ends with 'L', and always
 962       unpickles to a Python long.  There doesn't seem a real purpose to the
 963       trailing 'L'.
 964
 965       Note that LONG takes time quadratic in the number of digits when
 966       unpickling (this is simply due to the nature of decimal->binary
 967       conversion).  Proto 2 added linear-time (in C; still quadratic-time
 968       in Python) LONG1 and LONG4 opcodes.
 969       """),
 970
 971     I(name="LONG1",
 972       code='\x8a',
 973       arg=long1,
 974       stack_before=[],
 975       stack_after=[pylong],
 976       proto=2,
 977       doc="""Long integer using one-byte length.
 978
 979       A more efficient encoding of a Python long; the long1 encoding
 980       says it all."""),
 981
 982     I(name="LONG4",
 983       code='\x8b',
 984       arg=long4,
 985       stack_before=[],
 986       stack_after=[pylong],
 987       proto=2,
 988       doc="""Long integer using found-byte length.
 989
 990       A more efficient encoding of a Python long; the long4 encoding
 991       says it all."""),
 992
 993     # Ways to spell strings (8-bit, not Unicode).
 994
 995     I(name='STRING',
 996       code='S',
 997       arg=stringnl,
 998       stack_before=[],
 999       stack_after=[pystring],
1000       proto=0,
1001       doc="""Push a Python string object.
1002
1003       The argument is a repr-style string, with bracketing quote characters,
1004       and perhaps embedded escapes.  The argument extends until the next
1005       newline character.  (Actually, they are decoded into a str instance
1006       using the encoding given to the Unpickler constructor. or the default,
1007       'ASCII'.)
1008       """),
1009
1010     I(name='BINSTRING',
1011       code='T',
1012       arg=string4,
1013       stack_before=[],
1014       stack_after=[pystring],
1015       proto=1,
1016       doc="""Push a Python string object.
1017
1018       There are two arguments:  the first is a 4-byte little-endian signed int
1019       giving the number of bytes in the string, and the second is that many
1020       bytes, which are taken literally as the string content.  (Actually,
1021       they are decoded into a str instance using the encoding given to the
1022       Unpickler constructor. or the default, 'ASCII'.)
1023       """),
1024
1025     I(name='SHORT_BINSTRING',
1026       code='U',
1027       arg=string1,
1028       stack_before=[],
1029       stack_after=[pystring],
1030       proto=1,
1031       doc="""Push a Python string object.
1032
1033       There are two arguments:  the first is a 1-byte unsigned int giving
1034       the number of bytes in the string, and the second is that many bytes,
1035       which are taken literally as the string content.  (Actually, they
1036       are decoded into a str instance using the encoding given to the
1037       Unpickler constructor. or the default, 'ASCII'.)
1038       """),
1039
1040     # Bytes (protocol 3 only; older protocols don't support bytes at all)
1041
1042     I(name='BINBYTES',
1043       code='B',
1044       arg=string4,
1045       stack_before=[],
1046       stack_after=[pybytes],
1047       proto=3,
1048       doc="""Push a Python bytes object.
1049
1050       There are two arguments:  the first is a 4-byte little-endian signed int
1051       giving the number of bytes in the string, and the second is that many
1052       bytes, which are taken literally as the bytes content.
1053       """),
1054
1055     I(name='SHORT_BINBYTES',
1056       code='C',
1057       arg=string1,
1058       stack_before=[],
1059       stack_after=[pybytes],
1060       proto=3,
1061       doc="""Push a Python string object.
1062
1063       There are two arguments:  the first is a 1-byte unsigned int giving
1064       the number of bytes in the string, and the second is that many bytes,
1065       which are taken literally as the string content.
1066       """),
1067
1068     # Ways to spell None.
1069
1070     I(name='NONE',
1071       code='N',
1072       arg=None,
1073       stack_before=[],
1074       stack_after=[pynone],
1075       proto=0,
1076       doc="Push None on the stack."),
1077
1078     # Ways to spell bools, starting with proto 2.  See INT for how this was
1079     # done before proto 2.
1080
1081     I(name='NEWTRUE',
1082       code='\x88',
1083       arg=None,
1084       stack_before=[],
1085       stack_after=[pybool],
1086       proto=2,
1087       doc="""True.
1088
1089       Push True onto the stack."""),
1090
1091     I(name='NEWFALSE',
1092       code='\x89',
1093       arg=None,
1094       stack_before=[],
1095       stack_after=[pybool],
1096       proto=2,
1097       doc="""True.
1098
1099       Push False onto the stack."""),
1100
1101     # Ways to spell Unicode strings.
1102
1103     I(name='UNICODE',
1104       code='V',
1105       arg=unicodestringnl,
1106       stack_before=[],
1107       stack_after=[pyunicode],
1108       proto=0,  # this may be pure-text, but it's a later addition
1109       doc="""Push a Python Unicode string object.
1110
1111       The argument is a raw-unicode-escape encoding of a Unicode string,
1112       and so may contain embedded escape sequences.  The argument extends
1113       until the next newline character.
1114       """),
1115
1116     I(name='BINUNICODE',
1117       code='X',
1118       arg=unicodestring4,
1119       stack_before=[],
1120       stack_after=[pyunicode],
1121       proto=1,
1122       doc="""Push a Python Unicode string object.
1123
1124       There are two arguments:  the first is a 4-byte little-endian signed int
1125       giving the number of bytes in the string.  The second is that many
1126       bytes, and is the UTF-8 encoding of the Unicode string.
1127       """),
1128
1129     # Ways to spell floats.
1130
1131     I(name='FLOAT',
1132       code='F',
1133       arg=floatnl,
1134       stack_before=[],
1135       stack_after=[pyfloat],
1136       proto=0,
1137       doc="""Newline-terminated decimal float literal.
1138
1139       The argument is repr(a_float), and in general requires 17 significant
1140       digits for roundtrip conversion to be an identity (this is so for
1141       IEEE-754 double precision values, which is what Python float maps to
1142       on most boxes).
1143
1144       In general, FLOAT cannot be used to transport infinities, NaNs, or
1145       minus zero across boxes (or even on a single box, if the platform C
1146       library can't read the strings it produces for such things -- Windows
1147       is like that), but may do less damage than BINFLOAT on boxes with
1148       greater precision or dynamic range than IEEE-754 double.
1149       """),
1150
1151     I(name='BINFLOAT',
1152       code='G',
1153       arg=float8,
1154       stack_before=[],
1155       stack_after=[pyfloat],
1156       proto=1,
1157       doc="""Float stored in binary form, with 8 bytes of data.
1158
1159       This generally requires less than half the space of FLOAT encoding.
1160       In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1161       minus zero, raises an exception if the exponent exceeds the range of
1162       an IEEE-754 double, and retains no more than 53 bits of precision (if
1163       there are more than that, "add a half and chop" rounding is used to
1164       cut it back to 53 significant bits).
1165       """),
1166
1167     # Ways to build lists.
1168
1169     I(name='EMPTY_LIST',
1170       code=']',
1171       arg=None,
1172       stack_before=[],
1173       stack_after=[pylist],
1174       proto=1,
1175       doc="Push an empty list."),
1176
1177     I(name='APPEND',
1178       code='a',
1179       arg=None,
1180       stack_before=[pylist, anyobject],
1181       stack_after=[pylist],
1182       proto=0,
1183       doc="""Append an object to a list.
1184
1185       Stack before:  ... pylist anyobject
1186       Stack after:   ... pylist+[anyobject]
1187
1188       although pylist is really extended in-place.
1189       """),
1190
1191     I(name='APPENDS',
1192       code='e',
1193       arg=None,
1194       stack_before=[pylist, markobject, stackslice],
1195       stack_after=[pylist],
1196       proto=1,
1197       doc="""Extend a list by a slice of stack objects.
1198
1199       Stack before:  ... pylist markobject stackslice
1200       Stack after:   ... pylist+stackslice
1201
1202       although pylist is really extended in-place.
1203       """),
1204
1205     I(name='LIST',
1206       code='l',
1207       arg=None,
1208       stack_before=[markobject, stackslice],
1209       stack_after=[pylist],
1210       proto=0,
1211       doc="""Build a list out of the topmost stack slice, after markobject.
1212
1213       All the stack entries following the topmost markobject are placed into
1214       a single Python list, which single list object replaces all of the
1215       stack from the topmost markobject onward.  For example,
1216
1217       Stack before: ... markobject 1 2 3 'abc'
1218       Stack after:  ... [1, 2, 3, 'abc']
1219       """),
1220
1221     # Ways to build tuples.
1222
1223     I(name='EMPTY_TUPLE',
1224       code=')',
1225       arg=None,
1226       stack_before=[],
1227       stack_after=[pytuple],
1228       proto=1,
1229       doc="Push an empty tuple."),
1230
1231     I(name='TUPLE',
1232       code='t',
1233       arg=None,
1234       stack_before=[markobject, stackslice],
1235       stack_after=[pytuple],
1236       proto=0,
1237       doc="""Build a tuple out of the topmost stack slice, after markobject.
1238
1239       All the stack entries following the topmost markobject are placed into
1240       a single Python tuple, which single tuple object replaces all of the
1241       stack from the topmost markobject onward.  For example,
1242
1243       Stack before: ... markobject 1 2 3 'abc'
1244       Stack after:  ... (1, 2, 3, 'abc')
1245       """),
1246
1247     I(name='TUPLE1',
1248       code='\x85',
1249       arg=None,
1250       stack_before=[anyobject],
1251       stack_after=[pytuple],
1252       proto=2,
1253       doc="""One-tuple.
1254
1255       This code pops one value off the stack and pushes a tuple of
1256       length 1 whose one item is that value back onto it.  IOW:
1257
1258           stack[-1] = tuple(stack[-1:])
1259       """),
1260
1261     I(name='TUPLE2',
1262       code='\x86',
1263       arg=None,
1264       stack_before=[anyobject, anyobject],
1265       stack_after=[pytuple],
1266       proto=2,
1267       doc="""One-tuple.
1268
1269       This code pops two values off the stack and pushes a tuple
1270       of length 2 whose items are those values back onto it.  IOW:
1271
1272           stack[-2:] = [tuple(stack[-2:])]
1273       """),
1274
1275     I(name='TUPLE3',
1276       code='\x87',
1277       arg=None,
1278       stack_before=[anyobject, anyobject, anyobject],
1279       stack_after=[pytuple],
1280       proto=2,
1281       doc="""One-tuple.
1282
1283       This code pops three values off the stack and pushes a tuple
1284       of length 3 whose items are those values back onto it.  IOW:
1285
1286           stack[-3:] = [tuple(stack[-3:])]
1287       """),
1288
1289     # Ways to build dicts.
1290
1291     I(name='EMPTY_DICT',
1292       code='}',
1293       arg=None,
1294       stack_before=[],
1295       stack_after=[pydict],
1296       proto=1,
1297       doc="Push an empty dict."),
1298
1299     I(name='DICT',
1300       code='d',
1301       arg=None,
1302       stack_before=[markobject, stackslice],
1303       stack_after=[pydict],
1304       proto=0,
1305       doc="""Build a dict out of the topmost stack slice, after markobject.
1306
1307       All the stack entries following the topmost markobject are placed into
1308       a single Python dict, which single dict object replaces all of the
1309       stack from the topmost markobject onward.  The stack slice alternates
1310       key, value, key, value, ....  For example,
1311
1312       Stack before: ... markobject 1 2 3 'abc'
1313       Stack after:  ... {1: 2, 3: 'abc'}
1314       """),
1315
1316     I(name='SETITEM',
1317       code='s',
1318       arg=None,
1319       stack_before=[pydict, anyobject, anyobject],
1320       stack_after=[pydict],
1321       proto=0,
1322       doc="""Add a key+value pair to an existing dict.
1323
1324       Stack before:  ... pydict key value
1325       Stack after:   ... pydict
1326
1327       where pydict has been modified via pydict[key] = value.
1328       """),
1329
1330     I(name='SETITEMS',
1331       code='u',
1332       arg=None,
1333       stack_before=[pydict, markobject, stackslice],
1334       stack_after=[pydict],
1335       proto=1,
1336       doc="""Add an arbitrary number of key+value pairs to an existing dict.
1337
1338       The slice of the stack following the topmost markobject is taken as
1339       an alternating sequence of keys and values, added to the dict
1340       immediately under the topmost markobject.  Everything at and after the
1341       topmost markobject is popped, leaving the mutated dict at the top
1342       of the stack.
1343
1344       Stack before:  ... pydict markobject key_1 value_1 ... key_n value_n
1345       Stack after:   ... pydict
1346
1347       where pydict has been modified via pydict[key_i] = value_i for i in
1348       1, 2, ..., n, and in that order.
1349       """),
1350
1351     # Stack manipulation.
1352
1353     I(name='POP',
1354       code='0',
1355       arg=None,
1356       stack_before=[anyobject],
1357       stack_after=[],
1358       proto=0,
1359       doc="Discard the top stack item, shrinking the stack by one item."),
1360
1361     I(name='DUP',
1362       code='2',
1363       arg=None,
1364       stack_before=[anyobject],
1365       stack_after=[anyobject, anyobject],
1366       proto=0,
1367       doc="Push the top stack item onto the stack again, duplicating it."),
1368
1369     I(name='MARK',
1370       code='(',
1371       arg=None,
1372       stack_before=[],
1373       stack_after=[markobject],
1374       proto=0,
1375       doc="""Push markobject onto the stack.
1376
1377       markobject is a unique object, used by other opcodes to identify a
1378       region of the stack containing a variable number of objects for them
1379       to work on.  See markobject.doc for more detail.
1380       """),
1381
1382     I(name='POP_MARK',
1383       code='1',
1384       arg=None,
1385       stack_before=[markobject, stackslice],
1386       stack_after=[],
1387       proto=1,
1388       doc="""Pop all the stack objects at and above the topmost markobject.
1389
1390       When an opcode using a variable number of stack objects is done,
1391       POP_MARK is used to remove those objects, and to remove the markobject
1392       that delimited their starting position on the stack.
1393       """),
1394
1395     # Memo manipulation.  There are really only two operations (get and put),
1396     # each in all-text, "short binary", and "long binary" flavors.
1397
1398     I(name='GET',
1399       code='g',
1400       arg=decimalnl_short,
1401       stack_before=[],
1402       stack_after=[anyobject],
1403       proto=0,
1404       doc="""Read an object from the memo and push it on the stack.
1405
1406       The index of the memo object to push is given by the newline-teriminated
1407       decimal string following.  BINGET and LONG_BINGET are space-optimized
1408       versions.
1409       """),
1410
1411     I(name='BINGET',
1412       code='h',
1413       arg=uint1,
1414       stack_before=[],
1415       stack_after=[anyobject],
1416       proto=1,
1417       doc="""Read an object from the memo and push it on the stack.
1418
1419       The index of the memo object to push is given by the 1-byte unsigned
1420       integer following.
1421       """),
1422
1423     I(name='LONG_BINGET',
1424       code='j',
1425       arg=int4,
1426       stack_before=[],
1427       stack_after=[anyobject],
1428       proto=1,
1429       doc="""Read an object from the memo and push it on the stack.
1430
1431       The index of the memo object to push is given by the 4-byte signed
1432       little-endian integer following.
1433       """),
1434
1435     I(name='PUT',
1436       code='p',
1437       arg=decimalnl_short,
1438       stack_before=[],
1439       stack_after=[],
1440       proto=0,
1441       doc="""Store the stack top into the memo.  The stack is not popped.
1442
1443       The index of the memo location to write into is given by the newline-
1444       terminated decimal string following.  BINPUT and LONG_BINPUT are
1445       space-optimized versions.
1446       """),
1447
1448     I(name='BINPUT',
1449       code='q',
1450       arg=uint1,
1451       stack_before=[],
1452       stack_after=[],
1453       proto=1,
1454       doc="""Store the stack top into the memo.  The stack is not popped.
1455
1456       The index of the memo location to write into is given by the 1-byte
1457       unsigned integer following.
1458       """),
1459
1460     I(name='LONG_BINPUT',
1461       code='r',
1462       arg=int4,
1463       stack_before=[],
1464       stack_after=[],
1465       proto=1,
1466       doc="""Store the stack top into the memo.  The stack is not popped.
1467
1468       The index of the memo location to write into is given by the 4-byte
1469       signed little-endian integer following.
1470       """),
1471
1472     # Access the extension registry (predefined objects).  Akin to the GET
1473     # family.
1474
1475     I(name='EXT1',
1476       code='\x82',
1477       arg=uint1,
1478       stack_before=[],
1479       stack_after=[anyobject],
1480       proto=2,
1481       doc="""Extension code.
1482
1483       This code and the similar EXT2 and EXT4 allow using a registry
1484       of popular objects that are pickled by name, typically classes.
1485       It is envisioned that through a global negotiation and
1486       registration process, third parties can set up a mapping between
1487       ints and object names.
1488
1489       In order to guarantee pickle interchangeability, the extension
1490       code registry ought to be global, although a range of codes may
1491       be reserved for private use.
1492
1493       EXT1 has a 1-byte integer argument.  This is used to index into the
1494       extension registry, and the object at that index is pushed on the stack.
1495       """),
1496
1497     I(name='EXT2',
1498       code='\x83',
1499       arg=uint2,
1500       stack_before=[],
1501       stack_after=[anyobject],
1502       proto=2,
1503       doc="""Extension code.
1504
1505       See EXT1.  EXT2 has a two-byte integer argument.
1506       """),
1507
1508     I(name='EXT4',
1509       code='\x84',
1510       arg=int4,
1511       stack_before=[],
1512       stack_after=[anyobject],
1513       proto=2,
1514       doc="""Extension code.
1515
1516       See EXT1.  EXT4 has a four-byte integer argument.
1517       """),
1518
1519     # Push a class object, or module function, on the stack, via its module
1520     # and name.
1521
1522     I(name='GLOBAL',
1523       code='c',
1524       arg=stringnl_noescape_pair,
1525       stack_before=[],
1526       stack_after=[anyobject],
1527       proto=0,
1528       doc="""Push a global object (module.attr) on the stack.
1529
1530       Two newline-terminated strings follow the GLOBAL opcode.  The first is
1531       taken as a module name, and the second as a class name.  The class
1532       object module.class is pushed on the stack.  More accurately, the
1533       object returned by self.find_class(module, class) is pushed on the
1534       stack, so unpickling subclasses can override this form of lookup.
1535       """),
1536
1537     # Ways to build objects of classes pickle doesn't know about directly
1538     # (user-defined classes).  I despair of documenting this accurately
1539     # and comprehensibly -- you really have to read the pickle code to
1540     # find all the special cases.
1541
1542     I(name='REDUCE',
1543       code='R',
1544       arg=None,
1545       stack_before=[anyobject, anyobject],
1546       stack_after=[anyobject],
1547       proto=0,
1548       doc="""Push an object built from a callable and an argument tuple.
1549
1550       The opcode is named to remind of the __reduce__() method.
1551
1552       Stack before: ... callable pytuple
1553       Stack after:  ... callable(*pytuple)
1554
1555       The callable and the argument tuple are the first two items returned
1556       by a __reduce__ method.  Applying the callable to the argtuple is
1557       supposed to reproduce the original object, or at least get it started.
1558       If the __reduce__ method returns a 3-tuple, the last component is an
1559       argument to be passed to the object's __setstate__, and then the REDUCE
1560       opcode is followed by code to create setstate's argument, and then a
1561       BUILD opcode to apply  __setstate__ to that argument.
1562
1563       If not isinstance(callable, type), REDUCE complains unless the
1564       callable has been registered with the copyreg module's
1565       safe_constructors dict, or the callable has a magic
1566       '__safe_for_unpickling__' attribute with a true value.  I'm not sure
1567       why it does this, but I've sure seen this complaint often enough when
1568       I didn't want to <wink>.
1569       """),
1570
1571     I(name='BUILD',
1572       code='b',
1573       arg=None,
1574       stack_before=[anyobject, anyobject],
1575       stack_after=[anyobject],
1576       proto=0,
1577       doc="""Finish building an object, via __setstate__ or dict update.
1578
1579       Stack before: ... anyobject argument
1580       Stack after:  ... anyobject
1581
1582       where anyobject may have been mutated, as follows:
1583
1584       If the object has a __setstate__ method,
1585
1586           anyobject.__setstate__(argument)
1587
1588       is called.
1589
1590       Else the argument must be a dict, the object must have a __dict__, and
1591       the object is updated via
1592
1593           anyobject.__dict__.update(argument)
1594       """),
1595
1596     I(name='INST',
1597       code='i',
1598       arg=stringnl_noescape_pair,
1599       stack_before=[markobject, stackslice],
1600       stack_after=[anyobject],
1601       proto=0,
1602       doc="""Build a class instance.
1603
1604       This is the protocol 0 version of protocol 1's OBJ opcode.
1605       INST is followed by two newline-terminated strings, giving a
1606       module and class name, just as for the GLOBAL opcode (and see
1607       GLOBAL for more details about that).  self.find_class(module, name)
1608       is used to get a class object.
1609
1610       In addition, all the objects on the stack following the topmost
1611       markobject are gathered into a tuple and popped (along with the
1612       topmost markobject), just as for the TUPLE opcode.
1613
1614       Now it gets complicated.  If all of these are true:
1615
1616         + The argtuple is empty (markobject was at the top of the stack
1617           at the start).
1618
1619         + The class object does not have a __getinitargs__ attribute.
1620
1621       then we want to create an old-style class instance without invoking
1622       its __init__() method (pickle has waffled on this over the years; not
1623       calling __init__() is current wisdom).  In this case, an instance of
1624       an old-style dummy class is created, and then we try to rebind its
1625       __class__ attribute to the desired class object.  If this succeeds,
1626       the new instance object is pushed on the stack, and we're done.
1627
1628       Else (the argtuple is not empty, it's not an old-style class object,
1629       or the class object does have a __getinitargs__ attribute), the code
1630       first insists that the class object have a __safe_for_unpickling__
1631       attribute.  Unlike as for the __safe_for_unpickling__ check in REDUCE,
1632       it doesn't matter whether this attribute has a true or false value, it
1633       only matters whether it exists (XXX this is a bug).  If
1634       __safe_for_unpickling__ doesn't exist, UnpicklingError is raised.
1635
1636       Else (the class object does have a __safe_for_unpickling__ attr),
1637       the class object obtained from INST's arguments is applied to the
1638       argtuple obtained from the stack, and the resulting instance object
1639       is pushed on the stack.
1640
1641       NOTE:  checks for __safe_for_unpickling__ went away in Python 2.3.
1642       """),
1643
1644     I(name='OBJ',
1645       code='o',
1646       arg=None,
1647       stack_before=[markobject, anyobject, stackslice],
1648       stack_after=[anyobject],
1649       proto=1,
1650       doc="""Build a class instance.
1651
1652       This is the protocol 1 version of protocol 0's INST opcode, and is
1653       very much like it.  The major difference is that the class object
1654       is taken off the stack, allowing it to be retrieved from the memo
1655       repeatedly if several instances of the same class are created.  This
1656       can be much more efficient (in both time and space) than repeatedly
1657       embedding the module and class names in INST opcodes.
1658
1659       Unlike INST, OBJ takes no arguments from the opcode stream.  Instead
1660       the class object is taken off the stack, immediately above the
1661       topmost markobject:
1662
1663       Stack before: ... markobject classobject stackslice
1664       Stack after:  ... new_instance_object
1665
1666       As for INST, the remainder of the stack above the markobject is
1667       gathered into an argument tuple, and then the logic seems identical,
1668       except that no __safe_for_unpickling__ check is done (XXX this is
1669       a bug).  See INST for the gory details.
1670
1671       NOTE:  In Python 2.3, INST and OBJ are identical except for how they
1672       get the class object.  That was always the intent; the implementations
1673       had diverged for accidental reasons.
1674       """),
1675
1676     I(name='NEWOBJ',
1677       code='\x81',
1678       arg=None,
1679       stack_before=[anyobject, anyobject],
1680       stack_after=[anyobject],
1681       proto=2,
1682       doc="""Build an object instance.
1683
1684       The stack before should be thought of as containing a class
1685       object followed by an argument tuple (the tuple being the stack
1686       top).  Call these cls and args.  They are popped off the stack,
1687       and the value returned by cls.__new__(cls, *args) is pushed back
1688       onto the stack.
1689       """),
1690
1691     # Machine control.
1692
1693     I(name='PROTO',
1694       code='\x80',
1695       arg=uint1,
1696       stack_before=[],
1697       stack_after=[],
1698       proto=2,
1699       doc="""Protocol version indicator.
1700
1701       For protocol 2 and above, a pickle must start with this opcode.
1702       The argument is the protocol version, an int in range(2, 256).
1703       """),
1704
1705     I(name='STOP',
1706       code='.',
1707       arg=None,
1708       stack_before=[anyobject],
1709       stack_after=[],
1710       proto=0,
1711       doc="""Stop the unpickling machine.
1712
1713       Every pickle ends with this opcode.  The object at the top of the stack
1714       is popped, and that's the result of unpickling.  The stack should be
1715       empty then.
1716       """),
1717
1718     # Ways to deal with persistent IDs.
1719
1720     I(name='PERSID',
1721       code='P',
1722       arg=stringnl_noescape,
1723       stack_before=[],
1724       stack_after=[anyobject],
1725       proto=0,
1726       doc="""Push an object identified by a persistent ID.
1727
1728       The pickle module doesn't define what a persistent ID means.  PERSID's
1729       argument is a newline-terminated str-style (no embedded escapes, no
1730       bracketing quote characters) string, which *is* "the persistent ID".
1731       The unpickler passes this string to self.persistent_load().  Whatever
1732       object that returns is pushed on the stack.  There is no implementation
1733       of persistent_load() in Python's unpickler:  it must be supplied by an
1734       unpickler subclass.
1735       """),
1736
1737     I(name='BINPERSID',
1738       code='Q',
1739       arg=None,
1740       stack_before=[anyobject],
1741       stack_after=[anyobject],
1742       proto=1,
1743       doc="""Push an object identified by a persistent ID.
1744
1745       Like PERSID, except the persistent ID is popped off the stack (instead
1746       of being a string embedded in the opcode bytestream).  The persistent
1747       ID is passed to self.persistent_load(), and whatever object that
1748       returns is pushed on the stack.  See PERSID for more detail.
1749       """),
1750 ]
1751 del I
1752
1753 # Verify uniqueness of .name and .code members.
1754 name2i = {}
1755 code2i = {}
1756
1757 for i, d in enumerate(opcodes):
1758     if d.name in name2i:
1759         raise ValueError("repeated name %r at indices %d and %d" %
1760                          (d.name, name2i[d.name], i))
1761     if d.code in code2i:
1762         raise ValueError("repeated code %r at indices %d and %d" %
1763                          (d.code, code2i[d.code], i))
1764
1765     name2i[d.name] = i
1766     code2i[d.code] = i
1767
1768 del name2i, code2i, i, d
1769
1770 ##############################################################################
1771 # Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1772 # Also ensure we've got the same stuff as pickle.py, although the
1773 # introspection here is dicey.
1774
1775 code2op = {}
1776 for d in opcodes:
1777     code2op[d.code] = d
1778 del d
1779
1780 def assure_pickle_consistency(verbose=False):
1781
1782     copy = code2op.copy()
1783     for name in pickle.__all__:
1784         if not re.match("[A-Z][A-Z0-9_]+$", name):
1785             if verbose:
1786                 print("skipping %r: it doesn't look like an opcode name" % name)
1787             continue
1788         picklecode = getattr(pickle, name)
1789         if not isinstance(picklecode, bytes) or len(picklecode) != 1:
1790             if verbose:
1791                 print(("skipping %r: value %r doesn't look like a pickle "
1792                        "code" % (name, picklecode)))
1793             continue
1794         picklecode = picklecode.decode("latin-1")
1795         if picklecode in copy:
1796             if verbose:
1797                 print("checking name %r w/ code %r for consistency" % (
1798                       name, picklecode))
1799             d = copy[picklecode]
1800             if d.name != name:
1801                 raise ValueError("for pickle code %r, pickle.py uses name %r "
1802                                  "but we're using name %r" % (picklecode,
1803                                                               name,
1804                                                               d.name))
1805             # Forget this one.  Any left over in copy at the end are a problem
1806             # of a different kind.
1807             del copy[picklecode]
1808         else:
1809             raise ValueError("pickle.py appears to have a pickle opcode with "
1810                              "name %r and code %r, but we don't" %
1811                              (name, picklecode))
1812     if copy:
1813         msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1814         for code, d in copy.items():
1815             msg.append("    name %r with code %r" % (d.name, code))
1816         raise ValueError("\n".join(msg))
1817
1818 assure_pickle_consistency()
1819 del assure_pickle_consistency
1820
1821 ##############################################################################
1822 # A pickle opcode generator.
1823
1824 def genops(pickle):
1825     """Generate all the opcodes in a pickle.
1826
1827     'pickle' is a file-like object, or string, containing the pickle.
1828
1829     Each opcode in the pickle is generated, from the current pickle position,
1830     stopping after a STOP opcode is delivered.  A triple is generated for
1831     each opcode:
1832
1833         opcode, arg, pos
1834
1835     opcode is an OpcodeInfo record, describing the current opcode.
1836
1837     If the opcode has an argument embedded in the pickle, arg is its decoded
1838     value, as a Python object.  If the opcode doesn't have an argument, arg
1839     is None.
1840
1841     If the pickle has a tell() method, pos was the value of pickle.tell()
1842     before reading the current opcode.  If the pickle is a bytes object,
1843     it's wrapped in a BytesIO object, and the latter's tell() result is
1844     used.  Else (the pickle doesn't have a tell(), and it's not obvious how
1845     to query its current position) pos is None.
1846     """
1847
1848     if isinstance(pickle, bytes_types):
1849         import io
1850         pickle = io.BytesIO(pickle)
1851
1852     if hasattr(pickle, "tell"):
1853         getpos = pickle.tell
1854     else:
1855         getpos = lambda: None
1856
1857     while True:
1858         pos = getpos()
1859         code = pickle.read(1)
1860         opcode = code2op.get(code.decode("latin-1"))
1861         if opcode is None:
1862             if code == b"":
1863                 raise ValueError("pickle exhausted before seeing STOP")
1864             else:
1865                 raise ValueError("at position %s, opcode %r unknown" % (
1866                                  pos is None and "<unknown>" or pos,
1867                                  code))
1868         if opcode.arg is None:
1869             arg = None
1870         else:
1871             arg = opcode.arg.reader(pickle)
1872         yield opcode, arg, pos
1873         if code == b'.':
1874             assert opcode.name == 'STOP'
1875             break
1876
1877 ##############################################################################
1878 # A pickle optimizer.
1879
1880 def optimize(p):
1881     'Optimize a pickle string by removing unused PUT opcodes'
1882     gets = set()            # set of args used by a GET opcode
1883     puts = []               # (arg, startpos, stoppos) for the PUT opcodes
1884     prevpos = None          # set to pos if previous opcode was a PUT
1885     for opcode, arg, pos in genops(p):
1886         if prevpos is not None:
1887             puts.append((prevarg, prevpos, pos))
1888             prevpos = None
1889         if 'PUT' in opcode.name:
1890             prevarg, prevpos = arg, pos
1891         elif 'GET' in opcode.name:
1892             gets.add(arg)
1893
1894     # Copy the pickle string except for PUTS without a corresponding GET
1895     s = []
1896     i = 0
1897     for arg, start, stop in puts:
1898         j = stop if (arg in gets) else start
1899         s.append(p[i:j])
1900         i = stop
1901     s.append(p[i:])
1902     return b''.join(s)
1903
1904 ##############################################################################
1905 # A symbolic pickle disassembler.
1906
1907 def dis(pickle, out=None, memo=None, indentlevel=4):
1908     """Produce a symbolic disassembly of a pickle.
1909
1910     'pickle' is a file-like object, or string, containing a (at least one)
1911     pickle.  The pickle is disassembled from the current position, through
1912     the first STOP opcode encountered.
1913
1914     Optional arg 'out' is a file-like object to which the disassembly is
1915     printed.  It defaults to sys.stdout.
1916
1917     Optional arg 'memo' is a Python dict, used as the pickle's memo.  It
1918     may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
1919     Passing the same memo object to another dis() call then allows disassembly
1920     to proceed across multiple pickles that were all created by the same
1921     pickler with the same memo.  Ordinarily you don't need to worry about this.
1922
1923     Optional arg indentlevel is the number of blanks by which to indent
1924     a new MARK level.  It defaults to 4.
1925
1926     In addition to printing the disassembly, some sanity checks are made:
1927
1928     + All embedded opcode arguments "make sense".
1929
1930     + Explicit and implicit pop operations have enough items on the stack.
1931
1932     + When an opcode implicitly refers to a markobject, a markobject is
1933       actually on the stack.
1934
1935     + A memo entry isn't referenced before it's defined.
1936
1937     + The markobject isn't stored in the memo.
1938
1939     + A memo entry isn't redefined.
1940     """
1941
1942     # Most of the hair here is for sanity checks, but most of it is needed
1943     # anyway to detect when a protocol 0 POP takes a MARK off the stack
1944     # (which in turn is needed to indent MARK blocks correctly).
1945
1946     stack = []          # crude emulation of unpickler stack
1947     if memo is None:
1948         memo = {}       # crude emulation of unpicker memo
1949     maxproto = -1       # max protocol number seen
1950     markstack = []      # bytecode positions of MARK opcodes
1951     indentchunk = ' ' * indentlevel
1952     errormsg = None
1953     for opcode, arg, pos in genops(pickle):
1954         if pos is not None:
1955             print("%5d:" % pos, end=' ', file=out)
1956
1957         line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
1958                               indentchunk * len(markstack),
1959                               opcode.name)
1960
1961         maxproto = max(maxproto, opcode.proto)
1962         before = opcode.stack_before    # don't mutate
1963         after = opcode.stack_after      # don't mutate
1964         numtopop = len(before)
1965
1966         # See whether a MARK should be popped.
1967         markmsg = None
1968         if markobject in before or (opcode.name == "POP" and
1969                                     stack and
1970                                     stack[-1] is markobject):
1971             assert markobject not in after
1972             if __debug__:
1973                 if markobject in before:
1974                     assert before[-1] is stackslice
1975             if markstack:
1976                 markpos = markstack.pop()
1977                 if markpos is None:
1978                     markmsg = "(MARK at unknown opcode offset)"
1979                 else:
1980                     markmsg = "(MARK at %d)" % markpos
1981                 # Pop everything at and after the topmost markobject.
1982                 while stack[-1] is not markobject:
1983                     stack.pop()
1984                 stack.pop()
1985                 # Stop later code from popping too much.
1986                 try:
1987                     numtopop = before.index(markobject)
1988                 except ValueError:
1989                     assert opcode.name == "POP"
1990                     numtopop = 0
1991             else:
1992                 errormsg = markmsg = "no MARK exists on stack"
1993
1994         # Check for correct memo usage.
1995         if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
1996             assert arg is not None
1997             if arg in memo:
1998                 errormsg = "memo key %r already defined" % arg
1999             elif not stack:
2000                 errormsg = "stack is empty -- can't store into memo"
2001             elif stack[-1] is markobject:
2002                 errormsg = "can't store markobject in the memo"
2003             else:
2004                 memo[arg] = stack[-1]
2005
2006         elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
2007             if arg in memo:
2008                 assert len(after) == 1
2009                 after = [memo[arg]]     # for better stack emulation
2010             else:
2011                 errormsg = "memo key %r has never been stored into" % arg
2012
2013         if arg is not None or markmsg:
2014             # make a mild effort to align arguments
2015             line += ' ' * (10 - len(opcode.name))
2016             if arg is not None:
2017                 line += ' ' + repr(arg)
2018             if markmsg:
2019                 line += ' ' + markmsg
2020         print(line, file=out)
2021
2022         if errormsg:
2023             # Note that we delayed complaining until the offending opcode
2024             # was printed.
2025             raise ValueError(errormsg)
2026
2027         # Emulate the stack effects.
2028         if len(stack) < numtopop:
2029             raise ValueError("tries to pop %d items from stack with "
2030                              "only %d items" % (numtopop, len(stack)))
2031         if numtopop:
2032             del stack[-numtopop:]
2033         if markobject in after:
2034             assert markobject not in before
2035             markstack.append(pos)
2036
2037         stack.extend(after)
2038
2039     print("highest protocol among opcodes =", maxproto, file=out)
2040     if stack:
2041         raise ValueError("stack not empty after STOP: %r" % stack)
2042
2043 # For use in the doctest, simply as an example of a class to pickle.
2044 class _Example:
2045     def __init__(self, value):
2046         self.value = value
2047
2048 _dis_test = r"""
2049 >>> import pickle
2050 >>> x = [1, 2, (3, 4), {b'abc': "def"}]
2051 >>> pkl0 = pickle.dumps(x, 0)
2052 >>> dis(pkl0)
2053     0: (    MARK
2054     1: l        LIST       (MARK at 0)
2055     2: p    PUT        0
2056     5: L    LONG       1
2057     9: a    APPEND
2058    10: L    LONG       2
2059    14: a    APPEND
2060    15: (    MARK
2061    16: L        LONG       3
2062    20: L        LONG       4
2063    24: t        TUPLE      (MARK at 15)
2064    25: p    PUT        1
2065    28: a    APPEND
2066    29: (    MARK
2067    30: d        DICT       (MARK at 29)
2068    31: p    PUT        2
2069    34: c    GLOBAL     '__builtin__ bytes'
2070    53: p    PUT        3
2071    56: (    MARK
2072    57: (        MARK
2073    58: l            LIST       (MARK at 57)
2074    59: p        PUT        4
2075    62: L        LONG       97
2076    67: a        APPEND
2077    68: L        LONG       98
2078    73: a        APPEND
2079    74: L        LONG       99
2080    79: a        APPEND
2081    80: t        TUPLE      (MARK at 56)
2082    81: p    PUT        5
2083    84: R    REDUCE
2084    85: p    PUT        6
2085    88: V    UNICODE    'def'
2086    93: p    PUT        7
2087    96: s    SETITEM
2088    97: a    APPEND
2089    98: .    STOP
2090 highest protocol among opcodes = 0
2091
2092 Try again with a "binary" pickle.
2093
2094 >>> pkl1 = pickle.dumps(x, 1)
2095 >>> dis(pkl1)
2096     0: ]    EMPTY_LIST
2097     1: q    BINPUT     0
2098     3: (    MARK
2099     4: K        BININT1    1
2100     6: K        BININT1    2
2101     8: (        MARK
2102     9: K            BININT1    3
2103    11: K            BININT1    4
2104    13: t            TUPLE      (MARK at 8)
2105    14: q        BINPUT     1
2106    16: }        EMPTY_DICT
2107    17: q        BINPUT     2
2108    19: c        GLOBAL     '__builtin__ bytes'
2109    38: q        BINPUT     3
2110    40: (        MARK
2111    41: ]            EMPTY_LIST
2112    42: q            BINPUT     4
2113    44: (            MARK
2114    45: K                BININT1    97
2115    47: K                BININT1    98
2116    49: K                BININT1    99
2117    51: e                APPENDS    (MARK at 44)
2118    52: t            TUPLE      (MARK at 40)
2119    53: q        BINPUT     5
2120    55: R        REDUCE
2121    56: q        BINPUT     6
2122    58: X        BINUNICODE 'def'
2123    66: q        BINPUT     7
2124    68: s        SETITEM
2125    69: e        APPENDS    (MARK at 3)
2126    70: .    STOP
2127 highest protocol among opcodes = 1
2128
2129 Exercise the INST/OBJ/BUILD family.
2130
2131 >>> import pickletools
2132 >>> dis(pickle.dumps(pickletools.dis, 0))
2133     0: c    GLOBAL     'pickletools dis'
2134    17: p    PUT        0
2135    20: .    STOP
2136 highest protocol among opcodes = 0
2137
2138 >>> from pickletools import _Example
2139 >>> x = [_Example(42)] * 2
2140 >>> dis(pickle.dumps(x, 0))
2141     0: (    MARK
2142     1: l        LIST       (MARK at 0)
2143     2: p    PUT        0
2144     5: c    GLOBAL     'copy_reg _reconstructor'
2145    30: p    PUT        1
2146    33: (    MARK
2147    34: c        GLOBAL     'pickletools _Example'
2148    56: p        PUT        2
2149    59: c        GLOBAL     '__builtin__ object'
2150    79: p        PUT        3
2151    82: N        NONE
2152    83: t        TUPLE      (MARK at 33)
2153    84: p    PUT        4
2154    87: R    REDUCE
2155    88: p    PUT        5
2156    91: (    MARK
2157    92: d        DICT       (MARK at 91)
2158    93: p    PUT        6
2159    96: V    UNICODE    'value'
2160   103: p    PUT        7
2161   106: L    LONG       42
2162   111: s    SETITEM
2163   112: b    BUILD
2164   113: a    APPEND
2165   114: g    GET        5
2166   117: a    APPEND
2167   118: .    STOP
2168 highest protocol among opcodes = 0
2169
2170 >>> dis(pickle.dumps(x, 1))
2171     0: ]    EMPTY_LIST
2172     1: q    BINPUT     0
2173     3: (    MARK
2174     4: c        GLOBAL     'copy_reg _reconstructor'
2175    29: q        BINPUT     1
2176    31: (        MARK
2177    32: c            GLOBAL     'pickletools _Example'
2178    54: q            BINPUT     2
2179    56: c            GLOBAL     '__builtin__ object'
2180    76: q            BINPUT     3
2181    78: N            NONE
2182    79: t            TUPLE      (MARK at 31)
2183    80: q        BINPUT     4
2184    82: R        REDUCE
2185    83: q        BINPUT     5
2186    85: }        EMPTY_DICT
2187    86: q        BINPUT     6
2188    88: X        BINUNICODE 'value'
2189    98: q        BINPUT     7
2190   100: K        BININT1    42
2191   102: s        SETITEM
2192   103: b        BUILD
2193   104: h        BINGET     5
2194   106: e        APPENDS    (MARK at 3)
2195   107: .    STOP
2196 highest protocol among opcodes = 1
2197
2198 Try "the canonical" recursive-object test.
2199
2200 >>> L = []
2201 >>> T = L,
2202 >>> L.append(T)
2203 >>> L[0] is T
2204 True
2205 >>> T[0] is L
2206 True
2207 >>> L[0][0] is L
2208 True
2209 >>> T[0][0] is T
2210 True
2211 >>> dis(pickle.dumps(L, 0))
2212     0: (    MARK
2213     1: l        LIST       (MARK at 0)
2214     2: p    PUT        0
2215     5: (    MARK
2216     6: g        GET        0
2217     9: t        TUPLE      (MARK at 5)
2218    10: p    PUT        1
2219    13: a    APPEND
2220    14: .    STOP
2221 highest protocol among opcodes = 0
2222
2223 >>> dis(pickle.dumps(L, 1))
2224     0: ]    EMPTY_LIST
2225     1: q    BINPUT     0
2226     3: (    MARK
2227     4: h        BINGET     0
2228     6: t        TUPLE      (MARK at 3)
2229     7: q    BINPUT     1
2230     9: a    APPEND
2231    10: .    STOP
2232 highest protocol among opcodes = 1
2233
2234 Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2235 has to emulate the stack in order to realize that the POP opcode at 16 gets
2236 rid of the MARK at 0.
2237
2238 >>> dis(pickle.dumps(T, 0))
2239     0: (    MARK
2240     1: (        MARK
2241     2: l            LIST       (MARK at 1)
2242     3: p        PUT        0
2243     6: (        MARK
2244     7: g            GET        0
2245    10: t            TUPLE      (MARK at 6)
2246    11: p        PUT        1
2247    14: a        APPEND
2248    15: 0        POP
2249    16: 0        POP        (MARK at 0)
2250    17: g    GET        1
2251    20: .    STOP
2252 highest protocol among opcodes = 0
2253
2254 >>> dis(pickle.dumps(T, 1))
2255     0: (    MARK
2256     1: ]        EMPTY_LIST
2257     2: q        BINPUT     0
2258     4: (        MARK
2259     5: h            BINGET     0
2260     7: t            TUPLE      (MARK at 4)
2261     8: q        BINPUT     1
2262    10: a        APPEND
2263    11: 1        POP_MARK   (MARK at 0)
2264    12: h    BINGET     1
2265    14: .    STOP
2266 highest protocol among opcodes = 1
2267
2268 Try protocol 2.
2269
2270 >>> dis(pickle.dumps(L, 2))
2271     0: \x80 PROTO      2
2272     2: ]    EMPTY_LIST
2273     3: q    BINPUT     0
2274     5: h    BINGET     0
2275     7: \x85 TUPLE1
2276     8: q    BINPUT     1
2277    10: a    APPEND
2278    11: .    STOP
2279 highest protocol among opcodes = 2
2280
2281 >>> dis(pickle.dumps(T, 2))
2282     0: \x80 PROTO      2
2283     2: ]    EMPTY_LIST
2284     3: q    BINPUT     0
2285     5: h    BINGET     0
2286     7: \x85 TUPLE1
2287     8: q    BINPUT     1
2288    10: a    APPEND
2289    11: 0    POP
2290    12: h    BINGET     1
2291    14: .    STOP
2292 highest protocol among opcodes = 2
2293 """
2294
2295 _memo_test = r"""
2296 >>> import pickle
2297 >>> import io
2298 >>> f = io.BytesIO()
2299 >>> p = pickle.Pickler(f, 2)
2300 >>> x = [1, 2, 3]
2301 >>> p.dump(x)
2302 >>> p.dump(x)
2303 >>> f.seek(0)
2304 0
2305 >>> memo = {}
2306 >>> dis(f, memo=memo)
2307     0: \x80 PROTO      2
2308     2: ]    EMPTY_LIST
2309     3: q    BINPUT     0
2310     5: (    MARK
2311     6: K        BININT1    1
2312     8: K        BININT1    2
2313    10: K        BININT1    3
2314    12: e        APPENDS    (MARK at 5)
2315    13: .    STOP
2316 highest protocol among opcodes = 2
2317 >>> dis(f, memo=memo)
2318    14: \x80 PROTO      2
2319    16: h    BINGET     0
2320    18: .    STOP
2321 highest protocol among opcodes = 2
2322 """
2323
2324 __test__ = {'disassembler_test': _dis_test,
2325             'disassembler_memo_test': _memo_test,
2326            }
2327
2328 def _test():
2329     import doctest
2330     return doctest.testmod()
2331
2332 if __name__ == "__main__":
2333     _test()