You are here: Home ‣ Dive Into Python 3 ‣

Difficulty level: ♦♦♦♦♢

Serializing Python Objects

❝ FIXME ❞
— FIXME

Diving In

FIXME

Open the Python Shell and define the following variable:

>>> shell = 1

Keep that window open. Now open another Python Shell and define the following variable:

>>> shell = 2

Throughout this chapter, I will use the shell variable to indicate which Python Shell is being used in each example.

⁂

Serializing Simple Python Objects

The concept of serialization is simple. You have a data structure in memory that you want to save, reuse, or send to someone else. How would you do that? Well, that depends on how you want to save it, how you want to reuse it, and to whom you want to send it. Many games allow you to save your progress when you quit the game and pick up where you left off when you relaunch the game. (Actually, many non-gaming applications do this as well.) In this case, a data structure that captures “your progress so far” needs to be stored on disk when you quit, then loaded from disk when you relaunch. The data is only meant to be used by the same program that created it, never sent over a network, and never read by anything other than the program that created it. Therefore, the interoperability issues are limited to ensuring that later versions of the program can read data written by earlier versions.

For cases like this, the pickle module is ideal. It’s part of the Python standard library, so it’s always available. It’s fast; the bulk of it is written in C, like the Python interpreter itself. It can store arbitrarily complex Python data structures.

What can the pickle module store?

All the native datatypes that Python supports: booleans, integers, floating point numbers, complex numbers, strings, bytes objects, byte arrays, and None.
Lists, tuples, dictionaries, and sets containing any combination of native datatypes.
Lists, tuples, dictionaries, and sets containing any combination of lists, tuples, dictionaries, and sets containing any combination of native datatypes (and so on, to the maximum nesting level that Python supports).
Functions, classes, and instances of classes (with caveats that I’ll explain shortly).

If this isn’t enough for you, the pickle module is also extensible, as you’ll see later in this chapter.

Saving to a File

The pickle module works with data structures. Let’s build one.

>>> shell                                                                                              ①
1
>>> entry = {}                                                                                         ②
>>> entry['title'] = 'Dive into history, 2009 edition'
>>> entry['article_link'] = 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'
>>> entry['comments_link'] = None
>>> entry['internal_id'] = b'\xde\xd5\xb4\xf8'
>>> entry['tags'] = ('diveintopython', 'docbook', 'html')
>>> entry['published'] = True
>>> import time
>>> entry['published_date'] = time.strptime('Fri Mar 27 22:20:42 2009')                                ③
>>> entry['published_date']
time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1)

Follow along in Python Shell #1.
The idea here is to build a Python dictionary that could represent something useful, like an entry in an Atom feed. But I also want to ensure that it contains several different types of data, to show off the pickle module. Don’t read too much into these values.
The time module contains a data structure (time_struct) to represent a point in time (accurate to one millisecond) and functions to manipulate time structs. The strptime() function takes a formatted string an converts it to a time_struct. This string is in the default format, but you can control that with format codes. See the time module for more details.

That’s a handsome-looking Python dictionary. Let’s save it to a file.

>>> shell                                    ①
1
>>> import pickle
>>> with open('entry.pickle', 'wb') as f:    ②
...     pickle.dump(entry, f)                ③
...

This is still in Python Shell #1.
Use the open() function to open a file. Set the file mode to 'wb' to open the file for writing in binary mode. Wrap it in a with statement to ensure the file is closed automatically when you’re done with it.
The dump() function in the pickle module takes a serializable Python data structure, serializes it into a binary, Python-specific format using the latest version of the pickle protocol, and saves it to an open file.

That last sentence was pretty important.

The pickle module takes a Python data structure and saves it to a file.
To do this, it serializes the data structure using a data format called “the pickle protocol.”
The pickle protocol is Python-specific; there is no guarantee of cross-language compatibility. You probably couldn’t take the entry.pickle file you just created and do anything useful with it in Perl, PHP, Java, or any other language.
Not every Python data structure can be serialized by the pickle module. The pickle protocol has changed several times as new data types have been added to the Python language, but there are still limitations.
As a result of these changes, there is no guarantee of compatibility between different versions of Python itself. Newer versions of Python support the older serialization formats, but older versions of Python do not support newer formats (since they don’t support the newer data types).
Unless you specify otherwise, the functions in the pickle module will use the latest version of the pickle protocol. This ensures that you have maximum flexibility in the types of data you can serialize, but it also means that the resulting file will not be readable by older versions of Python that do not support the latest version of the pickle protocol.
The latest version of the pickle protocol is a binary format. Be sure to open your pickle files in binary mode, or the data will get corrupted during writing.

Loading from a File

Now switch to your second Python Shell — i.e. not the one where you created the entry dictionary.

>>> shell                                    ①
2
>>> entry                                    ②
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'entry' is not defined
>>> import pickle
>>> with open('entry.pickle', 'rb') as f:    ③
...     entry = pickle.load(f)               ④
... 
>>> entry                                    ⑤
{'comments_link': None,
 'internal_id': b'\xde\xd5\xb4\xf8',
 'title': 'Dive into history, 2009 edition',
 'tags': ('diveintopython', 'docbook', 'html'),
 'article_link':
 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),
 'published': True}

This is Python Shell #2.
There is no entry variable defined here. You defined an entry variable in Python Shell #1, but that’s a completely different environment with its own state.
Open the entry.pickle file you created in Python Shell #1. The pickle module uses a binary data format, so you should always open pickle files in binary mode.
The pickle.load() function takes a stream object, reads the serialized data from the stream, creates a new Python object, recreates the serialized data in the new Python object, and returns the new Python object.
Now the entry variable is a dictionary with familiar-looking keys and values.

The pickle.dump() / pickle.load() cycle results in an identical copy of the original data structure.

>>> shell                                    ①
1
>>> with open('entry.pickle', 'rb') as f:    ②
...     entry2 = pickle.load(f)              ③
... 
>>> entry2 == entry                          ④
True
>>> entry2['tags']                           ⑤
('diveintopython', 'docbook', 'html')
>>> entry2['internal_id']
b'\xde\xd5\xb4\xf8'

Switch back to Python Shell #1.
Open the entry.pickle file.
Load the serialized data into a new variable, entry2.
Python confirms that the two dictionaries, entry and entry2, are identical. In this shell, you built entry from the ground up, starting with an empty dictionary and manually assigning values to specific keys. You serialized this dictionary and stored it in the entry.pickle file. Now you’ve read the serialized data from that file and created a perfect replica of the original data structure.
For reasons that will become clear later in this chapter, I want to point out that the value of the 'tags' key is a tuple, and the value of the 'internal_id' key is a bytes object.

Pickling Without a File

The examples in the previous section showed how to serialize a Python object directly to a file on disk. But what if you don’t want or need a file? You can also serialize to a bytes object in memory.

>>> shell
1
>>> b = pickle.dumps(entry)     ①
>>> type(b)                     ②
<class 'bytes'>
>>> entry3 = pickle.loads(b)    ③
>>> entry3 == entry             ④
True

The pickle.dumps() function (note the 's' at the end of the function name) performs the same serialization as the pickle.dump() function. Instead of taking a stream object and writing the serialized data to a file on disk, it simply returns the serialized data.
Since the pickle protocol uses a binary data format, the pickle.dumps() function returns a bytes object.
The pickle.loads() function (again, note the 's' at the end of the function name) performs the same deserialization as the pickle.load() function. Instead of taking a stream object and reading the serialized data from a file, it takes a bytes object containing serialized data, such as the one returned by the pickle.dumps() function.
The end result is the same: a perfect replica of the original dictionary.

Bytes and Strings Rear Their Ugly Heads Again

The pickle protocol has been around for many years, and it has matured as Python itself has matured. There are now four different versions of the pickle protocol.

Python 1.x had two pickle protocols, a text-based format (“version 0”) and a binary format (“version 1”).
Python 2.3 introduced a new pickle protocol (“version 2”) to handle new functionality in Python class objects. It is a binary format.
Python 3.0 introduced another pickle protocol (“version 3”) with explicit support for bytes objects and byte arrays. It is a binary format.

Oh look, the difference between bytes and strings rears its ugly head again. (If you’re surprised, you haven’t been paying attention.) What this means in practice is that, while Python 3 can read data pickled with protocol version 2, Python 2 can not read data pickled with protocol version 3.

Debugging Pickle Files

What does the pickle protocol look like? Let’s jump out of the Python Shell for a moment and take a look at that entry.pickle file we created.

you@localhost:~/diveintopython3/examples$ ls -l entry.pickle
-rw-r--r-- 1 you  you  324 Aug  3 13:34 entry.pickle
you@localhost:~/diveintopython3/examples$ cat entry.pickle
comments_linkqNXtagsqXdiveintopythonqXdocbookqXhtmlq?qX publishedq?
XlinkXJhttp://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition
q   Xpublished_dateq
ctime
struct_time
?qRqXtitleqXDive into history, 2009 editionqu.

That wasn’t terribly helpful. You can see the strings, but other datatypes end up as unprintable (or at least unreadable) characters. Fields are not obviously delimited by tabs or spaces. This is not a format you would want to debug by yourself.

>>> shell
1
>>> import pickletools
>>> with open('entry.pickle', 'rb') as f:
...     pickletools.dis(f)
    0: \x80 PROTO      3
    2: }    EMPTY_DICT
    3: q    BINPUT     0
    5: (    MARK
    6: X        BINUNICODE 'published_date'
   25: q        BINPUT     1
   27: c        GLOBAL     'time struct_time'
   45: q        BINPUT     2
   47: (        MARK
   48: M            BININT2    2009
   51: K            BININT1    3
   53: K            BININT1    27
   55: K            BININT1    22
   57: K            BININT1    20
   59: K            BININT1    42
   61: K            BININT1    4
   63: K            BININT1    86
   65: J            BININT     -1
   70: t            TUPLE      (MARK at 47)
   71: q        BINPUT     3
   73: }        EMPTY_DICT
   74: q        BINPUT     4
   76: \x86     TUPLE2
   77: q        BINPUT     5
   79: R        REDUCE
   80: q        BINPUT     6
   82: X        BINUNICODE 'comments_link'
  100: q        BINPUT     7
  102: N        NONE
  103: X        BINUNICODE 'internal_id'
  119: q        BINPUT     8
  121: C        SHORT_BINBYTES 'ÞÕ´ø'
  127: q        BINPUT     9
  129: X        BINUNICODE 'tags'
  138: q        BINPUT     10
  140: X        BINUNICODE 'diveintopython'
  159: q        BINPUT     11
  161: X        BINUNICODE 'docbook'
  173: q        BINPUT     12
  175: X        BINUNICODE 'html'
  184: q        BINPUT     13
  186: \x87     TUPLE3
  187: q        BINPUT     14
  189: X        BINUNICODE 'title'
  199: q        BINPUT     15
  201: X        BINUNICODE 'Dive into history, 2009 edition'
  237: q        BINPUT     16
  239: X        BINUNICODE 'article_link'
  256: q        BINPUT     17
  258: X        BINUNICODE 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'
  337: q        BINPUT     18
  339: X        BINUNICODE 'published'
  353: q        BINPUT     19
  355: \x88     NEWTRUE
  356: u        SETITEMS   (MARK at 5)
  357: .    STOP
highest protocol among opcodes = 3

The most interesting piece of information in that disassembly is on the last line, because it includes the version of the pickle protocol with which this file was saved. There is no explicit version marker in the pickle protocol. To determine which protocol version was used to store a pickle file, you need to look at the markers (“opcodes”) within the pickled data and use hard-coded knowledge of which opcodes were introduced with each version of the pickle protocol. The pickle.dis() function does exactly that, and it prints the result in the last line of the disassembly output. Here is a function that returns just the version number, without printing anything:

[download pickleversion.py]

import pickletools

def protocol_version(file_object):
    maxproto = -1
    for opcode, arg, pos in pickletools.genops(file_object):
        maxproto = max(maxproto, opcode.proto)
    return maxproto

And here it is in action:

>>> import pickleversion
>>> with open('entry.pickle', 'rb') as f:
...     v = pickleversion.protocol_version(f)
>>> v
3

⁂

Serializing Complex Python Objects

FIXME - discussion of pickling class instances, stateful objects, __getstate__ and __setstate__, links to http://docs.python.org/3.1/library/pickle.html#pickle-inst and http://docs.python.org/3.1/library/pickle.html#pickle-state

Serializing Python Objects to be Read by Other Languages

The data format used by the pickle module is Python-specific. It makes no attempt to be compatible with other programming languages. If cross-language compatibility is one of your requirements, you need to look at other serialization formats. One such format is JSON. “JSON” stands for “JavaScript Object Notation,” but don’t let the name fool you — JSON is explicitly designed to be usable across multiple programming languages.

Python 3 includes a json module in the standard library. Like the pickle module, the json module has functions for serializing data structures, storing the serialized data on disk, loading serialized data from disk, and unserializing the data back into a new Python object. But there are some important differences, too. First of all, the JSON data format is text-based, not binary. RFC 4627 defines the JSON format and how different types of data must be encoded as text. For example, a boolean value is stored as either the five-character string 'false' or the four-character string 'true'. All JSON values are case-sensitive.

Second, as with any text-based format, there is the issue of whitespace. JSON allows arbitrary amounts of whitespace (spaces, tabs, carriage returns, and line feeds) between values. This whitespace is “insignificant,” which means that JSON encoders can add as much or as little whitespace as they like, and JSON decoders are required to ignore the whitespace between values. This allows you to “pretty-print” your JSON data, nicely nesting values within values at different indentation levels so you can read it in a standard browser or text editor. Python’s json module has options for pretty-printing during encoding.

Third, there’s the perennial problem of character encoding. JSON encodes values as plain text, but as you know, there ain’t no such thing as “plain text.” JSON must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, UTF-8), and section 3 of RFC 4627 defines how to tell which encoding is being used.

Mapping of Python Datatypes to JSON

Since JSON is not Python-specific, there are some mismatches in its coverage of Python datatypes. Some of them are simply naming differences, but there is one important datatype that is completely missing. See if you can spot it:

Notes	JSON	Python 3
	object	dictionary
	array	list
	string	string
	integer	integer
	real number	float
	`true`	`True`
	`false`	`False`
	`null`	`None`

Did you notice what was missing? Bytes! JSON has no support for bytes objects or byte arrays.

Serializing Datatypes Unsupported by JSON

Even if JSON has no built-in support for bytes, that doesn’t mean you can’t serialize bytes objects. The json module provides extensibility hooks for encoding and decoding unknown datatypes. (By “unknown,” I mean “not defined in JSON.” Obviously the json module knows about byte arrays, but it’s constrained by the limitations of the JSON specification.) If you want to encode bytes or other datatypes that JSON doesn’t support natively, you need to provide custom encoders and decoders for those types.

>>> shell                                                 ①
1
>>> entry
FIXME
>>> import json
>>> with open('entry.json', 'w', encoding='utf-8') as f:  ②
...     json.dump(entry, f)
... 
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "C:\Python31\lib\json\__init__.py", line 178, in dump
    for chunk in iterable:
  File "C:\Python31\lib\json\encoder.py", line 408, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "C:\Python31\lib\json\encoder.py", line 382, in _iterencode_dict
    for chunk in chunks:
  File "C:\Python31\lib\json\encoder.py", line 416, in _iterencode
    o = _default(o)
  File "C:\Python31\lib\json\encoder.py", line 170, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: b'\xde\xd5\xb4\xf8' is not JSON serializable

FIXME
FIXME

FIXME

# customserializer.py
def to_json(python_object):
    if isinstance(python_object, bytes):
        return {'__class__': 'bytes',
                '__value__': list(python_object)}
    raise TypeError(repr(python_object) + ' is not JSON serializable')

FIXME

FIXME

>>> shell
1
>>> import customserializer
>>> with open('entry.json', 'w', encoding='utf-8') as f:
...     json.dump(entry, default = customserializer.to_json)
... 
Traceback (most recent call last):
  File "<stdin>", line 9, in <module>
    json.dump(entry, f, default=customserializer.to_json)
  File "C:\Python31\lib\json\__init__.py", line 178, in dump
    for chunk in iterable:
  File "C:\Python31\lib\json\encoder.py", line 408, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "C:\Python31\lib\json\encoder.py", line 382, in _iterencode_dict
    for chunk in chunks:
  File "C:\Python31\lib\json\encoder.py", line 416, in _iterencode
    o = _default(o)
  File "/Users/pilgrim/diveintopython3/examples/customserializer.py", line 12, in to_json
    raise TypeError(repr(python_object) + ' is not JSON serializable')
TypeError: time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1) is not JSON serializable

FIXME

FIXME

# customserializer.py
def to_json(python_object):
    if isinstance(python_object, time.struct_time):
        return {'__class__': 'time.asctime',
                '__value__': time.asctime(python_object)}
    if isinstance(python_object, bytes):
        return {'__class__': 'bytes',
                '__value__': list(python_object)}
    raise TypeError(repr(python_object) + ' is not JSON serializable')

FIXME

FIXME

>>> shell
1
>>> with open('entry.json', 'w', encoding='utf-8') as f:
...     json.dump(entry, default = customserializer.to_json)
...

FIXME

FIXME

you@localhost:~/diveintopython3/examples$ ls -l example.json
-rw-r--r-- 1 you  you  391 Aug  3 13:34 entry.json
you@localhost:~/diveintopython3/examples$ cat example.json
{"published_date": {"__class__": "time.asctime", "__value__": "Fri Mar 27 22:20:42 2009"},
"comments_link": null, "internal_id": {"__class__": "bytes", "__value__": [222, 213, 180, 248]},
"tags": ["diveintopython", "docbook", "html"], "title": "Dive into history, 2009 edition",
"article_link": "http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition",
"published": true}

FIXME

FIXME

>>> shell
2
>>> del entry
>>> entry
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'entry' is not defined
>>> import json
>>> with open('entry.json', 'r', encoding='utf-8') as f:
...     entry = json.load(f)
... 
>>> entry
{'comments_link': None,
 'internal_id': {'__class__': 'bytes', '__value__': [222, 213, 180, 248]},
 'title': 'Dive into history, 2009 edition',
 'tags': ['diveintopython', 'docbook', 'html'],
 'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'published_date': {'__class__': 'time.asctime', '__value__': 'Fri Mar 27 22:20:42 2009'},
 'published': True}

FIXME

FIXME

# customserializer.py
def from_json(json_object):
    if '__class__' in json_object:
        if json_object['__class__'] == 'time.asctime':
            return time.strptime(json_object['__value__'])
        if json_object['__class__'] == 'bytes':
            return bytes(json_object['__value__'])
    return json_object

>>> shell
2
>>> import customserializer
>>> with open('entry.json', 'r', encoding='utf-8') as f:
...     entry = json.load(f, object_hook = customserializer.from_json)
... 
>>> entry
{'comments_link': None,
 'internal_id': b'\xde\xd5\xb4\xf8',
 'title': 'Dive into history, 2009 edition',
 'tags': ['diveintopython', 'docbook', 'html'],
 'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),
 'published': True}

FIXME

FIXME

>>> shell
1
>>> import customserializer
>>> with open('entry.json', 'r', encoding='utf-8') as f:
...     entry2 = json.load(f, object_hook = customserializer.from_json)
... 
>>> entry2 == entry
False
>>> entry['tags']
('diveintopython', 'docbook', 'html')
>>> entry2['tags']
['diveintopython', 'docbook', 'html']

FIXME

FIXME

Serializing Python Objects

Diving In

Serializing Simple Python Objects

Saving to a File

Loading from a File

Pickling Without a File

Bytes and Strings Rear Their Ugly Heads Again

Debugging Pickle Files

Serializing Complex Python Objects

Serializing Python Objects to be Read by Other Languages

Mapping of Python Datatypes to JSON

Serializing Datatypes Unsupported by JSON

Further Reading