Files
dive-into-python3/serializing.html
T
2009-08-09 20:45:09 -07:00

453 lines
22 KiB
HTML

<!DOCTYPE html>
<head>
<meta charset=utf-8>
<title>Serializing Python Objects - Dive into Python 3</title>
<!--[if IE]><script src=j/html5.js></script><![endif]-->
<link rel=stylesheet href=dip3.css>
<style>
body{counter-reset:h1 13}
</style>
<link rel=stylesheet media='only screen and (max-device-width: 480px)' href=mobile.css>
<link rel=stylesheet media=print href=print.css>
<meta name=viewport content='initial-scale=1.0'>
</head>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=25>&nbsp;<input type=submit name=root value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span class=u>&#8227;</span> <a href=table-of-contents.html#serializing>Dive Into Python 3</a> <span class=u>&#8227;</span>
<p id=level>Difficulty level: <span class=u title=advanced>&#x2666;&#x2666;&#x2666;&#x2666;&#x2662;</span>
<h1>Serializing Python Objects</h1>
<blockquote class=q>
<p><span class=u>&#x275D;</span> FIXME <span class=u>&#x275E;</span><br>&mdash; FIXME
</blockquote>
<p id=toc>&nbsp;
<h2 id=divingin>Diving In</h2>
<p class=f>FIXME
<p>Open the Python Shell and define the following variable:
<pre class='nd screen'>
<samp class=p>>>> </samp><kbd class=pp>shell = 1</kbd></pre>
<p>Keep that window open. Now open another Python Shell and define the following variable:
<pre class='nd screen'>
<samp class=p>>>> </samp><kbd class=pp>shell = 2</kbd></pre>
<p>Throughout this chapter, I will use the <code>shell</code> variable to indicate which Python Shell is being used in each example.
<p class=a>&#x2042;
<h2 id=pickle-simple>Serializing Simple Python Objects</h2>
<p>FIXME - introduction to pickle module, concepts, what datatypes can be pickled w/o additional work
<h3 id=dump>Saving to a File</h3>
<p>The <code>pickle</code> module works with data structures. Let&#8217;s build one.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>shell</kbd> <span class=u>&#x2460;</span></a>
<samp class=pp>1</samp>
<a><samp class=p>>>> </samp><kbd class=pp>entry = {}</kbd> <span class=u>&#x2461;</span></a>
<samp class=p>>>> </samp><kbd class=pp>entry['title'] = 'Dive into history, 2009 edition'</kbd>
<samp class=p>>>> </samp><kbd class=pp>entry['article_link'] = 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'</kbd>
<samp class=p>>>> </samp><kbd class=pp>entry['comments_link'] = None</kbd>
<samp class=p>>>> </samp><kbd class=pp>entry['internal_id'] = b'\xde\xd5\xb4\xf8'</kbd>
<samp class=p>>>> </samp><kbd class=pp>entry['tags'] = ('diveintopython', 'docbook', 'html')</kbd>
<samp class=p>>>> </samp><kbd class=pp>entry['published'] = True</kbd>
<samp class=p>>>> </samp><kbd class=pp>import time</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>entry['published_date'] = time.strptime('Fri Mar 27 22:20:42 2009')</kbd> <span class=u>&#x2462;</span></a>
<samp class=p>>>> </samp><kbd class=pp>entry['published_date']</kbd>
<samp class=pp>time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1)</samp></pre>
<ol>
<li>Follow along in Python Shell #1.
<li>The idea here is to build a Python dictionary that could represent something useful, like an <a href=xml.html#xml-structure>entry in an Atom feed</a>. But I also want to ensure that it contains several different types of data, to show off the <code>pickle</code> module. Don&#8217;t read too much into these values.
<li>The <code>time</code> module contains a data structure (<code>time_struct</code>) to represent a point in time (accurate to one millisecond) and functions to manipulate time structs. The <code>strptime()</code> function takes a formatted string an converts it to a <code>time_struct</code>. This string is in the default format, but you can control that with format codes. See the <a href=http://docs.python.org/3.1/library/time.html><code>time</code> module</a> for more details.
</ol>
<p>That&#8217;s a handsome-looking Python dictionary. Let&#8217;s save it to a file.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>shell</kbd> <span class=u>&#x2460;</span></a>
<samp class=pp>1</samp>
<samp class=p>>>> </samp><kbd class=pp>import pickle</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>with open('entry.pickle', 'wb') as f:</kbd> <span class=u>&#x2461;</span></a>
<a><samp class=p>... </samp><kbd class=pp> pickle.dump(entry, f)</kbd> <span class=u>&#x2462;</span></a>
<samp class=p>... </samp></pre>
<ol>
<li>This is still in Python Shell #1.
<li>Use the <code>open()</code> function to open a file. Set the file mode to <code>'wb'</code> to open the file for writing <a href=files.html#binary>in binary mode</a>. Wrap it in a <a href=files.html#with><code>with</code> statement</a> to ensure the file is closed automatically when you&#8217;re done with it.
<li>The <code>dump()</code> function in the <code>pickle</code> module takes a serializable Python data structure, serializes it into a binary, Python-specific format using the latest version of the <code>pickle</code> protocol, and saves it to an open file.
</ol>
<p>That last sentence was pretty important.
<ul>
<li>The <code>pickle</code> module takes a Python data structure and saves it to a file.
<li>To do this, it <i>serializes</i> the data structure using a data format called &#8220;the <code>pickle</code> protocol.&#8221;
<li>The <code>pickle</code> protocol is Python-specific; there is no guarantee of cross-language compatibility. You probably couldn&#8217;t take the <code>entry.pickle</code> file you just created and do anything useful with it in Perl, <abbr>PHP</abbr>, Java, or any other language.
<li>Not every Python data structure can be serialized by the <code>pickle</code> module. The <code>pickle</code> protocol has changed several times as new data types have been added to the Python language, but there are still limitations.
<li>As a result of these changes, there is no guarantee of compatibility between different versions of Python itself. Newer versions of Python support the older serialization formats, but older versions of Python do not support newer formats (since they don&#8217;t support the newer data types).
<li>Unless you specify otherwise, the functions in the <code>pickle</code> module will use the latest version of the <code>pickle</code> protocol. This ensures that you have maximum flexibility in the types of data you can serialize, but it also means that the resulting file will not be readable by older versions of Python that do not support the latest version of the <code>pickle</code> protocol.
<li>The latest version of the <code>pickle</code> protocol is a binary protocol. Be sure to open your pickle files <a href=files.html#binary>in binary mode</a>, or the data will get corrupted during writing.
</ul>
<h3 id=load>Loading from a File</h3>
<p>Now switch to your second Python Shell&nbsp;&mdash;&nbsp;<i>i.e.</i> not the one where you created the <code>entry</code> dictionary.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>shell</kbd> <span class=u>&#x2460;</span></a>
<samp class=pp>2</samp>
<a><samp class=p>>>> </samp><kbd class=pp>entry</kbd> <span class=u>&#x2461;</span></a>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in &lt;module>
NameError: name 'entry' is not defined</samp>
<samp class=p>>>> </samp><kbd class=pp>import pickle</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>with open('entry.pickle', 'rb') as f:</kbd> <span class=u>&#x2462;</span></a>
<a><samp class=p>... </samp><kbd class=pp> entry = pickle.load(f)</kbd> <span class=u>&#x2463;</span></a>
<samp class=p>... </samp>
<a><samp class=p>>>> </samp><kbd class=pp>entry</kbd> <span class=u>&#x2464;</span></a>
<samp class=pp>{'comments_link': None,
'internal_id': b'\xde\xd5\xb4\xf8',
'title': 'Dive into history, 2009 edition',
'tags': ('diveintopython', 'docbook', 'html'),
'article_link':
'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),
'published': True}</samp></pre>
<ol>
<li>FIXME
<li>FIXME
<li>FIXME
<li>FIXME
<li>FIXME
</ol>
<p>FIXME
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>shell</kbd>
<samp class=pp>1</samp>
<a><samp class=p>>>> </samp><kbd class=pp>with open('entry.pickle', 'rb') as f:</kbd> <span class=u>&#x2460;</span></a>
<a><samp class=p>... </samp><kbd class=pp> entry2 = pickle.load(f)</kbd> <span class=u>&#x2461;</span></a>
<samp class=p>... </samp>
<a><samp class=p>>>> </samp><kbd class=pp>entry2 == entry</kbd> <span class=u>&#x2462;</span></a>
<samp class=pp>True</samp>
<a><samp class=p>>>> </samp><kbd class=pp>entry2['tags']</kbd> <span class=u>&#x2463;</span></a>
<samp class=pp>('diveintopython', 'docbook', 'html')</samp>
<a><samp class=p>>>> </samp><kbd class=pp>entry2['internal_id']</kbd> <span class=u>&#x2464;</span></a>
<samp class=pp>b'\xde\xd5\xb4\xf8'</samp></pre>
<ol>
<li>FIXME
<li>
<li>
<li>
<li>
</ol>
<h3 id=dumps>Saving to (and Loading from) an Object in Memory</h3>
<p>FIXME
<h3 id=protocol-versions>Bytes and Strings Rear Their Ugly Heads (Again!)</h3>
<p>FIXME - discussion of pickle protocol versions, backward incompatibility of protocol version 3 due to bytes/strings separation in Python 3, link to http://docs.python.org/3.1/library/pickle.html#data-stream-format
<h3 id=debugging>Debugging Pickle Files</h3>
<p>What does the pickle protocol look like? Let&#8217;s jump out of the Python Shell for a moment and take a look at that <code>entry.pickle</code> file we created.
<pre class=screen>
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>ls -l entry.pickle</kbd>
<samp>-rw-r--r-- 1 you you 324 Aug 3 13:34 entry.pickle</samp>
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>cat entry.pickle</kbd>
<samp>comments_linkqNXtagsqXdiveintopythonqXdocbookqXhtmlq?qX publishedq?
XlinkXJhttp://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition
q Xpublished_dateq
ctime
struct_time
?qRqXtitleqXDive into history, 2009 editionqu.</samp></pre>
<p>That wasn&#8217;t terribly helpful. You can see the strings, but other datatypes end up as unprintable (or at least unreadable) characters. Fields are not obviously delimited by tabs or spaces. This is not a format you would want to debug by yourself.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>shell</kbd>
<samp class=pp>1</samp>
<samp class=p>>>> </samp><kbd class=pp>import pickletools</kbd>
<samp class=p>>>> </samp><kbd class=pp>with open('entry.pickle', 'rb') as f:</kbd>
<samp class=p>... </samp><kbd class=pp> pickletools.dis(f)</kbd>
<samp> 0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'published_date'
25: q BINPUT 1
27: c GLOBAL 'time struct_time'
45: q BINPUT 2
47: ( MARK
48: M BININT2 2009
51: K BININT1 3
53: K BININT1 27
55: K BININT1 22
57: K BININT1 20
59: K BININT1 42
61: K BININT1 4
63: K BININT1 86
65: J BININT -1
70: t TUPLE (MARK at 47)
71: q BINPUT 3
73: } EMPTY_DICT
74: q BINPUT 4
76: \x86 TUPLE2
77: q BINPUT 5
79: R REDUCE
80: q BINPUT 6
82: X BINUNICODE 'comments_link'
100: q BINPUT 7
102: N NONE
103: X BINUNICODE 'internal_id'
119: q BINPUT 8
121: C SHORT_BINBYTES 'ÞÕ´ø'
127: q BINPUT 9
129: X BINUNICODE 'tags'
138: q BINPUT 10
140: X BINUNICODE 'diveintopython'
159: q BINPUT 11
161: X BINUNICODE 'docbook'
173: q BINPUT 12
175: X BINUNICODE 'html'
184: q BINPUT 13
186: \x87 TUPLE3
187: q BINPUT 14
189: X BINUNICODE 'title'
199: q BINPUT 15
201: X BINUNICODE 'Dive into history, 2009 edition'
237: q BINPUT 16
239: X BINUNICODE 'article_link'
256: q BINPUT 17
258: X BINUNICODE 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'
337: q BINPUT 18
339: X BINUNICODE 'published'
353: q BINPUT 19
355: \x88 NEWTRUE
356: u SETITEMS (MARK at 5)
357: . STOP
highest protocol among opcodes = 3</samp></pre>
<p class=a>&#x2042;
<h2 id=pickle-advanced>Serializing Complex Python Objects</h2>
<p>FIXME - discussion of pickling class instances, stateful objects, __getstate__ and __setstate__, links to http://docs.python.org/3.1/library/pickle.html#pickle-inst and http://docs.python.org/3.1/library/pickle.html#pickle-state
<h2 id=pickle-security>Security Concerns with Pickled Objects</h2>
<p>FIXME - pickled objects can be modified in memory, in transit, or on disk; no checksums; no built-in guarantee that the pickle you're loading is the pickle you dumped; never unpickle untrusted input; xref to "eval() is evil" discussion in advanced-iterators chapter
<h2 id=json>Serializing Python Objects to be Read by Other Languages</h2>
<p>The data format used by the <code>pickle</code> module is Python-specific. It makes no attempt to be compatible with other programming languages. If cross-language compatibility is one of your requirements, you need to look at other serialization formats.
<p>One format that <em>is</em> designed to be used by multiple programming languages is <a href=http://json.org/><abbr>JSON</abbr></a>.
<p>FIXME - pickle format is python-specific; JSON format is designed to be cross-language (in fact, it was originally designed for JavaScript, hence the name); differences with pickle format (table or list); json module implements dumping and loading JSON-formatted data structures; JSON format is string-based (and always encoded as UTF-8 where bytes are required); compact vs. pretty-printing; JSONEncoder; JSONDecoder; iterencode
<h3 id=json-types>Mapping of Python Datatypes to <abbr>JSON</abbr></h3>
<p>FIXME
<table>
<tr><th>Notes
<th>JSON
<th>Python 3
<tr><th>
<td>object
<td><a href=native-datatypes.html#dictionaries>dictionary</a>
<tr><th>
<td>array
<td><a href=native-datatypes.html#lists>list</a>
<tr><th>
<td>string
<td><a href=strings.html#divingin>string</a>
<tr><th>
<td>integer
<td><a href=native-datatypes.html#numbers>integer</a>
<tr><th>
<td>real number
<td><a href=native-datatypes.html#numbers>float</a>
<tr><th>
<td><code>true</code>
<td><code><a href=native-datatypes.html#booleans>True</a>
<tr><th>
<td><code>false</code>
<td><code><a href=native-datatypes.html#booleans>False</a></code>
<tr><th>
<td><code>null</code>
<td><code><a href=native-datatypes.html#none>None</a></code>
</table>
<p>FIXME
<h3 id=json-unknown-types>Serializing Datatypes Unsupported by <abbr>JSON</abbr></h3>
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>shell</kbd>
<samp class=pp>1</samp>
<samp class=p>>>> </samp><kbd class=pp>entry</kbd>
<samp class=pp>FIXME</samp>
<samp class=p>>>> </samp><kbd class=pp>import json</kbd>
<samp class=p>>>> </samp><kbd class=pp>with open('entry.json', 'w', encoding='utf-8') as f:</kbd>
<samp class=p>... </samp><kbd class=pp> json.dump(entry)</kbd>
<samp class=p>... </samp>
<samp class=traceback>FIXME</samp></pre>
<ol>
<li>FIXME
</ol>
<p>FIXME
<pre class=pp><code># customserializer.py
def to_json(python_object):
if isinstance(python_object, bytes):
return {'__class__': 'bytes',
'__value__': list(python_object)}
raise TypeError(repr(python_object) + ' is not JSON serializable')</code></pre>
<ol>
<li>FIXME
</ol>
<p>FIXME
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>shell</kbd>
<samp class=pp>1</samp>
<samp class=p>>>> </samp><kbd class=pp>import customserializer</kbd>
<samp class=p>>>> </samp><kbd class=pp>with open('entry.json', 'w', encoding='utf-8') as f:</kbd>
<samp class=p>... </samp><kbd class=pp> json.dump(entry, default = customserializer.to_json)</kbd>
<samp class=p>... </samp>
<samp class=traceback>FIXME</samp></pre>
<ol>
<li>FIXME
</ol>
<pre class=pp><code># customserializer.py
def to_json(python_object):
if isinstance(python_object, time.struct_time):
return {'__class__': 'time.asctime',
'__value__': time.asctime(python_object)}
if isinstance(python_object, bytes):
return {'__class__': 'bytes',
'__value__': list(python_object)}
raise TypeError(repr(python_object) + ' is not JSON serializable')</code></pre>
<ol>
<li>FIXME
</ol>
<p>FIXME
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>shell</kbd>
<samp class=pp>1</samp>
<samp class=p>>>> </samp><kbd class=pp>with open('entry.json', 'w', encoding='utf-8') as f:</kbd>
<samp class=p>... </samp><kbd class=pp> json.dump(entry, default = customserializer.to_json)</kbd>
<samp class=p>... </samp></pre>
<ol>
<li>FIXME
</ol>
<p>FIXME
<pre class=screen>
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>ls -l example.json</kbd>
<samp class=pp>FIXME</samp>
<samp class=p>you@localhost:~/diveintopython3/examples$ </samp><kbd>cat example.json</kbd>
<samp class=pp>FIXME</samp></pre>
<ol>
<li>FIXME
</ol>
<p>FIXME
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>shell</kbd>
<samp class=pp>2</samp>
<samp class=p>>>> </samp><kbd class=pp>del entry</kbd>
<samp class=p>>>> </samp><kbd class=pp>entry</kbd>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in &lt;module>
NameError: name 'entry' is not defined</samp>
<samp class=p>>>> </samp><kbd class=pp>import json</kbd>
<samp class=p>>>> </samp><kbd class=pp>with open('entry.json', 'r', encoding='utf-8') as f:</kbd>
<samp class=p>... </samp><kbd class=pp> entry = json.load(f)</kbd>
<samp class=p>... </samp>
<samp class=traceback>FIXME</samp></pre>
<ol>
<li>FIXME
</ol>
<p>FIXME
<pre class=pp><code># customserializer.py
def from_json(json_object):
if '__class__' in json_object:
if json_object['__class__'] == 'time.asctime':
return time.strptime(json_object['__value__'])
if json_object['__class__'] == 'bytes':
return bytes(json_object['__value__'])
return json_object</code></pre>
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>shell</kbd>
<samp class=pp>2</samp>
<samp class=p>>>> </samp><kbd class=pp>import customserializer</kbd>
<samp class=p>>>> </samp><kbd class=pp>with open('entry.json', 'r', encoding='utf-8') as f:</kbd>
<samp class=p>... </samp><kbd class=pp> entry = json.load(f, object_hook = customserializer.from_json)</kbd>
<samp class=p>... </samp>
<samp class=p>>>> </samp><kbd class=pp>entry</kbd>
<samp class=pp>FIXME</samp></pre>
<ol>
<li>FIXME
</ol>
<p>FIXME
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>shell</kbd>
<samp class=pp>1</samp>
<samp class=p>>>> </samp><kbd class=pp>import customserializer</kbd>
<samp class=p>>>> </samp><kbd class=pp>with open('entry.json', 'r', encoding='utf-8') as f:</kbd>
<samp class=p>... </samp><kbd class=pp> entry2 = json.load(f, object_hook = customserializer.from_json)</kbd>
<samp class=p>... </samp>
<samp class=p>>>> </samp><kbd class=pp>entry2 == entry</kbd>
<samp class=pp>False</samp>
<samp class=p>>>> </samp><kbd class=pp>entry['tags']</kbd>
<samp class=pp>('diveintopython', 'docbook', 'html')</samp>
<samp class=p>>>> </samp><kbd class=pp>entry2['tags']</kbd>
<samp class=pp>['diveintopython', 'docbook', 'html']</samp></pre>
<ol>
<li>FIXME
</ol>
<p>FIXME
<h2 id=furtherreading>Further Reading</h2>
<blockquote class=note>
<p><span class=u>&#x261E;</span>Many articles about the <code>pickle</code> module make references to <code>cPickle</code>. In Python 2, there were two implementations of the <code>pickle</code> module, one written in pure Python and another written in C (but still callable from Python). In Python 3, <a href=porting-code-to-python-3-with-2to3.html#othermodules>these two modules have been consolidated</a>, so you should always just <code>import pickle</code>. You may find these articles useful, but you should ignore the now-obsolete information about <code>cPickle</code>.
</blockquote>
<ul>
<li><a href=http://docs.python.org/3.1/library/pickle.html><code>pickle</code> module</a>
<li><a href=http://www.doughellmann.com/PyMOTW/pickle/><code>pickle</code> and <code>cPickle</code>&nbsp;&mdash;&nbsp;Python object serialization</a>
<li><a href=http://wiki.python.org/moin/UsingPickle>Using <code>pickle</code></a>
<li><a href=http://www.ibm.com/developerworks/library/l-pypers.html>Python persistence management</a>
<li><a href=http://www.doughellmann.com/PyMOTW/json/><code>json</code>&nbsp;&mdash;&nbsp;JavaScript Object Notation Serializer</a>
<li><a href=http://blog.quaternio.net/2009/07/16/json-encoding-and-decoding-with-custom-objects-in-python/>JSON encoding and ecoding with custom objects in Python</a>
</ul>
<p class=v><a rel=prev href=xml.html title='back to &#8220;XML&#8221;'><span class=u>&#x261C;</span></a> <a rel=next href=http-web-services.html title='onward to &#8220;HTTP Web Services&#8221;'><span class=u>&#x261E;</span></a>
<p class=c>&copy; 2001&ndash;9 <a href=about.html>Mark Pilgrim</a>
<script src=j/jquery.js></script>
<script src=j/prettify.js></script>
<script src=j/dip3.js></script>