diff --git a/examples/customserializer.py b/examples/customserializer.py index dfb8c5d..cc7c75a 100644 --- a/examples/customserializer.py +++ b/examples/customserializer.py @@ -38,11 +38,11 @@ if __name__ == '__main__': print(type(entry['tags'])) print(type(entry2['tags'])) - with open('entry.json', 'w', encoding = 'utf-8') as f: - json.dump(entry, f, default = to_json) + with open('entry.json', 'w', encoding='utf-8') as f: + json.dump(entry, f, default=to_json) - with open('entry.json', 'r', encoding = 'utf-8') as f: - entry2 = json.load(f, object_hook = from_json) + with open('entry.json', 'r', encoding='utf-8') as f: + entry2 = json.load(f, object_hook=from_json) print(entry == entry2) print(type(entry['tags'])) diff --git a/index.html b/index.html index d269a3e..0e66fdf 100644 --- a/index.html +++ b/index.html @@ -40,7 +40,7 @@ h1:before,h2:before{content:''}
chardet to Python 3
diff --git a/serializing.html b/serializing.html
index 2dd3fe6..51449e6 100644
--- a/serializing.html
+++ b/serializing.html
@@ -20,9 +20,26 @@ body{counter-reset:h1 13}
FIXME +
The concept of serialization is simple. You have a data structure in memory that you want to save, reuse, or send to someone else. How would you do that? Well, that depends on how you want to save it, how you want to reuse it, and to whom you want to send it. Many games allow you to save your progress when you quit the game and pick up where you left off when you relaunch the game. (Actually, many non-gaming applications do this as well.) In this case, a data structure that captures “your progress so far” needs to be stored on disk when you quit, then loaded from disk when you relaunch. The data is only meant to be used by the same program that created it, never sent over a network, and never read by anything other than the program that created it. Therefore, the interoperability issues are limited to ensuring that later versions of the program can read data written by earlier versions. -
Open the Python Shell and define the following variable: +
For cases like this, the pickle module is ideal. It’s part of the Python standard library, so it’s always available. It’s fast; the bulk of it is written in C, like the Python interpreter itself. It can store arbitrarily complex Python data structures.
+
+
What can the pickle module store?
+
+
bytes objects, byte arrays, and None.
+If this isn’t enough for you, the pickle module is also extensible. If you’re interested in extensibility, check out the links in the Further Reading section at the end of the chapter.
+
+
This chapter tells a tale with two Python Shells. All of the examples in this chapter are part of a single story arc. You will be asked to switch back and forth between the two Python Shells as I demonstrate the pickle and json modules.
+
+
To help keep things straight, open the Python Shell and define the following variable:
>>> shell = 1@@ -36,24 +53,7 @@ body{counter-reset:h1 13}
⁂ -
The concept of serialization is simple. You have a data structure in memory that you want to save, reuse, or send to someone else. How would you do that? Well, that depends on how you want to save it, how you want to reuse it, and to whom you want to send it. Many games allow you to save your progress when you quit the game and pick up where you left off when you relaunch the game. (Actually, many non-gaming applications do this as well.) In this case, a data structure that captures “your progress so far” needs to be stored on disk when you quit, then loaded from disk when you relaunch. The data is only meant to be used by the same program that created it, never sent over a network, and never read by anything other than the program that created it. Therefore, the interoperability issues are limited to ensuring that later versions of the program can read data written by earlier versions. - -
For cases like this, the pickle module is ideal. It’s part of the Python standard library, so it’s always available. It’s fast; the bulk of it is written in C, like the Python interpreter itself. It can store arbitrarily complex Python data structures.
-
-
What can the pickle module store?
-
-
bytes objects, byte arrays, and None.
-If this isn’t enough for you, the pickle module is also extensible, as you’ll see later in this chapter.
-
-
The pickle module works with data structures. Let’s build one.
@@ -104,7 +104,9 @@ body{counter-reset:h1 13}
⁂ + +
Now switch to your second Python Shell — i.e. not the one where you created the entry dictionary.
@@ -158,7 +160,9 @@ NameError: name 'entry' is not defined
'tags' key is a tuple, and the value of the 'internal_id' key is a bytes object.
-⁂ + +
The examples in the previous section showed how to serialize a Python object directly to a file on disk. But what if you don’t want or need a file? You can also serialize to a bytes object in memory.
@@ -178,7 +182,9 @@ NameError: name 'entry' is not defined
⁂ + +
The pickle protocol has been around for many years, and it has matured as Python itself has matured. There are now four different versions of the pickle protocol. @@ -190,7 +196,9 @@ NameError: name 'entry' is not defined
Oh look, the difference between bytes and strings rears its ugly head again. (If you’re surprised, you haven’t been paying attention.) What this means in practice is that, while Python 3 can read data pickled with protocol version 2, Python 2 can not read data pickled with protocol version 3. -
⁂ + +
What does the pickle protocol look like? Let’s jump out of the Python Shell for a moment and take a look at that entry.pickle file we created.
@@ -293,16 +301,6 @@ def protocol_version(file_object):
⁂ -
FIXME - discussion of pickling class instances, stateful objects, __getstate__ and __setstate__, links to http://docs.python.org/3.1/library/pickle.html#pickle-inst and http://docs.python.org/3.1/library/pickle.html#pickle-state - - -
The data format used by the pickle module is Python-specific. It makes no attempt to be compatible with other programming languages. If cross-language compatibility is one of your requirements, you need to look at other serialization formats. One such format is JSON. “JSON” stands for “JavaScript Object Notation,” but don’t let the name fool you — JSON is explicitly designed to be usable across multiple programming languages.
@@ -313,7 +311,9 @@ def protocol_version(file_object):
Third, there’s the perennial problem of character encoding. JSON encodes values as plain text, but as you know, there ain’t no such thing as “plain text.” JSON must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, UTF-8), and section 3 of RFC 4627 defines how to tell which encoding is being used. -
⁂ + +
JSON looks remarkably like a data structure you might define manually in JavaScript. This is no accident; you can actually use the JavaScript eval() function to “decode” JSON-serialized data. (The usual caveats about untrusted input apply, but the point is that JSON is valid JavaScript.) As such, JSON may already look familiar to you.
@@ -369,7 +369,9 @@ def protocol_version(file_object):
"title": "Dive into history, 2009 edition"
}
-
⁂ + +
Since JSON is not Python-specific, there are some mismatches in its coverage of Python datatypes. Some of them are simply naming differences, but there is two important Python datatypes that are completely missing. See if you can spot them: @@ -406,7 +408,9 @@ def protocol_version(file_object):
Did you notice what was missing? Tuples & bytes! JSON has an array type, which the json module maps to a Python list, but it does not have a separate type for “frozen arrays” (tuples). And while JSON supports strings quite nicely, it has no support for bytes objects or byte arrays.
-
⁂ + +
Even if JSON has no built-in support for bytes, that doesn’t mean you can’t serialize bytes objects. The json module provides extensibility hooks for encoding and decoding unknown datatypes. (By “unknown,” I mean “not defined in JSON.” Obviously the json module knows about byte arrays, but it’s constrained by the limitations of the JSON specification.) If you want to encode bytes or other datatypes that JSON doesn’t support natively, you need to provide custom encoders and decoders for those types.
@@ -524,23 +528,25 @@ def to_json(python_object):
"article_link": "http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition",
"published": true}
-
⁂ -
FIXME +
Like the pickle module, the json module has a load() function which takes a stream object, reads JSON-encoded data from it, and creates a new Python object that mirrors the JSON data structure.
>>> shell 2 ->>> del entry +>>> del entry ① >>> entry Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'entry' is not defined >>> import json >>> with open('entry.json', 'r', encoding='utf-8') as f: -... entry = json.load(f) +... entry = json.load(f) ② ... ->>> entry +>>> entry ③ {'comments_link': None, 'internal_id': {'__class__': 'bytes', '__value__': [222, 213, 180, 248]}, 'title': 'Dive into history, 2009 edition', @@ -549,28 +555,38 @@ NameError: name 'entry' is not defined 'published_date': {'__class__': 'time.asctime', '__value__': 'Fri Mar 27 22:20:42 2009'}, 'published': True}
pickle module.
+json.load() function works the same as the pickle.load() function. You pass in a stream object and it returns a new Python object.
+json.load() function successfully read the entry.json file you created in Python Shell #1 and created a new Python object that contained the data. Now the bad news: it didn’t recreate the original entry data structure. The two values 'internal_id' and 'published_date' were recreated as dictionaries — specifically, the dictionaries with JSON-compatible values that you created in the to_json() conversion function.
FIXME +
json.load() doesn’t know anything about any conversion function you may have passed to json.dump(). What you need is the opposite of the to_json() function — a function that will take a custom-converted JSON object and convert it back to the original Python datatype.
-
# customserializer.py
-def from_json(json_object):
- if '__class__' in json_object:
+
+def from_json(json_object): ①
+ if '__class__' in json_object: ②
if json_object['__class__'] == 'time.asctime':
- return time.strptime(json_object['__value__'])
+ return time.strptime(json_object['__value__']) ③
if json_object['__class__'] == 'bytes':
- return bytes(json_object['__value__'])
+ return bytes(json_object['__value__']) ④
return json_object
+
+- This conversion function also takes one parameter and returns one value. But the parameter it takes is not a string, it’s a Python object — the result of deserializing a JSON-encoded string into Python.
+
- All you need to do is check whether this object contains the
'__class__' key that the to_json() function created. If so, the value of the '__class__' key will tell you how to decode the value back into the original Python datatype.
+ - To decode the time string returned by the
time.asctime() function, you use the time.strptime() function. This function takes a formatted datetime string (in a customizable format, but it defaults to the same format that time.asctime() defaults to) and returns a time.struct_time.
+ - To convert a list of integers back into a
bytes object, you can use the bytes() function.
+
+
+That was it; there were only two datatypes handled in the to_json() function, and now those two datatypes are handled in the from_json() function. This is the result:
>>> shell
2
>>> import customserializer
>>> with open('entry.json', 'r', encoding='utf-8') as f:
-... entry = json.load(f, object_hook = customserializer.from_json)
+... entry = json.load(f, object_hook=customserializer.from_json) ①
...
->>> entry
+>>> entry ②
{'comments_link': None,
'internal_id': b'\xDE\xD5\xB4\xF8',
'title': 'Dive into history, 2009 edition',
@@ -579,45 +595,61 @@ def from_json(json_object):
'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),
'published': True}
-- FIXME
+
- To hook the
from_json() function into the deserialization process, pass it as the object_hook parameter to the json.load() function. Functions that take functions; it’s so handy!
+ - The entry data structure now contains an
'internal_id' key whose value is a bytes object. It also contains a 'published_date' key whose value is a time.struct_time object.
-FIXME
+
There is one final glitch, though.
>>> shell
1
>>> import customserializer
>>> with open('entry.json', 'r', encoding='utf-8') as f:
-... entry2 = json.load(f, object_hook = customserializer.from_json)
+... entry2 = json.load(f, object_hook=customserializer.from_json)
...
->>> entry2 == entry
+>>> entry2 == entry ①
False
->>> entry['tags']
+>>> entry['tags'] ②
('diveintopython', 'docbook', 'html')
->>> entry2['tags']
+>>> entry2['tags'] ③
['diveintopython', 'docbook', 'html']
-- FIXME
+
- Even after hooking the
to_json() function into the serialization, and hooking the from_json() function into the deserialization, we still haven’t recreated a perfect replica of the original data structure. Why not?
+ - In the original entry data structure, the value of the
'tags' key was a tuple of three strings.
+ - But in the round-tripped entry2 data structure, the value of the
'tags' key is a list of three strings. JSON doesn’t distinguish between tuples and lists; it only has a single list-like datatype, the array, and the json module silently converts both tuples and lists into JSON arrays during serialization. For most uses, you can ignore the difference between tuples and lists, but it’s something to keep in mind as you work with the json module.
-FIXME
-
Further Reading
☞Many articles about the pickle module make references to cPickle. In Python 2, there were two implementations of the pickle module, one written in pure Python and another written in C (but still callable from Python). In Python 3, these two modules have been consolidated, so you should always just import pickle. You may find these articles useful, but you should ignore the now-obsolete information about cPickle.
+On pickling with the pickle module:
+
pickle module
pickle and cPickle — Python object serialization
- Using
pickle
- Python persistence management
+
+
+On JSON and the json module:
+
+
json — JavaScript Object Notation Serializer
- JSON encoding and ecoding with custom objects in Python
+On pickle extensibility:
+
+
+
© 2001–9 Mark Pilgrim
diff --git a/table-of-contents.html b/table-of-contents.html
index 9338c60..99f68d1 100755
--- a/table-of-contents.html
+++ b/table-of-contents.html
@@ -230,26 +230,40 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
XML
- - Diving In
-
- A 5-Minute Crash Course in XML
-
- The Structure Of An Atom Feed
-
- Parsing XML
-
-
- Searching For Nodes Within An XML Document
-
- Going Further With lxml
-
-
- Generating XML
-
- Further Reading
+
- Diving In
+
- A 5-Minute Crash Course in XML
+
- The Structure Of An Atom Feed
+
- Parsing XML
+
+
- Searching For Nodes Within An XML Document
+
- Going Further With lxml
+
+
- Generating XML
+
- Further Reading
Serializing Python Objects
- - ...diving in...
+
- Diving In
+
+
- Saving Data to a Pickle File
+
- Loading Data from a Pickle File
+
- Pickling Without a File
+
- Bytes and Strings Rear Their Ugly Heads Again
+
- Debugging Pickle Files
+
- Serializing Python Objects to be Read by Other Languages
+
- Saving Data to a JSON File
+
- Mapping of Python Datatypes to JSON
+
- Serializing Datatypes Unsupported by JSON
+
- Loading Data from a JSON File
+
- Further Reading
HTTP Web Services