From 24020a5f8c382667befa56b1c20bc8718719ed34 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Wed, 19 Aug 2009 02:23:15 -0400 Subject: [PATCH] finished serializing.html --- examples/customserializer.py | 8 +- index.html | 2 +- serializing.html | 152 +++++++++++++++++++++-------------- table-of-contents.html | 48 +++++++---- 4 files changed, 128 insertions(+), 82 deletions(-) diff --git a/examples/customserializer.py b/examples/customserializer.py index dfb8c5d..cc7c75a 100644 --- a/examples/customserializer.py +++ b/examples/customserializer.py @@ -38,11 +38,11 @@ if __name__ == '__main__': print(type(entry['tags'])) print(type(entry2['tags'])) - with open('entry.json', 'w', encoding = 'utf-8') as f: - json.dump(entry, f, default = to_json) + with open('entry.json', 'w', encoding='utf-8') as f: + json.dump(entry, f, default=to_json) - with open('entry.json', 'r', encoding = 'utf-8') as f: - entry2 = json.load(f, object_hook = from_json) + with open('entry.json', 'r', encoding='utf-8') as f: + entry2 = json.load(f, object_hook=from_json) print(entry == entry2) print(type(entry['tags'])) diff --git a/index.html b/index.html index d269a3e..0e66fdf 100644 --- a/index.html +++ b/index.html @@ -40,7 +40,7 @@ h1:before,h2:before{content:''}
  • Refactoring
  • Files
  • XML -
  • Serializing Python Objects (in progress) +
  • Serializing Python Objects
  • HTTP Web Services
  • Threading & Multiprocessing (in progress)
  • Case Study: Porting chardet to Python 3 diff --git a/serializing.html b/serializing.html index 2dd3fe6..51449e6 100644 --- a/serializing.html +++ b/serializing.html @@ -20,9 +20,26 @@ body{counter-reset:h1 13}

     

    Diving In

    -

    FIXME +

    The concept of serialization is simple. You have a data structure in memory that you want to save, reuse, or send to someone else. How would you do that? Well, that depends on how you want to save it, how you want to reuse it, and to whom you want to send it. Many games allow you to save your progress when you quit the game and pick up where you left off when you relaunch the game. (Actually, many non-gaming applications do this as well.) In this case, a data structure that captures “your progress so far” needs to be stored on disk when you quit, then loaded from disk when you relaunch. The data is only meant to be used by the same program that created it, never sent over a network, and never read by anything other than the program that created it. Therefore, the interoperability issues are limited to ensuring that later versions of the program can read data written by earlier versions. -

    Open the Python Shell and define the following variable: +

    For cases like this, the pickle module is ideal. It’s part of the Python standard library, so it’s always available. It’s fast; the bulk of it is written in C, like the Python interpreter itself. It can store arbitrarily complex Python data structures. + +

    What can the pickle module store? + +

    + +

    If this isn’t enough for you, the pickle module is also extensible. If you’re interested in extensibility, check out the links in the Further Reading section at the end of the chapter. + +

    A Quick Note About The Examples in This Chapter

    + +

    This chapter tells a tale with two Python Shells. All of the examples in this chapter are part of a single story arc. You will be asked to switch back and forth between the two Python Shells as I demonstrate the pickle and json modules. + +

    To help keep things straight, open the Python Shell and define the following variable:

     >>> shell = 1
    @@ -36,24 +53,7 @@ body{counter-reset:h1 13}

    ⁂ -

    Serializing Simple Python Objects

    - -

    The concept of serialization is simple. You have a data structure in memory that you want to save, reuse, or send to someone else. How would you do that? Well, that depends on how you want to save it, how you want to reuse it, and to whom you want to send it. Many games allow you to save your progress when you quit the game and pick up where you left off when you relaunch the game. (Actually, many non-gaming applications do this as well.) In this case, a data structure that captures “your progress so far” needs to be stored on disk when you quit, then loaded from disk when you relaunch. The data is only meant to be used by the same program that created it, never sent over a network, and never read by anything other than the program that created it. Therefore, the interoperability issues are limited to ensuring that later versions of the program can read data written by earlier versions. - -

    For cases like this, the pickle module is ideal. It’s part of the Python standard library, so it’s always available. It’s fast; the bulk of it is written in C, like the Python interpreter itself. It can store arbitrarily complex Python data structures. - -

    What can the pickle module store? - -

    - -

    If this isn’t enough for you, the pickle module is also extensible, as you’ll see later in this chapter. - -

    Saving Data to a Pickle File

    +

    Saving Data to a Pickle File

    The pickle module works with data structures. Let’s build one. @@ -104,7 +104,9 @@ body{counter-reset:h1 13}

  • The latest version of the pickle protocol is a binary format. Be sure to open your pickle files in binary mode, or the data will get corrupted during writing. -

    Loading Data from a Pickle File

    +

    ⁂ + +

    Loading Data from a Pickle File

    Now switch to your second Python Shell — i.e. not the one where you created the entry dictionary. @@ -158,7 +160,9 @@ NameError: name 'entry' is not defined

  • For reasons that will become clear later in this chapter, I want to point out that the value of the 'tags' key is a tuple, and the value of the 'internal_id' key is a bytes object. -

    Pickling Without a File

    +

    ⁂ + +

    Pickling Without a File

    The examples in the previous section showed how to serialize a Python object directly to a file on disk. But what if you don’t want or need a file? You can also serialize to a bytes object in memory. @@ -178,7 +182,9 @@ NameError: name 'entry' is not defined

  • The end result is the same: a perfect replica of the original dictionary. -

    Bytes and Strings Rear Their Ugly Heads Again

    +

    ⁂ + +

    Bytes and Strings Rear Their Ugly Heads Again

    The pickle protocol has been around for many years, and it has matured as Python itself has matured. There are now four different versions of the pickle protocol. @@ -190,7 +196,9 @@ NameError: name 'entry' is not defined

    Oh look, the difference between bytes and strings rears its ugly head again. (If you’re surprised, you haven’t been paying attention.) What this means in practice is that, while Python 3 can read data pickled with protocol version 2, Python 2 can not read data pickled with protocol version 3. -

    Debugging Pickle Files

    +

    ⁂ + +

    Debugging Pickle Files

    What does the pickle protocol look like? Let’s jump out of the Python Shell for a moment and take a look at that entry.pickle file we created. @@ -293,16 +301,6 @@ def protocol_version(file_object):

    ⁂ -

    Serializing Complex Python Objects

    - -

    FIXME - discussion of pickling class instances, stateful objects, __getstate__ and __setstate__, links to http://docs.python.org/3.1/library/pickle.html#pickle-inst and http://docs.python.org/3.1/library/pickle.html#pickle-state - - -

    Serializing Python Objects to be Read by Other Languages

    The data format used by the pickle module is Python-specific. It makes no attempt to be compatible with other programming languages. If cross-language compatibility is one of your requirements, you need to look at other serialization formats. One such format is JSON. “JSON” stands for “JavaScript Object Notation,” but don’t let the name fool you — JSON is explicitly designed to be usable across multiple programming languages. @@ -313,7 +311,9 @@ def protocol_version(file_object):

    Third, there’s the perennial problem of character encoding. JSON encodes values as plain text, but as you know, there ain’t no such thing as “plain text.” JSON must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, UTF-8), and section 3 of RFC 4627 defines how to tell which encoding is being used. -

    Saving Data to a JSON File

    +

    ⁂ + +

    Saving Data to a JSON File

    JSON looks remarkably like a data structure you might define manually in JavaScript. This is no accident; you can actually use the JavaScript eval() function to “decode” JSON-serialized data. (The usual caveats about untrusted input apply, but the point is that JSON is valid JavaScript.) As such, JSON may already look familiar to you. @@ -369,7 +369,9 @@ def protocol_version(file_object): "title": "Dive into history, 2009 edition" } -

    Mapping of Python Datatypes to JSON

    +

    ⁂ + +

    Mapping of Python Datatypes to JSON

    Since JSON is not Python-specific, there are some mismatches in its coverage of Python datatypes. Some of them are simply naming differences, but there is two important Python datatypes that are completely missing. See if you can spot them: @@ -406,7 +408,9 @@ def protocol_version(file_object):

    Did you notice what was missing? Tuples & bytes! JSON has an array type, which the json module maps to a Python list, but it does not have a separate type for “frozen arrays” (tuples). And while JSON supports strings quite nicely, it has no support for bytes objects or byte arrays. -

    Serializing Datatypes Unsupported by JSON

    +

    ⁂ + +

    Serializing Datatypes Unsupported by JSON

    Even if JSON has no built-in support for bytes, that doesn’t mean you can’t serialize bytes objects. The json module provides extensibility hooks for encoding and decoding unknown datatypes. (By “unknown,” I mean “not defined in JSON.” Obviously the json module knows about byte arrays, but it’s constrained by the limitations of the JSON specification.) If you want to encode bytes or other datatypes that JSON doesn’t support natively, you need to provide custom encoders and decoders for those types. @@ -524,23 +528,25 @@ def to_json(python_object): "article_link": "http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition", "published": true} -

    Loading Data from a JSON File

    +

    ⁂ -

    FIXME +

    Loading Data from a JSON File

    + +

    Like the pickle module, the json module has a load() function which takes a stream object, reads JSON-encoded data from it, and creates a new Python object that mirrors the JSON data structure.

     >>> shell
     2
    ->>> del entry
    +>>> del entry                                             
     >>> entry
     Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
     NameError: name 'entry' is not defined
     >>> import json
     >>> with open('entry.json', 'r', encoding='utf-8') as f:
    -...     entry = json.load(f)
    +...     entry = json.load(f)                              
     ... 
    ->>> entry
    +>>> entry                                                 
     {'comments_link': None,
      'internal_id': {'__class__': 'bytes', '__value__': [222, 213, 180, 248]},
      'title': 'Dive into history, 2009 edition',
    @@ -549,28 +555,38 @@ NameError: name 'entry' is not defined
      'published_date': {'__class__': 'time.asctime', '__value__': 'Fri Mar 27 22:20:42 2009'},
      'published': True}
      -
    1. FIXME +
    2. For demonstration purposes, switch to Python Shell #2 and delete the entry data structure that you created earlier in this chapter with the pickle module. +
    3. In the simplest case, the json.load() function works the same as the pickle.load() function. You pass in a stream object and it returns a new Python object. +
    4. I have good news and bad news. Good news first: the json.load() function successfully read the entry.json file you created in Python Shell #1 and created a new Python object that contained the data. Now the bad news: it didn’t recreate the original entry data structure. The two values 'internal_id' and 'published_date' were recreated as dictionaries — specifically, the dictionaries with JSON-compatible values that you created in the to_json() conversion function.
    -

    FIXME +

    json.load() doesn’t know anything about any conversion function you may have passed to json.dump(). What you need is the opposite of the to_json() function — a function that will take a custom-converted JSON object and convert it back to the original Python datatype. -

    # customserializer.py
    -def from_json(json_object):
    -    if '__class__' in json_object:
    +
    
    +def from_json(json_object):                                   
    +    if '__class__' in json_object:                            
             if json_object['__class__'] == 'time.asctime':
    -            return time.strptime(json_object['__value__'])
    +            return time.strptime(json_object['__value__'])    
             if json_object['__class__'] == 'bytes':
    -            return bytes(json_object['__value__'])
    +            return bytes(json_object['__value__'])            
         return json_object
    +
      +
    1. This conversion function also takes one parameter and returns one value. But the parameter it takes is not a string, it’s a Python object — the result of deserializing a JSON-encoded string into Python. +
    2. All you need to do is check whether this object contains the '__class__' key that the to_json() function created. If so, the value of the '__class__' key will tell you how to decode the value back into the original Python datatype. +
    3. To decode the time string returned by the time.asctime() function, you use the time.strptime() function. This function takes a formatted datetime string (in a customizable format, but it defaults to the same format that time.asctime() defaults to) and returns a time.struct_time. +
    4. To convert a list of integers back into a bytes object, you can use the bytes() function. +
    + +

    That was it; there were only two datatypes handled in the to_json() function, and now those two datatypes are handled in the from_json() function. This is the result:

     >>> shell
     2
     >>> import customserializer
     >>> with open('entry.json', 'r', encoding='utf-8') as f:
    -...     entry = json.load(f, object_hook = customserializer.from_json)
    +...     entry = json.load(f, object_hook=customserializer.from_json)  
     ... 
    ->>> entry
    +>>> entry                                                             
     {'comments_link': None,
      'internal_id': b'\xDE\xD5\xB4\xF8',
      'title': 'Dive into history, 2009 edition',
    @@ -579,45 +595,61 @@ def from_json(json_object):
      'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),
      'published': True}
      -
    1. FIXME +
    2. To hook the from_json() function into the deserialization process, pass it as the object_hook parameter to the json.load() function. Functions that take functions; it’s so handy! +
    3. The entry data structure now contains an 'internal_id' key whose value is a bytes object. It also contains a 'published_date' key whose value is a time.struct_time object.
    -

    FIXME +

    There is one final glitch, though.

     >>> shell
     1
     >>> import customserializer
     >>> with open('entry.json', 'r', encoding='utf-8') as f:
    -...     entry2 = json.load(f, object_hook = customserializer.from_json)
    +...     entry2 = json.load(f, object_hook=customserializer.from_json)
     ... 
    ->>> entry2 == entry
    +>>> entry2 == entry                                                    
     False
    ->>> entry['tags']
    +>>> entry['tags']                                                      
     ('diveintopython', 'docbook', 'html')
    ->>> entry2['tags']
    +>>> entry2['tags']                                                     
     ['diveintopython', 'docbook', 'html']
      -
    1. FIXME +
    2. Even after hooking the to_json() function into the serialization, and hooking the from_json() function into the deserialization, we still haven’t recreated a perfect replica of the original data structure. Why not? +
    3. In the original entry data structure, the value of the 'tags' key was a tuple of three strings. +
    4. But in the round-tripped entry2 data structure, the value of the 'tags' key is a list of three strings. JSON doesn’t distinguish between tuples and lists; it only has a single list-like datatype, the array, and the json module silently converts both tuples and lists into JSON arrays during serialization. For most uses, you can ignore the difference between tuples and lists, but it’s something to keep in mind as you work with the json module.
    -

    FIXME -

    Further Reading

    Many articles about the pickle module make references to cPickle. In Python 2, there were two implementations of the pickle module, one written in pure Python and another written in C (but still callable from Python). In Python 3, these two modules have been consolidated, so you should always just import pickle. You may find these articles useful, but you should ignore the now-obsolete information about cPickle.

    +

    On pickling with the pickle module: +

    + +

    On JSON and the json module: + +

    +

    On pickle extensibility: + +

    +

    © 2001–9 Mark Pilgrim diff --git a/table-of-contents.html b/table-of-contents.html index 9338c60..99f68d1 100755 --- a/table-of-contents.html +++ b/table-of-contents.html @@ -230,26 +230,40 @@ ul li ol{margin:0;padding:0 0 0 2.5em}

  • XML
      -
    1. Diving In -
    2. A 5-Minute Crash Course in XML -
    3. The Structure Of An Atom Feed -
    4. Parsing XML -
        -
      1. Elements Are Lists -
      2. Attributes Are Dictonaries -
      -
    5. Searching For Nodes Within An XML Document -
    6. Going Further With lxml -
        -
      1. Customizing Your XML Parser -
      2. Incremental Parsing -
      -
    7. Generating XML -
    8. Further Reading +
    9. Diving In +
    10. A 5-Minute Crash Course in XML +
    11. The Structure Of An Atom Feed +
    12. Parsing XML +
        +
      1. Elements Are Lists +
      2. Attributes Are Dictonaries +
      +
    13. Searching For Nodes Within An XML Document +
    14. Going Further With lxml +
        +
      1. Customizing Your XML Parser +
      2. Incremental Parsing +
      +
    15. Generating XML +
    16. Further Reading
  • Serializing Python Objects
      -
    1. ...diving in... +
    2. Diving In +
        +
      1. A Quick Note About The Examples in This Chapter +
      +
    3. Saving Data to a Pickle File +
    4. Loading Data from a Pickle File +
    5. Pickling Without a File +
    6. Bytes and Strings Rear Their Ugly Heads Again +
    7. Debugging Pickle Files +
    8. Serializing Python Objects to be Read by Other Languages +
    9. Saving Data to a JSON File +
    10. Mapping of Python Datatypes to JSON +
    11. Serializing Datatypes Unsupported by JSON +
    12. Loading Data from a JSON File +
    13. Further Reading
  • HTTP Web Services