Python persistence management

Use serialization to store Python objects

Persistence is all about keeping objects around, even between executions of a program. In this article you'll get a general understanding of various persistence mechanisms for Python objects, from relational databases to Python pickles and beyond. You'll also take an in-depth look at Python's object serialization capabilities.

Patrick O’Brien (pobrien@orbtech.com), Python programmer, Orbtech

Patrick O’BrienPatrick O'Brien is a Python programmer, consultant, and trainer. He is the author of PyCrust and a developer on the PythonCard project. He most recently lead the PyPerSyst team that ported Prevayler to Python, and continues to lead that project into interesting new territory. Learn more about Patrick and his work at the Orbtech Web site, or contact him at pobrien@orbtech.com.



01 November 2002

Also available in Russian Japanese

What is persistence?

The basic idea of persistence is fairly simple. Let's say you've got a Python program, perhaps to manage your daily to-do list, and you want to save your application objects (your to-do items) between uses of the program. In other words, you want to store your objects to disk and retrieve them later. That's persistence. To accomplish that goal you've got several options, each with advantages and disadvantages.

For example, you could store your object's data in some kind of formatted text file, such as a CSV file. Or you could use a relational database, such as Gadfly, MySQL, PostgreSQL, or DB2. These file formats and databases are well established, and Python has robust interfaces for all of these storage mechanisms.

One thing these storage mechanisms all have in common is that data is stored independent of the objects and programs that operate on the data. The benefit is that the data then becomes available as a shared resource for other applications. The drawback is that allowing access to an object's data in this way violates the object-oriented principle of encapsulation, in which an object's data should only be accessible through its own, public interface.

For some applications, then, the relational database approach may not be ideal. In particular, it's because relational databases do not understand objects. Instead, relational databases impose their own type system and their own data model of relations (tables), each containing a set of tuples (rows) made up of a fixed number of statically typed fields (columns). If the object model for your application doesn't translate easily into the relational model, you'll have quite a challenge mapping your objects to tuples and back again. This challenge is often referred to as an impedence-mismatch problem.


Object persistence

If you want to transparently store Python objects without losing their identity, type, etc., then you need some form of object serialization: a process that turns arbitrarily complex objects into textual or binary representations of those objects. Likewise, you must be able to restore the serialized form of an object back into an object that is the same as the original. In Python the serialization process is called pickling, and you can pickle/unpickle your objects to/from a string, a file on disk, or any file-like object. We'll look at pickling in detail later in this article.

Let's say you like the idea of keeping everything as an object and avoiding the overhead of translating objects into some kind of non-object based storage. Pickle files provide those benefits, but sometimes you need something more robust and scalable than simple pickle files. For example, pickling alone doesn't solve the problem of naming and locating the pickle files, nor does it support concurrent access to persistent objects. For those features you need to turn to something like ZODB, the Z object database for Python. ZODB is a robust, multi-user, object-oriented database system capable of storing and managing arbitrarily complex Python objects with transaction support and concurrency control. (See Resources to download ZODB.) Interestingly enough, even ZODB relies upon Python's native serialization capability, and to use ZODB effectively you must have a solid understanding of pickling.

Another interesting approach to the persistence problem, originally implemented in Java, is called Prevayler. (See Resources for a developerWorks article on Prevaylor.) A group of Python programmers recently ported Prevayler to Python and the result, called PyPerSyst, is hosted on SourceForge. (See Resources for a link to the PyPerSyst project.) The Prevayler/PyPerSyst concept also builds upon the native serialization capabilities of the Java and Python languages. PyPerSyst keeps an entire object system in memory, and provides disaster recovery by occasionally pickling a snapshot of the system to disk and by maintaining a log of commands that can be reapplied to the latest snapshot. While applications that use PyPerSyst are therefore limited by available RAM, the advantages are that a native object system completely loaded in memory is extremely fast and is much simpler to implement than one, such as ZODB, that allows for more objects than can be held in memory at once.

Now that we've briefly touched upon the various ways to store our persistent objects, it's time to examine the pickling process in detail. While our main interest is in exploring ways to persist Python objects without having to translate them into some other format, we are still left with various concerns, such as: how to effectively pickle and unpickle both simple and complex objects, including instances of custom classes; how to maintain object references, including circular and recursive references; and how to handle changes to class definitions without running into problems with previously pickled instances. We'll cover all of these issues in the following examination of Python's pickling capabilities.


A peck of pickled Python

Python pickling support comes from the pickle module, and its cousin, the cPickle module. The latter was coded in C to provide better performance and is the recommended choice for most applications. We'll continue to talk about pickle, but our examples will actually make use of cPickle. Since most of our examples will be shown from the Python shell, let's start by showing how to import cPickle while being able to refer to it as pickle:

>>> import cPickle as pickle

Now that we've imported the module, let's take a look at the pickle interface. The pickle module provides the following function pairs: dumps(object) returns a string containing an object in pickle format; loads(string) returns the object contained in the pickle string; dump(object, file) writes the object to the file, which may be an actual physical file, but could also be any file-like object having a write() method that accepts a single string argument; load(file) returns the object contained in the pickle file.

By default, dumps() and dump() create pickles using a printable ASCII representation. Both functions have a final, optional argument that, if True, specifies that pickles will be created using a faster and smaller binary representation. The loads() and load() functions automatically detect whether a pickle is in the binary or text format.

Listing 1 shows an interactive session using the dumps() and loads() functions just described:

Listing 1. Illustration of dumps() and loads()
Welcome To PyCrust 0.7.2 - The Flakiest Python Shell
Sponsored by Orbtech - Your source for Python programming expertise.
Python 2.2.1 (#1, Aug 27 2002, 10:22:32)
[GCC 3.2 (Mandrake Linux 9.0 3.2-1mdk)] on linux-i386
Type "copyright", "credits" or "license" for more information.
>>> import cPickle as pickle
>>> t1 = ('this is a string', 42, [1, 2, 3], None)
>>> t1
('this is a string', 42, [1, 2, 3], None)
>>> p1 = pickle.dumps(t1)
>>> p1
"(S'this is a string'\nI42\n(lp1\nI1\naI2\naI3\naNtp2\n."
>>> print p1
(S'this is a string'
I42
(lp1
I1
aI2
aI3
aNtp2
.
>>> t2 = pickle.loads(p1)
>>> t2
('this is a string', 42, [1, 2, 3], None)
>>> p2 = pickle.dumps(t1, True)
>>> p2
'(U\x10this is a stringK*]q\x01(K\x01K\x02K\x03eNtq\x02.'
>>> t3 = pickle.loads(p2)
>>> t3
('this is a string', 42, [1, 2, 3], None)

Notice that the text pickle format isn't too difficult to decipher. In fact, the conventions used are all documented in the pickle module. We should also point out that with the simple objects used in our example, there wasn't much space efficiency gained by using the binary pickle format. However, in a real system with complex objects, you will see a noticable size and speed improvement with the binary format.

Next we'll look at some examples using dump() and load(), which work with files and file-like objects. These functions operate much like the dumps() and loads() that we just looked at, with one additional capability -- the dump() function allows you to dump several objects to the same file, one after the other. Subsequent calls to load() will retrieve the objects in the same order. Listing 2 shows this capability in action:

Listing 2. Example of dump() and load()
>>> a1 = 'apple'
>>> b1 = {1: 'One', 2: 'Two', 3: 'Three'}
>>> c1 = ['fee', 'fie', 'foe', 'fum']
>>> f1 = file('temp.pkl', 'wb')
>>> pickle.dump(a1, f1, True)
>>> pickle.dump(b1, f1, True)
>>> pickle.dump(c1, f1, True)
>>> f1.close()
>>> f2 = file('temp.pkl', 'rb')
>>> a2 = pickle.load(f2)
>>> a2
'apple'
>>> b2 = pickle.load(f2)
>>> b2
{1: 'One', 2: 'Two', 3: 'Three'}
>>> c2 = pickle.load(f2)
>>> c2
['fee', 'fie', 'foe', 'fum']
>>> f2.close()

Pickle power

So far we've covered the basics of pickling. In this section, we'll cover some advanced issues that arise when you start to pickle complex objects, including instances of custom classes. Fortunately, you'll see that Python handles these situations quite readily.

Portability

Pickles are portable over space and time. In other words, the pickle file format is independent of machine architecture, which means you can create a pickle under Linux, for example, and send it to a Python program running under Windows or the Mac OS. And when you upgrade to a newer version of Python, you don't have to worry that you might be abandoning existing pickles. The Python developers have guaranteed that the pickle format will be backwards compatible across Python versions. In fact, details about current and supported formats are provided with the pickle module:

Listing 3. Retrieving supported formats
>>> pickle.format_version
'1.3'
>>> pickle.compatible_formats
['1.0', '1.1', '1.2']

Multiple references, same object

In Python, a variable is a reference to an object. And you can have multiple variables referencing the same object. It turns out that Python has no trouble at all maintaining this behavior with pickled objects, as Listing 4 demonstrates:

Listing 4. Maintenance of object references
>>> a = [1, 2, 3]
>>> b = a
>>> a
[1, 2, 3]
>>> b
[1, 2, 3]
>>> a.append(4)
>>> a
[1, 2, 3, 4]
>>> b
[1, 2, 3, 4]
>>> c = pickle.dumps((a, b))
>>> d, e = pickle.loads(c)
>>> d
[1, 2, 3, 4]
>>> e
[1, 2, 3, 4]
>>> d.append(5)
>>> d
[1, 2, 3, 4, 5]
>>> e
[1, 2, 3, 4, 5]

Circular and recursive references

The support for object references that we just demonstrated extends to circular references, where two objects contain references to each other, and recursive references, where an object contains a reference to itself. The following two listings highlight this capability. Let's look at a recursive reference first:

Listing 5. Recursive reference
>>> l = [1, 2, 3]
>>> l.append(l)
>>> l
[1, 2, 3, [...]]
>>> l[3]
[1, 2, 3, [...]]
>>> l[3][3]
[1, 2, 3, [...]]
>>> p = pickle.dumps(l)
>>> l2 = pickle.loads(p)
>>> l2
[1, 2, 3, [...]]
>>> l2[3]
[1, 2, 3, [...]]
>>> l2[3][3]
[1, 2, 3, [...]]

Now let's look at an example of a circular reference:

Listing 6. Circular reference
>>> a = [1, 2]
>>> b = [3, 4]
>>> a.append(b)
>>> a
[1, 2, [3, 4]]
>>> b.append(a)
>>> a
[1, 2, [3, 4, [...]]]
>>> b
[3, 4, [1, 2, [...]]]
>>> a[2]
[3, 4, [1, 2, [...]]]
>>> b[2]
[1, 2, [3, 4, [...]]]
>>> a[2] is b
1
>>> b[2] is a
1
>>> f = file('temp.pkl', 'w')
>>> pickle.dump((a, b), f)
>>> f.close()
>>> f = file('temp.pkl', 'r')
>>> c, d = pickle.load(f)
>>> f.close()
>>> c
[1, 2, [3, 4, [...]]]
>>> d
[3, 4, [1, 2, [...]]]
>>> c[2]
[3, 4, [1, 2, [...]]]
>>> d[2]
[1, 2, [3, 4, [...]]]
>>> c[2] is d
1
>>> d[2] is c
1

Notice how we get slightly, but significantly, different results when we pickle each object separately, rather than pickling them together inside a tuple as shown in Listing 7:

Listing 7. Pickling separately versus together inside a tuple
>>> f = file('temp.pkl', 'w')
>>> pickle.dump(a, f)
>>> pickle.dump(b, f)
>>> f.close()
>>> f = file('temp.pkl', 'r')
>>> c = pickle.load(f)
>>> d = pickle.load(f)
>>> f.close()
>>> c
[1, 2, [3, 4, [...]]]
>>> d
[3, 4, [1, 2, [...]]]
>>> c[2]
[3, 4, [1, 2, [...]]]
>>> d[2]
[1, 2, [3, 4, [...]]]
>>> c[2] is d
0
>>> d[2] is c
0

Equal, but not always identical

As we hinted in our last example, objects are only identical if they refer to the same object in memory. In the case of pickles, each is restored to an object that is equal to its original, but not identical. In other words, each pickle is a copy of the original object:

Listing 8. Restored objects as copies of originals
>>> j = [1, 2, 3]
>>> k = j
>>> k is j
1
>>> x = pickle.dumps(k)
>>> y = pickle.loads(x)
>>> y
[1, 2, 3]
>>> y == k
1
>>> y is k
0
>>> y is j
0
>>> k is j
1

At the same time, we saw that Python is able to maintain references between objects that are pickled as a unit. However, we also saw that separate calls to dump() take away Python's ability to maintain references to objects outside of the unit being pickled. Instead, Python makes a copy of the referenced object and stores it with the item being pickled. This isn't a problem for an application that pickles and restores a single object hierarchy. But it is something to be aware of for other situations.

It's also worth pointing out that there is an option that does allow separately pickled objects to maintain references to each other as long as they are all pickled to the same file. The pickle and cPickle modules provide a Pickler (and corresponding Unpickler) that is able to keep track of objects that have already been pickled. By using this Pickler, shared and circular references will be pickled by reference, rather than by value:

Listing 9. Maintenance of references among separately pickled objects
>>> f = file('temp.pkl', 'w')
>>> pickler = pickle.Pickler(f)
>>> pickler.dump(a)
<cPickle.Pickler object at 0x89b0bb8>
>>> pickler.dump(b)
<cPickle.Pickler object at 0x89b0bb8>
>>> f.close()
>>> f = file('temp.pkl', 'r')
>>> unpickler = pickle.Unpickler(f)
>>> c = unpickler.load()
>>> d = unpickler.load()
>>> c[2]
[3, 4, [1, 2, [...]]]
>>> d[2]
[1, 2, [3, 4, [...]]]
>>> c[2] is d
1
>>> d[2] is c
1

Nonpicklable objects

A few object types cannot be pickled. For example, Python cannot pickle a file object (or any object with a reference to a file object), because Python cannot guarantee that it can recreate the state of the file upon unpickling. (The other examples are so obscure that they aren't worth mentioning in an article of this nature.) Attempting to pickle a file object results in the following error:

Listing 10. Result of trying to pickle a file object
>>> f = file('temp.pkl', 'w')
>>> p = pickle.dumps(f)
Traceback (most recent call last):
  File "<input>", line 1, in ?
  File "/usr/lib/python2.2/copy_reg.py", line 57, in _reduce
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle file objects

Class instances

The pickling of class instances requires a bit more attention than the pickling of simple object types. This is primarily due to the fact that Python pickles the instance data (usually the _dict_ attribute) and the name of the class, but not the code for the class. When Python unpickles a class instance, it attempts to import the module containing the class definition using the exact class name and module name (including any package path prefixes) as they were at the time the instance was pickled. Also note that class definitions must appear at the top level of a module, meaning they cannot be nested classes (classes defined inside other classes or functions).

When class instances are unpickled, their _init_() method isn't normally called again. Instead, Python creates a generic class instance, applies the instance attributes that were pickled, and sets the instance's _class_ attribute to point to the original class.

New-style classes, introduced in Python 2.2, rely on a slightly different unpickling mechanism. While the result of the process is essentially the same as with old-style classes, Python uses the copy_reg module's _reconstructor() function to restore new-style class instances.

If you want to modify the default pickling behavior for either new-style or old-style class instances, you can define special class methods, named _getstate_() and _setstate_(), that will be called by Python during the saving and restoring of state information for instances of the class. We'll see some examples that make use of these special methods in the following sections.

For now, let's take a look at a simple class instance. To begin, we created a Python module named persist.py, containing the following new-style class definition:

Listing 11. New-style class definition
class Foo(object):

    def __init__(self, value):
        self.value = value

Now we can pickle a Foo instance and take a look at its representation:

Listing 12. Pickling a Foo instance
>>> import cPickle as pickle
>>> from Orbtech.examples.persist import Foo
>>> foo = Foo('What is a Foo?')
>>> p = pickle.dumps(foo)
>>> print p
ccopy_reg
_reconstructor
p1
(cOrbtech.examples.persist
Foo
p2
c__builtin__
object
p3
NtRp4
(dp5
S'value'
p6
S'What is a Foo?'
sb.
>>>

You can see that the class name, Foo, and the fully qualified module name, Orbtech.examples.persist, are both stored in the pickle. If we had pickled this instance to a file, and unpickled it later or on another machine, Python would attempt to import the Orbtech.examples.persist module and would raise an exception if it could not. Similar errors would occur if we renamed the class, renamed the module, or moved the module to another directory.

Here is the error Python gives when we rename the Foo class and then try to load a previously pickled Foo instance:

Listing 13. Trying to load a pickled instance of a renamed Foo class
>>> import cPickle as pickle
>>> f = file('temp.pkl', 'r')
>>> foo = pickle.load(f)
Traceback (most recent call last):
  File "<input>", line 1, in ?
AttributeError: 'module' object has no attribute 'Foo'

A similar error occurs when we rename the persist.py module:

Listing 14. Trying to load a pickled instance of a renamed persist.py module
>>> import cPickle as pickle
>>> f = file('temp.pkl', 'r')
>>> foo = pickle.load(f)
Traceback (most recent call last):
  File "<input>", line 1, in ?
ImportError: No module named persist

We'll provide techniques for managing these kinds of changes, without breaking existing pickles, in the Schema evolution section below.

Special state methods

Earlier we mentioned that a few object types, such as file objects, cannot be pickled. One way to handle instance attributes that are not picklable objects is to use the special methods available for modifying a class instance's state: _getstate_() and _setstate_(). Here is an example of our Foo class, which we've modified to handle a file object attribute:

Listing 15. Handling unpicklable instance attributes
class Foo(object):

    def __init__(self, value, filename):
        self.value = value
        self.logfile = file(filename, 'w')

    def __getstate__(self):
        """Return state values to be pickled."""
        f = self.logfile
        return (self.value, f.name, f.tell())

    def __setstate__(self, state):
        """Restore state from the unpickled state values."""
        self.value, name, position = state
        f = file(name, 'w')
        f.seek(position)
        self.logfile = f

When an instance of Foo is pickled, Python will pickle only the values returned to it when it calls the instance's _getstate_() method. Likewise, during unpickling, Python will supply the unpickled values as an argument to the instance's _setstate_() method. Inside the _setstate_() method we are able to recreate the file object based on the name and position information we pickled, and assign the file object to the instance's logfile attribute.


Schema evolution

Over time you'll find yourself having to make changes to your class definitions. If you've already pickled instances of a class that needs changing, you'll likely want to retrieve and update those instances so that they continue to function properly with the new class definition. We already saw some of the errors that can occur when changes are made to classes or modules. Fortunately, the pickling and unpickling processes provide hooks that we can use to support this need for schema evolution.

In this section, we'll look at ways to anticipate common problems and work around them. Because a class instance's code is not pickled, you can add, change, and remove methods without impacting existing pickled instances. For the same reason, you don't have to worry about class attributes. You do have to ensure that the code module containing the class definition is available in the unpickling environment. And you must plan for the changes that can cause unpickling problems: changing the name of a class, adding or removing instance attributes, and changing the name or location of the class definition module.

Class name change

To change the name of a class without breaking previously pickled instances, follow these steps. First, leave the original class definition intact so that it can be found when existing instances are unpickled. Instead of changing the original name, create a copy of the class definition, in the same module as the original class definition, giving it the new class name. Then add the following method to the original class definition, using the actual new class name in place of NewClassName:

Listing 16. Changing a class name: Method to add to the original class definition
def __setstate__(self, state):
    self.__dict__.update(state)
    self.__class__ = NewClassName

When existing instances are unpickled, Python will locate the original class definition, the instance's _setstate_() method will be called, and the instance's _class_ attribute will be reassigned to the new class definition. Once you are sure that all the existing instances have been unpickled, updated, and re-pickled, you can remove the old class definition from the source code module.

Attribute addition and subtraction

Once again, the special state methods, _getstate_() and _setstate_(), give us control over each instance's state and the opportunity to handle changes in an instance's attributes. Let's take a look at a simple class definition to which we will add and remove attributes. Here is the initial definition:

Listing 17. Original class definition
class Person(object):

    def __init__(self, firstname, lastname):
        self.firstname = firstname
        self.lastname = lastname

Let's assume we've created and pickled instances of Person, and now we've decided that we really just want to store one name attribute, rather than separate first and last names. Here is one way to change the class definition that will migrate previously pickled instances to the new definition:

Listing 18. New class definition
class Person(object):

    def __init__(self, fullname):
        self.fullname = fullname

    def __setstate__(self, state):
        if 'fullname' not in state:
            first = ''
            last = ''
            if 'firstname' in state:
                first = state['firstname']
                del state['firstname']
            if 'lastname' in state:
                last = state['lastname']
                del state['lastname']
            self.fullname = " ".join([first, last]).strip()
        self.__dict__.update(state)

In this example we added a new attribute, fullname, and removed two existing attributes, firstname and lastname. When a previously pickled instance is unpickled, its previously pickled state will be passed to _setstate_() as a dictionary, which will include values for the firstname and lastname attributes. We then combine those two values and assign them to the new fullname attribute. Along the way, we eliminate the old attributes from the state dictionary. After all the previously pickled instances have been updated and re-pickled, we can remove the _setstate_() method from the class definition.

Module modifications

A module name or location change is conceptually similar to a class name change but must be handled quite differently. That's because the module information is stored in the pickle but is not an attribute that can be modified through the standard pickle interface. In fact, the only way to change the module information is to perform a search and replace operation on the actual pickle file itself. Exactly how you would do this depends on your operating system and the tools you have at your disposal. And obviously this is a situation where you will want to back up your files in case you make a mistake. But the change should be fairly straightforward and will work equally well with the binary pickle format as with the text pickle format.


Conclusion

Object persistence depends on the object serialization capabilities of the underlying programming language. For Python objects that means pickling. Python pickles provide a robust and reliable foundation for effective persistence management of Python objects. In the Resources below, you'll find information about systems that build on Python's pickling capability.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11262
ArticleTitle=Python persistence management
publish-date=11012002