In Django, most interactions with a database are through its Object-Relational Mapper (ORM), a feature Django shares with other recent Web frameworks, such as Rails. ORMs are increasingly popular with developers, as they automate many common interactions with a database and use familiar object-oriented approaches rather than SQL statements.
Django programmers may choose to bypass the native ORM in favor of the popular
SQLAlchemy package, but although
SQLAlchemy is quite powerful, it is also more difficult to
use and requires more lines of code. Django applications can and have been developed
using SQLAlchemy instead of the native ORM, but some of
the most attractive features of Django, such as its automatically generated administrative
interface, require use of the ORM.
This article sheds light on some of the lesser-known features of the Django ORM in
particular, but SQLAlchemy users may find some of the
cautions about inefficient query generation applicable to their own code.
Software versions used in this article include:
- Django V1.0.2 (parts 1 and 2)
- Django V1.1 alpha (part 3)
- sqlite3
- Python V2.4-2.6 (Django does not yet support Python V3.)
- IPython (for the sample output)
The Django ORM supports many database back ends, but sqlite3 is the easiest to install and comes bundled with many operating systems. These examples should work with any back end. For a list of databases that Django supports, see Resources.
Avoiding common traps in ORM query generation
Django's design encourages an agile-development style that rewards rapid prototyping and experimentation. In early stages, it's best not to worry about performance and instead favor readability and ease of implementation.
Sometimes, it doesn't take long to hit performance problems. Often, it happens the first time you try your application with real data. It may become obvious when the test suite execution creeps over the five-minute mark, but contains just a few tests. Other times, the application is just noticeably slow. Luckily, there are some easily identifiable patterns that are correspondingly easy to fix. A common case is shown in Listing 1 (the application's models.py file) and Listing 2.
Listing 1. Basic models for the examples application: models.py
from django.db import models
# Some kind of document, like a blog post or a wiki page
class Document(models.Model):
name = models.CharField(max_length=255)
# A user-generated comment, such as found on a site like
# Digg or Reddit
class Comment(models.Model):
document = models.ForeignKey(Document, related_name='comments')
content = models.TextField()
|
Listing 2 shows how to access the models set up in Listing 1 in an inefficient way.
Listing 2. Accessing those models very slowly
from examples.model import *
import uuid
# First create a lot of documents and assign them random names
for i in range(0, 10000):
Document.objects.create(name=str(uuid.uuid4()))
# Get a selection of names back to be looked up later
names = Document.objects.values_list('name', flat=True)[0:5000]
# The really slow way to get back a list of Documents that
# match these names
documents = []
for name in names:
documents.append(Document.objects.get(name=name))
|
This is a contrived example, but it illustrates a fairly common use case: Given a list of identifiers, get all the items that correspond to those identifiers from the database.
The naively coded example above took 65 seconds to run when using sqlite3 in-memory.
With a file system-dependent database, it would be even longer. However,
Listing 3 has a fix for this slow-running query. Instead of
issuing multiple database queries for each name value, use the
fieldname__in operator to generate an SQL query similar
to:
SELECT * FROM model WHERE fieldname IN ('1', '2', ...)
|
(The actual query syntax generated will vary by database engine.)
Listing 3. A fast query for getting a list of items
from examples import models
import uuid
for i in range(0, 10000):
Document.objects.create(name=str(uuid.uuid4()))
names = Document.objects.values_list('name', flat=True)[0:5000]
documents = list(Document.objects.filter(name__in=names))
|
This code executes in only 3 seconds. Note that this code casts the result of the query to a list in order to force the query to be evaluated. Because Django queries are evaluated lazily, simply assigning the result of the query would not cause any database access to occur, invalidating the comparison.
Database gurus may find this example obvious, especially if they're accustomed to writing raw SQL, but many Python programmers don't have a database background. Sometimes, the best instincts of programmers can actually work against efficiency. Listing 4 demonstrates one way you might choose to refactor the code in Listing 2, not realizing it will turn out to be a trap.
Listing 4. A common pattern resulting in slow database usage
for name in names:
documents.append(get_document_by_name(name))
def get_document_by_name(name):
return Document.objects.get(name=name))
|
On the surface, it might seem like a good idea to create a separate method for retrieving documents from the database. Maybe there's other work that should be done here, such as adding data to the model before returning it. Watch out for this pattern because refactoring into discrete methods can appear to be an improvement on the code. Writing unit tests from the very beginning of development and including at least some tests that operate on large datasets can help identify when refactoring results in sudden performance drops.
Encapsulating common queries with manager models
All Django developers use the built-in Manager class: It's
what gets called for all methods of the form Model.objects.*.
The base Manager class is automatically available and
provides commonly used methods that return QuerySets
(for example, all()), those that return values (for
example, count()), and those that return
Model instances (for example,
get_or_create()).
Django developers are encouraged to override the base Manager
class. To illustrate why this feature is useful, extend the examples application to
include a new model, Format, which describes the file
format of the documents in the system. An example is shown below.
Listing 5. Adding a model to examples
from django.db import models
class Document(models.Model):
name = models.CharField(max_length=255)
format = models.ForeignKey('Format')
class Comment(models.Model):
document = models.ForeignKey(Document, related_name='comments')
content = models.TextField()
class Format(models.Model):
type = models.CharField(choices=( ('Text file', 'text'),
('ePub ebook', 'epub'),
('HTML file', 'html')),
max_length=10)
|
Next, use the changed models to create some
sample documents that have Format instances assigned.
Listing 6. Create some documents with assigned formats
# First create a series of Format objects \
and save # them to the database
them to the database
format_text = Format.objects.create(type='text')
format_epub = Format.objects.create(type='epub')
format_html = Format.objects.create(type='html')
# Create a few documents in various formats
for i in range(0, 10):
Document.objects.create(name='My text document',
format=format_text)
Document.objects.create(name='My epub document',
format=format_epub)
Document.objects.create(name='My HTML document',
format=format_html)
|
Imagine that this application will provide a way to first filter the documents by
format, then to filter that QuerySet by other fields
such as title. Following is a simple query to return just the text documents: Document.objects.filter(format=format_text).
In this example, the meaning of the query is quite clear, but in a mature application, you may need to apply many more restrictions to the result set. You may want to restrict the listing to only those documents marked public or documents no older than 30 days. If you have to call this query from multiple places in your application, keeping all of the filtering clauses in sync can become a real maintenance headache and a source of bugs.
This is where a custom manager can help. Custom managers offer the ability to define
unlimited numbers of "canned" queries — similar to the built-in manager
methods, such as latest() (which returns only the most
recent instance of a given model) or distinct() (which
issues a SELECT DISTINCT clause in the generated query).
Not only do these queries reduce the amount of code that may need to be duplicated
across your application but managers improve readability. Over time, would you
prefer to read this:
Documents.objects.filter(format=format_text,publish_on__week_day=todays_week_day, is_public=True).distinct().order_by(date_added).reverse() |
Or will it be easier for you or a new developer to understand:
Documents.home_page.all() |
Creating a custom manager is extremely simple. Listing 7 shows
the get_by_format example.
Listing 7. Custom Manager class that provides methods for each format type
from django.db import models
class DocumentManager(models.Manager):
# The model class for this manager is always available as
# self.model, but in this example we are only relying on the
# filter() method inherited from models.Manager.
def text_format(self):
return self.filter(format__type='text')
def epub_format(self):
return self.filter(format__type='epub')
def html_format(self):
return self.filter(format__type='html')
class Document(models.Model):
name = models.CharField(max_length=255)
format = models.ForeignKey('Format')
# The new model manager
get_by_format = DocumentManager()
# The default model manager now needs to be explicitly defined
objects = models.Manager()
class Comment(models.Model):
document = models.ForeignKey(Document, related_name='comments')
content = models.TextField()
class Format(models.Model):
type = models.CharField(choices=( ('Text file', 'text'),
('ePub ebook', 'epub'),
('HTML file', 'html')),
max_length=10)
def __unicode__(self):
return self.type
|
A few comments about this code:
- If you define a custom manager, Django automatically removes the default
manager. I prefer to leave the default manager and the custom
manager so other developers (or myself, when I forget) can still use
objectsand it will behave exactly as expected. However, because my newget_by_formatmanager is just a subclass of Django'smodels.Manager, all the default methods, such asall(), are available to it. Whether to include the default manager as well as your custom manager is a personal preference. - It's also possible to assign your new manager to
objectsdirectly. The only disadvantage is if you then want to override the initialQuerySetitself. Then your newobjectswill have an unexpected behavior that other developers might not anticipate. - You need to define the manager class in models.py prior to defining your model
class, or the class won't be available to Django. This is similar to the restrictions
around
ForeignKeyclass references. - I could have simply implemented
DocumentManagerwith a single method that takes an argument, likewith_format(format_name). Generally, I prefer manager methods that have verbose method names but take no arguments. - There's no technical limit to the number of custom managers you can assign to a class, but you won't likely need more than one or two.
Using the new manager methods is simple enough.
In [1]: [d.format for d in Document.get_by_format.text_format()][0] Out[1]: <Format: text> In [2]: [d.format for d in Document.get_by_format.epub_format()][0] Out[2]: <Format: epub> In [3]: [d.format for d in Document.get_by_format.html_format()][0] Out[3]: <Format: html> |
Now there is a convenient place to hang any functionality related to these queries, and you can apply additional restrictions without cluttering up the code. It's also in keeping with the spirit of Django's brand of model-view-controller (MVC) to put this kind of functionality in models.py rather than littering it in views or template tags.
Overriding the initial QuerySet that a custom manager returns
Another coding pattern that can apply to manager classes is one that may have no custom methods at all. Instead of defining a new method that returns only HTML-format documents, for example, you could define a custom manager that operates entirely on that restricted set, such as the one shown below.
Listing 8. Custom manager for HTML documents
class HTMLManager(models.Manager):
def get_query_set(self):
return super(HTMLManager, self).get_query_set().filter(format__type='html')
class Document(models.Model):
name = models.CharField(max_length=255)
format = models.ForeignKey('Format')
html = HTMLManager()
get_by_format = DocumentManager()
objects = models.Manager()
|
The get_query_set() method is inherited from
models.Manager and is overridden in this example
to take the basic query (the same one that all()
generates) and apply the additional filter to it. Any subsequent methods you
add to this manager would call your get_query_set()
method first, then apply additional query methods on that result, as shown below.
Listing 9. Using the custom format manager
# Our HTML query returns the same number of results as the manager
# which explicitly filters the result set.
In [1]: Document.html.all().count()
Out[1]: 10
In [2]: Document.get_by_format.html_format().count()
Out[2]: 10
# In fact we can prove that they return exactly the same results
In [3]: [d.id for d in Document.get_by_format.html_format()] ==
[d.id for d in Document.html.all()]
Out[3]: True
# It is not longer possible to operate on the unfiltered
# query in HTMLManager()
In [4]: Document.html.filter(format__type='epub')
Out[4]: []
|
Use this class-based method for filtering queries when you expect to have many operations on a subset of your data and want to reduce the amount of code and the complexity of the queries you need to generate.
Using class and static methods with models
There are no restrictions on the types of methods you can add to a manager. Methods
can return QuerySets, as shown above, or they can
return instances of the related model class (available as
self.model).
TThere may be cases where you want to do operations that are related to a model, but don't return instances or QuerySets. The
Django documentation states that all methods that aren't on instances of
a model class should go on the manager, but another possibility is to use
Python's class and static methods.
Here's a simple example of a utility method that's related to the
Format class but not to an individual instance:
# Return the canonical name for a format extension based on some
# common values that might be seen "in the wild"
def check_extension(extension):
if extension == 'text' or extension == 'txt' or extension == '.csv':
return 'text'
if extension.lower() == 'epub' or extension == 'zip':
return 'epub'
if 'htm' in extension:
return 'html'
raise Exception('Did not get known extension')
|
This code doesn't take or return an instance of the Format
class, so it isn't appropriate as an instance method. You could add it to a
FormatManager, but because it doesn't access the
database at all, it doesn't feel right to put it there.
One solution is to add it to the Format class and declare it
as a static method using the @staticmethod decorator,
as shown below.
Listing 10. Adding a utility function as a static method on a model class
class Format(models.Model):
type = models.CharField(choices=( ('Text file', 'text'),
('ePub ebook', 'epub'),
('HTML file', 'html')),
max_length=10)
@staticmethod
def check_extension(extension):
if extension == 'text' or extension == 'txt' or extension == '.csv':
return 'text'
if extension.lower() == 'epub' or extension == 'zip':
return 'epub'
if 'htm' in extension:
return 'html'
raise Exception('Did not get known extension')
def __unicode__(self):
return self.type
|
This method would be called as Format.check_extension(extension)
without requiring an instance of Format or the
creation of a manager.
Python also offers the @classmethod decorator, which
generates methods on classes that take the class itself as the first argument.
This might be useful in cases where you want to perform some introspection
on the class object itself without instantiating it.
Aggregation queries in Django V1.1
In V1.1, Django's ORM includes powerful query methods that provide functionality previously only available via raw SQL. For Python developers leery of SQL — and for anyone wanting to keep his or her Django applications usable across multiple database engines — this is a real boon.
In today's socially mediated applications, it's extremely common to order items not just
by rigid fields like alphabetical order or creation date but by
dynamic data. For example, in the examples application, you may want to list
documents by popularity, based on the number of comments made on each one.
Prior to Django V1.1, you could do such a thing only by writing some custom
SQL code, creating a nonportable stored procedure, or — worst of all — writing
some inefficient object-oriented queries. Another approach was to define a
dummy database field that would contain the desired value by which to count (for
example, the number of Comment rows) and update that manually by overriding
the document's save() method.
Django aggregation removes the need for all of this. You can now order documents by
the number of comments they receive with just one QuerySet
method: annotate(). Listing 11 provides an example.
Listing 11. Using aggregation to order results by number of comments
from django.db.models import Count
# Create some sample Documents
unpopular = Document.objects.create(name='Unpopular document', format=format_html)
popular = Document.objects.create(name='Popular document', format=format_html)
# Assign more comments to "popular" than to "unpopular"
for i in range(0,10):
Comment.objects.create(document=popular)
for i in range(0,5):
Comment.objects.create(document=unpopular)
# If we return results in the order they were created (id order, by default), we get
# the "unpopular" document first.
In [1]: Document.objects.all()
Out[1]: [<Document: Unpopular document>, <Document: Popular document>]
# If we instead annotate the result set with the total number of
# comments on each Document and then order by that computed value, we
# get the "popular" document first.
In [2]: Document.objects.annotate(Count('comments')).order_by('-comments__count')
Out[2]: [<Document: Popular document>, <Document: Unpopular document>]
|
The annotate() QuerySet method does not perform any
aggregation by itself. Instead, it directs Django to assign the value of the passed
expression to a pseudo-column in the result set. By default, that column name
will be the supplied field name (here, the value of Comment.document.related_name())
prepended to the name of the aggregation method. This code is calling
django.db.models.Count, which is one of the simple
mathematical functions available in the aggregation library. (For the
complete list, see Resources.)
The result of Document.objects.annotate(Count('comments'))
is a QuerySet to which a new
property — comments__count — has been
added. If you want to override that default name, you can pass the name as a
keyword argument.
Document.objects.annotate(popularity=Count('comments'))
|
Now that the intermediate QuerySet contains a count
of all the comments associated with each document, you can order by that field.
Because you want the documents with the most comments to appear first, use
descending order, as .order_by('-comments__count').
Using aggregation not only reduces the amount of code you need to write but it also guarantees that these operations will be fast, as they're relying on the database engine to do the mathematical calculations. This process is much more efficient than extracting all the relevant data through the ORM and calculating over the result set manually.
Other types of aggregation in Django V1.1
The new aggregation library doesn't just allow for returning more complex result
sets. You can also return non-QuerySet results
extracted directly out of the database. For example, to get the average
number of comments for all the documents in the database, use this code:
In [1]: from django.db.models import Avg
In [2]: Document.objects.aggregate(Avg('comments'))
Out[2]: {'comments__avg': 8.0}
|
You can apply aggregation to filtered or unfiltered queries, and columns generated
with annotate can themselves be filtered on like
normal fields. You can also apply aggregation methods across joins. For
example, you could aggregate documents based on ratings assigned to
comments, as in a Slashdot-style site. For more information about aggregation, see
Resources.
One charge leveled at object-relational mappers is that they abstract away so much of the database engine that writing efficient, scalable applications with them is impossible. For some kinds of applications — those with millions of accesses and highly interrelated models — this assertion is often true.
The vast majority of Web applications never reach such huge audiences and don't achieve that level of complexity. ORMs are designed to get projects off the ground quickly and to help developers jump into database-driven projects without requiring a deep knowledge of SQL. As your Web site gets bigger and more popular, you will certainly need to audit performance as described in the first section of this article. Eventually, you may need to start replacing ORM-driven code with raw SQL or stored procedures.
Happily, the capabilities of easy-to-use ORMs like Django's continue to evolve. Django V1.1's aggregation library is a major step forward, allowing efficient query generation while still providing a familiar object-oriented syntax. For even greater flexibility, Python developers should also look at SQLAlchemy, especially for Python Web applications that don't rely on Django.
Learn
-
Django Query API: See the complete guide to the Django Query API.
-
Learn more about what's new in Django V1.1.
-
See the complete list of all database engines Django supports. Since Django V1.0, it has been possible to define new external engines, as well.
-
Visit the Django V1.1 aggregation
library for the complete reference.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Stay current with developerWorks' Technical events and webcasts.
-
Follow developerWorks on Twitter.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
-
Discover a powerful alternative to Django's ORM: SQLAlchemy. This tool may be the right choice for very large applications.
-
SQLite V3: Since V2.5,
Python has come bundled with support for SQLite V3 and requires no other driver.
Earlier Python versions require that you download
pysqlite directly.
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
- Download
IBM product evaluation versions
or explore
the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
-
Participate in developerWorks blogs and get involved in the developerWorks community.

Liza Daly is a software engineer who specializes in applications for the publishing industry. She has been the lead developer on major online products for Oxford University Press, O'Reilly Media, and other publishers. Currently she is an independent consultant and the founder of Threepress, an open source project developing ebook applications.
Comments (Undergoing maintenance)





