Writing clean, testable, high quality code in Python

Catastrophically bad code can be written in any language, including the elegant and powerful Python language. In this article, we explore how thinking about testing actually produces dramatically different Python code. Lastly, we learn how to measure scientifically the difference.


Noah Gift , Associate Director Engineering, AT&T Interactive

Photo of Noah Gift

Noah Gift is the co-author of Python For UNIX and Linux System Administration by O'Reilly, and is also working on Google App Engine In Action for Manning. He is an author, speaker, consultant, and community leader, writing for publications such as Red Hat Magazine, O'Reilly, and MacTech. His consulting company's website is http://www.giftcs.com, and much of his writing can be found at http://noahgift.com. You can also follow Noah on Twitter.

He has a Master's degree in CIS from Cal State Los Angeles and a B.S. in Nutritional Science from Cal Poly San Luis Obispo. He is an Apple- and LPI-certified sysadmin, and has worked at companies such as Caltech, Disney Feature Animation, Sony Imageworks, Turner Studios, and Weta Digital. In his free time, he enjoys spending time with his wife, Leah, and their son, Liam, composing for the piano, running marathons, and exercising religiously.

developerWorks Contributing author

28 September 2010

Also available in Chinese


Writing software is among the most complicated endeavors a human can undertake. Brian Kernigan, co-author of the AWK programming language and "K and R C", sumed up the true nature of software development in the book, Software Tools, when he stated, "Controlling complexity is the essence of software development." The harsh reality of real world software development is that software is often created with intentional, or unintentional, complexity and a disregard for maintainability, testability, and quality. The end result of this unfortunate reality is software that can become increasingly difficult and expensive to maintain and that fails sporadically and even spectacularly.

The first step in the process of writing high quality code is to re-examine the entire thought process of how an individual or team develops software. Often in failed, or troubled, software development projects, the software was developed in a reactionary stream of consciousness where the focus of the software development was on getting a problem solved in any manner possible. In a successful software project, the developer is thinking not only about how to solve the problem at hand, but additionally about the process involved in solving the problem.

A succesful software developer will devise a way to run the tests in an easily automated fashion, so they can continuously prove the software works. They are aware of the dangers of needless complexity. They are humble in their approach, seek critical review, and expect refactoring at every step of the way. They continuously think about how they can ensure their software is testable, readable, and maintainable. Although Python the language, and Python the community, are heavily influenced by desire to write clean, maintainable code that works, it is still quite easy to do the exact opposite. In this article, we will tackle this problem head on and explore how to write clean, testable, high quality code in Python.

A clean code hypothetical problem

The best way to demonstrate this style of development is to solve a hypothetical problem. Let's suppose you are a back-end web developer at a company that allows users to generate reviews, and you need to come up with a way to show and highlight small snippets of those reviews. One way to approach the problem would be to write a large function that takes a snippet of text, and query parameters, and returns back a character limited snippet with the query parameters highlighted. All of the logic needed to solve the problem would be included in the one "mega" function, and you would simply need to keep rerunning your script, until you got the result you wanted. The format would probably look like the code example below and would often be developed with a combination of print statements, or logging statements, and an interactive shell.

Messy code
def my_mega_function(snippet, query)
    """This takes a snippet of text, and a query parameter and returns """

    #Logic goes here, and often runs on for several hundred lines
    #There are often deeply nested conditional statements and loops
    #Function could reach several hundred, if not thousands of lines
    return result

With a dynamic language like Python, Perl, or Ruby, it is easy to develop software by simply banging away at the problem, often interactively, until you get what seems to be the correct result and calling it a day. Unfortunately, this approach, while tempting, often leads to a false sense of accomplishment that is fraught with danger. Much of the danger lies in not designing a solution to be testable, and part lies in not properly controlling the complexity of the software written.

How can you say this function even works? You can have faith that it works because it worked the last time you ran it during development, but are you sure it doesn't contain subtle errors of logic or syntax? What happens if you need to change the code? Would it still work, and how would you know it still worked? What if that code needed to be maintained by another developer, and he needed to make changes to it? How would he know his changes didn't cause something subtle to break? How hard would it be for him to understand what the code does?

The short answer is: if you don't have tests, you don't know if your software works. If you stack together enough guesses, you may eventually build something that appears to function, but that no human could ever say with certainty ever worked properly. This is a bad place to be, and I have both written this software and helped debug software written this way. Fortunately, this condition is easily avoidable. Writing tests before, such as the case of Test Driven Development, or while you write your logic actually shapes the way code is written. It leads to modular, extensible code that is easy to test, understand, and maintain. It is immediately apparent to the experienced developer when software was developed with testing in mind, and when it was not. The software itself looks dramatically different to the trained eye.

Without simply taking my word for it, or visually inspecting code, there are ways to measure scientifically the difference between these two different styles. The first way is to actually measure the lines of code that are tested. Nose is a popular extension of Python's unit test framework that includes an easy way to run automatically a batch of tests and plug-ins, such as code coverage. By measuring code coverage during development, it becomes quickly apparent that it is almost impossible to get 100 percent test coverage for code that is composed of large functions, with highly nested logic, that are built in an ad hoc manner.

The second way to measure the difference is to use static analysis tools. There are several popular Python tools that measure various metrics for Python developers, ranging from general code quality to specific metrics, like duplicate code or complexity. You can measure the cyclomatic complexity of your code with either pygenie or pymetrics (see Resources).

Here is an example of what it looks like when we run pygenie on "clean" code that is relatively simple:

Pygenie output of cyclomatic complexity
% python pygenie.py complexity --verbose highlight spy
File: /Users/ngift/Documents/src/highlight.py
Type Name                                                                   Complexity 
M    HighlightDocumentOperations._create_snippit                                  3
M    HighlightDocumentOperations._reconstruct_document_string                     3
M    HighlightDocumentOperations._doc_to_sentences                                2
M    HighlightDocumentOperations._querystring_to_dict                             2
M    HighlightDocumentOperations._word_frequency_sort                             2
M    HighlightDocumentOperations.highlight_doc                                    2
X    /Users/ngift/Documents/src/highlight.py 1          
C    HighlightDocumentOperations                                                  1
M    HighlightDocumentOperations.__init__                                         1
M    HighlightDocumentOperations._custom_highlight_tag                            1
M    HighlightDocumentOperations._score_sentences                                 1
M    HighlightDocumentOperations._multiple_string_replace                         1

What is cyclomatic complexity?

Cyclomatic complexity is a software metric, developed by Thomas J. McCabe in 1976, to determine a program's complexity. The metric measures the number of linearly independent paths, or branches, through source code. According to McCabe, it is best to keep the complexity of a method below 10. This is important because research into human memory has determined that 7 (plus or minus 2) is the magical number of items that a human can hold in short term memory.

If a developer is working on code that has 50 linearly independent paths, then they are roughly exceeding fives times the capacity of short term memory in keeping track of what is occurring in that method. Simpler methods that don't tax all of a human's short term memory are easier to work with and have been proven to be less error prone. A 2008 study by Enerjy found a strong correlation between cyclomatic complexity and faultiness. Classes that had a complexity of 11 had a probability of being fault-prone of 0.28 but rose to 0.98 with classes of a complexity of 74.

As you can tell from the example, every method is extremely simple and contains a complexity rating under 10, which is desirable according to McCabe's research. In my experiences, I have seen "mega" functions written without testing that had complexity ratings over 140 and have stretched over 1200 lines. Suffice to say, it is literally impossible to test code like this. There is actually no way to ever know it works and refactoring it is impossible. If the author of the code kept testing in mind, and wrote the same logic with 100 percent test coverage, it is highly unlikely it would have such a high complexity rating.

A clean code hypothetical solution

Let's now take a look at a complete source code example with accompanying unit tests and functional tests and see what it actually does, and why this code is considered clean. One reasonable definition of clean, using strictly metrics, is that it fulfills the following requirements: it has close to 100 percent test coverage; it has a cyclomatic complexity rating of under 10 for all classes and methods; and it scores close to a 10.0 rating with pylint. Here is an example of using nose to test unit test and doctest coverage on the highlight module:

Running nosetests with coverage reporting: 100 percent coverage
% nosetests -v --with-coverage --cover-package=highlight --with-doctest\
     --cover-erase --exe

Doctest: highlight.HighlightDocumentOperations._custom_highlight_tag ... ok
test_functional.test_snippit_algorithm ... ok
test_custom_highlight_tag (test_highlight.TestHighlight) ... ok
Consumes the generator, and then verifies the result[0] ... ok
Verifies highlighted text is what we expect ... ok
test_multi_string_replace (test_highlight.TestHighlight) ... ok
Verifies the yielded results are what is expected ... ok

Name        Stmts   Exec  Cover   Missing
highlight      71     71   100%   
Ran 7 tests in 4.223s


As you can see from the above snippet, the nosetests command was run with several options, and there was 100 percent test coverage for the highlight spy script. The only thing of real note to point out is that --cover-package=highlight is a way of telling nose to show only the coverage report on a specified module. This is very useful to isolate the output of a coverage report to the module or packages you want to observe coverage reporting on. One thing you may want to try is to download the source code from this article and to comment out some of the tests to see how the coverage reporting mechanism really works.

highlight spy
# -*- coding: utf-8 -*-

:mod:`highlight` -- Highlight Methods

.. module:: highlight
   :platform: Unix, Windows
   :synopsis: highlight document snippets that match a query.
.. moduleauthor:: Noah Gift

    1.  You will need to install the ntlk library to run this code.
    2.  You will need to download the data for the ntlk:
        See http://www.nltk.org/data::
        import nltk


import re
import logging

import nltk

LOG = logging.getLogger("highlight")

class HighlightDocumentOperations(object):

    """Highlight Operations for a Document"""
    def __init__(self, document=None, query=None):
            document (str):
            query (str):
        self._document = document
        self._query = query
    def _custom_highlight_tag(phrase,
        """Injects an open and close highlight tag after a word

            phrase (str) - A word or phrase.
        start (str) - An opening tag.  Defaults to <strong>
        end (str) - A closing tag.  Defaults to </strong>
            (str) word or phrase with custom opening and closing tags
        >>> h = HighlightDocumentOperations()
        >>> h._custom_highlight_tag("foo")
        tagged_phrase = "{0}{1}{2}".format(start, phrase, end)
        return tagged_phrase
    def _doc_to_sentences(self):
        """Takes a string document and converts it into a list of sentences
        Unfortunately, this approach might be a tad naive for production
        because some segments that are split on a period are really an
        abbreviation, and to make things even more complicated, an
        abbreviation can also be the end of a sentence::
            (generator) A generator object of a tokenized sentence tuple,
            with the list position of sentence as the first portion of
            the tuple, such as:  (0, "This was the first sentence")
        tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        sentences = tokenizer.tokenize(self._document)
        for sentence in enumerate(sentences):
            yield sentence

    def _score_sentences(sentence, querydict):
        """Creates a scoring system for each sentence by substitution analysis
        Tokenizes each sentence, counts characters
        in sentence, and pass it back as nested tuple
            (tuple) - (score (int), (count (int), position (int),
                   raw sentence (str))
        position, sentence = sentence
        count = len(sentence)
        regex = re.compile('|'.join(map(re.escape, querydict)))
        score = len(re.findall(regex, sentence))
        processed_score = (score, (count, position, sentence))
        return processed_score
    def _querystring_to_dict(self, split_token="+"):
        """Converts query parameters into a dictionary
            (dict)- dparams, a dictionary of query parameters
        params = self._query.split(split_token)
        dparams = dict([(key, self._custom_highlight_tag(key)) for\
                    key in params])
        return dparams
    def _word_frequency_sort(sentences):
        """Sorts sentences by score frequency, yields sorted result
        This will yield the highest score count items first.
            sentences (list) - a nested tuple inside of list
            [(0, (90, 3, "The crust/dough was just way too effin' dry for me.
            Yes, I know what 'cornmeal' is, thanks."))]


        while sentences:
            yield sentences.pop()

    def _create_snippit(self, sentences, max_characters=175):
        """Creates a snippet from a sentence while keeping it under max_chars 
        Returns a sorted list with max characters.  The sort is an attempt
        to rebuild the original document structure as close as possible,
        with the new sorting by scoring and the limitation of max_chars.
            sentences (generator) - sorted object to turn into a snippit
            max_characters (int) - optional max characters of snippit
            snippit (list) - returns a sorted list with a nested tuple that
            has the first index holding the original position of the list::
            [(0, (90, 3, "The crust/dough was just way too effin' dry for me.
            Yes, I know what 'cornmeal' is, thanks."))]
        snippit = []
        total = 0
        for sentence in self._word_frequency_sort(sentences):
            LOG.debug("Creating snippit", sentence)
            score, (count, position, raw_sentence) = sentence
            total += count
            if total < max_characters:
                #position now gets converted to index 0 for sorting later
                snippit.append(((position), score, count, raw_sentence))
        #try to reassemble document by original order by doing a simple sort
        return snippit
    def _multiple_string_replace(string_to_replace, dict_patterns):
        """Performs a multiple replace in a string with dict pattern.
        Borrowed from Python Cookbook.
            string_to_replace (str) - String to be multi-replaced
            dict_patterns (dict) - A dict full of patterns
            (str) - Multiple replaced string.
        regex = re.compile('|'.join(map(re.escape, dict_patterns)))
        def one_xlat(match):
            """Closure that is called repeatedly during multi-substitution.
                match (SRE_Match object)
                partial string substitution (str)
            return dict_patterns[match.group(0)]
        return regex.sub(one_xlat, string_to_replace)
    def _reconstruct_document_string(self, snippit, querydict):
        """Reconstructs string snippit, build tags, and return string
        A helper function for highlight_doc.
            string_to_replace (list) - A list of nested tuples, containing
            this pattern::
            [(0, (90, 3, "The crust/dough was just way too effin' dry for me.
            Yes, I know what 'cornmeal' is, thanks."))]
            dict_patterns (dict) - A dict full of patterns
            (str) The most relevant snippet with the query terms highlighted.
        snip = []
        for entry in snippit:
            score = entry[1]
            sent = entry[3]
            #if we have matches, now do the multi-replace
            if score:
                sent = self._multiple_string_replace(sent,
        highlighted_snip = " ".join(snip)
        return highlighted_snip
    def highlight_doc(self):
        """Finds the most relevant snippit with the query terms highlighted
            (str) The most relevant snippet with the query terms highlighted.
        #tokenize to sentences, and convert query to a dict
        sentences = self._doc_to_sentences()
        querydict = self._querystring_to_dict()
        #process and score sentences
         scored_sentences = []
        for sentence in sentences:
            scored = self._score_sentences(sentence, querydict)
        #fit into max characters, and sort by original position
        snippit = self._create_snippit(scored_sentences)
        #assemble back into string
        highlighted_snip = self._reconstruct_document_string(snippit,

        return highlighted_snip
# -*- coding: utf-8 -*-
Tests this query searches a document, highlights a snippit and returns it

Contains both unit and functional tests.


import unittest
from highlight import HighlightDocumentOperations

class TestHighlight(unittest.TestCase):
    def setUp(self):
        self.document = """
Review for their take-out only.
Tried their large Classic (sausage, mushroom, peppers and onions) deep dish;\
and their large Pesto Chicken thin crust pizzas.
Pizza = I've had better.  The crust/dough was just way too effin' dry for me.\
Yes, I know what 'cornmeal' is, thanks.  But it's way too dry.\
I'm not talking about the bottom of the pizza...I'm talking about the dough \
that's in between the sauce and bottom of the pie...it was like cardboard, sorry!
Wings = spicy and good.   Bleu cheese dressing only...hmmm, but no alternative\
of ranch dressing, at all.  Service = friendly enough at the counters.  
Decor = freakin' dark.  I'm not sure how people can see their food.  
Parking = a real pain.  Good luck.        
        self.query = "deep+dish+pizza"
        self.hdo = HighlightDocumentOperations(self.document, self.query)
    def test_custom_highlight_tag(self):
        actual = self.hdo._custom_highlight_tag("foo",
        expected = "[BAR]foo[ENDBAR]"
    def test_query_string_to_dict(self):
        """Verifies the yielded results are what is expected"""
        result = self.hdo._querystring_to_dict()
        expected = {"deep": "deep",
                    "dish": "dish",
    def test_multi_string_replace(self):
        query = """pizza = I've had better"""
        expected = """pizza = I've had better"""
        query_dict = self.hdo._querystring_to_dict()
        result = self.hdo._multiple_string_replace(query, query_dict)
        self.assertEqual(expected, result)
    def test_doc_to_sentences(self):
        """Consumes the generator, and then verifies the result[0]"""
        results = []
        expected = (0,'\nReview for their take-out only.')
        for sentence in self.hdo._doc_to_sentences():
        self.assertEqual(results[0], expected)
    def test_highlight(self):
        """Verifies highlighted text is what we expect"""
        expected = """Tried their large Classic (sausage, mushroom, peppers and onions)\
deepdish;and their large Pesto Chicken thin crust \
        actual = self.hdo.highlight_doc()
        self.assertEqual(expected, actual)
    def tearDown(self):
        del self.query
        del self.hdo
        del self.document

if __name__ == '__main__':
"""Functional Test That Performs Some Basic Sanity Checks"""

from highlight import HighlightDocumentOperations

def test_snippit_algorithm():
    document1 = """
        This place has awesome deep dish pizza.
        I have been getting delivery through Waiters on wheels for years.
        It is classic, deep dish  Chicago style pizza.
        Now I found out they also have half-baked to pick-up and cook at home.
        This is a great benefit. I am having it tonight. Yum.
    document2 = """Review for their take-out only.
Tried their large Classic (sausage, mushroom, peppers and onions) deep dish;\
and their large Pesto Chicken thin crust pizzas.
Pizza = I've had better.  The crust/dough was just way too effin' dry for me.\
Yes, I know what 'cornmeal' is, thanks.  But it's way too dry.\
I'm not talking about the bottom of the pizza...I'm talking about the dough \
that's in between the sauce and bottom of the pie...it was like cardboard, sorry!
Wings = spicy and good.   Bleu cheese dressing only...hmmm, but no alternative\
of ranch dressing, at all.  Service = friendly enough at the counters.  
Decor = freakin' dark.  I'm not sure how people can see their food.  
Parking = a real pain.  Good luck."""
    h1 = HighlightDocumentOperations(document1, "deep+dish+pizza")
    actual = h1.highlight_doc()
    print "Raw Document1: %s" % document1
    print " Formatted Document1: %s" % actual
    assert  len(actual) < 500
    assert "<strong>" in actual

    h2 = HighlightDocumentOperations(document2, "deep+dish+pizza")
    actual = h2.highlight_doc()
    print "Raw Document2: %s" % document2
    print " Formatted Document2: %s" % actual
    assert  len(actual) < 500
    assert "<strong>" in actual

if __name__ == "__main__":

Concerning the above code sample, if you would like to run it, you will need to download the Natural Language Toolkit source and download the nltk data according to the instructions. Since this article is not about the code sample shown but about how it was created, and how to test it, I won't go into any detail explaining what the code actually does. Instead, let's finish up by running the static code analysis tool pylint on our source code:

% pylint highlight spy 
No config file found, using default configuration
************* Module highlight
E: 89:HighlightDocumentOperations._doc_to_sentences: Instance of 'unicode' has no 
    'tokenize' member (but some types could not be inferred)
E: 89:HighlightDocumentOperations._doc_to_sentences: Instance of 'ContextFreeGrammar' 
    has no 'tokenize' member (but some types could not be inferred)
W:108:HighlightDocumentOperations._score_sentences: Used builtin function 'map'
W:192:HighlightDocumentOperations._multiple_string_replace: Used builtin function 'map'
R: 34:HighlightDocumentOperations: Too few public methods (1/2)

69 statements analysed.

Global evaluation
Your code has been rated at 8.12/10 (previous run: 8.12/10)

The code scored an 8.12 out of 10 and was nicked down for a few items. Pylint is configurable, so it is very likely that you may need to configure it to meet your needs on your project. You can refer to the official pylint document (see Resources). For this specific example, there are two errors on line 89 that can be attributed to the external library nltk, and there are two warnings that could be changed by a configuration change to pylint. In general, you will never want to allow pylint errors in your source code, but there are some times, such as in the example above, that you may need to make an executive decision. It isn't a perfect tool, but I have found it to be very useful in the real world.


In this article, we explored how merely thinking about testing influences the structure of software, and how a lack of thought toward testing can prove fatally harmful to a project. We showed a complete code example, that included both functional and unit tests, and ran it against both code coverage analysis with nose and two static analysis tools, pylint, and pygenie. One thing we didn't have time to cover was how to automate this with some form of continuous integration testing. Fortunately, this is quite simple with the open source Java™ Continuous Integration System, Hudson. I would encourage you to consult the Hudson documentation (see Resources) and experiment with setting up an automated tests for your project that runs all of your tests, including static code analysis.

Finally, testing isn't a panacea, nor are static analysis tools. Software development is hard work. To get the chance even to be successful, we have to always be mindful of the real goal. It is not only to solve a problem, but also to create something we can prove works. If you agree with this premise, then this means that overly complex code, arrogance in design, and lack of respect for the power of Python, directly interfere with this goal.

Thanks to Kennedy Behrman, of Imagemovers Digital, for the technical review of this article.


Zip fileclean_code_sample.zip5.4KB





developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks

Zone=AIX and UNIX
ArticleTitle=Writing clean, testable, high quality code in Python