Introducing PyPy

An emerging implementation that combines science with practicality

Improve the performance of your Python development and add flexibility with PyPy with just-in-time compiler implementation. Learn about PyPy, its benefits, and how it can accelerate development of high-performance applications.

Share:

Uche Ogbuji, Partner, Zepheira, LLC

Photo of Uche OgbujiUche Ogbuji is partner at Zepheira where he oversees creation of sophisticated web catalogs and other richly contextual databases. He has a long history of pioneering in advanced web technologies such as XML, semantic web and web services, open source projects such as Akara, an open source platform for web data applications. He is a computer engineer and writer born in Nigeria, living and working near Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his weblog, Copia.



14 February 2012

Also available in Chinese Russian Japanese

Overview

The Python programming language, which emerged in 1994, has really taken hold since the turn of the new millennium. One measure of a language's success is the number of implementations. The best-known and most used implementation of Python is called CPython. There have also been successful projects such as Jython (Python language working on the Java™ runtime) and IronPython (Python language working on the .NET platform). These have all been open source, and Python has always had a large presence in the open source software world.

A long-standing goal for Python implementation is to support pure language design — to "bootstrap" the definition of Python by specifying the language in its own terms, rather than in terms of other languages such as C and Java. The PyPy project is a Python implementation serving this need. PyPy means "Python implemented in Python," though it's actually implemented to a subset of Python called RPython. More precisely, PyPy is a runtime of its own into which you can plug in any language.

The clean language design PyPy allows makes it feasible to build in low-level optimizers with enormous benefit to optimization. In particular, PyPy integrates a just-in-time (JIT) compiler. This is the same technology that famously revolutionized Java performance in the form of HotSpot, acquired by Sun Microsystems from Animorphic in the early 2000s and incorporated into their Java implementation, making the language practical for most uses. Python is already practical for many uses, but performance is the most frequent complaint. PyPy's tracing JIT compiler is already showing how it might revolutionize the performance of Python programs, and even though the project is in what I would characterize as a late beta phase, it is already an essential tool for the Python programmer, and a very useful addition to any developer's toolbox.

In this article, I introduce PyPy without presuming you have an extensive background in Python.

Getting started

First, don't confuse PyPy with PyPI. These are very different projects. The latter is the Python Package Index, a site and system for obtaining third-party Python packages to supplement the standard library. Once you arrive at the correct PyPy site (see Resources) you'll find that the developers have made things easy to try out for most users. If you have Linux®, Mac, or Windows® (except for Windows 64, which isn't yet supported) on recent hardware, you should be able to just download and execute one of the binary packages.

The current version of PyPy is 1.8, which fully implements Python 2.7.2, meaning it should be compatible in language features and behavior with that CPython version. However, it is already much faster than CPython 2.7.2 in many benchmarked uses, which is what really attracts our interest. The following session shows how I installed PyPy on my Ubuntu 11.04 box. It was captured from an earlier release of PyPy, but PyPy 1.8 gives similar results.

$ cd Downloads/
$ wget https://bitbucket.org/pypy/pypy/downloads/pypy-1.6-linux.tar.bz2
$ cd ../.local
$ tar jxvf ~/Downloads/pypy-1.6-linux.tar.bz2
$ ln -s ~/.local/pypy-1.6/bin/pypy ~/.local/bin/

Now you need to update $PATH to include ~/.local/bin/. After installing PyPy, I recommend installing Distribute and Pip as well, to make it easy to install additional packages. (Although I don't cover it in this article, you might also want to use Virtualenv, which is a way to keep separate, clean Python environments.) The following session demonstrates the Distribute and Pip set-up.

$ wget http://python-distribute.org/distribute_setup.py
$ wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py
$ pypy distribute_setup.py
$ pypy get-pip.py

You should find library files installed in ~/.local/pypy-1.8/site-packages/, and executables in ~/.local/pypy-1.8/bin, so you might want to add the latter to your $PATH. Also, make sure you are using the pip that was just installed, rather than the system-wide pip. After which you can install the third-party packages used later in this article.

$ pip install html5lib
$ pip install pyparsing

Listing 1 is shows output from the PyPy interpreter after invocation of the Python "easter egg" import this.

Listing 1. Sample PyPy output
uche@malatesta:~$ pypy
Python 2.7.1 (d8ac7d23d3ec, Aug 17 2011, 11:51:18)
[PyPy 1.6.0 with GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``__xxx__ and __rxxx__ vs operation
slots: particle quantum superposition kind of fun''
>>>> import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
>>>>

As a simple illustration of PyPy in action, I present a program to parse a web page and print a list of links expressed on the page. This is the basic idea behind spidering software, which follows the web of links from page to page for some purpose.

For the parsing I chose html5lib, which is a pure Python-parsing library designed to implement the parsing algorithms of the WHAT-WG group that is defining the HTML5 specification. HTML5 is designed to be backwards compatible, even with badly broken web pages. Thus html5lib doubles as a good, general-purpose HTML parsing toolkit. It has also been benchmarked on CPython and PyPy, and is significantly faster on the latter.

Listing 2 parses a specified web page and prints the links from that page line by line. You specify the target page URL on the command line, for example: pypy listing1.py http://www.ibm.com/developerworks/opensource/.

Listing 2. Listing the links on a page
#!/usr/bin/env pypy

#Import the needed libraries for use
import sys
import urllib2

import html5lib

#List of tuples, each an element/attribute pair to check for links
link_attrs = [
    ('a', 'href'),
    ('link', 'href'),
]

#This function is a generator, a Python construct that can be used as a sequence.
def list_links(url):
    '''
    Given a URL parse the HTML and yield a sequence of link strings
    as they are found on the page.
    '''
    #Open the URL and get back a stream of the content
    stream = urllib2.urlopen(url)
    #Parse the HTML content according to html5lib conventions
    tree_builder = html5lib.treebuilders.getTreeBuilder('dom')
    parser = html5lib.html5parser.HTMLParser(tree=tree_builder)
    doc = parser.parse(stream)

    #In the outer loop, go over each element/attribute set
    for elemname, attr in link_attrs:
        #In the inner loop, go over the matches of the current element name
        for elem in doc.getElementsByTagName(elemname):
            #If the corresponding attribute is found, yield it in sequence
            attrvalue = elem.getAttribute(attr)
            if attrvalue:
                yield attrvalue

    return

#Read the URL to parse from the first command line argument
#Note: Python lists start at index 0, but as in UNIX convention the 0th
#Command line argument is the program name itself
input_url = sys.argv[1]

#Set up the generator by calling it with the URL argument, then iterate
#Over the yielded link strings, printing each
for link in list_links(input_url):
    print link

I commented the code quite liberally, and I don't expect the reader to have deep knowledge of Python, though you should know the basics, such as how indentation is used to express flow of control. Please see Resources for relevant Python tutorials.

For simplicity, I avoided some conventions for such programs, but I did use one advanced feature that I think is very useful, even for the beginning programmer. The function list_links is called a generator. It is a function that acts like a sequence in that it computes and offers up the items one by one. The yield statements are key here, providing the sequence of values.


Even more complex screen scraping

Most web page parsing tasks are more complex than just finding and displaying links, and there are several libraries that can help with typical "web-scraping" tasks. Pyparsing is a general-purpose, pure Python-parsing toolkit that includes some facilities to support HTML parsing.

For the next example I'll demonstrate how to scrape a list of articles from an IBM developerWorks index page. See Figure 1 for a screenshot of the target page. Listing 3 is a sample record in the HTML.

Figure 1. IBM developerWorks web page to be processed
Screenshot shows a search page from the developerWorks Open Source technical library with a list of articles and their abstracts
Listing 3. Sample record to be processed from HTML
<tbody>
 <tr>
  <td>
   <a href="http://www.ibm.com/developerworks/opensource/library/os-wc3jam/index.html">
   <strong>Join the social business revolution</strong></a>
   <div>
    Social media has become social business and everyone from
    business leadership to software developers need to understand
    the tools and techniques that will be required.
    The World Wide Web Consortium (W3C) will be conducting a
    social media event to discuss relevant standards and requirement
    for the near and far future.
   </div>
  </td>
  <td>Articles</td>
  <td class="dw-nowrap">03 Nov 2011</td>
 </tr>
</tbody>

Listing 4 is code to parse this page. Again, I try to comment it liberally, but there are a few key new concepts I'll be discussing after the listing.

Listing 4. Listing 4. Extracting a listing of articles from a web page
#!/usr/bin/env pypy

#Import the needed built-in libraries for use
import sys
import urllib2
from greenlet import greenlet

#Import what we need from pyparsing
from pyparsing import makeHTMLTags, SkipTo

def collapse_space(s):
    '''
    Strip leading and trailing space from a string, and replace any run of whitespace
    within with a single space
    '''
    #Split the string according to whitespace and then join back with single spaces
    #Then strip leadig and trailing spaces. These are all standard Python library tools
    return ' '.join(s.split()).strip()

def handler():
    '''
    Simple coroutine to print the result of a matched portion from the page
    '''
    #This will be run the first time the code switches to this greenlet function
    print 'A list of recent IBM developerWorks Open Source Zone articles:'
    #Then we get into the main loop
    while True:
        next_tok = green_handler.parent.switch()
        print ' *', collapse_space(data.title), '(', data.date, ')', data.link.href

#Turn a regular function into a greenlet by wrapping it
green_handler = greenlet(handler)

#Switch to the handler greenlet the first time to prime it
green_handler.switch()

#Read the search starting page
START_URL = "http://www.ibm.com/developerworks/opensource/library/"
stream = urllib2.urlopen(START_URL)
html = stream.read()
stream.close()

#Set up some tokens for HTML start and end tags
div_start, div_end = makeHTMLTags("div")
tbody_start, tbody_end = makeHTMLTags("tbody")
strong_start, strong_end = makeHTMLTags("strong")
article_tr, tr_end = makeHTMLTags("tr")
td_start, td_end = makeHTMLTags("td")
a_start, a_end = makeHTMLTags("a")

#Put together enough tokens to narrow down the data desired from the page
article_row = ( div_start + SkipTo(tbody_start)
            + SkipTo(a_start) + a_start('link')
            + SkipTo(strong_start) + strong_start + SkipTo(strong_end)("title")
            + SkipTo(div_start) + div_start + SkipTo(div_end)("summary") + div_end
            + SkipTo(td_start) + td_start + SkipTo(td_end)("type") + td_end
            + SkipTo(td_start) + td_start + SkipTo(td_end)("date") + td_end
            + SkipTo(tbody_end)
          )

#Run the parser over the page. scanString is a generator of matched snippets
for data, startloc, endloc in article_row.scanString(html):
    #For each match, hand it over to the greenlet for processing
    green_handler.switch(data)

I set up Listing 4 deliberately to introduce PyPy's Stackless Python features. In short, it is a long-standing, alternative implementation of Python to experiment with advanced flow-of-control features. Most Stackless features have not made it into other Python implementations because of limitations in other runtimes, which are relaxed in PyPy. Greenlets are one example. Greenlets are like very lightweight threads that are multitasked cooperatively, by explicit calls to switch the content from one greenlet to another. Greenlets allow you to do some of the neat things that generators allow, and much more.

In Listing 4 I use greenlets to define a co-routine, a function whose operation is neatly interleaved with another in a way that makes the flow easy to construct and follow. You would often use greenlets in situations where more mainstream programming use callbacks, such as event-driven systems. Rather than invoking a callback you switch context to a co-routine. The key benefit of such facilities is that it allows you to structure programs for high efficiency without difficult state management.

Listing 4 provides a taste of co-routines and greenlets in general, but it's a useful concept to ease into in the context of PyPy, which comes bundled with greenlets and other Stackless features. In Listing 4, every time pyparsing matches a record, the greenlet is invoked to process that record.

The following is a sample of the output from Listing 4.

A list of recent IBM developerWorks Open Source Zone articles:
 * Join the social business revolution ( 03 Nov 2011 )
 http://www.ibm.com/developerworks/opensource/library/os-wc3jam/index.html
 * Spark, an alternative for fast data analytics ( 01 Nov 2011 )
 http://www.ibm.com/developerworks/opensource/library/os-spark/index.html
 * Automate development and management of cloud virtual machines ( 29 Oct 2011 )
 http://www.ibm.com/developerworks/cloud/library/cl-automatecloud/index.html

Whys and why nots

I took the approach in Listing 4 deliberately, but I'll start with one warning about it: It is always dangerous to try processing HTML without very specialized tools. Not as bad as XML, where not using a conforming parser is an anti-pattern, HTML is very complex and tricky, even in pages that conform to the standard, which most don't. If you need general-purpose HTML parsing, html5lib is a better option. That said, web scraping is usually a specialized operation where you are just extracting information according to the specific situation. For such limited use, pyparsing is fine, and provides some neat facilities to help.

The reasons I introduced greenlets, which are not strictly necessary in Listing 4, become more apparent as you expand such code to many real-world scenarios. In situations where you are multiplexing the parsing and processing with other operations, the greenlets approach makes it possible to structure processing not unlike UNIX command-line pipes. In cases where the problem is complicated by working with multiple source pages there is one more problem, the fact that urllib2 operations are not asynchronous and the whole program blocks any time it accesses a web page. Addressing this problem is beyond the scope of this article, but the use of advanced flow of control in Listing 4 should get you into the habit of thinking carefully about how to stitch together such sophisticated applications with an eye to performance.


Wrap up

PyPy is an actively maintained project, and certainly a moving target, but there is already much that can be done with it, and the high level of CPython compatibility means that you probably have a an established backup platform for your work if you begin to experiment. In this article, you learned enough to get started, and you had a bit of enticement to PyPy's very fascinating Stackless features. I think you'll be pleasantly surprised by PyPy's performance, and more importantly, it opens up new ways of thinking about programming for elegance without sacrificing speed.

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=792642
ArticleTitle=Introducing PyPy
publish-date=02142012