The Python programming language, which emerged in 1994, has really taken hold since the turn of the new millennium. One measure of a language's success is the number of implementations. The best-known and most used implementation of Python is called CPython. There have also been successful projects such as Jython (Python language working on the Java™ runtime) and IronPython (Python language working on the .NET platform). These have all been open source, and Python has always had a large presence in the open source software world.
A long-standing goal for Python implementation is to support pure language design — to "bootstrap" the definition of Python by specifying the language in its own terms, rather than in terms of other languages such as C and Java. The PyPy project is a Python implementation serving this need. PyPy means "Python implemented in Python," though it's actually implemented to a subset of Python called RPython. More precisely, PyPy is a runtime of its own into which you can plug in any language.
The clean language design PyPy allows makes it feasible to build in low-level optimizers with enormous benefit to optimization. In particular, PyPy integrates a just-in-time (JIT) compiler. This is the same technology that famously revolutionized Java performance in the form of HotSpot, acquired by Sun Microsystems from Animorphic in the early 2000s and incorporated into their Java implementation, making the language practical for most uses. Python is already practical for many uses, but performance is the most frequent complaint. PyPy's tracing JIT compiler is already showing how it might revolutionize the performance of Python programs, and even though the project is in what I would characterize as a late beta phase, it is already an essential tool for the Python programmer, and a very useful addition to any developer's toolbox.
In this article, I introduce PyPy without presuming you have an extensive background in Python.
First, don't confuse PyPy with PyPI. These are very different projects. The latter is the Python Package Index, a site and system for obtaining third-party Python packages to supplement the standard library. Once you arrive at the correct PyPy site (see Resources) you'll find that the developers have made things easy to try out for most users. If you have Linux®, Mac, or Windows® (except for Windows 64, which isn't yet supported) on recent hardware, you should be able to just download and execute one of the binary packages.
The current version of PyPy is 1.8, which fully implements Python 2.7.2, meaning it should be compatible in language features and behavior with that CPython version. However, it is already much faster than CPython 2.7.2 in many benchmarked uses, which is what really attracts our interest. The following session shows how I installed PyPy on my Ubuntu 11.04 box. It was captured from an earlier release of PyPy, but PyPy 1.8 gives similar results.
$ cd Downloads/ $ wget https://bitbucket.org/pypy/pypy/downloads/pypy-1.6-linux.tar.bz2 $ cd ../.local $ tar jxvf ~/Downloads/pypy-1.6-linux.tar.bz2 $ ln -s ~/.local/pypy-1.6/bin/pypy ~/.local/bin/
Now you need to update
$PATH to include
~/.local/bin/. After installing PyPy, I recommend installing Distribute
and Pip as well, to make it easy to install additional packages. (Although
I don't cover it in this article, you might also want to use Virtualenv,
which is a way to keep separate, clean Python environments.) The following
session demonstrates the Distribute and Pip set-up.
$ wget http://python-distribute.org/distribute_setup.py $ wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py $ pypy distribute_setup.py $ pypy get-pip.py
You should find library files installed in
~/.local/pypy-1.8/site-packages/, and executables in
~/.local/pypy-1.8/bin, so you might want to add the latter to your
$PATH. Also, make sure you are using the pip
that was just installed, rather than the system-wide pip. After which you
can install the third-party packages used later in this article.
$ pip install html5lib $ pip install pyparsing
Listing 1 is shows output from the PyPy interpreter
after invocation of the Python "easter egg"
Listing 1. Sample PyPy output
uche@malatesta:~$ pypy Python 2.7.1 (d8ac7d23d3ec, Aug 17 2011, 11:51:18) [PyPy 1.6.0 with GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. And now for something completely different: ``__xxx__ and __rxxx__ vs operation slots: particle quantum superposition kind of fun'' >>>> import this The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! >>>>
As a simple illustration of PyPy in action, I present a program to parse a web page and print a list of links expressed on the page. This is the basic idea behind spidering software, which follows the web of links from page to page for some purpose.
For the parsing I chose html5lib, which is a pure Python-parsing library designed to implement the parsing algorithms of the WHAT-WG group that is defining the HTML5 specification. HTML5 is designed to be backwards compatible, even with badly broken web pages. Thus html5lib doubles as a good, general-purpose HTML parsing toolkit. It has also been benchmarked on CPython and PyPy, and is significantly faster on the latter.
Listing 2 parses a specified web page and prints
the links from that page line by line. You specify the target page URL on
the command line, for example:
pypy listing1.py http://www.ibm.com/developerworks/opensource/.
Listing 2. Listing the links on a page
#!/usr/bin/env pypy #Import the needed libraries for use import sys import urllib2 import html5lib #List of tuples, each an element/attribute pair to check for links link_attrs = [ ('a', 'href'), ('link', 'href'), ] #This function is a generator, a Python construct that can be used as a sequence. def list_links(url): ''' Given a URL parse the HTML and yield a sequence of link strings as they are found on the page. ''' #Open the URL and get back a stream of the content stream = urllib2.urlopen(url) #Parse the HTML content according to html5lib conventions tree_builder = html5lib.treebuilders.getTreeBuilder('dom') parser = html5lib.html5parser.HTMLParser(tree=tree_builder) doc = parser.parse(stream) #In the outer loop, go over each element/attribute set for elemname, attr in link_attrs: #In the inner loop, go over the matches of the current element name for elem in doc.getElementsByTagName(elemname): #If the corresponding attribute is found, yield it in sequence attrvalue = elem.getAttribute(attr) if attrvalue: yield attrvalue return #Read the URL to parse from the first command line argument #Note: Python lists start at index 0, but as in UNIX convention the 0th #Command line argument is the program name itself input_url = sys.argv #Set up the generator by calling it with the URL argument, then iterate #Over the yielded link strings, printing each for link in list_links(input_url): print link
I commented the code quite liberally, and I don't expect the reader to have deep knowledge of Python, though you should know the basics, such as how indentation is used to express flow of control. Please see Resources for relevant Python tutorials.
For simplicity, I avoided some conventions for such programs, but I did use
one advanced feature that I think is very useful, even for the beginning
programmer. The function
list_links is called a
generator. It is a function that acts like a sequence in that it computes
and offers up the items one by one. The
statements are key here, providing the sequence of values.
Even more complex screen scraping
Most web page parsing tasks are more complex than just finding and displaying links, and there are several libraries that can help with typical "web-scraping" tasks. Pyparsing is a general-purpose, pure Python-parsing toolkit that includes some facilities to support HTML parsing.
For the next example I'll demonstrate how to scrape a list of articles from an IBM developerWorks index page. See Figure 1 for a screenshot of the target page. Listing 3 is a sample record in the HTML.
Figure 1. IBM developerWorks web page to be processed
Listing 3. Sample record to be processed from HTML
<tbody> <tr> <td> <a href="http://www.ibm.com/developerworks/opensource/library/os-wc3jam/index.html"> <strong>Join the social business revolution</strong></a> <div> Social media has become social business and everyone from business leadership to software developers need to understand the tools and techniques that will be required. The World Wide Web Consortium (W3C) will be conducting a social media event to discuss relevant standards and requirement for the near and far future. </div> </td> <td>Articles</td> <td class="dw-nowrap">03 Nov 2011</td> </tr> </tbody>
Listing 4 is code to parse this page. Again, I try to comment it liberally, but there are a few key new concepts I'll be discussing after the listing.
Listing 4. Listing 4. Extracting a listing of articles from a web page
#!/usr/bin/env pypy #Import the needed built-in libraries for use import sys import urllib2 from greenlet import greenlet #Import what we need from pyparsing from pyparsing import makeHTMLTags, SkipTo def collapse_space(s): ''' Strip leading and trailing space from a string, and replace any run of whitespace within with a single space ''' #Split the string according to whitespace and then join back with single spaces #Then strip leadig and trailing spaces. These are all standard Python library tools return ' '.join(s.split()).strip() def handler(): ''' Simple coroutine to print the result of a matched portion from the page ''' #This will be run the first time the code switches to this greenlet function print 'A list of recent IBM developerWorks Open Source Zone articles:' #Then we get into the main loop while True: next_tok = green_handler.parent.switch() print ' *', collapse_space(data.title), '(', data.date, ')', data.link.href #Turn a regular function into a greenlet by wrapping it green_handler = greenlet(handler) #Switch to the handler greenlet the first time to prime it green_handler.switch() #Read the search starting page START_URL = "http://www.ibm.com/developerworks/opensource/library/" stream = urllib2.urlopen(START_URL) html = stream.read() stream.close() #Set up some tokens for HTML start and end tags div_start, div_end = makeHTMLTags("div") tbody_start, tbody_end = makeHTMLTags("tbody") strong_start, strong_end = makeHTMLTags("strong") article_tr, tr_end = makeHTMLTags("tr") td_start, td_end = makeHTMLTags("td") a_start, a_end = makeHTMLTags("a") #Put together enough tokens to narrow down the data desired from the page article_row = ( div_start + SkipTo(tbody_start) + SkipTo(a_start) + a_start('link') + SkipTo(strong_start) + strong_start + SkipTo(strong_end)("title") + SkipTo(div_start) + div_start + SkipTo(div_end)("summary") + div_end + SkipTo(td_start) + td_start + SkipTo(td_end)("type") + td_end + SkipTo(td_start) + td_start + SkipTo(td_end)("date") + td_end + SkipTo(tbody_end) ) #Run the parser over the page. scanString is a generator of matched snippets for data, startloc, endloc in article_row.scanString(html): #For each match, hand it over to the greenlet for processing green_handler.switch(data)
I set up Listing 4 deliberately to introduce PyPy's Stackless Python features. In short, it is a long-standing, alternative implementation of Python to experiment with advanced flow-of-control features. Most Stackless features have not made it into other Python implementations because of limitations in other runtimes, which are relaxed in PyPy. Greenlets are one example. Greenlets are like very lightweight threads that are multitasked cooperatively, by explicit calls to switch the content from one greenlet to another. Greenlets allow you to do some of the neat things that generators allow, and much more.
In Listing 4 I use greenlets to define a co-routine, a function whose operation is neatly interleaved with another in a way that makes the flow easy to construct and follow. You would often use greenlets in situations where more mainstream programming use callbacks, such as event-driven systems. Rather than invoking a callback you switch context to a co-routine. The key benefit of such facilities is that it allows you to structure programs for high efficiency without difficult state management.
Listing 4 provides a taste of co-routines and greenlets in general, but it's a useful concept to ease into in the context of PyPy, which comes bundled with greenlets and other Stackless features. In Listing 4, every time pyparsing matches a record, the greenlet is invoked to process that record.
The following is a sample of the output from Listing 4.
A list of recent IBM developerWorks Open Source Zone articles: * Join the social business revolution ( 03 Nov 2011 ) http://www.ibm.com/developerworks/opensource/library/os-wc3jam/index.html * Spark, an alternative for fast data analytics ( 01 Nov 2011 ) http://www.ibm.com/developerworks/opensource/library/os-spark/index.html * Automate development and management of cloud virtual machines ( 29 Oct 2011 ) http://www.ibm.com/developerworks/cloud/library/cl-automatecloud/index.html
Whys and why nots
I took the approach in Listing 4 deliberately, but I'll start with one warning about it: It is always dangerous to try processing HTML without very specialized tools. Not as bad as XML, where not using a conforming parser is an anti-pattern, HTML is very complex and tricky, even in pages that conform to the standard, which most don't. If you need general-purpose HTML parsing, html5lib is a better option. That said, web scraping is usually a specialized operation where you are just extracting information according to the specific situation. For such limited use, pyparsing is fine, and provides some neat facilities to help.
The reasons I introduced greenlets, which are not strictly necessary in Listing 4, become more apparent as you expand such code to many real-world scenarios. In situations where you are multiplexing the parsing and processing with other operations, the greenlets approach makes it possible to structure processing not unlike UNIX command-line pipes. In cases where the problem is complicated by working with multiple source pages there is one more problem, the fact that urllib2 operations are not asynchronous and the whole program blocks any time it accesses a web page. Addressing this problem is beyond the scope of this article, but the use of advanced flow of control in Listing 4 should get you into the habit of thinking carefully about how to stitch together such sophisticated applications with an eye to performance.
PyPy is an actively maintained project, and certainly a moving target, but there is already much that can be done with it, and the high level of CPython compatibility means that you probably have a an established backup platform for your work if you begin to experiment. In this article, you learned enough to get started, and you had a bit of enticement to PyPy's very fascinating Stackless features. I think you'll be pleasantly surprised by PyPy's performance, and more importantly, it opens up new ways of thinking about programming for elegance without sacrificing speed.
- Some other important resources for PyPy are the performance page and the compatibility wiki, which tracks the compatibility of popular Python libraries with PyPy.
- Enjoy a gentle introduction to Python in the "Discover python" series (developerWorks, 2005-2006), by Robert Brunner.
- Learn more about HTML5 in the series "HTML5 fundamentals" (developerWorks, 2011), by Grace Walker.
- Check out developerWorks Open source for the resources you need to participate in the exciting world of open source software.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
- Follow developerWorks on Twitter.
- Watch developerWorks demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.