Manipulate CSV data with Python and pureXML

Import and convert US census data to XML using Python in IBM DB2 Express-C

IBM® DB2® pureXML® allows you to store XML data natively in a relational database management system, giving you the power and flexibility to report on this data without disturbing the advantages that its XML format offers. In this tutorial, you will learn how to connect to a DB2 database from the Python programming language, importing data about population from the United States Census Bureau. You will use Python to convert this CSV file into XML, before inserting this XML data natively into DB2. Finally, you will use Python to create a command-line application that produces some informative tables that you can access through a menu system.

Share:

Joe Lennon, Software developer, Freelance

Photo of Joe LennonJoe Lennon is a 24-year-old software developer from Cork, Ireland. Joe is author of the forthcoming Apress book Beginning CouchDB, and has contributed several technical articles and tutorials to IBM developerWorks. In his spare time, Joe likes to play football (soccer), tinker with gadgets and work on his Xbox 360 gamer score.



22 December 2009

Also available in Chinese Portuguese

Before you start

This tutorial is geared towards developers who want to learn how to store data in XML format in a database, connect to DB2 from a Python application, and learn how to convert data from CSV files into XML documents. No prior knowledge of Python is assumed (you will learn how to install it in this tutorial), but it would be advantageous. This tutorial assumes that you use a Microsoft® Windows® operating system, but the code should work on other platforms without modification. When you complete this tutorial, you will have the skills to create powerful Python applications that can communicate and interact with an IBM DB2 database and harness the power that pureXML offers.

About this tutorial

Frequently used acronyms

  • API: Application programming interface
  • CSV: Comma separated value
  • DBA: Database administrator
  • SQL: Structured Query Language
  • UI: User interface
  • URL: Uniform Resource Locator
  • XML: Extensible Markup Language

The IBM DB2 database management system has long been a leading player in the area of relational data management. In recent years, however, there has been a significant rise in the requirement for data structures that are more flexible and document-oriented in nature. One of the more prominent examples of such data structure is XML.

While many relational database systems have rushed to incorporate some form of XML support in their database, IBM DB2 is the only such offering that allows XML to be stored natively in the database, unchanged and true to its original form. This is referred to as pureXML—a technology that allows DB2 developers and DBAs to manipulate and report on XML data alongside relational data, without negatively affecting the purity of the XML itself.

In this tutorial, you will develop a Python script that connects to the United States Census Bureau Web site and downloads a CSV file containing data about the population at a national, regional, and state-wide level—including the results of the 2000 Census and fluctuations based on estimates in each year since then. You will then learn how to process this data, converting it into an XML document. Rather than import this large document and rely on DB2 functions to slice and dice it into individual rows, you will then use Python to insert this data into DB2, with an XML document stored per each relevant row in the CSV file. Finally, you will create a command-line application that produces some useful reports on this data, showing a list of states, regions, or countries in the order of highest to lowest population.

Prerequisites

To follow the steps in this tutorial, you will need to have the following software installed:

  • IBM DB2 Express-C 9.5 or later
  • Python Version 2.6 or any pre-3.0 version

See Resources for the links to download these prerequisites. This tutorial assumes that you are using a Microsoft Windows operating system, preferably XP or later. In order to install Python and the IBM DB2 extension for Python, you will need administrative privileges on your computer.


Setting up the database

In this section, you will create a new IBM DB2 database using the DB2 Command Editor utility, before you create a series of tables that will store census population data in XML format. You will create three tables: country, region, and state. Each of these tables will store a unique ID for each row in the table, as well as an XML document that will house the census data that you will import from the U.S. Census Bureau's CSV files later in this tutorial.

Creating the database

Let's get started. Open the DB2 Command Editor (Start>Programs>IBM DB2>[DB2 Instance Name]>Command Line Tools) and enter the following command: create database census using codeset UTF-8 territory US.

This process can take a minute or two to complete so be patient. When it is finished, you will receive a response message like: DB20000I The CREATE DATABASE command completed successfully.

Tip: You can quickly execute commands in Command Editor by pressing Ctrl+Enter.

Now, connect to the newly-created census database using the following: connect to census.

Once again, you should receive a response from the DB2 server, this time something along the lines of: A JDBC connection to the target has succeeded.

The database is now created and you are ready to create the tables that will store the application's data.

Creating the database tables

You will load the population data into the database and store it in three separate tables: country, region, and state. Let's create these tables now in Listing 1.

Listing 1. DDL SQL statements to create tables
create table country (
    id int not null generated by default as identity,
    data xml not null,
    primary key(id)
);

create table region (
    id int not null generated by default as identity,
    data xml not null,
    primary key(id)
);

create table state (
    id int not null generated by default as identity,
    data xml not null,
    primary key(id)
);

Each of these tables store the same type of data—a unique identifier that will be automatically generated by DB2 each time a row is inserted, and an XML data column that will store an XML document for each row. Strictly speaking, you can use a single table here and create a type column on it to determine whether a row is a country, region, or state, but if you separate them into tables it you have more flexibility for future manipulation.

When you execute the above SQL statements, DB2 should return the following response for each table: DB20000I. The SQL command completed successfully.

With the database configured, you are now ready to install and configure Python and the ibm_db extension for Python.


Installing and configuring Python

Python is a high-level programming language that places a strong focus on the readability of code. Unlike many other programming languages, where code indentation and style is at the discretion of the developer, in Python you must use indentation to denote blocks of code (such as classes, if statements, and loops). Python is easy to learn, produces elegant and clean code, and is widely supported on a host of different platforms, making it an excellent choice for any number of different application development projects.

About Python

Although Python is generally pre-installed on Mac OS X and Linux® operating systems, the same cannot be said for Microsoft Windows. Fortunately, you can download Python from the Web and install it on Windows—and you will learn how to do so in this section. Before you start, however, it's worth mentioning that you have a number of options when it comes to downloading Python for Windows.

The first option is to use the open-source official binary installer, available for download from the official Python Web site. This option offers the most up-to-date version of Python and is provided on an open-source license. In this tutorial, you will work with this distribution of Python.

Alternatively, the commercial ActiveState Python offers some additional resources such as complete documentation, and additional Python extensions including Windows-specific extensions that facilitate the development of Win32 API-based applications using Python.

Installing Python

The first step in installing Python is to download it from the official Python Web site (see Resources for a link). At the time of writing, the current production versions of Python are 2.6.4 and 3.1.1. This tutorial assumes you are using the 2.6.* version of Python. As version 3.0 and above is not backward-compatible, I highly recommend that you download the latest pre-3.0 version (version 2.x.x) offered for download. Save this file to your hard drive and when it has finished downloading, open the .msi file to launch the setup program.

When the installer launches, it will ask you if you want to install it for all users or just for you (this option is not available on Windows Vista®). Leave the default selection, Install for all users, and press Next to continue. You will now be asked to select a destination directory. The default should be C:\Python26\ or similar; again, accept this default and press Next to move forward. You will now be offered the opportunity to customize your Python installation by selecting which features you want to be installed. By default, everything is selected, so leave this as is and press Next to start the installation. The process will take a couple of minutes to complete, and when it is finished, you will see a window like the one in Figure 1.

Figure 1. Completing the Python 2.6.4 Installer window
Screen capture of Completing the Python 2.6.4 Installer window

Press Finish to exit the setup application. Before you move on, it's worth verifying that Python is installed and working correctly. You can use the shortcuts that were added to the Windows Start Menu if you wish, but I recommend that you launch Python from the command prompt as this is how you will run the scripts you create later in this tutorial.

First, open the Windows command prompt window through the Run dialog (Start>Run, then enter cmd) or navigate to Start>Programs>Accessories>Command Prompt). At the prompt, enter the command: python.

You should now be at the Python prompt, indicated by >>>, as in Figure 2. (See a text-only view of Figure 2.)

Figure 2. Python prompt
Screen capture of Python prompt in the commandprompt window

Note: If you see a message such as python is not recognized as an internal or external command, operable program or batch file, the Python directory was not placed on your Windows Path. See Resources for information on how to set this up. To quit the Python prompt, enter the following command: quit().

You should return to the Windows command prompt after entering this command at the Python prompt. In the next section, you will learn how to install the ibm_db Python extension, which will allow you to connect to a DB2 database from Python.

Installing the ibm_db Python extension

The ibm_db extension for Python allows you to connect to and interact with an IBM DB2 database using Python code. To install this extension, you will first need to install the easy_install utility (setuptools). Navigate to the setuptools package page (see Resources and find the file for your version of Python, which is 2.6 in my case). Download this file to your hard drive, and when it has completed, open it to install the easy_install.exe application into your Python Scripts directory (usually C:\Python26\Scripts).

Installing the ibm_db extension itself is very simple. Open a Windows command prompt window (Start>Run>cmd) and enter the following command, and if you installed Python to a different directory, change the reference accordingly: C:\Python26\Scripts\easy_install ibm_db.

This will search for, download, extract, and install the ibm_db extension automatically. When it is finished, you will be returned to the Windows command prompt, as in Figure 3. (See a text-only view of Figure 3.)

Figure 3. Successfully installed ibm_db extension
Screen capture of successfully installed ibm_db extension with 'Finished processing dependencies message

Next, you will verify that the ibm_db extension is working correctly by testing the connection to the DB2 database you created earlier in this tutorial.

Connecting to DB2 from Python

With the DB2 database created and Python and the ibm_db extension installed and configured, you are now ready to check that you can connect to DB2 from Python. Open a Windows command prompt window and issue the python command to launch the Python interpreter.

At the prompt, enter the following commands to connect to DB2 and count the number of rows in the country table. Please note that the Python prompt (>>> and ...) has been included here for illustrative purposes only and you should not type those in to the interpreter. Also, be sure to replace the credentials in the code in Listing 2 with your actual DB2 credentials.

Listing 2. Python code to connect to DB2
>>> import ibm_db
>>> conn =
 ibm_db.connect("DATABASE=census;HOSTNAME=localhost;PORT=50000;
PROTOCOL=TCPIP;UID=username;PWD=password;", "", "")
>>> sql = "SELECT COUNT(*) FROM country"
>>> stmt = ibm_db.exec_immediate(conn, sql)
>>> result = ibm_db.fetch_both(stmt)
>>> while result != False:
...     print "Count: ", result[0]
...     result = ibm_db.fetch_both(stmt)
...

After you enter the final line above, press enter and the code will execute. You should see a result (Count: 0) as in Figure 4.

Figure 4. Result of DB2 connection test
Screen capture of the result of DB2 connection test

If you have problems connecting to DB2 from Python, check that the ibm_db extension was installed correctly, and that you have already created the database as described earlier in this tutorial. Also verify that your credentials for connecting to DB2 are correct.

With the database set up and Python ready to get to work, you are now ready to start developing the main subject of this tutorial. In the next section, you will download, parse, and convert CSV data from the U.S. Census Bureau and save it as XML data in the DB2 database. You will then learn how to read this data from the database and display it to the user.


Downloading and converting the CSV data

In this section of the tutorial, you will learn how to create a Python script to pull a CSV file down from the United States Census Bureau Web site. Next, you will process this CSV data and convert it to XML that can be stored in the DB2 pureXML database you created earlier in the tutorial.

Before you start, you should create a folder somewhere on your hard disk where you will store the project files. I stored my data in a folder C:\pycensus, and I suggest that you do the same.

Downloading CSV files from the US Census Bureau Web site

The United States Census Bureau has a plethora of data available for download, in a variety of different formats. Unfortunately, the population data from Census 2000 and estimates for each year since then is only available in CSV format and not XML. That's not a problem, however, as you can use Python to pull this CSV file down from the Census Bureau Web site and convert it into XML that you can store in your DB2 pureXML database.

If you wanted, you can point your Web browser to the URL for the CSV file project folder. Instead, however, you will create a Python script to do this task. In your favorite text editor, create a new file and save it as download.py in your project folder (for example, C:\pycensus). Add the code from Listing 3 to this file.

Listing 3. download.py
import httplib
conn = httplib.HTTPConnection("www.census.gov")
conn.request("GET", "/popest/national/files/NST-EST2008-alldata.csv")
response = conn.getresponse()
f = open('data.csv', 'w')
f.write(response.read())
f.close()
conn.close()

In this script, you use the httplib module to connect to the census.gov Web site and issue a GET request for the CSV file required. Then you fetch the response and write it to a file named data.csv. To run this script, open up the Windows Command Prompt and change to the project directory as follows: cd \pycensus.

Next, run the following command to run the Python script: python download.py.

When the script has completed you will return to the prompt. You might wonder why there were no messages produced—don't worry, this is a good thing as it means no errors occurred. Open your project folder in Windows Explorer and you will now notice an extra file in the folder named data.csv. If you have Microsoft Excel®, this will be the default handler for this type of file, and opening it will produce a result like the one in Figure 5.

Figure 5. data.csv in Microsoft Excel
Screen capture of data.csv in Microsoft Excel

Warning: Do NOT save this file in Excel, as it may change the CSV file format to suit its own interpretation, and this might not be readable by Python. If Excel asks you to save the file, choose No. If you accidentally save the file, simply delete it and re-run the download.py Python script. In the next section, you will learn how to take this CSV file and convert it into XML.

Converting CSV data into XML documents

To convert the CSV data into XML, you must first be clear on how exactly you wish to store the data, whether different records should be stored differently, and check if some records should be discarded. In the example of the CSV file you just downloaded, you will notice that this contains three types of data: a single row of information for the entire country; four rows of data for the regions Northeast, Midwest, South, and West; fifty one rows of data for the fifty states of the USA plus District of Columbia; and a row for the Puerto Rico Commonwealth. The first row of the file is a header row that is to be used for column names.

The script you create in this section will take the header row and use this data to form the tag names for each element that a record should have in the XML document. The script will determine, based on the first four columns, whether the particular row refers to a country, region, or state, and will set the tag name accordingly to indicate what the XML document refers to. Finally, the script will choose to exclude the Puerto Rico Commonwealth record as it has some incomplete data.

In your text editor, create a new file and save it as convert.py. Add the code from Listing 4 to this file.

Listing 4. convert.py
import csv

reader = csv.reader(open('data.csv'), delimiter=',', quoting=csv.QUOTE_NONE)

print "<data>"
for record in reader:
    if reader.line_num == 1:
        header = record
    else:
        innerXml = ""
        dontShow = False
        type = ""
        for i, field in enumerate(record):
            innerXml += "<%s>" % header[i].lower() + field + "</%s>" 
% header[i].lower()
            if i == 1 and field == "0":
                type = "country"
            elif type == "" and i == 3 and field == "0":
                type = "region"
            elif type == "" and i == 3 and field != "0":
                type = "state"

            if i == 1 and field == "X":
                dontShow = True

        if dontShow == False:
            xml = "<%s>" % type
            xml += innerXml
            xml += "</%s>" % type
            print xml
print "</data>"

In this file, you use the csv library to read the data.csv file. You wrap the output in an opening <data> and closing </data> XML tag, as it is producing a single document output. Then you loop through each line of the CSV file. If the current line is the first line of the file, you set that record as the header. This will be used later in the script as the element name for each field in a country, region, or state record. If the current line is not the header record, you loop through each column in the record, and create an inner XML element string whose name is driven from the heading record. You then check whether the row in question is referring to a country, region, or state, and wrap the inner XML element in an outer tag <country>, <region>, or <state> accordingly. Finally, you check if the record contains an X in a specific field, and if so, set a Boolean indicator to True that will stop that particular row from being added to the XML document. The first way you can run this script is the same as before, by issuing: python convert.py.

Running the script this way will produce a result like the one in Figure 6.

Figure 6. convert.py output
Screen capture of convert.py output, showing data information as concatenated XML document

As you can see, the script has put the data directly to the screen. It would be far more useful if this data were saved to a file. Rather than creating more Python code to do this, you can simply change the command you issue as follows to tell the command prompt to save the output to a file named data.xml: python convert.py > data.xml.

This will create a new file in the project directory named data.xml. If you open this file in an application that reads and formats XML, such as Firefox, you might see a result like the one in Figure 7.

Figure 7. XML output in Mozilla Firefox
Screen capture of data viewed as XML output in Mozilla Firefox

With the data stored in a file like this, you can import the XML into a DB2 database using a .del file and the IMPORT command. However, this results in the entire XML data being stored in a single row in a DB2 table. Now, it is possible to use XQuery to split up this data and store it into separate rows. But, as you are already using Python to create the XML document, it is much easier to simply perform a series of INSERT statements directly in the convert.py script itself. In the next section, you will modify the convert.py script to do just that.


Saving XML into DB2 with Python

Previously, you learned how to format the CSV data you downloaded from the U.S. Census Bureau into a large XML document. Now you will learn how to take the rows for country, regions, and states and insert them into a DB2 database. Make the changes listed in this section to the convert.py file you created in the previous section.

Including the ibm_db library

The first thing that you need to do is include the ibm_db library in your code. To do this, change the first line of the covert.py file so that it now reads: import csv, ibm_db.

With a script like this, running it multiple times causes every row to be inserted repeatedly, resulting in a mass of duplicate data. To prevent that, clear the database tables at the start of the script so that each time it runs it will start from fresh. Add Listing 5 just below the import statement from Listing 4, which you just modified (in other words add it before the reader = csv.reader... line in Listing 4).

Listing 5. Excerpt from convert.py—clearing down tables
connString = "DATABASE=census;HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;
UID=username;PWD=password;"
try:
    conn = ibm_db.connect(connString, "", "")
except:
    print "Could not connect to DB2: ", ibm_db.conn_errormsg()
else:
    print "Connected to DB2."

sql = "DELETE FROM country"
try:
    stmt = ibm_db.exec_immediate(conn, sql)
except:
    print "Error deleting from country table: ", ibm_db.stmt_errormsg()
else:
    print "Country table emptied."

sql = "DELETE FROM region"
try:
    stmt = ibm_db.exec_immediate(conn, sql)
except:
    print "Error deleting from region table: ", ibm_db.stmt_errormsg()
else:
    print "Region table emptied."

sql = "DELETE FROM state"
try:
    stmt = ibm_db.exec_immediate(conn, sql)
except:
    print "Error deleting from state table: ", ibm_db.stmt_errormsg()
else:
    print "State table emptied."

You might remember the code to connect to the DB2 database from an earlier section of this tutorial, where you tested that the Python connection to DB2 was working. This time, you are performing three SQL statements—deleting all data from the country, region, and state tables, respectively. In each case, Python will output a message either confirming that the statement executed successfully or that an error occurred. If an error does occur, the DB2 error message is relayed to the user, making it easier to debug what went wrong.

Next, you need to delete a couple of print statements that output the outer XML declaration for the single large document you created in the previous section. These lines are: print "<data>" and print "</data>".

The former should be just beneath the reader = csv.reader... line, and the latter should be the last line of the file.

Finally, you need to change the convert.py file so that it doesn't print the XML code for each row, but rather that it saves this as an XML document in the appropriate DB2 table. You have already created code to determine if a particular line is a country, region, or state, and to generate the XML for the row; so all you need to do now is create the relevant INSERT statement and execute it.

Find the line that currently reads print xml. You need to replace this line with the code from Listing 6. Keep in mind that Python is very sensitive to code indentation, so be sure to line up your code correctly in your text editor.

Listing 6. Excerpt from convert.py—saving rows to the DB2 database
if type == "country":
    sql = "INSERT INTO country(data) VALUES('%s')" % xml
elif type == "region":
    sql = "INSERT INTO region(data) VALUES('%s')" % xml
elif type == "state":
    sql = "INSERT INTO state(data) VALUES('%s')" % xml

try:
    stmt = ibm_db.exec_immediate(conn, sql)
except:
    print "Error adding row: ", ibm_db.stmt_errormsg()
else:
    print "Row added to %s table" % type

The final code for convert.py will look like Listing 7. Again, indentation is hugely significant in Python, so ensure that it is correct or you might experience unexpected results.

Listing 7. convert.py
import csv, ibm_db

connString = "DATABASE=census;HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;UID=jjlennon;
PWD=DopGX240;"
try:
    conn = ibm_db.connect(connString, "", "")
except:
    print "Could not connect to DB2: ", ibm_db.conn_errormsg()
else:
    print "Connected to DB2."

sql = "DELETE FROM country"
try:
    stmt = ibm_db.exec_immediate(conn, sql)
except:
    print "Error deleting from country table: ", ibm_db.stmt_errormsg()
else:
    print "Country table emptied."

sql = "DELETE FROM region"
try:
    stmt = ibm_db.exec_immediate(conn, sql)
except:
    print "Error deleting from region table: ", ibm_db.stmt_errormsg()
else:
    print "Region table emptied."

sql = "DELETE FROM state"
try:
    stmt = ibm_db.exec_immediate(conn, sql)
except:
    print "Error deleting from state table: ", ibm_db.stmt_errormsg()
else:
    print "State table emptied."

reader = csv.reader(open('data.csv'), delimiter=',', quoting=csv.QUOTE_NONE)
for record in reader:
    if reader.line_num == 1:
        header = record
    else:
        innerXml = ""
        dontShow = False
        type = ""
        for i, field in enumerate(record):
            innerXml += "<%s>" % header[i].lower() + field + "</%s>" 
% header[i].lower()
            if i == 1 and field == "0":
                type = "country"
            elif type == "" and i == 3 and field == "0":
                type = "region"
            elif type == "" and i == 3 and field != "0":
                type = "state"

            if i == 1 and field == "X":
                dontShow = True
        if dontShow == False:
            xml = "<%s>" % type
            xml += innerXml
            xml += "</%s>" % type
            if type == "country":
                sql = "INSERT INTO country(data) VALUES('%s')" % xml
            elif type == "region":
                sql = "INSERT INTO region(data) VALUES('%s')" % xml
            elif type == "state":
                sql = "INSERT INTO state(data) VALUES('%s')" % xml
            try:
                stmt = ibm_db.exec_immediate(conn, sql)
            except:
                print "Error adding row: ", ibm_db.stmt_errormsg()
            else:
                print "Row added to %s table" % type

Make sure you have saved this file and open a Windows command prompt. Change to the project directory and run the convert.py script again, this time using the following command (don't pipe the output to a file): python convert.py.

You should see a number of "Row added to state table" messages appearing one after the other, as in Figure 8.

Figure 8. Output from revised convert.py
Screen capture of output from revised convert.py

Before you read this data from DB2 using Python, open the DB2 Command Editor and check how this data looks in the database. Make sure you are connected to the census database (issue the command connect to census if required) and enter the following SQL statement: select * from state. This query should produce 51 results, as in Figure 9.

Figure 9. Query Results view
Screen capture of Query Results view with numbered XML data entries

Click on the more (...) button next to any one of the rows in the Query Results tab. This will open the XML Document Viewer, showing that particular row's associated XML document. This should look similar to the screen capture in Figure 10.

Figure 10. XML document viewer
Screen capture of an XML document in the Tree View tab in the XML document viewer

Feel free to execute similar SQL statements to retrieve records from the country and region tables; you should get a single row result for the country table and four rows for the region table.

Next, you will learn how to read this data from DB2 into Python and present it to the user.

Reading XML from DB with Python

In this section, you will learn how to build a command-line Python application that will request user input to select one of three menu options. These options will allow the user to view a list of states, regions, or countries ordered by the population driven from the 2000 census.

To start, you'll connect to the DB2 database, print the list of menu options, and request the user's input. Create a new file called read.py and add the code from Listing 8 to it.

Listing 8. Excerpt from read.py—getting started
import ibm_db, locale, sys

locale.setlocale(locale.LC_ALL, '')

connString = "DATABASE=census;HOSTNAME=localhost;PORT=50000;
PROTOCOL=TCPIP;UID=username;PWD=password;"
try:
    conn = ibm_db.connect(connString, "", "")
except:
    print "Could not connect to DB2: ", ibm_db.conn_errormsg()
else:
    print "Connected to DB2."

print "To view population information, please select one of the following options:"
print "1.) List of states by population"
print "2.) List of regions by population"
print "3.) List of countries by population"
print "4.) Exit the application"

input = False
while input == False: 
    try:
        option = int(raw_input("Please enter a number from the options above to 
view that information: "))
        if option not in [1,2,3,4]:
            raise IOError('That is not a valid option!')
    except:
        print "That is an invalid option."
    else:
        input = True

In Listing 8, you are first importing the ibm_db and locale libraries. The locale library is required to format the population number so that it is more readable (using thousand separators). You begin the application by setting the locale to the default setting on your machine. Next, you connect to the DB2 database, before printing information about the different menu options that will be available to the user.

The final section of code in Listing 8 requests that the user enter a value, and verifies that this is an integer, and is one of the four available options—1, 2, 3 or 4. If the value provided is not one of these values, it will keep asking for a value until a valid one is entered. The user can exit the program at any time by selecting option 4.

Now that the application has determined what data the user wants to see, it must build an appropriate SQL statement to retrieve this data. The code in Listing 9 does just that.

Listing 9. Excerpt from read.py—building the SQL
selected = ""
if option == 1:
    sql = "select x.* from state s, xmltable('$d/state' passing s.data as \"d\"\
    columns \
    name varchar(50) path 'name', \
    pop int path 'census2000pop') as x \
    order by x.pop desc"
    selected = "state"
elif option == 2:
    sql = "select x.* from region r, xmltable('$d/region' passing r.data as \"d\"\
    columns \
    name varchar(50) path 'name', \
    pop int path 'census2000pop') as x \
    order by x.pop desc"
    selected = "region"
elif option == 3:
    sql = "select x.* from country c, xmltable('$d/country' passing c.data as \"d\"\
    columns \
    name varchar(50) path 'name', \
    pop int path 'census2000pop') as x \
    order by x.pop desc"
    selected = "country"
elif option == 4:
    sys.exit()

In Listing 9, the if block checks whether the user's input selection was the value 1, 2, 3 or 4. If it detects a value between 1 and 3, it will create an SQL statement for viewing population data for states, regions, or countries. If it detects that 4 was entered, it will exit the program.

The SQL statement for each option is virtually the same, except that it looks at a different table in each instance. It basically uses the XMLTABLE function to map XML elements in the data column of the table to different relational-style columns. It then orders the data by the population value, from highest number to lowest number.

The final part of the application is executing the SQL statement and looping through the result set to produce a table of results. Listing 10 shows this code.

Listing 10. Excerpt from read.py—Formatting results
try:
    stmt = ibm_db.exec_immediate(conn, sql)
except:
    print "Error retrieving data: ", ibm_db.stmt_errormsg()
else:
    res = ibm_db.fetch_both(stmt)
    print ".----------------------------------------------,"
    print "|                                              |"
    print "|", ("%s LIST BY POPULATION" % selected.upper()).center(44), "|"
    print "|                                              |"
    print "|----------------------------------------------|"
    print "|", ("%s" % selected.upper()).center(21), " | ", "POPULATION".center(18), "|"
    print "|----------------------------------------------|"
    while res != False:
        print "|", res[0].ljust(21), " | ", locale.format("%d", res[1], grouping=True)
.rjust(18), "|"
        res = ibm_db.fetch_both(stmt)
    print "'----------------------------------------------'"

In this code, you execute the SQL statement that was generated by the code in Listing 9, and print out a table that nicely formats the results. You are using a series of Python functions in this section that perform string manipulation such as left-justifying, centering, and right-justifying text and formatting the population value with thousand separators to make it easy to read.

With the code for reading from the database complete, you are now ready to execute the script. From the Windows Command Prompt, make sure that you are in the project directory and use the following command to start the program: python read.py.

When the program executes, it will connect to DB2 and present you with the following list of menu options, which you can enter in the application:

  1. List of states by population
  2. List of regions by population
  3. List of countries by population
  4. Exit the application

Figure 11 shows the menu options. (See a text-only view of Figure 11.)

Figure 11. Application menu
Screen capture of the application menu with four menu options

Try entering an invalid menu option, such as the string hello. You will receive an error as in Figure 12, before you are asked to enter an option again. (See a text-only view of Figure 12.)

Figure 12. Invalid menu option error
Screen capture of invalid menu option error

This time, enter a valid option. I selected option 2 (List of regions by population). This should produce a result as in Figure 13. (See a text-only view of Figure 13.)

Figure 13. Regional population data
Screen capture of regional population data

As you can see, the application presents a table with a list of regions, with the region with the largest population displayed first. You should see a similar result for the other two menu options, except that option 1 will display 51 states and option 3 will display just the one country.

Be sure to try out the different menu options for yourself, and try to enhance the application by adding some more options and different views of the data.


Summary

In this tutorial, you learned how to create a DB2 database that features tables with native XML data columns. You then learned how to install Python and the ibm_db extensions for Python through the easy_install utility. Next, you verified that you can communicate with the DB2 database from the Python interpreter. You then developed a Python script that pulled down population data from the U.S. Census Bureau Web site, before you converted this CSV data into XML format and saved it in DB2 tables. Finally, you created a basic command-line application that provides tabular reports about national, regional, and state-wide population data.

With the information provided in this tutorial, you should have the knowledge required to further your Python and DB2 development skills.


Download

DescriptionNameSize
Python CSV source codepython.csv.source.zip3KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Information Management, Open source, Web development
ArticleID=457285
ArticleTitle=Manipulate CSV data with Python and pureXML
publish-date=12222009