Managing the Protein Data Bank with DB2 pureXML

The Protein Data Bank (PDB) is a unique worldwide repository of structural data about proteins. The PDB data is available in XML format in order to provide flexibility, extensibility, and ease of data exchange in the biological research community. Analyzing data in the PDB can help explain diseases, develop new drugs, or understand the interactions between different proteins. However, one of the key challenges is to efficiently store and query this information to find and extract information and correlations of interest. This article describes how to use the hybrid capabilities of DB2®— relational and pureXML® features — to manage and analyze PDB data.

Share:

Gerd Anders (gerd.anders@ars.de), Consultant and Bioinformatics Researcher, ARS Computer & Consulting

Gerd AndersGerd Anders holds a degree in computer science from the Humboldt University Berlin. He focuses on bio-informatics and the use of database systems to efficiently manage and process very large amounts of data in and for the life sciences. The DB2-based PDB is an essential project of his Ph.D. thesis, providing a valuable basis for several DB2-driven applications and exploiting the entire potential of the PDB. In addition, he was also a tester of the closed-beta version of DB2 9.5 for Linux, UNIX, and Windows.



Matthias Nicola (mnicola@us.ibm.com ), Senior Software Engineer, IBM Silicon Valley Lab

Author photo: Matthias NicolaMatthias Nicola is a Senior Technical Staff Member at IBM's Silicon Valley Lab in San Jose, CA. He focuses on DB2 performance and benchmarking, XML, temporal data management, in-database analytics, and other emerging technologies. He also works closely with customers and business partners to help them design, optimize, and implement DB2 solutions. Previously, Matthias worked on data warehouse performance at Informix Software. He received his Ph.D. in computer science from the Technical University of Aachen, Germany.


developerWorks Contributing author
        level

08 September 2011

Also available in Chinese Russian Vietnamese Spanish

Introduction

The Protein Data Bank (PDB.org) is a worldwide archive of structural data about biological molecules, mostly proteins. The Protein Data Bank (PDB) is managed by several member organizations responsible for depositing, maintaining, processing, and freely providing this biological data to the scientific community. To provide flexibility, extensibility, and ease of data exchange, the PDB data is available in XML format. This XML format is defined by an XML Schema known as the Protein Data Bank Markup Language (PDBML).

The structural information includes the 3-D coordinates of the atoms of the molecule(s) a protein consists of. These atomic coordinates are also referred to as the 3-D structure or tertiary structure. The tertiary structure of a protein is closely coupled to its function. Thus, knowing the tertiary structure often helps in understanding the protein's intrinsic function. For example, the tertiary structure may be useful to explain diseases or develop new drugs. The tertiary structure can also be exploited to search the PDB for interactions between proteins.


The challenge

As of December 2010, the Protein Data Bank repository held 70,000 entries (XML documents) that contain more than 500 million atom coordinates. The total uncompressed size is more than 750 GB. Individual XML documents in the PDB range from a few MB to more than 1 GB in size. Based on the rapid growth of the PDB repository in recent years (Figure 1), the size of the PDB is expected to continue to increase significantly. Consequently, searching and analyzing this information is becoming ever more challenging.

Figure 1. Growth of the PDB over the past 20 years
Bar graph shows upward growth of PDB data year over year

A typical approach to analyze PDB data is to write a custom application or a set of scripts that search the PDBML documents for the purpose of a very specific research question. The disadvantages of this approach include the facts that:

  • Developing custom code each time new research is being conducted is very labor-intensive and time-consuming.
  • The performance is often poor because all documents need to be parsed and searched, even if only a subset of them contain relevant information.
  • It's often difficult to reuse or combine existing custom code to compose new or different queries against the PDB data.

DB2 V9.7.3 with pureXML was chosen to address these challenges, primarily because DB2 has the scalability and the XML capabilities required to process the expected volumes of PDBML documents. Additionally, DB2 is freely available for non-commercial usage via the IBM Academic Initiative. The goal was to store the PDB information in an efficient database schema, exploit relational and XML indexes for efficient search, and use XQuery and SQL/XML to express even complex queries against the PDB information.


The content of the Protein Data Bank

Before we discuss the DB2 database design for the PDB, it is helpful to understand the PDB data a little bit better.

The tertiary structure of a protein is experimentally determined (solved), predominantly by a method called X-ray Diffraction or X-ray Crystallography. Another less frequently used method is called Solution NMR (Nuclear Magnetic Resonance) or NMR Spectroscopy. The methods for determining (solving) the protein structure lead to differences in how a protein structure is described in the generated XML documents, which is particularly reflected in the XML file sizes.

Proteins are dynamic molecules, which means that their tertiary structures may vary slightly, for example depending on their environment. Due to these variations, NMR methodically determines multiple instances (models) that represent slightly shifted tertiary structures for the same protein. Consequently, XML files with protein data produced by NMR can be very large in size, such as 100 MB to 1 GB, or more. Also, you will see later in this article how and why we use DB2 range partitioning to separate the first (default) model of a protein from its variations.

Listing 1 shows an extract from one PDBML document. You can see four of 177 categories of information that can appear in such a document, including the authors of the study and the experimental method (<PDBx:exptlCategory>) used. The attribute entry_id represents the unique PDB identifier for this document.

Listing 1. Extract of a sample PDBML document (1BBZ.xml)
...
<PDBx:audit_authorCategory>
  <PDBx:audit_author pdbx_ordinal="1">
    <PDBx:name>Pisabarro, M.T.</PDBx:name>
  </PDBx:audit_author>
  ...
</PDBx:audit_authorCategory>
...
<PDBx:structCategory>
  <PDBx:struct entry_id="1BBZ">
    <PDBx:pdbx_descriptor>ABL TYROSINE KINASE, PEPTIDE P41
    </PDBx:pdbx_descriptor>
    <PDBx:title>CRYSTAL STRUCTURE OF THE ABL-SH3 DOMAIN COMPLEXED WITH
                A DESIGNED HIGH-AFFINITY PEPTIDE LIGAND: IMPLICATIONS FOR
                SH3-LIGAND INTERACTIONS
    </PDBx:title>
  </PDBx:struct>
</PDBx:structCategory>
...
<PDBx:struct_keywordsCategory>
  <PDBx:struct_keywords entry_id="1BBZ">
    <PDBx:pdbx_keywords>COMPLEX(TRANSFERASE/PEPTIDE)
    </PDBx:pdbx_keywords>
    <PDBx:text>COMPLEX (TRANSFERASE-PEPTIDE), SIGNAL TRANSDUCTION,SH3 DOMAIN, 
               COMPLEX (TRANSFERASE-PEPTIDE) complex
    </PDBx:text>
  </PDBx:struct_keywords>
</PDBx:struct_keywordsCategory>
...
<PDBx:exptlCategory>
  <PDBx:exptl entry_id="1BBZ" method="X-RAY DIFFRACTION">
    <PDBx:crystals_number>1</PDBx:crystals_number>
  </PDBx:exptl>
</PDBx:exptlCategory>
...

The test database

Due to time and resource constraints, we decided to use only a subset of the total available PDB data volume to prototype and evaluate the storage, indexing, and querying of PDBML documents in a DB2 database. Therefore, a representative sample of 6,029 documents was selected, which amounts to 83 GB and roughly 10 percent of the total volume of the PDBML archive as of December 2010. This set of documents contains approximately 1.7 billion XML elements, out of which approx. 1.54 billion elements describe tertiary protein structures through atom coordinates and other information.

A representative sample of PDBML documents must accurately reflect the ratio of documents with molecule information produced by X-ray Diffraction (smaller documents, 83 percent of all documents) vs. Solution NMR (larger documents, 16 percent of all documents). This ensures that the database configuration and queries are tested with a realistic mix of small and large documents.

The database server available for this study was a Sun X4600 M2 with eight dual-core processors (AMD Opteron 8220) and 256GB main memory. The operating system was Ubuntu 64-bit Linux®. The storage consisted of 10 hard drives (698 GB each; 7,200 rpm), organized as a single logical volume (RAID 5) using a hardware controller.


Database design recommendations for PDB

This section describes a set of database design recommendations that lead to simple and efficient database support for storing and analyzing PDB data. These recommendations address the database schema, the choice between XML and relational storage, definition of indexes, and physical data organization with partitioning and clustering options.

Hybrid XML/Relational storage

PDBML documents currently contain up to 177 categories of information, most of them optional. The large number of optional PDBML elements allow the documents to be very flexible and highly variable. A fully relational database schema would require hundreds of tables to represent PDBML. Such a relational database schema for the PDB was developed in 2005 and is shown in Figure 2. With more than 400 tables and more than 3,000 columns, the complexity of this schema is overwhelming. It is extremely difficult to understand and query such a database schema because a single PDB entry is broken up and scattered over hundreds of tables, making it hard for users to know which information resides in which table. Therefore, keeping most of the PDBML information in its original XML format and storing it in a single XML column results in a much simpler database design and retains the data in a format that users naturally understand.

Figure 2. Diagram of a fully relational database schema for PDBML
Diagram shows many tables arranged in a schema

One notable exception to the high variability of the PDBML data is the atom coordinates and their related labels, which follow a flat and regular structure repeated for every atom in a molecule, as illustrated in Listing 2. Since proteins commonly consist of thousands or tens of thousands of atoms, atom coordinates often represent 90 percent or more of a PDBML document.

Listing 2. Atom coordinates in a PDBML document
<PDBx:atom_siteCategory>
  <PDBx:atom_site id="1">
    <PDBx:B_iso_or_equiv>37.41</PDBx:B_iso_or_equiv>
    <PDBx:Cartn_x>1.039</PDBx:Cartn_x>
    <PDBx:Cartn_y>16.834</PDBx:Cartn_y>
    <PDBx:Cartn_z>18.876</PDBx:Cartn_z>
    <PDBx:auth_asym_id>A</PDBx:auth_asym_id>
    <PDBx:auth_atom_id>N</PDBx:auth_atom_id>
    <PDBx:auth_comp_id>ASN</PDBx:auth_comp_id>
    <PDBx:auth_seq_id>1</PDBx:auth_seq_id>
    <PDBx:group_PDB>ATOM</PDBx:group_PDB>
    <PDBx:label_alt_id xsi:nil="true" />
    <PDBx:label_asym_id>A</PDBx:label_asym_id>
    <PDBx:label_atom_id>N</PDBx:label_atom_id>
    <PDBx:label_comp_id>ASN</PDBx:label_comp_id>
    <PDBx:label_entity_id>1</PDBx:label_entity_id>
    <PDBx:label_seq_id>1</PDBx:label_seq_id>
    <PDBx:occupancy>1.00</PDBx:occupancy>
    <PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num>
    <PDBx:type_symbol>N</PDBx:type_symbol>
  </PDBx:atom_site>
  <PDBx:atom_site id="2">
    <PDBx:B_iso_or_equiv>36.15</PDBx:B_iso_or_equiv>
    <PDBx:Cartn_x>-0.213</PDBx:Cartn_x>
    <PDBx:Cartn_y>16.205</PDBx:Cartn_y>
    <PDBx:Cartn_z>18.364</PDBx:Cartn_z>
      ...
  </PDBx:atom_site>
  <PDBx:atom_site id="3">
    <PDBx:B_iso_or_equiv>33.97</PDBx:B_iso_or_equiv>
    <PDBx:Cartn_x>-0.549</PDBx:Cartn_x>
    <PDBx:Cartn_y>16.779</PDBx:Cartn_y>
    <PDBx:Cartn_z>16.986</PDBx:Cartn_z>
      ...
  </PDBx:atom_site>
  ...
</PDBx:atom_siteCategory>

The flat and regular structure of the atom information makes a perfect fit for traditional relational tables. In fact, the atom coordinates and labels are non-hierarchical data for which XML is not the best choice. Therefore, we decide on a hybrid database schema that stores the atom_site information in a relational table and the remainder of each PDBML document in an XML column, but with the <atom_siteCategory> removed from the document. This has several advantages:

  • The reduced PDBML documents are much smaller, which improves insert and load performance, as well as XML query performance. The XML parsing effort upon insert or load is reduced by approximately 90 percent.
  • The atom information takes less space in relational columns than in their verbose XML representation.
  • The atom data can be queried with traditional relational methods, which for non-hierarchical data is more efficient than XML navigation.
  • Since each atom is represented in a separated row, indices can help speed up the search for specific atoms within a given PDBML entry.

The chosen database schema consists of two tables, shown in Listing 3. The first (xmlrpdb.pdbxml) has one row for each PDB entry. This table has only two columns:

  • The primary key pdb_id holds the four-character PDB entry identifier from the XML attribute entry_id.
  • The XML column pdbxml_file holds the entire PDBML document except the <atom_siteCategory>.

The second table (xmlrpdb.atom_site) contains one relational row for each atom coordinate (i.e., for each <atom_site> element in a PDBML document). The column pdb_id is the foreign key that links atom coordinates to the corresponding PDBML document in the pdbxml table.

Both tables are stored in table spaces with a 32-KB page size to maximize the performance analytical queries that read large numbers of rows.

Listing 3. Hybrid XML/relational database schema for PDB in DB2
CREATE TABLE xmlrpdb.pdbxml (
  pdb_id             CHAR(4) NOT NULL,
  pdbxml_file        XML NOT NULL,
  PRIMARY KEY (PDB_ID))
IN ts_data32k INDEX IN ts_index32k;

CREATE TABLE xmlrpdb.atom_site (
  pdb_id                CHAR(4) NOT NULL,
  atom_site_id          INTEGER NOT NULL,
  auth_asym_id          VARCHAR(10) WITH DEFAULT NULL,
  auth_atom_id          VARCHAR(20) NOT NULL,
  auth_comp_id          VARCHAR(3) NOT NULL,
  auth_seq_id           VARCHAR(20) NOT NULL,
  b_iso_or_equiv        DECIMAL(7,3) NOT NULL,
  b_iso_or_equiv_esd    DECIMAL(7,3) WITH DEFAULT NULL,
  cartn_x               DECIMAL(7,3) NOT NULL,
  cartn_x_esd           DECIMAL(7,3) WITH DEFAULT NULL,
  cartn_y               DECIMAL(7,3) NOT NULL,
  cartn_y_esd           DECIMAL(7,3) WITH DEFAULT NULL,
  cartn_z               DECIMAL(7,3) NOT NULL,
  cartn_z_esd           DECIMAL(7,3) WITH DEFAULT NULL,
  group_pdb             VARCHAR(10) NOT NULL,
  label_alt_id          VARCHAR(10) WITH DEFAULT NULL,
  label_asym_id         VARCHAR(10) WITH DEFAULT NULL,
  label_atom_id         VARCHAR(20) WITH DEFAULT NULL,
  label_comp_id      VARCHAR(10) NOT NULL,
  label_entity_id       SMALLINT NOT NULL,
  label_seq_id          SMALLINT WITH DEFAULT NULL,
  occupancy             DECIMAL(7,3) NOT NULL,
  occupancy_esd         DECIMAL(7,3) WITH DEFAULT NULL,
  pdbx_pdb_atom_name    VARCHAR(10) WITH DEFAULT NULL,
  pdbx_pdb_ins_code     VARCHAR(10) WITH DEFAULT NULL,
  pdbx_PDB_model_num SMALLINT NOT NULL,
  type_symbol           VARCHAR(10) WITH DEFAULT NULL,
  PRIMARY KEY (pdb_id, atom_site_id),
  FOREIGN KEY (pdb_id) REFERENCES xmlrpdb.pdbxml(pdb_id),
  CONSTRAINT group_chk CHECK (group_PDB in ('ATOM', 'HETATM'))
) IN ts_atom_data_32k INDEX IN ts_atom_index32k;

Optionally, CHECK constraints can be defined on the pdbxml table to ensure that the four-character PDB identifier conforms to the PDB standard. The first character must be a number between 1 and 9, and the next three characters must be a number between 0 and 9 or an uppercase character between A and Z (see Listing 4).

Listing 4. CHECK constraints to enforce proper pdb_id values
ALTER TABLE xmlrpdb.pdbxml
  ADD CHECK (SUBSTR(pdb_id, 1, 1) BETWEEN '1' AND '9')
  ADD CHECK ((SUBSTR(pdb_id, 2, 1) BETWEEN '0' AND '9') OR
             (SUBSTR(pdb_id, 2, 1) BETWEEN 'A' AND 'Z'))
  ADD CHECK ((SUBSTR(pdb_id, 3, 1) BETWEEN '0' AND '9') OR
             (SUBSTR(pdb_id, 3, 1) BETWEEN 'A' AND 'Z'))
  ADD CHECK ((SUBSTR(pdb_id, 4, 1) BETWEEN '0' AND '9') OR
             (SUBSTR(pdb_id, 4, 1) BETWEEN 'A' AND 'Z'));

Populating the hybrid database schema

The conceptual process of inserting a PDBML document into our hybrid database schema is illustrated in Figure 3. The <atom_siteCategory> data needs to be extracted and removed from the XML document, and inserted into the relational atom_site table (blue). The reduced document itself is inserted into the pdbxml table. We call this process atom site separation.

Figure 3. Hybrid storage of a PDBML document with separation of atom_site data
Image shows section of XML document extracted and inserted into relational table

Due to high data volume, the atom site separation (the population of the hybrid database schema) needs to have high performance. Hence, costly XML parsing should be reduced as much as possible. Revisiting the atom coordinates in XML format in Listing 2, we find that 94.5 percent of the characters are markup, and only 5.5 percent of the characters are actual values. Hence, the ratio of markup to value is extremely high, which means that a lot of XML parsing may be required to extract a comparatively small amount of usable data. You will understand shortly how this consideration has affected our decision of how to populate the two tables.

One option to populate the relational atom_site table is to use INSERT statements with an XMLTABLE function. Such a statement can parse the entire PDBML document and extract the atom information to insert as relational rows. Additionally, an XQuery Update expression can delete the <atom_siteCategory> subtree from the each PDBML document inserted into the pdbxml table. Such an XQuery Update expression can also be part of an INSERT statement so that the <atom_siteCategory> is removed before writing it into the XML column, rather than performing a separate step after the insert.

Another option is to use a special-purpose preprocessor outside the database to extract the atom data into a relational flat file and remove it from each PDBML document. Such a preprocessor was implemented in C++, and it has the following benefits:

  • It can add desirable annotations to the data, such as information from sequence and structure alignments or application-dependent geometric transformations like rotations or translations of atomic coordinates.
  • It can be implemented without using a general-purpose XML parser. Instead, it is designed and optimized for the specific file structure of PDBML documents. It exploits special knowledge about the flat structure of the atom data, the existence of newline characters between elements, and other characteristics. As a result, the specialized preprocessor is at least 10 times faster than any solution with XML parsing.

Preprocessing the data set of 6,029 gzipped PDBML documents (i.e., 83 GB unzipped) and loading the prepared data into the pdbxml table and atom_site table took only 1 hour and 44 minutes. The preprocessor is available for download (see Download).

Compression

Considering the data volume in the PDB archive as well as its rapid growth, it is useful to compress the data in DB2. This reduces the storage consumption and improves performance. Although compression and decompression in DB2 consumes some additional CPU cycles, compression also reduces the number of physical I/O operations necessary to read a certain amount of data from disk. Furthermore, compressed pages of a DB2 table space remain compressed in the DB2 buffer pool in main memory. As a result, compression allows more data to be in memory than without compression, which increases the buffer-pool hit ratio and makes higher utilization of the available memory. We found that the I/O and memory benefits of compression outweigh the additional CPU cost and lead to overall higher performance.

The following commands in Listing 5 were used to compress both tables.

Listing 5. Enabling compression and REORG tables
ALTER TABLE xmlrpdb.pdbxml COMPRESS YES;
REORG TABLE xmlrpdb.pdbxml LONGLOBDATA RESETDICTIONARY;

ALTER TABLE xmlrpdb.atom_site COMPRESS YES;
REORG TABLE xmlrpdb.atom_site LONGLOBDATA RESETDICTIONARY;

The reduction in space consumption is summarized in Table 1. After compression, the information contained in the 6,029 PDBML documents can be stored in 67.4 percent fewer pages (i.e., three times less space than without compression).

Table 1. Space savings achieved by compression
Before compressionAfter compressionSavings
xmlrpdb.pdbxml176,256 pages44,736 pages74.6 percent
xmlrpdb.atom_site264,960 pages99,264 pages62.5 percent
Total441,216 pages144,000 pages67.4 percent

With a page size of 32 KB, the final storage of 144,000 pages is equivalent to 4.4 GB, which is only 5.3 percent of the original raw data volume of 83 GB. If we extrapolate this ratio to the total current size of the PDB archive, we find that the 0.75 TB of PDB information would be stored in DB2 using only approximately 40.7 GB of space, plus some space for indices.

This tremendous saving of storage stems from two factors. First, the high ratio of markup to value in the atom information is eliminated by converting the atom coordinates to relation format in the preprocessing step. Second, DB2 compression shrinks the remaining XML and relational data by a factor of 3.

Database partitioning

Despite the significant reduction in space consumption, the PDB data volume continues to grow fast. Also, the response time of complex analytical queries can be reduced by spreading the data across multiple database partitions, such that all partitions work on their assigned data in parallel. These database partitions can reside on the same machine to exploit all the CPU power of a multi-core system, or they can be spread across multiple machines in a shared-nothing configuration. The DB2 Database Partitioning Feature (DPF) is available through IBM InfoSphere® Warehouse, a software package that contains DB2 with advanced features, as well as additional design, reporting, and database management tools.

Using the DPF, we recommend to distribute the data in the pdbxml table and atom_site table across the database partitions by hashing on the values of the pdb_id column. This is achieved by adding the clause DISTRIBUTE BY HASH(pdb_id) to the respective CREATE TABLE statement. The large number of distinct values in the pdb_id column ensures a relatively even distribution of rows over the database partitions. Distributing both tables by hashing on their join key (pdb_id) also ensures that all atom rows for a given PDBML document are stored in the same database partition as the PDBML document itself. This collocation implies that joins between the two tables can always be evaluated within each of the database partitions and never require data to be shipped across partitions.

Range partitioning

Range partitioning (also known as table partitioning) enables you to partition the data in a table according to the value in a specified column, such that rows with the same value will reside in the same partition. The concept of range partitioning is orthogonal to the database partitioning. If database partitioning and range partitioning are used together, the rows in a table are first hashed across the database partitions and then range-partitioned within each database partition.

Range partitioning can serve multiple purposes. One purpose is easier roll-in and roll-out of new and old data, respectively. Another purpose is to improve performance based on partition elimination when the DB2 query optimizer determines that only a subset of the partitions need to be examined to answer as particular query. For the PDB, range partitioning was deployed to benefit from partition elimination, rather than to simplify roll-in and roll-out of data.

We decided to range-partition the atom_site table by the pdbx_PDB_model_num column for the following reason: Remember that the tertiary structure of a protein can be experimentally determined with a method called NMR, which produces multiple tertiary structures for the same protein. These variations are called models and are numbered by the field pdbx_PDB_model_num. A value of pdbx_PDB_model_num = 1 identifies the first (default) model of a protein. The additional variations are the non-default models of the same protein and have pdbx_PDB_model_num >= 2. Proteins that have been structurally determined by X-ray Diffraction have only one model with pdbx_PDB_model_num = 1.

Listing 6 shows the extended definition of the atom_site table with range partitioning. All atom coordinates that belong to the first model (pdbx_PDB_model_num = 1) are stored in one partition, whereas any variations (pdbx_PDB_model_num >= 2) are stored in another. Although only about 16 percent of all proteins currently in the PDB have variations produced by NMR, the number of their variations is so large that both partitions have roughly the same number of records.

Listing 6. Table definition with range partitioning
CREATE TABLE xmlrpdb.atom_site (
  pdb_id             CHAR(4) NOT NULL,
  ...
  pdbx_PDB_model_num SMALLINT NOT NULL,
  type_symbol           VARCHAR(10) WITH DEFAULT NULL,
  PRIMARY KEY (pdb_id, atom_site_id),
  FOREIGN KEY (pdb_id) REFERENCES xmlrpdb.pdbxml(pdb_id),
  CONSTRAINT group_chk CHECK (group_PDB in ('ATOM', 'HETATM'))
)
-- IN ts_atom_data_32k
INDEX IN ts_atom_index32k
PARTITION BY RANGE (pdbx_PDB_model_num)
(
  PARTITION DEF_MODELS      STARTING (1) ENDING (1) IN TS_ATOM_DATA1_32K,
  PARTITION NON_DEF_MODELS  STARTING (2) ENDING (MAXVALUE) IN TS_ATOM_DATA2_32K
);

We have chosen this range-partitioning scheme because many PDB queries typically differentiate between default and non-default protein models, and can therefore benefit from the partitioning. For example, a query that analyzes all or most of the default models only needs to scan the partition DEF_MODELS, which reduces the required I/O by half.

Multi-dimensional clustering

In addition to range-partitioning, multi-dimensional clustering (MDC) can be used to cluster the rows in a table based on one or more columns. Rows that have the same value in the clustering columns are physically stored together in the same storage cell. This can greatly improve the performance of queries that constrain and select data along one or multiple clustering dimensions. Like the DPF, MDC is also available through IBM InfoSphere Warehouse.

The choice of clustering columns needs to be based on the expected query workload so the clustering supports the most common and most critical queries. For example, many PDB queries might search the atom data based on the amino acid involved. Therefore, it can be beneficial to cluster the atom_site table based on the column label_comp_id, which, in most documents, contains the three-letter code for the amino acid. To achieve this clustering, add the following clause to the second CREATE TABLE statement in Listing 3: ORGANIZE BY DIMENSIONS(label_comp_id).

Another example is to cluster the atom_site table by the group_PDB column. We have evaluated this clustering for several sample queries that restrict their search to a single group_PDB value (i.e., "HETATOM") and found that it can improve query performance fourfold.

PDB queries and performance

In this section, we discuss two sample queries to demonstrate:

  • The ease with which even complex analysis of PDB data can be carried out.
  • The benefit of the database design decisions described in the previous sections.

The query in Listing 7 selects the PDB identifier, the resolution, and the description from all PDB entries where the experimental method is "X-RAY DIFFRACTION" and the resolution (ls_d_res_high) is less than 2.5. The resolution is expressed in Ångstrøm (1Å = 0.1 nanometer) and serves as a quality metric for the analysis of the atom structures. Structures with a resolution less than 2Å are high-resolution structures (i.e., the locations of their atoms could be determined very accurately). Structures with a resolution greater than 3Å are less accurate and usually ignored.

Listing 7. Query the top 10 records with the best X-ray resolution
SELECT pdb_id, x.resolution, x.pdbx_descriptor
FROM xmlrpdb.pdbxml,
  XMLTABLE
  ('$PDBXML_FILE/*:datablock/*:refineCategory/*:refine[
   @pdbx_refine_id = "X-RAY DIFFRACTION" and *:ls_d_res_high <= 2.5 ]'
  COLUMNS
    resolution      DEC(9,5)      PATH '*:ls_d_res_high',
    pdbx_descriptor VARCHAR(2000) PATH '../../*:structCategory/*:struct/*:pdbx_descriptor'
  ) AS x
-- WHERE
--   upper(x.pdbx_descriptor) LIKE '%UNKNOWN%' or
--   upper(x.pdbx_descriptor) LIKE '%UNCHARACTERIZED%'
ORDER BY x.resolution
FETCH FIRST 10 ROWS ONLY;

The result of this query is shown in Listing 8. One of the benefits of using DB2 pureXML as opposed to custom code is that it's easy to modify SQL/XML queries to refine the search. For example, Listing 7 contains three comment lines with an additional WHERE clause. They can be used to further filter the descriptor to find those structures that are not or could not be characterized yet.

Listing 8. Result produced by the query in Listing 7
PDB_ID RESOLUTION  PDBX_DESCRIPTOR 
------ ----------- -------------------------------------------------
2VB1       0.65000 LYSOZYME C (E.C.3.2.1.17)
2B97       0.75000 Hydrophobin II
2OV0       0.75000 PROTEIN
2I16       0.81000 Aldose reductase (E.C.1.1.1.21)
2I17       0.81000 Aldose reductase (E.C.1.1.1.21)
2HS1       0.84000 HIV-1 Protease V32I mutant (EC 3.4.23.16)
2F01       0.85000 Streptavidin               
2OL9       0.85000 SNQNNF peptide from human prion residues 170-175
2PF8       0.85000 Aldose reductase (E.C.1.1.1.21)
2P74       0.88000 Beta-lactamase CTX-M-9a (E.C.3.5.2.6)

  10 record(s) selected.

The predicates in the query in Listing 7 are weakly selective so that a full table scan of the pdbxml table is required. Table 2 summarizes how the performance of this table scan query has benefited from two of our design decisions: atom site separation and compression. In our environment, this table scan was I/O-bound. DB2 compression mitigated the I/O bottleneck and reduced the query elapsed time by more than 40 percent (from 244 to 128 seconds). Extracting the atom site data into a separate relational table greatly reduced the size of the pdbxml table, improving the query performance by almost 4 1/2 times, from 138 to 31 seconds.

Table 2. Response times (without indices) of the query in Listing 7
Atom site separationCompressionResponse time
NoNo244 seconds
NoYes138 seconds
YesYes31 seconds

Listing 9 shows another sample query, which determines how often different atoms — or ions — occur in different compounds. The WHERE clause restricts the search to so-called hetero atoms and only considers the first model of each protein.

Listing 9. Analysis of hetero atom occurrences
SELECT label_atom_id                   AS "Atom", 
       COALESCE(label_comp_id, 'none') AS "Compound", 
       COUNT(*)                        AS "Occurrences"
FROM xmlrpdb.atom_site
WHERE group_PDB = 'HETATM'
  AND pdbx_PDB_model_num = 1
GROUP BY label_atom_id, label_comp_id
ORDER BY COUNT(*),label_comp_id DESC;

A subset of the result rows is shown in Listing 10. The most frequently detected chemical compound is water (HOH) with oxygen (O) as one of its atoms. The reported number of hydrogens, denoted by H1 and H2 for HOH, is low because detecting hydrogens requires a very high resolution that is not always achieved.

(Human) hemoglobin is a protein consisting of multiple molecules, and such a molecule can interact with a non-protein compound called heme. A heme (HEM) is a multi-atom, non-proteinaceous organic structure capable of positioning an iron (FE) ion in its center. This iron ion, on its hand, is critical for oxygen binding. The result in Listing 10 shows that iron occurs frequently together with heme compounds. Although this is a simple example, it demonstrates how efficient it has become to detect meaningful correlations in the PDB data and to gain better understanding of how proteins function and interact on a molecular level.

Listing 10. Subset of the result produced by the query in Listing 9
Atom     Compound        Occurrences
-------- --------------  -------------------------
O        HOH             1571965
MG       MG              7159
...
H1       HOH             1858
H2       HOH             1858
ZN       ZN              1664
...
CL       CL              1318
CA       CA              1295
...
FE      HEM           379
NA       HEM             379

Table 3 shows how our database design choices for atom site separation, compression, range partitioning, and multi-dimensional clustering can provide excellent performance, even when no query-specific indexes exist.

Table 3. Response times (without indices) of the query in Listing 9
Atom site separationCompressionRange partitioningMDCResponse time
YesYesNoNo38.7 seconds
YesYesYesNo25.8 seconds
YesYesYesYes5.5 seconds

Summary

This article has described how to use pureXML and relational data management features in DB2 to efficiently store and query the Protein Data Bank (PDB). Based on the intrinsic characteristics of the protein data, we have designed an optimized hybrid database schema. For best performance and minimal space consumption, we recommend using database partitioning, range partitioning, compression, and multi-dimensional clustering. Additionally, a combination of XML indices and relational indexes can further improve query performance. The DB2-based PDB continues to be used for investigations, such as searching the entire PDB for certain protein interactions and to help explain unusual interactions on the structural level.

Acknowledgements

The development of the DB2-based PDB was done in the Structural Bioinformatics research group of Maria Teresa Pisabarro, Biotechnology Center, Technical University Dresden, Germany. The project was financed by a scholarship from the foundation of SAP co-founder Klaus Tschira. Also, thanks to Henrik Loeser for his help with the work described in this article and the Berlin Institute of Medical Systems Biology (BIMSB) of the Max Delbrück Center (MDC) for Molecular Medicine Berlin-Buch, Germany, for providing the production server.


Download

DescriptionNameSize
C++-preprocessor and a few DB2 sample queriesdb2pdb_download.zip965KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML, Industries
ArticleID=755466
ArticleTitle=Managing the Protein Data Bank with DB2 pureXML
publish-date=09082011