The Protein Data Bank (PDB.org) is a worldwide archive of structural data about biological molecules, mostly proteins. The Protein Data Bank (PDB) is managed by several member organizations responsible for depositing, maintaining, processing, and freely providing this biological data to the scientific community. To provide flexibility, extensibility, and ease of data exchange, the PDB data is available in XML format. This XML format is defined by an XML Schema known as the Protein Data Bank Markup Language (PDBML).
The structural information includes the 3-D coordinates of the atoms of the molecule(s) a protein consists of. These atomic coordinates are also referred to as the 3-D structure or tertiary structure. The tertiary structure of a protein is closely coupled to its function. Thus, knowing the tertiary structure often helps in understanding the protein's intrinsic function. For example, the tertiary structure may be useful to explain diseases or develop new drugs. The tertiary structure can also be exploited to search the PDB for interactions between proteins.
As of December 2010, the Protein Data Bank repository held 70,000 entries (XML documents) that contain more than 500 million atom coordinates. The total uncompressed size is more than 750 GB. Individual XML documents in the PDB range from a few MB to more than 1 GB in size. Based on the rapid growth of the PDB repository in recent years (Figure 1), the size of the PDB is expected to continue to increase significantly. Consequently, searching and analyzing this information is becoming ever more challenging.
Figure 1. Growth of the PDB over the past 20 years
A typical approach to analyze PDB data is to write a custom application or a set of scripts that search the PDBML documents for the purpose of a very specific research question. The disadvantages of this approach include the facts that:
- Developing custom code each time new research is being conducted is very labor-intensive and time-consuming.
- The performance is often poor because all documents need to be parsed and searched, even if only a subset of them contain relevant information.
- It's often difficult to reuse or combine existing custom code to compose new or different queries against the PDB data.
DB2 V9.7.3 with pureXML was chosen to address these challenges, primarily because DB2 has the scalability and the XML capabilities required to process the expected volumes of PDBML documents. Additionally, DB2 is freely available for non-commercial usage via the IBM Academic Initiative. The goal was to store the PDB information in an efficient database schema, exploit relational and XML indexes for efficient search, and use XQuery and SQL/XML to express even complex queries against the PDB information.
The content of the Protein Data Bank
Before we discuss the DB2 database design for the PDB, it is helpful to understand the PDB data a little bit better.
The tertiary structure of a protein is experimentally determined (solved), predominantly by a method called X-ray Diffraction or X-ray Crystallography. Another less frequently used method is called Solution NMR (Nuclear Magnetic Resonance) or NMR Spectroscopy. The methods for determining (solving) the protein structure lead to differences in how a protein structure is described in the generated XML documents, which is particularly reflected in the XML file sizes.
Proteins are dynamic molecules, which means that their tertiary structures may vary slightly, for example depending on their environment. Due to these variations, NMR methodically determines multiple instances (models) that represent slightly shifted tertiary structures for the same protein. Consequently, XML files with protein data produced by NMR can be very large in size, such as 100 MB to 1 GB, or more. Also, you will see later in this article how and why we use DB2 range partitioning to separate the first (default) model of a protein from its variations.
Listing 1 shows an extract from one PDBML document.
You can see four of 177 categories of information that can appear in such a
document, including the authors of the study and the experimental method
(<PDBx:exptlCategory>) used. The attribute entry_id
represents the unique PDB identifier for this document.
Listing 1. Extract of a sample PDBML document (1BBZ.xml)
...
<PDBx:audit_authorCategory>
<PDBx:audit_author pdbx_ordinal="1">
<PDBx:name>Pisabarro, M.T.</PDBx:name>
</PDBx:audit_author>
...
</PDBx:audit_authorCategory>
...
<PDBx:structCategory>
<PDBx:struct entry_id="1BBZ">
<PDBx:pdbx_descriptor>ABL TYROSINE KINASE, PEPTIDE P41
</PDBx:pdbx_descriptor>
<PDBx:title>CRYSTAL STRUCTURE OF THE ABL-SH3 DOMAIN COMPLEXED WITH
A DESIGNED HIGH-AFFINITY PEPTIDE LIGAND: IMPLICATIONS FOR
SH3-LIGAND INTERACTIONS
</PDBx:title>
</PDBx:struct>
</PDBx:structCategory>
...
<PDBx:struct_keywordsCategory>
<PDBx:struct_keywords entry_id="1BBZ">
<PDBx:pdbx_keywords>COMPLEX(TRANSFERASE/PEPTIDE)
</PDBx:pdbx_keywords>
<PDBx:text>COMPLEX (TRANSFERASE-PEPTIDE), SIGNAL TRANSDUCTION,SH3 DOMAIN,
COMPLEX (TRANSFERASE-PEPTIDE) complex
</PDBx:text>
</PDBx:struct_keywords>
</PDBx:struct_keywordsCategory>
...
<PDBx:exptlCategory>
<PDBx:exptl entry_id="1BBZ" method="X-RAY DIFFRACTION">
<PDBx:crystals_number>1</PDBx:crystals_number>
</PDBx:exptl>
</PDBx:exptlCategory>
...
|
Due to time and resource constraints, we decided to use only a subset of the total available PDB data volume to prototype and evaluate the storage, indexing, and querying of PDBML documents in a DB2 database. Therefore, a representative sample of 6,029 documents was selected, which amounts to 83 GB and roughly 10 percent of the total volume of the PDBML archive as of December 2010. This set of documents contains approximately 1.7 billion XML elements, out of which approx. 1.54 billion elements describe tertiary protein structures through atom coordinates and other information.
A representative sample of PDBML documents must accurately reflect the ratio of documents with molecule information produced by X-ray Diffraction (smaller documents, 83 percent of all documents) vs. Solution NMR (larger documents, 16 percent of all documents). This ensures that the database configuration and queries are tested with a realistic mix of small and large documents.
The database server available for this study was a Sun X4600 M2 with eight dual-core processors (AMD Opteron 8220) and 256GB main memory. The operating system was Ubuntu 64-bit Linux®. The storage consisted of 10 hard drives (698 GB each; 7,200 rpm), organized as a single logical volume (RAID 5) using a hardware controller.
Database design recommendations for PDB
This section describes a set of database design recommendations that lead to simple and efficient database support for storing and analyzing PDB data. These recommendations address the database schema, the choice between XML and relational storage, definition of indexes, and physical data organization with partitioning and clustering options.
PDBML documents currently contain up to 177 categories of information, most of them optional. The large number of optional PDBML elements allow the documents to be very flexible and highly variable. A fully relational database schema would require hundreds of tables to represent PDBML. Such a relational database schema for the PDB was developed in 2005 and is shown in Figure 2. With more than 400 tables and more than 3,000 columns, the complexity of this schema is overwhelming. It is extremely difficult to understand and query such a database schema because a single PDB entry is broken up and scattered over hundreds of tables, making it hard for users to know which information resides in which table. Therefore, keeping most of the PDBML information in its original XML format and storing it in a single XML column results in a much simpler database design and retains the data in a format that users naturally understand.
Figure 2. Diagram of a fully relational database schema for PDBML
One notable exception to the high variability of the PDBML data is the atom coordinates and their related labels, which follow a flat and regular structure repeated for every atom in a molecule, as illustrated in Listing 2. Since proteins commonly consist of thousands or tens of thousands of atoms, atom coordinates often represent 90 percent or more of a PDBML document.
Listing 2. Atom coordinates in a PDBML document
<PDBx:atom_siteCategory>
<PDBx:atom_site id="1">
<PDBx:B_iso_or_equiv>37.41</PDBx:B_iso_or_equiv>
<PDBx:Cartn_x>1.039</PDBx:Cartn_x>
<PDBx:Cartn_y>16.834</PDBx:Cartn_y>
<PDBx:Cartn_z>18.876</PDBx:Cartn_z>
<PDBx:auth_asym_id>A</PDBx:auth_asym_id>
<PDBx:auth_atom_id>N</PDBx:auth_atom_id>
<PDBx:auth_comp_id>ASN</PDBx:auth_comp_id>
<PDBx:auth_seq_id>1</PDBx:auth_seq_id>
<PDBx:group_PDB>ATOM</PDBx:group_PDB>
<PDBx:label_alt_id xsi:nil="true" />
<PDBx:label_asym_id>A</PDBx:label_asym_id>
<PDBx:label_atom_id>N</PDBx:label_atom_id>
<PDBx:label_comp_id>ASN</PDBx:label_comp_id>
<PDBx:label_entity_id>1</PDBx:label_entity_id>
<PDBx:label_seq_id>1</PDBx:label_seq_id>
<PDBx:occupancy>1.00</PDBx:occupancy>
<PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num>
<PDBx:type_symbol>N</PDBx:type_symbol>
</PDBx:atom_site>
<PDBx:atom_site id="2">
<PDBx:B_iso_or_equiv>36.15</PDBx:B_iso_or_equiv>
<PDBx:Cartn_x>-0.213</PDBx:Cartn_x>
<PDBx:Cartn_y>16.205</PDBx:Cartn_y>
<PDBx:Cartn_z>18.364</PDBx:Cartn_z>
...
</PDBx:atom_site>
<PDBx:atom_site id="3">
<PDBx:B_iso_or_equiv>33.97</PDBx:B_iso_or_equiv>
<PDBx:Cartn_x>-0.549</PDBx:Cartn_x>
<PDBx:Cartn_y>16.779</PDBx:Cartn_y>
<PDBx:Cartn_z>16.986</PDBx:Cartn_z>
...
</PDBx:atom_site>
...
</PDBx:atom_siteCategory>
|
The flat and regular structure of the atom information makes a perfect fit
for traditional relational tables. In fact, the atom coordinates and labels are
non-hierarchical data for which XML is not the best choice. Therefore, we decide
on a hybrid database schema that stores the atom_site information in
a relational table and the remainder of each PDBML document in an XML column, but
with the <atom_siteCategory> removed from the
document. This has several advantages:
- The reduced PDBML documents are much smaller, which improves insert and load performance, as well as XML query performance. The XML parsing effort upon insert or load is reduced by approximately 90 percent.
- The atom information takes less space in relational columns than in their verbose XML representation.
- The atom data can be queried with traditional relational methods, which for non-hierarchical data is more efficient than XML navigation.
- Since each atom is represented in a separated row, indices can help speed up the search for specific atoms within a given PDBML entry.
The chosen database schema consists of two tables, shown in
Listing 3. The first (xmlrpdb.pdbxml) has one row for
each PDB entry. This table has only two columns:
- The primary key
pdb_idholds the four-character PDB entry identifier from the XML attributeentry_id. - The XML column
pdbxml_fileholds the entire PDBML document except the<atom_siteCategory>.
The second table (xmlrpdb.atom_site) contains one
relational row for each atom coordinate (i.e., for each
<atom_site> element in a PDBML document).
The column pdb_id is the foreign key that links
atom coordinates to the corresponding PDBML document in the
pdbxml table.
Both tables are stored in table spaces with a 32-KB page size to maximize the performance analytical queries that read large numbers of rows.
Listing 3. Hybrid XML/relational database schema for PDB in DB2
CREATE TABLE xmlrpdb.pdbxml (
pdb_id CHAR(4) NOT NULL,
pdbxml_file XML NOT NULL,
PRIMARY KEY (PDB_ID))
IN ts_data32k INDEX IN ts_index32k;
CREATE TABLE xmlrpdb.atom_site (
pdb_id CHAR(4) NOT NULL,
atom_site_id INTEGER NOT NULL,
auth_asym_id VARCHAR(10) WITH DEFAULT NULL,
auth_atom_id VARCHAR(20) NOT NULL,
auth_comp_id VARCHAR(3) NOT NULL,
auth_seq_id VARCHAR(20) NOT NULL,
b_iso_or_equiv DECIMAL(7,3) NOT NULL,
b_iso_or_equiv_esd DECIMAL(7,3) WITH DEFAULT NULL,
cartn_x DECIMAL(7,3) NOT NULL,
cartn_x_esd DECIMAL(7,3) WITH DEFAULT NULL,
cartn_y DECIMAL(7,3) NOT NULL,
cartn_y_esd DECIMAL(7,3) WITH DEFAULT NULL,
cartn_z DECIMAL(7,3) NOT NULL,
cartn_z_esd DECIMAL(7,3) WITH DEFAULT NULL,
group_pdb VARCHAR(10) NOT NULL,
label_alt_id VARCHAR(10) WITH DEFAULT NULL,
label_asym_id VARCHAR(10) WITH DEFAULT NULL,
label_atom_id VARCHAR(20) WITH DEFAULT NULL,
label_comp_id VARCHAR(10) NOT NULL,
label_entity_id SMALLINT NOT NULL,
label_seq_id SMALLINT WITH DEFAULT NULL,
occupancy DECIMAL(7,3) NOT NULL,
occupancy_esd DECIMAL(7,3) WITH DEFAULT NULL,
pdbx_pdb_atom_name VARCHAR(10) WITH DEFAULT NULL,
pdbx_pdb_ins_code VARCHAR(10) WITH DEFAULT NULL,
pdbx_PDB_model_num SMALLINT NOT NULL,
type_symbol VARCHAR(10) WITH DEFAULT NULL,
PRIMARY KEY (pdb_id, atom_site_id),
FOREIGN KEY (pdb_id) REFERENCES xmlrpdb.pdbxml(pdb_id),
CONSTRAINT group_chk CHECK (group_PDB in ('ATOM', 'HETATM'))
) IN ts_atom_data_32k INDEX IN ts_atom_index32k;
|
Optionally, CHECK constraints can be defined on the
pdbxml table to ensure that the
four-character PDB identifier conforms to the PDB standard. The
first character must be a number between 1 and 9, and the next three
characters must be a number between 0 and 9 or an uppercase character
between A and Z (see Listing 4).
Listing 4.
CHECK constraints to
enforce proper pdb_id values
ALTER TABLE xmlrpdb.pdbxml
ADD CHECK (SUBSTR(pdb_id, 1, 1) BETWEEN '1' AND '9')
ADD CHECK ((SUBSTR(pdb_id, 2, 1) BETWEEN '0' AND '9') OR
(SUBSTR(pdb_id, 2, 1) BETWEEN 'A' AND 'Z'))
ADD CHECK ((SUBSTR(pdb_id, 3, 1) BETWEEN '0' AND '9') OR
(SUBSTR(pdb_id, 3, 1) BETWEEN 'A' AND 'Z'))
ADD CHECK ((SUBSTR(pdb_id, 4, 1) BETWEEN '0' AND '9') OR
(SUBSTR(pdb_id, 4, 1) BETWEEN 'A' AND 'Z'));
|
Populating the hybrid database schema
The conceptual process of inserting a PDBML document into our hybrid database
schema is illustrated in Figure 3. The
<atom_siteCategory>
data needs to be extracted and removed from the XML document, and inserted into
the relational atom_site table (blue). The reduced
document itself is inserted into the pdbxml table.
We call this process atom site separation.
Figure 3. Hybrid storage of a PDBML document with separation of
atom_site data
Due to high data volume, the atom site separation (the population of the hybrid database schema) needs to have high performance. Hence, costly XML parsing should be reduced as much as possible. Revisiting the atom coordinates in XML format in Listing 2, we find that 94.5 percent of the characters are markup, and only 5.5 percent of the characters are actual values. Hence, the ratio of markup to value is extremely high, which means that a lot of XML parsing may be required to extract a comparatively small amount of usable data. You will understand shortly how this consideration has affected our decision of how to populate the two tables.
One option to populate the relational atom_site table
is to use INSERT statements with an XMLTABLE function. Such a statement can parse the
entire PDBML document and extract the atom information to insert
as relational rows. Additionally, an XQuery Update expression can delete the
<atom_siteCategory> subtree from the each PDBML
document inserted into the pdbxml table. Such
an XQuery Update expression can also be part of an INSERT statement so that the
<atom_siteCategory> is removed before
writing it into the XML column, rather than performing a separate step after the
insert.
Another option is to use a special-purpose preprocessor outside the database to extract the atom data into a relational flat file and remove it from each PDBML document. Such a preprocessor was implemented in C++, and it has the following benefits:
- It can add desirable annotations to the data, such as information from sequence and structure alignments or application-dependent geometric transformations like rotations or translations of atomic coordinates.
- It can be implemented without using a general-purpose XML parser. Instead, it is designed and optimized for the specific file structure of PDBML documents. It exploits special knowledge about the flat structure of the atom data, the existence of newline characters between elements, and other characteristics. As a result, the specialized preprocessor is at least 10 times faster than any solution with XML parsing.
Preprocessing the data set of 6,029 gzipped PDBML documents (i.e., 83 GB
unzipped) and loading the prepared data into the pdbxml
table and atom_site table took only 1 hour and 44 minutes.
The preprocessor is available for download (see Download).
Considering the data volume in the PDB archive as well as its rapid growth, it is useful to compress the data in DB2. This reduces the storage consumption and improves performance. Although compression and decompression in DB2 consumes some additional CPU cycles, compression also reduces the number of physical I/O operations necessary to read a certain amount of data from disk. Furthermore, compressed pages of a DB2 table space remain compressed in the DB2 buffer pool in main memory. As a result, compression allows more data to be in memory than without compression, which increases the buffer-pool hit ratio and makes higher utilization of the available memory. We found that the I/O and memory benefits of compression outweigh the additional CPU cost and lead to overall higher performance.
The following commands in Listing 5 were used to compress both tables.
Listing 5. Enabling compression and REORG tables
ALTER TABLE xmlrpdb.pdbxml COMPRESS YES; REORG TABLE xmlrpdb.pdbxml LONGLOBDATA RESETDICTIONARY; ALTER TABLE xmlrpdb.atom_site COMPRESS YES; REORG TABLE xmlrpdb.atom_site LONGLOBDATA RESETDICTIONARY; |
The reduction in space consumption is summarized in Table 1. After compression, the information contained in the 6,029 PDBML documents can be stored in 67.4 percent fewer pages (i.e., three times less space than without compression).
Table 1. Space savings achieved by compression
| Before compression | After compression | Savings | |
|---|---|---|---|
| xmlrpdb.pdbxml | 176,256 pages | 44,736 pages | 74.6 percent |
| xmlrpdb.atom_site | 264,960 pages | 99,264 pages | 62.5 percent |
| Total | 441,216 pages | 144,000 pages | 67.4 percent |
With a page size of 32 KB, the final storage of 144,000 pages is equivalent to 4.4 GB, which is only 5.3 percent of the original raw data volume of 83 GB. If we extrapolate this ratio to the total current size of the PDB archive, we find that the 0.75 TB of PDB information would be stored in DB2 using only approximately 40.7 GB of space, plus some space for indices.
This tremendous saving of storage stems from two factors. First, the high ratio of markup to value in the atom information is eliminated by converting the atom coordinates to relation format in the preprocessing step. Second, DB2 compression shrinks the remaining XML and relational data by a factor of 3.
Despite the significant reduction in space consumption, the PDB data volume continues to grow fast. Also, the response time of complex analytical queries can be reduced by spreading the data across multiple database partitions, such that all partitions work on their assigned data in parallel. These database partitions can reside on the same machine to exploit all the CPU power of a multi-core system, or they can be spread across multiple machines in a shared-nothing configuration. The DB2 Database Partitioning Feature (DPF) is available through IBM InfoSphere® Warehouse, a software package that contains DB2 with advanced features, as well as additional design, reporting, and database management tools.
Using the DPF, we recommend to distribute the data in the
pdbxml table and
atom_site table across the database partitions
by hashing on the values of the pdb_id column.
This is achieved by adding the clause DISTRIBUTE BY
HASH(pdb_id) to the respective CREATE TABLE statement. The large number of distinct
values in the pdb_id column ensures a relatively
even distribution of rows over the database partitions. Distributing both
tables by hashing on their join key (pdb_id) also
ensures that all atom rows for a given PDBML document are stored in the same
database partition as the PDBML document itself. This collocation implies that
joins between the two tables can always be evaluated within each of the database
partitions and never require data to be shipped across partitions.
Range partitioning (also known as table partitioning) enables you to partition the data in a table according to the value in a specified column, such that rows with the same value will reside in the same partition. The concept of range partitioning is orthogonal to the database partitioning. If database partitioning and range partitioning are used together, the rows in a table are first hashed across the database partitions and then range-partitioned within each database partition.
Range partitioning can serve multiple purposes. One purpose is easier roll-in and roll-out of new and old data, respectively. Another purpose is to improve performance based on partition elimination when the DB2 query optimizer determines that only a subset of the partitions need to be examined to answer as particular query. For the PDB, range partitioning was deployed to benefit from partition elimination, rather than to simplify roll-in and roll-out of data.
We decided to range-partition the atom_site
table by the pdbx_PDB_model_num column for the
following reason: Remember that the tertiary structure of a protein can be
experimentally determined with a method called NMR, which produces
multiple tertiary structures for the same protein. These variations are
called models and are numbered by the field
pdbx_PDB_model_num. A value of
pdbx_PDB_model_num = 1 identifies the first
(default) model of a protein. The additional variations are
the non-default models of the same protein and have
pdbx_PDB_model_num >= 2. Proteins that have
been structurally determined by X-ray Diffraction have only one model with
pdbx_PDB_model_num = 1.
Listing 6 shows the extended definition of the
atom_site table with range partitioning. All
atom coordinates that belong to the first model
(pdbx_PDB_model_num = 1) are stored in one partition,
whereas any variations (pdbx_PDB_model_num >= 2)
are stored in another. Although only about 16 percent of all proteins currently in the
PDB have variations produced by NMR, the number of their variations is so
large that both partitions have roughly the same number of records.
Listing 6. Table definition with range partitioning
CREATE TABLE xmlrpdb.atom_site (
pdb_id CHAR(4) NOT NULL,
...
pdbx_PDB_model_num SMALLINT NOT NULL,
type_symbol VARCHAR(10) WITH DEFAULT NULL,
PRIMARY KEY (pdb_id, atom_site_id),
FOREIGN KEY (pdb_id) REFERENCES xmlrpdb.pdbxml(pdb_id),
CONSTRAINT group_chk CHECK (group_PDB in ('ATOM', 'HETATM'))
)
-- IN ts_atom_data_32k
INDEX IN ts_atom_index32k
PARTITION BY RANGE (pdbx_PDB_model_num)
(
PARTITION DEF_MODELS STARTING (1) ENDING (1) IN TS_ATOM_DATA1_32K,
PARTITION NON_DEF_MODELS STARTING (2) ENDING (MAXVALUE) IN TS_ATOM_DATA2_32K
);
|
We have chosen this range-partitioning scheme because many PDB queries
typically differentiate between default and non-default protein models, and
can therefore benefit from the partitioning. For example, a query that
analyzes all or most of the default models only needs to scan the partition
DEF_MODELS, which reduces the required I/O by half.
In addition to range-partitioning, multi-dimensional clustering (MDC) can be used to cluster the rows in a table based on one or more columns. Rows that have the same value in the clustering columns are physically stored together in the same storage cell. This can greatly improve the performance of queries that constrain and select data along one or multiple clustering dimensions. Like the DPF, MDC is also available through IBM InfoSphere Warehouse.
The choice of clustering columns needs to be based on the expected query
workload so the clustering supports the most common and most critical
queries. For example, many PDB queries might search the atom data based on
the amino acid involved. Therefore, it can be beneficial to cluster the
atom_site table based on the column
label_comp_id, which, in most documents, contains
the three-letter code for the amino acid. To achieve this clustering, add
the following clause to the second CREATE TABLE statement in
Listing 3: ORGANIZE BY
DIMENSIONS(label_comp_id).
Another example is to cluster the atom_site table
by the group_PDB column. We have evaluated this
clustering for several sample queries that restrict their search to a single
group_PDB value (i.e.,
"HETATOM") and found that it can
improve query performance fourfold.
In this section, we discuss two sample queries to demonstrate:
- The ease with which even complex analysis of PDB data can be carried out.
- The benefit of the database design decisions described in the previous sections.
The query in Listing 7 selects the PDB identifier, the
resolution, and the description from all PDB entries where the experimental method
is "X-RAY DIFFRACTION" and the resolution
(ls_d_res_high) is less than 2.5. The resolution
is expressed in Ångstrøm (1Å = 0.1 nanometer) and serves
as a quality metric for the analysis of the atom structures. Structures with
a resolution less than 2Å are high-resolution structures (i.e., the
locations of their atoms could be determined very accurately). Structures with
a resolution greater than 3Å are less accurate and usually ignored.
Listing 7. Query the top 10 records with the best X-ray resolution
SELECT pdb_id, x.resolution, x.pdbx_descriptor
FROM xmlrpdb.pdbxml,
XMLTABLE
('$PDBXML_FILE/*:datablock/*:refineCategory/*:refine[
@pdbx_refine_id = "X-RAY DIFFRACTION" and *:ls_d_res_high <= 2.5 ]'
COLUMNS
resolution DEC(9,5) PATH '*:ls_d_res_high',
pdbx_descriptor VARCHAR(2000) PATH '../../*:structCategory/*:struct/*:pdbx_descriptor'
) AS x
-- WHERE
-- upper(x.pdbx_descriptor) LIKE '%UNKNOWN%' or
-- upper(x.pdbx_descriptor) LIKE '%UNCHARACTERIZED%'
ORDER BY x.resolution
FETCH FIRST 10 ROWS ONLY;
|
The result of this query is shown in Listing 8.
One of the benefits of using DB2 pureXML as opposed to custom code is that
it's easy to modify SQL/XML queries to refine the search. For example,
Listing 7 contains three comment lines with an additional WHERE
clause. They can be used to further filter the descriptor to find those
structures that are not or could not be characterized yet.
Listing 8. Result produced by the query in Listing 7
PDB_ID RESOLUTION PDBX_DESCRIPTOR ------ ----------- ------------------------------------------------- 2VB1 0.65000 LYSOZYME C (E.C.3.2.1.17) 2B97 0.75000 Hydrophobin II 2OV0 0.75000 PROTEIN 2I16 0.81000 Aldose reductase (E.C.1.1.1.21) 2I17 0.81000 Aldose reductase (E.C.1.1.1.21) 2HS1 0.84000 HIV-1 Protease V32I mutant (EC 3.4.23.16) 2F01 0.85000 Streptavidin 2OL9 0.85000 SNQNNF peptide from human prion residues 170-175 2PF8 0.85000 Aldose reductase (E.C.1.1.1.21) 2P74 0.88000 Beta-lactamase CTX-M-9a (E.C.3.5.2.6) 10 record(s) selected. |
The predicates in the query in Listing 7 are weakly
selective so that a full table scan of the pdbxml
table is required. Table 2
summarizes how the performance of this table scan query has benefited from two
of our design decisions: atom site separation and compression. In our environment,
this table scan was I/O-bound. DB2 compression mitigated the I/O bottleneck and
reduced the query elapsed time by more than 40 percent (from 244 to 128 seconds). Extracting
the atom site data into a separate relational table greatly reduced the size of
the pdbxml table, improving the query performance by
almost 4 1/2 times, from 138 to 31 seconds.
Table 2. Response times (without indices) of the query in Listing 7
| Atom site separation | Compression | Response time |
|---|---|---|
| No | No | 244 seconds |
| No | Yes | 138 seconds |
| Yes | Yes | 31 seconds |
Listing 9 shows another sample query, which determines how often different
atoms — or ions — occur in different compounds. The
WHERE clause restricts
the search to so-called hetero atoms and only considers
the first model of each protein.
Listing 9. Analysis of hetero atom occurrences
SELECT label_atom_id AS "Atom",
COALESCE(label_comp_id, 'none') AS "Compound",
COUNT(*) AS "Occurrences"
FROM xmlrpdb.atom_site
WHERE group_PDB = 'HETATM'
AND pdbx_PDB_model_num = 1
GROUP BY label_atom_id, label_comp_id
ORDER BY COUNT(*),label_comp_id DESC;
|
A subset of the result rows is shown in Listing 10. The most frequently detected chemical compound is water (HOH) with oxygen (O) as one of its atoms. The reported number of hydrogens, denoted by H1 and H2 for HOH, is low because detecting hydrogens requires a very high resolution that is not always achieved.
(Human) hemoglobin is a protein consisting of multiple molecules, and such a molecule can interact with a non-protein compound called heme. A heme (HEM) is a multi-atom, non-proteinaceous organic structure capable of positioning an iron (FE) ion in its center. This iron ion, on its hand, is critical for oxygen binding. The result in Listing 10 shows that iron occurs frequently together with heme compounds. Although this is a simple example, it demonstrates how efficient it has become to detect meaningful correlations in the PDB data and to gain better understanding of how proteins function and interact on a molecular level.
Listing 10. Subset of the result produced by the query in Listing 9
Atom Compound Occurrences -------- -------------- ------------------------- O HOH 1571965 MG MG 7159 ... H1 HOH 1858 H2 HOH 1858 ZN ZN 1664 ... CL CL 1318 CA CA 1295 ... FE HEM 379 NA HEM 379 |
Table 3 shows how our database design choices for atom site separation, compression, range partitioning, and multi-dimensional clustering can provide excellent performance, even when no query-specific indexes exist.
Table 3. Response times (without indices) of the query in Listing 9
| Atom site separation | Compression | Range partitioning | MDC | Response time |
|---|---|---|---|---|
| Yes | Yes | No | No | 38.7 seconds |
| Yes | Yes | Yes | No | 25.8 seconds |
| Yes | Yes | Yes | Yes | 5.5 seconds |
This article has described how to use pureXML and relational data management features in DB2 to efficiently store and query the Protein Data Bank (PDB). Based on the intrinsic characteristics of the protein data, we have designed an optimized hybrid database schema. For best performance and minimal space consumption, we recommend using database partitioning, range partitioning, compression, and multi-dimensional clustering. Additionally, a combination of XML indices and relational indexes can further improve query performance. The DB2-based PDB continues to be used for investigations, such as searching the entire PDB for certain protein interactions and to help explain unusual interactions on the structural level.
The development of the DB2-based PDB was done in the Structural Bioinformatics research group of Maria Teresa Pisabarro, Biotechnology Center, Technical University Dresden, Germany. The project was financed by a scholarship from the foundation of SAP co-founder Klaus Tschira. Also, thanks to Henrik Loeser for his help with the work described in this article and the Berlin Institute of Medical Systems Biology (BIMSB) of the Max Delbrück Center (MDC) for Molecular Medicine Berlin-Buch, Germany, for providing the production server.
| Description | Name | Size | Download method |
|---|---|---|---|
| C++-preprocessor and a few DB2 sample queries | db2pdb_download.zip | 965KB | HTTP |
Information about download methods
Learn
- Visit PDB.org
for more information about the Protein Data Bank and the XML format PDBML.
- Learn more about the
XML format PDBML.
- For an introduction to PDB data, read
"Understanding
PDB Data: Looking at Structures."
- Get the full XML document from which extracts are shown in Listings 1 and 2.
- Gain comprehensive knowledge of DB2 pureXML with the
DB2 pureXML Cookbook.
- Read more about compression, partitioning, and clustering of XML data in the article
"Enhance business insight and scalability of XML data with new DB2 9.7 pureXML
features."
- For more DB2 pureXML resources, explore the
DB2 pureXML
enablement wiki.
-
Read the Native XML Database blog to stay
up to date on latest news and XML tips and tricks.
-
Learn about the IBM
Academic Initiative.
- Learn more about Information Management at the developerWorks Information Management
zone. Find technical documentation,
how-to articles, education, downloads, product information, and
more.
- Stay current with
developerWorks technical events and webcasts.
- Follow developerWorks on
Twitter.
Get products and technologies
-
Download DB2
Express-C for free.
- Build your next
development project with
IBM trial software,
available for download directly from developerWorks.
Discuss
- Participate in the discussion forum.
- Check out the
developerWorks
blogs and get involved in the
developerWorks community.

Gerd Anders holds a degree in computer science from the Humboldt University Berlin. He focuses on bio-informatics and the use of database systems to efficiently manage and process very large amounts of data in and for the life sciences. The DB2-based PDB is an essential project of his Ph.D. thesis, providing a valuable basis for several DB2-driven applications and exploiting the entire potential of the PDB. In addition, he was also a tester of the closed-beta version of DB2 9.5 for Linux, UNIX, and Windows.
Matthias Nicola is a Senior Technical Staff Member at IBM's Silicon Valley Lab in San Jose, CA. He focuses on DB2 performance and benchmarking, XML, temporal data management, in-database analytics, and other emerging technologies. He also works closely with customers and business partners to help them design, optimize, and implement DB2 solutions. Previously, Matthias worked on data warehouse performance at Informix Software. He received his Ph.D. in computer science from the Technical University of Aachen, Germany.




