Characteristics of RDF data
The RDF data model is schemaless. Unlike the relational model in which each table has a fixed number of columns, an RDF data set does not have a fixed number of predicates. A particular RDF subject can have any number of predicates. Furthermore, an RDF data set stores data across any number of domains, which further increases the "schemalessness" of the model. So when mapping RDF data to a relational schema, you must use a mechanism to support the schemaless nature of RDF data.
The most common mechanism to handle the schemaless feature when mapping RDF data to a relational schema is to have a table with three columns — one each for subject, predicate, and object. With this method, each triple is in a new row in the table, so that a variable number of predicates can be handled. However, this mapping does not scale well and has performance problems because querying data requires large number of self-joins with no useful use of relational indices. As an example, a simple query to retrieve two predicates of a single subject involves a self-join and fetching of two rows. In comparison, a traditional relational modeling of the same data, where both predicates exist in a single row, does not require any join, and data can be retrieved in with a single fetch.
DB2 software removes the need for a large number of self-joins when you query RDF data. It does so by storing all predicates and objects about a subject in a single row or minimal number of rows in a table. Because a relational table must have a fixed number of columns (governed by page size and column length), the mechanism for handling a variable number of predicates depends on how predicates are assigned to columns in the table.
DB2 software uses two mechanisms to assign predicates to columns in a table:
- Hashing — To reduce hash collisions, a set of hash functions, rather than a single hash function, is used. Hashing is random, and, in spite of the use of multiple hash functions, collisions can still occur relatively easily. If collisions occur, a new row is created in the table. The hashing mechanism is used in the default store.
- Predicate correlation — If a representative sample of the RDF data is available, DB2 software calculates the correlation among the predicates of the various resource types in the RDF data set. The software uses this correlation to assign predicates to columns in the table. This process leads to better space utilization in the table with diminished chances of collision. Multiple correlation functions are used to further diminish chances of collision. The predicate correlation mechanism is used in the optimized store.