The first part of this 2-part series discussed record keeping prior to the 21st century, and the effects of relational databases and the Web. Prior to the introduction of computing systems, business records were created and stored as-is in their original non-normalized format. Examples include stone tablets, tally sticks, or paper forms. With the introduction of computing systems, data normalization was invented to convert business records into a different representation that can conserve space and avoid update anomalies in databases. However, the normalized representation is very different from the original and difficult to understand.
The second part of this 2-part series examines the trends in data formats and business records in the 21st century. It starts with a case study to highlight the benefits of digitized business records without normalization, such as better performance and lower cost for application development and maintenance. Other emerging data representations such as JSON and RDF are also discussed, with focus on their pros and cons for relevant usage scenarios.
Today, many systems and applications create and represent business records as XML documents. For example, each order, each financial trade, each tax return, each insurance claim, and so on, is often a separate XML document. The main reasons are that XML is extensible, flexible, self-describing, and suitable for combining structured, unstructured, and semi-structured information. These properties make it easy to represent even complex records, to extend the data representation if needed, and to exchange business records between applications (A2A) or organizations (B2B). Hence, XML has also become the data format of choice for web services and Service-Oriented Architectures (SOA).
In response to this trend, the major database vendors have added XML capabilities to their products    and several XML-only databases have emerged . Such products allow XML to be stored, indexed, queried, and updated with many of the same properties as relational data, including ACID transactions, recoverability, scalability, high availability, and so on. These advances of database technology make it possible to store and manage business records in their original format – XML. This section compares XML to relational data in terms of (de)normalization, redundancy, and performance.
Since XML is a hierarchical data format, a single XML document can represent an arbitrary number of one-to-many relationships – without normalizing these relationships into a set of disjoint objects.
Consider a business record that describes a customer, and each customer can have multiple addresses, multiple phone numbers, multiple email addresses, multiple middle names, and multiple investment accounts. Furthermore, each of a customer's accounts can have multiple positions (holdings), each address can have multiple "street" entries, and so on. In a normalized relational schema, each of these one-to-many relationships requires an additional table, as shown in Figure 1. As result, storing a single business record requires having multiple SQL INSERT statements in several tables, in this example up to 12 INSERTs into 12 tables.
Figure 1. Example of a normalized relational schema
Similarly, when an application retrieves a single customer record, it needs to issue SELECT statements against all 12 tables. This is an inefficient process because it requires 12 or more separate API calls to the database. Each of these calls incurs network traffic, access to a database table, and possibly physical I/O to the table and its indexes. Note that a single query that joins all 12 tables is typically much worse. For example, if a customer has 3 accounts with a total of 7 positions (as shown in Tables 1 through 3), a join across these three tables would return 7 rows, as shown in Table 4. In this result set, the selected values from the customer table are repeated 7 times, and the values for every account are repeated for each of their positions. This redundancy quickly increases the size of the result set so that the latency of the client-server communication deteriorates performance. When more tables are added to the join, the result set grows quickly and amplifies the problem.
Tables 1 through 3 show normalized relational schema for customer, account, and position data.
Table 1. Customer table
Table 2. Account table
Table 3. Positions table
Table 4. Denormalized data, for example, in the result set of join query
Table 4 also illustrates how denormalization of a database schema can introduce data redundancy. For comparison, Listing 1 shows a possible XML representation of the same logical business record. While the normalized relational schema scatters the customer information over numerous tables, the tree structure of XML can represent the same one-to-many relationships in a single document. At the same time the document does not contain any redundancy, because a parent node (such as "Customer") is not repeated for each of its child nodes (such as "Account"). Thus, XML can represent a business record in a single non-normalized document without incurring redundancy. As a result, business records can be inserted or retrieved in a single SQL statement, which reduces the number of database API calls, application complexity, CPU cost, and network traffic. These advantages have been observed and measured in multiple commercial database systems, as we discuss in the next section.
Listing 1. XML Representation of the Customer, Account, and Position data
<Customer Cid="12345"> <FirstName>John</FirstName> <LastName>Doe</LastName> <DateOfBirth>1965-09-27</DateOfBirth> <Nationality>German</Nationality> <Accounts> <Account AccNo="985739476"> <Currency>Euro</Currency> <Balance>120000</Balance> <Positions> <Stock Sym="IBM" Shares="1,200"/> ... </Positions> </Account> <Account AccNo="985710938"> <Currency>Euro</Currency> <Balance>2786.23</Balance> <Positions> ... </Positions> </Account> <Account AccNo="985808142"> <Currency>USD</Currency> <Balance>523891</Balance> <Positions> ... </Positions> </Account> </Accounts> </Customer>
The benefits of inserting and retrieving a business record as a single non-normalized XML document (rather than normalized relational rows) is gaining attention among database practitioners. A major world-wide logistics corporation has put this value proposition to the test. One of their OLTP systems represents certain business records in a normalized relational schema that consists of 12 tables. To retrieve one business record, their application needs to issue 15 SQL statements. Most business records are represented by 15 rows, but some consist of hundreds of rows. For a comparison they chose 250,000 business records and converted them into an XML format, one document per business record. Most documents are 1KB to 2KB in size, but some up to 500KB.
The test scenario used DB2 9.5 Fix pack 4 with Linux RHEL 4.6 on an 8-core AMD Opteron server with 32GB of memory. Up to seven client machines were used to simulate an increasing number of concurrent users accessing the database. Each client was an IBM xSeries 345 (2-core) machine. Each of the 12 relational tables had appropriate indexes. For comparison, the XML documents were stored in a single table, in a single XML column, with a single XML index on the record ID attribute. In the relational test, each simulated concurrent user picked random record IDs and for each ID issued a transaction with 15 SQL queries to retrieve all information for that business record. No think time was used. In the XML test, each concurrent user picked random record IDs and issued a single SQL/XML query with an XPath expression in an XMLEXISTS predicate to retrieve an XML document.
The results are shown in Figure 2. On average the XML solution achieved 55% higher throughput then the normalized relational schema. In the relational test, the CPU capacity of the system under test was exhausted with 84 concurrent users performing about 3800 transactions per second. With the XML solution, the same system supported up to 150 concurrent users and almost 6000 transactions per second. At the same time the XML solution showed drastically lower CPU utilization on the client machines than the relational solution.
Figure 2. Retrieval performance: Non-normalized XML vs. normalized relational data
For the logistics company, the benefits of the non-normalized XML solution go further than higher performance and lower CPU consumption. For example, each XML document is immediately ready to be consumed by client applications, whereas the retrieved relational rows required construction of a business record – an additional cost that was not included in the measurements in Fig. 2. Also, the XML solution has a simpler database schema and is easier to maintain when the record format evolves over time. The business records have the same representation in the database and in the application (XML), which simplifies application development. This benefit has also been reported for XML-based forms processing, event logging, and order processing .
Accessing XML uses less of the CPU than the equivalent accesses to normalized relational tables representing the same business record. More data is transmitted between database and client as a result of a relational join to reconstruct the business record than transmitting the business record itself in XML. Moreover, the XML is easier to understand because it is closer, if not identical, to the real-world business record. A separate logical model for data is no longer needed. Mappings between real-world business records and what is stored inside the computer are not required.
Another trend is the emergence of Linked Data and the Semantic Web to connect related data. More specifically, Linked Data has been defined as "a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF" 61]. With RDF, resources are represented by statements in the form of subject-predicate-object expressions. These expressions are known as triples  and are shown in Listing 2 and Tables 5, 6, and 7.
Listing 2. XML
<DEPARTMENT deptid="15" deptname="Sales"> <EMPLOYEE> <EMPNO>10</EMPNO> <FIRSTNAME>CHRISTINE>/FIRSTNAME> <LASTNAME>SMITH</LASTNAME> <PHONE>408-463-4963</PHONE> <SALARY>52750.00</SALARY> </EMPLOYEE> <EMPLOYEE> <EMPNO>27</EMPNO> <FIRSTNAME>MICHAEL>/FIRSTNAME> <LASTNAME>THOMPSON</LASTNAME> <PHONE>408-463-1234</PHONE> <SALARY>41250.00</SALARY> </EMPLOYEE> </DEPARTMENT>
Table 5. Linked data (RDF triples)
Table 6. Normalized relational data (a)
Table 7. Normalized relational data (b)
Data normalization into triples is an essential part of the Semantic Web to build RDF stores from which powerful integrations can be derived. By breaking information into its smallest elements (triples), data can be represented as graphs with the subject and object as nodes, and the predicate as the linking arc. Inference software (reasoners) can analyze the graphs and apply rules to deduce new statements. For example, since departments have employees and human resources is a department, a new statement can be deduced that human resources has employees.
Linked data links typically connects the latest information only, as is the case with primary and foreign keys in relational tables. For example, an audit of an order that is constructed by linking or by navigating relational keys would pick up the current customer address and not the address of the customer at the time the order was placed. In business systems, it is often necessary for audit and compliance reasons to make a snapshot of the pertinent business information at the time an important event occurs, such as receiving an order or issuing a bank statement. The purpose of the snapshot is not to rely on navigation at a later date to re-construct the business record when an audit or query takes place. Sometimes the snapshot is signed to provide evidence that the business record has not been tampered with since its creation.
RDF, as an extreme form of normalization, is similar to normalized relational data in its difficulty for handling versioning and managing history and audit. RDF differs from normalized relational data in that it is not tied to a fixed schema. When a resource has new properties, additional statements (triples) are added to easily represent the properties. However, triples can be harder to query than traditional relational schemas.
The mechanisms of Linked Data and Semantic Web are powerful and can integrate seemingly unrelated systems. Examples include integrating hotel and airline reservations, order processing, or insurance renewal with social networking information. Augmenting business records (in XML) with semantic annotations (triples) provides a way of blending traditional business records with Linked Data and Semantic Web constructs. A real-world example is the US retailer Best Buy who annotates its XHTML pages that describe products, with RDFa (RDF annotations) that can be processed more programmatically, for example by search engines and classifiers .
It is more difficult to blend semantic annotations with relational databases because there are no mechanisms in place to augment relational data without changing the relational database schema, in contrast with business records expressed in XML where annotations are more straightforward to apply.
Table 8. JSON versus XML
Table 9. Comparison of relational data, XML, JSON, and XML
|Metadata||Data Definition Language (ISO)||XML Schema XSD (W3C),|
|JSON Schema (IETF)||RDFS (RDF Schema),|
Ontology (W3C and elsewhere)
|Constraints||Integrity constraints in table definitions (ISO)||Schematron (ISO),|
|Triggers||Relational triggers||-||-||RIF (W3C)|
|Data exchange serializations||SQL standard (ISO) defines an XML serialization but it is not widely used – There is no agreed JSON serialization. There are RDF serializations||XML is a syntax widely used for data exchange (W3C)||JSON is a serialized format, there are XML representations of JSON too||XML, Turtle|
|Annotations||Not part of the relational model||Many kinds of annotations are defined for XML schemas and for XML data||-||RDFa (W3C) can be used to annotate XML|
|Query & CRUD Languages||Data Manipulation Language (DML), SQL, SQL/XML (ISO)||XPath, XQuery (W3C) and others for CRUD||JAQL, JSONiq||SPARQL (W3C)|
|Query & CRUD APIs||Many for various programming languages, for example JDBC, ODBC||Various, includes some of the relational APIs and specific APIs, for example XQJ||-||SPARQL Graph Store HTTP Protocol – for CRUD (W3C)|
|Collection||Table, View, Database (ISO)||XML Collection Function (W3C)||-||RDF Graphs (W3C)|
|Transformation & other languages||SQL (Tables to Tables)||XSLT (XML to text, includes XML); XForms|
SQL XMLTABLE (XML to relational)
Data normalization, a process that starts early in most commercial projects, almost always precludes the storage of business records in their original form, such as XML. Moreover, it is often assumed that working with XML business records is inefficient. The performance comparison in this article has shown that this is not necessarily the case. In fact, the costs of retrieving and reconstructing business records from normalized tables is often higher than the cost of natively storing and retrieving non-normalized XML documents. Hence, where XML is used to represent business records, XML should also be considered as the corresponding storage format. Object-centric data access often performs better with XML than normalized relational tables, and allows for high XML transaction rates . Additionally, XML allows applications to be prototyped, developed, and evolved with agility as there is no need to design and maintain mappings between relational tables and XML.
Emerging storage systems such as Google Big Table and HBase also depart from normalization. They advocate strong denormalization and the storage of data according to how it is accessed. XML databases provide the same recommendation that business records are best stored in the granularity that matches the typical access patterns .
Linked Data, Semantic Web, and JSON are emerging data representations but are not yet pervasive in business systems for inter-system data-exchange. Data normalization into triples is an essential part of the Semantic Web to build RDF stores from which new facts can be derived through inferencing. Augmenting business records represented in XML, with semantic annotations such as RDFa, provides a way of blending traditional business records with Linked Data and the Semantic Web.
Despite these developments, normalization is not obsolete. Normalization is still useful, for example, for traditional relational transaction processing systems that support a high rate of updates, rarely have to reconstruct the original business records, and do not perform inserts of new record versions instead of traditional updates, as shown in Table 10.
Table 10. Suitability of normalized versus non-normalized storage
|Suitable for non-normalized data representation, for example, XML||Suitable for normalized or semi-denormalized data representation|
|1||"Object-centric" data access: all or most pieces of a business record are accessed together||Data access is set-oriented or column-oriented, for example for analytics|
|2||Intact business records are exchanged via web services and SOA||Original business records do not need to be reassembled|
|3||Versioning is required: data updates are replaced by inserts of immutable versions||Only the latest state of each business record needs to be retained|
|4||Schema evolution needs to be supported||The schema is mature, stable, and unlikely to evolve|
|5||Auditing and compliance of business records are critical||Audit/compliance requirements are short-term, weak, or absent|
The huge success of relational database systems, data normalization, and later the representation of business records in XML, all happened for compelling reasons at the time. In the 21st century it often makes sense to store and manipulate business records "as-is". To draw an analogy with the real world: humans acquire, manipulate, and store real things, such as clothes in closets, and cars in garages. Things are stored in the most appropriate form for the predominant type of accesses, that is, cars and clothes are stored without first dismantling them into pieces, as shown in Figure 3.
Figure 3. Intact vs "Normalized" storage of a car
This 2-part series has identified data normalization as a key barrier to storing business records intact and delivering agile data management solutions. Efficient XML databases are readily available , but are often overlooked because an early step in system design incorporates normalization, removing all traces of the original business records. Data Normalization should be reconsidered carefully.
1. Murthy, R. et al.: "Towards an enterprise XML architecture", SIGMOD 2005.
3. Nicola, M., van-der-Linden, B.: "Native Support XML in DB2 Universal Database", 31st International Conference on Very Large Databases, VLDB 2005.
4. Nicola, M.: "Lessons Learned from DB2 pureXML Applications: A Practitioner’s Perspective", 7th International XML Database Symposium, XSYM 2010.
5. Rys, M.: "XML and Relational Database Management Systems: Inside Microsoft SQL Server", SIGMOD 2005.
7. Carey, M. J. et al.: "EXRT: Towards a Simple Benchmark for XML Readiness Testing", 2nd Conference of the Transaction Processing Council, TPCTC 2010.
- Stay current with developerWorks technical events and webcasts focused on a variety
of IBM products and IT industry topics.
- Attend a free
developerWorks Live! briefing to get up-to-speed quickly on IBM
products and tools as well as IT industry trends.
- Follow developerWorks on
- Watch developerWorks on-demand demos ranging from product installation
and setup demos for beginners, to advanced functionality for experienced
- Learn more about using XML with IBM
databases at the Information
Management and XML technology page on developerWorks.
- Get the resources you need in the Information Management
area on developerWorks, to advance your skills on a wide variety
of IBM Information Management products.
Get products and technologies
products in the way that suits you best: Download a product trial,
try a product online, use a product in a cloud environment, or spend a few
hours in the SOA Sandbox learning how to implement Service Oriented
- Download a free
trial version of IBM DB2 for Linux, UNIX, and Windows.
- Now you can use
DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the
community that offers the same core data features as DB2 Express Edition
and provides a solid base to build and deploy applications.
- Build your next
development project with IBM trial
software, available for download directly from developerWorks, or
spend a few hours in the SOA Sandbox learning how to
implement Service Oriented Architecture efficiently.
- Get involved in the My developerWorks
community. Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis.
Susan Malaika works in IBM's Information Management Group. She specializes in XML and Web technologies, including Grid computing. She has published articles and co-edited a book on the Web. She is a member of the IBM Academy of Technology.
Matthias Nicola is a Senior Technical Staff Member at IBM's Silicon Valley Lab in San Jose, CA. He focuses on DB2 performance and benchmarking, XML, temporal data management, in-database analytics, and other emerging technologies. He also works closely with customers and business partners to help them design, optimize, and implement DB2 solutions. Previously, Matthias worked on data warehouse performance at Informix Software. He received his Ph.D. in computer science from the Technical University of Aachen, Germany.