Data normalization reconsidered, Part 2: Business records in the 21st century

An examination of record keeping in computer systems

Relational databases have been fundamental to business systems for more than 25 years. Data normalization is a methodology that minimizes data duplication to safeguard databases against logical and structural problems, such as data anomalies. Relational database normalization continues to be taught in universities and practiced widely. Normalization was devised in the 1970s when the assumptions about computer systems were different from what they are today.

The first part of this 2-part series provided a historical review of record keeping and examined the problems associated with data normalization, such as the difficulty of mapping evolving business records to a normalized format. Since the Internet has lead to a widespread creation of business records in digital format, such as XML, it has become possible to store records in computer systems in their original format.

The second part of the series discusses alternative data representations like XML, JSON, and RDF to overcome normalization issues or to introduce schema flexibility. In the 21st century digitized business records are often created in XML to begin with. This paper compares XML to normalized relational structures and explains when and why XML enables easier and faster data access. After a discussion of JSON and RDF it concludes with a summary and suggestions for reconsidering normalization.

Share:

Susan Malaika (malaika@us.ibm.com), Senior Technical Staff Member, IBM

Susan Malaika photoSusan Malaika works in IBM's Information Management Group. She specializes in XML and Web technologies, including Grid computing. She has published articles and co-edited a book on the Web. She is a member of the IBM Academy of Technology.



Matthias Nicola (mnicola@us.ibm.com ), Senior Technical Staff Member, IBM

Author photo: Matthias NicolaMatthias Nicola is a Senior Technical Staff Member at IBM's Silicon Valley Lab in San Jose, CA. He focuses on DB2 performance and benchmarking, XML, temporal data management, in-database analytics, and other emerging technologies. He also works closely with customers and business partners to help them design, optimize, and implement DB2 solutions. Previously, Matthias worked on data warehouse performance at Informix Software. He received his Ph.D. in computer science from the Technical University of Aachen, Germany.



12 January 2012

Also available in Chinese Russian Vietnamese

Introduction

The first part of this 2-part series discussed record keeping prior to the 21st century, and the effects of relational databases and the Web. Prior to the introduction of computing systems, business records were created and stored as-is in their original non-normalized format. Examples include stone tablets, tally sticks, or paper forms. With the introduction of computing systems, data normalization was invented to convert business records into a different representation that can conserve space and avoid update anomalies in databases. However, the normalized representation is very different from the original and difficult to understand.

The second part of this 2-part series examines the trends in data formats and business records in the 21st century. It starts with a case study to highlight the benefits of digitized business records without normalization, such as better performance and lower cost for application development and maintenance. Other emerging data representations such as JSON and RDF are also discussed, with focus on their pros and cons for relevant usage scenarios.

Relational and XML data representation: A comparison

Today, many systems and applications create and represent business records as XML documents. For example, each order, each financial trade, each tax return, each insurance claim, and so on, is often a separate XML document. The main reasons are that XML is extensible, flexible, self-describing, and suitable for combining structured, unstructured, and semi-structured information. These properties make it easy to represent even complex records, to extend the data representation if needed, and to exchange business records between applications (A2A) or organizations (B2B). Hence, XML has also become the data format of choice for web services and Service-Oriented Architectures (SOA).

In response to this trend, the major database vendors have added XML capabilities to their products [1] [3] [5] and several XML-only databases have emerged [6]. Such products allow XML to be stored, indexed, queried, and updated with many of the same properties as relational data, including ACID transactions, recoverability, scalability, high availability, and so on. These advances of database technology make it possible to store and manage business records in their original format – XML. This section compares XML to relational data in terms of (de)normalization, redundancy, and performance.

An example

Since XML is a hierarchical data format, a single XML document can represent an arbitrary number of one-to-many relationships – without normalizing these relationships into a set of disjoint objects.

Consider a business record that describes a customer, and each customer can have multiple addresses, multiple phone numbers, multiple email addresses, multiple middle names, and multiple investment accounts. Furthermore, each of a customer's accounts can have multiple positions (holdings), each address can have multiple "street" entries, and so on. In a normalized relational schema, each of these one-to-many relationships requires an additional table, as shown in Figure 1. As result, storing a single business record requires having multiple SQL INSERT statements in several tables, in this example up to 12 INSERTs into 12 tables.

Figure 1. Example of a normalized relational schema
This diagram shows a normalized customer record

Similarly, when an application retrieves a single customer record, it needs to issue SELECT statements against all 12 tables. This is an inefficient process because it requires 12 or more separate API calls to the database. Each of these calls incurs network traffic, access to a database table, and possibly physical I/O to the table and its indexes. Note that a single query that joins all 12 tables is typically much worse. For example, if a customer has 3 accounts with a total of 7 positions (as shown in Tables 1 through 3), a join across these three tables would return 7 rows, as shown in Table 4. In this result set, the selected values from the customer table are repeated 7 times, and the values for every account are repeated for each of their positions. This redundancy quickly increases the size of the result set so that the latency of the client-server communication deteriorates performance. When more tables are added to the join, the result set grows quickly and amplifies the problem.

Tables 1 through 3 show normalized relational schema for customer, account, and position data.

Table 1. Customer table
CidFirstNameLastNameDateOfBirthNationality
12345JohnDoe1965-09-27German
Table 2. Account table
CidAccNoCurrencyBalance
12345985739476Euro120,000
12345985710938Euro2786.23
12345985808142USD523,891
Table 3. Positions table
AccNoSymbolShares
985739476IBM1,200
985739476ORCL2,500
985739476VFINX550
985710938SBTYA12,000
985710938VIVAX75
985808142DWCT1,128
985808142PMCOA8,461
Table 4. Denormalized data, for example, in the result set of join query
CidFirstNameLastNameDateOfBirthNationalityAccNoCurrencyAccBalanceSymbolShares
12345JohnDoe1965-09-27German985739476Euro120,000IBM1,200
12345JohnDoe1965-09-27German985739476Euro120,000ORCL2,500
12345JohnDoe1965-09-27German985739476Euro120,000VFINX550
12345JohnDoe1965-09-27German985710938Euro2786.23SBTYA12,000
12345JohnDoe1965-09-27German985710938Euro2786.23VIVAX75
12345JohnDoe1965-09-27German985808142USD523,891DWCT1,128
12345JohnDoe1965-09-27German985808142USD523,891PMCOA8,461

Table 4 also illustrates how denormalization of a database schema can introduce data redundancy. For comparison, Listing 1 shows a possible XML representation of the same logical business record. While the normalized relational schema scatters the customer information over numerous tables, the tree structure of XML can represent the same one-to-many relationships in a single document. At the same time the document does not contain any redundancy, because a parent node (such as "Customer") is not repeated for each of its child nodes (such as "Account"). Thus, XML can represent a business record in a single non-normalized document without incurring redundancy. As a result, business records can be inserted or retrieved in a single SQL statement, which reduces the number of database API calls, application complexity, CPU cost, and network traffic. These advantages have been observed and measured in multiple commercial database systems, as we discuss in the next section.

Listing 1. XML Representation of the Customer, Account, and Position data
                <Customer Cid="12345">
                    <FirstName>John</FirstName>
                    <LastName>Doe</LastName>
                    <DateOfBirth>1965-09-27</DateOfBirth>
                    <Nationality>German</Nationality>
                    <Accounts>
                        <Account AccNo="985739476">
                            <Currency>Euro</Currency>
                            <Balance>120000</Balance>
                            <Positions>
                                <Stock Sym="IBM" Shares="1,200"/>
                                ...
                            </Positions>
                        </Account>
                        <Account AccNo="985710938">
                            <Currency>Euro</Currency>
                            <Balance>2786.23</Balance>
                            <Positions> ... </Positions>
                        </Account>
                        <Account AccNo="985808142">
                            <Currency>USD</Currency>
                            <Balance>523891</Balance>
                            <Positions> ... </Positions>
                        </Account>
                    </Accounts>
                </Customer>

Comparison of XML and relational performance in an OLTP system

The benefits of inserting and retrieving a business record as a single non-normalized XML document (rather than normalized relational rows) is gaining attention among database practitioners. A major world-wide logistics corporation has put this value proposition to the test. One of their OLTP systems represents certain business records in a normalized relational schema that consists of 12 tables. To retrieve one business record, their application needs to issue 15 SQL statements. Most business records are represented by 15 rows, but some consist of hundreds of rows. For a comparison they chose 250,000 business records and converted them into an XML format, one document per business record. Most documents are 1KB to 2KB in size, but some up to 500KB.

The test scenario used DB2 9.5 Fix pack 4 with Linux RHEL 4.6 on an 8-core AMD Opteron server with 32GB of memory. Up to seven client machines were used to simulate an increasing number of concurrent users accessing the database. Each client was an IBM xSeries 345 (2-core) machine. Each of the 12 relational tables had appropriate indexes. For comparison, the XML documents were stored in a single table, in a single XML column, with a single XML index on the record ID attribute. In the relational test, each simulated concurrent user picked random record IDs and for each ID issued a transaction with 15 SQL queries to retrieve all information for that business record. No think time was used. In the XML test, each concurrent user picked random record IDs and issued a single SQL/XML query with an XPath expression in an XMLEXISTS predicate to retrieve an XML document.

The results are shown in Figure 2. On average the XML solution achieved 55% higher throughput then the normalized relational schema. In the relational test, the CPU capacity of the system under test was exhausted with 84 concurrent users performing about 3800 transactions per second. With the XML solution, the same system supported up to 150 concurrent users and almost 6000 transactions per second. At the same time the XML solution showed drastically lower CPU utilization on the client machines than the relational solution.

Figure 2. Retrieval performance: Non-normalized XML vs. normalized relational data
This figure shows a graph that shows higher performance for XML column

For the logistics company, the benefits of the non-normalized XML solution go further than higher performance and lower CPU consumption. For example, each XML document is immediately ready to be consumed by client applications, whereas the retrieved relational rows required construction of a business record – an additional cost that was not included in the measurements in Fig. 2. Also, the XML solution has a simpler database schema and is easier to maintain when the record format evolves over time. The business records have the same representation in the database and in the application (XML), which simplifies application development. This benefit has also been reported for XML-based forms processing, event logging, and order processing [‎‎4].

Accessing XML uses less of the CPU than the equivalent accesses to normalized relational tables representing the same business record. More data is transmitted between database and client as a result of a relational join to reconstruct the business record than transmitting the business record itself in XML. Moreover, the XML is easier to understand because it is closer, if not identical, to the real-world business record. A separate logical model for data is no longer needed. Mappings between real-world business records and what is stored inside the computer are not required.

The semantic web and RDF (Resource Definition Framework)

Another trend is the emergence of Linked Data and the Semantic Web to connect related data. More specifically, Linked Data has been defined as "a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF" ‎61]. With RDF, resources are represented by statements in the form of subject-predicate-object expressions. These expressions are known as triples [‎8] and are shown in Listing 2 and Tables 5, 6, and 7.

Listing 2. XML
<DEPARTMENT deptid="15" deptname="Sales">
   <EMPLOYEE>
     <EMPNO>10</EMPNO>
     <FIRSTNAME>CHRISTINE>/FIRSTNAME>
     <LASTNAME>SMITH</LASTNAME>
     <PHONE>408-463-4963</PHONE>
     <SALARY>52750.00</SALARY>
   </EMPLOYEE>
   <EMPLOYEE>
   <EMPNO>27</EMPNO>
   <FIRSTNAME>MICHAEL>/FIRSTNAME>
   <LASTNAME>THOMPSON</LASTNAME>
   <PHONE>408-463-1234</PHONE>
   <SALARY>41250.00</SALARY>
   </EMPLOYEE>
</DEPARTMENT>
Table 5. Linked data (RDF triples)
SubjectPredicateObject
deptidhasdepname
deptidhasdep15
Salesisadeptname
Saleshasdepid15
emphasdeptid
emphasempno27
emphasempno10
27isanempnofordepid15
10isanempnofordepid15
Table 6. Normalized relational data (a)
DEPTIDDEPTNAME
15Sales
Table 7. Normalized relational data (b)
DEPTIDEMPNOFIRSTNAMELASTNAMEPHONESALARY
1527MICHAELJONES408-461-123441250
1510CHRISTINESMITH408-040-805152780

Data normalization into triples is an essential part of the Semantic Web to build RDF stores from which powerful integrations can be derived. By breaking information into its smallest elements (triples), data can be represented as graphs with the subject and object as nodes, and the predicate as the linking arc. Inference software (reasoners) can analyze the graphs and apply rules to deduce new statements. For example, since departments have employees and human resources is a department, a new statement can be deduced that human resources has employees.

Linked data links typically connects the latest information only, as is the case with primary and foreign keys in relational tables. For example, an audit of an order that is constructed by linking or by navigating relational keys would pick up the current customer address and not the address of the customer at the time the order was placed. In business systems, it is often necessary for audit and compliance reasons to make a snapshot of the pertinent business information at the time an important event occurs, such as receiving an order or issuing a bank statement. The purpose of the snapshot is not to rely on navigation at a later date to re-construct the business record when an audit or query takes place. Sometimes the snapshot is signed to provide evidence that the business record has not been tampered with since its creation.

RDF, as an extreme form of normalization, is similar to normalized relational data in its difficulty for handling versioning and managing history and audit. RDF differs from normalized relational data in that it is not tied to a fixed schema. When a resource has new properties, additional statements (triples) are added to easily represent the properties. However, triples can be harder to query than traditional relational schemas.

The mechanisms of Linked Data and Semantic Web are powerful and can integrate seemingly unrelated systems. Examples include integrating hotel and airline reservations, order processing, or insurance renewal with social networking information. Augmenting business records (in XML) with semantic annotations (triples) provides a way of blending traditional business records with Linked Data and Semantic Web constructs. A real-world example is the US retailer Best Buy who annotates its XHTML pages that describe products, with RDFa (RDF annotations) that can be processed more programmatically, for example by search engines and classifiers [‎‎9].

It is more difficult to blend semantic annotations with relational databases because there are no mechanisms in place to augment relational data without changing the relational database schema, in contrast with business records expressed in XML where annotations are more straightforward to apply.

JSON

JSON (JavaScript Object Notation) is based on a subset of the JavaScript Programming Language and has become popular for presenting information to JavaScript clients [‎‎‎10]. Indeed, many REST and Web APIs, offer both XML and JSON variants. JSON can be viewed as a lightweight data-interchange format suited for building human user interfaces with JavaScript. It is easy for humans to read and write, and easy for JavaScript to parse and generate. Table 8 shows a data sample and compares JSON with XML notation.

Table 8. JSON versus XML
JSONXML
 {
    "city": "Armonk",
    "state": "NY",
    "population": 4080
 }
<root type="object">
   <city type="string">Armonk</city>
   <state type="string">NY</state>
   <population type="number">4080</population>
</root>

JSON is well suited for formatting the result of a query on the fly to make processing simpler for a JavaScript client. Hence, JSON is very popular for web applications. However, JSON does not have namespaces, so even if constraints were available, they cannot be shared, extended, or inter-mingled easily. Supporting specifications for JSON, such as schema or transformation languages, do not exist or have not achieved the same level of maturity as those for XML. Although JSON has some of the same advantages of XML in terms of data normalization, it has not reached the level of maturity of XML (nor of relational and linked data) as shown in Table 9.

Table 9. Comparison of relational data, XML, JSON, and XML
RelationalXMLJSONLinked data
MetadataData Definition Language (ISO)XML Schema XSD (W3C),
Namespaces (W3C)
JSON Schema (IETF)RDFS (RDF Schema),
Ontology (W3C and elsewhere)
ConstraintsIntegrity constraints in table definitions (ISO)Schematron (ISO),
Relax-NG (OASIS)
-
TriggersRelational triggers--RIF (W3C)
Data exchange serializations SQL standard (ISO) defines an XML serialization but it is not widely used – There is no agreed JSON serialization. There are RDF serializationsXML is a syntax widely used for data exchange (W3C)JSON is a serialized format, there are XML representations of JSON tooXML, Turtle
AnnotationsNot part of the relational modelMany kinds of annotations are defined for XML schemas and for XML data-RDFa (W3C) can be used to annotate XML
Query & CRUD LanguagesData Manipulation Language (DML), SQL, SQL/XML (ISO)XPath, XQuery (W3C) and others for CRUD JAQL, JSONiqSPARQL (W3C)
Query & CRUD APIsMany for various programming languages, for example JDBC, ODBC Various, includes some of the relational APIs and specific APIs, for example XQJ-SPARQL Graph Store HTTP Protocol – for CRUD (W3C)
CollectionTable, View, Database (ISO)XML Collection Function (W3C)-RDF Graphs (W3C)
Transformation & other languagesSQL (Tables to Tables)XSLT (XML to text, includes XML); XForms
SQL XMLTABLE (XML to relational)
JavaScript-

Summary

Data normalization, a process that starts early in most commercial projects, almost always precludes the storage of business records in their original form, such as XML. Moreover, it is often assumed that working with XML business records is inefficient. The performance comparison in this article has shown that this is not necessarily the case. In fact, the costs of retrieving and reconstructing business records from normalized tables is often higher than the cost of natively storing and retrieving non-normalized XML documents. Hence, where XML is used to represent business records, XML should also be considered as the corresponding storage format. Object-centric data access often performs better with XML than normalized relational tables, and allows for high XML transaction rates [2]. Additionally, XML allows applications to be prototyped, developed, and evolved with agility as there is no need to design and maintain mappings between relational tables and XML.

Emerging storage systems such as Google Big Table and HBase also depart from normalization. They advocate strong denormalization and the storage of data according to how it is accessed. XML databases provide the same recommendation that business records are best stored in the granularity that matches the typical access patterns [‎4].

Linked Data, Semantic Web, and JSON are emerging data representations but are not yet pervasive in business systems for inter-system data-exchange. Data normalization into triples is an essential part of the Semantic Web to build RDF stores from which new facts can be derived through inferencing. Augmenting business records represented in XML, with semantic annotations such as RDFa, provides a way of blending traditional business records with Linked Data and the Semantic Web.

Despite these developments, normalization is not obsolete. Normalization is still useful, for example, for traditional relational transaction processing systems that support a high rate of updates, rarely have to reconstruct the original business records, and do not perform inserts of new record versions instead of traditional updates, as shown in Table 10.

Table 10. Suitability of normalized versus non-normalized storage
Suitable for non-normalized data representation, for example, XMLSuitable for normalized or semi-denormalized data representation
1"Object-centric" data access: all or most pieces of a business record are accessed togetherData access is set-oriented or column-oriented, for example for analytics
2Intact business records are exchanged via web services and SOAOriginal business records do not need to be reassembled
3Versioning is required: data updates are replaced by inserts of immutable versions Only the latest state of each business record needs to be retained
4Schema evolution needs to be supportedThe schema is mature, stable, and unlikely to evolve
5Auditing and compliance of business records are criticalAudit/compliance requirements are short-term, weak, or absent

The huge success of relational database systems, data normalization, and later the representation of business records in XML, all happened for compelling reasons at the time. In the 21st century it often makes sense to store and manipulate business records "as-is". To draw an analogy with the real world: humans acquire, manipulate, and store real things, such as clothes in closets, and cars in garages. Things are stored in the most appropriate form for the predominant type of accesses, that is, cars and clothes are stored without first dismantling them into pieces, as shown in Figure 3.

Figure 3. Intact vs "Normalized" storage of a car
This image shows a car on the left and a disassembled car on right.

Conclusion

This 2-part series has identified data normalization as a key barrier to storing business records intact and delivering agile data management solutions. Efficient XML databases are readily available [‎7], but are often overlooked because an early step in system design incorporates normalization, removing all traces of the original business records. Data Normalization should be reconsidered carefully.


References

1. Murthy, R. et al.: "Towards an enterprise XML architecture", SIGMOD 2005.

2. Nicola, M., Gonzalez, A.: "Taming a Terabyte of XML Data", IBM Data Management magazine, Vol. 14, Issue 1, 2009.

3. Nicola, M., van-der-Linden, B.: "Native Support XML in DB2 Universal Database", 31st International Conference on Very Large Databases, VLDB 2005.

4. Nicola, M.: "Lessons Learned from DB2 pureXML Applications: A Practitioner’s Perspective", 7th International XML Database Symposium, XSYM 2010.

5. Rys, M.: "XML and Relational Database Management Systems: Inside Microsoft SQL Server", SIGMOD 2005.

6. Holstege, M.: "Big, Fast, XQuery: Enabling Content Applications", IEEE Data Engineering Bulletin, Vol. 31 No. 4, 2008.

7. Carey, M. J. et al.: "EXRT: Towards a Simple Benchmark for XML Readiness Testing", 2nd Conference of the Transaction Processing Council, TPCTC 2010.

8. RDF Reference - RDF Primer.

9. How Best Buy is using the Semantic Web.

10. Introducing JSON.

Resources

Learn

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
  • Download a free trial version of IBM DB2 for Linux, UNIX, and Windows.
  • Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base to build and deploy applications.
  • Build your next development project with IBM trial software, available for download directly from developerWorks, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=784352
ArticleTitle=Data normalization reconsidered, Part 2: Business records in the 21st century
publish-date=01122012