© 2002 International Business Machines Corporation. All rights reserved.
Because data integrity is important to many business applications, specific techniques to guarantee integrity at all levels of computational architecture have been devised: from microprogramming, through the operational system and the database management systems, and even to the application level. This article discusses data integrity at the database level, by making use of the support for data types that DB2 offers, and how to map DB2 data integrity constraints to the XML using the XML Schema Standard [W3C2001]. This mapping is interesting because the integrity rules defined in DB2 could be applied during the parser of the XML file; that is, at a time prior to the inclusion of data in DB2. The SQL commands I present in this article are part of a subgroup of commands common to the following DB2 platforms: UNIX®, Windows®, AIX®, iSeriesTM, z/OSTM and OS/390® [IBM2002]. The proposed mapping is aligned with the working draft "SQL/XML mapping" [ISO2002], which is being discussed at the ANSI-NCITS H2 and ISO/IEC-SC32 committee.
Suppose that your client requested the development of a new accounting system. Ask him to choose one of the following development options:
- Option 1 - Development of the new system with 80% of requested functions, and 100% guarantee of data integrity.
- Option 2 - Development of the new system with 100% of requested functions, but with 99% guarantee of data integrity.
It is very likely that your client would choose option 1, as in certain systems total data reliability is needed. In this case, it is better not to implement some functionalities than it is to allow possibly unreliable data. It is even better to be penalized on the system's performance, than on the data reliability.
According to C.J. Date [Date2000], the restrictions of data integrity is another name for "business rules." He supports that the declarative approach is better than the procedural approach to define business rules. Among other points, Date emphasizes the following advantage:
"We can avoid redundant rule enforcement. We don't have to have "the same" enforcement code embedded in lots of different chunks of procedural code or lots of different routines (or "methods," or whatever we call them). We just state the rule once, and the system does the work."
In avoiding redundancy, you also avoid the problem of inconsistency; that is, one of the "copies" of a certain rule might not have been updated, as to the change of a business rule. According to Date, another advantage is that the rule will be documented in only one place -- the database catalog -- and can be queried there.
DB2 and the XML Schema Standard use the declarative form to define integrity restrictions. The classification of integrity restrictions used in this article is an adaptation of the classification proposed by Date in [Date2001].
The article is structured in the following manner:
- First is the support of data types by DB2, where some considerations are made on data integrity using a simple example.
- Following this, I introduce some basic concepts of XML Schema Standard.
- The main part of the article shows how to map the integrity rules defined in DB2 to the XML Schema Standard.
- Finally, I present some limitations of the proposed mapping.
According to Date [Date2000b]:
"type is a named set of values (i.e., all possible values of the type in question), along with an associated set of operators that can be applied to values and variables of the type in question."
DB2 has pre-defined data types that represent numbers, date, hour, strings, etc. For example, the DB2 INTEGER type refers to the group of integer numeric values in the interval between -2.147.483.648 and +2.147.483.647, and contains various associated operations (addition, subtraction, multiplication, etc). It is important to remember that, depending on its definition, a column of INTEGER type can take on NULL value which is a special value present on all data types. The NULL value is used in various situations, as for instance, when representing a non-informed or unknown value, or to represent a value that does not apply in a certain situation. The columns that are part of the primary key cannot take on NULL value. The rule that defines the domain of type values is called "type constraint."
sums up the main DB2 built-in data types.
Figure 1. DB2 built-in data types
When creating a table, each column is associated with a certain data type. Observe in the example below the creation of DEPARTMENT and PROJECT tables:
CREATE TABLE DEPARTMENT ( DEPTNOINTEGER NOT NULL, DEPTNAME CHAR(50), PRIMARY KEY (DEPTNO) ); CREATE TABLE PROJECT ( PROJNOINTEGER NOT NULL, PROJNAME CHAR(50)NOT NULL, DEPTNO SMALLINTNOT NULL, START_DATE DATENOT NULL, END_DATE DATE, PRIMARY KEY (PROJNO), FOREIGN KEY (DEPTNO) REFERENCES DEPARTMENT ON DELETE RESTRICT); |
DB2 guarantees that only valid dates will be attributed to the START_DATE column. This integrity rule is called "attribute constraint."
Because the table's primary key is the PROJNO column, we are forced to restrict NULL or duplicated values. In this way, we manage to identify each line of the table. We can classify the definition of a primary key in a table as being a restriction of the "entity constraint" type.
Note also that there is the definition of a foreign key (DEPTNO). This means that DB2 will only allow values that are in the DEPARTMENT table. We can classify this restriction as "referential integrity restriction". This is a specific case of "database constraint," defined by Date.
Even with all these integrity types, nothing prevents the inclusion of a line with invalid data for the business, as shown here:
INSERT INTO PROJECT |
Since the data types of the PROJNO and DEPTNO columns are compatible (INTEGER and SMALLINT), the system cannot prevent the execution of the following query:
SELECT PROJNO + DEPTNO AS SIC FROM PROJECT |
With this definition, the PROJECT table accepts values that are invalid to the business and allows for meaningless operations: there is no sense adding the project code to the department code.
To solve such problems, new business rules can be defined:
- (a) The project code is an integer number that can vary between 1 and 50.000, inclusive, and the department code is an integer number that can vary between 1 and 400, inclusive.
- (b) The initial date must be greater than "01-JAN-2000."
- (c) The final date must be greater than the initial date.
- (d) Operations between incompatible columns must not be allowed.
Such rules could have been validated at the application level. Nevertheless, for reasons already stated, we prefer to guarantee them at database level. For this we will use UDT (user-defined data type) and check constraints.
Listing 1 shows the new definition of objects that satisfy the new rules.
Business rules (a) and (d) will be implemented with the creation of UDTs DEPTNO and PROJNO (Listing 1 - lines 01-05), and with the definition of check constraints in the new tables (Listing 1 - lines 12-13, 20-21). Note that it was not necessary to create the check constraint in the NEW_PROJECT table, DEPTNO column, since there is already a definition of referential integrity in this column.
In the event of DB2 implementing the CREATE DOMAIN statement foreseen in the SQL ANSI (Advanced) Standard, it is necessary to create only the new domains, without the need of defining check constraints in tables.
Business rules (b) and (c) will be implemented with the definition of check constraints in the NEW_PROJECT table (Listing 1 - lines 24-28). The distinct type NAME was also created, to be used in places where a 50-character name is needed.
See below the definition of UDTs and the NEW_DEPARTMENT and NEW PROJECT tables.
Listing 1 - Creation of new tables and distinct types
01.CREATE DISTINCT TYPE DEPTNO AS INTEGER
02.WITH COMPARISONS;
03.
04.CREATE DISTINCT TYPE PROJNO AS INTEGER
05.WITH COMPARISONS;
06.
07.CREATE DISTINCT TYPE NAME AS CHAR(50)
08.WITH COMPARISONS;
09.
10.CREATE TABLE NEW_DEPARTMENT
11.( DEPTNODEPTNO NOT NULL
12.CHECK (DEPTNO >= DEPTNO(1)
13.AND DEPTNO <= DEPTNO(400)),
14.DEPTNAMENAME,
15.PRIMARY KEY (DEPTNO)
16.);
17.
18.CREATE TABLE NEW_PROJECT
19.( PROJNO PROJNO NOT NULL
20.CHECK (PROJNO >= PROJNO(1)
21.AND PROJNO <= PROJNO(50000)),
22.PROJNAMENAME NOT NULL,
23.DEPTNO DEPTNO NOT NULL,
24.START_DATEDATE NOT NULL
25.CHECK (START_DATE >= DATE('01-01-2000')),
26.END_DATEDATE,
27.CHECK (END_DATE > START_DATE),
28.PRIMARY KEY (PROJNO),
29.FOREIGN KEY (DEPTNO) REFERENCES
30.NEW_DEPARTMENT ON DELETE RESTRICT
31.);
|
The definition of these new rules helps to guarantee even greater data integrity. Because it was done in a declarative form, DB2 guarantees that these rules will always be respected.
Next, we can see how to map these rules to the XML world. To simplify, we shall only work with the types and rules regarding the NEW_PROJECT table.
With the increasing use of XML for data-centric applications, the need for data validation arose using strong type. From January 1998 onwards, the W3C Schema Working Group has received several schema language proposals (XML-DATA, XDR, SOX, DDML, DCD) which lead to the XML Schema Standard, published in May of 2001. XML Schema is not the only existing schema language. There are others such as: RELAX, Schematron, Examplotron, etc.
The XML Schema language holds many advantages compared with DTD (Document Type Definition). Among them we can emphasize:
- Richer group of data types
- Possibility of new data type definition
- Allows defining uniqueness using more than one element/attribute
- Uses the XML format in the schema definition
- Permits the usage of namespaces (allows the combination of different grammar avoiding name conflict)
Part 2 of XML Schema Standard [W3C2001] defines the built-in data types. Figure 2 presents the pre-defined types of XML Schema Standard (the primitive and the derived by restriction only).
Figure 2. XML Schema pre-defined data types
From a pre-defined data type (primitive or derived), it is possible to derive other data types. We will use the derivation by restriction to define correspondent data types to those that DB2 offer.
One type of XML Schema Standard can be classified as:
- Simple type - defines restrictions in relation to the text that appears in attributes or in elements that only contain text (cannot contain attributes).
- Complex type - defines restrictions for elements that contain attributes or other nested elements. A complex type can be anonymous (when defined inside the element in which it will be used), or can be named (when the use of it in the definition of other elements can be foreseen - must be defined outside the element in which it will be used).
We must choose one type for each element or attribute. In the below example we define an element with the name PROJECTS. This element was associated with an anonymous complex type that uses the sequence composer (indicates that the elements must appear in the order they were stated). The PROJECT element was associated to a named type. Note that the attributes minOccurs and maxOccurs define the element cardinality (the cardinality definition is very limited when DTD is used). In this case, it is being determined that the PROJECTS element can have 0 or more occurrences of the nested PROJECT element.
<xsd:element name="PROJECTS"> <xsd:complexType> <xsd:sequence> <xsd:element name="PROJECT" type="PROJECT" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> |
Mapping DB2 data types to XML Schema
An important characteristic of XML Schema is the possibility of creating new schemas from others already existing. Using pre-defined data types of the XML Schema Standard in the first place, we create the mapping for DB2 data types (schema db2xml.xsd). Secondly, using the db2xml.xsd, we create a new schema with the data types that interest the business (schema myTypes.xsd). Finally, we create a schema that will validate the data (schema new-project.xsd), having as a base db2xml and myTypes schemas.
To validate an XML document that uses XML schema, it is necessary to use a parser that accepts this kind of validation, as for example xerces, xsv etc. The parser receives the XML file as an entry (data to be validated) and the file with the schema (grammar for validation) and produces a result that informs if the structure and content are following the rules defined in the grammar.
Figure 3 sums up the schemas that were created to reach the validator schema. The lower part of the figure represents the parser program, reading the two input files and producing the validation result.
Figure 3. Schema layers and parser
Table 1, in the appendix, presents the first mapping level between the DB2 types and that of XML Schema (DB2xml). This mapping level does not depend on the business objects. We can see in detail how the mapping of CHAR type is done in the DB2xml.xsd file.
Listing 2. Mapping of DB2 CHAR type for XML Schema xsd:string type
01.<xsd:simpleType name="CHAR"> 02.<xsd:annotation> 03. <xsd:appinfo> 04. <db2xml:sqltype kind="PREDEFINED" 05. name="CHAR" maxLength="254"/> 06. </xsd:appinfo> 07.</xsd:annotation> 08.<xsd:restriction base="xsd:string"> 09.<xsd:maxLength value="254"/> 10.</xsd:restriction> 11.</xsd:simpleType> |
The type chosen to map the CHAR was xsd:string, with the restriction of 254 characters, maximum (Listing 2 - line 09). The size restriction is defined with the use of facets, which are properties derived from the base data type.
The XML Schema language allows the definition of two types of annotation for a schema: the documentation element (to include comments) and the appinfo element (information which could be used by other applications). In this mapping, we code the appinfo to inform which origin type is in DB2.
To map the DATE, TIME, TIMESTAMP types, it was necessary to use a pattern facet that defines the format of each type. For example, the regular expression "\p{Nd}{4}-\p{Nd}{2}-\p{Nd}{2}" defines the format of a date.
The next step is the definition of DB2 distinct types (UDTs). PROJNO, DEPTNO and NAME types were created. We can see in detail how the definition was made to the PROJNO type in myTypes.xsd file, as in Listing 3.
Listing 3. Mapping of PROJNO DISTINCT TYPE.
01.<xsd:simpleType name="PROJNO"> 02.<xsd:annotation> 03.<xsd:appinfo> 04.<db2xml:sqltype kind="DISTINCT" typeName="PROJNO"/> 05.</xsd:appinfo> 06.</xsd:annotation> 07.<xsd:restriction base="db2xml:INTEGER"> 08.<xsd:minInclusive value="1"/> 09.<xsd:maxInclusive value="50000"/> 10.</xsd:restriction> 11.</xsd:simpleType> |
Observe that the distinct type PROJNO was mapped from type db2xml:INTEGER (Listing 3 - line 7), restricting the values for the interval between 1 and 50000, inclusive. As well as the distinct types, we define the START_DATE in this same schema, which implements the mapping of the check constraint from the initial project date (<xsd:minInclusive value="2001-01-01"/>)
Finally, we can define the schema that will validate the data. For this, we will make use of the schemas db2xml.xsd and myTypes.xsd that were defined beforehand. Listing 4 shows the validator schema.
01.<?xml version="1.0" encoding="ISO-8859-1"?> 02.<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 03. xmlns:db2xml="http://www.acme.org/DB2xml" 04. xmlns:mT="http://www.acme.org/myTypes"> 05.<xs:import namespace="http://www.acme.org/myTypes" 06. schemaLocation="myTypes.xsd"/> 07.<xs:import namespace="http://www.acme.org/DB2xml" 08. schemaLocation="DB2xml.xsd"/> 09.<xs:complexType name="PROJECT"> 10.<xs:sequence> 11.<xs:element name="PROJNO" type="mT:PROJNO" minOccurs="1"/> 12.<xs:element name="PROJNAME" type="mT:NAME" minOccurs="1"/> 13.<xs:element name="DEPTNO" type="mT:DEPTNO" minOccurs="1"/> 14.<xs:element name="START_DATE" type="mT:START_DATE" 15.minOccurs="0"/> 16.<xs:element name="END_DATE" type="db2xml:DATE" 17.nillable="true"/> 18.</xs:sequence> 19.</xs:complexType> 20. 21.<xs:element name="PROJECTS"> 22.<xs:complexType> 23.<xs:sequence> 24.<xs:element name="PROJECT" type="PROJECT" 25.minOccurs="0" maxOccurs="unbounded"/> 26.</xs:sequence> 27.</xs:complexType> 28.<xs:unique name="PROJNO_primary_key"> 29.<xs:selector xpath="PROJECT"/> 30.<xs:field xpath="PROJNO"/> 31.</xs:unique> 32.</xs:element> 33.</xs:schema> |
Lines 2-8 define and import the base schemas. Lines 9-19 define a complex type by the name of PROJECT. This type is formed by elements corresponding to the columns of the example table. The element END_DATE was defined based on the type coming from the db2xml schema. Note the presence of the nillable="true" attribute to indicate that the element can assume NULL values. Lines 28 to 31 define the restriction uniqueness (equivalent to the primary key).
In "Section 1.1 Purpose" of XML Schema Standard (Part 1) [W3C2001b], a warning is made in relation to the language purpose:
"Some applications may require constraint capabilities not expressible in this language, and so may need to perform their own additional validations."
In the NEW_PROJECT table, there is a type of check constraint ("END_DATE > START_DATE") that cannot be validated using only resources of the XML Schema. In order to solve such a problem, the best alternative is to use the Schematron rule based language. Besides the normal validation, already shown on Figure 3, another validation of Schematron rules will be done. The link between XML Schema and the Schematron is done using the annotation/appinfo element.
Below, in Listing 5, we can see the redefinition of the END_DATE element with the integrity rule of schematron notation.
Listing 5. Check constraint (END_DATE > START_DATE).
01.<xsd:element name="END_DATE" type="db2xml:DATE" nillable="true"> 02.<xsd:annotation> 03.<xsd:appinfo> 04.<sch:pattern name="Check constraint end_date > start_date"> 05.<sch:rule context="END_DATE"> 06.<sch:assert test="( ./@xsi:nil='true' or 07.(number(translate(./text(),'-','')) > 08.number(translate(../START_DATE/text(),'-',''))))"> 09.END_DATE must be greater than START_DATE. 10.</sch:assert> 11.</sch:rule> 12.</sch:pattern> 13.</xsd:appinfo> 14.</xsd:annotation> 15.</xsd:element> |
The business rule is revealed with the use of an XPATH expression (lines 6-8) in the selected context (END_DATE - line 5). XPATH is a W3C Standard that forms the family of XML Standards. Its aim is to define paths that select fragments of an XML document.
To validate the schematron rules it is necessary to use an XSLT processor (for example, the saxon). XSLT is another W3C Standard that aims to perform transformations in XML files.
The XSLT processor receives as input an XML and an XSL file (transformation rules), and produces an output according to the established transformation rules.
The list below shows the necessary steps in validating the Schematron rules using the saxon.
-
saxon new_project.sch XSD2SCHTRN.XSL > proj_sch.xsd -
saxon proj_sch.xsd schematron-basic.xsl > validator.xsl -
saxon project.xml validator.xsl > result2.txt
Step 1 is a pre-process that is done to extract the schematron rules from the original file to a temporary file. This file is input to another transformation (Step 2) that produces the validation file of Schematron rules. In Step 3, validation rules are finally done, producing a result.
The final result that must be considered is the validation result of XML Schema parser together with the result of XSLT processor that validated the Schematron rules. Figure 4 sums up the validation process, taking into account the XML parser as well as the XSLT processor.
Figure 4. XML validation with XML Schema + Schematron
Some limitations of the proposed mapping are shown below:
- The article did not deal with mapping in the "XML Schema" sense for DB2, but with DB2 for the "XML Schema."
- The UDTs "structured type" and "reference type" were not mapped because these are not present in the mainframe platform. The article proposes to cover only what was common to all platforms.
- The XML Schema Standard does not allow for checking of referential integrity, in case the data are in another XML document. This limitation could be by-passed with the use of Schematron, as was done in the check constraint.
- Some important mapping characteristics, such characters sets, collating sequences, identifiers, were not discussed.
- The GRAPHIC, VARGRAPHIC and LONG VARGRAPHIC types, which exist on all DB2 platforms were not mapped.
- We mentioned beforehand that the redundancy can create inconsistency problems. In mapping the DB2 integrity rules for the XML Schema there will be redundancy that should be controlled. One alternative is to create a program that automatically extracts the integrity rules from DB2 catalog, creating automatic mapping.
Data integrity must be continually improved. DB2 has several features that guarantee this integrity. In this article we saw how to map DB2 integrity restrictions to the XML world, using XML Schema Standard. This validation can be done remotely, without the need of connecting to DB2. The main advantage is that, when the data are inserted in the DB2, they would be already validated. The load of validated data in the DB2 can be done in several ways. One option is to do a new XSLT transformation creating a file for LOAD. Another option is to use the data stage approach presented in the DB2 Developer Domain article "Yet another way to map <emphasis>XML Data<emphasis> to DB2" [Lima2002].
In the final part of the article we saw that the XML Schema cannot map all the integrity rules defined in our example. To be able to cover all the rules, it was necessary to use a schema language based on rules (Schematron) in association with the XML Schema, and resources of XSLT and XPATH Standards.
Beta 8.1 version of DB2 UDB recently announced that it has a UDF (XML Extender) that achieves the validation with XML Schema.
Table 1. Mapping built-in DB2 data type to XML Schema
| DB2 | XML Schema | Mapping |
|---|---|---|
| CHAR | xsd:string |
<xsd:simpleType name="CHAR">
|
| VARCHAR | xsd:string |
<xsd:simpleType name="VARCHAR">
|
| CLOB | xsd:string |
<xsd:simpleType name="CLOB">
|
| BLOB | xsd:hexBinary |
<xsd:simpleType name="BLOB">
|
| SMALLINT | xsd:short |
<xsd:simpleType name="SMALLINT">
|
| INTEGER | xsd:int |
<xsd:simpleType name="INTEGER">
|
| DECIMAL | xsd:decimal |
<xsd:simpleType name="DECIMAL">
|
| NUMERIC | xsd:decimal |
<xsd:simpleType name="NUMERIC">
|
| REAL | xsd:float |
<xsd:simpleType name="REAL">
|
| DOUBLE | xsd:double |
<xsd:simpleType name="DOUBLE">
|
| DATE | xsd:date |
<xsd:simpleType name="DATE">
|
| TIME | xsd:time |
<xsd:simpleType name="TIME">
|
| TIMESTAMP | xsd:dateTime |
<xsd:simpleType name="TIMESTAMP">
|
The author would like to gratefully acknowledge the comments and suggestions of Marco Antonio Motta de Souza.
| Name | Size | Download method |
|---|---|---|
| 0209limasample.zip | 13KB | HTTP |
Information about download methods
-
[IBM2002] IBM Corporation, "SQL Reference - for Cross-Platform Development", 2002.
-
[ISO2002] ANSI/ISO, "Working Draft - XML-Related Specifications (SQL/XML)", May/2002.
-
[Date2000] C. J. Date, "What not How - The Business Rule Approach to Application Development", Addison-Wesley, 2000.
-
[Date2001] C. J. Date, "Constraints and Predicates: A Brief Tutorial", Part I, Part II, Part III.
-
[Lima2002] Joao A. O. Lima, "Yet another way to map <emphasis>XML
Data<emphasis> to DB2", March, 2002.
-
[Date2000b] C. J. Date, "An Introduction to Database Systems", Addison-Wesley, 2000.
-
[W3C2001] W3C, "Schema, Part 2: Data Types", May, 2001.
-
[W3C2001b] W3C, "Schema, Part 1: Structures", May, 2001.
João Alberto de Oliveira Lima began his career in software industry as a programmer/analyst in 1991. Since 1993 he has been involved with DB2 as developer, DBA and SQL performance analyst. João has a BSc in Computer Science and a MSc in Software Engineering. He has the following certifications: IBM Certified Developer in "XML and Related Technologies", IBM Certified Solution Designer in "Business Intelligence V8.1" and IBM Certified Solutions Expert in "DB2 UDB V7.1 DBA for OS/390", "DB2 UDB V7.1 DBA for Unix, Windows and OS/2" and "DB2 UDB V7.1 Family Application Development". He is the author of reference guide "DB2 UDB [Mainframe v. 7]" published by Novatec Editora (only in Portuguese). He lives in Brasília, Brazil.
Comments (Undergoing maintenance)





