Learn how to utilize pre-built packages of IBM® InfoSphere® Information Analyzer data quality rules. We will show you how to understand the available content, how to use that information to address common data quality conditions, and how to then import it into your Information Analyzer environment to accelerate rule development and assessment.
With IBM InfoSphere Information Analyzer, you can create data quality rules to automatically monitor for potential data quality problems against defined business requirements or based on issues identified during data profiling. These rules can take time to develop and test for the broad range of data in a given table, system, or environment.
The objective of this article is to show you ways to accelerate this development through the import and use of the pre-built Information Analyzer rule definitions, which accompany this article. By using these pre-built data quality rule definitions, you should be able to accelerate development of data quality validation in your enterprise.
This article focuses on the following tasks:
- Understanding rule definitions available in pre-built packages
- Using pre-built rule definitions to address common data quality conditions
- Reviewing the structure and content of the Information Analyzer rule definition XML file
- Importing the pre-built data rule definitions using the HTTP/CLI API — a feature introduced in InfoSphere Information Analyzer V8.5 and enhanced in V8.7
The pre-built data quality rules accompanying this article are intended to:
- Reduce the effort identifying data quality issues in many common information domains (keys, national identifiers, dates, country codes, email addresses, etc.) and conditions (completeness checks, valid values, range checks, aggregated totals, equations, etc.)
- Serve as models, templates, and examples for your own additional rule design
- Be utilized in Information Analyzer jobs (either V8.5 or V8.7), or via the Rule Stage available in Information Server V8.7.
Address common data domains and quality conditions
Almost any piece of data stored in a database or file or being processed through a job or web service has some associated condition indicating whether the data meets established validation rules. These conditions may be as simple as indicating that there must be data in the field (i.e., that it is complete) or when there is data that it conforms to some specified format or set of values (i.e., the values are valid). Or the conditions may indicate that the data must match records in a specified reference source, such as postal codes, or that a particular equation is correctly calculated.
The potential range of data that can be evaluated and the potential number of quality conditions that can be defined is substantial, and this article (and accompanying pre-built rule definitions) cannot address all possible circumstances. Instead, the focus is on providing a jumpstart for a set of data domains and conditions commonly found in many data sources.
Information Analyzer provides a capability to define the rule logic for such data domains and conditions separately from any physical data source so the same logic is consistently applied from data source to data source (i.e., one data rule definition can be applied and used with many data sources). Coupled with a capability to import a set of rule definitions in a defined XML format, you can take the pre-built rule definitions and load them into Information Analyzer and start applying them to your own data sources.
Rule definitions follow a basic syntax where a variable, which could simply
be a word or term, is evaluated based on a specified condition or type of
check. The specified condition or check may or may not require some
additional reference value, such as another variable, a list of values, a
specified format, etc.. Further, several conditions may be connected
together with IF, THEN, AND, or OR clauses. For example, a very simple rule definition could be as
follows: DateOfBirth IS_DATE.
This condition indicates that a variable called DateOfBirth must be of a recognized date format.
In a slightly more complex case, you could have a rule definition such as Listing 1.
Listing 1. Sample rule definition
IF DateOfBirth EXISTS
AND DateOfBirth > datevalue('1900-01-01')
AND DateOfBirth < date()
THEN CustomerType = 'P'
|
Here, there is a conditional statement that checks whether the variable DateOfBirth exists and is within a set range, and only if those conditions are met, another variable called CustomerType is tested to see if it is equal to a specified value.
More information about creating and using rule definitions are available in the Information Analyzer user guide (see Data Rule definitions).
The most basic rule definitions test for the completeness of a field or a standard alphabetic or numeric format. The pre-built rules include examples of these conditions.
Figure 1. General completeness and data type rules
For instance, the rule definition AlphanumFieldExists
evaluates the following condition: Field1 EXISTS AND
len(trim(Field1)) <> 0.
This example includes several of Information Analyzer's basic capabilities:
- The use of a general variable name — in this instance, simply called
Field1
- NOTE: The variable can be connected to (bound) to any column or field of data. This is the flexibility that allows one rule definition to provide the basis for many actual executable data rules.
- Tests for multiple conditions — The existence of data and a not equal
(
<>) condition- NOTE: There is no specific limit to the number of conditions that can be included in one rule definition, though practically it is useful to keep rule definitions understandable. In creating rule definitions, look for basic building blocks and take advantage of Information Analyzer's rule-set capability to combine conditions instead of building all into one rule (see Data Rule Analysis techniques in the IBM Information Center].
- The inclusion of functions — In this case, len and trim
- NOTE: Refer to the Information Analyzer User Guide for the range of functions available. Functions can often be used to make conditions easier to solve. In this case, the functions are used to check for spaces in a field. The trim function first removes any number of blank (space) values from the left or right of any actual text. The len function identifies the length of all remaining alphanumeric characters with an expectation that the field will have at least one character value (that is, a length not equal to 0).
Data domains by data classification
At a basic level, beyond the generic examples above, you can roughly classify data into a set of common data domains as seen in the Column Analysis details of Information Analyzer:
- Identifier — A field generally unique and can identify related data (e.g., Customer ID, National identifier)
- Indicator — A field, often called a Flag, that has a binary condition (e.g., True or False, Yes or No, Female or Male)
- Code — A field that has a distinct and defined set of values, often abbreviated (e.g., State Code, Customer Status)
- Date — A field that contains some date value
- Quantity — A field that contains a numeric value and is not classified as an Identifier or Code (e.g., Price, Amount, Asset Value)
- Text — A field that contains alphanumeric values, possibly long text, and is not classified as an Identifier or Code (e.g., Name, Address, Description)
Additionally, there is a basic rule classification: Valid Value Combination, in which one field is of a certain value, a second field must be some specified value.
A subset of the pre-built rule definitions follows these general groupings and typical validation requirements. Figure 2, for example, highlights the pre-built rule definitions for Code fields.
Figure 2. Common classification-based rule definitions for codes
These rule definitions based on common data classifications typically assess structural formats or basic validation requirements (e.g., an Identifier must be in a range bounded by a low value and a high value, but it does not specify any exact values).
If, for instance, you have a Code field that allows the alphanumeric values
0-9, you might wish to apply the rule definition
Code1DigitNumeric (see in Figure 2) to
check that the field contains one single-digit numeric value. This rule
definition is as follows: Code MATCHES_FORMAT
'9'.
This example shows a straightforward condition:
- The use of a general variable called Code
- A test for a single format condition:
MATCHES_FORMAT- NOTE: Information Analyzer has two distinct checks for data
formatting: the
MATCHES_FORMAT, shown here, andMATCHES_REGEX, which evaluates against a broad range of regular expression conditions (many examples can be found through a simple Google search on the term "regular expression").
- NOTE: Information Analyzer has two distinct checks for data
formatting: the
- The
MATCHES_FORMATcondition requires a reference value; in this case, it expects one and only one numeric value (all numeric digits are represented by a 9).
As noted, there are many potential domains of data that could be incorporated into a package of pre-built data rule definitions. Examples of common domains included with the available packages include:
- Demographic information
- Age
- Date of birth
- Date of death
- National identifier (e.g., U.S. Social Security number, Canadian SIN number, passport number, Italy Fiscal Code, etc.)
- Internet address information
- Email address
- IP address
- URL
- Order/sales/policy information
- Order amount and quantity
- Sales amount (e.g., with or without sales tax, with or without discount)
- Maturity date
- Product code (e.g., ISBN code, UPC code)
- Employment information
- Start date
- Paid amount
- Phone information (North America only)
- Phone format
- Area code
These common domains span the range of data classes, providing more
specific cases for your use, as well as rule conditions that can be more
complex. Consider the following pre-built rule definition
SalesamtWDiscountPlusTaxValid, which evaluates a
sales amount field based on several variables, including a discount and a
tax:
(qtyValue1 * price) - (qtyValue1 * discount) + (((qtyValue1 * price) - (qtyValue1 * discount)) * salesTax) = totalAmount |
This example highlights that the source information (in this case) or the reference information used in the validation can incorporate a number of criteria:
- There are five variables in use in this logic:
- qtyValue1 — The quantity of an item in an order or sale
- price — The price of an item in an order or sale
- discount — A discount applied on an item in an order or sale
- salesTax — The sales tax applied on an order or sale
- totalAmount — The total amount of an order or sale
- NOTE: There is no specification in the rule definition itself as to where the data is actually stored; these variables might all be contained in one database or file or might come from several sources. That information is only required when the variables are bound when an executable data rule is generated.
- A test for a single condition
=(is equal to)- NOTE: This rule could also be written in
reverse, where
totalAmountis the source variable (on the left), which is equal to the reference data (the equation put on the right).
- NOTE: This rule could also be written in
reverse, where
- An equation using a series of functions (the standard numerical
operators
+,-,*, and/) and associated parentheses.
Standardized data domains (U.S.)
An accompanying pre-built rule package is targeted to validating the output of standardization processes for U.S. names, street addresses, and postal areas from IBM InfoSphere QualityStage®. The Standardize stage of QualityStage takes incoming data such as U.S. names and addresses that are not well-defined, parses that data, and creates a standardized form. For instance, consider the two addresses:
One hundred West Main Street apt 10 |
100 W Main St #10 |
Quite possibly, these two addresses represent the same location. But the differences in formatting and description would prevent such information from being connected. The output from the Standardize stage using a set of rules for U.S. addresses for both would produce:
Street# Pre-direction Street St. Type Unit Unit# 100 W Main St Apt 10 100 W Main St Apt 10 |
Generally, these standardization rule sets produce a fairly consistent
output, but there can be exceptions, such as new data, unexpected conditions,
default or test data, and unusual formats. The pre-built rule
definitions target these outputs, though they can be applied to any
well-parsed name, address, or postal area information. As an example, the
rule definition RuralRouteTypeIfExistsThenValidValues tests
whether a rural route type is valid.
IF RuralRouteType EXISTS
AND len(trim(RuralRouteType)) <> 0
THEN rtrim(RuralRouteType) IN_REFERENCE_LIST {'RR','RTE','HC','CONTRACT'}
|
This example highlights several criteria put into an
IF…THEN condition:
- The
IF…AND…is the same as the completeness exampleAlphanumFieldExistsshown above. When expressed within anIFcondition, only records where the field has a value will be evaluated with the subsequentTHENcondition. Records where there is no value present will not be evaluated and will not generate any exceptions. - The
THENcondition is the basis for meeting or not meeting the rule definition. Within the condition, thertrimfunction removes any spaces on the right ofRuralRouteType, and the resulting value is evaluated against a set of four valid values given in the list. - NOTE: These type of
IF…THENrule definitions work well together as part of a larger rule set. In essence, they describe a series of cases, each with distinct criteria. By defining the rule definitions separately and grouping into the rule set, it allows more insight into which records have issues, as well as how many records violate specific rules.
Using the pre-built rule definitions
The accompanying pre-built rule definitions can be considered from a design perspective and a deployment perspective.
Design-time accelerators, templates, and models
From a design perspective, you can use the pre-built rule definitions as-is, copy/modify to meet your needs, or use them as design models. The subsequent section on "Importing the pre-built rule definitions" describes the basic steps to bring the pre-built packages into your project(s).
The file IARuleDefs-BaseSet1-General-v8x.xml includes more than 130 definitions for the general rules and common domains described above. The file IARuleDefs-BaseSet1-USStan-v8x.xml includes nearly 60 definitions for the validation of U.S. name, address, and postal area standardized information described above.
First, once imported into your project, you may immediately use these rule definitions to test or assess your data sources, generating data rules as described in the Information Analyzer User Guide (see Generating a data rule from a rule definition). In this capacity, the rule definitions accelerate your ability to start detailed data quality assessment.
Second, you can use these rule definitions as templates to customize for your own specific data conditions. Consider the example case where you have a field named Region that represents a specific segment of the world. Region is defined as a text field that is five characters long, and the first two characters are alphabetic characters that must be in the following list: AM (Africa and Middle East), AP (Asia-Pacific), EU (Europe), NA (North America), and SA (South America).
The pre-built rule definitions do not include such an exact rule
definition. However, the rule definition
TextSubstrInRefList is described as "Substring Text
value starting in position 3 for length 3 is in reference list." This is
similar to the rule definition you need: evaluate a substring for
inclusion in a reference list.
In this instance, you could do the following:
- Log in to Information Analyzer.
- Open your project and navigate to the Develop menu and the Data Quality menu item.
- Select the desired rule definition in your project
(
TextSubstrInRefList, in this case). - Select the Create a Copy task, as in Figure 3.
Figure 3. Create a Copy task
- In the dialog box to Create a Copy, select OK.
- This will create a replica of the original rule called (Copy_of_TextSubstrInRefList, in this case).
- Open this new rule definition to edit as needed:
- Change the rule-definition name:
Region_SubstrInRefList. - Change the substring function from:
- Before: substring(TextField, 3, 3)
- After: substring(Region, 1, 2)
- NOTE: In this case, you want to start the substring function at the first character for a length of 2.
- Change the reference list data from:
- Before: {'AAA','AAB','BAA','CCC'}
- After: {'AM','AP','EU','NA','SA'}
- Change the rule-definition name:
- Save your updated rule definition.
Third, you can use the rule definitions as reference models — examples of specific functions or conditions in use that can guide you as you design and develop unique rules for your environment.
Deployment approaches for quality validation and monitoring
As with all rule definitions, the pre-built packages can be:
- Used to generate executable data rules for quality validation.
- Included in rule-set definitions and executable rule sets to validate
multiple conditions together.
- While discussed in greater detail in the Information Analyzer
Best Practices and Methodology Guide (see Resources), rule sets have several distinct
deployment advantages:
- They provide support for evaluating data based on many
data rule conditions. With the pre-built rule
definitions, you can combine as many of these
as needed to evaluate all the fields in a given
record, including multiple instances of the same rule
definition, such as
FieldExists. - They score all tested rules for each record in the set so results can be viewed in multiple dimensions. (For example, you can see all records that fail each specific rule, see all rules that a given record failed, or look at intersections of specific sets of rules).
- They optimize the rule evaluation for execution and processing.
- They provide support for evaluating data based on many
data rule conditions. With the pre-built rule
definitions, you can combine as many of these
as needed to evaluate all the fields in a given
record, including multiple instances of the same rule
definition, such as
- NOTE: Any rule-set definition you create can contain these pre-built rule definitions and/or your own rule definitions in any combination.
- While discussed in greater detail in the Information Analyzer
Best Practices and Methodology Guide (see Resources), rule sets have several distinct
deployment advantages:
- Published for users in other projects to take advantage of — When you import the pre-built rule definitions, they are imported to your project. For other users who are not part of your project to use, the rule definitions must be published or imported into their projects.
- Exported for deployment in other Information Analyzer environments — For example, if you are working in a development environment with test data to ensure your data rules work correctly, you may then need to export those data rules to a production environment for ongoing quality monitoring.
With the introduction of Information Analyzer V8.7, rule definitions built in Information Analyzer can be added to a new Rule Stage in an IBM InfoSphere DataStage or QualityStage job. This capability allows the use of any published rule definition to validate data that is part of data integration or data-cleansing processes, including those added through the accompanying pre-built rule-definition packages.
For example, you receive a file daily from an external third-party source. The quality of the data source is often low, resulting in problems in other information systems, including your business reporting. This daily file currently runs through a QualityStage job to standardize the file and load to existing data sources. You want to test incoming data for completeness using a set of rule definitions and validate the results of the QualityStage standardization output.
Figure 4 shows the addition of the new Rule Stage, CustomerValidityCheck, in this sample job. The Rule Stage could have one rule definition or many, depending on the number of data fields that need to be validated. Outputs from this stage include valid data, invalid data, and specific violation details.
Figure 4. Validate data in-process in DataStage or QualityStage
See Using the Data Rule Stage for more details on this capability.
By taking advantage of pre-built rule definitions, you can:
- Reduce the effort to address many common information domains and conditions.
- Provide models and publish rule definitions for other users to work from.
- Accelerate the process of assessing, testing, and deploying data rules in Information Analyzer.
- Deploy rule definitions for ongoing quality monitoring and in-flight data validation.
Understanding the pre-built rule definition packages
The Information Analyzer pre-built rule definitions that accompany this article are imported through the Information Analyzer API.
The pre-built definitions are structured using a defined XML schema. For full details on the structure, refer to Schema File elements for Rule Definitions.
At a brief level, the definition files look like Listing 2.
Listing 2. Rule definition XML schema
<?xml version="1.0" encoding="UTF-8"?>
<iaapi:Project xmlns:iaapi=\
"http://www.ibm.com/investigate/api/iaapi" name="your-project">
"<!--IBM InfoSphere Information Analyzer Base Set rule definitions v1.0
For use with IBM InfoSphere Information Server v8.7
Copyright IBM 2011. All rights reserved.
Disclaimer:INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION 'AS IS' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Some states do not allow disclaimer of express or implied warranties
in certain transactions, therefore, this statement may not apply to you.-->"
<DataRuleDefinitions>
<DataRuleDefinition name='FieldExists' folder='General Rules / General'>
<description>Field Exists; null check only</description>
<expression>Field1 EXISTS</expression>
</DataRuleDefinition>
<DataRuleDefinition name='AlphanumFieldExists' folder='General Rules / Alphanumeric'>
<description>Alphanumeric Field Exists; null & \
blank value check</description>
<expression>Field1 EXISTS AND len(trim(Field1)) <> 0</expression>
</DataRuleDefinition>
<DataRuleDefinition name='AlphaFormatValid' folder='General Rules / Alphabetic'>
<description>Example Alphabetic Format; excludes null values</description>
<expression>IF Field1 EXISTS THEN Field1 MATCHES_FORMAT 'AAAA'</expression>
</DataRuleDefinition>
<DataRuleDefinition name='AlphanumFormatValid' folder='General Rules / Alphanumeric'>
<description>Example Alphanumeric Format 1; as with \
Vehicle plate #; excludes null values</description>
<expression>IF Field1 EXISTS THEN Field1 MATCHES_FORMAT '999AAA'</expression>
</DataRuleDefinition>
|
The content includes:
- A general XML header:
<?xml version="1.0" encoding="UTF-8"?>, which should not be changed. - A specific XML header for Information Analyzer:
<iaapi:Project xmlns:iaapi="http://www.ibm.com/investigate/api/iaapi" name="your-project">. You will need to modify your-project to whatever Information Analyzer project name you use. - XML comments enclosed between
<!-- and --> - The start of the rule definitions section:
<DataRuleDefinitions> - The block of content for each rule definition, including:
- The rule-definition name
- For the V8.7 files only, a folder
- The rule-definition description
- The expression (the rule logic)
<DataRuleDefinition name='FieldExists' folder='General Rules / General'> <description>Field Exists; null check only</description> <expression>Field1 EXISTS</expression> </DataRuleDefinition>
- After all rule-definition blocks, the end of the rule-definitions
section and the end of the XML content:
</DataRuleDefinitions> </iaapi:Project>
Pre-built rule-definition packages available
There are two rule-definition packages available (see Downloads). One is for use with Information Analyzer V8.5, the other for use with V8.7. The rule-definition content is the same for both packages. The only difference is that the V8.7 includes a folder reference that will add the rule definition to a specific folder during import. That option is not available during import in V8.5.
For IBM InfoSphere Information Analyzer V8.5:
- IARuleDefs-BaseSet1-v85.zip
- General domain and condition rule definitions:
IARuleDefs-BaseSet1-General-v85.xml - U.S. standardization validation condition rule definitions:
IARuleDefs-BaseSet1-USStan-v85.xml
- General domain and condition rule definitions:
- NOTE: This package can be used in Information Analyzer V8.5 or V8.7.
For IBM InfoSphere Information Analyzer V8.7:
- IARuleDefs-BaseSet1-v87.zip
- General domain and condition rule definitions:
IARuleDefs-BaseSet1-General-v87.xml - U.S. standardization validation condition rule definitions:
IARuleDefs-BaseSet1-USStan-v87.xml
- General domain and condition rule definitions:
- NOTE: This package can be only be used in Information Analyzer V8.7 as it includes Folder options not previously available in the API and CLI.
You will need to:
- Choose the package that is right for your Information Analyzer version and download from the attachments to this article.
- Save the file to a location on your computer.
- Unzip the package and extract the two XML files to a location on your computer.
Importing the pre-built rule definitions
There are two options to import or load the pre-built definitions: a Command-Line Interface (CLI) and a REST (HTTP) API, which is accessed via a browser.
Command-Line Interface (CLI)
- The command is
IAAdmin. - It is available on the Information Analyzer client and server.
- Advantages:
- No
POSTadd-ons are required for your browser. - Many technical people like command-line utilities.
- No
- Disadvantages
- Though relatively short, some people find command-line syntax confusing.
- Some environments disable command-line utilities.
REST API interface (browser)
- It uses an
HTTP POSTutility. - Advantages:
- Web-based GUI
- Disadvantages:
- It requires a
POSTadd-on (some environments do not allow browser add-ons).
- It requires a
This article describes both methods of import. The subsequent import steps assume that you:
- Download, unzip (extract), and save the Rule
Definition XML files to wherever the Information Analyzer
Client or server reside (this can be to a remote server, a remote
Client Image, or your own machine).
- NOTE: These instructions assume you have downloaded on your own machine and will be imported from your local environment.
- Open each XML file you wish to use with Notepad (or any other basic file editor).
- Change the project name (listed as
"your-project" as noted above) to one of your
choosing, which already exists in your Information Analyzer environment
you have access to.
- Note: if there is not yet an Information Analyzer project you (or your IA Project Administrator) will need to create one
- NOTE: If you do not change the project name in the XML file, a project named your-project will be created and all the rule definitions will go there.
- Save the XML file(s).
To perform the import of the Information Analyzer rule definitions via the CLI:
- Open a command (DOS) prompt on the client. For example, on
Windows® XP, you can use
Start > All Programs > Accessories > Command Prompt. - Navigate to C:\IBM\InformationServer\ASBNode\bin.
- Run the following command:
IAAdmin -user xxxxx -password xxxxx -host your-server -port 9080 -create -projectContent C:\Temp\IARuleDefs-BaseSet1-General -v87.xml
- Use the Information Analyzer configuration information in the above
command relevant for your environment:
–user(your Information Analyzer user ID)–password(your Information Analyzer password)–host(your Information Analyzer server — match your login information)–port(usually 9080 — match your login information)- After
–projectContent, include the location where you saved the XML file. The example shows the file saved at C:\Temp\, but the location and name of your file may be different from the one shown here. Usev85orv87in the filename as appropriate for your installed version.
- If you are importing the BaseSet1-General and the BaseSet1-USStan rule-definition packages, you need to change the filename and run the command a second time for the second import.
- You will receive an error if you run the above command twice
with the same filename. The error will tell you that the
rules already exist. If this happens, then re-issue the above
command with
-updateinstead of-create. - If you attempt to import the v87 files into
an Information Analyzer 8.5 version, you will receive an error
message like the following:
>The XML request passed as parameter could not be parsed for the following reason: Feature 'folder' not found. - The above command will run for approximately 4-5 minutes.
If the import was successful, you can log in to Information Analyzer, open your project (same as specified in the XML file in the import), and review the imported rule definitions. You should see a list of rule definitions similar to Figure 5.
Figure 5. Imported rule definitions
The following browser import example uses Firefox and the associated Poster add-on. If you have a different browser, you will need to find a utility that allows you to perform a POST function in order to import the pre-built Information Analyzer rule definitions with this method.
Before performing an import via this method, you need a Firefox utility to allow the POST function. Take the following steps (these may vary depending on the version of your browser):
- Open Firefox and navigate to https://addons.mozilla.org/en-US/firefox/addon/poster/.
- Select Add to Firefox.
- Select Install Now when prompted.
- Close and restart Firefox.
- After Firefox restarts, enable the Add-on bar with the following
(again depending on version)
- Firefox > Options > Add-on Bar or
- Firefox > Tools > Add-ons
- Depending on the version, you may see the add-on as:
- A gold-colored P in the lower right of your browser
- A menu option under Firefox > Tools > Poster
To perform the import of the Information Analyzer rule definitions via the browser:
- Open Firefox and navigate to the add-on.
- Select the Poster add-on. You should see a browser form as shown in Figure
6.
Figure 6. Firefox Poster form
- Fill in the form:
URL:http://<yourInformationAnalyzerhost>\ :9080/InformationAnalyzer/create?projectContent
User: Your Information Analyzer user ID
Password: your Information Analyzer password
Timeout: Change to 300
File: enter the filename or select the Browse button to select the XML file from the appropriate directory. Note that the name of your file will be different from the one shown here. Use v85 or v87 as appropriate for your install.
POST: Select this to run the load process. - The POST will take a bit (3-5 minutes), and you
should see a message indicating status of
OK.
NOTE: It is possible that you will receive a timeout response if the Timeout is not changed. This simply affects the response to the client on status, but should not affect the post function on the server.
If the import was successful, you can log in to Information Analyzer, open your project (same as specified in the XML file in the import), and review the imported rule definitions. You should see a list of rule definitions similar to Figure 5 above.
You are now ready to begin using your pre-built Information Analyzer rule definitions.
You should now be able to use the pre-built set of rule definitions accompanying this article in your IBM InfoSphere Information Analyzer projects.
This article has specifically looked at how to do the following tasks:
- Understand rule definitions available in pre-built packages accompanying this article.
- Use the pre-built rule definitions to address common data-quality conditions.
- Review the structure and content of the rule-definition XML file.
- Import the pre-built data rule definitions using the HTTP/CLI API for either Information Analyzer V8.5 or V8.7.
Once imported, you can use the pre-built rule definitions to establish data quality rules, then test and monitor for potential data-quality problems. And, by using pre-built rule definitions, you can shorten the time needed to put in data quality checks in place for a broad range of data in any given table, system, or environment.
Appendix: Pre-built rule-definition names and descriptions
The pre-built rule definitions are provided for use with IBM InfoSphere Information Server.
The following rule definitions are included in the IARuleDefs-BaseSet1-General-v8x.xml file.
Table 1. Rule definitions in IARuleDefs-BaseSet1-General-v8x.xml
| Rule name | Rule short description |
| FieldExists | Field Exists |
| AlphanumFieldExists | Alphanumeric Field Exists |
| AlphaFormatValid | Example Alphabetic Format |
| AlphanumFormatValid | Example Alphanumeric Format 1 |
| AlphanumFormatValid_2 | Example Alphanumeric Format 2 |
| FieldIsNumeric | Field is numeric |
| FieldExistsAndNumeric | Field exists and is numeric |
| FieldExistsAndNumeric_2 | Field exists and is numeric |
| FieldIsDate | Field is date value |
| FieldExistsAndDate | Field exists and is date value |
| FieldExistsAndDate_2 | Field exists and is date value |
| IndicatorY_NValid | Indicator Field is 'Y' or 'N' |
| IndicatorUpperCaseY_NValid | Indicator Field is any case of 'Y' or 'N' |
| IndicatorT_FValid | Indicator Field is 'T' or 'F' |
| IndicatorUpperCaseT_FValid | Indicator Field is any case of 'T' or 'F' |
| IndicatorString0_1Valid | Indicator Field is '1' or '0' |
| IndicatorNum0_1Valid | Indicator Field is numeric 1 or 0 |
| Code1DigitUpperCase | Code is single-digit uppercase alphabetic |
| Code1DigitLowerCase | Code is single-digit lowercase alphabetic |
| Code1DigitNumeric | Code is single-digit numeric |
| Code1DigitAlphanum | Code is single-digit alphanumeric, any case |
| Code1DigitAlphabetic | Code is single-digit alphabetic |
| -Code1DigitNumeric_2 | -Code is single-digit numeric |
| Code1DigitAlphanum_2 | Code is single-digit alphanumeric |
| CodeInRefMaster | Code is in reference source |
| CodeNotInDefaultValueList | Code is not all 9s |
| TextSubstrIsValueX | Substring Text value starting in position 3 for length 1 is 'X' |
| TextSubstrInRefList | Substring Text value starting in position 3 for length 3 is in reference list |
| Date1NumLessThanDate2 | Date <= 2nd date |
| Date1StringLessThanDate2 | String Date <= 2nd string date |
| Date1NumLessThanSysdate | Date <= System Date; not a future date |
| Date1StringLessThanSysdate | String Date <= System Date; not a future date |
| DateNumWithinLastYear | Date within one year of current date |
| DateStringWithinLastYear | Date within one year of current date |
| Date1NumWithin60DaysDate2 | Date within 60 days of 2nd date |
| Date1StringWithin60DaysDate2 | Date within 60 days of 2nd date |
| YearNumNotFuture | Year <= System Year; not a future year |
| YearStringNotFuture | Year <= System Year; not a future year |
| YearNumIsCurrentYear | Year = System Year; is same as year of system date |
| YearStringIsCurrentYear | Year = System Year; is same as year of system date |
| IdentifierUnique | Identifier is unique |
| CompoundIdentifierUnique | Compound Identifier is unique |
| Id1StXCharactersAlphabetic | 1st x characters of Identifier are alpha |
| IdLast4To5CharactersMatchId2 | Last 4-5 characters of one Identifier match a 2nd Identifier |
| IdOccursLtXTimes | Identifier occurs <x times |
| IdInValidRange | Identifier in valid range (single) |
| IdInValidMulti_Range | Identifier in valid range (multiple/segmented) |
| IdInRefMaster | Identifier in reference column |
| QtyInValidRange | Quantity in valid range |
| QtyAdditionValid | Quantity computation is valid (addition) |
| QtyArithmeticValid | Quantity computation is valid (arithmetic) |
| QtyMultiplyValid | Quantity computation is valid (multiplication) |
| QtyDivisionValid | Quantity computation is valid (division) |
| QtyStringInValidRange | Quantity in valid range — string value |
| QtyStringAdditionValid | Quantity computation is valid (addition) — string values |
| QtyStringArithmeticValid | Quantity computation is valid (arithmetic) — string values |
| QtyStringMultiplyValid | Quantity computation is valid (multiplication) — string values |
| QtyStringDivisionValid | Quantity computation is valid (division) — string values |
| IfFieldaIsXThenFieldbIsY | If fieldA = 'X' then fieldB = 'Y' |
| IfFieldaIsXThenFieldbDoesNotExist | If fieldA = 'X' then fieldB should not exist |
| IfFieldaIsXThenFieldbInValidValueSet | If fieldA = 'X' then fieldB is in valid set of values |
| IfFieldaInValidValueSetThenFieldbExists | If fieldA is in valid set of values then fieldB = 'X' |
| IfFieldaGtXThenFieldbIsY | If fieldA > qtyValue then fieldB = 'Y' |
| IfFieldaIsXThenFieldbGtQty | If fieldA ='X' then fieldB > qtyValue |
| FieldaPlusConcatenatedKeyInRefMaster | fieldA exists and a combination of literals + fieldA in a reference column |
| AlphanumValidNoNonprintCharacters | Alphanumeric with no non-print characters |
| FieldFreqLessThanPercentLimitForAllRecords | Field Frequency less than Pct Limit for all Records |
| AgeInRangeNumeric | Age: Age >= 0 and < 125 |
| AdultInRangeNumeric | Adult: Age >= 18 and < 125 |
| ChildInRangeNumeric | Child: Age >= 0 and < 18 |
| AgeInRangeString | String data Age: Age >= 0 and < 125 |
| AdultinrangeString | String data Adult: Age >= 18 and < 125 |
| ChildinrangeString | String data Child: Age >= 0 and < 18 |
| AgeinrangeCalc | Derived Age: Age >= 0 and < 125 |
| AdultinrangeCalc | Derived Age Adult: Age >= 18 and < 125 |
| ChildinrangeCalc | Derived Age Child: Age >= 0 and < 18 |
| ChildnotmarriedNumeric | If Child (numeric) then Marital Status = 'N' |
| ChildnotmarriedString | If Child (string) then Marital Status = 'N' |
| ChildnotmarriedCalc | If Child (derived) then Marital Status = 'N' |
| DobReasonableRangeNumeric | Date of Birth >= 1900-01-01 and <= System Date; not a future date |
| DobReasonableRangeString | Date of Birth >= 1900-01-01 and <= System Date; not a future date |
| IfDodExistsThenGtDobNumeric | Date of Death can be blank, but if it exists it must be >= Date of Birth and not a future date |
| IfDodExistsThenGtDobString | Date of Death can be blank, but if it exists it must be >= Date of Birth and not a future date |
| StartDtWeekday_Numeric | Start Date is a weekday |
| StartDtWeekday_String | Start Date is a weekday |
| MaturityDtExistsValidCondition | Maturity Date exists where condition is valid |
| Ssn9DigitNumeric | SSN is 9-digit numeric |
| SsnMatchesNumericFormat | SSN matches only numeric format |
| SsnMatchesHyphenFormat | SSN matches numeric format with hyphens |
| TinMatchesHyphenFormat | TIN matches numeric format with hyphens |
| SsnMatchesRegex | SSN matches regex format |
| CanadaSinMatchesRegex | SIN matches regex format |
| FranceInseeMatchesRegex | INSEE matches regex format |
| ItalyFiscalCdMatchesRegex | FiscalCode matches regex format |
| SpainNifMatchesRegex | NIF matches regex format |
| UkNinoMatchesRegex | NINO matches regex format |
| PassportNumMatchesRegex | PassportNumber matches regex format |
| CreditCardMatchesRegex | CreditCard matches regex format |
| PaidAmtWeeklyValid | Paid Amount computation is valid (multiplication) |
| PaidAmtAnnualValid | Paid Amount computation is valid (multiplication) |
| SalesamtBaseValid | Quantity * price computation is valid (multiplication) |
| SalesamtWTaxValid | Quantity * price computation is valid (with salesTax) |
| SalesamtWDiscountValid | Quantity * price computation is valid (with discount) |
| SalesamtWDiscountPlusTaxValid | Quantity * price computation is valid (with discount before salesTax) |
| SalesamtPriceLookupValid | Quantity * price computation is valid (multiplication) - w/ lookup |
| SalesamtWTaxLookupValid | Quantity * price computation is valid (with salesTax) — with lookup |
| SalesamtWDiscountLookupValid | Quantity * price computation is valid (with discount) — with lookup |
| SalesamtWDiscount_TaxLookupValid | Quantity * price computation is valid (with discount before salesTax) — with lookup |
| ProductDescriptionValidCharacters | Product Description has valid characters |
| EmailAddrValidFormat | Email Address has valid format |
| NotesEmailAddrValidFormat | Notes-style Email Address has valid format |
| InternalEmailValidDomain | Internal Email has valid domain |
| HostNameValidFormat | Host Name has valid format |
| UrlValidFormat | URL has valid format |
| IpaddressValidFormat | IPAddress has valid format |
| ValidNAmerAreaCode | Valid Area Code |
| ValidNAmerAreaCode_2 | Valid Area Code |
| ValidUsPhoneFormat | Valid US Phone format |
| ValidIsbnFormat | Valid ISBN format |
| ValidUpc_AFormat | Valid UPC-A format |
| ValidCountryOfOperation | Valid Country of Operation |
| ValidCountryOfIssue | Valid Country of Issue |
| SameTaxid_Name_1 | Same Taxid, Same Name — v1 |
| SamePolicy_CodeValue | all records which have the same policy number should have a specific code set to the same value |
| SumLineItemsEqTotalQty | sum of all line items = total qty for an order |
| SumLineItemsEqTotalValue | sum of all line item values = total value for an order |
| MrktValueChgCurr_Prior | compare market value of portfolio for current business day and previous business day for swings in value of more than 10 percent |
| CountryCodeExists | Country Code Exists |
| IsoCountryCode2DigitAlphaIsValid | ISO Country Code 2 digit alpha is valid |
| IsoCountryCode3DigitAlphaIsValid | ISO Country Code 3 digit alpha is valid |
| IsoCountryCode3DigitNumericIsValid | ISO Country Code 3 digit numeric is valid or user assigned (900 to 999) |
| IsoCountryCodeDidNotHaveStandardizationError | ISO Country Code did not have Standardization error |
The following rule definitions are included in the IARuleDefs-BaseSet1-USStan-v8x.xml file.
Table 2. Rule definitions in IARuleDefs-BaseSet1-USStan-v8x.xml
| Rule name | Rule short description |
| NameExists | Name Exists |
| NameTypeIsIndividualOrOrganization | Name Type is Individual or Organization |
| GenderCodeDerivedValid | Gender Code Derived Valid |
| NamePrefixIfExistsThenValid | Name Prefix If Exists Then Valid |
| FirstNameValid | First Name Valid |
| MiddleNameValid | Middle Name Valid |
| PrimaryNameValid | Primary Name Valid |
| NameGenerationInRefList | Name Generational In Ref List |
| NameSuffixIfExistsThenValid | Name Suffix If Exists Then Valid |
| AdditionalNameIfExistsThenValid | Additional Name If Exists Then Valid |
| UnhandledNamePatternExpectEmpty | Unhandled Name Pattern |
| UnhandledNameDataExpectEmpty | Unhandled Name Data |
| ExceptionNameDataExpectEmpty | Exception Name Data |
| FirstNameNoTestData | First Name — No Test Data |
| MiddleNameNoTestData | Middle Name — No Test Data |
| PrimaryNameNoTestData | Primary Name — No Test Data |
| AdditionalNameNoTestData | Additional Name - No Test Data |
| FirstNameNysiis | First Name NYSIIS Exists if First Name exists |
| PrimaryName1Nysiis | Primary Name 1 NYSIIS Exists if Primary Name 1 exists |
| PrimaryName2Nysiis | Primary Name 2 NYSIIS Exists if Primary Name 2 exists |
| MatchPrimaryWordsIs1 | If Match Primary Words is 1 then only 1 Primary Match Word |
| MatchPrimaryWordsIs2 | If Match Primary Words is 2 then only 2 Primary Match Words |
| MatchPrimaryWordsIs3 | If Match Primary Words is 3 then only 3 Primary Match Words |
| MatchPrimaryWordsIs4 | If Match Primary Words is 4 then only 4 Primary Match Words |
| MatchPrimaryWordsIsGe5 | If Match Primary Words is greater or equal to 5 then all 5 Primary Match Words |
| AddressExists | Address Exists |
| AddressTypeIfExistsThenIsInReferenceList | Address Type if exists then is in reference list |
| HouseNumberOrSuffixIfExistsThenValidCharacter | House Number or Suffix if exists then valid characters |
| StreetDirectionalIfExistsThenValidValues | Street Directional If Exists Then Valid |
| StreetNameIfExistsThenValidCharacters | Street Name If Exists Then Valid |
| BuildingNameIfExistsThenValidCharacters | Building Name If Exists Then Valid |
| BoxTypeIfExistsThenValidValues | Box Type If Exists Then Valid |
| BoxTypeAndBoxValueCombo | Box Type and Box Value Combo |
| FloorTypeIfExistsThenValidValues | Floor Type If Exists Then Valid |
| FloorTypeAndFloorValueCombo | Floor Type and Floor Value Combo |
| RuralRouteTypeIfExistsThenValidValues | Rural Route Type If Exists Then Valid |
| RuralRouteTypeAndRuralRouteValueCombo | Rural Route Type and Rural Route Value Combo |
| StreetTypeIfExistsThenValidCharacters | Street Type If Exists Then Valid |
| UnitAndMultiunitTypeIfExistsThenValidValue | Unit and Multi-Unit Type If Exists Then Valid |
| UnitTypeAndUnitValueOrMultiunitType-Multi-Uni | Unit Type - Unit Value or Multi-Unit Type - Multi-Unit Value Combo |
| AdditionalAddressIfExistsThenValid | Additional Address If Exists Then Valid |
| UnhandledAddressPattern | Unhandled Address Pattern |
| UnhandledAddressData | Unhandled Address Data |
| ExceptionAddressData | Exception Address Data |
| StreetNameNoTestData | Street Name No Test Data |
| AdditionalAddressNoTestData | Additional Address No Test Data |
| CityExists | City Exists |
| StateCodeIsInReferenceSource | State Code is in reference source |
| ZipCodeIsInReferenceSource | ZIP Code is in reference source |
| ZipPlus4CodeIsInReferenceSource | ZIP+4 code is in reference source |
| ZipPlus4CodeRemovingHyphensIsInReferenceSourc | Zip+4 Code removing hyphens is in reference source |
| CityStateAndZipCodeComboIsValid | City State and ZIP Code combo is valid |
| ValidStateLengthForCountryCode | Valid State Length for Country Code US |
| UnhandledAreaPattern | Unhandled Area Pattern |
| UnhandledAreaData | Unhandled Area Data |
| ExceptionAreaData | Exception Area Data |
| CityNoTestData | City No Test Data |
| Description | Name | Size | Download method |
|---|---|---|---|
| Rule-definition pkg 8.5 | IARuleDefs-BaseSet1-v85.zip | 11KB | HTTP |
| Rule-definition pkg 8.7 | IARuleDefs-BaseSet1-v87.zip | 10KB | HTTP |
Information about download methods
Learn
- For basic information on creating and
using rule definitions and data rules in Information Analyzer, see Creating Rules for Information Analysis in the Information
Analyzer User Guide.
- For best practices with data rules
functionality, see Data Quality Methodology in the Information Analyzer Best
Practices and Methodology Guide.
- For information on data quality rules in a
broader context, see the IBM Redbooks® publication "Metadata Management with IBM InfoSphere Information
Server."
- From a data rule deployment standpoint,
see the developerWorks article "InfoSphere best practices: Using IBM InfoSphere Information Analyzer
data quality rules in a productive environment."
- For additional information on using the
API and CLI functionality in Information Analyzer, see Developing Applications with the HTTP API in the Information
Analyzer User Guide.
- For information on rules performance, see
the developerWorks article "InfoSphere best practices: Understand the resource usage of data
quality rules in IBM InfoSphere Information
Analyzer."
- In the InfoSphere area on developerWorks, get the resources you need to
advance skills on a variety of IBM InfoSphere products.
- Learn more about Information Management at the developerWorks Information Management
zone. Find technical documentation,
how-to articles, education, downloads, product information, and
more.
- Stay current with
developerWorks technical events and webcasts.
- Follow developerWorks on
Twitter.
Get products and technologies
- Build your next
development project with
IBM trial software,
available for download directly from developerWorks.
- Now you can use
DB2 for free. Download DB2 Express-C, a no-charge
version of DB2 Express Edition for the community that offers the same core
data features as DB2 Express Edition and provides a solid base to build
and deploy applications.
Discuss
- Participate in the discussion forum.
- Check out the
developerWorks
blogs and get involved in the
developerWorks community.





