Using pre-built rule definitions with IBM InfoSphere Information Analyzer

Importing and using pre-defined data quality validations

A key challenge in assessing and monitoring information quality is starting the process to validate key business requirements. Rather than start off with a blank slate, this article includes and shows how to use pre-built rule definitions to get under way.

Share:

Harald Smith (smithha@us.ibm.com), Software Architect, IBM

Author Photo: Harald SmithWith nearly 30 years in IT and software development, Harald Smith has focused on the design and delivery of information integration and information quality solutions and products, including methods and best practices.


developerWorks Contributing author
        level

15 December 2011

Also available in Vietnamese

Learn how to utilize pre-built packages of IBM® InfoSphere® Information Analyzer data quality rules. We will show you how to understand the available content, how to use that information to address common data quality conditions, and how to then import it into your Information Analyzer environment to accelerate rule development and assessment.

Overview

With IBM InfoSphere Information Analyzer, you can create data quality rules to automatically monitor for potential data quality problems against defined business requirements or based on issues identified during data profiling. These rules can take time to develop and test for the broad range of data in a given table, system, or environment.

The objective of this article is to show you ways to accelerate this development through the import and use of the pre-built Information Analyzer rule definitions, which accompany this article. By using these pre-built data quality rule definitions, you should be able to accelerate development of data quality validation in your enterprise.

This article focuses on the following tasks:

  • Understanding rule definitions available in pre-built packages
  • Using pre-built rule definitions to address common data quality conditions
  • Reviewing the structure and content of the Information Analyzer rule definition XML file
  • Importing the pre-built data rule definitions using the HTTP/CLI API — a feature introduced in InfoSphere Information Analyzer V8.5 and enhanced in V8.7

The pre-built data quality rules accompanying this article are intended to:

  • Reduce the effort identifying data quality issues in many common information domains (keys, national identifiers, dates, country codes, email addresses, etc.) and conditions (completeness checks, valid values, range checks, aggregated totals, equations, etc.)
  • Serve as models, templates, and examples for your own additional rule design
  • Be utilized in Information Analyzer jobs (either V8.5 or V8.7), or via the Rule Stage available in Information Server V8.7.

Address common data domains and quality conditions

Almost any piece of data stored in a database or file or being processed through a job or web service has some associated condition indicating whether the data meets established validation rules. These conditions may be as simple as indicating that there must be data in the field (i.e., that it is complete) or when there is data that it conforms to some specified format or set of values (i.e., the values are valid). Or the conditions may indicate that the data must match records in a specified reference source, such as postal codes, or that a particular equation is correctly calculated.

The potential range of data that can be evaluated and the potential number of quality conditions that can be defined is substantial, and this article (and accompanying pre-built rule definitions) cannot address all possible circumstances. Instead, the focus is on providing a jumpstart for a set of data domains and conditions commonly found in many data sources.

Information Analyzer provides a capability to define the rule logic for such data domains and conditions separately from any physical data source so the same logic is consistently applied from data source to data source (i.e., one data rule definition can be applied and used with many data sources). Coupled with a capability to import a set of rule definitions in a defined XML format, you can take the pre-built rule definitions and load them into Information Analyzer and start applying them to your own data sources.

Rule definitions

Rule definitions follow a basic syntax where a variable, which could simply be a word or term, is evaluated based on a specified condition or type of check. The specified condition or check may or may not require some additional reference value, such as another variable, a list of values, a specified format, etc.. Further, several conditions may be connected together with IF, THEN, AND, or OR clauses. For example, a very simple rule definition could be as follows: DateOfBirth IS_DATE.

This condition indicates that a variable called DateOfBirth must be of a recognized date format.

In a slightly more complex case, you could have a rule definition such as Listing 1.

Listing 1. Sample rule definition
IF DateOfBirth EXISTS
AND DateOfBirth > datevalue('1900-01-01')
AND DateOfBirth < date()
THEN CustomerType = 'P'

Here, there is a conditional statement that checks whether the variable DateOfBirth exists and is within a set range, and only if those conditions are met, another variable called CustomerType is tested to see if it is equal to a specified value.

More information about creating and using rule definitions are available in the Information Analyzer user guide (see Data Rule definitions).

Basic data domain examples

The most basic rule definitions test for the completeness of a field or a standard alphabetic or numeric format. The pre-built rules include examples of these conditions.

Figure 1. General completeness and data type rules
Image shows rules and their descriptions

For instance, the rule definition AlphanumFieldExists evaluates the following condition: Field1 EXISTS AND len(trim(Field1)) <> 0.

This example includes several of Information Analyzer's basic capabilities:

  • The use of a general variable name — in this instance, simply called Field1
    • NOTE: The variable can be connected to (bound) to any column or field of data. This is the flexibility that allows one rule definition to provide the basis for many actual executable data rules.
  • Tests for multiple conditions — The existence of data and a not equal (<>) condition
    • NOTE: There is no specific limit to the number of conditions that can be included in one rule definition, though practically it is useful to keep rule definitions understandable. In creating rule definitions, look for basic building blocks and take advantage of Information Analyzer's rule-set capability to combine conditions instead of building all into one rule (see Data Rule Analysis techniques in the IBM Information Center].
  • The inclusion of functions — In this case, len and trim
    • NOTE: Refer to the Information Analyzer User Guide for the range of functions available. Functions can often be used to make conditions easier to solve. In this case, the functions are used to check for spaces in a field. The trim function first removes any number of blank (space) values from the left or right of any actual text. The len function identifies the length of all remaining alphanumeric characters with an expectation that the field will have at least one character value (that is, a length not equal to 0).

Data domains by data classification

At a basic level, beyond the generic examples above, you can roughly classify data into a set of common data domains as seen in the Column Analysis details of Information Analyzer:

  • Identifier — A field generally unique and can identify related data (e.g., Customer ID, National identifier)
  • Indicator — A field, often called a Flag, that has a binary condition (e.g., True or False, Yes or No, Female or Male)
  • Code — A field that has a distinct and defined set of values, often abbreviated (e.g., State Code, Customer Status)
  • Date — A field that contains some date value
  • Quantity — A field that contains a numeric value and is not classified as an Identifier or Code (e.g., Price, Amount, Asset Value)
  • Text — A field that contains alphanumeric values, possibly long text, and is not classified as an Identifier or Code (e.g., Name, Address, Description)

Additionally, there is a basic rule classification: Valid Value Combination, in which one field is of a certain value, a second field must be some specified value.

A subset of the pre-built rule definitions follows these general groupings and typical validation requirements. Figure 2, for example, highlights the pre-built rule definitions for Code fields.

Figure 2. Common classification-based rule definitions for codes
Image shows list of codes and their definitions

These rule definitions based on common data classifications typically assess structural formats or basic validation requirements (e.g., an Identifier must be in a range bounded by a low value and a high value, but it does not specify any exact values).

If, for instance, you have a Code field that allows the alphanumeric values 0-9, you might wish to apply the rule definition Code1DigitNumeric (see in Figure 2) to check that the field contains one single-digit numeric value. This rule definition is as follows: Code MATCHES_FORMAT '9'.

This example shows a straightforward condition:

  • The use of a general variable called Code
  • A test for a single format condition: MATCHES_FORMAT
    • NOTE: Information Analyzer has two distinct checks for data formatting: the MATCHES_FORMAT, shown here, and MATCHES_REGEX, which evaluates against a broad range of regular expression conditions (many examples can be found through a simple Google search on the term "regular expression").
  • The MATCHES_FORMAT condition requires a reference value; in this case, it expects one and only one numeric value (all numeric digits are represented by a 9).

Common data domains

As noted, there are many potential domains of data that could be incorporated into a package of pre-built data rule definitions. Examples of common domains included with the available packages include:

  • Demographic information
    • Age
    • Date of birth
    • Date of death
    • National identifier (e.g., U.S. Social Security number, Canadian SIN number, passport number, Italy Fiscal Code, etc.)
  • Internet address information
    • Email address
    • IP address
    • URL
  • Order/sales/policy information
    • Order amount and quantity
    • Sales amount (e.g., with or without sales tax, with or without discount)
    • Maturity date
    • Product code (e.g., ISBN code, UPC code)
  • Employment information
    • Start date
    • Paid amount
  • Phone information (North America only)
    • Phone format
    • Area code

These common domains span the range of data classes, providing more specific cases for your use, as well as rule conditions that can be more complex. Consider the following pre-built rule definition SalesamtWDiscountPlusTaxValid, which evaluates a sales amount field based on several variables, including a discount and a tax:

(qtyValue1 * price) - (qtyValue1 * discount) + (((qtyValue1 * price) - 
  (qtyValue1 * discount)) * salesTax) = totalAmount

This example highlights that the source information (in this case) or the reference information used in the validation can incorporate a number of criteria:

  • There are five variables in use in this logic:
    • qtyValue1 — The quantity of an item in an order or sale
    • price — The price of an item in an order or sale
    • discount — A discount applied on an item in an order or sale
    • salesTax — The sales tax applied on an order or sale
    • totalAmount — The total amount of an order or sale
    • NOTE: There is no specification in the rule definition itself as to where the data is actually stored; these variables might all be contained in one database or file or might come from several sources. That information is only required when the variables are bound when an executable data rule is generated.
  • A test for a single condition = (is equal to)
    • NOTE: This rule could also be written in reverse, where totalAmount is the source variable (on the left), which is equal to the reference data (the equation put on the right).
  • An equation using a series of functions (the standard numerical operators +, -, *, and /) and associated parentheses.

Standardized data domains (U.S.)

An accompanying pre-built rule package is targeted to validating the output of standardization processes for U.S. names, street addresses, and postal areas from IBM InfoSphere QualityStage®. The Standardize stage of QualityStage takes incoming data such as U.S. names and addresses that are not well-defined, parses that data, and creates a standardized form. For instance, consider the two addresses:

One hundred West Main Street apt 10
100 W Main St #10

Quite possibly, these two addresses represent the same location. But the differences in formatting and description would prevent such information from being connected. The output from the Standardize stage using a set of rules for U.S. addresses for both would produce:

Street#  Pre-direction	Street	St. Type   Unit    Unit#
100         W            Main     St        Apt     10
100         W            Main     St        Apt     10

Generally, these standardization rule sets produce a fairly consistent output, but there can be exceptions, such as new data, unexpected conditions, default or test data, and unusual formats. The pre-built rule definitions target these outputs, though they can be applied to any well-parsed name, address, or postal area information. As an example, the rule definition RuralRouteTypeIfExistsThenValidValues tests whether a rural route type is valid.

IF RuralRouteType EXISTS
AND len(trim(RuralRouteType)) <> 0
THEN rtrim(RuralRouteType) IN_REFERENCE_LIST {'RR','RTE','HC','CONTRACT'}

This example highlights several criteria put into an IF…THEN condition:

  • The IF…AND… is the same as the completeness example AlphanumFieldExists shown above. When expressed within an IF condition, only records where the field has a value will be evaluated with the subsequent THEN condition. Records where there is no value present will not be evaluated and will not generate any exceptions.
  • The THEN condition is the basis for meeting or not meeting the rule definition. Within the condition, the rtrim function removes any spaces on the right of RuralRouteType, and the resulting value is evaluated against a set of four valid values given in the list.
  • NOTE: These type of IF…THEN rule definitions work well together as part of a larger rule set. In essence, they describe a series of cases, each with distinct criteria. By defining the rule definitions separately and grouping into the rule set, it allows more insight into which records have issues, as well as how many records violate specific rules.

Using the pre-built rule definitions

The accompanying pre-built rule definitions can be considered from a design perspective and a deployment perspective.

Design-time accelerators, templates, and models

From a design perspective, you can use the pre-built rule definitions as-is, copy/modify to meet your needs, or use them as design models. The subsequent section on "Importing the pre-built rule definitions" describes the basic steps to bring the pre-built packages into your project(s).

The file IARuleDefs-BaseSet1-General-v8x.xml includes more than 130 definitions for the general rules and common domains described above. The file IARuleDefs-BaseSet1-USStan-v8x.xml includes nearly 60 definitions for the validation of U.S. name, address, and postal area standardized information described above.

First, once imported into your project, you may immediately use these rule definitions to test or assess your data sources, generating data rules as described in the Information Analyzer User Guide (see Generating a data rule from a rule definition). In this capacity, the rule definitions accelerate your ability to start detailed data quality assessment.

Second, you can use these rule definitions as templates to customize for your own specific data conditions. Consider the example case where you have a field named Region that represents a specific segment of the world. Region is defined as a text field that is five characters long, and the first two characters are alphabetic characters that must be in the following list: AM (Africa and Middle East), AP (Asia-Pacific), EU (Europe), NA (North America), and SA (South America).

The pre-built rule definitions do not include such an exact rule definition. However, the rule definition TextSubstrInRefList is described as "Substring Text value starting in position 3 for length 3 is in reference list." This is similar to the rule definition you need: evaluate a substring for inclusion in a reference list.

In this instance, you could do the following:

  1. Log in to Information Analyzer.
  2. Open your project and navigate to the Develop menu and the Data Quality menu item.
  3. Select the desired rule definition in your project (TextSubstrInRefList, in this case).
  4. Select the Create a Copy task, as in Figure 3.
    Figure 3. Create a Copy task
    Image shows that Create a Copy is highlighted
  5. In the dialog box to Create a Copy, select OK.
  6. This will create a replica of the original rule called (Copy_of_TextSubstrInRefList, in this case).
  7. Open this new rule definition to edit as needed:
    • Change the rule-definition name: Region_SubstrInRefList.
    • Change the substring function from:
      • Before: substring(TextField, 3, 3)
      • After: substring(Region, 1, 2)
      • NOTE: In this case, you want to start the substring function at the first character for a length of 2.
    • Change the reference list data from:
      • Before: {'AAA','AAB','BAA','CCC'}
      • After: {'AM','AP','EU','NA','SA'}
  8. Save your updated rule definition.

Third, you can use the rule definitions as reference models — examples of specific functions or conditions in use that can guide you as you design and develop unique rules for your environment.

Deployment approaches for quality validation and monitoring

As with all rule definitions, the pre-built packages can be:

  • Used to generate executable data rules for quality validation.
  • Included in rule-set definitions and executable rule sets to validate multiple conditions together.
    • While discussed in greater detail in the Information Analyzer Best Practices and Methodology Guide (see Resources), rule sets have several distinct deployment advantages:
      • They provide support for evaluating data based on many data rule conditions. With the pre-built rule definitions, you can combine as many of these as needed to evaluate all the fields in a given record, including multiple instances of the same rule definition, such as FieldExists.
      • They score all tested rules for each record in the set so results can be viewed in multiple dimensions. (For example, you can see all records that fail each specific rule, see all rules that a given record failed, or look at intersections of specific sets of rules).
      • They optimize the rule evaluation for execution and processing.
    • NOTE: Any rule-set definition you create can contain these pre-built rule definitions and/or your own rule definitions in any combination.
  • Published for users in other projects to take advantage of — When you import the pre-built rule definitions, they are imported to your project. For other users who are not part of your project to use, the rule definitions must be published or imported into their projects.
  • Exported for deployment in other Information Analyzer environments — For example, if you are working in a development environment with test data to ensure your data rules work correctly, you may then need to export those data rules to a production environment for ongoing quality monitoring.

With the introduction of Information Analyzer V8.7, rule definitions built in Information Analyzer can be added to a new Rule Stage in an IBM InfoSphere DataStage or QualityStage job. This capability allows the use of any published rule definition to validate data that is part of data integration or data-cleansing processes, including those added through the accompanying pre-built rule-definition packages.

For example, you receive a file daily from an external third-party source. The quality of the data source is often low, resulting in problems in other information systems, including your business reporting. This daily file currently runs through a QualityStage job to standardize the file and load to existing data sources. You want to test incoming data for completeness using a set of rule definitions and validate the results of the QualityStage standardization output.

Figure 4 shows the addition of the new Rule Stage, CustomerValidityCheck, in this sample job. The Rule Stage could have one rule definition or many, depending on the number of data fields that need to be validated. Outputs from this stage include valid data, invalid data, and specific violation details.

Figure 4. Validate data in-process in DataStage or QualityStage
Image shows data validation process flow diagram from DataStage

See Using the Data Rule Stage for more details on this capability.

By taking advantage of pre-built rule definitions, you can:

  • Reduce the effort to address many common information domains and conditions.
  • Provide models and publish rule definitions for other users to work from.
  • Accelerate the process of assessing, testing, and deploying data rules in Information Analyzer.
  • Deploy rule definitions for ongoing quality monitoring and in-flight data validation.

Understanding the pre-built rule definition packages

The Information Analyzer pre-built rule definitions that accompany this article are imported through the Information Analyzer API.

Content structure

The pre-built definitions are structured using a defined XML schema. For full details on the structure, refer to Schema File elements for Rule Definitions.

At a brief level, the definition files look like Listing 2.

Listing 2. Rule definition XML schema
<?xml version="1.0" encoding="UTF-8"?>
<iaapi:Project xmlns:iaapi=\
"http://www.ibm.com/investigate/api/iaapi" name="your-project">
  "<!--IBM InfoSphere Information Analyzer Base Set rule definitions v1.0 
    For use with IBM InfoSphere Information Server v8.7              
    Copyright IBM 2011.  All rights reserved.                        
    Disclaimer:INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
    PUBLICATION 'AS IS' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS 
    OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 
    NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 
    Some states do not allow disclaimer of express or implied warranties 
    in certain transactions, therefore, this statement may not apply to you.-->"
  <DataRuleDefinitions>
    <DataRuleDefinition name='FieldExists'  folder='General Rules / General'>
      <description>Field Exists; null check only</description>
      <expression>Field1 EXISTS</expression>
    </DataRuleDefinition>
    <DataRuleDefinition name='AlphanumFieldExists'  folder='General Rules / Alphanumeric'>
      <description>Alphanumeric Field Exists; null & \
blank value check</description>
      <expression>Field1 EXISTS AND len(trim(Field1)) <> 0</expression>
    </DataRuleDefinition>
    <DataRuleDefinition name='AlphaFormatValid'  folder='General Rules / Alphabetic'>
      <description>Example Alphabetic Format; excludes null values</description>
      <expression>IF  Field1  EXISTS THEN Field1 MATCHES_FORMAT 'AAAA'</expression>
    </DataRuleDefinition>
    <DataRuleDefinition name='AlphanumFormatValid'  folder='General Rules / Alphanumeric'>
      <description>Example Alphanumeric Format 1; as with \
Vehicle plate #; excludes null values</description>
      <expression>IF  Field1  EXISTS THEN Field1 MATCHES_FORMAT '999AAA'</expression>
    </DataRuleDefinition>

The content includes:

  • A general XML header: <?xml version="1.0" encoding="UTF-8"?>, which should not be changed.
  • A specific XML header for Information Analyzer: <iaapi:Project xmlns:iaapi="http://www.ibm.com/investigate/api/iaapi" name="your-project"> . You will need to modify your-project to whatever Information Analyzer project name you use.
  • XML comments enclosed between <!-- and -->
  • The start of the rule definitions section: <DataRuleDefinitions>
  • The block of content for each rule definition, including:
    • The rule-definition name
    • For the V8.7 files only, a folder
    • The rule-definition description
    • The expression (the rule logic)
    Example:
    <DataRuleDefinition name='FieldExists' folder='General Rules / General'>
      <description>Field Exists; null check only</description>
      <expression>Field1 EXISTS</expression>
    </DataRuleDefinition>
  • After all rule-definition blocks, the end of the rule-definitions section and the end of the XML content:
     </DataRuleDefinitions>
     </iaapi:Project>

Pre-built rule-definition packages available

There are two rule-definition packages available (see Downloads). One is for use with Information Analyzer V8.5, the other for use with V8.7. The rule-definition content is the same for both packages. The only difference is that the V8.7 includes a folder reference that will add the rule definition to a specific folder during import. That option is not available during import in V8.5.

For IBM InfoSphere Information Analyzer V8.5:

  • IARuleDefs-BaseSet1-v85.zip
    • General domain and condition rule definitions:
      IARuleDefs-BaseSet1-General-v85.xml
    • U.S. standardization validation condition rule definitions:
      IARuleDefs-BaseSet1-USStan-v85.xml
  • NOTE: This package can be used in Information Analyzer V8.5 or V8.7.

For IBM InfoSphere Information Analyzer V8.7:

  • IARuleDefs-BaseSet1-v87.zip
    • General domain and condition rule definitions:
      IARuleDefs-BaseSet1-General-v87.xml
    • U.S. standardization validation condition rule definitions:
      IARuleDefs-BaseSet1-USStan-v87.xml
  • NOTE: This package can be only be used in Information Analyzer V8.7 as it includes Folder options not previously available in the API and CLI.

You will need to:

  1. Choose the package that is right for your Information Analyzer version and download from the attachments to this article.
  2. Save the file to a location on your computer.
  3. Unzip the package and extract the two XML files to a location on your computer.

Importing the pre-built rule definitions

There are two options to import or load the pre-built definitions: a Command-Line Interface (CLI) and a REST (HTTP) API, which is accessed via a browser.

Command-Line Interface (CLI)

  • The command is IAAdmin.
  • It is available on the Information Analyzer client and server.
  • Advantages:
    • No POST add-ons are required for your browser.
    • Many technical people like command-line utilities.
  • Disadvantages
    • Though relatively short, some people find command-line syntax confusing.
    • Some environments disable command-line utilities.

REST API interface (browser)

  • It uses an HTTP POST utility.
  • Advantages:
    • Web-based GUI
  • Disadvantages:
    • It requires a POST add-on (some environments do not allow browser add-ons).

This article describes both methods of import. The subsequent import steps assume that you:

  1. Download, unzip (extract), and save the Rule Definition XML files to wherever the Information Analyzer Client or server reside (this can be to a remote server, a remote Client Image, or your own machine).
    • NOTE: These instructions assume you have downloaded on your own machine and will be imported from your local environment.
  2. Open each XML file you wish to use with Notepad (or any other basic file editor).
  3. Change the project name (listed as "your-project" as noted above) to one of your choosing, which already exists in your Information Analyzer environment you have access to.
    • Note: if there is not yet an Information Analyzer project you (or your IA Project Administrator) will need to create one
    • NOTE: If you do not change the project name in the XML file, a project named your-project will be created and all the rule definitions will go there.
  4. Save the XML file(s).

Command-Line Import (CLI)

To perform the import of the Information Analyzer rule definitions via the CLI:

  1. Open a command (DOS) prompt on the client. For example, on Windows® XP, you can use Start > All Programs > Accessories > Command Prompt.
  2. Navigate to C:\IBM\InformationServer\ASBNode\bin.
  3. Run the following command:
    IAAdmin -user xxxxx -password xxxxx -host your-server 
    -port 9080 -create -projectContent C:\Temp\IARuleDefs-BaseSet1-General
    -v87.xml
  4. Use the Information Analyzer configuration information in the above command relevant for your environment:
    • –user (your Information Analyzer user ID)
    • –password (your Information Analyzer password)
    • –host (your Information Analyzer server — match your login information)
    • –port (usually 9080 — match your login information)
    • After –projectContent, include the location where you saved the XML file. The example shows the file saved at C:\Temp\, but the location and name of your file may be different from the one shown here. Use v85 or v87 in the filename as appropriate for your installed version.
    NOTES:
    • If you are importing the BaseSet1-General and the BaseSet1-USStan rule-definition packages, you need to change the filename and run the command a second time for the second import.
    • You will receive an error if you run the above command twice with the same filename. The error will tell you that the rules already exist. If this happens, then re-issue the above command with -update instead of -create.
    • If you attempt to import the v87 files into an Information Analyzer 8.5 version, you will receive an error message like the following:
      >The XML request passed as parameter could not be parsed for the following reason: Feature 'folder' not found.
    • The above command will run for approximately 4-5 minutes.

If the import was successful, you can log in to Information Analyzer, open your project (same as specified in the XML file in the import), and review the imported rule definitions. You should see a list of rule definitions similar to Figure 5.

Figure 5. Imported rule definitions
screen shows rule names, types, and descriptions

Web browser import (REST API)

The following browser import example uses Firefox and the associated Poster add-on. If you have a different browser, you will need to find a utility that allows you to perform a POST function in order to import the pre-built Information Analyzer rule definitions with this method.

Before performing an import via this method, you need a Firefox utility to allow the POST function. Take the following steps (these may vary depending on the version of your browser):

  1. Open Firefox and navigate to https://addons.mozilla.org/en-US/firefox/addon/poster/.
  2. Select Add to Firefox.
  3. Select Install Now when prompted.
  4. Close and restart Firefox.
  5. After Firefox restarts, enable the Add-on bar with the following (again depending on version)
    • Firefox > Options > Add-on Bar or
    • Firefox > Tools > Add-ons
  6. Depending on the version, you may see the add-on as:
    • A gold-colored P in the lower right of your browser
    • A menu option under Firefox > Tools > Poster

To perform the import of the Information Analyzer rule definitions via the browser:

  1. Open Firefox and navigate to the add-on.
  2. Select the Poster add-on. You should see a browser form as shown in Figure 6.
    Figure 6. Firefox Poster form
    Image shows input form with fields for url, user auth, timeout, actions, and file
  3. Fill in the form:
    URL:http://<yourInformationAnalyzerhost>\ :9080/InformationAnalyzer/create?projectContent
    User: Your Information Analyzer user ID
    Password: your Information Analyzer password
    Timeout: Change to 300
    File: enter the filename or select the Browse button to select the XML file from the appropriate directory. Note that the name of your file will be different from the one shown here. Use v85 or v87 as appropriate for your install.
    POST: Select this to run the load process.
  4. The POST will take a bit (3-5 minutes), and you should see a message indicating status of OK.
    NOTE: It is possible that you will receive a timeout response if the Timeout is not changed. This simply affects the response to the client on status, but should not affect the post function on the server.

If the import was successful, you can log in to Information Analyzer, open your project (same as specified in the XML file in the import), and review the imported rule definitions. You should see a list of rule definitions similar to Figure 5 above.

You are now ready to begin using your pre-built Information Analyzer rule definitions.


Conclusion

You should now be able to use the pre-built set of rule definitions accompanying this article in your IBM InfoSphere Information Analyzer projects.

This article has specifically looked at how to do the following tasks:

  • Understand rule definitions available in pre-built packages accompanying this article.
  • Use the pre-built rule definitions to address common data-quality conditions.
  • Review the structure and content of the rule-definition XML file.
  • Import the pre-built data rule definitions using the HTTP/CLI API for either Information Analyzer V8.5 or V8.7.

Once imported, you can use the pre-built rule definitions to establish data quality rules, then test and monitor for potential data-quality problems. And, by using pre-built rule definitions, you can shorten the time needed to put in data quality checks in place for a broad range of data in any given table, system, or environment.


Appendix: Pre-built rule-definition names and descriptions

The pre-built rule definitions are provided for use with IBM InfoSphere Information Server.

The following rule definitions are included in the IARuleDefs-BaseSet1-General-v8x.xml file.

Table 1. Rule definitions in IARuleDefs-BaseSet1-General-v8x.xml
Rule name Rule short description
FieldExists Field Exists
AlphanumFieldExists Alphanumeric Field Exists
AlphaFormatValid Example Alphabetic Format
AlphanumFormatValid Example Alphanumeric Format 1
AlphanumFormatValid_2 Example Alphanumeric Format 2
FieldIsNumeric Field is numeric
FieldExistsAndNumeric Field exists and is numeric
FieldExistsAndNumeric_2 Field exists and is numeric
FieldIsDate Field is date value
FieldExistsAndDate Field exists and is date value
FieldExistsAndDate_2 Field exists and is date value
IndicatorY_NValid Indicator Field is 'Y' or 'N'
IndicatorUpperCaseY_NValid Indicator Field is any case of 'Y' or 'N'
IndicatorT_FValid Indicator Field is 'T' or 'F'
IndicatorUpperCaseT_FValid Indicator Field is any case of 'T' or 'F'
IndicatorString0_1Valid Indicator Field is '1' or '0'
IndicatorNum0_1Valid Indicator Field is numeric 1 or 0
Code1DigitUpperCase Code is single-digit uppercase alphabetic
Code1DigitLowerCase Code is single-digit lowercase alphabetic
Code1DigitNumeric Code is single-digit numeric
Code1DigitAlphanum Code is single-digit alphanumeric, any case
Code1DigitAlphabetic Code is single-digit alphabetic
-Code1DigitNumeric_2 -Code is single-digit numeric
Code1DigitAlphanum_2 Code is single-digit alphanumeric
CodeInRefMaster Code is in reference source
CodeNotInDefaultValueList Code is not all 9s
TextSubstrIsValueX Substring Text value starting in position 3 for length 1 is 'X'
TextSubstrInRefList Substring Text value starting in position 3 for length 3 is in reference list
Date1NumLessThanDate2 Date <= 2nd date
Date1StringLessThanDate2 String Date <= 2nd string date
Date1NumLessThanSysdate Date <= System Date; not a future date
Date1StringLessThanSysdate String Date <= System Date; not a future date
DateNumWithinLastYear Date within one year of current date
DateStringWithinLastYear Date within one year of current date
Date1NumWithin60DaysDate2 Date within 60 days of 2nd date
Date1StringWithin60DaysDate2 Date within 60 days of 2nd date
YearNumNotFuture Year <= System Year; not a future year
YearStringNotFuture Year <= System Year; not a future year
YearNumIsCurrentYear Year = System Year; is same as year of system date
YearStringIsCurrentYear Year = System Year; is same as year of system date
IdentifierUnique Identifier is unique
CompoundIdentifierUnique Compound Identifier is unique
Id1StXCharactersAlphabetic 1st x characters of Identifier are alpha
IdLast4To5CharactersMatchId2 Last 4-5 characters of one Identifier match a 2nd Identifier
IdOccursLtXTimes Identifier occurs <x times
IdInValidRange Identifier in valid range (single)
IdInValidMulti_Range Identifier in valid range (multiple/segmented)
IdInRefMaster Identifier in reference column
QtyInValidRange Quantity in valid range
QtyAdditionValid Quantity computation is valid (addition)
QtyArithmeticValid Quantity computation is valid (arithmetic)
QtyMultiplyValid Quantity computation is valid (multiplication)
QtyDivisionValid Quantity computation is valid (division)
QtyStringInValidRange Quantity in valid range — string value
QtyStringAdditionValid Quantity computation is valid (addition) — string values
QtyStringArithmeticValid Quantity computation is valid (arithmetic) — string values
QtyStringMultiplyValid Quantity computation is valid (multiplication) — string values
QtyStringDivisionValid Quantity computation is valid (division) — string values
IfFieldaIsXThenFieldbIsY If fieldA = 'X' then fieldB = 'Y'
IfFieldaIsXThenFieldbDoesNotExist If fieldA = 'X' then fieldB should not exist
IfFieldaIsXThenFieldbInValidValueSet If fieldA = 'X' then fieldB is in valid set of values
IfFieldaInValidValueSetThenFieldbExists If fieldA is in valid set of values then fieldB = 'X'
IfFieldaGtXThenFieldbIsY If fieldA > qtyValue then fieldB = 'Y'
IfFieldaIsXThenFieldbGtQty If fieldA ='X' then fieldB > qtyValue
FieldaPlusConcatenatedKeyInRefMaster fieldA exists and a combination of literals + fieldA in a reference column
AlphanumValidNoNonprintCharacters Alphanumeric with no non-print characters
FieldFreqLessThanPercentLimitForAllRecords Field Frequency less than Pct Limit for all Records
AgeInRangeNumeric Age: Age >= 0 and < 125
AdultInRangeNumeric Adult: Age >= 18 and < 125
ChildInRangeNumeric Child: Age >= 0 and < 18
AgeInRangeString String data Age: Age >= 0 and < 125
AdultinrangeString String data Adult: Age >= 18 and < 125
ChildinrangeString String data Child: Age >= 0 and < 18
AgeinrangeCalc Derived Age: Age >= 0 and < 125
AdultinrangeCalc Derived Age Adult: Age >= 18 and < 125
ChildinrangeCalc Derived Age Child: Age >= 0 and < 18
ChildnotmarriedNumeric If Child (numeric) then Marital Status = 'N'
ChildnotmarriedString If Child (string) then Marital Status = 'N'
ChildnotmarriedCalc If Child (derived) then Marital Status = 'N'
DobReasonableRangeNumeric Date of Birth >= 1900-01-01 and <= System Date; not a future date
DobReasonableRangeString Date of Birth >= 1900-01-01 and <= System Date; not a future date
IfDodExistsThenGtDobNumeric Date of Death can be blank, but if it exists it must be >= Date of Birth and not a future date
IfDodExistsThenGtDobString Date of Death can be blank, but if it exists it must be >= Date of Birth and not a future date
StartDtWeekday_Numeric Start Date is a weekday
StartDtWeekday_String Start Date is a weekday
MaturityDtExistsValidCondition Maturity Date exists where condition is valid
Ssn9DigitNumeric SSN is 9-digit numeric
SsnMatchesNumericFormat SSN matches only numeric format
SsnMatchesHyphenFormat SSN matches numeric format with hyphens
TinMatchesHyphenFormat TIN matches numeric format with hyphens
SsnMatchesRegex SSN matches regex format
CanadaSinMatchesRegex SIN matches regex format
FranceInseeMatchesRegex INSEE matches regex format
ItalyFiscalCdMatchesRegex FiscalCode matches regex format
SpainNifMatchesRegex NIF matches regex format
UkNinoMatchesRegex NINO matches regex format
PassportNumMatchesRegex PassportNumber matches regex format
CreditCardMatchesRegex CreditCard matches regex format
PaidAmtWeeklyValid Paid Amount computation is valid (multiplication)
PaidAmtAnnualValid Paid Amount computation is valid (multiplication)
SalesamtBaseValid Quantity * price computation is valid (multiplication)
SalesamtWTaxValid Quantity * price computation is valid (with salesTax)
SalesamtWDiscountValid Quantity * price computation is valid (with discount)
SalesamtWDiscountPlusTaxValid Quantity * price computation is valid (with discount before salesTax)
SalesamtPriceLookupValid Quantity * price computation is valid (multiplication) - w/ lookup
SalesamtWTaxLookupValid Quantity * price computation is valid (with salesTax) — with lookup
SalesamtWDiscountLookupValid Quantity * price computation is valid (with discount) — with lookup
SalesamtWDiscount_TaxLookupValid Quantity * price computation is valid (with discount before salesTax) — with lookup
ProductDescriptionValidCharacters Product Description has valid characters
EmailAddrValidFormat Email Address has valid format
NotesEmailAddrValidFormat Notes-style Email Address has valid format
InternalEmailValidDomain Internal Email has valid domain
HostNameValidFormat Host Name has valid format
UrlValidFormat URL has valid format
IpaddressValidFormat IPAddress has valid format
ValidNAmerAreaCode Valid Area Code
ValidNAmerAreaCode_2 Valid Area Code
ValidUsPhoneFormat Valid US Phone format
ValidIsbnFormat Valid ISBN format
ValidUpc_AFormat Valid UPC-A format
ValidCountryOfOperation Valid Country of Operation
ValidCountryOfIssue Valid Country of Issue
SameTaxid_Name_1 Same Taxid, Same Name — v1
SamePolicy_CodeValue all records which have the same policy number should have a specific code set to the same value
SumLineItemsEqTotalQty sum of all line items = total qty for an order
SumLineItemsEqTotalValue sum of all line item values = total value for an order
MrktValueChgCurr_Prior compare market value of portfolio for current business day and previous business day for swings in value of more than 10 percent
CountryCodeExists Country Code Exists
IsoCountryCode2DigitAlphaIsValid ISO Country Code 2 digit alpha is valid
IsoCountryCode3DigitAlphaIsValid ISO Country Code 3 digit alpha is valid
IsoCountryCode3DigitNumericIsValid ISO Country Code 3 digit numeric is valid or user assigned (900 to 999)
IsoCountryCodeDidNotHaveStandardizationError ISO Country Code did not have Standardization error

The following rule definitions are included in the IARuleDefs-BaseSet1-USStan-v8x.xml file.

Table 2. Rule definitions in IARuleDefs-BaseSet1-USStan-v8x.xml
Rule name Rule short description
NameExists Name Exists
NameTypeIsIndividualOrOrganization Name Type is Individual or Organization
GenderCodeDerivedValid Gender Code Derived Valid
NamePrefixIfExistsThenValid Name Prefix If Exists Then Valid
FirstNameValid First Name Valid
MiddleNameValid Middle Name Valid
PrimaryNameValid Primary Name Valid
NameGenerationInRefList Name Generational In Ref List
NameSuffixIfExistsThenValid Name Suffix If Exists Then Valid
AdditionalNameIfExistsThenValid Additional Name If Exists Then Valid
UnhandledNamePatternExpectEmpty Unhandled Name Pattern
UnhandledNameDataExpectEmpty Unhandled Name Data
ExceptionNameDataExpectEmpty Exception Name Data
FirstNameNoTestData First Name — No Test Data
MiddleNameNoTestData Middle Name — No Test Data
PrimaryNameNoTestData Primary Name — No Test Data
AdditionalNameNoTestData Additional Name - No Test Data
FirstNameNysiis First Name NYSIIS Exists if First Name exists
PrimaryName1Nysiis Primary Name 1 NYSIIS Exists if Primary Name 1 exists
PrimaryName2Nysiis Primary Name 2 NYSIIS Exists if Primary Name 2 exists
MatchPrimaryWordsIs1 If Match Primary Words is 1 then only 1 Primary Match Word
MatchPrimaryWordsIs2 If Match Primary Words is 2 then only 2 Primary Match Words
MatchPrimaryWordsIs3 If Match Primary Words is 3 then only 3 Primary Match Words
MatchPrimaryWordsIs4 If Match Primary Words is 4 then only 4 Primary Match Words
MatchPrimaryWordsIsGe5 If Match Primary Words is greater or equal to 5 then all 5 Primary Match Words
AddressExists Address Exists
AddressTypeIfExistsThenIsInReferenceList Address Type if exists then is in reference list
HouseNumberOrSuffixIfExistsThenValidCharacter House Number or Suffix if exists then valid characters
StreetDirectionalIfExistsThenValidValues Street Directional If Exists Then Valid
StreetNameIfExistsThenValidCharacters Street Name If Exists Then Valid
BuildingNameIfExistsThenValidCharacters Building Name If Exists Then Valid
BoxTypeIfExistsThenValidValues Box Type If Exists Then Valid
BoxTypeAndBoxValueCombo Box Type and Box Value Combo
FloorTypeIfExistsThenValidValues Floor Type If Exists Then Valid
FloorTypeAndFloorValueCombo Floor Type and Floor Value Combo
RuralRouteTypeIfExistsThenValidValues Rural Route Type If Exists Then Valid
RuralRouteTypeAndRuralRouteValueCombo Rural Route Type and Rural Route Value Combo
StreetTypeIfExistsThenValidCharacters Street Type If Exists Then Valid
UnitAndMultiunitTypeIfExistsThenValidValue Unit and Multi-Unit Type If Exists Then Valid
UnitTypeAndUnitValueOrMultiunitType-Multi-Uni Unit Type - Unit Value or Multi-Unit Type - Multi-Unit Value Combo
AdditionalAddressIfExistsThenValid Additional Address If Exists Then Valid
UnhandledAddressPattern Unhandled Address Pattern
UnhandledAddressData Unhandled Address Data
ExceptionAddressData Exception Address Data
StreetNameNoTestData Street Name No Test Data
AdditionalAddressNoTestData Additional Address No Test Data
CityExists City Exists
StateCodeIsInReferenceSource State Code is in reference source
ZipCodeIsInReferenceSource ZIP Code is in reference source
ZipPlus4CodeIsInReferenceSource ZIP+4 code is in reference source
ZipPlus4CodeRemovingHyphensIsInReferenceSourc Zip+4 Code removing hyphens is in reference source
CityStateAndZipCodeComboIsValid City State and ZIP Code combo is valid
ValidStateLengthForCountryCode Valid State Length for Country Code US
UnhandledAreaPattern Unhandled Area Pattern
UnhandledAreaData Unhandled Area Data
ExceptionAreaData Exception Area Data
CityNoTestData City No Test Data

Downloads

DescriptionNameSize
Rule-definition pkg 8.5IARuleDefs-BaseSet1-v85.zip11KB
Rule-definition pkg 8.7IARuleDefs-BaseSet1-v87.zip10KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base to build and deploy applications.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=781217
ArticleTitle=Using pre-built rule definitions with IBM InfoSphere Information Analyzer
publish-date=12152011