Data Rules stage

Use the Data Rules stage to check data quality anywhere in the flow of a job.

By using the Data Rules stage, you can include rules that you create in IBM® InfoSphere® Information Analyzer directly in the flow of a job. These rules check the quality of in-stream data. The Data Rules stage can be added anywhere in a job, and you can add it multiple times to check for data anomalies and validate the quality of the data. By including additional downstream stages in the job, you can analyze or transform the invalid records, and send valid records downstream for further processing.

The Data Rules stage can use all of the published data rule definitions that you created in InfoSphere Information Analyzer. You can also create ad hoc rules that can be used only within that stage. If you have the Rule Author role in InfoSphere Information Analyzer, you can create and publish rule definitions and rule set definitions directly from the stage itself. All rule definitions that you publish are saved to the metadata repository so that they are also available for use within InfoSphere Information Analyzer.

You can use the Data Rules stage to design the preprocessing of the data (bringing the data to the right form before it is processed by the Data Rules stage) and the postprocessing of the result. It also offers the flexibility to chain several rule stages or to store good or bad records in different target tables. You can also manage exceptions within the job flow, so that low-quality data does not pass through to a data warehouse, for example. The Data Rules stage can use three output types: an output type for data that meets all rules, an output type for data that does not meet one or more rules, and an output type that contains detailed information about each record that failed. The details describe each condition that the record did not meet. You can use this information to perform additional analysis to determine the cause of the data failures and correct the failure either at the data source or in-stream within the job.

Although you can create and publish new rules from the Data Rules stage, you should create rules in InfoSphere Information Analyzer and then implement them in a job. You can use InfoSphere Information Analyzer to design, test, debug, and validate the rules on non-operational data. When the rule is ready for production, publish the rule and implement the rule in the Data Rules stage. This separation of responsibilities keeps the design, test, and production systems, the rules, and the data secure.

When a job that uses the Data Rules stage runs, the output of the stage is passed to the downstream stages. Unlike InfoSphere Information Analyzer, a job does not store results or history in the Analysis Results database (IADB).

If you reuse rules that were created in InfoSphere Information Analyzer or if you create new rules in InfoSphere Information Analyzer and plan to use them in the Rules Stage, be aware of the following requirements:

You perform the majority of the configuration within the Data Rules Editor, which is available from the Stage > General tab of the Data Rules stage. The Rules Editor presents an Input tab and an Output tab. On the Input tab, you select the rules to include in the stage and then map, or bind, each rule variable to a column from an input link or to a literal value. On the Output tab, you select the columns to pass to the output links. In addition, you can specify additional statistics and attributes, such as the date or time, to add to the output, as well as create a custom expression to perform a calculation.