Rule sets that are used by the Standardize stage

You can apply rule sets in the Standardize stage to create output columns that are consistent, meet industry standards, and that you can use in a variety of ways for data matching.

Rule sets check and normalize input data. The following categories of rule sets are available:
  • Country or region identifier rule sets read area information and attempt to identify the associated country or region.
  • Domain preprocessor rule sets evaluate mixed-domain input, such as free-form name and address information, and categorize the data into domain-specific column sets.
  • Domain-specific rule sets process free-form data from a single domain such as name, address, or area information.
  • Validation rule sets generate business intelligence and reporting fields, and are applied to common business data such as dates, email addresses, and phone numbers.

The provided rule sets are designed for optimal results. However, if the results are not satisfactory, or if you want to create rule sets for other data domains, you can create a new rule set, copy an existing rule set, or modify an existing rule set. You can modify rule set behavior by enhancing the rule set in DataStage®, adding user overrides, or editing the rule set files directly.

Standardize processing flow for records for the USA

The following diagram illustrates the Standardize stage processing flow using domain preprocessor and domain-specific rule sets to standardize the records that are commonly found in the United States.

Because input files are rarely domain-specific, domain preprocessor (PREP) rule sets are critical when preparing a file for standardization.

The same workflow is representative of other countries used with the Standardize stage.

The Standardizing processing flow chart for records

Using literals for required values

If the input records do not include critical entries, you can insert the required values as a literal, which appears in the output. You insert the literal when adding columns.

For example, the input records lack a state entry because all records are for the state of Vermont. To include the state in the standardized records, you would insert the literal VT between the city name and the postal code.

If input records have an apartment number column containing only an apartment number, you could insert a # (pound sign) literal between the unit type and the unit value.

Literals cannot contain any spaces and must be inserted between columns. You cannot include two contiguous literals for a rule set.

The only special characters that you can use in a literal are:

#
pound sign
%
percentage
^
caret
&
ampersand
<>
angle brackets
/
slash

For domain preprocessor rule sets, you must insert column delimiters using literals.