Rules

Rules are processes that standardize groups of related records. Rules can apply to records that match the same pattern or to exact strings of text.

When you create or modify a rule, you map values in the input records to output columns, specify actions that manipulate the data, and identify conditions to ensure that rules apply only to the correct records.

Rules are added and modified by editing the pattern-action specification (previously called pattern-action file), enhancing a rule set in DataStage®, or using override objects.

In the pattern-action specification, patterns are executed in the order that they appear. A pattern either matches the input record or does not match. If it matches, the actions that are associated with the pattern are executed. If it does not match, the actions are skipped. In either case, processing continues with the next pattern in the file.

Actions

An action is a part of a rule that specifies how the rule processes a record. You can add one or more actions for each value in a record.

You specify actions in the Standardization Rules Designer, pattern-action specification, or user overrides. Regardless of where the action is specified, every action that manipulates a particular record includes the following parts.
Table 1. Parts of an action
Part Examples
An object that is acted upon
  • Value
  • Standard value
  • The first three characters in a value
  • Literal
One or more manipulations of the object
  • Copy the object from one field to a different field
  • Look up the object in a lookup table and convert it to a returned value
  • Concatenate the object with a different object
A target for the resulting character string
  • Output column
  • User variable (pattern-action specification only)

In the pattern-action specification, you can also specify actions that affect how records are processed by the rule set and the order in which they are processed. For example, you can use the CALL action to call subroutines that process particular types of information, such as unit types.

Conditions

A condition specifies requirements that records must meet before the actions in a rule are applied to that record. You can add one or more conditions for each value in a record.

You specify conditions in the Standardization Rules Designer or pattern-action specification. Regardless of where the condition is specified, all conditions have the following parts.
Table 2. Parts of a condition
Part Examples
An object that the condition applies to
  • Value
  • Standard value
  • The first three characters in a value
Requirements that the object must meet for the rule to apply to that object
  • The object equals a particular value
  • The object is in a lookup table
  • The length of an object is greater than or equal to a particular value

For example, your data might contain distinct values that represent calendar dates in the ccyy mm dd format. A current record in this format is 1986 03 16. Suppose that your data quality requirements require a rule that concatenates these values into one value in the format ccyymmdd.

To ensure that the rule applies only to calendar dates in the correct format, you add a series of conditions that requires the length of the first value to be equal to four characters and the length of the second and third values to be equal to two characters. When you add this condition, the rule applies to records such as 1986 03 16, but not to records such as 03 16 86.

Pattern matching principles

To obtain correct standardization, you need to understand the concepts of pattern matching and the reasons for matching.

If all elements of an address are uniquely identified by keywords, address standardization is easy. The following example is not subject to any ambiguity. The first field is numeric (house number), the next is a direction, which is uniquely identified by the value (previously called token) N, the next is an unknown word MAPLE, and the last is a street type, AVE:


123 N MAPLE AVE

Most addresses fall into this pattern with minor variations.


123 E MAINE AV
3456 NO CHERRY HILL ROAD
123 SOUTH ELM PLACE

The first numeric value is interpreted as the house number and must be moved to the house number field {HouseNumber}. The direction is moved to the pre-direction field {StreetPrefixDirectional}, the street names to the street name field {StreetName}, and the street type to the {StreetSuffixType} field.

The braces indicate that the reference is to a dictionary field that defines a field in the output data. For example:

Table 3. Pattern matching
Pattern Dictionary field Examples
Numeric value. The class is ^. {HouseNumber}
  • 123
  • 3456
  • 123
Direction. The class is D. {StreetPrefixDirectional}
  • E (an abbreviation for East)
  • NO (an abbreviation for North)
  • SOUTH
Unclassified words. The class is ?. {StreetName}
  • MAIN
  • CHERRY
  • HILL
  • ELM
Street type. The class is T. {StreetSuffixType}
  • AV (an abbreviation for Avenue)
  • ROAD
  • PLACE

Rule groups

A rule group is a collection of rules that are applied to records at the same point in the standardization process. To ensure that rules are applied in a particular order, you can organize the rules into rule groups.

Rule groups can contain rules that are applied to records before or after the other actions in the standardization process. A rule group is invoked by a separate action in the pattern-action specification.

For example, the following action invokes the Hardware_Retail rule group. The example includes comments that specify the rule group that is invoked.
; Rules for hardware retail products
; ----------------------------------
; ----------------------------

& ; CALL Hardware_Retail SUBROUTINE
CALL Hardware_Retail
The Hardware_Retail rule group might include rules like the following rule:
B | + | S | C | P ; Common Pattern Found: CALL Post_Process SUBROUTINE then EXIT
COPY_A [1] {ProductBrand}
COPY_S [2] {ProductName}
COPY_A [3] {ProductSize}
COPY_A [4] {ProductCode}
COPY_A [5] {ProductUnitPrice}
CALL Post_Process
EXIT
For most rule sets, you might need only the following rule groups:
  • A rule group for rules that you want to apply to records before all other actions
  • A rule group for rules that you want to apply to records after all other actions
The pattern-action specification for predefined rule sets contains the following rule groups:
  • The Input_Overrides rule group contains rules that are applied to records before all other actions in the pattern-action specification.
  • The Unhandled_Overrides rule group contains rules that are applied to records after all other actions in the pattern-action specification.
You cannot add rules that are based on pattern-action language to these rule groups.

You can modify the rules in a rule group, add a rule group, or change the name of a rule group. For a rule set to work correctly, the references to the rule groups in the pattern-action specification must match the information about the rule groups. Before you provision a rule set and apply it in a job, ensure that the pattern-action specification is updated to match.

Pattern-action specification (.PAT file)

The pattern-action specification (previously called pattern-action file) is an ASCII file that can be created or updated using any standard text editor.

The pattern-action specification has the following general format:


\POST_START
post-execution actions
\POST_END
\PRAGMA_START
specification statements
\PRAGMA_END

pattern
actions

pattern
actions

pattern
actions

...

There are two special sections in the pattern-action specification. The first section consists of post-execution actions within the \POST_START and \POST_END lines. The post-execution actions are executed after the pattern matching process is finished for the input record.

Post-execution actions include computing Soundex codes, NYSIIS codes, reverse Soundex codes, and reverse NYSIIS codes, and copying, concatenating, and prefixing dictionary field value initials.

The second special section consists of specification statements within the \PRAGMA_START and \PRAGMA_END lines. The only specification statements currently allowed are SEPLIST, STRIPLIST, and TOK. The special sections are optional. If omitted, the header and trailer lines must also be omitted.

Other than the special sections, the pattern-action specification consists of standardization rules. Standardization rules include one or more conditions, such as a pattern, and the associated actions. The pattern requires one line. The actions are coded one action per line. The next pattern can start on the following line.

Blank lines are used to increase readability. For example, it is suggested that blank lines or comments separate one rule from another.

Comments follow a semicolon. An entire line can be a comment line by specifying a semicolon as the first non-blank character; for example:


;
;  This is a standard address pattern
;
^ | ? | T  ; 123 Maple Ave
Consider the following input entries:
123 N MAPLE AVE
123 MAPLE AVE
The following sample code shows how the post actions compute a NYSIIS code for street name and process patterns to handle the above input entries:
\POST_START
NYSIIS {StreetName} {StreetNameNYSIIS}
\POST_END
^ | D | ? | T ; 123 N Maple Ave
COPY [1] {HouseNumber} ; Copy House number (123)
COPY_A [2] {StreetPrefixDirectional} ; Copy direction (N)
COPY_S [3] {StreetName} ; Copy street name (Maple)
COPY_A [4] {StreetSuffixType} ; Copy street type (Ave)
EXIT

^ | ? | T
COPY [1] {HouseNumber}
COPY_S [2] {StreetName}
COPY_A [3] {StreetSuffixType}
EXIT

This example pattern-action specification has a post section that computes the NYSIIS code of the street name (in field {StreetName}) and moves the result to the {StreetNameNYSIIS} field.

The first pattern matches a numeric value followed by a direction followed by one or more unknown words followed by a street type (as in 123 N MAPLE AVE). The associated actions are to:

  1. Copy operand [1] (numeric value) to the {HouseNumber} house number field.
  2. Copy the standard abbreviation of operand [2] to the {StreetPrefixDirectional} prefix direction field.
  3. Copy and retain spaces between the words in operand [3] to the {StreetName} field.
  4. Copy the standard abbreviation of the fourth operand to the {StreetSuffixType} street type field.
  5. Exit the pattern program. A blank line indicates the end of the actions for the pattern.

The second rule is similar except that the rule handles cases like 123 MAPLE AVE. If there is no match on the first pattern, the next pattern in the sequence is attempted.

Special characters

At the beginning of the standardization process, input data is parsed into meaningful values. Special characters are used to identify distinct values and distinguish between values and characters that do not contain useful information.

Rule sets use the following types of special characters:
  • Separation characters indicate where one value in a record ends and the next value begins. If a character is in the separation list but is not in the strip list, the character is identified as a distinct value.
  • Strip characters are removed from the record. For example, if a period (.) is in the strip list but is not in the separation list, the characters N.W. in the raw data are parsed into the following value: NW

In the pattern-action specification, separation characters are specified in the separation list, which is specified by using the SEPLIST statement. Strip characters are specified in the strip list, which is specified by using the STRIPLIST statement. When input data is parsed, the separation list is applied first.

Any character in both lists separates values but is not identified as a distinct value itself. For example, if a space is in both lists, the characters 123 456 in the raw data are parsed into the following two values: 123 and 456. Because the space is in both lists, it separates the two values but is not a value itself. When you specify patterns in the pattern-action specification, you cannot include any character that is in both lists in a pattern.

Examples

In this example, the space is in both lists and the hyphen is in the strip list but not the separation list. Hyphens are stripped so that STRATFORD-ON-AVON is considered to be STRATFORDONAVON.

SEPLIST: " !?%$,.;:()/#&"
STRIPLIST: " !?*@$,.;:-\\''"

In this example, the hyphen is in both lists. Because the separation list is applied before the strip list, STRATFORD-ON-AVON in the incoming data is parsed into three values: STRATFORD, ON, and AVON.


SEPLIST: " !?%$,.;:-()/#&"
STRIPLIST: " !?*@$,.;:-\\''"