Rules
Rules are processes that standardize groups of related records. Rules can apply to records that match the same pattern or to exact strings of text.
When you create or modify a rule, you map values in the input records to output columns, specify actions that manipulate the data, and identify conditions to ensure that rules apply only to the correct records.
Rules are added and modified by editing the pattern-action specification (previously called pattern-action file), enhancing a rule set in DataStage®, or using override objects.
In the pattern-action specification, patterns are executed in the order that they appear. A pattern either matches the input record or does not match. If it matches, the actions that are associated with the pattern are executed. If it does not match, the actions are skipped. In either case, processing continues with the next pattern in the file.
Actions
An action is a part of a rule that specifies how the rule processes a record. You can add one or more actions for each value in a record.
Part | Examples |
---|---|
An object that is acted upon |
|
One or more manipulations of the object |
|
A target for the resulting character string |
|
In the pattern-action specification, you can also specify actions that affect how records are processed by the rule set and the order in which they are processed. For example, you can use the CALL action to call subroutines that process particular types of information, such as unit types.
Conditions
A condition specifies requirements that records must meet before the actions in a rule are applied to that record. You can add one or more conditions for each value in a record.
Part | Examples |
---|---|
An object that the condition applies to |
|
Requirements that the object must meet for the rule to apply to that object |
|
For example, your data might contain distinct values that represent calendar dates in the ccyy mm dd format. A current record in this format is 1986 03 16. Suppose that your data quality requirements require a rule that concatenates these values into one value in the format ccyymmdd.
To ensure that the rule applies only to calendar dates in the correct format, you add a series of conditions that requires the length of the first value to be equal to four characters and the length of the second and third values to be equal to two characters. When you add this condition, the rule applies to records such as 1986 03 16, but not to records such as 03 16 86.
Pattern matching principles
To obtain correct standardization, you need to understand the concepts of pattern matching and the reasons for matching.
If all elements of an address are uniquely identified by keywords, address standardization is easy. The following example is not subject to any ambiguity. The first field is numeric (house number), the next is a direction, which is uniquely identified by the value (previously called token) N, the next is an unknown word MAPLE, and the last is a street type, AVE:
123 N MAPLE AVE
Most addresses fall into this pattern with minor variations.
123 E MAINE AV
3456 NO CHERRY HILL ROAD
123 SOUTH ELM PLACE
The first numeric value is interpreted as the house number and must be moved to the house number field {HouseNumber}. The direction is moved to the pre-direction field {StreetPrefixDirectional}, the street names to the street name field {StreetName}, and the street type to the {StreetSuffixType} field.
The braces indicate that the reference is to a dictionary field that defines a field in the output data. For example:
Pattern | Dictionary field | Examples |
---|---|---|
Numeric value. The class is ^. | {HouseNumber} |
|
Direction. The class is D. | {StreetPrefixDirectional} |
|
Unclassified words. The class is ?. | {StreetName} |
|
Street type. The class is T. | {StreetSuffixType} |
|
Rule groups
A rule group is a collection of rules that are applied to records at the same point in the standardization process. To ensure that rules are applied in a particular order, you can organize the rules into rule groups.
Rule groups can contain rules that are applied to records before or after the other actions in the standardization process. A rule group is invoked by a separate action in the pattern-action specification.
; Rules for hardware retail products
; ----------------------------------
; ----------------------------
& ; CALL Hardware_Retail SUBROUTINE
CALL Hardware_Retail
B | + | S | C | P ; Common Pattern Found: CALL Post_Process SUBROUTINE then EXIT
COPY_A [1] {ProductBrand}
COPY_S [2] {ProductName}
COPY_A [3] {ProductSize}
COPY_A [4] {ProductCode}
COPY_A [5] {ProductUnitPrice}
CALL Post_Process
EXIT
- A rule group for rules that you want to apply to records before all other actions
- A rule group for rules that you want to apply to records after all other actions
- The Input_Overrides rule group contains rules that are applied to records before all other actions in the pattern-action specification.
- The Unhandled_Overrides rule group contains rules that are applied to records after all other actions in the pattern-action specification.
You can modify the rules in a rule group, add a rule group, or change the name of a rule group. For a rule set to work correctly, the references to the rule groups in the pattern-action specification must match the information about the rule groups. Before you provision a rule set and apply it in a job, ensure that the pattern-action specification is updated to match.
Pattern-action specification (.PAT file)
The pattern-action specification (previously called pattern-action file) is an ASCII file that can be created or updated using any standard text editor.
The pattern-action specification has the following general format:
\POST_START
post-execution actions
\POST_END
\PRAGMA_START
specification statements
\PRAGMA_END
pattern
actions
pattern
actions
pattern
actions
...
There are two special sections in the pattern-action specification. The first section consists of post-execution actions within the \POST_START and \POST_END lines. The post-execution actions are executed after the pattern matching process is finished for the input record.
Post-execution actions include computing Soundex codes, NYSIIS codes, reverse Soundex codes, and reverse NYSIIS codes, and copying, concatenating, and prefixing dictionary field value initials.
The second special section consists of specification statements within the \PRAGMA_START and \PRAGMA_END lines. The only specification statements currently allowed are SEPLIST, STRIPLIST, and TOK. The special sections are optional. If omitted, the header and trailer lines must also be omitted.
Other than the special sections, the pattern-action specification consists of standardization rules. Standardization rules include one or more conditions, such as a pattern, and the associated actions. The pattern requires one line. The actions are coded one action per line. The next pattern can start on the following line.
Blank lines are used to increase readability. For example, it is suggested that blank lines or comments separate one rule from another.
Comments follow a semicolon. An entire line can be a comment line by specifying a semicolon as the first non-blank character; for example:
;
; This is a standard address pattern
;
^ | ? | T ; 123 Maple Ave
123 N MAPLE AVE
123 MAPLE AVE
\POST_START
NYSIIS {StreetName} {StreetNameNYSIIS}
\POST_END
^ | D | ? | T ; 123 N Maple Ave
COPY [1] {HouseNumber} ; Copy House number (123)
COPY_A [2] {StreetPrefixDirectional} ; Copy direction (N)
COPY_S [3] {StreetName} ; Copy street name (Maple)
COPY_A [4] {StreetSuffixType} ; Copy street type (Ave)
EXIT
^ | ? | T
COPY [1] {HouseNumber}
COPY_S [2] {StreetName}
COPY_A [3] {StreetSuffixType}
EXIT
This example pattern-action specification has a post section that computes the NYSIIS code of the street name (in field {StreetName}) and moves the result to the {StreetNameNYSIIS} field.
The first pattern matches a numeric value followed
by a direction followed by one or more unknown words followed by a
street type (as in 123 N MAPLE AVE
). The associated
actions are to:
- Copy operand [1] (numeric value) to the {HouseNumber} house number field.
- Copy the standard abbreviation of operand [2] to the {StreetPrefixDirectional} prefix direction field.
- Copy and retain spaces between the words in operand [3] to the {StreetName} field.
- Copy the standard abbreviation of the fourth operand to the {StreetSuffixType} street type field.
- Exit the pattern program. A blank line indicates the end of the actions for the pattern.
The second rule is similar except that the rule handles
cases like 123 MAPLE AVE
. If there is no match on
the first pattern, the next pattern in the sequence is attempted.
Special characters
At the beginning of the standardization process, input data is parsed into meaningful values. Special characters are used to identify distinct values and distinguish between values and characters that do not contain useful information.
- Separation characters indicate where one value in a record ends and the next value begins. If a character is in the separation list but is not in the strip list, the character is identified as a distinct value.
- Strip characters are removed from the record. For example, if a period (.) is in the strip list but is not in the separation list, the characters N.W. in the raw data are parsed into the following value: NW
In the pattern-action specification, separation characters are specified in the separation list, which is specified by using the SEPLIST statement. Strip characters are specified in the strip list, which is specified by using the STRIPLIST statement. When input data is parsed, the separation list is applied first.
Any character in both lists separates values but is not identified as a distinct value itself. For example, if a space is in both lists, the characters 123 456 in the raw data are parsed into the following two values: 123 and 456. Because the space is in both lists, it separates the two values but is not a value itself. When you specify patterns in the pattern-action specification, you cannot include any character that is in both lists in a pattern.
Examples
SEPLIST: " !?%$,.;:()/#&"
STRIPLIST: " !?*@$,.;:-\\''"
In this example, the hyphen is in both lists. Because the separation list is applied before the strip list, STRATFORD-ON-AVON in the incoming data is parsed into three values: STRATFORD, ON, and AVON.
SEPLIST: " !?%$,.;:-()/#&"
STRIPLIST: " !?*@$,.;:-\\''"