Classifications

Classifications strengthen the contextual information that patterns provide by identifying that the underlying values belong to particular categories. Each rule set contains its own set of categories, which are called classes.

In DataStage®, records are represented as patterns. In the same way that a record consists of one or more values, patterns consist of one or more abstract characters, each of which represents a class. For example, a set of address data might include the record 123 N CHERRY HILL ROAD, which is represented by the pattern ^D++T. The following table shows the contextual information that each class in the pattern ^D++T provides.
Table 1. Example of a standard address pattern with the contextual information that each class provides
Input record Class label Contextual information that the class provides
123 ^ Value that includes only numbers
N D Street direction
Cherry + Value that includes only letters
Hill + Value that includes only letters
Road T Street type
Patterns contain the following types of classes:
  • Default classes provide basic information about the type of the value, such as whether the value is comprised of alphabetic characters, numeric characters, or some combination of both.
  • Custom classes provide stronger contextual information about the type of the value. In a data set that contains retail product information, custom classes might be used to indicate whether an alphabetic value is the name of a product or the name of a brand. The one-character label for custom classes can be any letter in the Latin alphabet or 0, which indicates a null class.

Rule sets use classifications to identify and classify key values. For example, a rule set for address data might use classifications to categorize values that are street types (AVE, ST, RD) or directions (N, NW, S) by providing the following information:

  • Standard abbreviations for each word; for example, HWY for Highway
  • A list of one-character labels that represent classes and that are assigned to individual data elements during processing

Classifications are added and modified by editing the classifications table (previously called .CLS file) , enhancing a rule set in DataStage, or using the user classification override.

Class types

Classes provide contextual information about values. Default classes provide basic information about the type of the value, and custom classes provide stronger contextual information about the type of the value.

Default classes

If a value is not assigned to a custom class, the value has one of the default classes, or basic pattern classes, that are shown in the following table.

Table 2. Default classes
Class label Description
^ Digits only.

The caret (^) class represents a single number, for example, the number 123. However, the string 1,230 uses three values (previously called tokens): the number 1, a comma, and the number 230.

? One or more consecutive words that are not assigned to a custom class.

The Standardization Rules Designer does not use the ? class.

+ Letters only.
& A single value of any type.
> Leading digits, followed by letters.
< Leading letters, followed by digits.
@ Mixed letters and digits.

For example: A123B, 345BCD789.

~ Special characters that are not in the SEPLIST, which is the list of characters that indicate where one value in a record ends and the next value begins.
k One or more Chinese numeric characters.

Custom classes

Custom classes are defined by users. The label for a custom class can be an uppercase alphabetic character or the number 0, which indicates a null class.

A custom class provides stronger contextual information about a value than a default class. For example, if a classification definition does not assign the value ROAD to a custom class, the value is assigned to the + default class. This default class indicates that the value is a single alphabetic word. If a classification definition assigns the value to a custom class that represents street types, which might be represented by the character T, the value provides more contextual information. When this information is provided, you can write rules that address a specific subset of the data and therefore handle that data more effectively.

The null class

The null class, which has the label 0, is used in a classification definition or in a RETYPE action to make a value NULL. Because a value with the null class never matches anything, the value is never used in a pattern and is not processed.

If you assign a value to the null class, the value is skipped in the pattern matching process.

You can find more information about the null class and the RETYPE action in the Pattern Action Reference.

Classification definitions

A classification definition assigns a value to a class. The definition can include additional information about the value and affect other similar values.

A classification definition has the following parts:
Value
The string of one or more characters that you want to add a definition for.
Standard value
A standardized spelling or representation of the value that can be used as part of an action or condition in a rule. If you do not specify a standard value, it is the same as the value.

The standard value might be an abbreviation or expanded variation of the word. For example, the standard value for WEST might be W, and the standard value for POB might be "PO BOX".

In the classifications table (previously called the .CLS file), the maximum length for a standard value is 25 characters.

In the classification definition for a value in the null class, the standard value is not required.

Class
The class that the value is assigned to. The class is represented by a one-character class label. For more information about class types, see Class types.
Similarity threshold (previously called threshold weight)

The degree of variation that can exist in the spelling or representation of the value. If you want the classification definition to affect values that are different from the value in the definition, you can set the similarity threshold lower than the default of 900.

The similarity threshold must be an integer in the range 700 - 900. The integers represent the following degrees of variation:
900
Strings must match exactly.
800
Strings are almost certainly the same.
750
Strings are probably the same.
700
Strings are probably different.

When the rule set that contains a classification definition is applied to data, values in the data are compared and a score is assigned. This score indicates the degree of similarity between two values. The string comparison method that is used can take into account phonetic errors, random insertion, deletion and replacement of characters, and transposing of characters.

The score is weighted by the length of the value because small errors in long values are less serious than errors in short values. Because errors in short values cannot generally be tolerated, do not specify a similarity threshold for short values.

Classifications table (.CLS file)

In a rule set, the classifications table (previously called the .CLS file) contains a list of classification definitions. A classification definition assigns a value to a class.

The header of the classifications table includes the name of the rule set and the legend for the classifications. The legend indicates the classes that are used in the classification definitions and descriptions for those classes. All of the lines in the header are specified as comments by preceding any text with semicolons. For example, the header in a classifications table for a rule set that handles retail product data might include the following lines:
;--------------------------------------------------------
; Retail Product Classification Table                                                   
;--------------------------------------------------------
; Classification Legend                                                         
;--------------------------------------------------------
; B - Product Brand                                                                
; C - Product Color                                                            
; N - Product Name                                                               
; S - Product Size                                                              
; T - Product Type                                                              

After the header, the file contains the following strings:

;;ProductName vn.n
\FORMAT\ SORT=N

Do not include any other comments before these lines.

After the header and introductory strings, each line in the classifications table includes one classification definition. In the classifications table, classification definitions use the following format:


value standard value class [similarity-threshold] [; comments]

In the classifications table, each value must be a single word. Multiple or compound words, such as New York, North Dakota, or Rhode Island, are considered separate values.

Literals in the classifications table

Literals are characters that are entered instead of a string in one of the parts of a classification definition.

Some characters that function as literals are also used as labels for default classes. To specify one of these characters as a literal, you must enter an escape character before the character that you want to use as a literal.

When you enter a classification definition in the classifications table, you can use the literals and escape characters that are shown in the following table.

Table 3. Literals and escape characters in the classifications table
Character Description
\& The ampersand (&) is a class that indicates a single value of any type. However, you can type the backslash (\) escape character before the ampersand to use the ampersand as a literal.
/ Literal.
\/ You can use the backslash (\) escape character with the forward slash (/) in the same manner that you use the forward slash (/) character.
- Literal.
\- You can use the backslash (\) escape character with the hyphen in the same manner that you use the hyphen (-) character.
\# Literal. You must use this character with the backslash (\) escape character, for example: \#.
() Literal.

The parentheses are used to enclose operands or user variables in a pattern syntax.

\( and \) Use the backslash (\) escape character with the opening parenthesis or closing parenthesis to filter out parenthetical remarks.