Classifications
Classifications strengthen the contextual information that patterns provide by identifying that the underlying values belong to particular categories. Each rule set contains its own set of categories, which are called classes.
Input record | Class label | Contextual information that the class provides |
---|---|---|
123 | ^ | Value that includes only numbers |
N | D | Street direction |
Cherry | + | Value that includes only letters |
Hill | + | Value that includes only letters |
Road | T | Street type |
- Default classes provide basic information about the type of the value, such as whether the value is comprised of alphabetic characters, numeric characters, or some combination of both.
- Custom classes provide stronger contextual information about the type of the value. In a data set that contains retail product information, custom classes might be used to indicate whether an alphabetic value is the name of a product or the name of a brand. The one-character label for custom classes can be any letter in the Latin alphabet or 0, which indicates a null class.
Rule sets use classifications to identify and classify key values. For example, a rule set for address data might use classifications to categorize values that are street types (AVE, ST, RD) or directions (N, NW, S) by providing the following information:
- Standard abbreviations for each word; for example, HWY for Highway
- A list of one-character labels that represent classes and that are assigned to individual data elements during processing
Classifications are added and modified by editing the classifications table (previously called .CLS file) , enhancing a rule set in DataStage, or using the user classification override.
Class types
Classes provide contextual information about values. Default classes provide basic information about the type of the value, and custom classes provide stronger contextual information about the type of the value.
Default classes
If a value is not assigned to a custom class, the value has one of the default classes, or basic pattern classes, that are shown in the following table.
Class label | Description |
---|---|
^ | Digits only. The caret (^) class represents a single number, for example, the number 123. However, the string 1,230 uses three values (previously called tokens): the number 1, a comma, and the number 230. |
? | One or more consecutive words that are not assigned
to a custom class. The Standardization Rules Designer does not use the ? class. |
+ | Letters only. |
& | A single value of any type. |
> | Leading digits, followed by letters. |
< | Leading letters, followed by digits. |
@ | Mixed letters and digits. For example: A123B, 345BCD789. |
~ | Special characters that are not in the SEPLIST, which is the list of characters that indicate where one value in a record ends and the next value begins. |
k | One or more Chinese numeric characters. |
Custom classes
Custom classes are defined by users. The label for a custom class can be an uppercase alphabetic character or the number 0, which indicates a null class.
A custom class provides stronger contextual information about a value than a default class. For example, if a classification definition does not assign the value ROAD to a custom class, the value is assigned to the + default class. This default class indicates that the value is a single alphabetic word. If a classification definition assigns the value to a custom class that represents street types, which might be represented by the character T, the value provides more contextual information. When this information is provided, you can write rules that address a specific subset of the data and therefore handle that data more effectively.
The null class
The null class, which has the label 0, is used in a classification definition or in a RETYPE action to make a value NULL. Because a value with the null class never matches anything, the value is never used in a pattern and is not processed.
If you assign a value to the null class, the value is skipped in the pattern matching process.
You can find more information about the null class and the RETYPE action in the Pattern Action Reference.
Classification definitions
A classification definition assigns a value to a class. The definition can include additional information about the value and affect other similar values.
- Value
- The string of one or more characters that you want to add a definition for.
- Standard value
- A standardized spelling or representation of the value that can
be used as part of an action or condition in a rule. If you do not
specify a standard value, it is the same as the value.
The standard value might be an abbreviation or expanded variation of the word. For example, the standard value for WEST might be W, and the standard value for POB might be "PO BOX".
In the classifications table (previously called the .CLS file), the maximum length for a standard value is 25 characters.
In the classification definition for a value in the null class, the standard value is not required.
- Class
- The class that the value is assigned to. The class is represented by a one-character class label. For more information about class types, see Class types.
- Similarity threshold (previously called threshold weight)
The degree of variation that can exist in the spelling or representation of the value. If you want the classification definition to affect values that are different from the value in the definition, you can set the similarity threshold lower than the default of 900.
The similarity threshold must be an integer in the range 700 - 900. The integers represent the following degrees of variation:- 900
- Strings must match exactly.
- 800
- Strings are almost certainly the same.
- 750
- Strings are probably the same.
- 700
- Strings are probably different.
When the rule set that contains a classification definition is applied to data, values in the data are compared and a score is assigned. This score indicates the degree of similarity between two values. The string comparison method that is used can take into account phonetic errors, random insertion, deletion and replacement of characters, and transposing of characters.
The score is weighted by the length of the value because small errors in long values are less serious than errors in short values. Because errors in short values cannot generally be tolerated, do not specify a similarity threshold for short values.
Classifications table (.CLS file)
In a rule set, the classifications table (previously called the .CLS file) contains a list of classification definitions. A classification definition assigns a value to a class.
;--------------------------------------------------------
; Retail Product Classification Table
;--------------------------------------------------------
; Classification Legend
;--------------------------------------------------------
; B - Product Brand
; C - Product Color
; N - Product Name
; S - Product Size
; T - Product Type
After the header, the file contains the following strings:
;;ProductName vn.n
\FORMAT\ SORT=N
Do not include any other comments before these lines.
After the header and introductory strings, each line in the classifications table includes one classification definition. In the classifications table, classification definitions use the following format:
value standard value class [similarity-threshold] [; comments]
In the classifications table, each value must be a single word. Multiple or compound words, such as New York, North Dakota, or Rhode Island, are considered separate values.
Literals in the classifications table
Literals are characters that are entered instead of a string in one of the parts of a classification definition.
Some characters that function as literals are also used as labels for default classes. To specify one of these characters as a literal, you must enter an escape character before the character that you want to use as a literal.
When you enter a classification definition in the classifications table, you can use the literals and escape characters that are shown in the following table.
Character | Description |
---|---|
\& | The ampersand (&) is a class that indicates a single value of any type. However, you can type the backslash (\) escape character before the ampersand to use the ampersand as a literal. |
/ | Literal. |
\/ | You can use the backslash (\) escape character with the forward slash (/) in the same manner that you use the forward slash (/) character. |
- | Literal. |
\- | You can use the backslash (\) escape character with the hyphen in the same manner that you use the hyphen (-) character. |
\# | Literal. You must use this character with the backslash (\) escape character, for example: \#. |
() | Literal. The parentheses are used to enclose operands or user variables in a pattern syntax. |
\( and \) | Use the backslash (\) escape character with the opening parenthesis or closing parenthesis to filter out parenthetical remarks. |