Classification definitions

A classification definition assigns a value to a class. The definition can include additional information about the value and affect other similar values.

A classification definition has the following parts:

Value

The string of one or more characters that you want to add a definition for.

Standard value

A standardized spelling or representation of the value that can be used as part of an action or condition in a rule. If you do not specify a standard value, it is the same as the value.

The standard value might be an abbreviation or expanded variation of the word. For example, the standard value for WEST might be W, and the standard value for POB might be "PO BOX".

In the classifications table (previously called the .CLS file), the maximum length for a standard value is 25 characters.

In the classification definition for a value in the null class, the standard value is not required.

Class

The class that the value is assigned to. The class is represented by a one-character class label. For more information about class types, see Class types.

Similarity threshold (previously called threshold weight)

The degree of variation that can exist in the spelling or representation of the value. If you want the classification definition to affect values that are different from the value in the definition, you can set the similarity threshold lower than the default of 900.

The similarity threshold must be an integer in the range 700 - 900. The integers represent the following degrees of variation:

900: Strings must match exactly.
800: Strings are almost certainly the same.
750: Strings are probably the same.
700: Strings are probably different.

When the rule set that contains a classification definition is applied to data, values in the data are compared and a score is assigned. This score indicates the degree of similarity between two values. The string comparison method that is used can take into account phonetic errors, random insertion, deletion and replacement of characters, and transposing of characters.

The score is weighted by the length of the value because small errors in long values are less serious than errors in short values. Because errors in short values cannot generally be tolerated, do not specify a similarity threshold for short values.