CSV file format for classification definitions
When you import classification definitions in the Standardization Rules Designer, files that contain definitions must be text files that contain values that are separated by commas. Each classification definition in the file must be formatted correctly.
CSV file requirements
The entire CSV file
must meet the following requirements:
- Files must have only one definition per line.
- Files must use UTF-8 character encoding.
- If leading or trailing white space must be preserved for an individual value, the entire value must be enclosed in double quotation marks.
Definition requirements
Each classification
definition can include a maximum of four columns. The following table
shows the four columns in the order that they must be specified and
lists requirements for each column.
Column | Required column | Requirements |
---|---|---|
Value | Yes | The maximum length is 600 characters. |
Standard value | No If a standard value is not specified, the standard value is the same as the value. |
The maximum length is 600 characters. |
Class label | Yes | The class label must be one character. |
Similarity threshold | No If a similarity threshold is not specified, a default of 900 is assigned. |
The value must be an integer in the range 700 - 900. |
Examples
In this example, values are specified only for the required columns.
BOX,,CAs
a result, default values are assigned for columns that are not specified.
The following table shows the classification definition that is shown
in the Standardization Rules Designer.
Value | Standard value | Class label | Similarity threshold |
---|---|---|---|
BOX | BOX | C | 900 |
In this example, values are specified for all of the columns. Because the standard value contains white space, it is enclosed in double quotation marks.
NC,"NORTH CAROLINA",S,800The
following table shows the definition that results.
Value | Standard value | Class label | Similarity threshold |
---|---|---|---|
NC | NORTH CAROLINA | S | 800 |