CSV file for term and classification assignment based on rules
Create a CSV file with the name ikc-term-assignment-rules.csv that defines the rules for term or classification assignment and upload it to the project. The CSV file must conform to formatting rules.
General formatting rules
The CSV file must comply with the Common Format and MIME Type for comma-separated values (CSV) Files and must be encoded in UTF-8.
File size
The maximum recommended size of the CSV import file is 50 MB.
Header row
The header row of the CSV file represents the properties that make up the rule and the action to take.
Follow these guidelines for the header row:
- The header row must be the first row in the file and must not be repeated.
- Separate column names with a comma. If you create the file in a spreadsheet editor, the commas are added automatically when you save the file in CSV format.
- The header row must include the mandatory columns for the rule.
- You can omit any optional columns.
- You can add arbitrary other columns, which will be ignored.
- Use the exact column names in the header row. Column names are case-sensitive.
- Make sure the column names do not include extra white space characters. White space characters might be added by a spreadsheet or text editor, but not be visible. If you receive an import error that the column names are incorrect, even though your columns are spelled and capitalized correctly, check for white spaces.
Column specification
To delimit values for different columns, use a comma. If you create the file in a spreadsheet editor, the commas are added automatically when you save the file in CSV format.
To omit a value for a column, use a comma directly after the previous comma and without any other characters. For example, two consecutive commas indicate that the second column is empty.
To enclose values, use double quotation marks (").
Category paths
You must specify the full category path for a term or a classification. To delimit the category path, use two greater-than (>>) symbols between each level of the category hierarchy and between the category path and the artifact name.
If you start the path with >>, the root category is [uncategorized].
Rule columns
The CSV file can contain mandatory and optional columns.
To define the rule condition, include these columns:
OBJECT_TYPE-
The type of object where terms should be assigned. Valid values:
assetcolumn
This column is mandatory and must not be empty.
PROPERTY-
The property to match. Valid values:
name: The name of the data asset or column.description: The description of the data asset or column.mostfreqvalues: Any of the most frequent values of the data profile. Rules with this property require data profiling before the rule can be properly applied.OBJECT_TYPEmust becolumn.dataclassname: The name of the data class that is assigned to a column.OBJECT_TYPEmust becolumn.assetid: The ID of the data asset.parentassetname: The name of the data asset that contains the column.OBJECT_TYPEmust becolumn.parentassetdescription: The description of the data asset that contains the column.OBJECT_TYPEmust becolumn.
This column is mandatory and must not be empty.
MATCH_STRING-
The string to match against the property. You can set any value. This column is mandatory and must not be empty.
MATCH_TYPE-
Describes how the match string should be matched against the property. This column is mandatory and must not be empty. Valid values:
-
equals
Case-insensitive exact match. -
equalscs
Case-sensitive exact match. -
contains
Match if the property contains the match string. Matching is case-insensitive. -
containscs
Match if the property contains the match string. Matching is case-sensitive. -
regex
Match if the match string is a regular expression and matches the property. For security reasons, this is disabled by default.An instance administratior can enable the use of regular expressions. For more information, see Enable the use of regular expressions for rule-based term assignment in the IBM Software Hub documentation.
-
To define which terms and classifications to assign with which confidence, include these columns
TERM_NAME-
The name of the term including the category path as described in Category path. For example,
Category 1 >> Category2 >> MyTerm.To assign a term, either
TERM_NAMEorTERM_IDmust be present. You can specify both. In that case,TERM_IDtakes precedence. If you plan to use the rules file in different systems with similar terms and category hierarchies, use term names instead of term IDs. TERM_ID-
The ID of the term. You can use the artifact ID or the global ID.
To assign a term, either
TERM_NAMEorTERM_IDmust be present. You can specify both. In that case,TERM_IDtakes precedence. If you plan to use the rules file in different systems with similar terms and category hierarchies, use term names instead of term IDs. CLASSIFICATION_NAME-
The name of the classification including the category path as described in Category path. For example,
Category 1 >> Category2 >> MyClassification.To assign a classification, either
CLASSIFICATION_NAMEorCLASSIFICATION_IDmust be present. You can specify both. In that case,CLASSIFICATION_IDtakes precedence. If you plan to use the rules file in different systems with similar classifications and category hierarchies, use classification names instead of classification IDs. CLASSIFICATION_ID-
The ID of the classification. You can use the artifact ID or the global ID.
To assign a classification, either
CLASSIFICATION_NAMEorCLASSIFICATION_IDmust be present. You can specify both. In that case,CLASSIFICATION_IDtakes precedence. If you plan to use the rules file in different systems with similar classifications and category hierarchies, use classification names instead of classification IDs. CONFIDENCE-
A float value between 0 and 1 that indicates the confidence for the term or classification assignment. The default value is 1.0 (=100%). Independent of the locale, the decimal point is
.
Additional columns that you can include:
ACTIVE-
If you set the value
no, the rule is not considered during assignment. During development, you might want to disable certain rules without removing them from the CSV file. GROUP-
A group of rules that allows you to set up more complex assignment rules, such as,
If a column name contains X and its description contains Y, then assign term T1 and T2.At least one condition and one action must be defined per rule group.
Rule file options
You can supply additional options to influence how rules are applied in the description field of the uploaded rule file. Add lines in the format <option-name>=<option-value>. The description field can contain any other
text as well.
default_confidence_if_missing-
A float value between 0 and 1 that indicates a default confidence other than 1.0 if the
CONFIDENCEcolumn is empty. use_expanded_names-
Defines when a generated name should also be considered when rules are evaluated. This option is valid only if gen AI based enrichment capabilities are enabled.
Possible values:
NEVER: Do not consider generated names.SUGGESTED: Consider a suggested generated name.ACCEPTED: Consider an assigned generated name.
Default value is
ACCEPTED. use_generated_descriptions-
Defines when a generated description should also be considered as a description when rules are evaluated. This option is valid only if gen AI based enrichment capabilities are enabled.
Possible values:
NEVER: Do not consider generated descriptionsSUGGESTED: Consider a suggested generated description.ACCEPTED: Consider an assigned generated description.
Default value is
ACCEPTED.
Examples
See some rule and rule group examples.
Rule examples
The following example describes three rules:
- If a column has a name that contains the string
address, assign termpersonal datawith 100% confidence. 100% is the default if theCONFIDENCEcolumn is empty. - If a column has a name that contains the string
customer, assign termdata subjectwith 90% confidence. - If an asset has a description that contains the string
client, assign the termdata subject, but with 100% confidence. - If an asset has a description that contains the string
FYEO, assign the classificationConfidentialwith 100% confidence.
The term names are written as a path in the category tree: GDPR is a root category that contains the terms personal data and data subject.
The COMMENT column contains additional information about the rule but does not affect term assignment.
| OBJECT_TYPE | PROPERTY | MATCH_TYPE | MATCH_STRING | TERM_NAME | CLASSIFICATION_NAME | CONFIDENCE | COMMENT |
|---|---|---|---|---|---|---|---|
| column | name | contains | address | GDPR >> personal data | Address is personal data | ||
| column | name | contains | customer | GDPR >> data subject | 0.9 | Customers are data subjects | |
| asset | description | contains | client | GDPR >> data subject | Clients are data subjects | ||
| asset | description | contains | FYEO | >> Confidential | Description contains "FYEO" if it is confidential |
Rule group examples
The following example shows a rule group G1 that joins two conditions and a rule group G2 that defines two terms to be assigned for one condition:
G1: If a column's name containsaddressand its description containsidentifier, assign the termonline identifierwith 92% confidence.G2: If a column haspostfach("P.O. Box" in German) as one of its most frequent values, assign the termEuropean Unionwith 90% confidence and termdata subjectwith 95% confidence.
| OBJECT_TYPE | PROPERTY | MATCH_TYPE | MATCH_STRING | TERM_NAME | CONFIDENCE | GROUP |
|---|---|---|---|---|---|---|
| column | name | contains | address | G1 | ||
| column | description | contains | identifier | GDPR >> online identifier | 0.92 | G1 |
| column | mostfreqvalues | contains | postfach | GDPR >> European Union | 0.9 | G2 |
| GDPR >> data subject | 0.95 | G2 |
This example shows how you can use the parentassetname and parentassetdescription properties.
G3: If the column with the nameCUSTOMER_IDis contained in a data asset with the nameCUSTOMER_ADDRESS, assign the termPerson:A10with 100% confidence.G4: If the column with the nameACCT_IDis contained in a data asset where the descriptions containscustomer account information, assign the termAccount numberwith 95% confidence.
| OBJECT_TYPE | PROPERTY | MATCH_TYPE | MATCH_STRING | TERM_NAME | CONFIDENCE | GROUP |
|---|---|---|---|---|---|---|
| column | parentassetname | equalscs | CUSTOMER_ADDRESS | G3 | ||
| column | name | equalscs | CUSTOMER_ID | Company >> Department >> Person: A10 | 1.0 | G3 |
| column | parentassetdescription | containscs | customer account information | G4 | ||
| column | name | equalscs | ACCT_ID | Finance >> Accounts >> Account number | 0.95 | G4 |
Sample rule file description
The following example is a valid rule file description:
This the best rule file in the world.
default_confidence_if_missing = 0.95
use_expanded_names = ACCEPTED
use_generated_descriptions = SUGGESTED
Closing remarks.