Automatic term assignment (Watson Knowledge Catalog)

Automatic term assignment is the process of automatically mapping business terms to data assets and asset columns. Terms can automatically be assigned to assets and columns as a part of metadata enrichment, column analysis, and automated discovery.

For information about automatic term assignment in column analysis and automated discovery, see Automatic term assignment in column analysis and automated discovery.

You can assign business terms manually by editing the data asset properties in a project or a catalog, or when you work with enrichment results.

If automatic term assignment is configured as part of metadata enrichment, such assignments are generated by several methods. These methods also generate suggestions for terms to assign.

The terms are assigned based on the confidence level. Initially, these associations are represented as candidates which domain experts and stewards can review and assign manually. The confidence level for when a term is suggested or automatically assigned is determined by the project's enrichment settings. The default confidence level to be exceeded is 75% for term suggestions, and 90% for automatic assignment of candidate terms.

Only published business terms can be assigned.

Methods used to generate term assignments

The following methods are used to generate term assignments:

The linguistic name matching method bases its result on the similarity between the term and the name of the data asset or column. For example, a column CREDNUM might be associated with a term Credit Card Number because of the similarity between the two names. Linguistic name matching matches only data asset and column names with term names and abbreviations. Descriptions are not taken into account. ML-based term assignment handles names and descriptions.
The class-based assignment method generates assignments based on data classification. If a data class was selected for an asset column either as the result of column analysis or manually, and if this data class is linked to one or more business terms, these terms are suggested or assigned if they exceed the respective thresholds. The term confidence level is the same as the confidence of the data class the term is linked with. For example, a column COL1 classified as an email address with 90% confidence is likely to be assigned to the term E-mail Address if the data class and term are linked. Because there is no linguistic similarity between the name of the column and the term, the linguistic name matching method is not capable of making this association.

To enable the class-based assignment method, it is important to review data class to term linkage before running term assignment because appropriate linkage is an important prerequisite for high-quality results.

Note that business terms linked to the predefined data classes Code, Identifier, Date, Text, Indicator, Quantity, and Boolean are not considered for term assignment.

The machine learning (ML) method that uses supervised machine learning models to assign terms. The ML method works differently depending on which release of IBM Cloud Pak for Data you are on.
- The machine learning (ML) method uses one supervised machine learning model per project to assign terms. ML model training is triggered when a metadata enrichment job is started if one of these conditions is true:
  - No model is available yet for the current project.
  - A term was added or modified since the model was last trained.
  - At least 20 columns were marked as reviewed since the model was last trained.
  - The last training did not complete successfully or within a reasonable period of time.
  The model is trained with any published business terms and any available term assignments on columns that were marked as reviewed in the project. If no term assignments are available, the training focuses on linguistic similarity of words in names and descriptions of terms and data assets. Terms can be assigned based on that similarity. With an increasing number of reviewed assignments, terms can be assigned independent of linguistic similarity because term assignments on columns with similar characteristics become available.
- The machine learning (ML) method for generating term assignments uses two supervised machine learning models: one for term assignments and one for term removals. Depending on your setup, you can have two models per project or two models for global use in all projects. Starting in IBM Cloud Pak for Data 4.5.3, the default setting is to use global models. You can return to using project-specific models, which was the default behavior in earlier versions. See Changing the scope of the built-in ML models used for term assignment.
  
  Project-specific models are trained with any published business terms and any available term assignments or removals on columns that were marked as reviewed in the project.
  
  Global models are trained with any published business terms and any term assignments available in the default catalog.
  
  ML model training is triggered when a metadata enrichment job is started if one of these conditions is true:
  - No model is available yet.
  - A term was added or modified since the model was last trained.
  - Global models: Assignments on at least 20 columns in the default catalog changed because terms were added or removed since the model was last trained.
    Project-specific models: At least 20 columns were marked as reviewed since the model was last trained.
  - The last training did not complete successfully or within a reasonable period of time.
  If no term assignments are available, the training for the term assignment model focuses on linguistic similarity of words in names and descriptions of terms and data assets. Terms can be assigned based on that similarity. With an increasing number of reviewed assignments, terms can be assigned independent of linguistic similarity because term assignments on columns with similar characteristics become available.
  
  If no term removal information is available on the first use of the ML method, the initial training for the term removal model happens at a later time, which means, it is initially trained when term removal information is available on a subsequent model training cycle.
  
  By removing assigned terms, you provide negative feedback to the automatic term assignment methods. When you rerun automatic term assignment, the ML-based term assignment method then also returns a negative confidence value for such terms. The individual confidence values returned by each term assignment method are adjusted by this negative confidence value for calculating the overall confidence score of a term.
  
  The confidence for an assigned or suggested term is shown as a percentage value. This value represents the overall confidence, which is the maximum of these values:
  - The confidence value returned by linguistic name matching, adjusted by any negative confidence value returned by ML-based assignment
  - The confidence value returned by class-based assignment, adjusted by any negative confidence value returned by ML-based assignment
  - The confidence value returned by ML-based assignment, adjusted by any negative confidence value returned this method

A project administrator can customize some settings for the term assignment methods. See Default enrichment settings.

Removing assignments

Removed terms are considered in automatic term assignment in IBM Cloud Pak for Data 4.5.3 or later.

When you review the assignments, you might find terms that you think are not accurate for a given data asset. You can remove such terms thus providing negative feedback to the automatic term assignment methods. When you rerun automatic term assignment, the ML-based term assignment method then also returns a negative confidence value for such terms. The individual confidence values returned by each term assignment method are adjusted by this negative confidence value for calculating the overall confidence score of a term. See How the overall confidence score is calculated.

How the overall confidence is computed

A method that associates a term with a data asset computes a confidence, which is a numeric value between a configurable minimum and 1. The minimum value is configured as percentage threshold for which the term must match by the setting of the suggestion threshold for term assignment.

The confidence is computed differently depending on which release of IBM Cloud Pak for Data you are on.

The confidence for an assigned or suggested term is shown as a percentage value. This value represents the overall confidence, which is the maximum of these values:
- The confidence value returned by linguistic name matching
- The confidence value returned by class-based assignment
- The confidence value returned by ML-based assignment
Example:

Assuming the methods return the following confidence values for a column ADDRESS and term Home Address: Linguistic name matching: 0.5 Class-based assignment: 0.4 ML-based assignment: 0.3
The overall confidence is 0.5 because it's the highest value returned by a method.
The confidence for an assigned or suggested term is shown as a percentage value. This value represents the overall confidence, which is the maximum of these values:
- The confidence value returned by linguistic name matching, adjusted by any negative confidence value returned by ML-based assignment
- The confidence value returned by class-based assignment, adjusted by any negative confidence value returned by ML-based assignment
- The confidence value returned by ML-based assignment, adjusted by any negative confidence value returned this method
Example:

Assuming the methods return the following confidence values for a column ADDRESS and term Home Address: Linguistic name matching: 0.5 Class-based assignment: 0.4 ML-based assignment: 0.3 ML-based assignment for removed term: -0.4
The actual confidence value for each method is calculated by subtracting the the confidence value returned for removed terms:
Linguistic name matching: 0.5 - 0.4 = 0.1 Class-based assignment: 0.4 - 0.4 = 0 ML-based assignment: 0.3 - 0.4 = -0.1
The overall confidence is 0.1 because it’s the highest value calculated for a method.

How new analysis results update existing term assignments

When you rerun an enrichment, a new analysis result updates term assignments as follows:

Existing suggested terms are deleted and replaced with the new suggested terms.
Existing automatic assignments are deleted and replaced with new automatic assignments.
Existing removed terms and manual assignments are left untouched.

Publishing term assignments

When you publish the enrichment results, term assignments, whether manual or automatic, are available in the catalog and in all projects that contain a given data asset. Term suggestions are not published.

When you remove a published term assignment, all projects that contain the data asset are affected. While you work within the enrichment results, the changes are internal to the project. However, when you publish the changes, the term is removed from the asset in all projects it is contained in. Before you remove a published assignment, make sure that it wasn't added on purpose by other users.

Learn more

Parent topic: Metadata enrichment results