Important:

IBM Cloud Pak® for Data Version 4.6 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.6 reaches end of support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1 documentation.

Automatic term assignment (Watson Knowledge Catalog)

Automatic term assignment is the process of automatically mapping business terms to data assets and asset columns. Terms can automatically be assigned to assets and columns as a part of metadata enrichment, column analysis, and automated discovery.

For information about automatic term assignment in column analysis and automated discovery, see Automatic term assignment in column analysis and automated discovery.

You can also assign business terms manually by editing the data asset properties in a project or a catalog, or when you work with enrichment results.

If automatic term assignment is configured as part of metadata enrichment, such assignments are generated by several methods. These methods also generate suggestions for terms to assign.

The terms are assigned based on the confidence level. Initially, these associations are represented as candidates which domain experts and stewards can review and assign manually. The confidence for an assigned or suggested term is shown as a percentage value. This value represents the overall confidence. See How the overall confidence is computed. The confidence level for when a term is suggested or automatically assigned is determined by the project's enrichment settings. The default confidence level to be exceeded is 75% for term suggestions, and 90% for automatic assignment of candidate terms. See Default enrichment settings. A project administrator can customize these settings.

Only published business terms can be assigned. Assigned terms do not affect data class assignment.

Term assignment methods

You can use all or a subset of the available term assignment methods.

Linguistic name matching

The linguistic name matching method bases its result on the similarity between the term name or abbreviations and the name of the data asset or column. For example, a column CREDNUM might be associated with a term Credit Card Number because of the similarity between the two names. Linguistic name matching matches only data asset and column names with term names and abbreviations. Descriptions are not considered. ML-based term assignment handles names and descriptions.

Based on data class assignment

The class-based assignment method generates assignments based on data classification. If a data class was selected for an asset column either as the result of column analysis or manually, and if this data class is linked to one or more business terms, these terms are suggested or assigned if they exceed the respective thresholds. The term confidence level is the same as the confidence of the data class that the term is linked with. For example, a column COL1 classified as an email address with 90% confidence is likely to be assigned to the term E-mail Address if the data class and term are linked. Because there is no linguistic similarity between the name of the column and the term, the linguistic name matching method is not capable of making this association.

To enable the class-based assignment method, it is important to review data class to term linkage before running term assignment because appropriate linkage is an important prerequisite for high-quality results.

Business terms that are linked to the predefined data classes Code, Identifier, Date, Text, Indicator, Quantity, and Boolean are not considered for term assignment.

Machine learning

The machine learning (ML) method for generating term assignments can use the built-in supervised machine learning models or work with a custom service that you build and maintain.

ML models are trained based on published terms and on term assignments present in the training data in a project or a catalog. See Training data for machine learning models. If no term assignments are available, the training for the term assignment model focuses on linguistic similarity of words in names and descriptions of terms and data assets or columns. Terms can be assigned based on that similarity. With an increasing number of reviewed assignments, terms can be assigned independent of linguistic similarity because term assignments on columns with similar characteristics become available.

Built-in models

Built-in models comprise a model for term assignments and one for term removals.

Custom service

The option to use a custom service is available only if Watson Machine Learning is deployed in your Cloud Pak for Data environment. Instructions and a sample notebook for building a custom model are provided in the https://github.com/IBM/wkc-term-assignment-samples GitHub repository.

Rejected terms

When you review term assignments, you might find terms that you think are not accurate for a data asset. You can remove such terms thus providing negative feedback. Such terms are considered as rejected. Depending on the project settings, a built-in ML model can learn from these rejections. The model can then adjust the confidence scores of term assignments based on these rejected terms when you rerun automatic term assignment. The individual confidence values returned by each selected term assignment method are adjusted by this negative confidence value for calculating the overall confidence score of a term. See How the overall confidence score is calculated.

In IBM Cloud Pak for Data 4.6.0, 4.6.1, or 4.6.2, the option to consider term rejections when working with a custom service for ML-based term assignment must explicitly be enabled for your Cloud Pak for Data cluster. See Enable the use of the built-in ML model for negative term assignment with a custom ML model.

Training data for machine learning models

For each project, you can define whether the built-in ML models used for automatic term assignment are trained with assets from the project or with assets from a catalog of your choice.

For custom models, the model owner is responsible for training the model.

Built-in models

When models are trained with assets from a catalog, they are trained with any published business terms and any term assignments available in the selected catalog. When you decide to train the models within the project, the models are trained with any published business terms and any available term assignments or removals on columns that were marked as reviewed in the project.

In Cloud Pak for Data 4.6.0, the default setting is to train the model within the project.

In IBM Cloud Pak for Data 4.6.1 and later, the default setting is to train the models from the default catalog, but you can select any catalog to which you have access. If the default catalog doesn't exist, the training scope defaults to the project.

When are the models trained?

ML model training is triggered when a metadata enrichment job is started and one of these conditions is true:

No model is available yet.
A new business term was created or an existing term was updated since the model was last trained. The term does not have to be assigned to any assets or columns.
Models trained with project assets: At least 21 columns were marked as reviewed since the model was last trained.

Models trained with catalog assets: Assignments on at least 21 columns in the selected catalog changed because terms were assigned or rejected since the model was last trained.
The last training did not complete successfully or within a reasonable period of time.

If no information about term rejections is available on the first use of the model for rejections, the initial training for the term rejection model happens later, which means, it is initially trained when information about rejected terms is available on a subsequent model training cycle.

How the overall confidence is computed

A method that associates a term with a data asset computes a confidence, which is a numeric value between a configurable minimum and 1. The minimum value is defined by the suggestion threshold for term assignment that can be configured in the default enrichment settings.

The confidence for an assigned or suggested term is shown as a percentage value. This value represents the overall confidence. The overall confidence is the maximum of the confidence values returned by the selected term assignment methods and might be adjusted by any negative confidence value returned by the ML model for term removals.

In IBM Cloud Pak for Data 4.6.0, 4.6.1, and 4.6.2, the confidence values returned by the built-in model for negative term assignment adjust the results of all term assignment methods. By default, option is enabled only if the selected methods for automatic term assignment include the built-in ML. A system administrator can change this setting so that confidence values are also adjusted when a custom model is used. See Enable the use of the built-in ML model for negative term assignment with a custom ML model.

In IBM Cloud Pak for Data 4.6.3 and later, you can choose whether the confidence values are adjusted. If this option is enabled, the confidence values returned by the built-in model for negative term assignment adjust the results of all selected term assignment methods. This includes ML-based term assignment with a custom model.

Example:

Assuming all methods are enabled, the confidence values for a column ADDRESS and term Home Address:

Linguistic name matching: 0.5
Class-based assignment: 0.4
ML-based assignment: 0.3
ML model for rejections: -0.4

The actual confidence value for each method is calculated by subtracting the confidence value returned for rejected terms:

Linguistic name matching: 0.5 - 0.4 = 0.1
Class-based assignment: 0.4 - 0.4 = 0
ML-based assignment: 0.3 - 0.4 = -0.1

The overall confidence is 0.1 because it’s the highest value calculated for a method.

If the same confidence value for a term is calculated for several methods, only one is automatically assigned. The order in which such a term is selected is as follows:

Term found by the data-class based assignment method
Term found by the ML method
Term found by the name-matching method

How new analysis results update existing term assignments

When you rerun an enrichment, a new analysis result updates term assignments as follows:

In IBM Cloud Pak for Data 4.6.0, 4.6.1, and 4.6.2:

Existing suggested terms are deleted and replaced with the new suggested terms.
Existing automatic assignments are deleted and replaced with new automatic assignments.
Existing rejected terms and manual assignments are left untouched.

In IBM Cloud Pak for Data 4.6.3 and later:

Existing suggested terms are deleted and replaced with the new suggested terms.
Existing unreviewed automatic assignments are deleted.
Existing rejected terms, reviewed automatic assignments, and manual assignments are left untouched.
New automatic assignments are added.

Learn more

Parent topic: Metadata enrichment results