Best practices for term assignment

You can follow these best practices to enhance the accuracy and efficiency of term assignment results in IBM Knowledge Catalog.

You can select to apply all or a subset of these term assignment methods in metadata enrichment:

Machine Learning-based term assignment
Data class-based term assignment
Name-matching term assignment
Rule-based term assignment
Custom Machine Learning-based term assignment

Machine Learning-based term assignment

Machine Learning-based (ML-based) term assignment uses a built-in model that is trained from published business terms and reviewed term assignments. The model learns from positive and negative examples and continuously improves based on user feedback.

When to use

Use ML-based term assignment when you want the system to learn patterns automatically from reviewed assignments and rejections. The model automatically adapts to changes in business vocabulary and asset metadata.

Best practices

Consider the following information.

Types of assignments in training data

The model for ML-based term assignment is trained on two types of assignments:

Positive examples

Assigning a business term to an asset or column helps the ML-based term assignment model learn how to best assign business terms in the future. Terms are assigned in these cases:
- The term assignment service automatically assigns a term if the set confidence threshold is exceeded. The confidence threshold is defined in the metadata enrichment settings.
- You assign a term that was suggested by the term assignment service.
- You select and assign a correct term.
Negative examples

It is equally important to reject or remove a term assignment when it is wrong. Rejections and removals help the ML-based term assignment model avoid assigning similar incorrect terms in future runs. Remove A term by using the minus (-) symbol next to the term on the Governance tab in the side panel.

Training scope – catalog (recommended)

The model for ML-based term assignment is trained on the assignments that are available in a catalog.

Create a dedicated training catalog to ensure consistent model learning and reduce noise.
Publish carefully reviewed assets to the training catalog for ML training. Avoid to publish assets from incomplete or experimental projects to the training catalog.

Training scope – project

The ML model can also be trained on the assignments in a project that is used for metadata enrichment. Only reviewed assets and columns contribute to training. Ensure that assets are marked as reviewed after their term assignments were validated.

Continuous model improvement

ML models are retrained automatically during metadata enrichment when new terms or reviewed assignments are available.

A model that is trained on the assignments and rejections in a catalog is shared between all metadata enrichments that reference this catalog in the metadata enrichment settings under Select assets used for training built-in model and for adjustment. The training of that model can be triggered from all metadata enrichments that use the training catalog’s model. So if a user reran a metadata enrichment on an older project that uses the training catalog’s model, ML-based term assignment might generate new suggestions or assignments based on the latest reviewed assignments and rejections in the training catalog.

To improve the model, regularly publish reviewed assets to the training catalogs.

Troubleshooting

Retrieve model details from the metadata enrichment job logs:

Screen capture of a job log for metadata enrichment with details on the used term assignment model

If a term is not assigned by ML-based term assignment as expected, check whether the used model should contain the information about the rejected or accepted assignment based on when the training was finished, or on the Number of term assignments or Number of term rejects.

Data-class-based term assignment

This method generates term assignments based on the results of data class assignment. If a data class is assigned to a column and that data class is linked to one or more business terms, those terms can be suggested or automatically assigned. The confidence level of the term assignment mirrors the confidence level of the data class assignment.

To work with this method:

Create and manage data classes
- Create data classes that accurately represent data types (for example, Email Address, Customer ID). For more information, see Data classes.
- Review and update data classes periodically to ensure that they reflect current business standards.
Link data classes to business terms
- Edit each data class and link it to one or more relevant business terms by adding the business terms in the Related artifacts section. Only published business terms can be linked and assigned.
- Review these linkages before you run metadata enrichment to ensure high-quality results.
Assign terms based on data classes

When metadata enrichment runs, term assignments are generated automatically if data classes and terms are properly linked. The confidence level of the term assignment mirrors the confidence of the data class.

When to use

Use this method when consistent data classes exist and are properly linked to business terms. Term assignments can be (but do not have to be) made based on the content of the actual data in columns.

Best practices

Keep data classes specific and domain-relevant.

Periodically audit linkages between data classes and business terms to minimize false positives.

Name-matching term assignment

The name-matching method bases its results on similarity between term names (or common abbreviations) and data asset or column names.

When to use

Use name matching when column or asset names follow consistent naming conventions and contain keywords similar to business terms. Name matching provides a baseline for term suggestions when other methods such as ML-based, data class-based, or gen AI-based term assignment lack training data, proper configured data classes or context like abbreviation files, display names, or descriptions.

Best practices

If name-matching term assignment assigns or suggests too few or too many terms, you can lower or raise the values for Assignment threshold and Suggestion threshold in the metadata enrichment settings.

Troubleshooting

Common issues and limitations:

Limited context understanding: Descriptions or data values are not analyzed.
False positives: Terms might be assigned to assets or columns if they have similar names, even if the term obviously does not fit. For example, the term Invoice Date might be assigned to a column named Invoicing_Data.
No semantic matching: Terms are missed when name similarity is low, but meaning is related.

Gen AI-based term assignment

Gen AI-based term assignment is available if the gen AI enrichment capabilities are enabled for IBM Knowledge Catalog Standard or IBM Knowledge Catalog Premium in your Cloud Pak for Data deployment.

Gen AI-based term assignment uses a fine-tuned IBM Slate foundation model to semantically match business terms to assets and columns based on both names and descriptions.

When to use

Use this method to capture semantic relationships if metadata lacks direct keyword overlap or if you want to use AI for domain-specific term assignment.

Best practices

Ensure that the gen AI capabilities are properly enabled and configured in the environment. For more information, see Preparing to install IBM Knowledge Catalog in the IBM Software Hub documentation.

Rule-based term assignment

This method uses simple rules that are defined in a CSV file that is uploaded to the project.

When to use

This method is suitable if simple matching rules can be formulated based on metadata properties such as name or description. For example: If the column name contains ‘address’, then assign the term ‘Personal data’.

Best practices

Create a CSV file with the required name and format and upload it to the project. For more information, see CSV file for term assignment based on rules.

The rule-based method serves as a lightweight alternative to custom machine learning models.

Custom Machine Learning-based term assignment

The custom Machine Learning (ML) term assignment method enables organizations to build and maintain their own term assignment model as a separate service. This option is available only if Watson Machine Learning is deployed in the Cloud Pak for Data environment.

A custom model can be developed to assign business terms to data assets and columns by using organization-specific logic or domain knowledge. It can operate independently or in combination with other term assignment methods.

When to use

Use this method if built-in models do not meet specialized domain or data requirements or if you need full control over training data, retraining frequency, or model architecture.

Best practices

Use the sample notebooks provided in the IBM Knowledge Catalog samples repository to understand how to configure and deploy a custom term assignment model.

Example notebooks: https://github.com/IBM/knowledge-catalog-samples/tree/main/metadata-enrichment/term-assignment/custom-term-assignment

Troubleshooting

Verify that Watson Machine Learning is properly deployed and accessible within the Cloud Pak for Data environment.

If a custom ML model is used and issues occur that extend beyond its integration with IBM Knowledge Catalog, responsibility for resolving those issues lies with the model owner.

Metadata Expansion

If gen AI capabilities are enabled in your deployment, you can run metadata enrichment with the option to generate a display name and a description for assets and columns. For more information, see Designing metadata enrichment and the blog Enrich your metadata using generative AI technology powered by foundation models from IBM watsonx.

This additional metadata is used as follows:

Name-matching term assignment uses the generated display name for the name-similarity matching.
Gen AI-based term assignment uses the generated display name and any suggested description for term assignment.

Thus, the results of using Expand metadata option in metadata enrichment will further improve the term assignment results when the name-matching and gen AI-based term assignment methods (individually or in combination) are used.

Combining term assignment methods and result fine tuning

In most cases, it’s beneficial to use multiple term assignment methods together because each method has its own strengths and weaknesses. For example, name matching can fail if metadata doesn’t contain direct keyword overlaps, but gen AI-based term assignment can handle such cases effectively. However, gen AI-based term assignment might not be able to make an assignment if it cannot establish a connection because it doesn’t know what a column name or abbreviation stands for. In such situations, a trained ML model might be able to make the connection based on learned patterns from previous assignments.

Additionally, you can apply fine-tuning options for term assignment that work best when multiple methods are used together. For more information and best practices on adjusting these parameters, see Tuning options for term assignment.