AI augmentations in curation tools

IBM Knowledge Catalog uses large language models (LLMs) to provide AI-enhanced capabilities that help you with your data governance, data quality, and data product tasks.

Where gen AI capabilities are available

Depending on how IBM Knowledge Catalog is configured, generative AI capabilities are available in one or more product areas. See Setup requirements.

Metadata enrichment

In metadata enrichment, you can automatically generate metadata for data assets:

Descriptive table and column names based on content and structure
Comprehensive explanations for tables and columns
Business terminology from data assets
Relationships between business terms and data assets

Text-to-SQL

With the Text-to-SQL service, you can generate SQL queries from natural language:

Create query-based data assets and data products. For more information, see Creating data assets by using SQL queries.
Define data quality rules. For more information, see Creating SQL-based data quality rules.
Search data by using natural language, for example, in document libraries.

Important:

Starting in IBM Software Hub 5.4, a new multilingual embedding model is used for Text-to-SQL functionality. For continued use of the Text-to-SQL functionality in deployments that are upgraded to IBM Software Hub 5.4, project administrators must therefore update existing projects, metadata enrichments, and unstructured data curation flows by reprocessing the data with the new model.

For more information, see Updating projects and metadata with a new embedding model.

Multilingual support

You can create queries against data assets and document libraries in natural languages other than English. Supported natural languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese.

If the asset metadata is in the same natural language as the query that you submit, results will be more accurate.

AI-powered search

LLM-based semantic search finds assets and artifacts across workspaces by using natural language. For more information, see Working with the AI search.

Semantic embeddings

Vectors for metadata can be computed for consumption in other services such as the Text-to-SQL service or in watsonx BI. Those embeddings are stored in a vector database such as an Elasticsearch or OpenSearch database to enable similarity search and schema matching at query time:

Conversion of natural language to SQL, for example, to create SQL-based data quality rules or data products
Improved search relevance
Improved context understanding

Setup requirements

The following settings determine the scope of generative AI capabilities that are available in IBM Knowledge Catalog.

Service The IBM Knowledge Catalog Premium Cartridge must be installed.

Generative AI capabilities in general must be enabled for IBM Knowledge Catalog Premium. For some of the capabilities, extra deployment requirements apply. For more information, see Preparing to install IBM Knowledge Catalog in the IBM Software Hub documentation.
If the service is configured to work with models on a local or remote watsonx.ai instance, the connection to that instance must be configured and the appropriate models must be defined. For more information, see Connecting to a watsonx.ai instance.
Generative AI features must be enabled in a project for text-to-SQL capabilities and for AI-based metadata enrichment:

Data intelligence tools settings

By default, projects are enabled for use of generative AI capabilities.

Natural language queries are disabled by default and must be enabled for the Text-to-SQL service to work. An onboarding job creates initial schema vectors for the data in your project. These vectors are extended as you add data assets and metadata to your project.

See Data intelligence tools settings.

Data quality settings

For generating rule explanations, the Explain data quality rules with AI option must be on.

See Data quality settings.

Metadata enrichment settings

AI-based name generation

Description generation

AI-based business-term assignment

See Metadata enrichment settings.

Certified foundation models

The following models are certified for use with the generative AI capabilities in IBM Knowledge Catalog.

Metadata enrichment

meta-llama/llama-3-3-70b-instruct

openai/gpt-oss-120b

ibm/granite-4-h-small

ibm/granite-3-3-8b-instruct

ibm/granite-8b-code-instruct

meta-llama/llama-3-3-70b-instruct

openai/gpt-oss-120b

meta-llama/llama-4-maverick-17b-128e-instruct-fp8

Models that are identified as certified models have undergone evaluation with the generative AI capabilities in IBM Knowledge Catalog. Models that are not certified are not guaranteed to work as expected, and their accuracy and performance can vary.

Consider your requirements regarding accuracy, cost, and availability, and any other factors such as supported languages when you choose the models that you want to use.

For more information, see:

Supported foundation models in watsonx.ai (watsonx.ai on IBM Software Hub)
Supported foundation models in watsonx.ai (watsonx.ai on IBM Cloud and AWS)
Billing details for generative AI assets in Watson Machine Learning (watsonx.ai on IBM Cloud and AWS)

Accuracy considerations

Due to the nature of generative models, you might see variations in inferred content such as AI-generated names, descriptions, and terms in metadata enrichment, AI-generated SQL statements, or chat responses from AI agents.

Such variations can have several reasons:

Use of different models, for example, at different times
Natural variations for the same input to the same model in repetitive runs
Intentional variation, for example, for creative brainstorming

Variations are expected and should not be considered as defects or problems.

How is data handled if models run on a cloud platform?

Prompt engineering is used to interact with the models. To inference from models, only metadata is sent to the models in the prompts. In general, the models look at the data and follow the instructions to generate the results. The only case where additional data is used is when users choose to send some sample values to the enrichment model. This data is also sent as part of the prompt. No data is retained by the models. Neither is any of that data used to train the models.

To use models on a cloud platform, the deployment must be set up to work with models on a remote watsonx.ai instance and a connection to the remote instance must be configured. If you have GPUs in your deployment, you can configure IBM Knowledge Catalog to work with the models in a local inference foundation models component.