Glossary

This glossary provides terms and definitions for watsonx.data.

The following cross-references are used in this glossary:

  • See refers you from a nonpreferred term to the preferred term or from an abbreviation to the spelled-out form.
  • See also refers you to a related or contrasting term.

A B C D E F G H I J L M N O P Q R S T U V

A

active metadata

Metadata that is automatically updated based on analysis by machine learning processes. For example, profiling and data quality analysis automatically update metadata for data assets.

active runtime

An instance of an environment that is running to provide compute resources to assets that run code.

agent

An algorithm or a program that interacts with an environment to learn optimal actions or decisions, typically using reinforcement learning, to achieve a specific goal.

agentic AI

A generative AI flow that can decompose a prompt into multiple tasks, assign tasks to appropriate gen AI agents, and synthesize an answer without human intervention.

AI

See artificial intelligence.

artificial intelligence (AI)

The capability to acquire, process, create and apply knowledge in the form of a model to make predictions, recommendations or decisions.

asset

  • An item that contains information about data, other valuable information, or code that works with data. See also data asset.
  • An item in a project or catalog that contains metadata about data or data analysis.

B

business term

A word or phrase that defines a business concept in a standard way for an enterprise. Terms can be used to enrich the metadata of data assets and to define the criteria of data protection rules.

business vocabulary

The set of governance artifacts, such as business terms and data classes, that describe and enrich data assets.

C

catalog

Watsonx.data Platform UI: A repository of assets for an organization share. Assets in catalogs can be governed by data protection rules and enriched by other governance artifacts, such as classifications, data classes, and business terms. Catalogs can store structured and unstructured data, references to data in external data sources, and other analytical assets, like machine learning models.

Watsonx.data Console UI: A metadata management system that organizes and tracks information about databases, tables, schemas, and partitions in a data lakehouse. Catalogs enable query engines to access and share data consistently. Supported catalog types include Apache Iceberg, Apache Hive, Apache Hudi, and Delta Lake.

category

For data governance, a collaborative workspace for organizing and managing governance artifacts.

classification

For data governance, a governance artifact that describes the sensitivity level of the data in a data asset.

collaborator

A member of a group of people who are working together toward a common goal.

Common Policy Gateway (CPG)

A service that makes or delegates governance decisions on a per-request basis. CPG provides a unified interface for applications to obtain access control and governance approvals, either through built-in policies or by delegating to external policy engines.

connected data

A data set that is accessed through a connection to an external data source.

connection

  • A set of properties that define how to connect to and access a remote system (typically a data source).
  • The information required to connect to a database. The actual information that is required varies according to the DBMS and connection method.

coordinator

A server type in a Presto installation that parses SQL statements, plans queries, manages worker nodes, and returns query results to clients. The coordinator is the primary interface for client connections and orchestrates distributed query execution across worker nodes.

curate

To create a data asset and prepare it to be published in a catalog. Curation can include enriching the data asset by assigning governance artifacts such as business terms, classification, and data classes, and analyzing the quality of the data in the data asset.

D

Data Access Service (DAS)

A component that provides unified access to object storage while governing external engines and auditing data access. DAS generates authentication signatures without exposing credentials to external systems.

data asset

An asset that points to data, for example, to an uploaded file. Connections and connected data assets are also considered data assets. See also asset.

data class

A governance artifact that categorizes columns in relational data sets according to the type of the data and how the data is used.

data governance

The process of tracking and controlling data to maintain data quality, data security, and compliance.

data integration

The combination of technical and business processes that are used to combine data from disparate sources into meaningful and valuable information.

data model

A visualization of data elements, their relationships, and their attributes.

data lake

A large-scale data storage repository that stores raw data in any format in a flat architecture. Data lakes hold structured and unstructured data as well as binary data for the purpose of processing and analysis.

data lakehouse

A unified data storage and processing architecture that combines the flexibility of a data lake with the structured querying and performance optimizations of a data warehouse, enabling scalable and efficient data analysis for AI and analytics applications.

data pipeline

A series of data processing and transformation steps.

data privacy

The protection of data from unauthorized access and inappropriate use.

data product

A collection of optimized data or data-related assets that are packaged for reuse and distribution with controlled access. Data products contain data as well as models, dashboards, and other computational asset types. Unlike data assets in governance catalogs, data products are managed as products with multiple purposes to provide business value.

data protection rule

A governance artifact that specifies what data to control and how to control it. A data protection rule contains criteria and an action. See also rule.

data quality analysis

The analysis of data against the quality dimensions accuracy, completeness, consistency, timeliness, uniqueness, and validity.

data quality definition

A data quality definition describes a rule evaluation or condition for data quality rules.

data quality rule

During data quality analysis, a data quality rule that assesses data for whether specific conditions are met and identifies records that do not meet the conditions as rule violations. See also rule.

data science

The analysis and visualization of structured and unstructured data to discover insights and knowledge.

dataset

A collection of data that can be unstructured, or structured in the form of rows (records) and columns (fields), and contained in a file or database table.

data source

A repository, queue, or feed for reading data, such as a database.

data table

A collection of data, usually in the form of rows (records) and columns (fields) and contained in a table.

data warehouse

A large, centralized repository of data collected from various sources that is used for reporting and data analysis. It primarily stores structured and semi-structured data, enabling businesses to make informed decisions.

Delta Lake

An open-source storage layer that provides ACID transactions, schema enforcement, and time travel capabilities for data lakes. In watsonx.data, Delta Lake is supported as a catalog type for managing metadata and enabling reliable data processing on cloud storage.

E

embedding

A numerical representation of a unit of information, such as a word or a sentence, as a vector of real-valued numbers. Embeddings are learned, low-dimensional representations of higher-dimensional data.

endpoint URL

A network destination address that identifies resources, such as services and objects. For example, an endpoint URL is used to identify the location of a model or function deployment when a user sends payload data to the deployment.

engine

A computational component that processes queries and workloads. Engines are optimized for different use cases such as SQL analytics, large-scale data processing, or vector similarity search.

environment

The compute resources for running jobs.

environment runtime

An instantiation of the environment template to run assets.

environment template

A definition that specifies hardware and software resources to instantiate environment runtimes.

F

flow

A collection of nodes that define a set of steps for processing data or training a model.

foundation model

An AI model that can be adapted to a wide range of downstream tasks. Foundation models are typically large-scale generative models that are trained on unlabeled data using self-supervision. As large scale models, foundation models can include billions of parameters.

G

gen AI

See generative AI.

generative AI (gen AI)

A class of AI algorithms that can produce various types of content including text, source code, imagery, audio, and synthetic data.

governance artifact

Governance items that enrich or control data assets. Governance artifacts include business terms, classifications, data classes, policies, rules, and reference data sets.

governance rule

A governance artifact that provides a natural-language description of the criteria that are used to determine whether data assets are compliant with business objectives. See also rule.

governance workflow

A task-based process to control the creating, modifying, and deleting of governance artifacts.

governed catalog

A catalog that has enforcement of data protection rules enabled.

graphical builder

A tool for creating flow assets by visually coding. A canvas is an area on which to place objects or nodes that can be connected to create a flow.

H

hallucination

A response from a foundation model that includes off-topic, repetitive, incorrect, or fabricated content. Hallucinations involving fabricating details can happen when a model is prompted to generate text, but the model doesn't have enough related text to draw upon to generate a result that contains the correct details.

Hive Metastore (HMS)

A service that stores metadata related to databases, tables, schemas, and partitions in a backend relational database. HMS provides a Thrift interface for engines such as Presto and Spark to access metadata, enabling them to query data stored in object storage as if it were organized in traditional database tables.

I

image

A software package that contains a set of libraries.

inferencing

The process of running live data through a trained AI model to make a prediction or solve a task.

ingest

To import and load data from an external data source into the data lakehouse.

insight

An accurate or deep understanding of something. Insights are derived using cognitive analytics to provide current snapshots and predictions of customer behaviors and attitudes.

intelligent AI

Artificial intelligence systems that can understand, learn, adapt, and implement knowledge, demonstrating abilities like decision-making, problem-solving, and understanding complex concepts, much like human intelligence.

intent

A purpose or goal expressed by customer input to a chatbot, such as answering a question or processing a bill payment.

J

job

A separately executable unit of work.

L

lineage

  • The history of the flow of data through assets.
  • The history of the events performed on an asset.

logical model

A logical representation of data objects that are related to a business domain.

M

mask

To replace sensitive data values in a column of a data set. Masking methods vary in data utility and privacy from providing similarly formatted replacement values that retain referential integrity to providing the same replacement value for the entire column.

metadata import

A method of importing metadata that is associated with data assets, including process metadata that describes the lineage of data assets and technical metadata that describes the structure of data assets.

Metadata Service (MDS)

A centralized metadata repository that manages and stores metadata for tables, databases, partitions, and other objects.

metastore

A centralized repository that stores metadata about data assets that are ingested into the lakehouse.

Model Context Protocol (MCP)

A standardization layer for AI applications to communicate effectively with external services such as tools, databases, and predefined templates.

multimodal model

A generative AI model that can process multiple types of data, such as, text, images, and audio, and convert between them. For example, a multimodal model can take text input and generate image output.

N

natural language

A modeling syntax that resembles natural human language (in English) to formulate models.

natural language processing library

A library that provides basic natural language processing functions for syntax analysis and out-of-the-box pre-trained models for a wide variety of text processing tasks.

node

The graphical representation of a data operation in a stream or flow. Different types of nodes have different shapes to indicate the type of operation that they perform.

notebook

An interactive document that contains executable code, descriptive text for that code, and the results of any code that is run.

notebook kernel

The part of the notebook editor that executes code and returns the computational results.

O

object storage

A method of storing data, typically used in the cloud, in which data is stored as discrete units, or objects, in a storage pool or repository that does not use a file hierarchy but that stores all objects at the same level.

ontology

An explicit formal specification of the representation of the objects, concepts, and other entities that can exist in some area of interest and the relationships among them.

orchestration

The process of creating an end-to-end flow that can train, run, deploy, test, and evaluate a machine learning model, and uses automation to coordinate the system, often using microservices.

P

pair review

A process during which a data steward user compares records to determine whether they are a match. Pair review results train a matching algorithm how to decide which records get matched into master data entities.

physical model

A definition of the physical structures and relationships of data.

placeholder

A field or variable to be replaced with a value.

policy

  • A set of rules that protect data by controlling access to data assets or anonymizing sensitive data within data assets.
  • A governance artifact that consists of one or more data protection and governance rules.

primary category

For data governance, the category that contains the governance artifact. A category is similar to a folder or directory that organizes a user's governance artifacts.

privacy

Assurance that information about an individual is protected from unauthorized access and inappropriate use.

profile

The generated metadata and statistics about the textual content of data.

project

A collaborative workspace for working with data and other assets.

prompt

  • Data, such as text or an image, that prepares, instructs, or conditions a foundation model's output.
  • A component of an action that indicates that user input is required for a field before making a transition to an output screen.

prompt engineering

The process of designing natural language prompts for a language model to perform a specific task.

prompting

The process of providing input to a foundation model to induce it to produce output.

publish

To copy an asset into a catalog.

Python

A programming language that is used in data science and AI.

Python function

A function that contains Python code to support a model in production.

Q

quality rule

One or more conditions required for a data record to meet quality standards. During data quality analysis, data records are checked against these conditions. See also rule.

R

R

An extensible scripting language that is used in data science and AI that offers a wide variety of analytic, statistical, and graphical functions and techniques.

read

To copy data into an application to manipulate or analyze it.

redact

To replace all data values in a column with the same string to hide sensitive values, data format, and any relationships between values. A form of masking.

reference data set

A governance artifact that defines values for specific types of columns.

resource group

A configuration mechanism to manage query execution and resource allocation on Presto clusters. Resource groups enable administrators to set limits on CPU time, memory usage, and concurrent queries, and support hierarchical allocation through subgroups for multi-tenant environments.

retrieval augmented generation (RAG)

A technique in which a large language model is augmented with knowledge from external sources to generate text. In the retrieval step, relevant documents from an external source are identified from the user’s query. In the generation step, portions of those documents are included in the LLM prompt to generate a response grounded in the retrieved documents.

rule

An artifact that contains information, criteria, or logic to analyze or protect data. See also data protection rule, data quality rule, governance rule, quality rule.

runtime environment

The predefined or custom hardware and software configuration that is used to run tools or jobs, such as notebooks.

S

schema evolution

The capability to modify table schemas over time without rewriting existing data.

script

A file that contains Python or R scripts to support a model in production.

secondary category

An optional category that references the governance artifact.

semantic data model

A model that describes the structure of data through its relationships and calculations, and provides meaning and context for the data. Metrics are defined in semantic data models.

semantic search

A keyword search that incorporates linguistic and contextual analysis. In a semantic search, the intent of the query is specified using one or more specifiers. For example, it is possible to specify a person named "Bush" and such a query would then not return results about the kind of bushes that grow in a garden but rather just persons named Bush.

sensitive data

Data that contains information that should be protected from unauthorized access or disclosure. Categories of sensitive data can be protected health information, personally identifiable information, trade secrets, or financial results.

service instance

A runtime occurrence of an installed service.

Spark engine

A native Apache Spark engine for processing large-scale data, transforming datasets, and running analytical workloads.

structured data

Data that resides in fixed fields within a record or file. Relational databases and spreadsheets are examples of structured data. See also unstructured data.

T

text classification

A model that automatically identifies and classifies text into specified categories.

time series

A set of values of a variable at periodic points in time.

time travel

A feature that enables querying historical versions of data at specific points in time for auditing, debugging, or data recovery.

token

A discrete unit of meaning or analysis in a text, such as a word or subword.

tokenization

The process used in natural language processing to split a string of text into smaller units, such as words or subwords.

U

unstructured data

Any data that is stored in an unstructured format rather than in fixed fields. Data in a word processing document is an example of unstructured data. See also structured data.

V

vector

A one-dimensional, ordered list of numbers, such as [1, 2, 5] or [0.7, 0.2, -1.0].

vector database

See vector store.

vector index

An index that retrieves the vectorized embeddings of documents from a vector store.

vector store

A repository that stores vectorized embeddings of documents.

visualization

A graph, chart, plot, table, map, or any other visual representation of data.