Glossary

This glossary provides terms and definitions for Cloud Pak for Data as a Service.

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | R | S | T | U | V | W

A

accelerator

In high-performance computing, a specialized circuit that is used to take some of the computational load from the CPU, increasing the efficiency of the system. For example, in deep learning, GPU-accelerated computing is often employed to offload part of the compute workload to a GPU while the main application runs off the CPU. See also graphics processing unit.

accountability

The expectation that organizations or individuals will ensure the proper functioning, throughout their lifecycle, of the AI systems that they design, develop, operate or deploy, in accordance with their roles and applicable regulatory frameworks. This includes determining who is responsible for an AI mistake which may require legal experts to determine liability on a case-by-case basis.

activation function

A function defining a neural unit's output given a set of incoming activations from other neurons

active learning

A model for machine learning in which the system requests more labeled data only when it needs it.

active metadata

Metadata that is automatically updated based on analysis by machine learning processes. For example, profiling and data quality analysis automatically update metadata for data assets.

active runtime

An instance of an environment that is running to provide compute resources to assets that run code.

agent

An algorithm or a program that interacts with an environment to learn optimal actions or decisions, typically using reinforcement learning, to achieve a specific goal.

AI

See artificial intelligence.

AI accelerator

Specialized silicon hardware designed to efficiently execute AI-related tasks like deep learning, machine learning, and neural networks for faster, energy-efficient computing. It can be a dedicated unit in a core, a separate chiplet on a multi-module chip or a separate card.

AI ethics

A multidisciplinary field that studies how to optimize AI's beneficial impact while reducing risks and adverse outcomes. Examples of AI ethics issues are data responsibility and privacy, fairness, explainability, robustness, transparency, environmental sustainability, inclusion, moral agency, value alignment, accountability, trust, and technology misuse.

AI governance

An organization's act of governing, through its corporate instructions, staff, processes and systems to direct, evaluate, monitor, and take corrective action throughout the AI lifecycle, to provide assurance that the AI system is operating as the organization intends, as its stakeholders expect, and as required by relevant regulation.

AI safety

The field of research aiming to ensure artificial intelligence systems operate in a manner that is beneficial to humanity and don't inadvertently cause harm, addressing issues like reliability, fairness, transparency, and alignment of AI systems with human values.

AI system

See artificial intelligence system.

algorithm

A formula applied to data to determine optimal ways to solve analytical problems.

analytics

The science of studying data in order to find meaningful patterns in the data and draw conclusions based on those patterns.

artificial intelligence (AI)

The capability to acquire, process, create and apply knowledge in the form of a model to make predictions, recommendations or decisions.

artificial intelligence system (AI system)

A system that can make predictions, recommendations or decisions that influence physical or virtual environments, and whose outputs or behaviors are not necessarily pre-determined by its developer or user. AI systems are typically trained with large quantities of structured or unstructured data, and might be designed to operate with varying levels of autonomy or none, to achieve human-defined objectives.

asset

An item in a project or catalog that contains metadata about data or data analysis.

attribute composition rule

One of a set of rules that determine how a master data entity's attribute values get selected from its member records. See also rule.

AutoAI experiment

An automated training process that considers a series of training definitions and parameters to create a set of ranked pipelines as model candidates.

B

batch deployment

A method to deploy models that processes input data from a file, data connection, or connected data in a storage bucket, then writes the output to a selected destination.

bias

Systematic error in an AI system that has been designed, intentionally or not, in a way that may generate unfair decisions. Bias can be present both in the AI system and in the data used to train and test it. AI bias can emerge in an AI system as a result of cultural expectations; technical limitations; or unanticipated deployment contexts. See also fairness.

bias detection

The process of calculating fairness to metrics to detect when AI models are delivering unfair outcomes based on certain attributes.

bias mitigation

Reducing biases in AI models by curating training data and applying fairness techniques.

binary classification

A classification model with two classes. Predictions are a binary choice of one of the two classes.

business term

A word or phrase that defines a business concept in a standard way for an enterprise. Terms can be used to enrich the metadata of data assets and to define the criteria of data protection rules.

business vocabulary

The set of governance artifacts, such as business terms and data classes, that describe and enrich data assets.

C

catalog

A repository of assets for an organization share. Assets in catalogs can be governed by data protection rules and enriched by other governance artifacts, such as classifications, data classes, and business terms. Catalogs can store structured and unstructured data, references to data in external data sources, and other analytical assets, like machine learning models.

classification

For data governance, a governance artifact that describes the sensitivity level of the data in a data asset.

cleanse

To ensure that all values in a data set are consistent and correctly recorded.

collaborator

A member of a group of people who are working together toward a common goal.

combinatorial problem

A problem that is difficult to solve because it requires multiple decisions to be made involving too many combinations of possible choices. Some examples are finding a grouping, ordering, or the assignment of objects.

compute resources

The hardware and software resources defined by an environment definition to run analytical assets.

confusion matrix

A performance measurement that determines the accuracy between a model's positive and negative predicted outcomes compared to positive and negative actual outcomes.

connected data

A data set that is accessed through a connection to an external data source.

connection

The information required to connect to a database. The actual information that is required varies according to the DBMS and connection method.

constraint

In Decision Optimization, a condition that must be satisfied by the solution of a problem.

continuous learning

Automating the tasks of monitoring model performance, retraining with new data, and redeploying to ensure prediction quality.

Core ML deployment

The process of downloading a deployment in Core ML format for use in iOS apps.

corpus

A collection of source documents that are used to train a machine learning model.

CPLEX model

A Decision Optimization model that is formulated to be solved by the CPLEX engine.

CPO model

A constraint programming model that is formulated to be solved by the Decision Optimization CP Optimizer (CPO) engine.

curate

To select, collect, preserve, and maintain content relevant to a specific topic. Curation establishes, maintains, and adds value to data; it transforms data into trusted information and knowledge.
To create a data asset and prepare it to be published in a catalog. Curation can include enriching the data asset by assigning governance artifacts such as business terms, classification, and data classes, and analyzing the quality of the data in the data asset.

D

data asset

An asset that points to data, for example, to an uploaded file. Connections and connected data assets are also considered data assets.

data class

A governance artifact that categorizes columns in relational data sets according to the type of the data and how the data is used.

data governance

The process of tracking and controlling data to maintain data quality, data security, and compliance.

data integration

The combination of technical and business processes that are used to combine data from disparate sources into meaningful and valuable information.

data lake

A large-scale data storage repository that stores raw data in any format in a flat architecture. Data lakes hold structured and unstructured data as well as binary data for the purpose of processing and analysis.

data lakehouse

A unified data storage and processing architecture that combines the flexibility of a data lake with the structured querying and performance optimizations of a data warehouse, enabling scalable and efficient data analysis for AI and analytics applications.

data mining

The process of collecting critical business information from a data source, correlating the information, and uncovering associations, patterns, and trends. See also predictive analytics.

data model

A visualization of data elements, their relationships, and their attributes.

data pipeline

A series of data processing and transformation steps.

data privacy

The protection of data from unauthorized access and inappropriate use.

data product

A collection of optimized data or data-related assets that are packaged for reuse and distribution with controlled access. Data products contain data as well as models, dashboards, and other computational asset types. Unlike data assets in governance catalogs, data products are managed as products with multiple purposes to provide business value.

data protection rule

A governance artifact that specifies what data to control and how to control it. A data protection rule contains criteria and an action. See also rule.

data quality analysis

The analysis of data against the quality dimensions accuracy, completeness, consistency, timeliness, uniqueness, and validity.

data quality definition

A data quality definition describes a rule evaluation or condition for data quality rules.

data quality rule

During data quality analysis, a data quality rule that assesses data for whether specific conditions are met and identifies records that do not meet the conditions as rule violations. See also rule.

Data Refinery flow

A data source, a chain of one or more operations that refine and shape that data source, and a target that the data moves to.

data science

The analysis and visualization of structured and unstructured data to discover insights and knowledge.

data set

A collection of data, usually in the form of rows (records) and columns (fields) and contained in a file or database table.

data source

A repository, queue, or feed for reading data, such as a database.

DataStage flow

An asset that is based on an ordered set of steps to extract, transform, and load data.

data table

A collection of data, usually in the form of rows (records) and columns (fields) and contained in a table.

data warehouse

A large, centralized repository of data collected from various sources that is used for reporting and data analysis. It primarily stores structured and semi-structured data, enabling businesses to make informed decisions.

Decision Optimization experiment

An asset that contains a group of scenarios that represent different model formulations or data sets related to the same problem that is being solved.

Decision Optimization model

A prescriptive model that can be solved with optimization to provide the best solution to a Decision Optimization problem.

decision variable

One of a set of variables representing decisions to be made, whose values are determined by the optimization engine while ensuring that all constraints are satisfied and the objective optimized.

deep learning

A computational model that uses multiple layers of interconnected nodes, which are organized into hierarchical layers, to transform input data (first layer) through a series of computations to produce an output (final layer). Deep learning is inspired by the structure and function of the human brain.

deep learning experiment

A model training process that is based on a logical grouping of one or more model training definitions that are connected in a neural network.

deep neural network

A neural network with multiple hidden layers, allowing for more complex representations of the data.

deployment

A model or application package that is available for use.

deployment space

A workspace where models are deployed and deployments are managed.

DOcplex

A Python API for modeling and solving Decision Optimization problems.

E

endpoint URL

A network destination address that identifies resources, such as services and objects. For example, an endpoint URL is used to identify the location of a model or function deployment when a user sends payload data to the deployment.

environment

The compute resources for running jobs.

environment runtime

An instantiation of the environment template to run assets.

environment template

A definition that specifies hardware and software resources to instantiate environment runtimes.

explainability

The ability of human users to trace, audit, and understand predictions that are made in applications that use AI systems.
The ability of an AI system to provide insights that humans can use to understand the causes of the system's predictions.

F

fairness

In an AI system, the equitable treatment of individuals or groups of individuals. The choice of a specific notion of equity for an AI system depends on the context in which it is used. See also bias.

feature

A property or characteristic of an item within a data set, for example, a column in a spreadsheet. In some cases, features are engineered as combinations of other features in the data set.

feature engineering

The process of selecting, transforming, and creating new features from raw data to improve the performance and predictive power of machine learning models.

feature selection

Identifying the columns of data that best support an accurate prediction or score in a machine learning model.

feature store

A centralized repository or system that manages and organizes features, providing a scalable and efficient way to store, retrieve, and share feature data across machine learning pipelines and applications.

feature transformation

In AutoAI, a phase of pipeline creation that applies algorithms to transform and optimize the training data to achieve the best outcome for the model type.

flow

A collection of nodes that define a set of steps for processing data or training a model.

foundation model

An AI model that can be adapted to a wide range of downstream tasks. Foundation models are typically large-scale generative models that are trained on unlabeled data using self-supervision. As large scale models, foundation models can include billions of parameters.

G

Gantt chart

A graphical representation of a project timeline and duration in which schedule data is displayed as horizontal bars along a time scale.

gen AI

See generative AI.

generative AI (gen AI)

A class of AI algorithms that can produce various types of content including text, source code, imagery, audio, and synthetic data.

governance artifact

Governance items that enrich or control data assets. Governance artifacts include business terms, classifications, data classes, policies, rules, and reference data sets.

governance rule

A governance artifact that provides a natural-language description of the criteria that are used to determine whether data assets are compliant with business objectives. See also rule.

governance workflow

A task-based process to control the creating, modifying, and deleting of governance artifacts.

governed catalog

A catalog that has enforcement of data protection rules enabled.

GPU

See graphics processing unit.

graphical builder

A tool for creating flow assets by visually coding. A canvas is an area on which to place objects or nodes that can be connected to create a flow.

graphics processing unit (GPU)

A specialized processor designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. GPUs are heavily utilized in machine learning due to their parallel processing capabilities. See also accelerator.

grounding

Providing a large language model with information to improve the accuracy of results.

H

HAP detection (HAP detection)

The ability to detect and filter hate, abuse, and profanity in both prompts submitted by users and in responses generated by an AI model.

HAP detector (HAP detector)

A sentence classifier that removes potentially harmful content, such as hate speech, abuse, and profanity, from foundation model output and input.

hold-out set

A set of labeled data that is intentionally withheld from both the training and validation sets, serving as an unbiased assessment of the final model's performance on unseen data.

human oversight

Human involvement in reviewing decisions rendered by an AI system, enabling human autonomy and accountability of decision.

hyperparameter

In machine learning, a parameter whose value is set before training as a way to increase model accuracy.

I

image

A software package that contains a set of libraries.

inferencing

The process of running live data through a trained AI model to make a prediction or solve a task.

ingest

To feed data into a system for the purpose of creating a base of knowledge.
To continuously add a high-volume of real-time data to a database.

insight

An accurate or deep understanding of something. Insights are derived using cognitive analytics to provide current snapshots and predictions of customer behaviors and attitudes.

intent

A purpose or goal expressed by customer input to a chatbot, such as answering a question or processing a bill payment.

J

job

A separately executable unit of work.

K

knowledge base

See corpus.

L

labeled data

Raw data that is assigned labels to add context or meaning so that it can be used to train machine learning models. For example, numeric values might be labeled as zip codes or ages to provide context for model inputs and outputs.

large language model (LLM)

A language model with a large number of parameters, trained on a large quantity of text.

lineage

The history of the flow of data through assets.
The history of the events performed on an asset.

LLM

See large language model.

logical model

A logical representation of data objects that are related to a business domain.

M

machine learning (ML)

A branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving the accuracy of AI models.

machine learning framework

The libraries and runtime for training and deploying a model.

machine learning model

An AI model that is trained on a a set of data to develop algorithms that it can use to analyze and learn from new data.

mask

To replace sensitive data values in a column of a data set. Masking methods vary in data utility and privacy from providing similarly formatted replacement values that retain referential integrity to providing the same replacement value for the entire column.

masking flow

A flow that produces permanently masked copies of data.

master data

For model training, reference data that remains the same for several jobs on the same model but that can be changed, if necessary.
In Match 360, a consolidated view of data from the disparate sources.

master data entity

A composition of records that a matching algorithm has determined to represent the same real-world entity, such as a person or organization. Each entity includes one or many member records that the matching algorithm has linked together.

mathematical programming (MP)

A field of mathematics, or operational research, used to model and solve Decision Optimization problems. This encompasses linear, integer, mixed integer and non-linear programming.

metadata import

A method of importing metadata that is associated with data assets, including process metadata that describes the lineage of data assets and technical metadata that describes the structure of data assets.

misalignment

A discrepancy between the goals or behaviors that an AI system is optimized to achieve and the true, often complex, objectives of its human users or designers

ML

See machine learning.

MLOps

The practice for collaboration between data scientists and operations professionals to help manage production machine learning (or deep learning) lifecycle. MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. It involves model development, training, validation, deployment, monitoring, and management and uses methods like CI/CD.
A methodology that takes a machine learning model from development to production.

model

In a machine learning context, a set of functions and algorithms that have been trained and tested on a data set to provide predictions or decisions.
In Decision Optimization, a mathematical formulation of a problem that can be solved with CPLEX optimization engines using different data sets.

model formulation

In Decision Optimization, the mathematical formulation of a model expressed as a list of decision variables, one or more objective functions to be maximized or minimized, and some constraints to be satisfied.

ModelOps

A methodology for managing the full lifecycle of an AI model, including training, deployment, scoring, evaluation, retraining, and updating.

MP

See mathematical programming.

N

natural language

A modeling syntax that resembles natural human language (in English) to formulate models.

natural language processing (NLP)

A field of artificial intelligence and linguistics that studies the problems inherent in the processing and manipulation of natural language, with an aim to increase the ability of computers to understand human languages.

natural language processing library

A library that provides basic natural language processing functions for syntax analysis and out-of-the-box pre-trained models for a wide variety of text processing tasks.

neural network

A mathematical model for predicting or classifying cases by using a complex mathematical scheme that simulates an abstract version of brain cells. A neural network is trained by presenting it with a large number of observed cases, one at a time, and allowing it to update itself repeatedly until it learns the task.

NLP

See natural language processing.

node

The graphical representation of a data operation in a stream or flow. Different types of nodes have different shapes to indicate the type of operation that they perform.

notebook

An interactive document that contains executable code, descriptive text for that code, and the results of any code that is run.

notebook kernel

The part of the notebook editor that executes code and returns the computational results.

O

obfuscate

To replace data in a column with similarly formatted values that match the original format. A form of masking.

objective function

In Decision Opmization and operations research, an expression to optimize (that is, either to minimize or to maximize) while satisfying other constraints of the problem.

object storage

A method of storing data, typically used in the cloud, in which data is stored as discrete units, or objects, in a storage pool or repository that does not use a file hierarchy but that stores all objects at the same level.

online deployment

Method of accessing a model or Python code deployment through an API endpoint as a web service to generate predictions online, in real time.

ontology

An explicit formal specification of the representation of the objects, concepts, and other entities that can exist in some area of interest and the relationships among them.

operational asset

An asset that runs code in a tool or a job.

OPL model

A model formulation expressed in OPL modeling language.

optimal solution

In operations research, a solution to a problem that optimizes the objective function (whether linear or quadratic) and satisfies all the other constraints of the problem.

optimization

The process of finding the most appropriate solution to a precisely defined problem while respecting the imposed constraints and limitations. For example, determining how to allocate resources or how to find the best elements or combinations from a large set of alternatives.

orchestration

The process of creating an end-to-end flow that can train, run, deploy, test, and evaluate a machine learning model, and uses automation to coordinate the system, often using microservices.

P

pair review

A process during which a data steward user compares records to determine whether they are a match. Pair review results train a matching algorithm how to decide which records get matched into master data entities.

parameter

A configurable part of the model that is internal to a model and whose values are estimated or learned from data. Parameters are aspects of the model that are adjusted during the training process to help the model accurately predict the output. The model's performance and predictive power largely depend on the values of these parameters.

payload

The data that is passed to a deployment to get back a score, prediction, or solution.

payload logging

The capture of payload data and deployment output to monitor ongoing health of AI in business applications.

physical model

A definition of the physical structures and relationships of data.

pipeline

In Watson Pipelines, an end-to-end flow of assets from creation through deployment.
In AutoAI, a candidate model.

pipeline leaderboard

In AutoAI, a table that shows the list of automatically generated candidate models, as pipelines, ranked according to the specified criteria.

placeholder

A field or variable to be replaced with a value.

policy

A strategy or rule that an agent follows to determine the next action based on the current state.
A set of rules that protect data by controlling access to data assets or anonymizing sensitive data within data assets.
A governance artifact that consists of one or more data protection and governance rules.

predictive analytics

A business process and a set of related technologies that are concerned with the prediction of future possibilities and trends. Predictive analytics applies such diverse disciplines as probability, statistics, machine learning, and artificial intelligence to business problems to find the best action for a specific situation. See also data mining.

pretrained model

An AI model that was previously trained on a large data set to accomplish a specific task. Pretrained models are used instead of building a model from scratch.

primary category

For data governance, the category that contains the governance artifact. A category is similar to a folder or directory that organizes a user's governance artifacts.

privacy

Assurance that information about an individual is protected from unauthorized access and inappropriate use.

profile

The generated metadata and statistics about the textual content of data.

project

A collaborative workspace for working with data and other assets.

pruning

The process of simplifying, shrinking, or trimming a decision tree or neural network. This is done by removing less important nodes or layers, reducing complexity to prevent overfitting and improve model generalization while maintaining its predictive power.

publish

To copy an asset into a catalog.

Python

A programming language that is used in data science and AI.

Python DOcplex model

A model formulation expressed in Python.

Python function

A function that contains Python code to support a model in production.

Q

quality rule

One or more conditions required for a data record to meet quality standards. During data quality analysis, data records are checked against these conditions. See also rule.

quantization

A method of compressing foundation model weights to speed up inferencing and reduce GPU memory needs.

R

An extensible scripting language that is used in data science and AI that offers a wide variety of analytic, statistical, and graphical functions and techniques.

read

To copy data into an application to manipulate or analyze it.

redact

To replace all data values in a column with the same string to hide sensitive values, data format, and any relationships between values. A form of masking..

reference data set

A governance artifact that defines values for specific types of columns.

refine

To cleanse and shape data.

reinforcement learning

A machine learning technique in which an agent learns to make sequential decisions in an environment to maximize a reward signal. Inspired by trial and error learning, agents interact with the environment, receive feedback, and adjust their actions to achieve optimal policies.

reward

A signal used to guide an agent, typically a reinforcement learning agent, that provides feedback on the goodness of a decision

rule

An artifact that contains information, criteria, or logic to analyze or protect data. See also data protection rule, data quality rule, governance rule, quality rule, attribute composition rule.

runtime environment

The predefined or custom hardware and software configuration that is used to run tools or jobs, such as notebooks.

S

scoring

In machine learning, the process of measuring the confidence of a predicted outcome.
The process of computing how closely the attributes for an incoming identity match the attributes of an existing entity.

script

A file that contains Python or R scripts to support a model in production.

secondary category

An optional category that references the governance artifact.

self-attention

An attention mechanism that uses information from the input data itself to determine what parts of the input to focus on when generating output.

self-supervised learning

A machine learning training method in which a model learns from unlabeled data by masking tokens in an input sequence and then trying to predict them. An example is "I like ________ sprouts".

semantic search

A keyword search that incorporates linguistic and contextual analysis. In a semantic search, the intent of the query is specified using one or more specifiers. For example, it is possible to specify a person named "Bush" and such a query would then not return results about the kind of bushes that grow in a garden but rather just persons named Bush.

sensitive data

Data that contains information that should be protected from unauthorized access or disclosure. Categories of sensitive data can be protected health information, personally identifiable information, trade secrets, or financial results.

sentiment analysis

Examination of the sentiment or emotion expressed in text, such as determining if a movie review is positive or negative.

shape

To customize data by filtering, sorting, removing columns; joining tables; performing operations that include calculations, data groupings, hierarchies and more.

small data

Data that is accessible and comprehensible by humans. See also structured data.

SQL pushback

In SPSS Modeler, the process of performing many data preparation and mining operations directly in the database through SQL code.

structured data

Data that resides in fixed fields within a record or file. Relational databases and spreadsheets are examples of structured data. See also unstructured data, small data.

structured information

Items stored in structured resources, such as search engine indices, databases, or knowledge bases.

substitute

To replace data in a column with values that don't match the original format but retain referential integrity.

supernode

An SPSS Modeler node that shrinks a data stream by encapsulating several nodes into one.

supervised learning

A machine learning training method in which a model is trained on a labeled dataset to make predictions on new data.

T

text classification

A model that automatically identifies and classifies text into specified categories.

time series

A set of values of a variable at periodic points in time.

trained model

A model that is trained with actual data and is ready to be deployed to predict outcomes when presented with new data.

training

The initial stage of model building, involving a subset of the source data. The model learns by example from the known data. The model can then be tested against a further, different subset for which the outcome is already known.

training data

A collection of data that is used to train machine learning models.

training set

A set of labeled data that is used to train a machine learning model by exposing it to examples and their corresponding labels, enabling the model to learn patterns and make predictions.

transfer learning

A machine learning strategy in which a trained model is applied to a completely new problem.

transformer

A neural network architecture that uses positional encodings and the self-attention mechanism to predict the next token in a sequence of tokens.

transparency

Sharing appropriate information with stakeholders on how an AI system has been designed and developed. Examples of this information are what data is collected, how it will be used and stored, and who has access to it; and test results for accuracy, robustness and bias.

Turing test

Proposed by Alan Turing in 1950, a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human.

U

unbounded problem

A Decision Optimization problem where an infinite number of solutions exists and the objective can take values up to infinity. Unbounded problems are often caused by missing constraints in the model formulation.

unstructured data

Any data that is stored in an unstructured format rather than in fixed fields. Data in a word processing document is an example of unstructured data. See also structured data.

unstructured information

Data that is not contained in a fixed location, such as the natural language text document.

unsupervised learning

A machine learning training method in which a model is not provided with labeled data and must find patterns or structure in the data on its own.

V

validation set

A separate set of labeled data that is used to evaluate the performance and generalization ability of a machine learning model during the training process, assisting in hyperparameter tuning and model selection.

visualization

A graph, chart, plot, table, map, or any other visual representation of data.

W

weight

A coefficient for a node that transforms input data within the network's layer. Weight is a parameter that an AI model learns through training, adjusting its value to reduce errors in the model's predictions.