SSAO5N - Documentation Index
Table of Contents
Welcome
Overview
What's new in watsonx.data
watsonx.data deployment options and plans
Platform UI and Console UI comparison
Platform architecture
Asset types and properties
Searching for assets
Previews
Profiles
Relationships
Activities
Object storage for workspaces
AI assistants and agents
Services
Regional availability
IBM Cloud Services
Creating services
watsonx.data plans
Billing details for watsonx.data as a Service
watsonx.ai Studio service plans
Billing details for watsonx.ai Studio
watsonx.ai Runtime plans
Billing details for generative AI assets in watsonx.ai Runtime
watsonx.data intelligence plans and billing
watsonx.data integration plans and billing
Cloud Object Storage plans
AWS services
AWS GovCloud services
Astra DB pricing
Astra Managed Clusters pricing plans
Azure services
FAQs
Language support
Browser support
Preview releases
Notices
Accessibility
Known issues for the platform UI
Known issues for the console UI on IBM Cloud
Known issues on AWS
Getting help
Getting started and tutorials
Signing up for the Lite plan
Joining your organization's watsonx.data account
Switching between experiences
Creating task credentials
Generating an API key and bearer token
watsonx APIs and SDKs
Tutorials
Getting started with the watsonx.data Lite plan
Gen AI lakehouse tutorial
Quick start watsonx.data console
Getting started with your query engine
Adding storage and querying data
Integrating watsonx.ai with watsonx.data for retrieval-augmented generation (RAG)
Retrieving parameter values
Data ingestion, Time travel, and Table rollback
Connecting and Querying across multiple data sources
Spark application REST API
AI solution accelerators
Q&A RAG accelerator
Medallion accelerator
Projects
Shared projects across experiences
Creating a project
Importing a project
Importing project assets
Administering projects
Managing collaborators
Project collaborator roles
Marking a project as sensitive
Managing task credentials
Adding associated services
Exporting project assets
Enabling folders
Defining default settings for tools
Unstructured Data Integration
Managing assets in projects
Downloading data assets
Choosing compute resources for tools
Working with pipelines
Compute options for the notebook editor
Compute options for Pipelines
Managing compute resources
Creating non-standard environment templates
Customizing environment templates
Runtime usage
Creating and managing jobs
Creating jobs in the Notebook editor
Creating Jobs for Spark Application Submission - new
Creating jobs for Pipelines
Viewing jobs across projects
Adding catalog assets to a project
Publishing assets to a catalog
Leaving a project
Markdown cheatsheet
Preparing data
Adding data to a project
Adding very large files to a project
Adding connections to projects
Adding integrated service connectors
Connecting to data behind a firewall
Adding data from a connection
Adding a connected folder asset from a connection
Connectors
Amazon S3 connection
Setting up temporary credentials or a Role ARN for Amazon S3
Box connection
ClickHouse connection
Collibra connection
Confluence connection
Google Cloud Storage connection
Google Drive connection
IBM Cloud Object Storage connection
Controlling access to Cloud Object Storage buckets
IBM Cloud Object Storage (infrastructure) connection
IBM Db2 connection
IBM FileNet P8 connection
IBM Netezza Performance Server connection
IBM watsonx.data Milvus connection
IBM watsonx.data Presto connection
IBM watsonx.data SharePoint connection
Microsoft Azure Data Lake Storage connection
IBM watsonx.data SharePoint connection
Microsoft SharePoint Files connection
Microsoft SharePoint connection
Microsoft SQL Server connection
Oracle connection
PostgreSQL connection
Snowflake connection
Splunk connection
Adding platform connections
Managing collaborators on platform connections
Building custom connectors with Connector forge
Creating and deploying custom connectors
Troubleshooting custom connectors
Parametrized connections
Data protection with data source definitions
Protection solutions for data source definition
Connectors that support data source definitions
Connectors with hard-coded data source identity properties
Roles and asset privacy settings for data source definitions
Creating a data source definition
Creating a data source definition from the Data source definition list
Adding endpoints to a new or existing data source definition
Creating a connection from the Data source definitions list
Setting connection limits for data source definitions
Managing data source definitions
Orchestrating tasks with Orchestration Pipelines
Getting started with Pipelines
Planning a pipeline
Creating a pipeline
Configuring pipeline nodes
Managing default settings
Configuring global objects
Adding conditions to a pipeline
Functions used in pipelines Expression Builder
Handling pipeline errors
Programming a pipeline
Creating custom components
Running and saving pipelines
Configuration management for Orchestration Pipelines
Managing watsonx.data infrastructure
Presto (Java)
Presto (C++)
watsonx.data Spark engine
Apache Gluten accelerated Spark engine
Catalogs
Metadata Service
Data Access Service (DAS)
Milvus
Query Optimizer
Presto (Java) mixed-case support
API customization
Data Gate
Accessing data in external data platforms
Metadata Service
Resource groups
Access management and governance
Gathering diagnostics
OpenTelemetry
IBM Manta Data Lineage
Customizing max pool size in Metadata Service
Provisioning a Presto (Java) engine
Provisioning a Presto (C++) engine
Provisioning a serverless Spark engine for Lite plan
Provisioning a Spark engine
Provisioning Apache Gluten accelerated Spark engine
Managing watsonx.data Spark
Customization overview
Managing Spark engine capacity
Managing native Spark engine details
Managing the Spark engine details
View and edit native Spark engine details
Associating or dissociating catalogs
View and manage applications
Spark user interface
Scaling native Spark engine
Registering an engine
Managing engines
Associating a catalog with an engine
Exploring the catalog objects
Dissociating a catalog from an engine
Configuring Presto resource groups
Adding storage
IBM Cloud Object Storage
Amazon S3
IBM Storage Ceph
MinIO
Hadoop Distributed File System (HDFS)
Google Cloud Storage
Azure Data Lake Storage
Apache Ozone
Custom S3 storage
Adding multiple Apache Iceberg catalogs to a single storage
Exploring the storage details and objects
Editing storage details
Deleting a storage-catalog pair
Setting up GlusterFS replicated storage with MinIO
Disabling or enabling ACL on an ACL-enabled storage
Registering external data into
Adding data source
Apache Druid
Apache Kafka
Apache Phoenix
Apache Pinot
Amazon Redshift
BigQuery
Apache Cassandra
ClickHouse
HANA
IBM Db2 for i
Elasticsearch
IBM Data Virtualization Manager
IBM Db2
IBM Netezza
IBM Db2 for z/OS
IBM Informix
MongoDB
MySQL
Oracle
PostgreSQL
Prometheus
Redis
SingleStore
Snowflake
SQL Server
Teradata
Custom
Arrow Flight service
Apache Derby
Greenplum
MariaDB
Salesforce
Updating data source credentials
Editing data source details
Deleting a data source-catalog pair
Managing IAM access for
Managing user access
Managing roles and privileges
Managing data policy rules
Common Policy Gateway (CPG) connector
Enabling or disabling common policy gateway engines
Protecting your lakehouse with context-based restrictions
Introduction to OpenRAG
Quick start: Provision OpenRAG and OpenSearch
Adding an OpenRAG service
Astra DB in watsonx.data
Adding an Astra DB service
Terminating an Astra DB service
Viewing Astra DB database details
Creating an application token
Creating a custom role in Astra DB
Managing access control for Astra DB service
Semantic automation for data enrichment
Registering and activating semantic layer
Enriching data with semantic automation layer
Performing semantic searches
Driver manager
Billing and usage
Connecting to Presto server
Account‑scoped metadata model
Engineering data
Engineering structured data
Exploring Data manager
About Data manager
Creating schemas using the web console
Creating tables using the web console
Ingesting data using web console
Overview of data ingestion
Ingesting data by using the Spark ingestion UI
Ingesting data from a local system
Ingesting data from remote storage
Ingesting data from databases
Migrating data from Delta Lake to Iceberg tables
Ingesting streaming data by using Spark Stream (Experimental)
Accessing Spark logs for ingestion jobs
Ingesting data from object storage bucket
Querying data
Running SQL queries
About Visual Explain
Query Optimizer
Activating Query optimizer
Managing statistical updates
Syncing Query optimizer manager with metastore
Verifying table sync
Enhancing statistics for synced Iceberg tables
Updating query rewrite timeout
Upgrading Query Optimizer
Deactivating Query optimizer manager
Query history
Exporting and importing the query history
Overview
Configuring Query monitoring
Analyzing diagnostic data
Managing diagnostic data from user interface
Managing diagnostic data by manual method
Retrieving QHMM logs by using ibm-lh utility
QHMM Shell Script usage
Working with Spark
Introduction to watsonx.data Spark
Spark application runtime
Spark application submission methods
Submitting Spark application by using native Spark engine
Console submission
Spark scenarios and use-cases
Enabling application autoscaling
Data Processing with Spark Streaming
Enhancing Spark application submission using Spark access control extension
`
Query Server runtime
Connecting to Spark query server by using Spark JDBC Driver
Jupyter Notebook runtime
Visual Studio Code runtime
VS Code development environment - Spark labs
Monitoring and debugging Spark applications from Spark labs
Monitoring and debugging
Debug the Spark application
Accessing the Spark history server
Monitoring Spark application runs by using Databand
Track Spark applications
Data management
Spark table maintenance by using IBM cpdctl
Submitting Spark jobs for MoR to CoW conversion
Working with different table formats
Submitting Spark runtimes to migrate Delta Lake tables to Apache Iceberg
Querying data with Data workbench
Creating a data product
Table Optimizer
Table Optimizer configuration options
Interacting with data through an MCP server
Setting up the remote MCP server
Setting up a local MCP server
Finding and querying data in metastores
Creating new data source
Creating new Schema
Running SQL queries in SQL worksheets
Getting connection information
SQL statements, data types and mixed-case behavior supported by Presto
Presto SQL statements
Presto data types
Mixed-case behavior based on connectors
IBM Cloud Pak for Data Command Line Interface (IBM cpdctl)
Downloading and installing IBM Cloud Pak for Data Command Line Interface (IBM cpdctl)
Supporting commands and usage for watsonx.data in IBM cpdctl
config commands and usage
wx-data commands and usage
Additional information about cpdctl wx-data command usage and examples
Additional information about ingestion command usage and special cases
Analyzing data
Notebooks and scripts
Planning your notebooks and scripts experience
Jupyter Notebook editor
Creating and managing notebooks
Parts of a notebook
Jupyter kernels and notebook environments
Coding and running notebooks
Libraries and scripts
Installing custom libraries
Importing scripts into a notebook
Watson Natural Language Processing
Working with pre-trained models
Library task catalog
Language detection
Syntax analysis
Noun phrase extraction
Keyword extraction and ranking
Entity extraction
Embeddings
HAP detection
Sentiment extraction
Tone classification
Emotion classification
Relations extraction
Hierarchical categorization
Category types
Creating your own models
Detecting entities with a custom dictionary
Detecting entities with regular expressions
Detecting entities with a custom transformer model
Classifying text with a custom classification model
Extracting sentiment with a custom transformer model
Extracting targets sentiment with a custom transformer model
Usage samples
Key Point Summarization
Geospatial data analysis
Data skipping for Spark SQL
Parquet encryption
Key management by application
Key management by KMS
Time series analysis
Using the time series library
Time series key functionality
Time series functions
Time series lazy evaluation
Time reference system
SPSS predictive analytics algorithms
Data preparation
Classification and regression
Clustering
Forecasting
Survival analysis
Score
Loading and accessing data in a notebook
Loading data through generated code snippets
Accessing data in an AWS S3 bucket
Manually adding the project access token
Accessing project assets with ibm-watson-studio-lib
ibm-watson-studio-lib for Python
ibm-watson-studio-lib for R
Using Python functions to work with IBM Cloud Object Storage
Managing the notebooks and scripts lifecycle
Sharing notebooks
Hiding code in a notebook
Publishing notebooks on GitHub
Publishing notebooks as a gist
Analyzing and processing data with Spark
Manage your Spark jobs
Building a RAG solution
Terms of use
Tokens
Supported foundation models
IBM foundation models
Third-party foundation models
Supported encoder models
IBM Slate 125m v2 embedding model card
Choosing a model
Foundation model benchmarks
Foundation model lifecycle
Curating and integrating unstructured data
Supported connectors for curation of unstructured data
Setting up curation flows for unstructured data
Designing unstructured data curation flows
Document classes
Schema requirements
Managing document classes
Integrating unstructured data documents
Creating data integration flows
Working with parameters
Running a flow with Spark
Debugging a flow
Data preparation nodes
Ingest data
Extract data
Quality
Transform data
Generate output
Custom nodes
Building prompts
Prompt tips
Avoiding undesirable output
Generating accurate output
Model parameters for prompting
Filtering model content with AI guardrails
Sample prompts
Prompt Lab
Adding prompts
Saving prompts
Building reusable prompts
Chatting with documents and media files
Adding Milvus service
Connecting to Milvus service
Working with Milvus
Pause and resume Milvus service
Connecting watsonx Assistant to Milvus for custom search
Using the Milvus backup tool
Using the Vector Transport Service
Optimizing your RAG knowledge base
Retrieval service
Selecting the retrieval service model
Understanding reliability scores in Prompt Lab
Understanding online reliability scores
Integrating your RAG pipeline with AI agents
IBM watsonx.data local Model Context Protocol (MCP) server
Integrating with watsonx Orchestrate
Integrating with other agentic framework
IBM watsonx.data remote Model Context Protocol (MCP) server
Integrating with watsonx Orchestrate
Integrating with LangChain agentic framework
Data governance
Catalogs
Administering a catalog
Creating a catalog
Duplicate asset handling
Managing access to a catalog
Catalog collaborator roles
Changing catalog settings
Deleting a catalog
Saving searches for catalog assets
Catalog assets
Finding and viewing an asset in a catalog
Adding assets to a catalog
Adding a data file
Adding a connection
Adding data from a connection
Adding a connected folder asset from a connection
Downloading data assets
Editing asset properties
Relationships in a catalog
Asset relationships
Managing relationships in a catalog
Exploring relationships
Controlling access to an asset
Profiling an asset
Removing an asset
Categories
Predefined categories
Designing categories
Managing categories
Managing category collaborators
Category collaborator roles
Creating custom category collaborator roles
Importing and exporting categories
Business Terms
Designing business terms
Predefined business terms
Managing business terms
Authoring business terms
Classifications
Designing classifications
Predefined classifications
Data Classes
Designing data classes
Adding matching methods to data classes
Predefined data assets
Predefined data classes details
Reference Data
Designing reference data sets
Creating reference data sets with composite keys
Importing files for reference data sets
Relationships between reference data sets
Predefined reference data sets
Storing reference data sets in an external database
Policies
Designing policies
Governance rules
Designing governance rules
Data protection rules
Governance through Access Controlled Lists
Configuring ACL flow
Designing data protection rules
Filtering rows
Mask data
Advanced masking options
Redacting data method
Obfuscating data method
Preserve format method
Identifier masking method
Data protection rules enforcement
Managing data protection rules
Data quality SLAs
Designing data quality SLAs
Managing data quality SLAs
Data lineage
Lineage for unstructured data
Viewing data lineage
Managing data lineage graph
Configuring alias assignments
Managing IBM watsonx.data intelligence
Assigning roles and permissions for users
Custom properties, relationships, and asset types
Creating custom asset types
Creating custom properties
Creating custom relationships
Importing custom properties or relationships from a file
Managing custom properties, relationships, and asset types
Managing rule settings
Migrating data protection rules
Setting up reporting for IBM watsonx.data intelligence
Database requirements
Data model
Managing reporting
Sample reporting queries
Reporting tables
Workspaces
Asset relationships
Categories
Governance artifacts
Artifact relationships
Data quality rules
Customizations
User Profiles
Tags
Rules
Metadata imports and enrichments
Administration
Administration on IBM Cloud
Setting up the platform on IBM Cloud
Setting up watsonx.data
Managing users and access
Adding users to the account
Levels of user access roles
User roles for watsonx.data intelligence on IBM Cloud
IAM access groups
Setting up IAM access groups
Example IAM access groups
Setting up watsonx.data
Setting up watsonx.data by bringing your own licence
Setting up Cloud Object Storage
Setting up watsonx.ai Studio and watsonx.ai Runtime
Setting up watsonx.data intelligence
Creating the Platform assets catalog
Managing the platform on IBM Cloud
Monitoring account resource usage
Setting up trusted profiles
Managing account settings
Managing all projects in the account
Upgrading services on the platform
Managing Cloud Object Storage resources
Removing users
Stop using services or the platform
Security on IBM Cloud
Network security
Enterprise security
Account security
Data security
Collaborator security
Security policies and responsibilities in IBM Cloud
Securing connections to services with private service endpoints
Configuring firewall access
Firewall access for the platform
Firewall access for IBM Cloud Object Storage
Firewall access for Redshift
Firewall access for Spark
Firewall access for watsonx.ai Runtime
Firewall access for watsonx.ai Studio
Firewall Access for watsonx.data Intelligence
Firewall Access for watsonx.data Integration
Firewall Access for watsonx.data
Deleting watsonx.data instance
Learning about watsonx.data architecture and workload isolation
Securing your data in watsonx.data
Securing metadata in watsonx.data
Setting up virtual private endpoints
Administration on AWS
Provision watsonx.data subscription on AWS
Using the IBM SaaS Console with accounts
Getting started with the IBM SaaS Console with accounts
Granting access through service IDs and API keys from the IBM SaaS Console
Email notifications
Using the IBM SaaS Console
Getting started with the IBM SaaS Console
Granting access through service IDs and API keys
Setting up watsonx.data for GovCloud
Accessing the IBM SaaS Console
Accessing the watsonx.data instance
Configuring network endpoints
Setting up virtual private endpoints
Managing access to virtual private endpoints
Configuring Egress firewall policies
Administration on Azure
Setting up watsonx.data on Azure with a user-managed data plane
Creating a compute plane
Configuring persistent storage for logging
Troubleshooting FluentBit pods
Downloading watsonx.data system logs
Script for downloading logs
Troubleshooting log download
Egress firewall
Integrations
OpenTelemetry
Adding telemetry diagnostic tools through the user interface
Supporting dashboards
Supporting dashboard metrics for Presto (Java)
Supporting dashboard metrics for Presto (C++)
Supporting dashboard metrics for Spark and Gluten accelerated Spark
Customizing Instana dashboards to monitor engine performance
Customizing Grafana dashboards to monitor engine performance
Integrating Confluent Tableflow in watsonx.data
Querying Confluent Tableflow using Presto engine
Querying Confluent Tableflow using Spark engine
Integrating Databricks Unity Catalog
Querying Databricks Unity Catalog using Spark engine
Integrating Databricks Unity Catalog
Integrating Cloudera
Setting up Cloudera integration with Presto engine
Querying Cloudera tables using Presto engine
Integrating Snowflake Open Catalog
Querying Snowflake Open Catalog using Spark engine
Querying Snowflake Open Catalog using Presto engine
Salesforce
Data Build Tool (dbt) integration
dbt-watsonx-presto (data build tool adapter for Presto)
dbt Configuration (setting up your dbt profile)
Installing and using dbt-watsonx-presto
Installing and using dbt-watsonx-spark
dbt Configuration (setting up your dbt profile)
Connecting to IBM Knowledge Catalog (IKC)
Service to service authorization
Masking your data in watsonx.data on IBM Cloud with IBM Knowledge Catalog on software
Enabling Apache Ranger policy for resources
Adding row-level filtering policy
Adding column masking policy
Integrating with DataStage
Integrating with Data Product Hub
Data visualization in with BI tools
Connecting Tableau to Presto in watsonx.data
Connecting Looker to Presto in watsonx.data
Connecting Domo to Presto in watsonx.data
Connecting Qlik to Presto in watsonx.data
Connecting Power BI to Presto in watsonx.data
Connecting IBM Cognos Analytics to Presto in watsonx.data
Integrating with IBM Manta Data Lineage
Connecting to watsonx BI
Troubleshooting
IBM Cloud Status
IBM Cloud Object Storage for projects
watsonx.ai Studio
Access Control List
Re-enrichment not reflecting glossary updates in semantic automation
Updating the configuration settings for Query Optimizer
Query Optimizer internal error updating statistics
Case-sensitive search configuration with Presto (Java)
IBM Cloud IP address restriction
Why do I receive an error while integrating with Ranger?
Optimizing JDBC metadata queries for Presto (Java and C++) engines
Private endpoint not working in CBR rule
Why do I receive a certificate error while connecting to Spark labs?
Why do I receive permission denied error while connecting to Spark labs?
Why do I receive a timeout error while connecting to Spark labs?
Why is Milvus unresponsive?
Why is Milvus data missing or corrupted?
Resolving bulk insert issues in Milvus
Managing the user API key
Managing your settings on IBM Cloud
Activity Tracker Event Routing
Managing your cloud account
Auditing events for watsonx.data
Logging for watsonx.data
Monitoring Presto engine JMX metrics with Sysdig on IBM Cloud
Milvus metrics
Metering and usage experience
Default limits and quotas for Spark engine
Default instance limits for engines and services
IBM watsonx.data pricing plans
Architecture and concepts in serverless instances
Best practices
Getting connection information
Presto exposed JMX metrics
Metrics exposed by Milvus
Mixed-case behavior
Understanding your responsibilities when using watsonx.data
High availability and disaster recovery
Disaster scenarios in watsonx.data
Presto update process for watsonx.data
Configuration properties for Presto (Java) - coordinator and worker nodes
JVM properties for Presto (Java) - coordinator and worker nodes
Catalog properties for Presto (Java)
Event listener properties for Presto (Java)
Configuration properties for Presto (C++) - worker nodes
Configuration properties for Presto (C++) - coordinator nodes
Catalog properties for Presto (C++)
Velox properties for Presto (C++)
JVM properties for Presto (C++) - coordinator nodes
Global properties for Presto (C++)
LogConfig worker properties for Presto (C++)
LogConfig coordinator properties for Presto (C++)
Resource group properties
Glossary