Accessing data in external data platforms

IBM® watsonx.data enables you to query data from multiple external data platforms without copying or moving the data. This capability provides seamless access to data across your data landscape while maintaining data in its original location.

Overview

With watsonx.data, you can access and query data from external platforms through direct connections. This approach eliminates the need to replicate data, providing:

Real-time access - Query the most current data without synchronization delays
Cost efficiency - Eliminate storage duplication and data transfer costs
Simplified architecture - Reduce data pipeline complexity and maintenance overhead
Unified governance - Apply consistent security and access policies across data sources

This integration method is commonly known as zero-copy data federation, where queries are executed directly on remote data through secure connections.

Supported external data platforms

watsonx.data supports querying data from the following external platforms:

Cloudera: Access Hive tables stored in Cloudera HDFS for enterprise data warehousing and Hadoop ecosystem integration.
Learn more: Integrating Cloudera in watsonx.data
Databricks Unity Catalog: Access Delta Lake and Iceberg tables stored in Databricks Unity Catalog for multi-cloud data analytics and unified data governance.
Learn more: Integrating Databricks Unity Catalog in watsonx.data
Snowflake Open Catalog: Access Apache Iceberg tables managed by Snowflake Open Catalog for cloud-native data analytics and cross-platform data access.
Learn more: Integrating Snowflake Open Catalog in watsonx.data
Confluent Tableflow: Access streaming data tables managed by Confluent Tableflow for real-time analytics and event-driven architectures.
Learn more: Integrating Confluent Tableflow in watsonx.data
Salesforce: Connect to Salesforce data through Arrow Flight service for CRM analytics and customer data integration.
Learn more: Salesforce

How it works

Accessing external data in watsonx.data works through the following components:

External data platform - The remote system where data resides
Metadata layer - Catalog or metastore that provides table definitions and schema information
watsonx.data engines - Presto or Spark engines that execute queries
Storage layer - External storage systems where data files are stored
Authentication layer - Security mechanisms for secure access

Query engines

watsonx.data supports querying external data through two query engines:

Presto engine - Optimized for interactive analytics with SQL-based querying
Spark engine - Optimized for batch processing and complex transformations with PySpark and Scala support

For details on which platforms support which engines, see the platform-specific integration guides.

Getting started

To access data from external platforms:

Identify data platforms - Determine which external platforms you need to access
Review prerequisites - Ensure you meet the requirements for each platform
Gather credentials - Obtain necessary authentication credentials and connection details
Configure connections - Set up storage components and catalogs in watsonx.data
Associate engines - Connect catalogs to appropriate query engines
Test queries - Validate connectivity and query functionality

For detailed setup instructions, see the integration guide for your specific platform.