Accessing data in external data platforms

IBM® watsonx.data enables you to query data from multiple external data platforms without copying or moving the data. This capability provides seamless access to data across your data landscape while maintaining data in its original location.

Overview

With watsonx.data, you can access and query data from external platforms through direct connections. This approach eliminates the need to replicate data, providing:

  • Real-time access - Query the most current data without synchronization delays
  • Cost efficiency - Eliminate storage duplication and data transfer costs
  • Simplified architecture - Reduce data pipeline complexity and maintenance overhead
  • Unified governance - Apply consistent security and access policies across data sources

This integration method is commonly known as zero-copy data federation, where queries are executed directly on remote data through secure connections.

Supported external data platforms

watsonx.data supports querying data from the following external platforms:

Cloudera
Access Hive tables stored in Cloudera HDFS for enterprise data warehousing and Hadoop ecosystem integration.

Learn more: Integrating Cloudera in watsonx.data

Databricks Unity Catalog
Access Delta Lake and Iceberg tables stored in Databricks Unity Catalog for multi-cloud data analytics and unified data governance.

Learn more: Integrating Databricks Unity Catalog in watsonx.data

Snowflake Open Catalog
Access Apache Iceberg tables managed by Snowflake Open Catalog for cloud-native data analytics and cross-platform data access.

Learn more: Integrating Snowflake Open Catalog in watsonx.data

Confluent Tableflow
Access streaming data tables managed by Confluent Tableflow for real-time analytics and event-driven architectures.

Learn more: Integrating Confluent Tableflow in watsonx.data

Salesforce
Connect to Salesforce data through Arrow Flight service for CRM analytics and customer data integration.

Learn more: Salesforce

How it works

Accessing external data in watsonx.data works through the following components:

  1. External data platform - The remote system where data resides
  2. Metadata layer - Catalog or metastore that provides table definitions and schema information
  3. watsonx.data engines - Presto or Spark engines that execute queries
  4. Storage layer - External storage systems where data files are stored
  5. Authentication layer - Security mechanisms for secure access

Query engines

watsonx.data supports querying external data through two query engines:

  • Presto engine - Optimized for interactive analytics with SQL-based querying
  • Spark engine - Optimized for batch processing and complex transformations with PySpark and Scala support

For details on which platforms support which engines, see the platform-specific integration guides.

Getting started

To access data from external platforms:

  1. Identify data platforms - Determine which external platforms you need to access
  2. Review prerequisites - Ensure you meet the requirements for each platform
  3. Gather credentials - Obtain necessary authentication credentials and connection details
  4. Configure connections - Set up storage components and catalogs in watsonx.data
  5. Associate engines - Connect catalogs to appropriate query engines
  6. Test queries - Validate connectivity and query functionality

For detailed setup instructions, see the integration guide for your specific platform.