Introduction to watsonx.data Spark
This section provides an overview about watsonx.data Spark engines, guides you on how to get started with your Spark application, and outlines the complete workflow for developing and managing Spark applications.
IBM watsonx.data Spark is a comprehensive solution for efficient data processing tasks. It leverages the powerful capabilities of Apache Spark, such as the required speed, flexibility and efficiency to handle large datasets. It scales by distributing processing workflows across large clusters, offering built-in parallelism and fault-tolerance.
IBM watsonx.data Spark engine can be leveraged for the following use cases:
- Ingesting large volumes of data into watsonx.data tables.
- Cleanse and transform data before ingestion.
- Table maintenance operations to enhance performance
- Complex analytics workloads that are difficult to represent as queries.
- Develop, run, and debug applications written in Python and Scala.

Type of Spark engines
- Spark engine : Powerful data processing engine capable of processing Spark applications that involves complex analytical operations. For more information, see spk_overview.html.
- Apache Gluten accelerated Spark engine: Performance optimized data processing engine capable of processing Spark applications. It uses Apache Gluten, which relies on Velox (C++) generic database acceleration library that optimize the queries. This is an effective solution to speed up and simplify your process if you work with very huge data set. For more information, see glutn_overview.html.
Access policies
To know about the access policies supported, see Access management and governance in watsonx.data.
Supported Storage
- AWS S3
Spark engines can access Amazon S3 storage using IAM roles. For information, see Amazon S3.
- IBM COS (IBM COS),
- Azure Blob File System
- Google Cloud Storage (GCS)
- Hadoop Distributed File System (HDFS)
- Azure Data Lake Storage (ADLS Gen2)
- Hive
- Hudi
- Delta
- Iceberg
Spark labs - Development experience
The Visual Studio Code based development environment that enables you to interactively program Spark applications, debug, submit, and test Spark applications on a Spark cluster running on the Spark engine. For more information, see Development Environments – Spark lab experience.
Remote Dataplane Support with WXD Spark
Jupyter Notebook
A Jupyter notebook is a web-based environment for interactive computing. You can use notebooks to run small pieces of code that process your data, and you can immediately view the results of your computation. Notebooks include all of the building blocks that you need to work with data, namely the data, the code computations that process the data, the visualizations of the results, and text and rich media to enhance understanding. You can work with Jupyter Notebooks from Spark labs and also integrates with watsonx.ai to allow a web-based working experience with Jupyter Notebook. For information, see Development Environments – Spark lab experience and Working with watsonx.ai Jupyter Notebook.