Introduction to watsonx.data Spark

This section provides an overview about watsonx.data Spark engines, guides you on how to get started with your Spark application, and outlines the complete workflow for developing and managing Spark applications.

IBM watsonx.data Spark is a comprehensive solution for efficient data processing tasks. It leverages the powerful capabilities of Apache Spark, such as the required speed, flexibility and efficiency to handle large datasets. It scales by distributing processing workflows across large clusters, offering built-in parallelism and fault-tolerance.

IBM watsonx.data Spark engine can be leveraged for the following use cases:

Ingesting large volumes of data into watsonx.data tables.
Cleanse and transform data before ingestion.
Table maintenance operations to enhance performance
Complex analytics workloads that are difficult to represent as queries.
Develop, run, and debug applications written in Python and Scala.

Spark workflow diagram showing the data processing flow in IBM watsonx.data Spark

Type of Spark engines

IBM watsonx.data supports the following types of native Spark engines that reside within watsonx.data:

Spark engine : Powerful data processing engine capable of processing Spark applications that involves complex analytical operations. For more information, see spk_overview.html.
Apache Gluten accelerated Spark engine: Performance optimized data processing engine capable of processing Spark applications. It uses Apache Gluten, which relies on Velox (C++) generic database acceleration library that optimize the queries. This is an effective solution to speed up and simplify your process if you work with very huge data set. For more information, see glutn_overview.html.

Access policies

To know about the access policies supported, see Access management and governance in watsonx.data.

Supported Storage

The following storage types are supported:

AWS S3
Spark engines can access Amazon S3 storage using IAM roles. For information, see Amazon S3.
IBM COS (IBM COS),
Azure Blob File System
Google Cloud Storage (GCS)
Hadoop Distributed File System (HDFS)
Azure Data Lake Storage (ADLS Gen2)

The following table formats are supported:

Hive
Hudi
Delta
Iceberg

Spark labs - Development experience

The Visual Studio Code based development environment that enables you to interactively program Spark applications, debug, submit, and test Spark applications on a Spark cluster running on the Spark engine. For more information, see Development Environments – Spark lab experience.

Remote Dataplane Support with WXD Spark

watsonx.data supports the remote dataplane feature by enabling watsonx.data Spark engines to be installed and run on remote physical locations through hub cluster configuration and remote operator installation. For information, see Running Spark application in a remote dataplane.

Note: On Power (ppc64le) clusters, RDP-based dataplanes are not supported.

Jupyter Notebook

A Jupyter notebook is a web-based environment for interactive computing. You can use notebooks to run small pieces of code that process your data, and you can immediately view the results of your computation. Notebooks include all of the building blocks that you need to work with data, namely the data, the code computations that process the data, the visualizations of the results, and text and rich media to enhance understanding. You can work with Jupyter Notebooks from Spark labs and also integrates with watsonx.ai to allow a web-based working experience with Jupyter Notebook. For information, see Development Environments – Spark lab experience and Working with watsonx.ai Jupyter Notebook.