Provisioning watsonx.data Spark engine

You can provision the following native Spark engines in watsonx.data or on remote physical locations.

Spark engine
Apache Gluten accelerated Spark engine

Required permissions: To provision a Spark engine, you must have the Admin role.

Applies to :

Spark engine

Apache Gluten accelerated Spark engine

Before you begin

If you plan to use the remote physical location to host watsonx.data Spark, ensure that the required remote physical location and dataplane setup is complete in IBM Software Hub. See Setting up a remote physical location for IBM Software Hub.

About this task

To provision a native Spark engine, complete the following steps.

Procedure

Log in to watsonx.data console.
From the navigation menu, select Infrastructure Manager.
To provision an engine, click Add component and select IBM Spark.
Click Next.
In the Add component - IBM Spark window, choose the engine type from the Type list. You can select Spark or Apache Gluten accelerated Spark. For more information about the two engines, see Spark overview.
Enter the Display name for your Spark engine.

Select the Data plane for your Spark engine if you plan to use the remote physical location to host watsonx.data Spark.

Creating Spark engine on a remote dataplane by using API

When using a remote dataplane, Spark engines are created using the watsonx.data REST APIs. This provides greater flexibility and control when configuring engines for remote execution.

Note: On Power (ppc64le) clusters, RDP-based dataplanes are not supported. However, while adding a Gluten or Native Spark engine, the UI still shows a Dataplane dropdown (disabled) when no dataplane is configured.

Use the following sample payloads for creating a Spark engine:

Sample Engine creation payload with new volume:


curl -k -X POST "https://<cpd_host_route>/lakehouse/api/v3/spark_engines" -H
"Accept: application/json" -H "Content-Type: application/json" -H
"AuthInstanceId: 1770216873926301" -H "Authorization: Bearer $TOKEN" -d '{
"description": "sample engine new feb 5th",
"display_name": "rdp35engine",
"dataplane_name": "dataplane11",
"type": "spark",
"origin": "native",
"configuration": {
"default_version": "3.5",
"engine_home": {
"volume_name": "rdp35newvol",
"volume_storage_class": "managed-nfs-shared-storage",
"volume_storage_size": "5Gi"
}
}
}'

Sample Engine creation payload with existing volume: (Volume should exist in dataplane):

curl -k -X POST "https://<cpd_host_route>/lakehouse/api/v3/spark_engines" -H
"Accept: application/json" -H "Content-Type: application/json" -H
"AuthInstanceId: 1770216873926301" -H "Authorization: Bearer $TOKEN" -d '{
"description": "sample engine",
"display_name": "rdpexitvol35",
"dataplane_name": "dataplane11",
"type": "spark",
"origin": "native",
"configuration": {
"default_version": "3.5",
"engine_home": {
"volume_id": "1770241329230613"
}
}
}'

Sample Engine creation payload with register object storage:

curl --request POST \
--url https://<cpd_host_route>/lakehouse/api/v3/spark_engines \
--header 'Accept: application/json' \
--header 'AuthInstanceId: 1770216873926301' \
--header "Authorization: Bearer $TOKEN" \
--header 'Content-Type: application/json' \
--data '{
"description": "sample engine",
"display_name": "cosrdpengine",
"dataplane_name": "dataplane11",
"type": "spark",
"origin": "native",
"configuration": {
"default_version": "3.5",
"engine_home": {
"storage_name": "rdptestbucket"
}
}
}' -k

Select Create a native Spark engine, and do the following:
1. Specify the storage volume that is considered as Engine home, which stores the Spark events and logs that are generated while running spark applications. You can either select an existing storage volume or specify details to create a new storage volume or IBM COS storage or Amazon S3 storage. Choose one of the following options:
  Note: To store Spark application, create a different storage volume. To create storage volume , see Creating a storage volume.
  - Option1: Select an existing volume. To do that, specify the following fields:
    - Existing volume: Select the option to associate a storage volume that is already available in the cluster. To create storage volume , see Creating a storage volume.
    - Select volume: To use an existing volume, select the storage volume from the list.
  - Option2 : Create a new storage volume. To do that, specify the following fields:
    - New Volume: Select the option to create a new storage volume and use it.
    - Volume name: Enter a name for the new storage volume.
    - Storage Class: Select the class to which the storage volume belongs.
    - Size of the new storage volume: Slide to select the volume size in GB. You can select values between 5 GB and 1024 GB.
    Restriction: Use storage classes that provision file storage rather than block storage. If you try to use a storage class that provisions block storage, you might encounter an error when you try to create storage volumes.
    
    Note: You must have user role with the Create service instances permission in IBM Software Hub to create Storage volumes. If you do not have the permission, the Administrator must create a storage volume and grant you write access permission. To create storage volume and grant access permission, see Creating a storage volume.
  - Option3 : Registered Object storage. You can select an already registered IBM COS storage, Amazon S3 or ADLS Gen2 storage. For information about registering IBM COS and Amazon S3, see IBM Cloud Object Storage and Amazon S3. Select the registered storage from the Engine home bucket list.
    Restriction: Use storage classes that provision file storage rather than block storage. If you try to use a storage class that provisions block storage, you might encounter an error when you try to create storage volumes.
2. Select the Spark runtime version that must be considered for processing the applications.
3. Select the catalogs that must be associated with the engine from the Associated catalogs(optional) field.
Important: If you plan to use the remote physical location, Spark engines can be created by API using either an Engine creation payload with a new volume or an Engine creation payload with an existing volume, where the volume must already exist in the dataplane.
Click Create. An acknowledgment message is displayed.

Related API: For information on related API:

v2 API: Create Spark engine.

v3 API: Create Spark engine.