Analyze and process data with Spark

Processing large volumes of information in a rapid and efficient way is important. Data Engineers can use watsonx.data as a Service as a go-to platform to analyze and process data using Apache Spark. You can run your workloads in the form of Python, Scala, or Java jobs, schedule and monitor the job run. The generative AI capabilities will help to efficiently analyse the data and generate insights from it.

Before you begin

IBM watsonx.data Premium instance: Ensure you have a IBM watsonx.data Premium instance. Use the 9 dot switcher to select the watsonx.data context.
Access to the Spark engine: Access to the Spark engine to run Spark jobs.
Prepare your Spark application: Have a spark application file in Python, Scala, or Java language ready, which can reside on your local machine, in your project, or in a Cloud Object Storage or Amazon S3 connection.
Associate the engine with the catalog: Ensure your engine is associated with the right catalog that you are trying to access from the spark application (if you access any catalogs in your Spark application).
You must have Editor privilege for the project.

Analyze and process data with Spark

To run the Spark job on your watsonx.data spark engine, do the following:

Create a project

You need a project to begin with. If you don't have one, you must create one. To create a project, see Creating a project alternatively, if you already have an existing project, you can proceed with using it.

Tip: You can create a new project or create project from local file. Creating a project from Git is currently not supported.
Upload the Spark application

The Spark application can reside as an asset inside your project or in your computer or in a connected source such as Cloud Object Storage or Amazon S3 connection.

Spark application in your Project

To place your Spark application in your Project, you have the following two ways:

Option1: To add your Spark application to your project through the Upload data files option. To do that, see Adding data to a project.

Option2: If your Spark application resides in any connected sources, you can import the files to your project, To do that, see Convert files in project storage to assets.

Spark application in a connected source

To place your Spark application in a connected source such as Cloud Object Storage or Amazon S3 connection, see Manage connections.

You can directly choose Spark application from the following supported connected source -
- IBM Cloud Object Storage.
Unsupported Configuration - Verify SSL Certificate, Trust SSL certificate, Personal Credentials, Sensitive Credentials (masking), Use Secrets from vault, Resource instance ID and API Key, Resource instance ID, API Key, Access Key and Secret Key, Service Credentials.
- Amazon S3
Unsupported Features - Server Proxy, Personal Credentials, Sensitive Credentials (masking), Use Secrets from vault, Temporary credentials, Trusted role credentials.

Spark application in your local system

You can directly choose and upload a local file to be used as Spark application in the tool.

Create a Spark job asset

a. From the Assets tab, click New asset +.

b. Search for the data engineering category and select the Analyse and process data tool.

c. In the tear sheet that opens, configure the following details to analyse your data:

Table 1. Analyse and process data
Field	Description
Name	Specify a name for the Spark job.
Description (Optional)	Enter a description for the Spark job.
Spark engine	Select a Spark engine for processing. The Spark engines (Spark and Apache Gluten accelerated Spark) that are available in your watsonx.data instance and for which you have been granted access will be listed. If you do not have any Spark engines in your watsonx.data instance, an information message opens. Click View instance to access the Infrastructure Manager page of watsonx.data instance and create a Spark engine. Tip: If you do not have access to create or access Spark engine, contact your instance administrator.
Type	You have the following option: * Python: If your Spark application is written in Python language, select this option. * Scala or Java: If your Spark application is written in Java or Scala language, select this option.
Main class	If you selected the Type field as Scala or Java, main class must be provided for the application to run.
Application executable	To run the Spark application in Python, Java or Scala language, you must upload the application from your computer or from project or from any of the connected storage. a. Click Select. b. The Select file page opens. c. Upload the Spark application. You can either upload it from your computer or you can select it from project or from any of the connected storage. d. To upload the file from your computer, select Local system and drag and drop your file. e. To select the file from the project, click Connections tab > Data assets category and select the Spark application. The Spark application file gets listed here only if you have defined the connection initially in your Project. To access your application from any connected source, see Adding data to a project. f. To select the file from any of the connected storage, click Connections tab > Connection category and select the Spark application. The Spark application file gets listed here only if you have defined the connection initially in your Project. To access your application from any connected source, see Manage connections.. g. Click Save. The Spark application is added to the project.
Arguments(Optional)	Use the Add argument button to specify all arguments required by the application.

d. Click Next. Specify the following Spark configuration:

Table 1. Analyse and process data
Field	Description
Spark version	Enter the Spark version for running your application. Spark 3.4(deprecated), 3.5, and 4.0 are available.
Spark config.	Specify the Spark configuration properties in the form of key-value pair (<property_name>=<property_value>) separated by new lines. For more information about the different properties, see Properties.
Spark env.	Specify the Spark environment properties as key=value pairs separated by new lines. For more information about the different properties, see Environment properties.
Hardware	Customize the resources that you require. Specify the number of CPU (cores) and memory (GB) for Driver and Executor that is required for the workload. Note: If you use a watsonx.data Lite plan instance with serverless Spark engine to submit jobs, the Spark engine allows a maximum resource quota limit of 8 vCPU×32 GB, and allows only pre-defined Spark driver and executor vCPU and memory combinations in the hardware configuration section. For information about the memory combination, see T-shirt sizes for Serverless Spark.
Packages	If you have any Java dependencies, specify them in the Packages field separated by new lines. These dependencies will be added in the class path of the spark driver and executor.
Repositories	If you have any maven dependencies, specify in the Repositories field or if it resides in a specific location, specify the URL of the repository in the Repositories field separated by new lines.
Py files	If you have Python file dependencies that are in .zip, .py or .egg formats, click Select to upload the files either from Connection or Data asset. Important: You can upload .zip and .egg files only from Connection.
Jars	If you have JAR file dependencies, click Select to upload the files from either Connection or Data asset.

Note: You can specify jar and py-files dependencies of your Spark application in the dependencies section while creating or editing a Spark application job. You must select the files from your project's connections and data assets.

e. Click Next. Specify the job schedule details:

Run after job creation: Use the toggle switch to enable running the Spark job immediately after the Job creation.
Run on a schedule: Use the toggle switch to enable selecting a schedule to run the Spark job.

You can select the check box to define a schedule for the job run or to repeat the run on an hourly, daily, weekly or monthly basis.

f. Click Next. Configure the notifications:

Receive notification: Use the toggle switch to enable receive notification based on the Job notification types that you select below.
Success : Select this check box to receive notification on successful completion of a Spark job.
Warning : Select this check box to receive notification on any warnings.
Failure : Select this check box to receive notification on Spark job failure.

g. Click Next. Preview the details and click Create or Create and run depending on the job schedule selection you made. An information message opens that displays the progress of job creation and you will be redirected to the Job run details page if you have used the toggle switch to enable running the Spark job immediately after the Job creation.

h. From the Job run details page, you can

View the status of the Spark job run.
Click View more details link to view addition details of the job run.
When the Spark job is in Running state, you can access the Spark UI. Click the Go to Spark UI link to view the Spark UI

Note: Spark UI is available only for the jobs in RUNNING status

When the Spark job run is completed, you can view the Spark history UI with the details of the completed application. Click the Go to Spark history link to view the history UI
When the Spark job run starts, you start viewing the Spark driver logs in the right part of the page, which displays upto 200 lines of log information. To see the complete log information, download the log files. Click Download log link available in the Log field.

Tip: You can also view the job details of the older job runs from the Job Details page available in the Platform page.

Watch the quick video for a visual walkthrough: Structured data ingestion.