Examples of Spark applications

Db2® Warehouse provides examples of application code that illustrate how to develop Apache Spark applications.

Load the examples into your home directory as described in Loading the sample Spark application code.

Examples for Python, Scala, and R

Each of the following examples is available in a Python version, a Scala version, and an R version:

ReadWriteExampleKMeansJson and ReadExampleJson

These examples read data from and write data to JSON files, not Db2 Warehouse tables. A common way to develop applications is to start by creating code like this. Then, after that code has been tested using the JSON files, you can modify it so that it uses database tables instead (as in ReadWriteExampleKMeans and ReadExample).

ReadWriteExampleKMeans and ReadExample

These examples are based on ReadWriteExampleKMeansJson and ReadExampleJson, but have been modified in the following ways so that they use Db2 Warehouse tables instead of JSON files:

Change #1: The master was removed, because the application is assigned to your cluster automatically when it is submitted. If you prefer to set the application name explicitly, use the setAppName method.
Change #2: The data source was replaced so that the application reads data from a Db2 Warehouse table instead of from a JSON file.
Change #3 (applies to ReadWriteExampleKMeans only): The data source was replaced so that the application writes data to a Db2 Warehouse table instead of to a JSON file.

ExceptionExample

This example shows how an application can throw an exception so that the corresponding error is recorded in the file $HOME/spark/log/submission_id/submission.info.

SqlPredicateExample

This example shows how an application can push an SQL predicate down into the database. This improves performance, because only a subset of the data needs to be fetched.

Examples for R only

The following examples are available only in an R version:

idaxTApply.R: This example shows how to write an R function that applies a user-defined function to each subset (group of rows) of a distributed data frame (ida.data.frame).; The subsets are specified by an index column.; The results from applying the functions are stored in a data frame or a database table.
idaxTApplyExample.R: This example shows how to connect to the database, read tables, open a SparkR session, and call the idaxTApply function.

Note: Both examples contain performance hints regarding the configuration of sparkR and idax data source.