GPU RDD sample and API examples

This sample Spark application uses the Python RDD API to create a new RDD whose tasks run on GPU slots in your cluster. When GPU scheduling is enabled, GPU resources in your cluster are assigned to applications that can use them. Examples for the Scala and R APIs are provided as well if you want to use either Scala or R.

Python word count GPU RDD sample

This sample is a Spark application for use with GPU resources. Copy this sample and save it as wordcount_gpu.py.

This sample specifically uses the newStage parameter to determine whether the GPU RDD must be created in a new stage.

When newStage=true

A Shuffled RDD is created with the previous RDD as parent,
A GPU RDD is created with the Shuffled RDD as parent, and
GPU tasks run in a new stage.

Note: It is recommended that you use this sample when your cluster is installed to a shared file system, such as IBM Spectrum Scale.

Take note of the following considerations:

Spark generates tasks by stage. If a stage contains at least one GPU RDD, all tasks that belong to this RDD run on GPU hosts. If GPU slots are fewer than CPU slots, and one stage contains many CPU RDDs and few GPU RDDs, it is suggested that you use gpu(newStage=True). Otherwise, all tasks will be in Pending state, waiting for processing to complete on GPU slots.
CPU tasks and GPU tasks run on different slots to process CPU and GPU tasks in parallel. In this sample, stage0 and stage1 run concurrently.

wordcount_gpu.py:

from __future__ import print_function
import sys
from operator import add
from pyspark import SparkContext


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: wordcount_gpu.py <infile> <outfile>", infile=sys.stderr, outfile=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="PythonWordCount_gpu")
    lines = sc.textFile(sys.argv[1])
    tokens = lines.flatMap(lambda x: x.split(' '))
    #split tasks into 2 parts, one for cpu, the other one for gpu
    cpu_tokens, gpu_tokens = tokens.randomSplit([2, 3], 17)
    cpu_counts = cpu_tokens.map(lambda x: (x, 1)).reduceByKey(add)
    gpu_counts = gpu_tokens.gpu(False).map(lambda x: (x, 1)).reduceByKey(add)

    # union the compute from cpu and gpu
    counts = cpu_counts.union(gpu_counts).reduceByKey(add).sortByKey(True, 1)
    counts.saveAsTextFile(sys.argv[2])

    sc.stop()

Python GPU RDD API example

gpu(newStage=True)
Returns a new Resilient Distributed Dataset (RDD) for use with GPU resources. Tasks associated with this RDD run on GPU slots in your cluster. 
Use parameter newStage to specify whether the GPU RDD must be created in a new stage.
newStage=True:
· Create a Shuffled RDD with the previous RDD as parent.
· Create a GPU RDD with the Shuffled RDD as parent.
· GPU tasks runs in a new stage.
newStage=False:
· Create a GPU RDD with the previous RDD as parent.
Returns: A GPU RDD.
>>> gpu_tokens.gpu(False).map(lambda x: (x, 1)).reduceByKey(add)
......
>>> gpu_tokens.gpu(True).map(lambda x: (x, 1)).reduceByKey(add)
......

Scala GPU RDD API example

gpu(newStage=True): RDD[T]
Returns a new Resilient Distributed Dataset (RDD) for use with GPU resources. Tasks associated with this RDD run on GPU slots in your cluster. 
Use parameter newStage to specify whether the GPU RDD must be created in a new stage.
newStage=True:
· Create a Shuffled RDD with the previous RDD as parent.
· Create a GPU RDD with the Shuffled RDD as parent.
· GPU tasks runs in a new stage.
newStage=False:
Create a GPU RDD with the previous RDD as parent.
Returns: A GPU RDD.
gpu_tokens.gpu(False).map( x => (x, 1) ).reduceByKey(add)
......
gpu_tokens.gpu(True).map( x => (x, 1) ).reduceByKey(add)
......

R GPU RDD API example

gpu(newStage=True)
Returns a new Resilient Distributed Dataset (RDD) for use with GPU resources. Tasks associated with this RDD run on GPU slots in your cluster.
Use parameter newStage to specify whether the GPU RDD must be created in a new stage.
newStage=True:
· Create a Shuffled RDD with the previous RDD as parent.
· Create a GPU RDD with the Shuffled RDD as parent.
· GPU tasks runs in a new stage.
newStage=False:
· Create a GPU RDD with the previous RDD as parent.
Returns: A GPU RDD.
>>> gpu_token_trans <- SparkR:::gpu(gpu_tokens,TRUE)
......
>>> gpu_token_trans <- SparkR:::gpu(gpu_tokens,FALSE)
......