spark.local.dir configuration parameter
The spark.local.dir
parameter is a configuration option in Apache Spark that specifies the local directory used for temporary file storage and shuffles during job execution. By default, the value is set to /tmp/spark/scratch
.
Purpose and Usage
While Spark jobs are running, temporary files, shuffles, and intermediate data are written to the local disk for efficient processing. The spark.local.dir
parameter defines the directory path where Spark writes these temporary files.
Default Value
The default value of spark.local.dir
is set to /tmp/spark/scratch
. This directory location is used for temporary storage on the local disk of each Spark executor node.
Overriding the Default Value
In the Analytics Engine service, you have the ability to override the default value of spark.local.dir
when submitting your Spark applications:
Example
Application Payload:
{
"application_details": {
"application": "/opt/ibm/spark/examples/src/main/python/wordcount.py",
"application_arguments": ["cos://mycos-bucket.object-storage/people.csv"],
"conf": {
"spark.app.name": "MyJob",
"spark.local.dir": "/myTempSpace",
"spark.hadoop.fs.cos.object-storage.endpoint": "s3.direct.us-south.cloud-object-storage.appdomain.cloud",
"spark.hadoop.fs.cos.object-storage.secret.key": "xxxx",
"spark.hadoop.fs.cos.object-storage.access.key": "xxxx"
}
},
"volumes": [{
"name": "temp-vol",
"mount_path": "/myTempSpace",
"source_sub_path": ""
}]
}
By overriding the default value of spark.local.dir
, you can tailor the temporary file storage location to your specific requirements, such as utilizing a high-performance disk or directing temporary files to a shared network storage
location.
Remember to consider the available disk space, permissions, and accessibility of the chosen directory path when selecting a custom value for spark.local.dir
.
In notebooks you can override the Spark conf spark.local.dir
while initializing the spark context. For example:
from pyspark import SparkContent, SparkConf
import os
# Create a SparkConf
spark_conf = SparkConf().setAppName("DataProcessing").setMaster("local")
spark_conf.set("spark.local.dir", "/home/spark/shared/spark-events/")
sc.stop()
# Create a new SparkContext
sc = SparkContext(conf=spark_conf)
Parent topic: Submitting Spark jobs via API