spark.local.dir configuration parameter
Applies to :
Spark engine
Apache Gluten accelerated Spark engine
spark.local.dir parameter is a configuration option in Apache Spark
that specifies the local directory used for temporary file storage and shuffles during runtime. By default, the value is set to /tmp/spark/scratch.While Spark runtimes are running, temporary
files, shuffles, and intermediate data are written to the local disk for efficient processing. The
spark.local.dir parameter defines the directory path where Spark writes these
temporary files.
The default value of
spark.local.dir is set to /tmp/spark/scratch. This directory
location is used for temporary storage on the local disk of each Spark executor node.
You can override the default
value of spark.local.dir when submitting your Spark
applications:
Example
Application Payload:
{
"application_details": {
"application": "/opt/ibm/spark/examples/src/main/python/wordcount.py",
"application_arguments": ["cos://mycos-bucket.object-storage/people.csv"],
"conf": {
"spark.app.name": "MyJob",
"spark.local.dir": "/myTempSpace",
"spark.hadoop.fs.cos.object-storage.endpoint": "s3.direct.us-south.cloud-object-storage.appdomain.cloud",
"spark.hadoop.fs.cos.object-storage.secret.key": "xxxx",
"spark.hadoop.fs.cos.object-storage.access.key": "xxxx"
}
},
"volumes": [{
"name": "temp-vol",
"mount_path": "/myTempSpace",
"source_sub_path": ""
}]
}
By overriding the default value of spark.local.dir, you can tailor
the temporary file storage location to your specific requirements, such as utilizing a
high-performance disk or directing temporary files to a shared network storage
location.
Remember to consider the available disk space, permissions, and accessibility of the
chosen directory path when selecting a custom value for spark.local.dir.
In
notebooks you can override the Spark conf spark.local.dir while initializing the
spark context. For example:
from pyspark.sql import SparkSession
sc = spark.sparkContext
conf = sc.getConf()
conf.set("spark.local.dir", "/home/spark/shared/spark-events/")
sc.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
conf = sc.getConf()
To update and ignore configurations,
from pyspark.sql import SparkSession
# update Spark configuration
newconf=spark.sparkContext.getConf()
newconf.set("spark.local.dir", "/home/spark/shared/spark-events/")
newconf.set("spark.executor.instances", "2") # ignored!?
# restart Spark kernel with new configuration
spark.stop()
spark = SparkSession.builder.config(conf=newconf).getOrCreate()
To remove libraries, Spark event directories and log files after notebook execution, see Removing libraries.