Monitoring Apache Spark

The Apache Spark sensor is automatically deployed and installed after you install the Instana agent.

Support information

To make sure that the Apache Spark sensor is compatible with your current setup, check the following support information sections:

Supported versions and support policy

Edit online

The sensor supports Spark versions from 1.4.x to 3.5.x.

The following table shows the latest supported version and support policy:


Technology	Support policy	Latest technology version	Latest supported version
Apache Spark Application	On demand	3.5.4	3.5.4
Apache Spark Standalone	On demand	3.5.4	3.5.4

For more information about the support policy, see Support strategy for sensors.

Sensor (Data Collection)

Edit online

Spark Application

Edit online

The two main components of a spark application are driver process and executor processes. Executor processes contain data only relevant to the task execution. The Driver is the main process and is responsible for coordinating the execution of a Spark application. Therefore, it contains all data about the performance and execution of the Spark application and also includes data about each executor of the Spark application.

Instana collects all spark application data (including executor data) from the driver JVM. To monitor spark applications, the Instana agent needs to be installed on the host on which the Spark driver JVM is running.

Note that there are two ways of submitting spark applications to the cluster manager. Depending how this option is set the location where the driver JVM is running can change.

Deploy mode cluster: When you are submitting with the option --deploy-mode cluster, for example ./spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn --deploy-mode cluster /path/to/app.jar, the spark driver JVM is running on one of the worker nodes of your cluster manager. If the Instana agent is installed on worker nodes, the Spark application (driver) is discovered automatically
Deploy mode client: When you are submitting with the option --deploy-mode client, or without the option --deploy-mode (default value is client), for example ./spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn --deploy-mode client /path/to/app.jar or ./spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn /path/to/app.jar, the Spark driver JVM is running on the host on which this command is run. For Instana to be able to monitor this spark application, the Instana agent must be installed on the host where the Spark submit is run.

Depending on the type of the Spark application Instana monitors different data is collected:

Batch Applications

Edit online

Jobs
Stages
Longest completed stages
Executors

Streaming Applications

Edit online

Batching
Scheduling delay
Total delay
Processing time
Output operations
Input records
Receivers
Executors

Spark Application on AWS EMR

Edit online

Instana detects and monitors spark applications through the spark driver, therefore to get visibility of the spark applications, install the agent on EC2 instances in your EMR cluster. When you are deploying spark apps from the primary node and with the deployment mode client, it's sufficient to install the agent only on the primary node of EMR cluster.

If you don't want to copy the spark app JAR to the primary node, and want to deploy the spark app with cluster mode from somewhere else, for example from an S3 bucket, you must install the agent on all the nodes in the EMR cluster. It is because the driver is scheduled on the worker node.

The best method is to create the EMR cluster, and in the advance configuration, select the custom AMI image that installed the Instana agent installed. For more information on how to start the EMR cluster with the custom AMI, see AWS documentation. To build the AMI image with the Instana agent installed, see AWS documentation. When prompted to SSH into the EC2 instance to install the software, use the one-liner that is located in your Instana Settings page, which can be opened by clicking Settings from the sidebar on the Instana user interface. For more information, see Installing the host agent on Amazon Elastic Compute Cloud (Amazon EC2). This way you gain insights into all of your EMR cluster nodes, you can monitor spark applications regardless of the deployment mode, and you gain insights into all the underlying components of EMR, such as Hadoop YARN. If you want to measure only the Hadoop YARN metrics, refer to the Monitoring Amazon ElasticMapReduce (EMR) documentation.

Spark Standalone Cluster Manager

Edit online

In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple stand-alone deployment mode. Spark stand-alone is a cluster manager and is made of primary and worker nodes. Instana monitors whole spark stand-alone cluster through primary node of a cluster. It collects cluster-wide data and data for each worker node of a cluster.

Tracked Configuration

Edit online

Host
Port
Rest URI
Version
Status

Metrics

Edit online

Alive workers
Dead workers
Decommissioned workers
Workers In Unknown State
Used Memory
Total Memory
Used Cores
Total Cores
Data and metrics per worker
Most recent apps
Most recent drivers

Configuration

Edit online

Custom poll rate for Spark application

Edit online

An agent natively monitors the Spark application sensor, and its configuration is optional. You can use the Spark application sensor for custom polling.

com.instana.plugin.sparkapplication:
  poll_rate: 1 # values are in seconds. Default value is 1 second.

Custom poll rate for Spark Standalone

Edit online

An agent natively monitors the Spark Standalone sensor, and its configuration is optional. You can use the Spark Standalone sensor for custom polling.

com.instana.plugin.sparkstandalone:
  poll_rate: 1 # values are in seconds. Default value is 1 second.