Monitoring Apache Spark
The Apache Spark sensor is automatically deployed and installed after you install the Instana agent.
- Supported versions
- Sensor (Data Collection)
Currently, supported Spark versions are from 1.4.x to 2.4.x.
Sensor (Data Collection)
The two main components of a spark application are driver process and executor processes. Executor processes contain data only relevant to the task execution. The Driver is the main process and is responsible for coordinating the execution of a Spark application. Therefore, it contains all data about the performance and execution of the Spark application and also includes data about each executor of the Spark application.
Instana collects all spark application data (including executor data) from the driver JVM. To monitor spark applications, the Instana agent needs to be installed on the host on which the Spark driver JVM is running.
Note that there are two ways of submitting spark applications to the cluster manager. Depending how this option is set the location where the driver JVM is running can change.
- Deploy mode cluster: When you are submitting with the option
--deploy-mode cluster, for example
./spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn --deploy-mode cluster /path/to/app.jar, the spark driver JVM is running on one of the worker nodes of your cluster manager. If the Instana agent is installed on worker nodes, the Spark application (driver) is discovered automatically
- Deploy mode client: When you are submitting with the option
--deploy-mode client, or without the option
--deploy-mode(default value is
client), for example
./spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn --deploy-mode client /path/to/app.jaror
./spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn /path/to/app.jar, the Spark driver JVM is running on the host on which this command is run. For Instana to be able to monitor this spark application, the Instana agent must be installed on the host where the Spark submit is run.
Depending on the type of the Spark application Instana monitors different data is collected:
- Longest completed stages
- Scheduling delay
- Total delay
- Processing time
- Output operations
- Input records
Spark Application on AWS EMR
Instana detects and monitors spark applications through the spark driver, therefore to get visibility of the spark applications, install the agent on EC2 instances in your EMR cluster. When you are deploying spark apps from the master node
and with the deployment mode
client, it's sufficient to install the agent only on the master node of EMR cluster.
If you don't want to copy the spark app JAR to the master node, and want to deploy the spark app with
cluster mode from somewhere else, for example from an S3 bucket, you must install the agent on all the nodes in the EMR cluster.
It is because the driver is scheduled on the worker node.
The best method is to create the EMR cluster, and in the advance configuration, select the custom AMI image that installed the Instana agent installed. For more information on how to start the EMR cluster with the custom AMI, see AWS documentation. To build the AMI image with the Instana agent installed, see AWS documentation. When prompted to SSH into the EC2 instance to install the software, use the one-liner that is located in your Instana Settings page, which can be opened by clicking Settings from the sidebar on the Instana user interface. For more information, see here. This way you gain insights into all of your EMR cluster nodes, you can monitor spark applications regardless of the deployment mode, and you gain insights into all the underlying components of EMR, such as Hadoop YARN. If you want to measure only the Hadoop YARN metrics, refer to the documentation.
Spark Standalone Cluster Manager
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple stand-alone deployment mode. Spark stand-alone is a cluster manager and is made of master and worker nodes. Instana monitors whole spark stand-alone cluster through master node of a cluster. It collects cluster-wide data and data for each worker node of a cluster.
- Rest URI
- Alive workers
- Dead workers
- Decommissioned workers
- Workers In Unknown State
- Used Memory
- Total Memory
- Used Cores
- Total Cores
- Data and metrics per worker
- Most recent apps
- Most recent drivers