Performance Tuning/Debugging Ambari Metrics in IOP 4.1

Motivation:

Ambari Metrics 0.1.0 (AMS) was released with Apache Ambari 2.1.0 (IOP 4.1). When AMS is ran with default configurations, it has the potential to have many resource contention issues. Under the hood, Ambari Metrics 0.1.0 uses it’s own instances of HBase 0.98 and Phoenix 4.2 to store metrics and run some basic de-duplication / compaction. As the cluster scales upwards, the disk r/w requests caused by HBase (Ambari Metrics Collector) on a single disk can cause that node to utilize 100% of each CPU (I’ve seen as high as 3000% CPU usage in `top`!).

Starting to debug Ambari Metrics

Ambari Metrics Collector logs: /var/log/ambari-metrics-collector/
Verify available disk space in “df -h”
Verify available memory “free -m”
See running processes/cpu usage for user ams “top -u ams”
See whether the HBase Regionserver and HBase HMaster are still running “ps -ef | grep ams”

Below table is an aggregation of some of the more common issues/resolutions caused by Ambari Metrics

Issue	Possible Cause(s)	Resolution
Ambari Metrics Collector process is using 100% of available CPU’s. Any service (including Ambari Web UI) running on the same host as Ambari Metrics Collector becomes slow/unresponsive.	Ambari Metrics Collector is running on the same node as Ambari Server Ambari Metrics is running in embedded mode	C A
ams-hbase*.log shows multiple zookeeper timeouts	CPU Contention on the Metrics Collector Host when running in embedded mode	A B
Metrics for CPU, Network, among other ‘go missing’ from the Ambari web UI	CPU Contention caused by disk r/w bottleneck; ams-hbase master heapsize too low	A D
Metrics collector fails to start, “port in use” or “Binding to port -1”	Port 61181 doesn’t get stopped	E
After adding hosts to Ambari for a total of > 100 hosts, UI error is thrown “Validation failed. Config validation failed”	stack-advisor fails to update 1 property for that range of hosts	F
GC Options applied to Ambari Metrics Collector are not applied to collector process	AMBARI-14945	G

Resolutions

A. Run Ambari Metrics in Distributed Mode rather than embedded

If you are running with more than 3 nodes, I strongly suggest running in distributed mode and writing hbase.root.dir contents to hdfs directly, rather than to the local disk of a single node. This applies to already installed and running IOP clusters.

In the Ambari Web UI, select the Ambari Metrics service and navigate to Configs. Update the following properties:
- General > Metrics Service operation mode=distributed
- Advanced ams-hbase-site > hbase.cluster.distributed=true
- Advanced ams-hbase-site > hbase.root.dir=hdfs://namenode.fqdn.example.org:8020/amshbase
Restart Metrics Collector and affected Metrics monitors

B. “Cleaning” up a hanging Ambari Metrics Collector (embedded mode only)

In the scenario which you need to run embedded mode (Small clusters, sandbox, vms, etc) you can use the following steps to restore your node’s performance if it has been affected by the Metrics Collector.

If your host has multiple disks, modify the default value used for hbase.root.dir and hbase.tmp.dir, preferably to a lower-utilized disk than the OS is running on
Delete the contents of the ZooKeeper tmp snapshot dir. This will delete any unsaved metrics, effectively removing the backlog/bottleneck caused by disk contention.
rm -rf /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper
Lower the TTL metrics aggregation. By default these are collected every 2 minute, reducing this to 5min or higher will significantly reduce the extended lag/cpu spikes caused — though you will still see CPU spikes on the new TTL intervals for a short period. Note, in Ambari 2.2 the default value here has been increased to 5 min.
In the Ambari Web UI, modify the configs for ams-site
timeline.metrics.host.aggregator.minute.interval : 300

C. Moving the metrics collector to a new host.

The below steps include some required work arounds for known issues with open/resolved JIRAs.

Stop Ambari Metrics Service

curl -u admin:admin -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo":{"context":"Stop All Components"},"Body":{"ServiceComponentInfo":{"state":"INSTALLED"}}}' http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/services/AMBARI_METRICS/components/METRICS_COLLECTOR    curl -u admin:admin -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo":{"context":"Stop All Components"},"Body":{"ServiceComponentInfo":{"state":"INSTALLED"}}}' http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/services/AMBARI_METRICS/components/METRICS_MONITOR

Delete the Ambari Metrics Collector from the old host

curl -u admin:admin -i -H 'X-Requested-By: ambari' -X DELETE http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/hosts/old.metrics.collector.host/host_components/METRICS_COLLECTOR

Add the Ambari Metrics Collector component to the new host

curl -u admin:admin -i -H 'X-Requested-By: ambari' -X POST http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/hosts/new.metrics.collector.host/host_components/METRICS_COLLECTOR

Install the Ambari Metrics Collector component on the new host

curl -u admin:admin -i -H 'X-Requested-By: ambari' -X PUT -d '{"HostRoles": {"state": "INSTALLED"}}' http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/hosts/new.metrics.collector.host/host_components/METRICS_COLLECTOR

Update the Collector hostname used by Metrics Monitory on all hosts in your ambari cluster. The collector hostname is stored the ‘metrics_server’ property in /etc/ambari-metrics-monitor/conf/metric_monitor.ini
```
   #Run on every host in the cluster  sed -i 's/old.collector.hostname/new.collector.hostname/' /etc/ambari-metrics-monitor/conf/metric_monitor.ini
```

Start Ambari Metrics service, either via UI or curl call below

curl -u admin:admin -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo":{"context”:"Start All Components"},"Body":{"ServiceComponentInfo":{"state":"INSTALLED"}}}' http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/services/AMBARI_METRICS/components/METRICS_COLLECTOR

Possible Issues: AMBARI-13758,

Fix: In the Ambari Web UI, modify the Ambari Metrics Config:

Under ams-hbase-site , find the hbase.zookeeper.quorum field and update it to ‘localhost‘. Note, in Ambari 2.2.2+, AMS supports using the Stack deployed Zookeeper thus leveraging a true ZK quorum rather than a single instance.

D. Increase metrics collector heapsize

The default value for ams-hbase-env : hbase_master_heapsize will often lead to periodic missing metrics from the Ambari Web UI. In future releases the stack-advisor recommendations have been updated.

Use https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning as a guideline for tuning the heapsizes for AMS.

For reference, below configs tend to work rather well in the 1-50 node range:

Property	Recommended Value 1-50 Nodes
hbase_master_heapsize	2048
hbase_regionserver_heapsize	2048
metrics_collector_heapsize	1024

E. Resolve port issues with Collector

Ensure the port used by the embedded ams zookeeper is free on the collector host,
hbase.zookeeper.property.clientPort default valus is: 61181:

netstat -nltp | grep 61181

Free up this port or change the default clientPort to a free port and restart ambari metrics collector

F. Resolve stack-advisor validation failure for >100 hosts

“Validation failed. Config validation failed. ” Appears when saving modifications to Ambari Metrics Configs

Fix: Update stack_advisor to add default heap recommendations for any unaccounted number of nodes

Option 1: Edit stack_advisor directly:

#Open the BI 4.0 stack advisor on the Ambari Server node(4.1 inherits 4.0)  vim /var/lib/ambari-server/resources/stacks/BigInsights/4.0/services/stack_advisor.py  #At line 684 "totalHostsCount = len(hosts["items"])"  #Add the following bolded line  totalHostsCount = len(hosts["items"])  putAmsHbaseEnvProperty("hbase_master_heapsize", "512m")  #Restart ambari-server

Option 2: Download tar.gz with patched stack_advisor.py

  wget http://developer.ibm.com/hadoop/wp-content/uploads/sites/28/2016/02/stack_advisor_ams_patch.tar_.gz  tar -C /var/lib/ambari-server/resources/stacks/BigInsights/4.0/services/ -xvfz stack_advisor_ams_patch.tar.gz

G. Ambari Metrics Collector start script patch for java GC options

The Ambari Metrics Collector start script doesn’t properly read the java options used for the collector process. This causes all GC options to be skipped. AMBARI-14945

  #Update the /usr/sbin/ambari-metrics-collector script to remove extra quotes on AMS_COLLECTOR_OPTS  sed -i 's/"${AMS_COLLECTOR_OPTS}"/${AMS_COLLECTOR_OPTS}/' /usr/sbin/ambari-metrics-collector

IBM Support

Tips

Performance Tuning/Debugging Ambari Metrics in IOP 4.1 - Hadoop Dev

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?