Monitoring host

After you install the Instana host agent, the host sensor is automatically installed and deployed. You can view metrics that are related to the host sensor in the Instana UI.

Supported information

Supported operating systems

  • Linux
  • Windows
  • Mac OS/OS X
  • Solaris on Sparc
  • AIX

Supported versions and platform

Configuring

For more information, see agent configuration.

Viewing metrics

To view the metrics, complete the following steps:

  1. From the navigation menu in the Instana UI, select Infrastructure.
  2. Click a specific monitored host. You can see a host dashboard with all the collected metrics and monitored processes.

Configuration data

  • Operating System name and version
  • CPU model and count
  • GPU model and count
  • Memory
  • Max Open Files
  • Hostname
  • Fully Qualified Domain Name
  • Machine ID
  • Boot ID
  • Startup time
  • Installed packages
  • System ID

System ID is used for correlation with asset management systems. Instana agent collects the System ID by default for Linux operating systems. For other supported operating systems, such as Windows, macOS, Solaris, and AIX you need to enable System ID by using the agent configuration YAML file as shown in the following example:

com.instana.plugin.host:
  collectSystemId: true

Performance metrics

The following performance metrics are required for monitoring a host.

CPU usage

The CPU usage metric displays total CPU usage in percentage.

To collect more accurate CPU usage, in an AIX LPAR environment, you must set useMpstat to true as shown in the following example:

com.instana.plugin.host:
  useMpstat: true

Datapoint: Filesystem

Granularity: 1 second

Memory usage

  • On the Linux operating systems, you can measure the used value in percentage by using the formula (total - actualFree) ÷ total. The sensor uses the actualFree value that is the real constrained memory which includes free and cached memory, instead of free, which is a low value (used for caching or buffering).

Datapoint: Filesystem

Granularity: 1 second

  • On the AIX LPAR environment, the used value is computed in percentage by using the formula (computational + non-computational) ÷ real total.

The non-computational is a part of used memory, which has relatively high used value. A high used value doesn't necessarily indicate a need for more memory.

The determination of memory over-commitment is based on the comparison between computational memory and the real memory in the system. Therefore, the percentage of computational is more informative for estimating memory usage on AIX.

Datapoint: AIX perfstat_memory_total interface

Granularity: 1 second

CPU load

The CPU load metric displays the average number of processes that are executed for past selected period of time.

Datapoint: Filesystem

Granularity: 5 seconds

CPU usage

The CPU usage metric displays the following values as a percentage on a graph for a selected time period:

  • user
  • system
  • wait
  • nice
  • steal

Datapoint: Filesystem

Granularity: 1 second

Context switches

The Context switches metric displays the total number of context switches on a graph for a selected time period. . Context switches is supported only on Linux hosts.

Datapoint: Filesystem

Granularity: 1 second

CPU load

The CPU load metric displays the value on a graph for a selected time period.

Datapoint: Filesystem

Granularity: 1 second

Individual CPU usage

The Individual CPU usage metric displays the following values as a percentage on a graph for a selected time period:

  • user
  • system
  • wait
  • nice
  • steal

Datapoint: Filesystem

Granularity: 1 second

Individual GPU usage

The following table outlines the Individual GPU usage values:

Datapoint Collected from Granularity Unit
Gpu Usage nvidia-smi 1 second %
Temperature nvidia-smi 1 second °C
Encoder nvidia-smi 1 second %
Decoder nvidia-smi 1 second %
Memory Used nvidia-smi 1 second %
Memory Total nvidia-smi 1 second bytes
Transmitted throughput nvidia-smi 1 second bytes/s
Received throughput nvidia-smi 1 second bytes/s

The following table outlines the supported version of Nvidia graphics cards:

Brand Model
Tesla S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100
Quadro 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series
GeForce varying levels of support, with fewer metrics available than on the Tesla and Quadro products

Supported operating system: Linux

Prerequisites

You must install the latest official Nvidia drivers.

For more information about starting Docker container for Instana Agent with GPU support, see Enable GPU monitoring through Instana Agent container.

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes by using nvidia-smi. The background process is started in a loop mode and kept in memory. This process significantly improves the performance of metrics collection and prevents any potential overhead.

The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to collect accurate and up-to-date metrics every second for multiple GPUs without the overhead.

GPU Memory/Process

The following list of processes utilizes GPU:

Datapoint Collected from Granularity
Process Name nvidia-smi 1 second
PID nvidia-smi 1 second
GPU nvidia-smi 1 second
Memory nvidia-smi 1 second

The following table outlines the supported version of Nvidia graphics cards:

Brand Model
Tesla S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100
Quadro 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series
GeForce varying levels of support, with fewer metrics available than on the Tesla and Quadro products

Supported operating system: Linux

Prerequisites

You must install the latest official Nvidia drivers.

For more information about starting Docker container for Instana Agent with GPU support, see Enable GPU monitoring through Instana Agent container.

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes by using nvidia-smi. The background process is started in a loop mode and kept in memory. This process significantly improves the performance of metrics collection and prevents any potential overhead.

The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to collect accurate and up-to-date metrics every second for multiple GPUs without the overhead.

Memory

The following table outlines the unit for Linux values:

Value Unit
used Percentage
swap used Percentage
swap total Byte
swap free Byte
cached Byte
available Byte

The values are displayed on a graph for a selected time period.

Datapoint: Filesystem

Granularity: 1 second

The following table outlines the unit for AIX values:

Value Unit
used Percentage
swap used Percentage
virtual used Percentage
swap total Byte
swap free Byte
virtual total Byte
virtual free Byte
page-in per second The number of page-in event per second
page-out per second The number of page-out event per second

You can view all values on a graph for a time period that user select.

Datapoint: AIX perfstat_memory_total interface

Granularity: 1 second

Open files

Open files usage when available on the operating system; current vs max. The values are displayed on a graph for a selected time period.

Solaris operating system has limited support. Global zone supports only the current metric and non-global zone does not support any metrics.

Datapoint: Filesystem

Granularity: 1 second

Filesystems

The following table outlines filesystems per device:

Datapoint Collected from Granularity
Device Filesystem 60 seconds
Mount Filesystem 60 seconds
Options Filesystem 60 seconds
Type Filesystem 60 seconds
Capacity Filesystem 60 seconds
Total Utilization* Filesystem 60 seconds
Read Utilization* Filesystem 60 seconds
Write Utilization* Filesystem 60 seconds
Used Filesystem 1 second
Leaked* Filesystem 1 second
Inode usage Filesystem 1 second
Reads/s, Bytes Read/s** Filesystem 1 second
Writes/s, Bytes Writes/s** Filesystem 1 second

* The total, read, and write usage datapoint metrics display the disk I/O utilization as a percentage. This functionality is compatible only with Linux.

* Leaked (refers to deleted files that are in use and equates to capacity - used - free. On Linux, you can find these files with lsof | grep deleted).

** The Total Utilization, Read Utilization, and Write Utilization datapoints are not supported for Network File Systems (NFS).

By default, Instana only monitors local filesystems. You can list the filesystems that are monitored or excluded in the configuration.yaml file.

The name for the configuration setting is the device name, which you can obtain from the first column of mtab file or df command output.

You must specify temporary filesystems in the following format: tmpfs:/mount/point.

The following example shows the list of filesystems that are monitored:

com.instana.plugin.host:
  filesystems:
    - '/dev/sda1'
    - 'tmpfs:/sys/fs/cgroup'
    - 'server:/usr/local/pub'

The following example shows the filesystems that are included or excluded:

com.instana.plugin.host:
  filesystems:
    include:
      - '/dev/xvdd'
      - 'tmpfs:/tmp'
      - 'server:/usr/local/pub'
    exclude:
      - '/dev/xvda2'
Network File Systems (NFS)

To monitor all NFS, use the nfs_all: true configuration parameter as shown in the following example:

com.instana.plugin.host:
  nfs_all: true

Network interfaces

The following table outlines the network traffic and errors per an interface:

Datapoint Collected from Granularity
Interface Filesystem 60 seconds
Mac Filesystem 60 seconds
IPs Filesystem 60 seconds
RX Bytes Filesystem 1 second
RX Errors Filesystem 1 second
TX Bytes Filesystem 1 second
TX Errors Filesystem 1 second

TCP activity

TCP activity values are displayed on a graph for a selected time period. The following table outlines TCP activity values:

Datapoint Collected from Granularity
Establised Filesystem 1 second
Open/s Filesystem 1 second
In Segments/s Filesystem 1 second
Out Segments/s Filesystem 1 second
Established Resets Filesystem 1 second
Out Resets Filesystem 1 second
Fail Filesystem 1 second
Error Filesystem 1 second
Retransmission Filesystem 1 second

Instana doesn't support the TCP activity metric for Sun Solaris hosts.

Process top list

The top process list is updated every 30 seconds and the list contains only the processes with system usage. For example, the processes with more than 10% CPU usage over the last 30 seconds or processes with more than 512 MB memory usage (RSS) are displayed in the process top list.

To create a combined list of processes from the top 10 CPU and memory usage lists, set combineTopProcesses to true. The processes are included in the combined list even if their CPU usage is less than 10% or memory usage is less than 512 MB. If the same process is listed in the top 10 CPU and top 10 memory usage lists, it is listed only once in the combined list, which can include up to 20 entries.

com.instana.plugin.host:
  combineTopProcesses: true

Linux top semantics are used. 100% CPU refers to full use of a single CPU core, and you can search a history of snapshots from the previous month. The normalized CPU is calculated by dividing the CPU by the number of logical processors.

Datapoint Collected from Granularity
PID Filesystem 30 seconds
Process Name Filesystem 30 seconds
CPU Filesystem 30 seconds
CPU (normalized) Calculated 30 seconds
Memory Filesystem 30 seconds

Extract packages list

You can extract installed packages on an operating system once a day by setting the collectInstalledSoftware to true in the configuration.yaml file.

The following Linux distributions are currently supported:

  • Debian-based (dpkg)
  • Red Hat-based (rpm and yum)
com.instana.plugin.host:
  collectInstalledSoftware: true # [true, false]

Windows services list

Windows services are not monitored by default. This feature is enabled only when winServiceRegex is entered in the configuration.yaml file. The winServiceRegex is a regular expression used to monitor services whose service name or display name matches the regular expression. For example, winServiceRegex: '(Sensor|Device)' monitors all services that include Sensor or Device in their service name or display name.

Datapoint Collected from Granularity
Service Name Windows sc queryex 60 seconds
Display Name Windows sc queryex 60 seconds
PID Windows sc queryex 60 seconds
State Windows sc queryex 60 seconds

Health signatures

For each sensor, there is a knowledge base of health signatures that are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.

Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.

For more information about the built-in events for the Host sensor, see Built-in events reference.

Error report events (only AIX operating system)

On the AIX system, the errpt command generates an error report from entries in an error log. The errors in the error report are then captured as events and sent to Instana. The sensor captures permanent and temporary error types, and hardware and software error classes. You need to enable the feature by using the agent configuration.yaml file as shown in the following example:

com.instana.plugin.host:
  aixEventsPollRate: 900 # In seconds

Troubleshooting

eBPF not supported

Monitoring issue type: ebpf_not_supported

The Process Abnormal Termination functionality detects when processes that run on a Linux-based operating system terminate unexpectedly due to crashes or getting killed by outside signals.

This functionality is built on top of the extended Berkley Packet Filter, which might be unavailable on this host.

To take advantage of Instana's eBPF-based features, you need a 4.7+ Linux kernel with debugfs mounted.

For more information about the supported operating systems, see Process Abnormal Termination.

SELinux policy blocking eBPF

If SELinux is installed on your host, then you need to create a policy to allow the agent to use eBPF. SELinux may prevent unconfined services similar to the host agent from issuing the bpf_* syscall that the eBPF sensor uses to instrument the Linux kernel. To verify, you must look in the log entries of the Audit system, which is stored by default in the /var/log/audit/audit.log.

The following example shows the steps to create policy for a Red Hat Linux machine:

  1. Run the following command:
$ cat /var/log/audit/audit.log | grep ebpf
type=AVC msg=audit(1598891569.452:193): avc:  denied  { map_create } for  pid=1612 comm="ebpf-preflight-" 
scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:unconfined_service_t:s0 
tclass=bpf permissive=0
type=SYSCALL msg=audit(1598891569.452:193): arch=c000003e syscall=321 success=no exit=-13 
a0=0 a1=7ffc0e1f5020 a2=78 a3=fefefefefefefeff items=0 ppid=1502 pid=1612 auid=4294967295 
uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebpf-preflight-" 
exe="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin" 
subj=system_u:system_r:unconfined_service_t:s0 key=(null)
type=PROCTITLE msg=audit(1598891569.452:193):
proctitle="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin"

Audit log files are usually rotated. Therefore, you must run this command not long after starting the host agent.

In the log file, you might see the map_create syscall is denied. To allow the eBPF sensor to make the syscall, you must create the SELinux policy and the program audit2allow.

  1. On Red Hat systems, install the policy as follows:
yum install policycoreutils-python
  1. With audit2allow, create raw policy files based on the log entries as shown in the following example:
grep ebpf /var/log/audit/audit.log | audit2allow -M instana_ebpf

The processing command creates the following files:

ls -Al | grep instana_ebpf
-rw-r--r--. 1 root                    root                      886 31. Aug 18:31 instana_ebpf.pp
-rw-r--r--. 1 root                    root                      239 31. Aug 18:31 instana_ebpf.te

The raw policy file instana_ebpf.te contains an instruction to allow the denied syscall as shown in the following example:

$ cat instana_ebpf.temodule instana_ebpf 1.0;require {
	type unconfined_service_t;
	class bpf map_create;
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf map_create;

This policy allows any application of type unconfined (very generic) to make the map_create syscall.

  1. In addition, the eBPF sensor needs a few more syscalls. You must edit the instana_ebpf.te file as shown in the following example:
$ cat instana_ebpf.te module instana_ebpf 1.0;require {
	type unconfined_service_t;
	class bpf { map_create map_read map_write prog_load prog_run };
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf { map_create map_read map_write prog_load prog_run };
  1. Re-write the file to a binary format as the instana_ebpf.mod file:
$ checkmodule -M -m -o instana_ebpf.mod instana_ebpf.te
checkmodule:  loading policy configuration from instana_ebpf.te
checkmodule:  policy configuration loaded
checkmodule:  writing binary representation (version 19) to instana_ebpf.mod
  1. Repackage the instana_ebpf.mod file as a loadable module:
semodule_package -o instana_ebpf.pp -m instana_ebpf.mod
  1. Apply the policy package:
semodule -i instana_ebpf.pp

Any unconfined process, such as the host agent can now make syscalls.