Monitoring host
After you install the Instana host agent, the host sensor is automatically installed and deployed. You can view metrics that are related to the host sensor in the Instana UI.
- Supported information
- Configuring
- Viewing metrics
- Configuration data
- Performance metrics
- Health signatures
- Error report events (only AIX operating system)
- Troubleshooting
Supported information
Supported operating systems
- Linux
- Windows
- Mac OS/OS X
- Solaris on Sparc
- AIX
Supported versions and platform
Configuring
For more information, see agent configuration.
Viewing metrics
To view the metrics, complete the following steps:
- From the navigation menu in the Instana UI, select Infrastructure.
- Click a specific monitored host. You can see a host dashboard with all the collected metrics and monitored processes.
Configuration data
- Operating System name and version
- CPU model and count
- GPU model and count
- Memory
- Max Open Files
- Hostname
- Fully Qualified Domain Name
- Machine ID
- Boot ID
- Startup time
- Installed packages
- System ID
System ID
is used for correlation with asset management systems. Instana agent collects the System ID
by default for Linux operating systems. For other supported operating systems, such as Windows, macOS, Solaris,
and AIX you need to enable System ID
by using the agent configuration YAML file as shown in the following example:
com.instana.plugin.host:
collectSystemId: true
Performance metrics
The following performance metrics are required for monitoring a host.
CPU usage
The CPU usage
metric displays total CPU usage in percentage.
To collect more accurate CPU usage, in an AIX LPAR environment, you must set useMpstat
to true
as shown in the following example:
com.instana.plugin.host:
useMpstat: true
Datapoint: Filesystem
Granularity: 1 second
Memory usage
- On the Linux operating systems, you can measure the
used
value in percentage by using the formula(total - actualFree) ÷ total
. The sensor uses theactualFree
value that is the real constrained memory which includes free and cached memory, instead offree
, which is a low value (used for caching or buffering).
Datapoint: Filesystem
Granularity: 1 second
- On the AIX LPAR environment, the
used
value is computed in percentage by using the formula(computational + non-computational) ÷ real total
.
The non-computational
is a part of used memory, which has relatively high used
value. A high used
value doesn't necessarily indicate a need for more memory.
The determination of memory over-commitment is based on the comparison between computational memory and the real memory in the system. Therefore, the percentage of computational
is more informative for estimating memory usage
on AIX.
Datapoint: AIX perfstat_memory_total interface
Granularity: 1 second
CPU load
The CPU load
metric displays the average number of processes that are executed for past selected period of time.
Datapoint: Filesystem
Granularity: 5 seconds
CPU usage
The CPU usage
metric displays the following values as a percentage on a graph for a selected time period:
user
system
wait
nice
steal
Datapoint: Filesystem
Granularity: 1 second
Context switches
The Context switches
metric displays the total number of context switches on a graph for a selected time period. . Context switches
is supported only on Linux hosts.
Datapoint: Filesystem
Granularity: 1 second
CPU load
The CPU load
metric displays the value on a graph for a selected time period.
Datapoint: Filesystem
Granularity: 1 second
Individual CPU usage
The Individual CPU usage
metric displays the following values as a percentage on a graph for a selected time period:
user
system
wait
nice
steal
Datapoint: Filesystem
Granularity: 1 second
Individual GPU usage
The following table outlines the Individual GPU usage
values:
Datapoint | Collected from | Granularity | Unit |
---|---|---|---|
Gpu Usage |
nvidia-smi |
1 second | % |
Temperature |
nvidia-smi |
1 second | °C |
Encoder |
nvidia-smi |
1 second | % |
Decoder |
nvidia-smi |
1 second | % |
Memory Used |
nvidia-smi |
1 second | % |
Memory Total |
nvidia-smi |
1 second | bytes |
Transmitted throughput |
nvidia-smi |
1 second | bytes/s |
Received throughput |
nvidia-smi |
1 second | bytes/s |
The following table outlines the supported version of Nvidia graphics cards:
Brand | Model |
---|---|
Tesla | S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100 |
Quadro | 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series |
GeForce | varying levels of support, with fewer metrics available than on the Tesla and Quadro products |
Supported operating system: Linux
Prerequisites
You must install the latest official Nvidia drivers.
For more information about starting Docker container for Instana Agent with GPU support, see Enable GPU monitoring through Instana Agent container.
Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes by using nvidia-smi
. The background process is started in a loop mode and kept in memory. This process
significantly improves the performance of metrics collection and prevents any potential overhead.
The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to collect accurate and up-to-date metrics every second for multiple GPUs without the overhead.
GPU Memory/Process
The following list of processes utilizes GPU:
Datapoint | Collected from | Granularity |
---|---|---|
Process Name |
nvidia-smi |
1 second |
PID |
nvidia-smi |
1 second |
GPU |
nvidia-smi |
1 second |
Memory |
nvidia-smi |
1 second |
The following table outlines the supported version of Nvidia graphics cards:
Brand | Model |
---|---|
Tesla | S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100 |
Quadro | 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series |
GeForce | varying levels of support, with fewer metrics available than on the Tesla and Quadro products |
Supported operating system: Linux
Prerequisites
You must install the latest official Nvidia drivers.
For more information about starting Docker container for Instana Agent with GPU support, see Enable GPU monitoring through Instana Agent container.
Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes by using nvidia-smi
. The background process is started in a loop mode and kept in memory. This process
significantly improves the performance of metrics collection and prevents any potential overhead.
The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to collect accurate and up-to-date metrics every second for multiple GPUs without the overhead.
Memory
The following table outlines the unit for Linux values:
Value | Unit |
---|---|
used |
Percentage |
swap used |
Percentage |
swap total |
Byte |
swap free |
Byte |
cached |
Byte |
available |
Byte |
The values are displayed on a graph for a selected time period.
Datapoint: Filesystem
Granularity: 1 second
The following table outlines the unit for AIX values:
Value | Unit |
---|---|
used |
Percentage |
swap used |
Percentage |
virtual used |
Percentage |
swap total |
Byte |
swap free |
Byte |
virtual total |
Byte |
virtual free |
Byte |
page-in per second |
The number of page-in event per second |
page-out per second |
The number of page-out event per second |
You can view all values on a graph for a time period that user select.
Datapoint: AIX perfstat_memory_total interface
Granularity: 1 second
Open files
Open files usage when available on the operating system; current
vs max
. The values are displayed on a graph for a selected time period.
Solaris operating system has limited support. Global zone supports only the current
metric and non-global zone does not support any metrics.
Datapoint: Filesystem
Granularity: 1 second
Filesystems
The following table outlines filesystems per device:
Datapoint | Collected from | Granularity |
---|---|---|
Device |
Filesystem |
60 seconds |
Mount |
Filesystem |
60 seconds |
Options |
Filesystem |
60 seconds |
Type |
Filesystem |
60 seconds |
Capacity |
Filesystem |
60 seconds |
Total Utilization * |
Filesystem |
60 seconds |
Read Utilization * |
Filesystem |
60 seconds |
Write Utilization * |
Filesystem |
60 seconds |
Used |
Filesystem |
1 second |
Leaked * |
Filesystem |
1 second |
Inode usage |
Filesystem |
1 second |
Reads/s , Bytes Read/s ** |
Filesystem |
1 second |
Writes/s , Bytes Writes/s ** |
Filesystem |
1 second |
* The total, read, and write usage datapoint metrics display the disk I/O utilization as a percentage. This functionality is compatible only with Linux.
* Leaked
(refers to deleted files that are in use and equates to capacity - used - free
. On Linux, you can find these files with lsof | grep deleted
).
** The Total Utilization
, Read Utilization
, and Write Utilization
datapoints are not supported for Network File Systems (NFS).
By default, Instana only monitors local filesystems. You can list the filesystems that are monitored or excluded in the configuration.yaml
file.
The name for the configuration setting is the device name, which you can obtain from the first column of mtab
file or df
command output.
You must specify temporary filesystems in the following format: tmpfs:/mount/point
.
The following example shows the list of filesystems that are monitored:
com.instana.plugin.host:
filesystems:
- '/dev/sda1'
- 'tmpfs:/sys/fs/cgroup'
- 'server:/usr/local/pub'
The following example shows the filesystems that are included or excluded:
com.instana.plugin.host:
filesystems:
include:
- '/dev/xvdd'
- 'tmpfs:/tmp'
- 'server:/usr/local/pub'
exclude:
- '/dev/xvda2'
Network File Systems (NFS)
To monitor all NFS, use the nfs_all: true
configuration parameter as shown in the following example:
com.instana.plugin.host:
nfs_all: true
Network interfaces
The following table outlines the network traffic and errors per an interface:
Datapoint | Collected from | Granularity |
---|---|---|
Interface |
Filesystem |
60 seconds |
Mac |
Filesystem |
60 seconds |
IPs |
Filesystem |
60 seconds |
RX Bytes |
Filesystem |
1 second |
RX Errors |
Filesystem |
1 second |
TX Bytes |
Filesystem |
1 second |
TX Errors |
Filesystem |
1 second |
TCP activity
TCP activity values are displayed on a graph for a selected time period. The following table outlines TCP activity values:
Datapoint | Collected from | Granularity |
---|---|---|
Establised |
Filesystem |
1 second |
Open/s |
Filesystem |
1 second |
In Segments/s |
Filesystem |
1 second |
Out Segments/s |
Filesystem |
1 second |
Established Resets |
Filesystem |
1 second |
Out Resets |
Filesystem |
1 second |
Fail |
Filesystem |
1 second |
Error |
Filesystem |
1 second |
Retransmission |
Filesystem |
1 second |
Instana doesn't support the TCP activity metric for Sun Solaris hosts.
Process top list
The top process list is updated every 30 seconds and the list contains only the processes with system usage. For example, the processes with more than 10% CPU usage over the last 30 seconds or processes with more than 512 MB memory usage (RSS) are displayed in the process top list.
To create a combined list of processes from the top 10 CPU and memory usage lists, set combineTopProcesses
to true
. The processes are included in the combined list even if their CPU usage is less than 10% or memory
usage is less than 512 MB. If the same process is listed in the top 10 CPU and top 10 memory usage lists, it is listed only once in the combined list, which can include up to 20 entries.
com.instana.plugin.host:
combineTopProcesses: true
Linux top
semantics are used. 100% CPU refers to full use of a single CPU core, and you can search a history of snapshots from the previous month. The normalized CPU is calculated by dividing the CPU by the number of logical
processors.
Datapoint | Collected from | Granularity |
---|---|---|
PID |
Filesystem |
30 seconds |
Process Name |
Filesystem |
30 seconds |
CPU |
Filesystem |
30 seconds |
CPU (normalized) |
Calculated |
30 seconds |
Memory |
Filesystem |
30 seconds |
Extract packages list
You can extract installed packages on an operating system once a day by setting the collectInstalledSoftware
to true
in the configuration.yaml
file.
The following Linux distributions are currently supported:
- Debian-based (
dpkg
) - Red Hat-based (
rpm
andyum
)
com.instana.plugin.host:
collectInstalledSoftware: true # [true, false]
Windows services list
Windows services are not monitored by default. This feature is enabled only when winServiceRegex
is entered in the configuration.yaml
file.
The winServiceRegex
is a regular expression used to monitor services whose service name or display name matches the regular expression. For example, winServiceRegex: '(Sensor|Device)'
monitors all services that
include Sensor
or Device
in their service name or display name.
Datapoint | Collected from | Granularity |
---|---|---|
Service Name |
Windows sc queryex |
60 seconds |
Display Name |
Windows sc queryex |
60 seconds |
PID |
Windows sc queryex |
60 seconds |
State |
Windows sc queryex |
60 seconds |
Health signatures
For each sensor, there is a knowledge base of health signatures that are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.
Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.
For more information about the built-in events for the Host sensor, see Built-in events reference.
Error report events (only AIX operating system)
On the AIX system, the errpt
command generates an error report from entries in an error log. The errors in the error report are then captured as events and sent to Instana. The sensor captures permanent and temporary error types,
and hardware and software error classes. You need to enable the feature by using the agent configuration.yaml
file as shown in the following example:
com.instana.plugin.host:
aixEventsPollRate: 900 # In seconds
Troubleshooting
eBPF not supported
Monitoring issue type: ebpf_not_supported
The Process Abnormal Termination functionality detects when processes that run on a Linux-based operating system terminate unexpectedly due to crashes or getting killed by outside signals.
This functionality is built on top of the extended Berkley Packet Filter, which might be unavailable on this host.
To take advantage of Instana's eBPF-based features, you need a 4.7+ Linux kernel with debugfs
mounted.
For more information about the supported operating systems, see Process Abnormal Termination.
SELinux policy blocking eBPF
If SELinux is installed on your host, then you need to create a policy to allow the agent to use eBPF. SELinux may prevent unconfined services
similar to the host agent from issuing the bpf_*
syscall that the eBPF
sensor uses to instrument the Linux kernel. To verify, you must look in the log entries of the Audit system, which is stored by default in the /var/log/audit/audit.log
.
The following example shows the steps to create policy for a Red Hat Linux machine:
- Run the following command:
$ cat /var/log/audit/audit.log | grep ebpf
type=AVC msg=audit(1598891569.452:193): avc: denied { map_create } for pid=1612 comm="ebpf-preflight-"
scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:unconfined_service_t:s0
tclass=bpf permissive=0
type=SYSCALL msg=audit(1598891569.452:193): arch=c000003e syscall=321 success=no exit=-13
a0=0 a1=7ffc0e1f5020 a2=78 a3=fefefefefefefeff items=0 ppid=1502 pid=1612 auid=4294967295
uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebpf-preflight-"
exe="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin"
subj=system_u:system_r:unconfined_service_t:s0 key=(null)
type=PROCTITLE msg=audit(1598891569.452:193):
proctitle="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin"
Audit log files are usually rotated. Therefore, you must run this command not long after starting the host agent.
In the log file, you might see the map_create
syscall is denied. To allow the eBPF sensor to make the syscall, you must create the SELinux policy and the program audit2allow
.
- On Red Hat systems, install the policy as follows:
yum install policycoreutils-python
- With
audit2allow
, create raw policy files based on the log entries as shown in the following example:
grep ebpf /var/log/audit/audit.log | audit2allow -M instana_ebpf
The processing command creates the following files:
ls -Al | grep instana_ebpf
-rw-r--r--. 1 root root 886 31. Aug 18:31 instana_ebpf.pp
-rw-r--r--. 1 root root 239 31. Aug 18:31 instana_ebpf.te
The raw policy file instana_ebpf.te
contains an instruction to allow the denied syscall as shown in the following example:
$ cat instana_ebpf.temodule instana_ebpf 1.0;require {
type unconfined_service_t;
class bpf map_create;
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf map_create;
This policy allows any application of type unconfined (very generic) to make the map_create
syscall.
- In addition, the eBPF sensor needs a few more syscalls. You must edit the
instana_ebpf.te
file as shown in the following example:
$ cat instana_ebpf.te module instana_ebpf 1.0;require {
type unconfined_service_t;
class bpf { map_create map_read map_write prog_load prog_run };
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf { map_create map_read map_write prog_load prog_run };
- Re-write the file to a binary format as the
instana_ebpf.mod
file:
$ checkmodule -M -m -o instana_ebpf.mod instana_ebpf.te
checkmodule: loading policy configuration from instana_ebpf.te
checkmodule: policy configuration loaded
checkmodule: writing binary representation (version 19) to instana_ebpf.mod
- Repackage the
instana_ebpf.mod
file as a loadable module:
semodule_package -o instana_ebpf.pp -m instana_ebpf.mod
- Apply the policy package:
semodule -i instana_ebpf.pp
Any unconfined process, such as the host agent can now make syscalls.