Monitoring Host
The host sensor is automatically deployed and installed after you install the Instana agent.
- Supported OS
- Configuration
- Metrics collection
- Error report events (only AIX operating system)
- Troubleshooting
Supported OS
- Linux
- Windows
- Mac OS/OS X
- Solaris on Sparc
- AIX
Configuration
For detailed information, see our agent configuration documentation.
Metrics collection
To view the metrics, select Infrastructure in the sidebar of the Instana User interface, click a specific monitored host, and then you can see a host dashboard with all the collected metrics and monitored processes.
Configuration data
- Operating System name and version
- CPU model and count
- GPU model and count
- Memory
- Max Open Files
- Hostname
- Fully Qualified Domain Name
- Machine ID
- Boot ID
- Startup time
- Installed packages
- System ID
Note:
System ID
can be used for correlation with asset management systems. Instana agent collects the System ID
by default for Linux OS. For other supported operating systems (Windows, macOS, Solaris, AIX), you need to
explicitly enable the feature by using the agent configuration.yaml
file:
com.instana.plugin.host:
collectSystemId: true
Performance metrics
CPU usage
Overall CPU usage as a percentage.
- In an AIX LPAR environment, set
useMpstat
totrue
to collect more accurate CPU usage.
com.instana.plugin.host:
useMpstat: true
Datapoint: Filesystem
Granularity: 1 second
Memory usage
- On Linux, the
used
value is computed in percentage by using the formula(total - actualFree) / total
. The sensor uses theactualFree
value, the real constrained memory which includes free and cached memory, instead of justfree
which is usually a pretty low value (used for caching/buffering).
Datapoint: Filesystem
Granularity: 1 second
- On AIX, the
used
value is computed in percentage by using the formula(computational + non-computational) / real total
. Thenon-computational
is considered as part of used memory, potentially resulting in a relatively highused
value. However, a highused
value doesn't necessarily indicate a need for more memory.{: note} The determination of memory over-commitment is based on the comparison between computational memory and the real memory in the system. Therefore, the percentage ofcomputational
is more informative for estimating memory usage on AIX.
Datapoint: AIX perfstat_memory_total interface
Granularity: 1 second
CPU load
The average number of processes being or waiting to be executed over past selected period of time.
Datapoint: Filesystem
Granularity: 5 seconds
CPU usage
CPU usage values as a percentage; user
, system
, wait
, nice
, and steal
. The values are displayed on a graph over a selected time period.
Datapoint: Filesystem
Granularity: 1 second
Context switches
The total number of context switches. This is supported only on Linux hosts. The value is displayed on a graph over a selected time period.
Datapoint: Filesystem
Granularity: 1 second
CPU load
CPU load. The value is displayed on a graph over a selected time period.
Datapoint: Filesystem
Granularity: 1 second
Individual CPU usage
Individual CPU usage values as a percentage; user
, system
, wait
, nice
, and steal
. The values are displayed on a graph over a selected time period.
Datapoint: Filesystem
Granularity: 1 second
Individual GPU usage
Individual GPU usage values.
Datapoint | Collected from | Granularity | Unit |
---|---|---|---|
Gpu Usage |
nvidia-smi |
1 second | % |
Temperature |
nvidia-smi |
1 second | °C |
Encoder |
nvidia-smi |
1 second | % |
Decoder |
nvidia-smi |
1 second | % |
Memory Used |
nvidia-smi |
1 second | % |
Memory Total |
nvidia-smi |
1 second | bytes |
Transmitted throughput |
nvidia-smi |
1 second | bytes/s |
Received throughput |
nvidia-smi |
1 second | bytes/s |
Supported Nvidia Graphic Cards:
Brand | Model |
---|---|
Tesla | S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100 |
Quadro | 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series |
GeForce | varying levels of support, with fewer metrics available than on the Tesla and Quadro products |
Supported OS: Linux
Prerequisites: Installed latest official Nvidia drivers.
Starting Instana Agent Docker container with GPU support is documented here: Enable GPU monitoring through Instana Agent container.
Note:
Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi
command-line utility. The background process is started in a loop mode and is kept
in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor
to gather accurate and up to date metrics every second for multiple GPUs without the overhead.
Note:
Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi
command-line utility. The background process is started in a loop mode and is kept
in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor
to gather accurate and up to date metrics every second for multiple GPUs without the overhead.
GPU Memory/Process
The following list of processes utilizes GPU.
Datapoint | Collected from | Granularity |
---|---|---|
Process Name |
nvidia-smi |
1 second |
PID |
nvidia-smi |
1 second |
GPU |
nvidia-smi |
1 second |
Memory |
nvidia-smi |
1 second |
Supported Nvidia Graphic Cards:
Brand | Model |
---|---|
Tesla | S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100 |
Quadro | 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series |
GeForce | varying levels of support, with fewer metrics available than on the Tesla and Quadro products |
Supported OS: Linux
Prerequisites: Installed latest official Nvidia drivers.
Starting Instana Agent Docker container with GPU support is documented here: Enable GPU monitoring through Instana Agent container.
Note:
Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi
command-line utility. The background process is started in a loop mode and is kept
in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor
to gather accurate and up to date metrics every second for multiple GPUs without the overhead.
Note:
Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi
command-line utility. The background process is started in a loop mode and is kept
in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor
to gather accurate and up to date metrics every second for multiple GPUs without the overhead.
Memory
- Linux:
- memory
used
andswap used
values are displayed in percentage. swap total
,swap free
,buffers
,cached
, andavailable
are valued as a byte. The values are displayed on a graph over a selected time period.
- memory
Datapoint: Filesystem
Granularity: 1 second
- AIX:
- memory
used
,swap used
andvirtual used
values are displayed in percentage. swap total
,swap free
,virtual total
,virtual free
are values as a byte.computational
andnon-computational
values are displayed both as a percentage and in byte format in two separate charts.page-in per second
andpage-out per second
are displayed as the number of page-in and page-out events per second. You can visualize all these values on a graph over a user-selected time period.
- memory
Datapoint: AIX perfstat_memory_total interface
Granularity: 1 second
Open files
Open files usage when available on the operating system; current
vs max
. The values are displayed on a graph over a selected time period.
Note:
Solaris OS has limited support:
- Global zone - only the
current
metric is supported - Non-global zone - none of the metrics are supported
Datapoint: Filesystem
Granularity: 1 second
Filesystems
Filesystems per device.
Datapoint | Collected from | Granularity |
---|---|---|
Device |
Filesystem |
60 seconds |
Mount |
Filesystem |
60 seconds |
Options |
Filesystem |
60 seconds |
Type |
Filesystem |
60 seconds |
Capacity |
Filesystem |
60 seconds |
Total Utilization * |
Filesystem |
60 seconds |
Read Utilization * |
Filesystem |
60 seconds |
Write Utilization * |
Filesystem |
60 seconds |
Used |
Filesystem |
1 second |
Leaked * |
Filesystem |
1 second |
Inode usage |
Filesystem |
1 second |
Reads/s , Bytes Read/s ** |
Filesystem |
1 second |
Writes/s , Bytes Writes/s ** |
Filesystem |
1 second |
* The total, read, and write utilization datapoint metrics display the disk I/O utilization as a percentage. This functionality is compatible only with Linux.
* Leaked
(refers to deleted files that are in use and equates to capacity - used - free
. On Linux you can find these files with lsof | grep deleted
).
** The Total Utilization
, Read Utilization
, and Write Utilization
datapoints are not supported for Network File Systems.
Instana will by default only monitor local filesystems. It is possible to explicitly list the filesystems that shall be monitored or excluded in the configuration.yaml
file. The name for the config setting is the device name, which can be obtained from the first column of mtab
file or df
command output.
Temporary filesystems need to be specified in the following format: tmpfs:/mount/point
. For example, list of filesystems to be monitored:
com.instana.plugin.host:
filesystems:
- '/dev/sda1'
- 'tmpfs:/sys/fs/cgroup'
- 'server:/usr/local/pub'
or to be included / excluded:
com.instana.plugin.host:
filesystems:
include:
- '/dev/xvdd'
- 'tmpfs:/tmp'
- 'server:/usr/local/pub'
exclude:
- '/dev/xvda2'
Network File Systems (NFS)
To monitor all Network File Systems (NFS), use the nfs_all: true
configuration parameter:
com.instana.plugin.host:
nfs_all: true
Network interfaces
Network traffic and errors per an interface.
Datapoint | Collected from | Granularity |
---|---|---|
Interface |
Filesystem |
60 seconds |
Mac |
Filesystem |
60 seconds |
IPs |
Filesystem |
60 seconds |
RX Bytes |
Filesystem |
1 second |
RX Errors |
Filesystem |
1 second |
TX Bytes |
Filesystem |
1 second |
TX Errors |
Filesystem |
1 second |
TCP activity
TCP activity values are displayed on a graph over a selected time period.
Datapoint | Collected from | Granularity |
---|---|---|
Establised |
Filesystem |
1 second |
Open/s |
Filesystem |
1 second |
In Segments/s |
Filesystem |
1 second |
Out Segments/s |
Filesystem |
1 second |
Established Resets |
Filesystem |
1 second |
Out Resets |
Filesystem |
1 second |
Fail |
Filesystem |
1 second |
Error |
Filesystem |
1 second |
Retransmission |
Filesystem |
1 second |
Instana doesn't support the TCP activity metric for Sun Solaris hosts.
Process top list
The top process list is updated every 30 seconds, and it contains only the processes with significant system usage. For example, the processes with more than 10% CPU usage over the last 30 seconds or processes with more than 512 MB memory usage (RSS) are displayed in the process top list.
To create a combined list of processes from the top 10 CPU and memory usage lists, set combineTopProcesses
to true
. The processes are included in the combined list even if their CPU usage is less than 10% or memory
usage is less than 512 MB. If the same process is listed in the top 10 CPU and top 10 memory usage lists, it is listed only once in the combined list, which can include up to 20 entries.
com.instana.plugin.host:
combineTopProcesses: true
Linux top
semantics are used. 100% CPU refers to full use of a single CPU core, and you can search a history of snapshots from the previous month. The normalized CPU is calculated by dividing the CPU by the number of logical
processors.
Datapoint | Collected from | Granularity |
---|---|---|
PID |
Filesystem |
30 seconds |
Process Name |
Filesystem |
30 seconds |
CPU |
Filesystem |
30 seconds |
CPU (normalized) |
Calculated |
30 seconds |
Memory |
Filesystem |
30 seconds |
Installed Packages List
When the collectInstalledSoftware
is set to true
in the configuration.yaml
file, installed packages on an operating system can
be extracted once a day.
The following Linux distributions are currently supported:
- Debian-based (
dpkg
) - Red Hat-based (
rpm
andyum
)
com.instana.plugin.host:
collectInstalledSoftware: true # [true, false]
Health signatures
For each sensor, there is a curated knowledgebase of health signatures that are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.
Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.
For information about the built-in events for the Host sensor, see the Built-in events reference.
Error report events (only AIX operating system)
On the AIX system, the errpt
command generates an error report from entries in an error log. The errors in the error report are then captured as events and sent to Instana. The sensor captures permanent and temporary error types,
and hardware and software error classes. You need to explicitly enable the feature by using the agent configuration.yaml
file:
com.instana.plugin.host:
aixEventsPollRate: 900 # In seconds
Troubleshooting
eBPF Not Supported
Monitoring issue type: ebpf_not_supported
The Process Abnormal Termination functionality detects when processes running on a Linux-based Operating System terminate unexpectedly due to crashes or getting killed by outside signals.
This functionality is built on top of the extended Berkley Packet Filter, which seems to be unavailable on this host.
To take advantage of Instana's eBPF-based features you need a 4.7+ Linux kernel with debugfs
mounted. Refer to the Process Abnormal Termination documentation
for more information on the supported Operating Systems.
SELinux policy blocking eBPF
If you have SELinux installed on your host, you usually need to create a policy to allow the agent to leverage eBPF. SELinux may prevent unconfined services
like the host agent from issuing the bpf_*
syscall that
the eBPF sensor uses to instrument the Linux kernel. To verify that this is happening, one must look in the log entries of the Audit system, which is stored by default in the /var/log/audit/audit.log
.
The following is an example is from a Red Hat Linux machine:
$ cat /var/log/audit/audit.log | grep ebpf
type=AVC msg=audit(1598891569.452:193): avc: denied { map_create } for pid=1612 comm="ebpf-preflight-"
scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:unconfined_service_t:s0
tclass=bpf permissive=0
type=SYSCALL msg=audit(1598891569.452:193): arch=c000003e syscall=321 success=no exit=-13
a0=0 a1=7ffc0e1f5020 a2=78 a3=fefefefefefefeff items=0 ppid=1502 pid=1612 auid=4294967295
uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebpf-preflight-"
exe="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin"
subj=system_u:system_r:unconfined_service_t:s0 key=(null)
type=PROCTITLE msg=audit(1598891569.452:193):
proctitle="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin"
Note that audit log files are usually rotated, so we have to run this command not long after starting the host agent.
In the log file, we see that the map_create
syscall was denied. To allow the eBPF sensor to make this syscall we need to create an SELinux policy. For this we need the program audit2allow
. On Red Hat systems this
can be installed as follows:
yum install policycoreutils-python
With audit2allow
, we can then create raw policy files based on the log entries:
grep ebpf /var/log/audit/audit.log | audit2allow -M instana_ebpf
The command above will create the following files:
ls -Al | grep instana_ebpf
-rw-r--r--. 1 root root 886 31. Aug 18:31 instana_ebpf.pp
-rw-r--r--. 1 root root 239 31. Aug 18:31 instana_ebpf.te
The raw policy file, called instana_ebpf.te
, now contains an instruction to allow the denied syscall:
$ cat instana_ebpf.temodule instana_ebpf 1.0;require {
type unconfined_service_t;
class bpf map_create;
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf map_create;
This policy will allow any app of type unconfined (very generic) to make the map_create
syscall.
Additionally, the eBPF sensor needs a few more syscalls. We have to edit the instana_ebpf.te
file so it looks like this:
$ cat instana_ebpf.te module instana_ebpf 1.0;require {
type unconfined_service_t;
class bpf { map_create map_read map_write prog_load prog_run };
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf { map_create map_read map_write prog_load prog_run };
This file then must be re-written to a binary format as the instana_ebpf.mod
file:
$ checkmodule -M -m -o instana_ebpf.mod instana_ebpf.te
checkmodule: loading policy configuration from instana_ebpf.te
checkmodule: policy configuration loaded
checkmodule: writing binary representation (version 19) to instana_ebpf.mod
The instana_ebpf.mod
file must be repackaged as a loadable module:
semodule_package -o instana_ebpf.pp -m instana_ebpf.mod
And finally we can apply the policy package:
semodule -i instana_ebpf.pp
Any unconfined process, such as the host agent, can now make those syscalls.