Monitoring a Linux host
You can monitor your Linux host with Instana. Instana provides comprehensive insights into the Linux host's performance, health, and resource utilization, enabling efficient troubleshooting, performance optimization, and proactive issue detection.
- System information
- Interfaces
- Reporting status
- Performance metrics
- CPU usage: Overall
- Memory usage: Overall
- CPU load: Peak
- Process blocked state
- Process waiting runtime
- User sessions
- CPU usage: Total
- Context Switches
- CPU load: Average
- Individual CPU Usage
- Individual GPU usage
- GPU Memory/Process
- Memory
- Paging activity
- Open files
- Process statistics
- RPC client and server activity
- File system
- Disk
- Network interfaces
- TCP activity
- Process top list
- Extract packages list
- File information attributes
System information
Instana automatically collects comprehensive system information from your Linux host. View these details in the System pane of the Instana dashboard:
| Parameter | Description |
|---|---|
| OS | Operating system details, including kernel version and architecture. |
| CPU | CPU model and core count. |
| Memory | Total system memory in GiB (gibibytes). |
| Max Open Files | Maximum number of concurrent file operations that are supported by the system. |
| Hostname | Network hostname of the Linux host. |
| FQDN | Fully qualified domain name, including subdomain and top-level domain. |
| Machine ID | Unique identifier generated during Linux distribution installation. |
| Boot ID | Unique identifier for the current boot session. |
| System ID | Custom identifier that is used by Instana for host management and correlation with asset management systems. Collected automatically by the Instana agent for Linux operating systems. |
| Host ID | MAC address of the primary network interface. |
| Started At | System boot timestamp. |
| BIOS Version | Version number of the system BIOS (Basic Input/Output System) or UEFI (Unified Extensible Firmware Interface) firmware. |
| BIOS Release Date | Release date of the installed BIOS version. |
| OS Vendor Name | Name of the organization or distribution that provided the operating system. |
| OS Vendor ID | Short identifier for programmatic OS vendor identification. |
| Hardware Model | Specific model name or number of the system or system board. |
| Hardware Brand | Hardware manufacturer name. |
/sys/class/dmi/id/ directory through the Linux sysfs interface. This data is sourced from DMI (Desktop Management Interface) and SMBIOS (System Management BIOS) provided by the system firmware.Interfaces
You can find the following details:
- Interfaces: The list of network interfaces and IP addresses.
- Instana agent: The Instana agent for the host.
- Process: The count and details of the processes that are running on the host.
Reporting status
The historical availability of a Linux host is shown in the Reporting Status chart in the Linux host dashboard. You can see three color indicators that identify the status of a host reporting to Instana.
| Status | Description | Color indicator |
|---|---|---|
| Reporting | The host reported to Instana without any interruptions. | Green |
| Reporting - monitoring issues | The host reported to Instana with some interruption (such as, network interruptions or agent monitoring issues) and was not fully available. | Orange |
| Not Reporting | The host was not reporting to Instana at all during this time. | Red |
The metric that is used to show this data on the host dashboard is based on the aggregation of messages received from the agent monitoring the host. A host is classified as Reporting if Instana receives at least 98% of the expected messages in a given timeframe.
For example, if the metric aggregation time window is 5 minutes and the poll rate of the host is once per second, Instana expects to receive 300 messages from the host during that timeframe.
- If at least 294 messages are received (98% of 300), the host status is shown as Reporting.
- If less than 294 but greater than 0 messages are received, the host status is shown as Reporting – Monitoring Issues.
- If no messages are received, the host status is shown as Not Reporting.
Performance metrics
The following performance metrics are displayed for the Linux host.
CPU usage: Overall
The CPU usage values, when combined, provide a detailed view of how the CPU resources are being utilized on a Linux host.
| Metric | Description | Granularity |
|---|---|---|
| CPU Usage | The total CPU usage in percentage for the time range that you set. | 1 second |
Memory usage: Overall
| Metric | Description | Granularity |
|---|---|---|
| Memory Usage | The total memory usage in percentage | 1 second |
You can measure the used value in percentage by using the formula (total - actualFree) ÷ total. The sensor uses the actualFree value that is the real-constrained memory that includes free and cached memory, instead of free, which is a low value (used for caching or buffering).
CPU load: Peak
| Metric | Description | Granularity |
|---|---|---|
| Load | The peak CPU load. The highest number of processes that are run for the time range that you set. | 1 second |
Process blocked state
| Metric | Description | Granularity |
|---|---|---|
| Process blocked state | The number of processes in a blocked state that are waiting for I/O resources to become available. | 1 minute |
Process waiting runtime
| Metric | Description | Granularity |
|---|---|---|
| Process waiting runtime | The number of processes waiting in the run queue for CPU time. | 1 minute |
User sessions
| Metric | Description | Granularity |
|---|---|---|
| User Sessions | The number of concurrent user login sessions on the host. | 1 minute |
CPU usage: Total
| Metric | Description | Granularity |
|---|---|---|
| User | The percentage of CPU time that is spent executing user-space processes, including applications and user-initiated services. | 1 second |
| System | The percentage of CPU time that is spent executing kernel operations, including system calls, device drivers, and core OS functions. | 1 second |
| Wait | The percentage of CPU time that is spent waiting for I/O operations to complete, indicating potential disk or network bottlenecks. | 1 second |
| Nice | The percentage of CPU time that is spent executing processes with reduced priority (positive nice values), allowing higher-priority tasks to run first. | 1 second |
| Steal | The percentage of CPU time that is stolen by the hypervisor to service other virtual machines on the same physical host. | 1 second |
| Idle | The percentage of CPU time when the processor was idle and not waiting for I/O operations, indicating available CPU capacity. | 1 second |
Context switches
| Metric | Description | Granularity |
|---|---|---|
| Context Switches | The total number of context switches on a graph for a selected time period. | 1 second |
CPU load: Average
The CPU load metric tracks the average number of processes competing for CPU resources, displayed as a time-series graph.
| Metric | Description | Granularity |
|---|---|---|
| CPU Load | The average number of processes in the run queue (either executing on the CPU or waiting for CPU time) over the selected time period, providing insight into system workload and resource demand. | 1 second |
Individual CPU Usage
The CPU usage metric displays the following metrics in percentage on a graph for a selected time period for each CPU:
| Metric | Description | Granularity |
|---|---|---|
| User | The amount of CPU time that is spent running user-space processes (applications and services). | 1 second |
| System | The amount of CPU time that is spent running kernel-space processes (OS core functions). | 1 second |
| Wait | The amount of CPU time that is spent waiting for input or output operations to complete. | 1 second |
| Nice | The amount of CPU time that is spent running processes with a lower priority (nice value). | 1 second |
| Steal | The amount of CPU time lost due to the hypervisor managing other virtual machines or containers on the same physical host. | 1 second |
| Idle | Percentage of CPU time when the processor was idle. | 1 second |
Individual GPU usage
The following table outlines the Individual GPU usage values:
| Metric | Description | Granularity | Unit |
|---|---|---|---|
| Gpu Usage | GPU usage percentage | 1 second | % |
| Temperature | GPU temperature in Celsius | 1 second | °C |
| Encoder | Encoder utilization | 1 second | % |
| Decoder | Decoder utilization | 1 second | % |
| Memory Used | Memory usage | 1 second | % |
| Memory Total | Total GPU memory | 1 second | bytes |
| Transmitted throughput | Transmitted data rate | 1 second | bytes/s |
| Received throughput | Received data rate | 1 second | bytes/s |
The metric is collected from nvidia-smi. The following table outlines the supported version of Nvidia graphics cards:
| Brand | Model |
|---|---|
| Tesla | S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100 |
| Quadro | 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series |
| GeForce | Varying levels of support, with fewer metrics available than on the Tesla and Quadro products |
Prerequisites
You must install the latest official Nvidia drivers.
For more information about starting a Docker container for Instana Agent with GPU support, see Enable GPU monitoring through Instana Agent container.
Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes by using nvidia-smi. The background process is started in a loop mode and kept in memory. This process significantly improves the performance of metrics collection and prevents any potential overhead.
The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to collect accurate and up-to-date metrics every second for multiple GPUs without the overhead.
GPU Memory/Process
The following list of processes uses GPU:
| Datapoint | Collected from | Granularity |
|---|---|---|
Process Name |
nvidia-smi |
1 second |
PID |
nvidia-smi |
1 second |
GPU |
nvidia-smi |
1 second |
Memory |
nvidia-smi |
1 second |
The following table outlines the supported version of Nvidia graphics cards for GPU memory:
| Brand | Model |
|---|---|
| Tesla | S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100 |
| Quadro | 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series |
| GeForce | Varying levels of support, with fewer metrics available than on the Tesla and Quadro products |
Memory
The following table outlines the unit for memory:
| Metric | Unit | Description | Granularity |
|---|---|---|---|
| Total | Byte | The total amount of memory | 1 second |
| Shared | Byte | Memory used by shared memory segments and tmpfs filesystems on Linux systems | 1 second |
| Used | Percentage | Amount of memory in use | 1 second |
| Buffers | Byte | Memory used for buffers | 1 second |
| Cached | Byte | Memory used for caching | 1 second |
| Available | Byte | Memory available for use | 1 second |
| Swap total | Byte | Total swap space available | 1 second |
| Swap free | Byte | Available swap space | 1 second |
| Swap Used | Percentage | Amount of swap space in use | 1 second |
| Virtual total | Byte | Total capacity of virtual memory (physical memory and swap space). | 1 second |
| Virtual used | Byte | Memory that applications use actively, excluding reclaimable buffers and cache | 1 second |
| Virtual free | Byte | Amount of virtual memory available for allocation. | 1 second |
The values are displayed on a graph for a selected time period.
Paging activity
| Metric | Description | Granularity |
|---|---|---|
| Total faults | The total number of page faults, including both minor and major faults when processes access memory not in RAM. | 1 second |
| Major Faults | The number of major page faults that require loading data from disk into memory. | 1 second |
| Paged-in | The number of memory pages that are transferred from disk to physical RAM. | 1 second |
| Paged-out | The number of memory pages that are transferred from physical RAM to disk. | 1 second |
| Swapped-in | The number of memory pages that are transferred from swap space on disk back into physical RAM. | 1 second |
| Swapped-out | The number of memory pages that are transferred from physical RAM to swap space on disk. | 1 second |
By default, paging activity metrics are not collected. You can enable the collection of paging activity metrics by setting the collectPagingActivity to true in the configuration.yaml file.
com.instana.plugin.host:
collectPagingActivity: true # [true, false]
Open files
Open files usage when available on the operating system; current vs max. The values are displayed on a graph for a selected time period.
| Metric | Unit | Description | Granularity |
|---|---|---|---|
| Current | Byte | The total memory available for use by the system, including both active and inactive memory. | 1 second |
| Used | Percentage | The memory in use by processes. | 1 second |
Process statistics
By default, process statistics metrics are not collected. You can enable the collection of process statistics metrics by setting the collectSystemProcess to true in the agent configuration.yaml file.
com.instana.plugin.host:
collectSystemProcess: true # [true, false]
| Metric | Description | Granularity |
|---|---|---|
| Total processes | The total number of processes currently running on the system, including all active, sleeping, stopped, and zombie processes. | 1 minute |
| Blocked state | The number of processes in a blocked state that are waiting for I/O operations to complete, such as disk reads, network responses, or other resource availability. | 1 minute |
| Waiting runtime | The number of processes in the run queue that are waiting for CPU time allocation, indicating processes ready to execute but not currently running on the CPU. | 1 minute |
| Zombie | The number of zombie processes that have completed execution but still have entries in the process table, waiting for their parent process to read status. | 1 minute |
Zombie processes
Zombie processes are executed processes whose exit status has not yet been collected by their parent process. These processes do not consume CPU or memory.
The zombie processes are shown as a list in the dashboard with the following details:
| Metric | Description | Granularity |
|---|---|---|
| PID | The process ID of the zombie process. | 1 minute |
| PPID | The process ID of the parent process. | 1 minute |
| User | The user who owns the zombie process. | 1 minute |
| State | The current state of the zombie process. | 1 minute |
| Start time | The time or date when the zombie process started. | 1 minute |
| CPU time | The total CPU time consumed by the zombie process. | 1 minute |
| Priority | The scheduling priority assigned to the process (lower value indicate higher priority). | 1 minute |
RPC client and server activity
| Metric | Description | Granularity |
|---|---|---|
| Client calls | The number of RPC calls that are initiated by the client to remote servers. | 1 minute |
| Retransmitted calls | The number of RPC calls that were retransmitted due to timeout or network issues. | 1 minute |
| Authentication refreshed | The number of times the client refreshed authentication credentials during RPC operations. | 1 minute |
| Metric | Description | Granularity |
|---|---|---|
| Server calls | The number of RPC calls received and processed by the server. | 1 minute |
| Rejected calls | The number of RPC calls that were rejected by the server due to various reasons. | 1 minute |
| Authentication failures | The number of RPC calls that failed authentication verification. | 1 minute |
| Packets malformed headers | The number of RPC packets received with malformed or corrupted headers. | 1 minute |
| Invalid requests | The number of RPC requests that were invalid or improperly formatted. | 1 minute |
By default, RPC client and server activity metrics are not collected. You can enable the collection of RPC activity metrics by setting the collectRpcActivity to true in the agent configuration.yaml file.
com.instana.plugin.host:
collectRpcActivity: true # [true, false]
File system
These metrics provide insights into file system performance, capacity, and usage, allowing administrators to monitor and optimize their storage systems effectively.
| Metric | Description | Granularity |
|---|---|---|
| Free disk space | The amount of free space that is available on the file system. | 1 second |
| Leaked | Space that is allocated but not used, considered leaked or wasted. | 1 second |
| Capacity | The total capacity of the file system. | 1 second |
| Used disk percentage | The percentage of space that is used on the file system. | 1 second |
| Inode Usage | The percentage of inodes (data structures describing files and directories) in use. | 1 second |
| Inode Free | The number of free inodes that are available on the file system. | 1 second |
| Bytes Read/s | The utilization of read operations. | 1 second |
| Bytes Written/s | The utilization of write operations. | 1 second |
| Reads/s | The number of bytes read from the file system. | 1 second |
| Writes/s | The number of bytes written to the file system. | 1 second |
| Read utilization | The percentage of time that is spent performing read operations. | 1 second |
| Write utilization | The percentage of time that is spent performing write operations. | 1 second |
| Total utilization | The overall usage of the file system, combining read, write, and inode usage. | 1 second |
| Tag | Description |
|---|---|
| Device | The name of the device. |
| Mount | The mount point where the device is attached in the file system hierarchy. |
| Options | The options or parameters that are used while mounting the file system. |
| Type | The type of file system. |
* The total, read, and write usage datapoint metrics display the disk I/O utilization as a percentage.
* Leaked (refers to deleted files that are in use and equates to capacity - used - free. You can find these files with lsof | grep deleted).
** The Total Utilization, Read Utilization, and Write Utilization datapoints are not supported for Network File Systems (NFS).
By default, Instana only monitors local file systems. You can list the file systems that are monitored or excluded in the configuration.yaml file.
The name for the configuration setting is the device name, which you can obtain from the first column of mtab file or df command output.
You must specify temporary file systems in the following format: tmpfs:/mount/point.
The following example shows the list of file systems that are monitored:
com.instana.plugin.host:
filesystems:
- '/dev/sda1'
- 'tmpfs:/sys/fs/cgroup'
- 'server:/usr/local/pub'
The following example shows the file systems that are included or excluded:
com.instana.plugin.host:
filesystems:
include:
- '/dev/xvdd'
- 'tmpfs:/tmp'
- 'server:/usr/local/pub'
exclude:
- '/dev/xvda2'
Network File Systems (NFS)
To monitor all NFS, use the nfs_all: true configuration parameter as shown in the following example:
com.instana.plugin.host:
nfs_all: true
Disk
The following table covers metrics that are related to Disk
| Metric | Description | Granularity | Unit |
|---|---|---|---|
| Device | The name of the disk or partition. | 1 second | Milliseconds |
| Read Time | Average time for read requests to be completed. | 1 second | Milliseconds |
| Write Time | Average time for write requests to be completed. | 1 second | Milliseconds |
| Discard Requests Time | Average time for discard requests to be completed. | 1 second | Milliseconds |
| Flush Requests Time | Average time for flush requests to be completed. | 1 second | Milliseconds |
| Byte Read Rate | The number of bytes that are read per second. | 1 second | Bytes/seconds |
| Byte Write Rate | The number of bytes that are written per second. | 1 second | Bytes/seconds |
| Latency | The average time per I/O operation. | 1 second | Milliseconds |
| Throughput | The total number of read and write operations performed per second. | 1 second | IOPS |
| Transfer Rate | The amount of data read and written per second. | 1 second | Bytes/seconds |
| Read % | The percentage of total disk I/O operations that are read operations. | 1 second | Percentage |
| Write % | The percentage of total disk I/O operations that are write operations. | 1 second | Percentage |
| Read Requests | The number of read operations completed divided by the length of the time period. | 1 second | Requests per second |
| Write Requests | The number of write operations completed divided by the length of the time period. | 1 second | Requests per second |
| Avg Request Queue Length | The amount of data read and written per second. | 1 second | Number |
Network interfaces
The following table outlines the network traffic and errors per an interface.
| Metric | Description | Granularity |
|---|---|---|
| Interface | The network interface being used for communication. | 60 seconds |
| Mac | The Media Access Control (MAC) address of the network interface. | 60 seconds |
| IPs | The IP addresses assigned to the network interface. | 60 seconds |
| RX Bytes | The total number of bytes that are received by the network interface per second. | 1 second |
| RX Errors | The number of errors that are encountered while receiving data on the network interface. | 1 second |
| TX Bytes | The total number of bytes that are transmitted by the network interface per second. | 1 second |
| TX Errors | The total number of errors that are encountered while transmitting packets on the network interface. | 1 second |
| Received/s | The number of packets that are received by the network interface per second. | 1 second |
| Transmitted/s | The number of packets that are transmitted by the network interface per second. | 1 second |
TCP activity
These metrics provide insights into TCP connection activity, including established connections, segment transmission rates, and error occurrences.
| Metric | Description | Granularity |
|---|---|---|
| Established | The number of established TCP connections. | 1 second |
| Open/s | The number of new TCP connections opened per second. | 1 second |
| In Segments/s | The number of incoming TCP segments per second. | 1 second |
| Out Segments/s | The number of outgoing TCP segments per second. | 1 second |
| Established Resets | The number of established TCP connections that were reset per second. | 1 second |
| Out Resets | The number of outgoing TCP connections that were reset per second. | 1 second |
| Fail | The number of failed TCP connection attempts per second. | 1 second |
| Error | The number of TCP errors per second. | 1 second |
| Retransmission | The number of TCP retransmissions per second. | 1 second |
Process top list
The top process list provides comprehensive insights into running processes, including process identifiers, names, resource consumption metrics, and ownership information. This list is updated every 30 seconds and displays only processes that meet specific resource utilization thresholds: processes consuming more than 10% CPU over the last 30 seconds or processes with memory usage (RSS) exceeding 512 MB.
To generate a unified view combining the top 10 CPU-intensive and top 10 memory-intensive processes, configure combineTopProcesses to true. This configuration includes processes in the combined list regardless of whether they meet the standard thresholds. When a process appears in both the CPU and memory top 10 lists, it is listed only once, resulting in a combined list of up to 20 unique entries.
com.instana.plugin.host:
combineTopProcesses: true # [true, false]
Linux top semantics are used. 100% CPU refers to full use of a single CPU core, and you can search a history of snapshots from the previous month. The normalized CPU is calculated by dividing the CPU by the number of logical processors.
| Metric | Description | Granularity |
|---|---|---|
| PID | The unique identifier that is assigned to each process by the operating system. | 30 seconds |
| Process Name | The name of the process as defined by the application or service. | 30 seconds |
| PPID | The process ID of the parent process that created this process. | 30 seconds |
| UID | The numeric user identifier of the user account that owns and runs the process. | 30 seconds |
| GID | The numeric group identifier associated with the process owner. | 30 seconds |
| Elapsed time | The total time elapsed since the process was started. | 30 seconds |
| CPU | The amount of CPU resources consumed by the process. | 30 seconds |
| CPU (normalized) | The CPU usage of the process, normalized to a scale. | 30 seconds |
| Memory | The amount of memory consumed by the process. | 30 seconds |
Extract packages list
You can extract installed packages on an operating system once a day by setting the collectInstalledSoftware to true in the configuration.yaml file.
The following Linux distributions are currently supported:
- Debian-based (
dpkg) - Red Hat-based (
rpmandyum)
com.instana.plugin.host:
collectInstalledSoftware: true # [true, false]
File information attributes
You can obtain the following file attributes for the top 10 files or directories by size from the root (/) directory, by setting the getFileInfo to true in the configuration.yaml file.
| Metric | Description |
|---|---|
| File name | The name of the file or directory. |
| Last accessed time | The date and time of the last file access. |
| Last changed time | The date and time of the last change to a file. |
| Access | This attribute defines the access rights for a file. |
| Type | The type of file (File or Directory). |
| Size | The size of a file, in bytes. |
| Content changed | Indicates whether the file content changes (Yes or No). |
| Owner | The name of the file owner. |
| Group | The name of the logical group to which a file owner belongs. |
com.instana.plugin.host:
getFileInfo: true # [true, false]
For each sensor, a knowledge base of health signatures is evaluated continuously against the incoming metrics and raises issues or incidents based on user impact.
Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of an entity.
For more information about the built-in events for the Host sensor, see Built-in events reference.