Monitoring an AIX host
You can monitor your host with Instana. Instana provides comprehensive insights into the host's performance, health, and resource utilization, enabling efficient troubleshooting, performance optimization, and proactive issue detection.
System information
Instana retrieves various system details from the host. You can view the following details of the host on the Instana GUI in the System pane:
Parameter | Description |
---|---|
OS | The details of the operating system, the kernel version, and the architecture. |
CPU | The details of the CPU and the count. |
Memory | The amount of system memory in GiB (gigabytes). |
Hostname | The hostname of the AIX machine. |
FQDN | The fully qualified domain name. It is the complete domain name of the host, including the subdomain and top-level domain. |
System ID* | The custom identifier used by Instana to uniquely represent and manage the monitored host within its monitoring. System ID is used for correlation with asset management systems. |
Host ID | The MAC address of the host's network interface, which is a unique identifier for the network adapter. |
Started At | The time at which the machine started. |
*System ID is used for correlation with asset management systems. You need to enable System ID by using the agent configuration YAML file as shown in the following example:
"com.instana.plugin.host":
"collectSystemId": true
Interfaces
You can find the following details:
- Interfaces: The list of network interfaces and IP addresses.
- Instana agent: The Instana agent for the host.
- Process: The count and details of the processes that are running on the host.
Performance metrics
The following performance metrics are displayed for the host.
CPU usage - percentage
The CPU usage values, when combined, provide a detailed view of how the CPU resources are being utilized on a host.
To collect more accurate CPU usage in an AIX LPAR environment, you must set useMpstat
to true
as shown in the following example:
com.instana.plugin.host:
useMpstat: true
Datapoint: Filesystem
Metric | Description | Granularity |
---|---|---|
CPU Usage | The total CPU usage in percentage for the time range that you set. | 1 second |
Memory usage
Metric | Description | Granularity |
---|---|---|
Memory Usage | The total memory usage in percentage | 1 second |
On the AIX LPAR environment, the used
value is computed in percentage by using the formula (computational + non-computational) ÷ real total
.
The non-computational
is a part of used memory, which has a relatively high used
value. A high used
value doesn't necessarily indicate a need for more memory.
The determination of memory over-commitment is based on the comparison between computational memory and the real memory in the system. Therefore, the percentage of computational
is more informative for estimating memory usage
on AIX.
Datapoint: AIX perfstat_memory_total interface
Datapoint: Filesystem
Average Run Queue (1h)
Metric | Description | Granularity |
---|---|---|
Average Run Queue (1h) | The average number of processes in the run queue over the last 60 minutes. If the agent is up for less than 60 minutes, it displays `Not collected | 60 minutes |
CPU load - average
The CPU load
metric displays the value on a graph for a selected time period.
To collect accurate CPU load, in the AIX LPAR environment, you must set useMpstat
to true as shown in the following example:
com.instana.plugin.host:
useMpstat: true
Datapoint: Filesystem
Metric | Description | Granularity |
---|---|---|
CPU Load | The average number of processes that are run for the time range that you set. | 5 seconds |
CPU usage - total
Metric | Description | Granularity |
---|---|---|
User | The amount of CPU time spent running user-space processes (applications and services). | 1 second |
System | The amount of CPU time spent running kernel-space processes (OS core functions). | 1 second |
Wait | The amount of CPU time spent waiting for input/output operations to complete. | 1 second |
Nice | The amount of CPU time spent running processes with a lower priority (nice value). | 1 second |
Steal | The amount of CPU time lost due to the hypervisor managing other virtual machines or containers on the same physical host. | 1 second |
Hypervisor | The hypervisor. | 1 second |
Used | The amount of CPU usage. | 1 second |
Idle | The duration the CPU was idle. | 1 second |
CPU Events
Metric | Description | Granularity |
---|---|---|
Context Switches | The total number of context switches on a graph for a selected time period. | 1 second |
Device Interrupts | The total number of device interrupts on a graph for a selected time period. | 1 second |
Datapoint: Filesystem
CPU load - peak
Metric | Description | Granularity |
---|---|---|
Load | The peak CPU load. The highest number of processes that are run for the time range that you set. | 1 second |
Individual CPU Usage
The CPU usage
metric displays the following metrics in percentage on a graph for a selected time period for each CPU:
Metric | Description | Granularity |
---|---|---|
User | The amount of CPU time spent running user-space processes (applications and services). | 1 second |
System | The amount of CPU time spent running kernel-space processes (OS core functions). | 1 second |
Wait | The amount of CPU time spent waiting for input/output operations to complete. | 1 second |
Nice | The amount of CPU time spent running processes with a lower priority (nice value). | 1 second |
Steal | The amount of CPU time lost due to the hypervisor managing other virtual machines or containers on the same physical host. | 1 second |
Datapoint: Filesystem
Memory
The following table outlines the unit for memory:
Metric | Unit | Description | Granularity |
---|---|---|---|
Used | % | The amount of memory in use. | 1 second |
Computational | % | The computational memory usage. | 1 second |
Non-computational | % | The non-computational memory usage. | 1 second |
Computational | Byte | The computational memory size. | 1 second |
Non-computational | Byte | The non-computational memory size. | 1 second |
Real available | Byte | The real available memory. | 1 second |
Swap used | % | The amount of swap space in use. | 1 second |
Virtual used | % | The amount of virtual memory in use. | 1 second |
Swap total | Byte | The total swap space available. | 1 second |
Swap free | Byte | The available swap space. | 1 second |
Virtual total | Byte | The total virtual memory. | 1 second |
Virtual free | Byte | The available virtual memory. | 1 second |
Virtual active | Byte | The active virtual memory. | 1 second |
Page-in per second | Events | The number of page-in events per second. | 1 second |
Page-out per second | Events | The number of page-out events per second. | 1 second |
Page-scan per second | Events | The number of page-scan events per second. | 1 second |
Page-faults per second | Events | The number of page-fault events per second. | 1 second |
Page-reclaims per second | Events | The number of page-reclaim events per second. | 1 second |
You can view all values on a graph for a time period that you select.
Datapoint: AIX perfstat_memory_total interface
System I/O Events
System I/O Events capture low-level interactions between the operating system and hardware, including system calls and read or write operations at both logical and physical levels. These metrics provide visibility into system I/O behavior and help identify performance bottlenecks.
Metric | Description | Granularity |
---|---|---|
Reads/s | The number ofread and readv system calls |
1 second |
Writes/s | The number of write system calls | 1 second |
Block reads/s | The number of physical block reads | 1 second |
Block writes/s | The number of physical block writes | 1 second |
Non block reads/s | The number of physical block reads (synchronous and asynchronous) | 1 second |
Non block writes/s | The number of raw I/O | 1 second |
Logical block reads/s | The number of logical block reads from system buffers | 1 second |
Logical block writes/s | The number of logical block writes to system buffers | 1 second |
Volume groups
A volume group in AIX is a logical storage pool that consists of one or more physical volumes. It provides a flexible way to manage logical volumes and allocate storage dynamically.
The following table provides key details about a volume group:
Metric | Description | Granularity |
---|---|---|
Volume group name | The unique identifier or name that is assigned to the volume group. | 60 seconds |
Total size | The total storage capacity available in the volume group. | 60 seconds |
Used size | The amount of storage that is allocated to logical volumes. | 60 seconds |
Free size | The remaining unallocated storage within the volume group. | 60 seconds |
Volume group state | The current operational status of the volume group. | 60 seconds |
Physical volumes
Physical volumes are raw storage devices (disks or partitions) that are initialized for use in an AIX Volume Group.
The following table provides key details about a physical volumes:
Metric | Description | Granularity |
---|---|---|
Physical volume name | The name that is assigned to the physical volume. | 60 seconds |
Total size | The total storage capacity of the physical volume. | 60 seconds |
Used size | The storage currently allocated on the physical volume. | 60 seconds |
Free size | The unallocated storage remaining on the physical volume. | 60 seconds |
Disks
The following table covers metrics that are related to disk
Metric | Description | Granularity |
---|---|---|
Disk Name | The name of the disk. | N/A |
Average Disk Transfers | The average amount of data transferred (read or written) to the disk. | 5 seconds |
Busy | The percentage of time a disk has been busy transferring data. | 5 seconds |
Transfer Rate | The number of data transfers per second during a monitoring interval. | 5 seconds |
Read Transfers | The number of read transfers per second during a monitoring interval. | 5 seconds |
Write Transfers | The number of write transfers per second during a monitoring interval. | 5 seconds |
Service Queue Full | The number of times the service queue became full. | 5 seconds |
Transfers | The amount of data transferred (read or written) to the drive. | 5 seconds |
Disk Reads | The amount of data read from the drive. | 5 seconds |
Disk Writes | The amount of data written to the drive. | 5 seconds |
Type | The type of device. | 5 seconds |
Process statistics
Process statistics metrics provide insights into the state and behavior of processes and threads on the system.
Metric | Description | Granularity |
---|---|---|
system |
The total number of processes that are running on the system. | 1 second |
runnable |
The number of processes that are waiting to be executed. | 1 second |
threads waiting |
The number of threads that are waiting for page operations. | 1 second |
execs executed |
The number of exec system calls executed during the sampling interval. |
1 second |
forks executed |
The number of fork system calls executed during the sampling interval. |
1 second |
stopped |
The number of processes in a stopped state. | 1 second |
sleeping |
The number of processes in a sleep state. | 1 second |
idle |
The number of processes currently in an idle state. | 1 second |
Filesystems
These metrics provide insights into file system performance, capacity, and usage, allowing administrators to monitor and optimize their storage systems effectively.
Metric | Description | Granularity |
---|---|---|
Device | The name of the device. | 60 seconds |
Mounts | The number of times a file system is mounted. | 60 seconds |
Options | The options or parameters used when mounting the file system. | 60 seconds |
Free | The amount of free space available on the file system. | 1 second |
Leaked | Space that has been allocated but not used, considered "leaked" or wasted. | 1 second |
Reads/s | The number of read operations per second. | 1 second |
Writes/s | The number of write operations per second. | 1 second |
Type | The type of file system. | 60 seconds |
Capacity | The total capacity of the file system. | 60 seconds |
Used | The amount of space used on the file system. | 1 second |
Inode Usage | The percentage of inodes (data structures describing files and directories) in use. | 1 second |
Inode Free | The number of free inodes available on the file system. | 1 second |
Bytes Read/s | The number of bytes read from the file system. | 1 second |
Bytes Written/s | The number of bytes written to the file system. | 1 second |
Datapoint: Filesystem
* The total, read, and write usage datapoint metrics display the disk I/O utilization as a percentage.
* Leaked
(refers to deleted files that are in use and equates to capacity - used - free
. You can find these files with lsof | grep deleted
).
** The Total Utilization
, Read Utilization
, and Write Utilization
datapoints are not supported for Network File Systems (NFS).
By default, Instana only monitors local file systems. You can list the file systems that are monitored or excluded in the configuration.yaml
file.
The name for the configuration setting is the device name, which you can obtain from the first column of mtab
file or df
command output.
You must specify temporary file systems in the following format: tmpfs:/mount/point
.
The following example shows the list of file systems that are monitored:
com.instana.plugin.host:
filesystems:
- '/dev/sda1'
- 'tmpfs:/sys/fs/cgroup'
- 'server:/usr/local/pub'
The following example shows the file systems that are included or excluded:
com.instana.plugin.host:
filesystems:
include:
- '/dev/xvdd'
- 'tmpfs:/tmp'
- 'server:/usr/local/pub'
exclude:
- '/dev/xvda2'
Network File Systems (NFS)
To monitor all NFS, use the nfs_all: true
configuration parameter as shown in the following example:
com.instana.plugin.host:
nfs_all: true
Network interfaces
The following table outlines the network traffic and errors per an interface.
Metric | Description | Granularity |
---|---|---|
Interface | The network interface being used for communication. | 60 seconds |
Mac | The Media Access Control (MAC) address of the network interface. | 60 seconds |
IPs | The IP addresses assigned to the network interface. | 60 seconds |
RX Bytes | The total number of bytes received by the network interface per second. | 1 second |
RX Errors | The number of errors encountered while receiving data on the network interface. | 1 second |
TX Bytes | The total number of bytes transmitted by the network interface per second. | 1 second |
Received/s | The number of packets received by the network interface per second. | 1 second |
Transmitted/s | The number of packets transmitted by the network interface per second. | 1 second |
Datapoint: Filesystem
TCP activity
These metrics provide insights into TCP connection activity, including established connections, segment transmission rates, and error occurrences.
Metric | Description | Granularity |
---|---|---|
Established | The number of established TCP connections. | 1 second |
Open/s | The number of new TCP connections opened per second. | 1 second |
In Segments/s | The number of incoming TCP segments per second. | 1 second |
Out Segments/s | The number of outgoing TCP segments per second. | 1 second |
Established Resets | Percentage of established TCP connections that were reset per second. | 1 second |
Out Resets | Percentage of outgoing TCP connections that were reset per second. | 1 second |
Fail | Percentage of failed TCP connection attempts per second. | 1 second |
Error | Percentage of TCP errors per second. | 1 second |
Retransmission | Percentage of TCP retransmissions per second. | 1 second |
Datapoint: Filesystem
Process top list
These metrics offer insights into running processes, including their process ID, name, CPU usage, normalized CPU usage, and memory consumption. The top process list is updated every 30 seconds and the list contains only the processes with system usage. For example, the processes with more than 10% CPU usage over the last 30 seconds or processes with more than 512 MB memory usage (RSS) are displayed in the process top list.
To create a combined list of processes from the top 10 CPU and memory usage lists, set combineTopProcesses
to true
. The processes are included in the combined list even if their CPU usage is less than 10% or memory
usage is less than 512 MB. If the same process is listed in the top 10 CPU and top 10 memory usage lists, it is listed only once in the combined list, which can include up to 20 entries.
com.instana.plugin.host:
combineTopProcesses: true
Linux top
semantics are used. 100% CPU refers to full use of a single CPU core, and you can search a history of snapshots from the previous month. The normalized CPU is calculated by dividing the CPU by the number of logical processors.
Metric | Description | Granularity |
---|---|---|
PID | The unique identifier that is assigned to each process by the operating system. | 30 seconds |
Process Name | The name of the process as defined by the application or service. | 30 seconds |
CPU | The amount of CPU resources consumed by the process. | 30 seconds |
CPU (normalized) | The CPU usage of the process, normalized to a scale. | 30 seconds |
Memory | The amount of memory consumed by the process. | 30 seconds |
Datapoint: Filesystem
Health signatures
For each sensor, a knowledge base of health signatures is evaluated continuously against the incoming metrics. They are used to raise issues or incidents depending on the user impact.
Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of an entity.
For more information about the built-in events for the Host sensor, see Built-in events reference.
Error report events
On the AIX system, the errpt
command generates an error report from entries in an error log. The errors in the error report are then captured as events and sent to Instana. The sensor captures permanent and temporary error types,
and hardware and software error classes. You need to enable the feature by using the agent configuration.yaml
file as shown in the following example:
com.instana.plugin.host:
aixEventsPollRate: 900 # In seconds