Monitoring an AIX host

You can monitor your host with Instana. Instana provides comprehensive insights into the host's performance, health, and resource utilization, enabling efficient troubleshooting, performance optimization, and proactive issue detection.

System information

Instana retrieves various system details from the host. You can view the following details of the host on the Instana GUI in the System pane:

Table 1. System information
Parameter Description
OS The details of the operating system, the kernel version, and the architecture.
CPU The details of the CPU and the count.
Memory The amount of system memory in GiB (gigabytes).
Hostname The hostname of the AIX machine.
FQDN The fully qualified domain name. It is the complete domain name of the host, including the subdomain and top-level domain.
System ID* The custom identifier used by Instana to uniquely represent and manage the monitored host within its monitoring. System ID is used for correlation with asset management systems.
Host ID The MAC address of the host's network interface, which is a unique identifier for the network adapter.
Started At The time at which the machine started.

*System ID is used for correlation with asset management systems. You need to enable System ID by using the agent configuration YAML file as shown in the following example:

"com.instana.plugin.host": 
  "collectSystemId": true

Interfaces

You can find the following details:

  • Interfaces: The list of network interfaces and IP addresses.
  • Instana agent: The Instana agent for the host.
  • Process: The count and details of the processes that are running on the host.

Performance metrics

The following performance metrics are displayed for the host.

CPU usage - percentage

The CPU usage values, when combined, provide a detailed view of how the CPU resources are being utilized on a host.

To collect more accurate CPU usage in an AIX LPAR environment, you must set useMpstat to true as shown in the following example:

com.instana.plugin.host:
  useMpstat: true

Datapoint: Filesystem

Table 2. CPU usage
Metric Description Granularity
CPU Usage The total CPU usage in percentage for the time range that you set. 1 second

Memory usage

Table 3. Memory usage
Metric Description Granularity
Memory Usage The total memory usage in percentage 1 second

On the AIX LPAR environment, the used value is computed in percentage by using the formula (computational + non-computational) ÷ real total.

The non-computational is a part of used memory, which has a relatively high used value. A high used value doesn't necessarily indicate a need for more memory.

The determination of memory over-commitment is based on the comparison between computational memory and the real memory in the system. Therefore, the percentage of computational is more informative for estimating memory usage on AIX.

Datapoint: AIX perfstat_memory_total interface

Datapoint: Filesystem

Average Run Queue (1h)

Table 6. Average Run Queue
Metric Description Granularity
Average Run Queue (1h) The average number of processes in the run queue over the last 60 minutes. If the agent is up for less than 60 minutes, it displays `Not collected 60 minutes

CPU load - average

The CPU load metric displays the value on a graph for a selected time period.

To collect accurate CPU load, in the AIX LPAR environment, you must set useMpstat to true as shown in the following example:

com.instana.plugin.host:
  useMpstat: true

Datapoint: Filesystem

Table 7. CPU Load
Metric Description Granularity
CPU Load The average number of processes that are run for the time range that you set. 5 seconds

CPU usage - total

Table 8. CPU usage details
Metric Description Granularity
User The amount of CPU time spent running user-space processes (applications and services). 1 second
System The amount of CPU time spent running kernel-space processes (OS core functions). 1 second
Wait The amount of CPU time spent waiting for input/output operations to complete. 1 second
Nice The amount of CPU time spent running processes with a lower priority (nice value). 1 second
Steal The amount of CPU time lost due to the hypervisor managing other virtual machines or containers on the same physical host. 1 second
Hypervisor The hypervisor. 1 second
Used The amount of CPU usage. 1 second
Idle The duration the CPU was idle. 1 second

CPU Events

Table 8. CPU Events
Metric Description Granularity
Context Switches The total number of context switches on a graph for a selected time period. 1 second
Device Interrupts The total number of device interrupts on a graph for a selected time period. 1 second

Datapoint: Filesystem

CPU load - peak

Table 9. CPU Load details
Metric Description Granularity
Load The peak CPU load. The highest number of processes that are run for the time range that you set. 1 second

Individual CPU Usage

The CPU usage metric displays the following metrics in percentage on a graph for a selected time period for each CPU:

Table 10. Individual CPU Usage
Metric Description Granularity
User The amount of CPU time spent running user-space processes (applications and services). 1 second
System The amount of CPU time spent running kernel-space processes (OS core functions). 1 second
Wait The amount of CPU time spent waiting for input/output operations to complete. 1 second
Nice The amount of CPU time spent running processes with a lower priority (nice value). 1 second
Steal The amount of CPU time lost due to the hypervisor managing other virtual machines or containers on the same physical host. 1 second

Datapoint: Filesystem

Memory

The following table outlines the unit for memory:

Table 11. Memory
Metric Unit Description Granularity
Used % The amount of memory in use. 1 second
Computational % The computational memory usage. 1 second
Non-computational % The non-computational memory usage. 1 second
Computational Byte The computational memory size. 1 second
Non-computational Byte The non-computational memory size. 1 second
Real available Byte The real available memory. 1 second
Swap used % The amount of swap space in use. 1 second
Virtual used % The amount of virtual memory in use. 1 second
Swap total Byte The total swap space available. 1 second
Swap free Byte The available swap space. 1 second
Virtual total Byte The total virtual memory. 1 second
Virtual free Byte The available virtual memory. 1 second
Virtual active Byte The active virtual memory. 1 second
Page-in per second Events The number of page-in events per second. 1 second
Page-out per second Events The number of page-out events per second. 1 second
Page-scan per second Events The number of page-scan events per second. 1 second
Page-faults per second Events The number of page-fault events per second. 1 second
Page-reclaims per second Events The number of page-reclaim events per second. 1 second

You can view all values on a graph for a time period that you select.

Datapoint: AIX perfstat_memory_total interface

System I/O Events

System I/O Events capture low-level interactions between the operating system and hardware, including system calls and read or write operations at both logical and physical levels. These metrics provide visibility into system I/O behavior and help identify performance bottlenecks.

Table 12. System I/O Events
Metric Description Granularity
Reads/s The number ofread and readv system calls 1 second
Writes/s The number of write system calls 1 second
Block reads/s The number of physical block reads 1 second
Block writes/s The number of physical block writes 1 second
Non block reads/s The number of physical block reads (synchronous and asynchronous) 1 second
Non block writes/s The number of raw I/O 1 second
Logical block reads/s The number of logical block reads from system buffers 1 second
Logical block writes/s The number of logical block writes to system buffers 1 second

Volume groups

A volume group in AIX is a logical storage pool that consists of one or more physical volumes. It provides a flexible way to manage logical volumes and allocate storage dynamically.

The following table provides key details about a volume group:

Table 13. Volume Groups
Metric Description Granularity
Volume group name The unique identifier or name that is assigned to the volume group. 60 seconds
Total size The total storage capacity available in the volume group. 60 seconds
Used size The amount of storage that is allocated to logical volumes. 60 seconds
Free size The remaining unallocated storage within the volume group. 60 seconds
Volume group state The current operational status of the volume group. 60 seconds

Physical volumes

Physical volumes are raw storage devices (disks or partitions) that are initialized for use in an AIX Volume Group.

The following table provides key details about a physical volumes:

Table 14. Volume Groups
Metric Description Granularity
Physical volume name The name that is assigned to the physical volume. 60 seconds
Total size The total storage capacity of the physical volume. 60 seconds
Used size The storage currently allocated on the physical volume. 60 seconds
Free size The unallocated storage remaining on the physical volume. 60 seconds

Disks

The following table covers metrics that are related to disk

Table 15. Disk metrics
Metric Description Granularity
Disk Name The name of the disk. N/A
Average Disk Transfers The average amount of data transferred (read or written) to the disk. 5 seconds
Busy The percentage of time a disk has been busy transferring data. 5 seconds
Transfer Rate The number of data transfers per second during a monitoring interval. 5 seconds
Read Transfers The number of read transfers per second during a monitoring interval. 5 seconds
Write Transfers The number of write transfers per second during a monitoring interval. 5 seconds
Service Queue Full The number of times the service queue became full. 5 seconds
Transfers The amount of data transferred (read or written) to the drive. 5 seconds
Disk Reads The amount of data read from the drive. 5 seconds
Disk Writes The amount of data written to the drive. 5 seconds
Type The type of device. 5 seconds

Process statistics

Process statistics metrics provide insights into the state and behavior of processes and threads on the system.

Table 16. Process statistics metrics
Metric Description Granularity
system The total number of processes that are running on the system. 1 second
runnable The number of processes that are waiting to be executed. 1 second
threads waiting The number of threads that are waiting for page operations. 1 second
execs executed The number of exec system calls executed during the sampling interval. 1 second
forks executed The number of fork system calls executed during the sampling interval. 1 second
stopped The number of processes in a stopped state. 1 second
sleeping The number of processes in a sleep state. 1 second
idle The number of processes currently in an idle state. 1 second

Filesystems

These metrics provide insights into file system performance, capacity, and usage, allowing administrators to monitor and optimize their storage systems effectively.

Table 17. File systems
Metric Description Granularity
Device The name of the device. 60 seconds
Mounts The number of times a file system is mounted. 60 seconds
Options The options or parameters used when mounting the file system. 60 seconds
Free The amount of free space available on the file system. 1 second
Leaked Space that has been allocated but not used, considered "leaked" or wasted. 1 second
Reads/s The number of read operations per second. 1 second
Writes/s The number of write operations per second. 1 second
Type The type of file system. 60 seconds
Capacity The total capacity of the file system. 60 seconds
Used The amount of space used on the file system. 1 second
Inode Usage The percentage of inodes (data structures describing files and directories) in use. 1 second
Inode Free The number of free inodes available on the file system. 1 second
Bytes Read/s The number of bytes read from the file system. 1 second
Bytes Written/s The number of bytes written to the file system. 1 second

Datapoint: Filesystem

* The total, read, and write usage datapoint metrics display the disk I/O utilization as a percentage.

* Leaked (refers to deleted files that are in use and equates to capacity - used - free. You can find these files with lsof | grep deleted).

** The Total Utilization, Read Utilization, and Write Utilization datapoints are not supported for Network File Systems (NFS).

By default, Instana only monitors local file systems. You can list the file systems that are monitored or excluded in the configuration.yaml file.

The name for the configuration setting is the device name, which you can obtain from the first column of mtab file or df command output.

You must specify temporary file systems in the following format: tmpfs:/mount/point.

The following example shows the list of file systems that are monitored:

com.instana.plugin.host:
  filesystems:
    - '/dev/sda1'
    - 'tmpfs:/sys/fs/cgroup'
    - 'server:/usr/local/pub'

The following example shows the file systems that are included or excluded:

com.instana.plugin.host:
  filesystems:
    include:
      - '/dev/xvdd'
      - 'tmpfs:/tmp'
      - 'server:/usr/local/pub'
    exclude:
      - '/dev/xvda2'

Network File Systems (NFS)

To monitor all NFS, use the nfs_all: true configuration parameter as shown in the following example:

com.instana.plugin.host:
  nfs_all: true

Network interfaces

The following table outlines the network traffic and errors per an interface.

Table 18. Network traffic and errors per an interface
Metric Description Granularity
Interface The network interface being used for communication. 60 seconds
Mac The Media Access Control (MAC) address of the network interface. 60 seconds
IPs The IP addresses assigned to the network interface. 60 seconds
RX Bytes The total number of bytes received by the network interface per second. 1 second
RX Errors The number of errors encountered while receiving data on the network interface. 1 second
TX Bytes The total number of bytes transmitted by the network interface per second. 1 second
Received/s The number of packets received by the network interface per second. 1 second
Transmitted/s The number of packets transmitted by the network interface per second. 1 second

Datapoint: Filesystem

TCP activity

These metrics provide insights into TCP connection activity, including established connections, segment transmission rates, and error occurrences.

Table 19. TCP activity
Metric Description Granularity
Established The number of established TCP connections. 1 second
Open/s The number of new TCP connections opened per second. 1 second
In Segments/s The number of incoming TCP segments per second. 1 second
Out Segments/s The number of outgoing TCP segments per second. 1 second
Established Resets Percentage of established TCP connections that were reset per second. 1 second
Out Resets Percentage of outgoing TCP connections that were reset per second. 1 second
Fail Percentage of failed TCP connection attempts per second. 1 second
Error Percentage of TCP errors per second. 1 second
Retransmission Percentage of TCP retransmissions per second. 1 second

Datapoint: Filesystem

Process top list

These metrics offer insights into running processes, including their process ID, name, CPU usage, normalized CPU usage, and memory consumption. The top process list is updated every 30 seconds and the list contains only the processes with system usage. For example, the processes with more than 10% CPU usage over the last 30 seconds or processes with more than 512 MB memory usage (RSS) are displayed in the process top list.

To create a combined list of processes from the top 10 CPU and memory usage lists, set combineTopProcesses to true. The processes are included in the combined list even if their CPU usage is less than 10% or memory usage is less than 512 MB. If the same process is listed in the top 10 CPU and top 10 memory usage lists, it is listed only once in the combined list, which can include up to 20 entries.

com.instana.plugin.host:
  combineTopProcesses: true

Linux top semantics are used. 100% CPU refers to full use of a single CPU core, and you can search a history of snapshots from the previous month. The normalized CPU is calculated by dividing the CPU by the number of logical processors.

Table 20. Process top list
Metric Description Granularity
PID The unique identifier that is assigned to each process by the operating system. 30 seconds
Process Name The name of the process as defined by the application or service. 30 seconds
CPU The amount of CPU resources consumed by the process. 30 seconds
CPU (normalized) The CPU usage of the process, normalized to a scale. 30 seconds
Memory The amount of memory consumed by the process. 30 seconds

Datapoint: Filesystem

Health signatures

For each sensor, a knowledge base of health signatures is evaluated continuously against the incoming metrics. They are used to raise issues or incidents depending on the user impact.

Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of an entity.

For more information about the built-in events for the Host sensor, see Built-in events reference.

Error report events

On the AIX system, the errpt command generates an error report from entries in an error log. The errors in the error report are then captured as events and sent to Instana. The sensor captures permanent and temporary error types, and hardware and software error classes. You need to enable the feature by using the agent configuration.yaml file as shown in the following example:

com.instana.plugin.host:
  aixEventsPollRate: 900 # In seconds