Monitoring a Windows host

Edit online

You can monitor your Windows host with Instana. Instana provides comprehensive insights into the Windows host's performance, health, and resource utilization, enabling efficient troubleshooting, performance optimization, and proactive issue detection.

System information
Interfaces
Reporting status
Performance metrics
Health signatures

System information

Edit online

Instana retrieves various system details from the host. You can view the following details of the host on the Instana GUI in the System pane:


Parameter	Description
OS	The details of the operating system, the kernel version, and the architecture.
CPU	The details of the CPU and the count.
Memory	The amount of system memory in GiB (gigabytes).
Hostname	The hostname of the host machine.
FQDN	The fully qualified domain name. It is the complete domain name of the host, including the subdomain and top-level domain.
Machine ID	The unique identifier for the host that is generated during the installation of the host distribution.
System ID*	The custom identifier used by Instana to uniquely represent and manage the monitored host within its monitoring. `System ID` is used for correlation with asset management systems.
Host ID	The MAC address of the host's network interface, which is a unique identifier for the network adapter.
Started At	The time at which the host machine started.

*For Windows, you need to enable System ID by using the agent configuration YAML file as shown in the following example:

"com.instana.plugin.host": 
  "collectSystemId": true

Interfaces

Edit online

You can find the following details:

Interfaces: The list of network interfaces and IP addresses.
Instana agent: The Instana agent for the host.
Process: The count and details of the processes that are running on the host.

Reporting status

Edit online

The historical availability of a Windows host is shown in the Reporting Status chart in the Windows host dashboard. You can see three color indicators that identify the status of a host reporting to Instana.


Status	Description	Color indicator
Reporting	The host reported to Instana without any interruptions.	Green
Reporting - monitoring issues	The host reported to Instana with some interruption (such as, network interruptions or agent monitoring issues) and was not fully available.	Orange
Not Reporting	The host was not reporting to Instana at all during this time.	Red

The metric that is used to show this data on the host dashboard is based on the aggregation of messages received from the agent monitoring the host. A host is classified as Reporting if Instana receives at least 98% of the expected messages in a given timeframe.

For example, if the metric aggregation time window is 5 minutes and the poll rate of the host is once per second, Instana expects to receive 300 messages from the host during that timeframe.

If at least 294 messages are received (98% of 300), the host status is shown as Reporting.
If less than 294 but greater than 0 messages are received, the host status is shown as Reporting – Monitoring Issues.
If no messages are received, the host status is shown as Not Reporting.

Performance metrics

Edit online

The following performance metrics are displayed for the host.

CPU usage - percentage

Edit online

The CPU usage values, when combined, provide a detailed view of how the CPU resources are being utilized on a host.


Metric	Description	Granularity
CPU Usage	The total CPU usage in percentage for the time range that you set.	1 second

Memory usage

Edit online


Metric	Description	Granularity
Memory Usage	The total memory usage in percentage	1 second

CPU usage - total

Edit online


Metric	Description	Granularity
User	The amount of CPU time spent running user-space processes (applications and services).	1 second
System	The amount of CPU time spent running kernel-space processes (OS core functions).	1 second
Wait	The amount of CPU time spent waiting for input/output operations to complete.	1 second
Nice	The amount of CPU time spent running processes with a lower priority (nice value).	1 second
Steal	The amount of CPU time lost due to the hypervisor managing other virtual machines or containers on the same physical host.	1 second

Individual CPU Usage

Edit online

The CPU usage metric displays the following metrics in percentage on a graph for a selected time period for each CPU:


Metric	Description	Granularity
User	The amount of CPU time spent running user-space processes (applications and services).	1 second
System	The amount of CPU time spent running kernel-space processes (OS core functions).	1 second
Wait	The amount of CPU time spent waiting for input/output operations to complete.	1 second
Nice	The amount of CPU time spent running processes with a lower priority (nice value).	1 second
Steal	The amount of CPU time lost due to the hypervisor managing other virtual machines or containers on the same physical host.	1 second

Datapoint: Filesystem

Individual GPU usage

Edit online

The following table outlines the Individual GPU usage values:


Metric	Description	Granularity	Unit
Gpu Usage	GPU usage percentage	1 second	%
Temperature	GPU temperature in Celsius	1 second	°C
Encoder	Encoder utilization	1 second	%
Decoder	Decoder utilization	1 second	%
Memory Used	Memory usage	1 second	%
Memory Total	Total GPU memory	1 second	bytes
Transmitted throughput	Transmitted data rate	1 second	bytes/s
Received throughput	Received data rate	1 second	bytes/s

The metric is collected from nvidia-smi. The following table outlines the supported version of Nvidia graphics cards:


Brand	Model
Tesla	S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100
Quadro	4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series
GeForce	Varying levels of support, with fewer metrics available than on the Tesla and Quadro products

Prerequisites

Edit online

You must install the latest official Nvidia drivers.

For more information about starting a Docker container for Instana Agent with GPU support, see Enable GPU monitoring through Instana Agent container.

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes by using nvidia-smi. The background process is started in a loop mode and kept in memory. This process significantly improves the performance of metrics collection and prevents any potential overhead.

The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to collect accurate and up-to-date metrics every second for multiple GPUs without the overhead.

GPU Memory/Process

Edit online

The following list of processes uses GPU:


Datapoint	Collected from	Granularity
`Process Name`	`nvidia-smi`	1 second
`PID`	`nvidia-smi`	1 second
`GPU`	`nvidia-smi`	1 second
`Memory`	`nvidia-smi`	1 second

The following table outlines the supported version of Nvidia graphics cards for GPU memory:


Brand	Model
Tesla	S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100
Quadro	4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series
GeForce	Varying levels of support, with fewer metrics available than on the Tesla and Quadro products

Memory

Edit online

The following table outlines the unit for memory:


Metric	Unit	Description	Granularity
Used	Percentage	Amount of memory in use	1 second

The values are displayed on a graph for a selected time period.

Datapoint: Filesystem

File system

Edit online

These metrics provide insights into file system performance, capacity, and usage, allowing administrators to monitor and optimize their storage systems effectively.


Metric	Description	Granularity
Device	The name of the device.	60 seconds
Options	The options or parameters that are used when mounting the file system.	60 seconds
Free	The amount of free space available on the file system.	1 second
Leaked	Space that has been allocated but not used, considered "leaked" or wasted.	1 second
Type	The type of file system.	60 seconds
Capacity	The total capacity of the file system.	60 seconds
Used	The amount of space used on the file system.	1 second

Datapoint: Filesystem

* The total, read, and write usage datapoint metrics display the disk I/O utilization as a percentage.

* Leaked (refers to deleted files that are in use and equates to capacity - used - free. You can find these files with lsof | grep deleted).

** The Total Utilization, Read Utilization, and Write Utilization datapoints are not supported for Network File Systems (NFS).

By default, Instana only monitors local file systems. You can list the file systems that are monitored or excluded in the configuration.yaml file.

The name for the configuration setting is the device name, which you can obtain from the first column of mtab file or df command output.

You must specify temporary file systems in the following format: tmpfs:/mount/point.

The following example shows the list of file systems that are monitored:

com.instana.plugin.host:
  filesystems:
    - '/dev/sda1'
    - 'tmpfs:/sys/fs/cgroup'
    - 'server:/usr/local/pub'

The following example shows the file systems that are included or excluded:

com.instana.plugin.host:
  filesystems:
    include:
      - '/dev/xvdd'
      - 'tmpfs:/tmp'
      - 'server:/usr/local/pub'
    exclude:
      - '/dev/xvda2'

Network File Systems (NFS)

Edit online

To monitor all NFS, use the nfs_all: true configuration parameter as shown in the following example:

com.instana.plugin.host:
  nfs_all: true

Network interfaces

Edit online

The following table outlines the network traffic and errors per an interface.


Metric	Description	Granularity
Interface	The network interface being used for communication.	60 seconds
Mac	The Media Access Control (MAC) address of the network interface.	60 seconds
IPs	The IP addresses assigned to the network interface.	60 seconds
RX Bytes	The total number of bytes received by the network interface per second.	1 second
RX Errors	The percentage of errors encountered while receiving data on the network interface.	1 second
TX Bytes	The total number of bytes transmitted by the network interface per second.	1 second
TX Errors	The percentage of errors encountered while transmitting data on the network interface.	1 second
Received/s	The number of packets received by the network interface per second.	1 second
Transmitted/s	The number of packets transmitted by the network interface per second.	1 second

Datapoint: Filesystem

TCP activity

Edit online

These metrics provide insights into TCP connection activity, including established connections, segment transmission rates, and error occurrences.


Metric	Description	Granularity
Established	The number of established TCP connections.	1 second
Open/s	The number of new TCP connections opened per second.	1 second
In Segments/s	The number of incoming TCP segments per second.	1 second
Out Segments/s	The number of outgoing TCP segments per second.	1 second
Established Resets	Percentage of established TCP connections that were reset per second.	1 second
Out Resets	Percentage of outgoing TCP connections that were reset per second.	1 second
Fail	Percentage of failed TCP connection attempts per second.	1 second
Error	Percentage of TCP errors per second.	1 second
Retransmission	Percentage of TCP retransmissions per second.	1 second

Datapoint: Filesystem

Windows services list

Edit online

Windows services are not monitored by default. This feature is enabled only when winServiceRegex is entered in the configuration.yaml file. The winServiceRegex is a regular expression that is used to monitor services whose service name or display name matches the regular expression. For example, winServiceRegex: '(Sensor|Device)' monitors all services that include Sensor or Device in their service name or display name.


Metric	Description	Granularity
Service Name	Service name	60 seconds
Display Name	Display name	60 seconds
PID	Process ID	60 seconds
State	Service state	60 seconds

The metrics are collected from Windows sc queryex.

Process top list

Edit online

These metrics offer insights into running processes, including their process ID, name, CPU usage, normalized CPU usage, and memory consumption. The top process list is updated every 30 seconds and the list contains only the processes with system usage. For example, the processes with more than 10% CPU usage over the last 30 seconds or processes with more than 512 MB memory usage (RSS) are displayed in the process top list.

To create a combined list of processes from the top 10 CPU and memory usage lists, set combineTopProcesses to true. The processes are included in the combined list even if their CPU usage is less than 10% or memory usage is less than 512 MB. If the same process is listed in the top 10 CPU and top 10 memory usage lists, it is listed only once in the combined list, which can include up to 20 entries.

com.instana.plugin.host:
  combineTopProcesses: true

Linux top semantics are used. 100% CPU refers to full use of a single CPU core, and you can search a history of snapshots from the previous month. The normalized CPU is calculated by dividing the CPU by the number of logical processors.


Metric	Description	Granularity
PID	The unique identifier that is assigned to each process by the operating system.	30 seconds
Process Name	The name of the process as defined by the application or service.	30 seconds
CPU	The amount of CPU resources consumed by the process.	30 seconds
CPU (normalized)	The CPU usage of the process, normalized to a scale.	30 seconds
Memory	The amount of memory consumed by the process.	30 seconds

Datapoint: Filesystem

Health signatures

Edit online

For each sensor, a knowledge base of health signatures is evaluated continuously against the incoming metrics. They are used to raise issues or incidents depending on the user impact.

Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of an entity.

For more information about the built-in events for the Host sensor, see Built-in events reference.