Automatic discovery and monitoring
Concepts
“And at our scale, humans cannot continuously monitor the status of all of our systems”. - Netflix
This is true for traditional APM tools, which is primarily used by performance-tuning experts to manually analyze and correlate information to identify bottlenecks and errors in production. With higher scale and dynamics, this task is like finding a needle in a haystack. There are too many moving parts and metrics to correlate.
If machine intelligence approach is applied to system management, the core model and data set must be impeccable. Microservice applications are made of 100s to 1000s of building blocks, and are constantly evolving. Therefore, it is necessary to understand all the blocks and their dependencies, which demand an advanced approach to discovery.
Components
The building blocks that application monitoring needs to cover are:
Physical components
- Datacenter/Availability zones – Zones can be in different continents and regions. They can fail or have different performance characteristics.
- Hosts/Machines – Either physical, virtual, or “as a service”. Each host has resources like CPU, memory, and IO that can be a bottleneck. Each host runs in one zone.
- Containers – Running on top of a host and can be managed by a scheduler like Kubernetes or Apache Mesos.
- Processes – Running in the container (usually one per container) or on the host can be runtime environments like Java or PHP, but also middleware like Tomcat, Oracle, or Elasticsearch.
- Clusters – Many services can act as a group or cluster so that they appear as one distributed process to the outside world. The number of instances within cluster can change and can have an impact on the cluster performances.
Logical components
- Services – Logical units of work that can have many instances and different versions running on top of the previous mentioned physical building blocks.
- Endpoints – Public API of a service to expose specific commands to the rest of the system.
- Application perspectives (also called Applications) – A perspective on a set of services and endpoints defined by a common context (declared by using tags).
- Traces – Trace is the sequence of synchronous and asynchronous communications between services. Services talk to each other and deliver a result for a user request. Transforming data in a data flow can involve many services.
- Calls – Describes a request between two services. A trace is composed of one or more calls.
Business components
- Business services – Business Services can be compositions of services and applications that deliver unique business value and services.
- Business process – A combination of technical traces that form a process. For example, it might be the “buying” trace in e-commerce, followed by an order trace in ERP, followed by a trace of FedEx's logistics in delivery to the customers.
It’s not uncommon for thousands of service instances in different versions run on hundreds of hosts in different zones on more than one continent to provide an application to its users. This creates a network of dependencies between the components that must work perfectly together so that the service quality of the application is ensured, and the business value delivered. A traditional monitoring tool would alert when a single component crosses a threshold. However, the failure of one or many of these components does not mean that the quality of the application is definitely affected. Therefore, a modern monitoring tool must understand the whole network of components and their dependencies to monitor, analyze, and predict the quality of service.
Identifying and cataloging change
As described, the number of services and their dependencies is 10-100x higher than in SOA-based applications, which poses a challenge for monitoring tools. And the situation is getting worse – continuous delivery methodology, automation tools, and container platforms exponentially increase the rate of changes of applications, making it impossible for humans to keep up with the changes or to continuously configure monitoring tools into the newly deployed blocks (for example, a new container spun up by an orchestration tool). Therefore, a modern monitoring solution is required to have automatic and immediate discovery of each and every block before analyzing and understanding them.
The changes then need to be linked to the previous snapshot so that persistency is kept and a mode can be reconstructed at any point in time to investigate incidents.
Changes can happen in any of the building blocks at any time. See this graphic for examples of changes in each component:
How Instana discover each and every piece of the puzzle
A key ingredient to the Instana Dynamic APM solution is our agent architecture, and specifically, our use of sensors. Sensors are mini agents – small programs that are specifically designed to attach and monitor one thing. They are automatically managed by our single agent (one per host), which is deployed either as a stand-alone process on the host, or as a container via the container scheduler.
The agent first automatically detects the physical components like zones in AWS, Docker containers that run on the host or Kubernetes, processes like HAProxy, Nginx, JVM, Spring Boot, Postgres, Cassandra or Elasticsearch, and clusters of these processes, like a Cassandra cluster. For each component it detects, the agent collects its configuration data and starts monitoring it for changes. It also starts sending important metrics for each component every second. The agent automatically detects and uses metrics provided by the services like JMX or Dropwizard.
As a next step, the agent starts to inject trace functionality into the service code. For example, it intercepts HTTP calls, database calls, and queries to Elasticsearch. It captures the context of each call like stack traces or payload.
The intelligence combining this data into traces, discovering dependencies and services, and detecting changes and issues is done on the server. Therefore, the agent is lightweight and can be injected into thousands of hosts.
Automatic, immediate, and continuous discovery is a requirement for the new generation of monitoring solutions. Instana has been fundamentally designed around this requirement.
How Instana collects data
Instana uses a single agent with multiple sensors and currently we support over one hundred sensors. These sensors are not extensions, they are updated, loaded, and unloaded entirely by the agent. There is an optional command-line interface that provides access to the agent state, individual sensors, and agent logs.
A sensor is designed to automatically discover and monitor a specific technology, and pass its data to the agent. The agent manages all communication to the Instana Service Quality Engine. After discovery, the sensor collects the details and metric data to provide an accurate representation of the component's state. A specific sensor gathers specific types of data about their respective technologies, which varies depending on the technology. Sensors collect the following:
- Configuration – Catalogs current settings and states in order to keep track of any change.
- Events – Initial discovery, all state changes (online and offline), built-in events that trigger issues or incidents based on failing health rules on entities, and custom events that trigger issues or incidents based on the thresholds of an individual metric of any given entity.
- Traces – Traces are captured based on the programming language platform.
- Metrics – Qualitative attributes of the technology that indicate performance.
In addition, discovery is recursive within a sensor. For example, the Java Machine sensor continues up the stack and discovers frameworks running on it (like Tomcat or SpringBoot), then assists the agent to load the appropriate additional sensors.
The intelligence combining this data into traces, discovering dependencies and services, and detecting changes and issues is done on the server. Therefore, the agent is lightweight, and can be injected into thousands of hosts.
The Instana backend uses streaming technology able to process millions of events per second streamed from the agents. This streaming engine is effectively real-time, takes only 3 seconds to process the situation and display to the user.