Configuration drift is when a network, app, device or other IT system gradually and unintentionally shifts away from its intended baseline settings. Performance issues caused by configuration drift, and subsequent downtime, can cost businesses thousands of dollars per minute.
To some extent, configuration drift is inevitable over the course of a system’s lifecycle. It can be caused by manual changes to a network that affect the way its components interact with each other or automated tools that tweak settings in ways administrators didn’t intend. Without the proper documentation, incompatible or detrimental changes can be made as old administrators leave and new ones join.
A textbook example of configuration drift is the case of an administrator applying a fix to one server in a load-balanced environment but not the others. Even if the system continues operating normally for the time being, problems can occur down the line. The patched server might use a new library that is incompatible with future updates to the network that assume the original conditions, potentially leading to outages and inefficiencies.
Configuration drift doesn’t just pose a threat to performance. Systems that drift away from their intended settings can become more vulnerable to malicious actors and data breaches. For example, if firewall rules are not updated as new resources are added to a network, hackers can sneak right in.
Configuration drift can also affect compliance status. An organization can fail an audit if network documentation describes one set of security settings but the live environment is different.
DevOps professionals and system administrators have tools at their disposal to prevent misconfigurations and fight configuration drift. Infrastructure as code (IaC) tools such as Terraform tether network configuration to a configuration file that serves as a source of truth. Configuration files help ensure that new network resources are automatically provisioned in the proper state, reducing the number of opportunities for drift
Observability tools give administrators visibility into metrics, logs and traces, helping them spot configuration drift as it happens and apply fixes. Immutable infrastructure limits drift by discarding outdated servers instead of applying fixes at all. Configuration drift is also handled by configuration management tools such as Ansible.
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
Configuration drift is most commonly caused by manual changes to system configurations, automation going wrong, issues at the organizational level that create inconsistencies or some combination of these factors.
One-off manual fixes are the primary cause of configuration drift.
A “hotfix,” or a remediation applied outside the normal update schedule, can fix immediate, pressing problems such as a server that’s set to the wrong timeout value and therefore keeps crashing. But this type of configuration change can break the system down the road:
The number of ways in which human error—or merely unforeseen consequences—can nudge the system away from the state it “should” be in is almost unlimited. Even small changes can accumulate in such a way that the live production environment bears little resemblance to the repository, riddled with bugs and security risks.
Without proper testing and oversight, automated updates and processes can cause important network resources to drift away from their intended configurations.
Automation is only as good as its source of truth. For example, if an IaC tool relies on outdated configuration files to spin up new servers, it might end up breaking the environment. Automated software updates to apps and operating systems such as Microsoft Windows might apply at different times across servers, causing potentially harmful divergence. And these updates might not work well with an organization’s unique network architecture, causing further problems.
Even tools meant to manage configuration can cause drift under the right circumstances. For example, connectivity issues might cause Ansible to apply a configuration update unevenly, leaving one server unchanged. That server will gradually diverge from its environment, potentially causing service outages.
At the organizational level, issues with the continuous integration/continuous delivery (CI/CD) pipeline and DevOps practices can lead to configuration problems.
When development, operations and security teams are siloed from each other, confusion, miscommunication and inadequate troubleshooting are inevitable. In addition to divergent technical practices, teams within an organization might have their own practices for change management. Some organizations lack formal change management processes entirely.
Lack of established, clear practices for making and documenting changes can lead to inconsistent change logs, unauthorized changes and unenforced approval workflows. Ultimately, administrators, developers and engineers might circumvent the change management process entirely.
Configuration drift poses significant risks to a system’s security, performance and compliance status.
Configuration drift can significantly increase an organization’s attack surface by creating exceptions to security policies that remain unknown to administrators, and therefore unfixed.
For example, a credential created to apply a hotfix might be left in place, vulnerable to hackers who might use it for malicious purposes. Similarly, an engineer might make an exception to a firewall rule that they never go back and close, significantly weakening the network’s overall security posture. Developers might activate an application with incomplete security controls for testing and never deactivate it, creating another security vulnerability for malicious actors to exploit.
Similarly, every new app, endpoint or other resource added to a system can cause configuration drift if the proper security controls are not applied. For example, adding a new server without configuring the endpoint detection and response (EDR) system proper can create a weak link. A simple mistake in microservices configuration can lead to a large number of unprotected assets entering the network.
Along with cybersecurity, network performance is the most significant—and expensive—risk posed by configuration drift.
Take the example of a server experiencing heavier traffic than its counterparts. This server might have its connection pool size increased with a hotfix to improve performance. Because this server is behind a load balancer, the balancer automatically sets a policy to drive more traffic its way to spread the server load more evenly.
When the server is replaced during a new deployment, the hotfix that increased its pool is no longer in place, and the server crashes due to the additional traffic. The original hotfix applied to speed traffic is the drift. When it is not accounted for, further changes to the network can lead to expensive downtime until the cause is identified.
Configuration drift can cause an organization to fall out of compliance without even being aware of it. When the state of a network diverges from what an organization “thinks” it is doing—or what its documentation says it is doing—the organization runs the risk of noncompliance. Even if the noncompliance is unintentional, the organization might still face fines and fees.
Take the example of the Health Insurance Portability and Accountability Act (HIPAA). HIPAA requires that organizations use certain encryption methods to protect sensitive data in transit and at rest.
Say an administrator needs to integrate a legacy system to their HIPAA-compliant network, and this legacy system uses an outdated encryption method. If this encryption method is not addressed, the integration will render the organization out of compliance with HIPAA.
Drift detection—the practice of tracking changes to the network and identifying divergence from its intended state—requires a combination of tools including infrastructure as code, GitOps, immutable infrastructure and observability.
Infrastructure as code, the practice of provisioning and managing IT infrastructure by using scripts rather than manual processes, is one of the most powerful tools for configuration drift management.
IaC helps tackle configuration drift by turning the network’s desired state into a piece of version-controlled code to which every network component can be compared.
For example, in Terraform, when a change is made, the IaC tool compares the state file (the platform’s most up-to-date view of the network) to the declared configuration files—that is, the files that say what the network “should” be. Terraform then resolves discrepancies between the state file and the declared configuration by updating infrastructure to match the configuration file, reducing the chances for drift to sneak in.
When organizations impose strict access control on IaC tools, it can reduce the opportunities for drift even more. By limiting IaC access to only authorized individuals who need it, organizations limit the ability to change infrastructure configurations in general. And when changes are made, they go through the IaC version control process, further mitigating the risk of drift.
GitOps is a DevOps practice that uses the open-source repository Git as the single source of truth for configuration files. GitOps helps many organizations deploy IaC with maximum efficiency and security.
GitOps practices focus on using automation to validate the state of the network against the desired state stored in Git in real time. GitOps platforms can continuously scan networks, detect misconfigurations and flag them or apply fixes, making any drift that does occur temporary. And because all changes are tied to Git, they are all tracked with an author, timestamp and description.
Immutable infrastructure mitigates configuration drift by dramatically decreasing the overall number of opportunities to change the network’s configuration.
Immutable infrastructure is the practice of replacing, not modifying, servers and other IT resources when changes are needed.
For example, say that a server needs a security update. Instead of applying the update to the existing server, administrators would decommission the server and replace it with a new, updated one.
Immutable infrastructure draws on IaC tools to automatically deploy new systems as described in code when changes are needed. Every new component added to the network automatically matches the desired state.
The three practices of IaC, GitOps and immutable infrastructure are closely intertwined. IaC tools define the images for network components while GitOps facilitates deployment, builds a comprehensive record of the network and prevents discrepancies.
The three pillars of observability (logs, metrics and traces) also have a role to play in preventing configuration drift.
An observability platform, for example, might detect that metrics on one server (such as response times or CPU usage) are significantly diverging from servers that should have identical configurations. This divergence is a potential symptom of drift. Similarly, discrepancies in error rate logs for each server might indicate drift if one server has an abnormally high number of errors of a certain kind. Traces of an application’s call chain might also uncover locations experiencing deviations and drift.
Drift detection is the practice of comparing the actual state of a network to the desired state to detect discrepancies. While one can hypothetically carry out this process manually, many cloud and IaC providers offer tools with drift detection functionality, which helps automate and streamline an otherwise time-consuming project.
For example, AWS Config records the configurations of AWS modules, flags anything that deviates from the desired state and helps remediate drift. Terraform’s health assessments verify that actual infrastructure settings match the settings recorded in the workspace’s state file and continuously validate that resources satisfy required checks defined in the system’s configurations.
Terraform Enterprise compares conditions to the state file, or updates the state file to reflect actual conditions, revealing changes. Configuration management tools such as Ansible and Puppet can also be used for drift detection.
Explore how agentic AI helps reduce downtime and resolve IT anomalies faster through smarter detection, faster root cause analysis and automated operations.
Learn how AI agents and large language models (LLMs) enable proactive IT optimization, predicting issues early, mapping system dependencies and delivering real-time insights for smarter, scalable systems.
Harness the power of AI and automation to proactively solve issues across the application stack.
Discover how AI for IT operations delivers the insights you need to help drive exceptional business performance.
Drive scalable digital transformation with IBM Consulting® industry expertise.