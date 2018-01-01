Application resiliency



Reliability can't be added to a built system. The system and its components must be designed and implemented with reliability in mind. It's a shared responsibility by everyone who contributes to the software development lifecycle, including the architect, the product owner, the DevOps engineer, and the site reliability engineer.

To implement reliability into a software component, you can use the key architectural patterns. There are several important patterns and techniques that you need to understand to build reliable applications which are described next. You don't need to use every tool, but you need to know which tools are available and which tool is right for a task.

When you're skilled in these techniques, you can have a meaningful conversation with the product owner on the reliability targets for a service. These techniques can help you gauge the viability and effort that are required to meet a certain reliability target.



Foundational patterns

First, there are three foundational patterns, each with their own trade-offs, to consider:

Redundant resources (trade cost) : Have redundant resources to avoid single points of failure. Every component can fail, but the system is robust enough that an individual outage can be tolerated.

: Have redundant resources to avoid single points of failure. Every component can fail, but the system is robust enough that an individual outage can be tolerated. Degraded results (trade quality): Instead of expecting every transaction to succeed, sometimes it can be tolerable for a business to see some requests fail.

Instead of expecting every transaction to succeed, sometimes it can be tolerable for a business to see some requests fail. Retry transient failures (trade latency): Trade latency for reliability. Automatic retries form the base of the third technique.

Mitigating techniques

From the three foundational patterns, there are additional techniques that mitigate some consequences caused by the trade-offs or help compensate for failures.

Circuit breaker: Detect failures and encapsulate the logic of preventing a failure from constantly recurring during maintenance, temporary external system failure, or unexpected system difficulties

Sidecar: Enable a service mesh control plane to secure, control, and observe.

Enable a service mesh control plane to secure, control, and observe. Exponential backoff: Space out repeated requests or retransmissions of the same block of data, often to avoid resource congestion.

Waterfall: Run multiple instances of a transaction and use one result while you discard the other responses.

Run multiple instances of a transaction and use one result while you discard the other responses. Partitioning or sharding: Partition a workload into distinct independent parts to improve availability and performance.

Partition a workload into distinct independent parts to improve availability and performance. Fail static : Limit the number of resources that a service uses so that the service can continue to function, meeting SLAs even during extreme load.

: Limit the number of resources that a service uses so that the service can continue to function, meeting SLAs even during extreme load. Caching: Store data so that future requests for it can be served faster.

Store data so that future requests for it can be served faster. Queuing: Queue requests and process them asynchronously to improve the stability of the system.

Queue requests and process them asynchronously to improve the stability of the system. Throttling: Limit the number of resources that a service uses.

Limit the number of resources that a service uses. Load shedding Deliberately cut off consumers to protect the grid from collapse

Deliberately cut off consumers to protect the grid from collapse Bulkhead: Isolate application components so that if one fails, it doesn't impact the others.

Isolate application components so that if one fails, it doesn't impact the others. Waiting room (visitor prioritization) : Provide a waiting room experience when your back-end application becomes overloaded.

: Provide a waiting room experience when your back-end application becomes overloaded. Compensating transaction: Record all the steps of workflow and start to undo the operations if a failure occurs.

Record all the steps of workflow and start to undo the operations if a failure occurs. Event-driven architecture: Integrate services through a publish/subscribe architecture.

Non-architectural techniques

In addition to the architecture patterns, you can take other approaches to improve the reliability of a system. Typically, these techniques help to manage a system or to understand the behavior of complex systems. While they don't directly improve reliability, these techniques can help you understand the system and infer strategies to improve reliability.

Systems theory: Observe the behavior of a system as a whole, not its individual parts.

Observe the behavior of a system as a whole, not its individual parts. Observability: Monitor your service as consumers experience it to quickly detect deviations from the norm, as defined through service level objectives and service level indicators.

Monitor your service as consumers experience it to quickly detect deviations from the norm, as defined through service level objectives and service level indicators. Chaos engineering Inject failure into a system to improve reliability and resilience.

Inject failure into a system to improve reliability and resilience. Recoverability: Recover quickly from a disaster in a non-routine way.



Data resiliency



When a failure occurs with hosted data, data resiliency allows data to remain available to traditional applications and applications that incorporate APIs and other services for analytics.



Choosing the correct set of data resiliency techniques and technologies in the context of an overall business continuity plan is vital, but it can be complex and difficult. How do you reconstruct a steady state? Might the recovery noticeably affect a user or system? You need to be able to establish a sync point, restart, understand anything that was in flight and not recoverable, and back out.

Here are some examples of techniques you might want to consider in your workload resiliency design:

Backups: Traditionally, organizations recover data by using data backups. Backups are usually anchored on single applications. However, when applications are interconnected, backups can face resynchronization challenges. The latency between backups can also create a gap of lost data. Depending on the underlying technologies of the data stores, a logging system might not be available to mitigate the gap of loss. Because backups are taken incrementally, it can be cumbersome to recompose an image to restore. Some modern databases are so large that backups aren't even taken.

Snapshots: Snapshots of logical disk units can address a backup that transcends individual applications. However, they might not always work well with various techniques of mirrored data and striped data.

Mirroring: Various techniques of mirroring exist. The mirrored copy can be on the same disk drive or pushed to a remote system. Typically, the operating system handles the mirroring, or replication. Mirroring techniques can vary between synchronous and asynchronous replication. With synchronous mirroring, the two copies of the data are identical and a latency issue might exist regarding the physical distance that is permitted between the primary and the secondary locations. Asynchronous mirroring doesn't usually have the distance limitation, but if an unexpected failure occurs, the latency might result in data loss. At the hardware level, a peer-to-peer remote copying can provide a form of mirroring that enables resource services to provide for a controlled switchover or failover.

Flash copies: A flash copy can provide a fast point-in-time copy of the data. You can use the copy to bring an application online in a separate partition or system. This type of copying can also supplement the ability to complete an offline backup or populate data for non-production systems.

Logical replication: If you use logical replication to build a multi-system with high availability, be sure to use a transport mechanism that uses synchronous remote journaling. The journaling provides a way to replay.

Hardware replication: Hardware replication is done at the operating system or disk level instead of at the object level. One advantage that hardware replication has over logical replication is that hardware replication is done at a lower level. When replication is done synchronously, you're more likely to have identical copies of the data. The disadvantage is that the data is accessible from only one copy, and you can't use the second copy during active replication.

Hardware replication is done at the operating system or disk level instead of at the object level. One advantage that hardware replication has over logical replication is that hardware replication is done at a lower level. When replication is done synchronously, you're more likely to have identical copies of the data. The disadvantage is that the data is accessible from only one copy, and you can't use the second copy during active replication. Software replication: Software or database replication is useful when you need to move to auxiliary systems, such as a data lake or a data warehouse. If you use change data capture (CDC) technology, the data replication software depends on the database that provides a logging mechanism.



Constraint: Network latency & data volume



The two primary enemies of a data resiliency initiative are data volume and network latency. To design for high availability, disaster recovery, or workload reallocation, you must factor in real-world physics.

The time to move data from one location to another depends on distance. The longer the distance, the longer the latency. Sending petabytes of data all at once is likely to clog the available bandwidth. All these constraints are independent of other considerations, such as time to rebuild an index or to create a sync point at the target location.

Whether you're moving data in bulk, trickling data via CDC or a message queue, or using a mirrored technology, you need data strategy, data topology, and data governance.