Manage transactions in the cloud, Part 2
Making do without distributed transactions
How to ensure transactional qualities in cloud applications
This content is part # of # in the series: Manage transactions in the cloud, Part 2
This content is part of the series:Manage transactions in the cloud, Part 2
Stay tuned for additional content in this series.
In Part 1 of this series, you learned about transaction processing and distributed transactions. Transactional properties help to reduce the error-handling effort required for an application. However, distributed transactions, which manage transactions across multiple resources, are not necessarily available in cloud-based runtimes. Here in Part 2, we look at the cloud setup, and ways of ensuring transactional qualities across multiple resource managers even without distributed transactions.
In a cloud setup, distributed transactions are frowned upon even more than they are in a classical setup. Resources (resource managers such as databases) can be volatile, as can transaction managers. They can be started or stopped at any time to adjust for different load situations. Restarting a resource manager aborts all transactions currently active in it, but what about transaction managers that are in the middle of the voting phase of a distributed transaction? Without a persistent transaction manager log, transactions are lost in limbo and resources locked for those transactions will never be released by their resource managers.
Also in a cloud setup, as well as in some service-oriented architecture (SOA) setups, the protocols used to call a resource manager frequently do not support transactions such as REST over HTTP calls. With such protocols, a call is equivalent to an autocommitted operation, as described in Part 1.
“The cloud makes it easier to quickly deploy new versions or quickly scale your application, but this can make things more difficult for the application programmer. Using what you learn here about transaction management should help you create applications that work better in the cloud.”
Therefore, while you could probably set up the application servers in the cloud with a shared, persistent transaction log on a separate storage infrastructure to compensate for volatile operations, many popular communication protocols do not support the use of transactions.
This means that in a cloud setup you have to make do without distributed transactions. The following sections present a number of techniques for coordinating multiple resources in a non-transactional setup.
Ways to handle distributed transactions in the cloud
The underlying assumption of transaction processing is that you want multiple resources to be consistent with each other. A change in one resource should be consistent with a change in another resource. Without distributed transactions, these error-handling patterns can be used for multiple resources:
- A change is performed in only one of the resources. If this situation is acceptable from a functional point of view, transactions can be autocommitted, and the error situation in the other resource can be logged and ignored.
- The change in the first resource is rolled back when an error occurs while changing the second resource. This requires that the transaction on the first resource still be open when the error occurs, so that it can be rolled back. Alternatively, when the transactions on both resources are kept open, and the error occurs on the first commit of one of the two transactions, the rolled-back change in the first resource is not visible to the outside, so on rollback no window of inconsistency appears.
- One resource is changed and, at a later point in time, the second resource is changed to become consistent with the first resource. This could be due to a recoverable error situation that occurs while changing the second resource, but it could also be an architectural pattern that defers the changes in the second resource anyway, such as in batch processing. This pattern creates a (larger) window of inconsistency, but in the end consistency is eventually achieved. Thus, this pattern is referred to as eventual consistency. I call it roll-forward inconsistency as the second change is committed.
- The change in the first resource is compensated when an error occurs while changing the second resource. For this pattern, the first resource needs to provide a compensating transaction to roll back the changes in a separate transaction. This also creates eventual consistency, but the error pattern is different from the roll-forward inconsistency above. I call this a roll-back inconsistency, as the first change is rolled back.
In the next section, I look at the techniques for employing the eventual consistency property. The system will not be immediately consistent, but it should become consistent soon after the transaction is committed.
How can error conditions be reduced so that errors will happen only infrequently, or not at all?
Ignore—and fix later
The easiest approach, of course, is to ignore any error situation and fix it with a batch job later on. However, that may not be as easy as it seems. Consider a shop system with a shopping cart in one database and an order-processing service. If the order service is not available and you ignore it, you need to do the following:
- Maintain a status on the shopping cart that the order has not yet been successfully sent.
- Consider the time needed until the batch runs to fix the situation. If it only runs at night, you may lose a day until order processing happens on the back end, possibly creating dissatisfied customers.
Figure 1 illustrates an example of an eventual consistency batch fixing up inconsistencies.
Figure 1. A batch job repairing inconsistencies
A variation on this theme is to write a "job" to the database, and have a background process continuously attempt to deliver the order until successful. In this case, you would essentially be building your own transaction log and transaction manager.
Time-based pessimistic locking
Pessimistic locking is a technique that allows transactions to ensure that the resource they want to change is actually available to them. Pessimistic locking prevents other transactions from modifying a resource (like a database row) and is sometimes even necessary in short-lived technical transactions.
In long-running business transactions like that described in Part 1, an initial short-running transaction changes the state of the resource so that other transactions cannot lock it again. In the credit card example in Part 1, you saw that such techniques can be successfully used at scale in a long-running business transaction.
This technique can also work in a cloud setup, however there is one caveat: You need to make sure that the lock is released when it's not needed anymore. You can do this by successfully committing a change on the object or rolling back the transaction. Basically, you need to write and process a transaction log in the transaction client or transaction manager to ensure that the lock is released even if other possible errors have occurred.
If you cannot guarantee that the lock will be released, you can use time-based pessimistic locking—a lock that releases itself after a given time interval—as a last resort. Such a lock can be implemented with a simple timestamp on the resource, so no batch job is needed.
A lock introduces a shared state between the transaction client and the resource manager. The lock needs to be identified and associated with the transaction; therefore, you need to pass either a lock identifier or a transaction identifier between the systems to resolve the lock on commit or rollback.
In contrast to pessimistic locking, optimistic locking is not an error-avoiding technique; its goal is to detect inconsistencies. While pessimistic locking reserves a resource for a transaction, a more optimistic approach is to assume that there is no inconsistency and to directly perform the local change for a transaction in the first step. However, you then need to be able to roll back the change if the inconsistency does occur. This is a compensation of the optimistically performed operation.
In theory, this is a nice approach, as you only need to call the
compensation service operation when the operation is rolled back. However,
in a JEE application, for example, a resource manager can decide to roll
back when the
commit() is attempted, rather than when the
change is performed, as some database constraints may only be checked at
commit time. The result is that you need to write your own Java resource
adapter because only then do you know for sure whether a transaction on
another resource has been rolled back or committed. But this once again
introduces distributed transactions, which you want to avoid in a cloud
setup for the reasons mentioned above. Also, you need to determine what
happens when your compensation fails for some reason.
A better approach is to note the failure in some kind of transaction log—for example, a row written to a specific table in a separate (autocommit transaction) operation. You would then let a batch job process the compensation at a later time. However, a problem occurs when the second change is fully processed at the called service, but the caller does not receive a response, as shown in the next section. If the first resource is then compensated, you have generated another inconsistency. If such situations should be avoided, you should check the state of the second resource before trying to compensate the first change.
As you can see here, discussing all the different possible error modes is a tedious but necessary endeavor. Only one more specific situation remains to be discussed: when an operation actually succeeds, but the caller does not receive the reply and assumes an error.
Consider the shopping cart and the order service again. Let's say the order service has been called, but the reply has gone missing on the network. The caller does not know whether the order service call succeeded and tries again. The order should not be processed twice, of course, so what can be done?
Figure 2. Service operations must manage “double submit” problems
Figure 2 illustrates such a situation. This problem is similar to the "double submit" problem in web applications, in which a button is accidentally double-clicked and the same request is sent to the server twice. What if the server response resource manager detects this repeated call? In our example, the order service has to identify the second call as a repeated call and returns just a copy of the first call's result; the actual order is processed only once and the order service is repeatable.
Repeatability of a service helps greatly in reducing the complexity of error handling in a setup without distributed transactions. This means that the service is idempotent: You can repeat the service call multiple times and the result will always be as if the service was called only once. But of course, this should only happen to the very same order. A separate order should be processed as a new order, even if it has the same content.
This solution can be implemented by comparing all attributes of a service call. For example, my bank rejects a second transaction with the identical amount, subject, and recipient as the first, just in case I entered a transaction twice into the online banking application by accident.
More robustly, a transaction identifier that's sent with the request can help detect double submits. In that case, the called service operation registers a transaction ID that's sent by the caller and when the same ID is used again, it returns only the original result. However, this may introduce a shared state (the transaction ID) between service instances in a cluster, so that a repeated call to another cluster member would still be processed only once, thus creating another consistency problem.
Manual transaction and error handling
With the lack of distributed transactions, the system needs to look at each individual operation and decide how to tackle it in a way that ensures consistency. A client that calls multiple resource managers—each possibly in its own transaction—must carefully decide which resource manager to commit first and what to do with the others in case of an error. This is determined by the functional requirements. Non-transactional resources such as web services are even worse because they cannot be rolled back but only compensated. What the resource manager used to do for you in a classical system you now have to do for yourself.
Consider this example from one of my projects: We stored some records in a separate system using a web service, which we needed to reference later. As the local transaction sometimes ran into an error, we had to seriously look at when the error occurred and where to call the compensating operation. In the end, we made the call repeatable and did away with the compensation altogether. If a transaction was rolled back, the next try would write (and overwrite) the same record again, and we had the pointer we needed for referral without any leftovers from previous tries.
In another example, we were forced to keep a reference counter of a master data record usage in another system, using an increment/decrement web service. Whenever we created an object that referred to the master data record (that is, store its ID), we had to increment the counter, and decrement it when deleting the object. This caused two problems. First, the call was a web service so we had to use a Java resource adapter to handle rollbacks. Even then, errors occurred even during rollback (for example, due to network problems). Second (and worse), decrement was not a 100% compensation. If the value reached zero—for example in a rollback—there was the danger that the master data record would be deleted (which we found out about only afterward, of course). In the end, we counted only up to when we created an object, and noted in a separate (transaction) log table whenever we did something with this master data record. A separate nightly batch then fixed the counter value.
Examples like these are common. Even eBay uses a basically transaction-free design. Most of eBay's database statements are auto-commit (which is similar to a web service call); only in certain, well-defined situations are statements combined into a single transaction (on that single database). Distributed transactions are not allowed. (See "Scalability Best Practices: Lessons from eBay.")
In a related scenario for distributed replicated databases, the CAP theorem states that out of the three capabilities—consistency, availability, and partition tolerance—only two can be achieved at any one time. While this is defined for a single, distributed database, it can also be applied to transactions across multiple resources as discussed above.
Relaxing strong consistency and allowing eventual consistency windows is formalized in the "BASE" paradigm:
- Basically Available
- Soft state
- Eventual consistency
Availability is prioritized over strong consistency, which results in a soft state that may not be the latest value, but may eventually be updated. This principle applies to the database area, but is also relevant to web services or microservices that are deployed in the cloud. However, while in a distributed database the database vendor is responsible for keeping the system (eventually) consistent; in an application that uses multiple resources, the application must take care of that itself, as shown above.
The new eventual consistency paradigm can even be acceptable in some functional scenarios. See, for example, Google's discussion in "Providing a Consistent User Experience and Leveraging the Eventual Consistency Model to Scale to Large Datasets."
Going further, you can try to find and use algorithms that are safe through eventual consistency. The recently proposed CALM theorem holds that some programs (those that are "monotonic") can be run safely on an eventually consistent basis. For example, if you add a counter by using a read operation, add 1, and then write it back, it will be susceptible to consistency problems; however, if you add to a counter by calling an "increment" operation, it will not. See "Eventual Consistency Today: Limitations, Extensions, and Beyond" and "CALM: consistency as logical monotonicity" for more information.
Having to do without the strong consistency models that are guaranteed by distributed transactions can make life harder in some ways for the average programmer. Work that was previously done by the transaction manager must now be done by the programmer, especially in the area of error handling. On the other hand, the programmer's life is also made easier: Programming for eventual consistency from the start makes the program much more scalable and resilient to network delays and problems in a world that is much more distributed than the programmer's own datacenter. Not everyone creates applications for an eBay-sized scale from the start, of course, but distributed transactions can create significant overhead. (However, I confess that I wish I could use at least one database and a message store in a single transaction.)
So how can you get along without distributed transactions? Here are some suggestions:
- In a new system, try to find algorithms that solve your business problem but work under eventual consistency (see the CALM theorem).
- Challenge the business needs on strong consistency requirements.
- Make sure the services you call are adapted to this architecture by being repeatable (idempotent) or by providing proper compensations.
- In an existing system, modify your algorithms to make them simpler and less prone to failures in the services your system calls using the techniques described here.
The cloud makes it easier to quickly deploy new versions or scale your application, but this can also make things more challenging. Using what you have learned here about transaction management should help you create applications that work better in the cloud.
Many thanks to my colleagues Sven Höfgen, Jens Fäustl, and Martin Kunz for reviewing and commenting on this series.
- A Plain English Introduction to CAP Theorem
- ACID vs. BASE: The Shifting pH of Database Transaction Processing
- Scalability Best Practices: Lessons from eBay
- Providing a Consistent User Experience and Leveraging the Eventual Consistency Model to Scale to Large Datasets
- Eventual Consistency Today: Limitations, Extensions, and Beyond
- CALM: consistency as logical monotonicity
- Design Pattern for Eventual Consistency
- Base: An Acid Alternative