Change Data Capture and WebSphere eXtreme Scale
Architectural patterns for dealing with stale data in a distributed caching environment
This content is part # of # in the series: Change Data Capture and WebSphere eXtreme Scale
This content is part of the series:Change Data Capture and WebSphere eXtreme Scale
Stay tuned for additional content in this series.
The widespread use of mobile channels has imposed significant pressures on application hosting and middleware infrastructure. Organizations are coming under increasing scrutiny to ensure application availability, scalability, and rich user experience - especially during peak load.
Figure 1 illustrates a typical deployment model for a simple web application. In reality, though, most mission-critical enterprise applications are much more complex. There are likely to be several other middleware components in the mix, such as security appliances, API gateways, ESBs, a messaging infrastructure, decision servers, and enterprise back end systems, such as SAP or CICS®.
Figure 1. Web application deployment pattern
Distributed caching has been gaining widespread adoption as an effective mechanism to address the demands around availability, scalability, and rich user experience. Nearly all large sites today employ caching of some sort in their implementations.
Some representative points in a deployment architecture where caching can be leveraged are shown in Figure 2. Examples of the various forms of caching include caching of content on the edge of network via content delivery networks, request/response caching, JSP page fragment caching, and data caching.
Figure 2. Reducing application effective path length via caching
We are particularly focused here on caching at the data tier using in-memory data grid technology: IBM WebSphere eXtreme Scale. Figure 3 shows a common logical architecture of a WebSphere eXtreme Scale-based data caching solution where a distributed grid is fronting a data store. A data grid solution can be hosted in Java™ runtimes that are based either on IBM WebSphere Application Server or stand-alone JVMs.
Figure 3. Representative WebSphere eXtreme Scale -based data caching topology
The basic principle behind data caching in WebSphere eXtreme Scale (and data grids in general) is to move frequently used content closer to the consumer. The relative proximity of the data to the consuming application eliminates the latency that would otherwise be incurred in a distributed application architecture. This duplication, however, does introduce an undesirable side effect, popularly known as the stale cache problem. Let’s look at this basic problem and then outline common architectural patterns for dealing with a stale cache in WebSphere eXtreme Scale.
Most enterprise applications that retrieve data from a database perform some business operations on this data, and then persist it back to the database. The database technology permits concurrent access from multiple users; however, it strictly enforces normal concurrency controls to preserve data integrity. There is only one version of a data element and the database maintains the most current version (Figure 4).
Figure 4. Classical database scenario
When you introduce a data grid into the application architecture, there is now more than one and potentially several versions of the data element: one in the database, and the others in the various grids (Figure 5).
Figure 5. Data grid fronting a persistent store
When third party applications that are not grid-aware update the data store directly, the cache could end up in an inconsistent state. This is also referred to as a stale cache, or dirty read, problem and is illustrated in Figure 6.
Figure 6. Stale cache
Keeping different caches synchronized with the data store and each other is rarely a trivial task. We’ll look at a set of architectural techniques that specifically address the issues around stale caches and dynamic data.
Determining the system-of-record
When designing a solution that involves a data grid, a fundamental architectural choice that determines which component will be the authoritative source of the data needs to be made upfront. The decision of whether the data grid will be the system-of-record (SoR) or whether the data source (in most cases, a database) will be the SoR needs to be established at the outset.
If the grid is the SoR then all transactions and CRUD operations should happen via the grid. Applications should not update the data store directly. The updates made to the grid are in turn propagated to the data store either synchronously or asynchronously.
If the grid is the SoR, updates made by a third party application should generally be discouraged. If, however, third party updates need to be accommodated, then it is necessary that these updates be made to the data store and shepherded within a transaction context to ensure data accuracy. This will ensure that grid clients always see the most current version of the data. A common implementation approach is to build a grid-aware persistence layer and ensure, via governance, that all application updates to the database flow through this layer. Figure 7 presents a schematic of this approach.
Figure 7. Grid as SoR – Common persistence Layer for handling database updates
When the grid is not the SoR and the authoritative source of the data is the data store, then special care needs to be taken when dealing with third party application updates. The updates made to the data store need to be propagated to or synchronized with the data grid. This is often referred to as eventual consistency. The window of time between the update of the SoR and the later update of the grid represents a time window during which the cache is stale.
The latency and duration of a stale cache is generally a configurable item. The frequency and duration of cache synchronization operations are generally dependent on things such as specific business requirements, ability to tolerate staleness, and quantity of data that is changing.
Stale cache handling architectural patterns
When dealing with any software architecture issue, there always exists a core set of common alternative approaches to designing a solution. Let’s try to distill the various approaches of dealing with a stale cache and look at five architectural patterns. These patterns are listed below and are described, along with their relative pros and cons, in the remainder of this article.
These architectural patterns are:
- Toleration: Do nothing, be aware of data inaccuracy, and just tolerate
- Expiration: Expiration policy-based or time- and event-based
- Poll-based: Polling-based mechanism
- Push-based: Push-based mechanism built on database triggers
- CDC-based: Leveraging Change Data Capture
Toleration: Do nothing, be aware of data inaccuracy, and just tolerate
Implicit in any caching architecture is the notion of staleness of data, hence it has to be addressed.
The degree or level of toleration would obviously vary by use case. Many business use cases do not require absolute data accuracy and can tolerate some staleness. Many organizations who leverage caching typically refresh the data grid once every 24 hours, while others expire the entry after 20 minutes. Examples in this space include a sales catalog, which can be loaded either daily or weekly. Organizations dealing with financial information offer free stock quotes that can be as much as 20 minutes delayed. As always, there are cost considerations in any implementation. As a general rule of thumb, it is important to understand your data accuracy requirements; maximum toleration for staleness of the data, from the business owners – and then choose the most appropriate method.
Expiration: Expiration policy-based or Time and Event-based
Time-based expiration is perhaps the most often used type of policy, where the cache simply clears an entry after a pre-defined interval of time. WebSphere eXtreme Scale uses evictors to remove data from the grid. A default time-to-live (TTL) evictor is created with every dynamic backing map. The evictor removes entries based on the time-to-live setting, which represents the pre-determined interval after which the entry should be deleted. The objectGrid.xml file shown in Figure 8 provides a sample of a TTL evictor configured to expire 30 minutes after the last access or update time. The example shows how last access time and last update time can be used separately or together with an OR. See the Knowledge Center for more examples on Configuring evictors with XML files check the Knowledge Center .
Figure 8. TTL WebSphere eXtreme Scale XML Configuration
The default TTL evictor uses an eviction policy that is based on time, and the number of entries in the BackingMap has no effect on the expiration time of an entry. You can also use an optional pluggable evictor to evict entries based on the number of entries that exist rather than based on time.
WebSphere eXtreme Scale has two optional pluggable evictors that provide algorithms for deciding which entries to evict when a BackingMap grows beyond some size limit:
- LRUEvictor uses a least recently used (LRU) algorithm.
- LFUEvictor uses a least frequently used (LFU) algorithm.
If the LRU or LFU eviction algorithms are not adequate for a particular application, WebSphere eXtreme Scale provides a facility that enables you to implement a custom evictor based on your specific eviction strategy. (See Custom Evictors topic in the Knowledge Center .)
Poll-based: Polling-based mechanism
As the name implies, this approach polls or, basically, queries the database to determine if any changes have occurred since the last load. These changes are then pulled into the grid. This can be done by invalidating those entries in the grid that have changed in the underlying data source, or by updating the entries with the new data values. A time stamp or versioning scheme is commonly used here. For JPA, WebSphere eXtreme Scale supports polling the database for changes using a TimeBasedDBUpdater. (See the Knowledge Center.)
In situations where JPA is not used, a custom application can be written to poll the database for these changes that are then applied to the grid.
Depending on the size of the database, these periodic queries can impose significant additional load via locks on the database. Moreover, data that is deleted from the database can be hard to detect using a query, unless additional techniques that require performing a soft delete are employed. In summary, this approach could add performance overhead and can be a bit complex to implement enterprise-wide, where little influence or governance might exist over third party applications.
Push-based: Push-based mechanism built on database triggers
A push-based notification approach can be employed to deliver changes from the data store to the grid. The database trigger mechanism is a candidate that can be used here to propagate database changes out to the grid. Although this is a common approach, it can be tedious to accomplish as the database triggers need to be implemented in Java. This technique can have performance impacts on the database, especially under peak load, and is often viewed by administrators as a bit intrusive.
Figure 9 shows a synchronous implementation of this approach.
The database trigger code essentially operates synchronously on the grid, as would any other grid client. A commit on a transaction can cause the database trigger to get activated. The trigger, in turn, functions as a WebSphere eXtreme Scale client and performs the relevant CRUD (Create/Read/Update/Delete) operations on the grid.
Figure 9. Grid synchronous database trigger - Synchronous
An optimization of the above synchronous model is to perform the execution of the trigger logic asynchronously. The WebSphere eXtreme Scale client that gets executed inserts the changed data in a message queue. The consuming application will de-queue the message and perform the relevant grid operation. The consuming application is typically implemented as a message-driven bean (MDB). The transaction semantics of the message consumption should be carefully addressed to avoid transactional and reliability issues, the details of which are beyond the scope of this article. The basic architecture of this asynchronous model employing database triggers is represented in Figure 10.
Figure 10. Grid synchronous database trigger - Asynchronous
Although this technique is commonly used and advocated, it is necessary to be aware of the drawbacks of this approach. A database trigger approach involves code that needs to be developed and maintained as part of the solution. The various idiosyncrasies around transactional execution semantics need to be well understood to avoid introducing data accuracy side effects. Database administrators often perceive database triggers to be intrusive, heavy in terms of performance, and potentially risky. Moreover, if the asynchronous model is used, which is very often the case, then all the relevant configuration and operational considerations involving the reliability and semantics of the messaging infrastructure needs to be carefully designed, implemented and monitored.
CDC-based: Leveraging Change Data Capture
This pattern is based on the concepts of Change Data Capture. CDC is a set of software design patterns used to determine (and track) that data has changed so that action can be taken using the changed data. Most database systems manage a transaction log that records changes made to the database contents and metadata.
Figure 11 shows a high-level architecture of an IBM InfoSphere Data Replication CDC-based implementation. Based on the source/target systems and qualities of service expected, there are many ways to design a CDC-based topology. A common deployment topology is shown below, where the source and target systems contain the source and target CDC engines, respectively. An InfoSphere CDC engine is also referred to as an InfoSphere CDC instance. The InfoSphere Management console generally runs on a separate host and manages the CDC instances via the InfoSphere Access Server.
Figure 11. CDC architecture overview
The InfoSphere Management console is used for all configuration, control, and monitoring activities. The Access Server serves as a validation point and enforces access control for the Management Console user-logins. The source and target CDC engines send, receive, and apply data changes, respectively, as shown above. Metadata that is associated for a given replication configuration is stored partly locally and partly distributed across the source and target InfoSphere CDC engines. For replication to occur, only the source and target engines are required. The Management Console and Access Server are optional.
InfoSphere CDC detects changes by scraping database logs for changed data on the source. That data is pushed asynchronously from the source engine to the target engine over TCP/IP. No intermediate tables, files, or queues are used. Most InfoSphere CDC engines can serve as a source engine, capturing database changes from the source database, and as a target engine capable of receiving change data and applying it to the designated target database or other destinations, such as DataStage or a JMS queue. A simple and intuitive user interface enables users to determine what data needs to be integrated and what transformations need to be performed on the data before being applied to the target. By interacting with only the database logs, additional load is not put on the source database and no changes are required to the source application .
The WebSphere eXtreme Scale CDC adapter leverages the replication framework of InfoSphere CDC described above. The adapter is a configurable plug-in that transforms the change capture events from InfoSphere to key-value objects and then routes them to the configured WebSphere eXtreme Scale data grid as shown in Figure 12. The target grid and data mapping information is supplied via a user-defined configuration file. The details of the XML file format can be found in the Knowledge Center.
Figure 12. WebSphere eXtreme Scale CDC Adapter replication landscape
The WebSphere eXtreme Scale CDC adapter is only available on data grids at WebSphere eXtreme Scale V126.96.36.199 or later, and requires that the eXtremeIO option be enabled for the target data grid.
Benefits of the CDC-based grid synchronization pattern include:
- Near-real time data accuracy is provided with no changes required to the source application.
- A lightweight, small footprint, and low-impact process that puts no additional load on the source database.
- Out-of-the-box support for various caching topologies, both side-cache and in-line cache.
- A highly-available solution that can function in both synchronous and asynchronous manners.
- Easy enablement on InfoSphere Data Replication-supported platforms, such as z/OS®, Oracle®, and Informix.
- Lower total cost of ownership relative to other functionally comparable approaches.
The adapter can function in both a synchronous and asynchronous fashion. The WebSphere eXtreme Scale CDC adapter can be configured in three operational modes (Figure 13):
- Push: In this mode, all database operations are unidirectional, flowing from the database to the grid. Insert-Update-Delete operations flow from the database to the grid. A typical use case would be where a fully-populated cache is used to serve reads or queries, but requires that underlying database changes be applied to the grid in a timely manner for enhanced data accuracy.
- Invalidate: This mode is used when the database operations are bi-directional where, for example, updates to the cache are propagated to the database via a WebSphere eXtreme Scale loader, or database changes originate from third party applications. Given this bi-directional nature, it is necessary to prevent a circular condition where updates by the WebSphere eXtreme Scale loader do not generate change events that in turn attempt to re-apply these updates back to the grid.
- Refresh: This is the same as the Invalidate mode, with the exception being that the update change event drives a get() operation on the grid after the invalidate() operation. The purpose of the get() operation is more of an optimization that attempts to prevent a subsequent cache miss for the relevant key, and is generally used where cache pre-loading is desirable.
Figure 13. WebSphere eXtreme Scale CDC Adapter (V188.8.131.52)
Let's look at a sample caching architectural scenario.
There are a large class of applications that need to query a data set to support the current informational needs of the business. It is very common that these large systems-of-record reside on expensive database and hardware platforms, such as IBM z/OS and Oracle RAC. Reducing the licensing cost of the software and being able to “do more with less” hardware enables organizations to deal with pressure to deliver competitive value with shrinking IT budgets.
There are three basic types of data:
- Static data is data that never or rarely changes. Reference information and profile data are typical examples of static data.
- Streaming data is data that is continuously flowing. Streaming data has to be captured as it passes, else it is missed. Online movie streaming is an example of streaming data. Streaming data are good candidates for buffering but not for caching.
- Dynamic data is data that changes due to transactional events. The frequency of change cannot be predicted and hence caching and refreshing this data is typically difficult. As competitive and economic pressures increase, timely access to dynamic data is needed to make business critical decisions.
Figure 14. Dynamic data caching - Architecture refinement
The various architectural options are:
- Option A: Querying the SoR is the obvious first choice to solve this issue. With increasing volumes, however, this solution can be very expensive in terms of hardware and software licensing costs. More importantly, though, executing several queries against the SoR can introduce performance issues and delay other applications that depend on the SoR.
- Option B: To improve performance and reduce costs, such as licensing or MIPS consumption on the mainframe, it has become increasingly common to employ some sort of caching techniques. This option outlines an approach that uses caching the mainframe data in a less costly database that gets periodically synchronized with the SoR via a nightly cron job, for example. While this solution does cut costs and allows for the existing infrastructure to scale, there are issues with data latency or data accuracy, as the data being acted upon could potentially be stale.
- Option C: This architectural option does the same thing as option B, however it uses a data grid such as WebSphere eXtreme Scale rather than a database. The in-memory and auto-scalable features enable this solution to deliver better response times and throughput, however it still suffers from the fundamental problem of stale data.
- Option D: This approach leverages the new capability of the WebSphere eXtreme Scale CDC adapter that enables the solution to provide a near-real time view of the data. This approach has all the inherent benefits of WebSphere eXtreme Scale in addition to providing a more accurate real time view of the data that enables a true dynamic data scenario facilitating better decision making.
- Option E: This pattern is just an example of how the dynamic data caching scenario can be further extended by leveraging features of WebSphere eXtreme Scale, such as multi-master-replication, to enable grid synchronization across potentially geographically dispersed domains, bringing the data closer to where it is consumed.
Distributed caching solutions often have to deal with the problem of a stale cache. This tutorial presented several architectural approaches for dealing with this issue, including the relative merits and demerits of each approach. The five approaches covered here provide high level coverage of the common patterns available to keep the data grid and system-of-record in relatively close synchronization with each other. The article concluded with a description of the WebSphere eXtreme Scale CDC Adapter, a new feature offered in WebSphere eXtreme Scale V184.108.40.206. This feature leverages IBM InfoSphere Data Replication Change Data Capture by capturing transactional changes from the database logs, transforming these change events to key-value objects, and routing these changes to the configured cache. The no-programming approach and out-of-the-box support for near-real time synchronization between the SoR and data grid make this a highly desirable approach.
I would like to thank Imran Sayyid (WebSphere eXtreme Scale Development), the primary developer of the WebSphere eXtreme Scale CDC Adapter, for patiently answering all my questions and serving as a technical reviewer of this article. I would like to acknowledge Heather Duschl (WebSphere eXtreme Scale Test Team) for sharing her learnings and doing a very detailed review of this article. Appreciation to John McGarvey (WebSphere eXtreme Scale – Chief Architect), Michael Cheng (Release Architect for WebSphere Application Server), and Vinod Ralh (Cloud Advisor, Australia and New Zealand) for serving as additional reviewers. Thanks to my wife Marisa and daughter Samuela for allowing me the time-away to work on this.
- Practical Guide to Cloud Computing Version 2.0, Cloud Standards Customer Council (2011). (PDF)
- WebSphere eXtreme Scale V8.6 Knowledge Center
- Redbook: WebSphere eXtreme Scale v8.6 Key Concepts and Usage Scenarios
- Configuring Java clients with an XML configuration
- Starting the JPA time-based updater
- Download WebSphere eXtreme Scale Version 8.6 Fix Pack 7 for distributed
- Redbook: Smarter Business: Dynamic Information with IBM InfoSphere Data Replication CDC