Value proposition and context
The Preferred Data Source Pattern, or Preferred Source Pattern, is a microflow pattern for service aggregation. The pattern allows a client to retrieve information from a group of information sources without the need to understand, at least at a high level, that multiple sources exist.
Consider the following situations where multiple data sources must appear as one:
- A company has multiple sources of information, some of which are more expensive to access than others (for example, a local parts database and a remote parts database).
- A company upgrades its IT systems and, in doing so, introduces new sources of information that it must use in conjunction with old sources (for example, customers).
- One or more similar businesses merge, and all have somewhat dissimilar data representing the same entities, such as customers.
- Any individual entity has some enterprise-unique identifier that's part of the record (for example, a customer number or SKU).
Assume that the above scenarios are integrated in the context of information management in an SOA Web services environment.
How can a client retrieve information from a set of disparate information sources without the need to understand that multiple sources exist?
The Preferred Data Source Pattern identifies one of the data sources as the preferred source and considers the others alternate sources, used only when the preferred source can't provide the desired information. Figure 1 shows the relationship between the facade and the adapters.
Figure 1. Relationship of facade and adapters
The pattern assumes that information obtained from any source comes in the form of records that describe entities, such as customers or parts. Further, it assumes that any individual entity has some enterprise-unique identifier that's part of the record, such as a customer number or SKU.
The pattern contains a facade that hides the fact that multiple sources exist; the client interacts only with this facade. The facade interface matches that of the preferred source, and the preferred interface contains one or more operations that allow the client to find (read) information matching various criteria. A find operation returns 0..n records that match the criteria.
It's important to understand that no matter which source provides the information, none of the returned records may be the desired record. Consider a scenario in which a store clerk searches in a nationwide company database for customers with the name John Smith. The find operation could return 20 John Smiths, but none of them represent the John Smith standing in front of the clerk. The client must depend on additional interactions with the user to determine whether any of the returned records are the desired one.
The Preferred Source Pattern assumes that an information source has one or more find operations that return zero or more instances of the entity record, or perhaps a subset of the entity record. The information source may have one or more write operations that allow a client to create and update entity records.
Figure 2 shows a sequence diagram for a find operation in the pattern. The client invokes the facade, which then invokes the preferred information source. If that source provides no matches, the facade invokes the alternate information sources in a predefined order until matches are found or until it exhausts all the alternative sources. After it finds a match or exhausts all sources, the facade returns to the client. Note: For the sake of clarity, we haven't shown the synchronous returns.
Figure 2. Find operations
In its simplest form, the preferred source, and thus the pattern, supports only find (read) operations. A virtual catalog capability might leverage such a read-only pattern, as there's no need (or perhaps no ability) to update the preferred source.
The description for the simplest form must include a Web Services Description Language (WSDL) document that describes the preferred source and all alternate sources. The facade and all alternate sources use the preferred source's interface (port type). If an alternate source doesn't natively expose the same interface, you can apply a transform pattern to the source; however, this pattern is out of this article's scope. The WSDL for the alternate sources must differ from the preferred source, at least in the endpoint address; it may also differ in the binding(s) with a bit more work. The interface uses the schema describing the entity record and any other parameters. Note: The WSDL document will define or import the schema.
As indicated earlier, assume that an entity record includes a unique ID. You create this identification for the find operations to which the pattern will be applied. Treat all other operations as pass-through operations. Then create a list that shows the order in which the alternate sources are invoked. You can, of course, have a single list of WSDL documents for the services; the first in the list is the preferred source.
In a more general case, the preferred source interface may contain additional operations that allow the client to create, update, or delete (in some cases delete may take the form of deactivate). Obviously, the pattern facade must also support the additional operations.
When information resulting from a read operation doesn't come from the preferred source, you may need to add the information to the preferred source. The pattern should support efficient updates of the preferred source, but this is somewhat problematic. Consider this customer information scenario: If the desired customer record doesn't exist in the preferred source (the local store database), it may be located in some legacy database at the store, or it may be undefined in the enterprise so that you must use an external information source, such as Acxiom, which finds information based on a phone number.
The facade's actions depend on the IDs in the entity records returned from the source. The real alternate source may provide valid IDs, invalid IDs, or no IDs. A valid ID is acceptable as an ID in the preferred source. Assume the facade finds the information in the legacy database, and four records match the search criteria. Further assume that all records have a valid ID. In this case, the facade should add none, one, or all four records to the preferred source, depending on the circumstances. If none of the records represents the person in front of the clerk, obviously the facade wouldn't save those in the preferred source. If one of the records does represent the person in front of the clerk, then it's highly likely that the facade should save the record in the preferred source. But when should you do this?
Certainly the facade has no way of knowing which record is right, so the client must initiate the create or update (write) operation. However, in this case, if there's no new information about the customer, the client may not perform any write operations at all! At best, the client might do something like write a new time stamp that indicates the last time the identified customer visited. If there is a migration relationship between the preferred source and one or more alternate sources, you might want to automatically add all the records from the alternate source to the preferred source without explicit action on the part of the client. In a migration scenario, legacy sources might contain entity records with IDs that are referenced in other parts of the enterprise. Thus, when you put the legacy record into the preferred source, you should use the same ID to preserve the integrity of references. To do this, you need an operation to create an entity record with an existing ID.
You may require a different set of actions when the matching records from an alternate source don't have a valid ID or have no ID; this would be the case for external sources like Acxiom. If the records have no IDs and are returned to the client, the client can easily determine that a record matching the person in front of the clerk must be created in the preferred source, without assistance from the facade. The client can add the record to the preferred source (through the facade, of course) using an operation to create an entity record without an ID; you can assume that such an operation exists as a pattern requirement. If the returned records have invalid IDs and are returned to the client, the client cannot easily determine that a record matching the person in front of the clerk must be created in the preferred source. In fact, the idea of giving the client invalid IDs seems flawed. Because the client can't detect that the ID is invalid, it can use the ID in another context to link to the entity record. This means that an alternate source used by the facade must return either valid IDs or no IDs. You may have to produce either a valid ID or no ID in the transform pattern wrapper if the real source doesn't do so. Creating a valid ID may require some sort of ID correlation service that allows you to create valid IDs in multiple environments; that is out of this pattern's scope.
It might be interesting to define a service provider interface for an ID creation service or function that the facade can optionally call to create valid IDs. The facade can use that service in the case where an alternate source provides no IDs and no create method returns the ID of newly inserted records. This can be even more useful in the On Write policy described later. This policy highlights the need to:
- Identify the ID field in the entity record.
- Understand whether an alternate source provides valid IDs or no IDs.
The information helps drive the actions that the facade can take under various circumstances. Another important aspect of the relationship between find and update operations is the nature of the value type(s) returned from the find operation(s) as well as the value type(s) used to drive the create operation(s). The current thinking is that the pattern's initial implementation requires that the value type always equal the entity record. This eliminates the need for additional information on how to identify the entity record's subsets and minimizes mapping code.
The subject of update leads to the need for policies, per alternate source, for handling the relationship between write (either creates or updates) operations to the preferred source in relation to the returned entity records from read (find) operations on an alternate source. To drive these per alternate source policies, you must identify the find, create, and update operations in the preferred source interface. The following subsections describe some of these policies.
For find operations, the facade only returns the entity records and does nothing
for write operations. The client must detect the ID's validity to deduce when it
create operation to insert a record into the
preferred source, as well as to explicitly invoke a create operation on the
facade. The facade, in turn, invokes the create operation on the preferred source.
(see Figure 3).
Figure 3. Nothing policy
This may be the best policy when the alternate source provides no IDs. For this policy, all read operations are simply passed on to the sources, and results are returned. All other operations are passed through to the preferred source.
Add All policy
The facade adds all the entity records to the preferred source as a side effect of the find operation. The client behaves identically to the way it does when the records come from the preferred source. There are two subpolicies:
- Where the entity records have a valid ID.
- Where the entity records don't have a valid ID (have no ID).
For the first subpolicy (valid ID), the facade creates the records in the
preferred source using the identified
operation and returns the records from the alternate source to the client. To
validate this subpolicy, the preferred source must support a create operation that
allows existing IDs.
For the second subpolicy (no ID), the facade creates the records in the preferred
source using the identified
create_noID operation. To
prevent performing additional read operations, the facade places the IDs created
by the preferred source in the records obtained from the alternate source; these
are subsequently returned to the client. To validate this subpolicy, the preferred
source must support a create operation that returns the IDs as a result of the
preferred source's create operation (see Figure 4).
Figure 4. Add All policy
Alternatively, if no such operation exists, you can perform the create operations on the preferred source, then perform another read on the preferred source using the original criteria. This returns only the entities just added, because nothing matching the criteria was initially found in the preferred source. These entities can be returned to the client. A possibly valuable variant for second (no ID) subpolicy is to allow a choice between inserting all the records, as described, or inserting none of the records; this would result in an Add All policy with a valid ID and nothing with invalid ID subpatterns.
For this policy, you must identify the find operations. You don't need to identify write operations, which are simply passed on to the preferred source, and results are returned. You can determine which subpolicy to use (the first or the second) by knowing whether or not the alternate source returns valid IDs. Note: This may be a deployment time problem in that a source could return valid IDs, but not support the appropriate create operation (with IDs); you can use the alternative for the second subpolicy in this case. This policy may be most useful in migration scenarios when coupled with some sort of clean-up mechanism associated with the alternate source. The mechanism (out of scope for this pattern) would run periodically to remove entity records that shouldn't be migrated from an alternate source to the preferred source. For example, customers who haven't visited a store for six months might be removed from the legacy store database. The drawback, of course, is that if the customer returns to the store, the record won't be found anywhere in an existing database. Thus the customer will appear as a new customer, and you will likely have to re-enter information about him or her.
On Write policy
If the alternate source provides valid IDs, or the facade can create valid IDs for the records, the facade remembers or caches the entity records retrieved from the alternate source using the ID as the key. The facade then returns the records to the client. If the source doesn't provide valid IDs, the facade simply returns the records to the client. If the client invokes any operation that updates a record from the preferred source (it has a valid ID and is in the cache), the operation may only provide partial information. The facade matches the ID that must be part of the parameters for the operation against the cached records. It then merges the information from the client with the cached record and invokes the create operation (with ID) on the preferred source (see Figure 5).
Figure 5: On Write policy
This policy suffers from a set of problems:
- To cache the records from the alternate source, you need a cache per value type returned.
- You must specify some cache characteristics, like number of entries or lifetime, which you could derive from the caching pattern. You can consider the Requester Side Caching Pattern in this case.
- The timing between putting an entry in the cache and removing it due to timeout may be problematic in some situations. This might result in the client updating an ID that has been removed from the cache. There seems to be no recourse except to cause an exception or set the timeout so long that you risk a cache blowup.
- Under normal circumstances, a read from the preferred source succeeds, and you don't need to use the cache; but, when an update operation occurs, you need the ID to look in the cache anyway, and a cache miss should occur. The update operation would proceed as if it were a simple pass-through. The caching in this case impacts performance.
- For this policy, you must identify the find and update operations in the interface (you must apply the policy to all update and create operations). Developers must also supply operation-specific merge logic on update operations as custom code.
It's unlikely that the three policies above cover all situations, and in this case, you must use a Custom policy. You can first use the Nothing policy to support creating a policy that works for your situation.
Forces at work
Consider some of the forces at work when dealing with these patterns:
- Generally, the various data sources provide all the persistence; this is certainly true for read-only situations. When write is allowed, you can employ a temporary cache.
- Two important performance considerations are the order in which you access data sources and the number of those sources. In most situations, to maximize performance it makes sense to make the preferred source the most likely source of requested data. In other situations, such as in a data migration, the new source (which may be the least likely to contain the requested data, at least initially) is the preferred source.
The most common pattern used with the Preferred Data Source Pattern is a wrapper pattern that makes disparate sources of information look the same; it presents the same WSDL port type.
In some situations, you can use the Preferred Data Source Pattern recursively. For example, a service in a store may be implemented with the Preferred Data Source Pattern, and one of the alternate sources is a service at the enterprise. You may implement that enterprise service with the Preferred Data Source Pattern. You can use the Requester Side Caching Pattern to implement the On Write policy.
The Preferred Data Source Pattern is considered an Enterprise Application Integration (EAI) pattern in an SOA context; thus the strengths and weaknesses of typical EAI patterns apply here. The Preferred Data Source Pattern works best when the data in multiple sources are relatively consistent, clean, and straightforward, and the returned result sets are small to medium. It provides many advantages, including flexibility, extensibility, implementation simplicity, and cost savings. However, you should be cautious about performance implications, because the Preferred Data Source Pattern doesn't use parallel processing and query optimization.
- "Web Services Response Template Pattern: A Specification" (developerWorks, February 2006) provides fine-grained access to a coarse-grained interface.
- Examine "The Requester Side Caching Pattern Specification" (developerWorks, October 2005) in detail.
- "Information Service Patterns, Part 1: The Data Federation Pattern" (developerWorks, July 2006) creates an integrated view into distributed information without creating data redundancy and while federating both structured and unstructured information.
- "The Cache Mediation Pattern Specification: An Overview" (developerWorks, May 2006) is an asynchronous version of the Requester Side Caching Pattern.
- Stay current with developerWorks technical events and Webcasts.
- The SOA and Web services zone on IBM developerWorks hosts hundreds of informative articles and introductory, intermediate, and advanced tutorials on how to develop Web services applications.
- The IBM SOA Web site offers an overview of SOA and how IBM can help you get there.
- Browse for books on these and other technical topics at the Safari bookstore.
Get products and technologies
- Innovate your next development project with IBM trial software, available for download or on DVD.
- Visit the IBM Enterprise Integration Solution Web site to learn how IBM is leading in SOA.
- Get involved in the developerWorks community by participating in developerWorks blogs.