See the WebSphere eXtreme Scale Wiki for links to eXtreme Scale Version 7.0 documentation.
If you log in
with your developerWorks ID, you can leave comments and feedback for the development team.
The DataGrid API provides a simple programming interface to run business logic over all or a subset of the ObjectGrid in parallel where the data is located.
Overview
Increasingly, customers requires the ability to process large amounts of data extremely quickly. Customers need to search GBs or TBs of data in seconds. Conventional vertical scaling approaches cannot solve this problem as a single machine is not large enough to do the work. DataGrid architectures solve the problem using a radically different approach. First, the data is partitioned in to non-overlapping chunks or partitions. Next, the partitions are stored on a number of machines. Each partition is placed on exactly one machine. For example, if you had 10GB of data and you wanted to use 100 partitions, each partition is 100MB of data. If you had 20 machines, then each machine would have 5 partitions of data assigned to it. If each machine was a dual quad core machine, then there are 8 CPUs available on each machine to process the data on that machine. If you wanted to find all entries with a particular query, then you can run the query in parallel on all 20 machines and use all 160 CPUs to search the data. Each machine would use its CPUs to search the partitions resident on that box. If you need to store 20GB of data or run the query twice as fast then you would use 40 machines.
The data in the grid also needs to be fault tolerant. ObjectGrid allows each partition to be replicated, which means that a partitions data would need to be copied to another machines in case the primary failed. The data is only kept in memory so once the machine fails then the data is lost unless there is another copy.
In the previous scenario, 10GB of data, 100 partitions, 20 machines are used as an example. Each machine has the main or primary copy of 5 partitions and is also a backup for another 5 partitions. Thus, each machine would need 10 * 100MB or 1GB of memory available to meet this requirement. If a machine fails, then ObjectGrid promotes the replica to be the new primary machine, then reassigns a replica on a surviving machine to make the system fault tolerant. The new primary then copies its data to the new backup machine as quickly as is allowed, which means the ObjectGrid repairs itself automatically when a machine fails.
If 10 more machines are added, then ObjectGrid would migrate the primaries and replicas to evenly distribute the data over the 30 machines. This gives us 50% more memory, network bandwidth and CPU. DataGrids can scale linearly with the number of machines.
ObjectGrid is also aware of the need to manage such a grid over multiple data centers. It can make sure that backups and primaries for the same partition are located in different data centers if that is required. It can put all primaries in data center 1 and all replicas in data center 2 or it can round robin primaries and replicas between both data centers. The rules are flexible so that many scenarios are possible.
ObjectGrid can also manage thousands of servers, which together with completely automatic placement with data center awareness makes such large grids affordable from an administrative point of view. Administrators can specify what they want to do and not how to do it.
Application and data colocation
This is a principal feature of DataGrid architectures. Normally, multi-tiered architectures run stateless application servers in front of a database. The application servers use JDBC to pull data from the database, process it and return results to the clients and to push updates back to the database. This process has several drawbacks:
- More application servers put a bigger load on the database and eventually leads to the database being a bottleneck
- The data must be pulled over the network
- If gigabytes of data must be processed, then it is not practical due to the network cost
- The database is a single point of failure and an inhibitor on scaling up the number of application servers
- Sometimes the cost of pulling the data to the application server outweighs the cost of processing the data enormously.
Attempts to solve this problem conventionally typically use stored procedures that allow the data to be processed within the database. This process avoids the need to push the data to the application server, which helps and results in a good pattern. Colocating business logic with data means high performance. The data can be processed by the business logic at memory speeds. But, databases usually only scale vertically so you won't be able to apply 160 CPUs against the data.
DataGrids work best when the business logic is colocated with the partitioned data. Clients invoke the logic using a simple API. The API then invokes the business logic on the appropriate partitions where the logic runs at memory speeds against the data. The results are then returned to the client. This avoids the overhead of pulling the data to where the logic is. It also allows you to run the logic on many CPUs because of the partitioned model.
Process individual entries in parallel with entry specific results
ObjectGrid provides the capability for a client to ask for an application agent to perform operations on specific entries. The entries are specified using keys or can also be specified using a query. The ObjectGrid then returns a Map holding the keys of entries processed and the result for those entries. The ObjectGrid receives the request from the client, determines which partitions have the data for the keys, then runs the application logic in parallel to get the processed entries results. Once all the partitions process their entries, the results are returned to the client.
This works very well for finding all customers matching a very specific query but would not work if a client asked for the salaries of every entry in a grid with a million entries. The client would run out of memory and receiving millions of results. This pattern works best when the partitions return an amount of data thats small enough for a single client to process. If a lot of data gets returned to the client then we recommend using the pattern below to push more intermediate processing in to the grid.
Process groups of entries and aggregate them to a single result.
Rather than pull all the data back to the client which leads to the problem described previously, the data is processed as much as possible on partition nodes. This minimizes the amount of data returning to the client and also means that the processing is performed using the power of the grid rather than on a single client. For example, if a client wanted to know the total of the salaries for all customers held in the grid, then the sum of the customers for each partition is calculated on the partition, then those results are aggregated and returned to the client. This allows the power of the grid to handle the intermediate results also. This approach allows the client to ask the grid to process Tera bytes of data and receive the aggregate result.
Cross partition query support
Typically, a client interacts with a single partition at a time. For example, if you have a grid that stores employees, each employee has an associated department, and you want to partition the grid on an employee, then how do you find all the employees in a certain department? This is a cross partition query and the DataGrid APIs can be used to solve this with a parallel query.
Additional information
© Copyright IBM Corporation 2007,2009. All Rights Reserved.