When data handling comes up in developer forums, the discussion usually is based on applets and servlets and focuses on up-front performance issues such as look and feel, security, and load time. The actual amount of data that is transferred from one machine to another, or data flow as it is more commonly called, rarely comes up. In fact, data flow is an important subject that is rarely discussed or evaluated adequately, particularly for large-scale, data-intensive applications.
In this article, I'll explain how data flow affects performance in n-tier applications with multiple servers. I'll use a data-flow model to illustrate the particular junctures where data can slow or block application processing and explain how to work around common validation, security, and data access problems. I'll also look at more high-level design and architecture decisions that can considerably improve the performance of your applications. In addition, I'll weigh the merits of centralization versus decentralization of stored data, a little-considered design factor that is especially relevant in the context of today's n-tier systems.
Data flow can slow down or disrupt an application at any stage, so the trick is to anticipate problems and resolve them before they start. I'll use a data-flow model to describe the most common data-flow bottlenecks, as well as some tricks for avoiding them. Figure 1 shows the data flow through a typical large-scale n-tier application with more than one server.
Figure 1. Data flow in a large-scale application

Now let's look at the specific places where data tends to slow down and what you can do about it.
- 1. From client to Web server
- This is a mandatory step for data flow, but in some cases more data flow through these points than is necessary. For example, an application that does a lot of simple validation on the server side rather than the client side can slow down your system. Ideally, you want to move data to the server only after it's been successfully validated by the client. While in some cases validation actually can't happen on the client side, you can often work around this if you reorganize the application. For example, data validation is often thought of as business logic and so considered a server-side function. In fact, data validation is always data specific, so it should be performed on the client side.
- 2. Web server to application server
- If you need to process data presentation rules for the client, do it on the Web server. More often than not, the Web server is simply used to pass on data to the application tier; to weed out a lot of performance lag, process the data on your Web server instead. In the case of presentation data, you'll need it processed for the application tier anyway. While data processing on the application tier might be easier from a coding perspective, it also entails transferring of a lot more data than you need to. It may not seem like a big deal to transfer an extra 10 bytes, but once you multiply that across a million transactions (which is typical of large applications) the unnecessary data transferred adds up to a whopping 10 MB. And that's without considering the packet headers and footers for the data!
- 3. Application server to database server
- Do all data processing before your data reaches the database server. This ensures that the database server only needs to reorganize the data for fast and easy retrieval. It also ensures that only the necessary data reaches the database server. You should determine the number of servers used for processing in the application tier based on these considerations. Every additional server in the application tier not only increases the hardware price, but also increases the data-transfer overhead.
- 4. Retrieval from the database server
- Store the data in such a way that it requires a minimum number of joins during the retrieval process, which means that data must be normalized and de-normalized at the appropriate points. I discuss data normalization and de-normalization in detail a bit further on.
- 5. Back to the Web server from the application tier
- During the retrieval process only one application server should come into the data flow before it is utilized, even if the whole system is more than three tiers or the application tier has more than one server type. Note that the data path for storage and processing need not be the same as the path for retrieval. It's also worthwhile to put some thought into the kind of data that will be retrieved, based on the application being coded. For example, in a Web-based trading site, you're likely to retrieve only a few orders once they've been placed (those being orders that are changed or canceled). On the other hand, you'll probably access a lot of data to check the order status if a considerable time elapses between the placement and delivery of an order. These types of considerations can lead you to optimize the code in ways you otherwise would not.
- 6. Displaying retrieved data on the client
- All the processing done for data storage must also be done for data retrieval. Therefore, it is worthwhile to write both encoding and decoding routines to make the combination of data retrieval and processing as efficient as possible.
Design and architecture considerations
Once you understand the specific points where data flow can disrupt performance and can code against such bottlenecks you're off to a great start. Next, take a more high-level approach and avoid design and architecture mistakes that slow down system performance. Consider some of the functions that impact performance in large-scale applications, and how you can design for better performance.
As I mentioned earlier, it's a big mistake to move data from client to server just for the sake of validation. Data sent from client to server and then rejected due to validation failure only increases the amount of avoidable data that is transmitted. Your best option is to reorganize the application so that all the validation occurs on the client side. If you can't do that -- that is, if you absolutely must validate data on the server -- at least use a code transmission mechanism such as ActiveX® or Java™ applets to do it. Code transmission happens only once, whereas data must be transmitted every time you encounter a validation problem at the other end.
If you work with data that depends on already-stored data for validation, try to keep that data available on the client also. For example, if you need to keep a listing of user-submitted values for comparison with newer values, it is economical to keep the data on the client and not the server. The amount of data entered by various users will quickly exceed the amount of data you actually must transmit for validation.
This rule does have exceptions, such as in the case of uniqueness testing in a database column. Since you can''t do this on the client side it is necessary to validate on the server. In most cases, however, it is a best practice to do your data validation on the client.
Data that must be kept secure typically consumes more operations than data that isn't secured. While it seems counterintuitive, non-secure data often goes through all the same operations as secure data, simply because the two types of data are stored in the same place. You can improve your overall application performance by storing secure and non-secure data separately in your database.
The only case where it might not make sense to keep these two types of data separate is if the amount of secure data is miniscule compared to the amount of non-secure data, such as in an application where the only security measure is the user password. In this case you might store the two types of data together, but you still want to ensure that operations done on the secure data are never done on the non-secure data.
Frequency of access is an important factor to consider when you determine storage criteria. Consider a hospital information system, for example. In this system, a patient's demographics are accessed much less frequently than her name and ID number. Therefore, it makes sense to store demographic information separately from name and ID data. If you store the two types of data together, the database will have to do a projection operation to cull out the demographics data every time it accesses a patient's ID and name.
In this example it doesn't matter whether you store all the information in the same table or the same database. In the case of a table, you must do a projection operation to cull the data. In the case of a database, data used for offline processing (such as data warehousing) might slow down the system if stored along with data for online processing. This is especially so when frequently accessed data is also meant for offline processing, in which case, keep the data in a different database.
You might even consider using a different database for some data in cases where all the data is tied to online transactions. This is a good strategy for cases where the application never accesses two pieces of data together and both data pieces might grow very large (in the order of gigabytes or terabytes). In this case, dividing the data into two different databases is useful.
Many times you can predict the subsequent state of an application from its current state with reasonable accuracy; for example, in trading software, if the user queries for an order he is very likely to see the order details. You can observe similar patterns for most application types. Observing these patterns, you might then optimize the application by doing a look-ahead operation on data that is likely to be called immediately after a given action, according to your prediction algorithm.
At first glance, this strategy seems to introduce statefulness in the system, which might give rise to scaling-out problems, but in fact it doesn't.
This approach does aid you in optimizing the system: the system runs slower only if the information were lost. The only way a malfunction occurs as the result of a look-ahead operation is if the predicted results weren't available, such as when the next call is routed to another system. (This strategy is usable for systems that employ server farming, as well. In this case, the information can simply be stored in a central repository across all the servers in the farm. )
I can illustrate this further with a trading application example, where one user accesses an order. This request goes to the first of three servers (server1) in the application server farm. The application predicts that the user will next access the same order's detail and fetches the data in advance. You might store this data on server1. But the subsequent request related to order detail goes to server3, because server1 and server2 are busy. Now, if the data related to order detail is not available to server3, then all it has to do is fetch that data. While this will not give rise to the intended optimization, it also doesn't result in an inconsistent state, since the order detail can be fetched in either case. To take advantage of the optimization, keep the data in shared storage that is accessible to all the servers in the farm.
Normalization and de-normalization
Data normalization is always recommended in theory, but in practice over-normalization (or normalization without appropriate de-normalization) lead to reliance on joins for data retrieval. For example, consider a case where a patient name is always accessed with the patient demographics and the patient ID is the key. In this case it is advisable to store the patient name (even if normalization demands otherwise) along with the demographics data, to ensure that you don't need to do a join for the patient name every time you want to access demographics. In this case, the patient name is updated in two places. But, since you don't update a patient name as often as you access demographics, you've still optimized the system for performance.
Note that normalization is standard practice in any database-oriented application. It's important when performance optimizing the system that you start from a base of data that is adequately normalized; otherwise, when you update data, you'll have to update every instance of it.
Cohesion is especially important in n-tier applications where more than one server processes data. For example, consider the case where data to be printed must pass through multiple servers before it reaches the printer. If not carefully thought out, this type of setup slows both the servers involved and the printing process, especially if a lot of data is printed. To get around this, ensure that the data usage point is as close to the data storage point as possible. (This tip also applies when moving data into a database.)
Data that is accessed together should always be stored together and data that is accessed separately should be stored separately. Disrupting this basic rule results in the database having to perform a lot more operations to access the data. Separating data that is normally accessed together results in more join operations than necessary; storing together data that is normally accessed separately means additional projections to cull the unwanted data.
Data warehousing versus transaction processing
Online transaction processing (OLTP) is not the same thing as online analytical processing (OLAP), and data access patterns for each technique are quite different. As a result, you very likely need a different database for each process -- even if it means storing exactly the same data twice. Separating OLTP data from OLAP data allows you to optimize for the access patterns of each technique. In Table 1 you can see a breakdown of the differences between OLTP and OLAP databases, with particular regard to access patterns and resource usage for each type.
Table 1. OLTP versus OLAP processing
| Feature | OLTP | OLAP |
|---|---|---|
| Access pattern | Repetitive | Occasional |
| Operations | Read-Write | Full table scans |
| Unit of work | Simple queries based on hashing or keys/indices | Complex queries |
| Number of records accessed | Tens or hundreds | Millions (full table) |
| Size | 10s of MBs to GBs | 10s of GBs to TBs |
| Metric | Transaction throughput | Query throughput |
While often overlooked, data flow consumes a larger share of bandwidth in data-intensive applications. The larger and more complex the application, the more data movement there is and the more complex your design considerations must be. It is therefore a good idea to model the path of data flow through your application and consider each point where normal application functions might slow that flow. Next, think about the architecture of your application and decide where you can optimize it for performance. When in doubt, always go back to basics: check the data flow path and ensure that the application performs adequately at each juncture.
Learn
- Get your boots, we're going to the farm: Check out this feet-first primer on server farming (Christian Buckley and Darren Pulsipher, developerWorks, May 2004).
- Information management in Service-Oriented Architecture, Part 1: Discover the role of information management in SOA: Take an in-depth look at data and content management techniques in a service-oriented architecture (Mei Selvage, Dan Wolfson, John Handy-Bosma, developerWorks, March 2005).
- Architectural manifesto series: Get the architect's perspective on application design.
- developerWorks Web Architecture zone Expand your site development skills with articles and tutorials that specialize in Web technologies.
- The cranky user columns: Read any of the earlier articles in this column.
- developerWorks
technical events and webcasts: Stay current with jam-packed technical sessions that shorten your learning curve, and improve the quality and results of your most difficult software projects.
Get products and technologies
- History Flow Visualization Application: Download and try this tool to visualize dynamic, evolving documents and the interactions of multiple collaborating authors.
- DB2® Data Warehouse Edition V8.2: Get the tools to build a complete data warehousing solution that includes a highly scalable relational database, data access capabilities, business intelligence analytics, and front-end analysis tools.
Discuss
- developerWorks
blogs:Get involved in the developerWorks community.

Shantanu Bhattacharya has extensively designed and created architectures for application software, networking software, and security software. Most notably, he worked on India's first supercomputer (File System) and real-time software for the Indian Missile Program. Shantanu is currently a chief architect for Siemens Information Systems Limited in Bangalore, India.
Comments (Undergoing maintenance)





