Skip to main content

skip to main content

developerWorks  >  Architecture | Information Management  >

Information architecture essentials, Part 7: Data-store design

The role of information architecture in data-store design

developerWorks
Document options
PDF format - Fits A4 and Letter

PDF - Fits A4 and Letter
32KB (8 pages)

Get Adobe® Reader®

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Benjamin Lieberman , Ph.D., Principal Software Architect, BioLogic Software Consulting

13 May 2008

Valuable business information should never be left sitting around. It should be organized and saved into a permanent data store. In the past, these data stores were represented by large filing rooms filled with cabinets. Today, the same goal is achieved by using relational and other types of electronic databases. Just like those dusty file systems of yore, a legacy database tends to become the final resting place for useful business information—and this information is essentially lost, because it can't be accessed in a meaningful way. Data-store design can help you establish an efficient mechanism to store and retrieve valuable business information.

Consider the design of data stores from the perspective of how the information will be used. Organizations use information for many different reasons, with three general purposes being the most common: monitoring, operations, and control.

Monitoring involves the collection of measurement information on the performance of the business, adherence to internal and external regulations, satisfaction of customer needs, return on investment to stakeholders, and other relevant metrics. This information may be time-constrained in that it has value only during a specific window of risk or opportunity, such as a system outage. Or it may have long-range value, such as the effect of cost-reduction strategies.

Operational data is involved with the day-to-day running of the organization and supplying value to customers. This information covers the vast bulk of data collected by a business, including orders, shipping records, inventory, accounts receivable and payable, and a variety of other transactional data. Operational data is the lifeblood of a business, recording the actions of the business in pursuit of the business's goals.

Control data is used for decision making and is usually compiled from the other two sources. For example, an executive may require information about the rate of order delivery compared to the cost of supporting delivery operations to decide whether to invest in a new order-tracking system. Control data is also important for reporting on the overall capabilities of the organization to the stakeholders and investors. Information from monitoring and operations is combined to illustrate cost centers, revenue-generation potential, and other critical key indicators to the senior management.

Skills and competencies

Each different type of data use calls for a different storage approach and associated data-store design. Design considerations such as storage sizing, performance, level of data detail, indexing, and access are all affected by the different ways data is used.

These data uses are summarized with the focus of data capture and key business drivers:

  • Monitoring
    • Focus: Performance tuning
    • Driver: Timeliness
  • Operations
    • Focus: Transaction management
    • Driver: Data integrity
  • Control
    • Focus: Data manipulation
    • Driver: Accuracy

Because of the many design considerations affected by how the data will be used, various skills are necessary to support a cohesive data-storage strategy. Monitoring data-store design is mostly concerned with data collection. For example, a real-time flight monitor must be capable of capturing the aircraft flight status to the second on multiple flight parameters (airspeed, ground speed, altitude, attitude, flight control surface position, and so on). Storage must be rapid, especially if the system of interest is monitoring for abnormal conditions, such as a sudden drop in altitude or airspeed. Individuals tasked with design of these data-storage systems must be thoroughly familiar with data-capture characteristics of the available devices (which may be embedded systems), as well as the display of monitoring information to the operational support staff. They may also need to consider the amount of collected data, because storage may be at a premium.

Operational support data design is the most common task in a business system, requiring a balanced set of skills among performance, storage, and ease of access. The skills required include an understanding of basic relational data mechanics, such as normalization, metadata attributes, and indexing. Operational information is used by all aspects of the business, so it's vital to understand the system demands placed on the data. Index strategies, proper table joins, prefetching, caching, and other techniques permit rapid access to stored information.

Decision-support systems rely on data collected from other data stores as the basis for processing and analysis. Skills in the design of data-warehouse extraction (including data-field mapping), transformation, consolidation, and efficient data loading are all critical to the design of systems providing business performance reports.

Along with the necessary technical skills and experience, it's also important for data designers to have strong familiarity with the business domain. Many critical design decisions are directly influenced by the utilization of the data. Questions about indexing strategies, performance-tuning approaches, referential integrity, and critical design issues require a deep appreciation for the quality of the data source and the intended target audience. For example, in developing a data store for an over-the-air cell phone programming system, the designers considered how much transactional data was required to track the success rate of programming attempts. The primary goal of the system was to transmit cell phone configuration data for newly reprogrammed mobile units; only limited requirements were provided for the tracking of each programming attempt. By investigating the problem domain more deeply, the designers found that the programming could fail for a variety of reasons (the cell user could enter a tunnel or hang up the unit; the unit could impact a solid surface at a high rate of speed; and so on), not all of which were under the system's control. Only by storing a detailed step-wise history of each transaction was it possible to tune the system for a maximum success rate.

Tools and techniques

When you consider the technical aspects of data-store architecture, the focus is on the business architectural drivers and how the data-store implementation is influenced by intended data use. This section focuses on the data architecture drivers and constraints, data modeling, data design, and performance.

As noted earlier, the primary concern for monitoring data systems is the speed of capture. In terms of the drivers listed earlier, the most critical items are data capture and database sizing. When you're considering a data-store design to capture monitoring information, some solutions may not involve a formal relational database. For example, consider the case of monitoring airplane dynamics; one solution would be to implement a B-tree that stores to a flash drive on the monitoring device. This is an efficient way to rapidly store and retrieve information (a B-tree runs in logarithmic O(TlogMN) time for insertions and retrieval, which is an efficient method for data organization and is often used in indexing strategies), while providing semipermanent storage in a small space.

For operational data stores, the focus is on both speed of access and accuracy of transactions. For these purposes, a relational database is almost always the technique of choice. Modern relational databases handle concurrency, transaction management, efficient data storage, security, and backup-and-restore operations. The main choice is between cost and capability: an open source database may be acceptable to an entrepreneurial venture due to low cost of ownership, whereas a Fortune 500 company may demand the highest performance and vendor support of a top-of-the-line commercial product. Operational systems are focused on maintaining accuracy via highly normalized data structures, with associated indexing support for both efficient retrievals and network capabilities for fast transfer of large amounts of data (such as storage area networks).

Resources for IT architects
IBM breaks IT architecture down into six main disciplines: enterprise, application, information, integration, infrastructure, and operations. Find definitions of these disciplines in New to architecture and find resources to help you architect enterprise and software systems in free IT architecture kits from IBM. Then find technical articles and tutorials, tools downloads, skills roadmaps, forums, and other learning and community resources to help you develop skills to architect solutions in the Architecture area on developerWorks.

Finally, for decision making, the data store must support accurate, up-to-date information and allow for complex manipulations. The typical choice is a managed data warehouse, hosted on separate hardware from the operational or monitoring data stores and updated using some form of automated extraction, transformation, and loading. Many data warehouse choices exist, including IBM® InfoSphere™. These solutions focus on the transfer of information from operational or monitoring data stores, the transformation of that information based on business needs for reporting, and the display of that information as a set of reports to support business decisions.

Data-store logical and physical modeling is focused on the capture of the business domain in such a way as to facilitate the development of accurate and efficient data structures. As noted, this doesn't always result in an entity relationship model (the most common form of database modeling) but may instead focus on packaging information for storage. In the monitoring example, the data model should represent the information stored with a minimal transformation of that data, even when there is duplication. This supports the driving concern over rapid storage rather than efficient retrievals or transaction management. In contrast, an operational data model frequently results in a standard normalized entity-relationship model, with minimal duplication and strong referential integrity for the data. Modeling for data warehouses should focus on mapping information from input data stores and transforming that information based on the questions to be answered (which may result in data redundancy to support faster processing).

Data-store implementation is driven by the following practical considerations:

  • Physical storage (relational, hierarchical, tagged, object)
  • Network layout (stand-alone, clustered, grid)
  • Operational support (user and application security, deployments, housekeeping)
  • Data conversion (mapping source to target, data cleansing, configuration data)
  • Data warehouse (extraction, translation, loading)

Each of these considerations is influenced by the intended use of the data store. The physical layout for an operational database may be highly distributed to support business functions, whereas the decision-making system may be located in a single data-management center to minimize maintenance costs.

When you're considering a data-store architecture, it's important that the system performance be adequate for the task at hand. You should do extensive performance testing to ensure that the system will operate effectively even in unexpected conditions, such as high client concurrency or the loss of one or more data clusters.

The word performance can mean different things to different people, but in general it's measured by the data store's ability to provide the level of service expected for the intended use. Many factors influence a data store's ability to perform effectively, each of which must be considered during design and implementation:

  • Performance testing
  • Access or response time per request
  • Deadlocks
  • Table scans (thrashing)
  • I/O utilization
  • Transaction throughput
  • Referential integrity
  • Indexing, hints, and query optimization

Performance testing should be an integral part of the data-store design. Many open source and commercial products are available to assist you in evaluating potential data-store solutions.



Back to top


Milestones

The development of an organization's set of data stores is based on the same best practices as any other system development, starting with the investigations of data needs and followed by identifying candidate data-store architectures, iteratively developing the data-store hardware and schema, testing, and release into production. The primary difference in the development of a data store is that it is often the target of other software activity, rather than directly interacting with end users. Consequently, the data-store development team must be in close contact with the software development group to ensure that the correct data-store elements (such as tables and columns) are available when needed.

Data stores in production have a very different management and maintenance schedule than the software or data-warehouse reporting team. The most frequent request is to correct data issues that may have occurred due to a system failure or input of invalid data. For database administrators, the second most frequent task is to tune the database for better performance, either for direct report queries or to better support client applications.



Resources

Learn

Get products and technologies
  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2 ®, Lotus ®, Rational ®, Tivoli ®, and WebSphere ®.

Discuss
  • Participate in the IT architecture forum to exchange tips and techniques and to share other related information about the broad topic of IT architecture.



About the author

Benjamin A. Lieberman serves as the Principal Architect for BioLogic Software Consulting. Dr. Lieberman provides consulting and training services on a wide variety of software development topics, including requirements analysis, software analysis and design, configuration management, and development process improvement. Dr. Lieberman is also an accomplished professional writer with a book (The Art of Software Modeling, Auerbach Publishing, 2006) and numerous software-related articles to his credit. Dr. Lieberman holds a doctorate degree in Biophysics and Genetics from the University of Colorado, Health Sciences Center, Denver, Colorado.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top


IBM, the IBM logo, ibm.com, DB2, developerWorks, InfoSphere, Lotus, Rational, and WebSphere are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml. Adobe and the Adobe logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.