The traditional, serial-based approach to data warehouse (DW) projects appears to be ill-suited in practice. This second article in a two-part series describes the activities which occur during the Elaboration, Construction, and Transition phases of a Rational Unified Process-based approach to DW projects. This approach reduces both your business and technical risk, while delivering a high-quality solution that meets the changing needs of its end users. The Rational Unified Process®, or RUP®, with leading-edge agile database development techniques such as database refactoring and database regression testing tailored into it, proves to be an effective process for DW development.
In Part One, I described some of the challenges with traditional approaches to DW development and how RUP addresses them. I then described the activities which occur during the RUP's Inception phase to get the project started and to address the scope risk associated with the effort. In this second part I describe how a RUP team successfully architects, implements, and then deploys a DW into production. As I did in Part One, I will continue using the Behemoth Retail Company, a fictional organization for which we're developing a data warehouse via true-to-life scenarios.
Throughout this article I describe how to apply several agile development techniques and philosophies, so be forewarned. Many development teams choose to instantiate the RUP in an agile manner, and many others choose to borrow a few agile ideas here and there. I'm a firm believer in being as agile as possible.
Figure 1 depicts the RUP lifecycle, which follows an evolutionary (iterative and incremental) approach to development. The iterative nature is depicted by the various disciplines along the left-hand side of the diagram: you iterate back and forth between various activities pertaining to requirements, analysis and design, testing, programming, and so on. You develop incremental releases over time, running through the lifecycle once for each production release of your system. Furthermore, you produce working software at the end of each Construction iteration, software which could potentially be released internally for testing and/or demo purposes.
Figure 1: The RUP lifecycle
The primary goal of the Elaboration phase is to mitigate the technical risk to your project. You do this by identifying the use cases which reflect the high-priority technical risks which your team faces, and then develop an end-to-end working skeleton of the system to prove your proposed architecture works. For a DW project your technical risks typically focus on data volume, access to the legacy data sources, and data quality.
Let's continue working through the Behemoth Retail Company example. Figure 2 presents the initial list of use cases which we identified during the Inception phase (this list isn't complete; normally we'd have hundreds of use cases for a large organization). This list is likely to change throughout the project, more on this later, but for now this is what we have to work from. To determine which use cases represent technical risk, we need to understand both the general gist of the individual use cases, which is why we initially documented the use cases in point form. We also need to understand our architectural strategy for the DW, captured via the UML deployment diagram in Figure 3, which we also developed during the Inception phase. Sometimes diagrams such as this aren't developed until the Elaboration phase, and that's perfectly fine -- your team needs to do what makes the most sense for the situation that it finds itself in.
Figure 2: An initial list of use cases for the Behemoth Retail Company DW
Figure 3: The initial deployment diagram
After examining the use cases and the current architectural vision, we decide that the primary technical risks to our DW are volume of store data, the quality of store data, and the complexity of supplier logistical data. The use cases Analyze Stock Turnover Rates in a Store, Archive Store Data After Three Months, and Determine Supplier Delivery Effectiveness seem to cover these risks, so we'll implement appropriate portions of those use cases during the Elaboration phase. Why only a portion of a use case? Because our current goal is to address risk, not to fully complete an individual use case. We might decide to reorganize the large use case into several smaller ones, and thereby be in a position to implement a complete, albeit smaller use case, or we might decide simply to finish implementing the use case in a Construction iteration. I prefer to work on smaller use cases and to complete them in a single iteration, but either approach works. One strategy to reorganize use cases is by logic path: if there are four alternate courses of action you might want to organize the larger use case into five smaller use cases, one for the basic path and one each for the alternate paths. After the reorganization we decide that we need to implement the Analyze Stock Turnover Rates for a Store Department, Archive Store Data After Three Months, and Determine Supplier Stock Delivery Accuracy.
To implement the Analyze Stock Turnover Rates for a Store Department use case we need to identify the legacy source systems, in this case the Store Sales database of Figure 3. As we do some deeper analysis of the situation we discover that our deployment diagram isn't completely accurate. Yes, there are store databases but there are different versions of them. Over the past several years Behemoth has merged with several of its competitors, each of whom had their own way of running a store. Although Behemoth has been converting the stores over to their own systems this effort is still underway and will continue for the next few years. As a result we need to be able to show that we can extract data from three different versions of the Store database -- BehemothStoreDB, AlsoRanStoreDB, and GaveItAGoodTryStoreDB -- each of which has its own schema.
Similarly, to implement the Determine Supplier Stock Delivery Accuracy use case we need to extract data from the Store Orders and Logistics databases. Luckily these databases are consistent throughout Behemoth; because of its size in the marketplace Behemoth is in a position to "motivate" its suppliers to work according to Behemoth's wishes. The data is complex, but we understand the structure very well.
The Archive Store Data After Three Months use case is a bit different. First, it requires us to mark the data so that we can later determine how old it is. Clearly we need to add a date timestamp to individual rows. Second, it requires us to remove older data and store it elsewhere. We should consider removing older data as part of the load process for incoming data. Third, it requires us to get access to the archived data to run reports on store data which go for more than three months.
During the Elaboration phase we write just enough code to prove the architecture. In our case we'll write code to extract some data from each of the three types of store database, the orders database, and the logistics database. We don't need to extract all of the data from those databases, nor do we need to get the data fully cleansed, we just need to show that we can do it in a performant manner.
We also need to write some typical reports which would use this data. These reports may not be pretty, and we may not know enough yet to implement complex business logic, but the fundamental data must be output somehow. We'd also need to archive older store data, something that we haven't thought through (there's no archival database shown in Figure 3 yet). Finally, we would need to write at least one report which goes against both the live data in the DW as well as the archived data.
For now, these reports may be written via Java or Visual Basic, even though we may decide at a later date to use a full-fledged reporting tool. At this point in time our team hasn't yet decided which reporting tool we're going to use, Figure 3 doesn't address this yet either. We know that there are a lot of tools out there; therefore not knowing what exact tool we're going to choose isn't a risk to our project and this is a decision that we can safely defer. Yes, eventually you will choose a reporting tool such as Eclipse BIRT or DB2 Query Management Facility. This sort of approach reflects the thinking of the lean software development community and it is inherent in RUP -- we don't need to think everything through at once; instead we'll focus on the critical risks now and defer decisions until the point where we actually need to make them.
We also need to show that we understand fundamental capacity planning issues, in particular both database sizing and performance considerations. Many data warehouses will double in size annually, so we may want to show that our architecture is expandable even though we will choose to expand the actual system on an as-needed basis in practice. In other words, you can develop a robust architecture for your system and prove that it is robust without overbuilding it today.
As you'd expect, our work products will evolve throughout the Elaboration phase. For example, we should update the deployment diagram of Figure 3 to indicate that there are three types of store databases as well as the addition of the archival database and archiving program. As we analyze the source databases we'll discover that our conceptual model needs updating to reflect our improved understanding of the domain. These models will evolve throughout the project, so we will need to be prepared for this.
We may even want to start some sort of source-to-target mapping at this point. During the Elaboration phase I would suggest mapping business entities within the conceptual model to the source databases within the legacy systems. As we implement the data extraction code I would also be tempted to add attributes to the conceptual model and then map them to the columns in the legacy data source, but I would only do this on a just-in-time (JIT) basis and only if it provides value to the overall project. It's quite common for DW project teams to get bogged down in this sort of work, so I prefer to err on the side of caution and minimize documentation until there is a clear need for it.
We'll also continue to evolve our requirements models, identifying new use cases as our stakeholders gain a better understanding of what they want. More importantly, we will want to focus on the types of reports and queries that people are likely to make as they perform the business activity described by the use cases. Figure 4 overviews an example of a simple report specification. Notice how the specification is fairly slim, covering the critical issues without going into details -- you can get the details on a JIT basis during the Construction phase. Many DW teams run into trouble when they try to over-specify what they're trying to achieve, a questionable goal at best when we know that our stakeholders are going to change their minds. Less documentation and more high-quality working software should be the order of the day.
|Report Name: Stock Turnover Rates for Store|
Figure 4: Initial report specification
An important thing that we can do early in a project is to adopt realistic data naming conventions which the development team is willing to follow. Conventions for addressing null values are also important, as is what to do when we receive bad data. Do we accept it as is or do we try to fix it? Do we report the problem to the owners of the source system in the hopes that they will fix the problem eventually or do we do the simplest thing possible and not report it? I generally lean towards fixing the data the best that I can and report any problems back to the owners.
There are several common mistakes which development teams will make during the Elaboration phase:
- Overly-detailed requirements documentation. Yes, we need to understand the requirements but that doesn't mean that we need to write a lot of documentation. I could have written a several page description, including mock ups, for the report specification of Figure 4. What would that work achieve? The details are likely to change anyway, negating the value of the original specification and also forcing me to update it to reflect the stakeholder's new intent. The fact is that Figure 4 provides enough information for the team to understand the basic requirements and it provides a base from which we can analyze the details on a JIT basis during Construction when we actually need to implement the report. Comprehensive documentation not only wastes money, it puts your project at risk by providing you with a false sense of security.
- Overly-detailed design documentation. Elaboration isn't a design phase, it's a "reduce your technical risk" phase. Yes, you need to design your solution but that doesn't mean that you need to do it up front when you know the least about the system. On RUP projects we do a bit of architectural modeling early in the lifecycle, then we prove that the architecture works, then we address the details in a JIT manner during Construction iterations.
- Writing detailed documentation describing the legacy data sources. You may discover that you don't even need to access some of these legacy sources, and you certainly won't access all of the data in them, so why invest all that effort writing documentation until you actually need to? At this point in time you want to understand the constraints on the data sources, in particular overall data quality and the availability/timing of the data, and gain a sense of the volume of data. Once again, the details can be addressed during Construction.
- Overly detailed source-to-target mappings. At this point in time you likely want to map the business entities depicted in your conceptual model to likely data sources, but any more detail than this will slow you down and actually put your project at risk. If you have the ability to capture this information today, certainly you'll still have that ability when and if you actually need the information.
- Not involving operations and support people. Operations and support staff are critical stakeholders on any project -- operations people have a very good understanding of how existing legacy systems work and support people understand what end users are currently struggling with. Furthermore, operations staff will have important quality of service (QoS) requirements and deployment considerations which your data warehouse will need to take into account.
To exit the Elaboration phase the team must pass the Lifecycle Architecture (LCA) milestone review. This review can be as formal or as informal as we choose, and naturally I prefer to keep things as simple and lightweight as possible. At the LCA review we must review the business case for the DW to determine if it still makes sense to build it based on our improved understanding of the architecture and requirements. My philosophy is that it is better to fail three months into a two-year project than to fail three years into it. More importantly we must be able to demonstrate a working, end-to-end skeleton for our DW which shows that we understand and have addressed the major technical risks. It is this skeleton upon which we will build the flesh of the system during the Construction phase. The fundamental goal of the LCA review is to make a "go/no-go" decision -- the DW must make both financial and technical sense before we can proceed.
During the Construction phase we develop the data warehouse in an evolutionary manner, putting the "flesh" onto the architectural skeleton which we developed during the Elaboration phase. The system is built in iterations. I've found that iterations between two and four weeks in length work best although if you're new to the RUP you may need to take longer at first as you get used to the approach. A Construction iteration is a timebox, at the end of which we have working software which implements the highest priority requirements to date. By working in this manner we are able to receive regular feedback from our stakeholders, increasing the chance that we understand their actual needs. Furthermore, by producing updated software on a regular basis we provide concrete evidence to our stakeholders that we're actually making progress on the project. The only true measure of progress on a software development project is the delivery of working software.
During each iteration we work on the highest priority requirements which are still left to be implemented, effectively working our way down the requirements stack. If our iterations are two weeks in length, we pull two weeks worth of work from the top of the priority stack. By working in this manner we are always in a position where we are achieving the maximum benefit for our stakeholder's IT investment, thereby reducing their financial risk.
The system, including the schema of the data warehouse, will evolve throughout the project. The implication is that we need to adopt several evolutionary development techniques:
- Configuration management. The goal of configuration management is to manage the artifacts of your project throughout the lifecycle. This is such an important thing that the RUP includes the Configuration and Change Management discipline which specifically addresses it. There is a wide range of configuration management tools, such as IBM Rational ClearCase, available to you.
- Continuous integration. The goal of continuous integration is to rebuild the system whenever something changes, providing you with clear feedback as to whether the changes you've introduced have broken something. Tools such as IBM Build Forge enable you to automate build and release efforts within your projects.
- Refactoring. A code refactoring is a simple change to source code which improves the design without breaking anything nor adding anything. Similarly a database refactoring is a simple change to a database schema which improves the design without breaking nor adding anything. Refactoring enables us to keep our design of the highest quality at all times, and because it's easier to work with a clean design than a questionable one it helps to keep developers as productive as possible. Better yet, refactoring enables us to safely fix existing legacy assets over time. The good news is that code refactoring tools are built into modern integrated development environments (IDEs) such as Eclipse. The bad news is that because database refactoring is so new that there are only a few initial tools available to you. Luckily, as Pramod Sadalage and I show in Refactoring Databases, 1 it's quite straightforward to implement database refactorings manually.
- Regression testing. The goal of regression testing is to ensure that a change to the system hasn't broken existing functionality. Therefore effective regression testing is a key enabler for evolutionary development because we're adding new functionality to the code base all the time. Modern IDEs have code testing tools, such as the xUnit suite, built right in and products such as IBM Rational Functional Tester can be used for database testing.
Let's assume that we're in Construction iteration number 5 of the Behemoth DW. During each of the first four iterations we implemented the functionality required to support several use cases and now we're about to do the same during this iteration. For each use case your general strategy is the same:
- Identify the required data elements. Because we've taken the time to write up high-level report specifications, such as the one in Figure 3, during Elaboration this will be fairly straightforward. Otherwise, we would have had to work closely with our stakeholders during this iteration to either write up such specifications or simply to sketch out what the report might look like. Either way, we likely won't identify all of the data elements that we will need for the entire iteration, but we'll get most of them. Even if we could develop a complete list our stakeholders will likely change their minds once they see what we've built, so we need want to be prepared to handle additional elements later in the iteration.
- Identify the source(s) of the data elements. For each element we need to identify the best source for it. Sometimes we will already have the element within the DW, and on the surface it may appear that we have all the work done for it, but we still want to ensure that this is true. Earlier in the project we may have captured a subset of the overall data for that data element but that may not be sufficient for this new report. For example, perhaps we've captured stock information for all dry goods but now we have a requirement to also capture stock information regarding food, information that comes from a new data source. Some data elements will be completely new to us and we won't have extracted them yet.
- Evolve DW data schema to accept new data elements. We will need to add new columns, and perhaps even new tables, to store the new data elements. We may even find that we need to refactor some aspects of the existing database schema in order to maintain the quality of the database design.
- Develop extraction code to obtain the missing data elements. For each element we'll need to develop code to extract, transform/cleanse it if necessary, and load it to the DW. Dealing with legacy data problems, from the analysis effort to judge the quality of the data to writing the transformation code to cleanse it, often proves to take the majority of the development time. When a data element occurs in a data source from which we're already extracting data then we merely need to add it to the existing extract code. If the element occurs in a data source which we haven't accessed yet then we have to gain access to the data source first. More on this below.
- Develop required reports. Sometimes we just need to make the data available for querying and sometimes we're developing a full-fledged report. In the case of the latter, or when we're extending an existing report, we're going to need to do some coding.
- Test, test, test. As I indicated above, we regression test thoroughly with an evolutionary approach to development.
- Get feedback from your stakeholders. Once we've developed a new report, updated an existing one, or made some new data elements available for ad-hoc querying, we show it to our stakeholders as soon as possible to gain feedback. We'll often hear comments like "This is pretty good, but it isn't exactly what I wanted. Can you do X?" and that is exactly what we want to hear because it enables us to rework things so that the DW meets their actual needs.
- Iterate. A RUP iteration isn't a mini-waterfall where you do all of the requirements analysis, then all of the design, then all of the programming, and so on for the iteration. Some people try to organize it like that, and if you're starting out with RUP. the concept of "mini-waterfalls" is a good beginning, but in the end you'll discover that you'll iterate back and forth between the steps that I've described above. Don't worry, that's normal.
Initially gaining access to a data source can range from being very easy to virtually impossible depending on the level of politics within your organization. Ideally, all you should have to do is speak with the appropriate operations staff and then work with them to obtain the documentation and access rights that you require. You may also need to do some capacity planning with them because the extraction needs of your DW could overly stress the legacy data source, thereby requiring upgrades. Not so ideally, you'll run into some politician, usually well-placed in your organization, who is desperate to protect the empire they've built over the years and who intends to do everything in his or her power to prevent you from accessing their data.
You will want to identify several critical pieces of information about an existing data source:
- What is its availability? 24/7? During certain periods?
- What is the access media? Network access? Tape? DVD?
- How can you access it? Via SQL calls? Via web services?
- Who is the contact person(s) for the data source?
- What security access rights are required?
- What is the structure of the incoming data?
From a modeling point of view I've always found that the physical data models (PDMs) which describes the DW schema and which describe the legacy data sources are by far the most critical models on the project. If you don't understand the physical structure of the data that you're currently working with then you're in serious trouble. This doesn't mean that you need to understand everything right up front, it just means that you need to understand what you're working on at the present moment. Tools such as IBM Rational Data Architect (RDA) and Computer Associates ER/Win are good options for physical data modeling.
The second most important models are the use cases because they describe how people will actually work with the DW, providing the insight that we require to ensure that our DW meets the needs of its end users. Use cases are typically written using tools such as Microsoft Word or a Wiki and use case diagrams via IBM Rational Software Architect (RSA) or Microsoft Visio.
Third would be the source-to-target mappings because people, including both end users and developers, want to know where the data comes from. If your mappings are simple then a tool such as Microsoft Excel or even a Wiki is likely to be sufficient. For more complex endeavors use a metadata management tool such as Rochade or Datamapper.
A distant fourth would be a detailed logical data model (LDM). The potential value of LDMs is that they provide a technology independent overview of the business entities supported by your DW. However, in practice it seems that logical data modeling proves to be little more than bureaucratic busywork that provides little real value to your organization. If you're going to create an LDM then I highly suggest you only capture information in those models which provide immediate benefit to your team. Far too much bureaucracy within IT departments seems to be justified by claims of future productivity improvements resulting from greater levels of documentation, yet those benefits rarely seem to materialize in the long run.
There are several common mistakes which we want to avoid during the Construction phase of a DW project:
- Taking a data intensive approach. People want reports which meet their business needs, so focus on providing the functionality that does that. This means that we need a usage-centered approach to development, not a data-centered one. Data is clearly an important part of the overall picture, but it's only one of many parts. If we focus on data and not usage we run the risk of building a data warehouse that nobody is interested in using, an all-too-common occurrence on traditional data warehouse efforts.
- Organizing the work by data source. In the RUP we organize work during the Construction phase by requirements, not by technical issues. In other words we don't do all the work required to get the data out of data source X during a given iteration. Instead, we do the work to fulfill specific stakeholder requirements. Then during each iteration we get a little more data from system X, and some more from system Y, and some more from system Z, and so on.
- Writing detailed report specifications. Following short iterations, perhaps a few weeks in length, and providing working software at the end of each iteration, often results in stakeholders who are far more interested in getting more software than they are in getting more specifications. Effective RUP teams focus on high-value activities such as actually developing reports instead of merely documenting what you intend to deliver at some point in the future.
- Inadequate tooling. Building a data warehouse is a complex endeavor, and you'll need a good toolset if you hope to succeed.
- Poor collaboration with stakeholders. In Communication Breakdown 2 Rick Sherman indicates that one of the greatest risks to business intelligence (BI) projects, of which DW projects are a subset, is poor communication. All of the communication challenges which he describes stem from a traditional, serial approach to development which is overly focused on data issues. The RUP, on the other hand, is an evolutionary process which promotes a use-case driven approach involving active stakeholder participation.
- Poor collaboration with the data source owners. To succeed at building a data warehouse we must work closely with the owners of the legacy data sources. They understand the data far better than we do, so a little bit of effective collaboration with them can dramatically reduce overall development costs. Yes, you'll likely have to write code which cleanses the incoming data, but that's the nature of the work.
To exit the Construction phase we must pass the Initial Operational Capability (IOC) review. The goal of this milestone is to ensure that the DW is ready to go into the Transition phase, that the stakeholders still support the deployment of the DW, and that the business case still makes sense.
The goal of the Transition phase is to successfully deploy your data warehouse into production. You will need to communicate your deployment plan to your end users, this is true if you're deploying incrementally or if you're deploying to all end users at once. This communication effort will take the form of emails, overview presentations, and end user training. During the first release of a data warehouse, end user training can be a substantial effort because the reporting and analysis tools will be new to most if not all of your end users. For subsequent releases of the warehouse the training effort should be minimal because you should only need to inform them of the new data available to them.
During the Transition phase you will perform final acceptance and system testing, but if you've tested thoroughly throughout Elaboration and Construction this effort proves to be little more than a formality. In fact, the most common mistake made during this phase is to just think that it's simply a testing phase. Testing at the end of the lifecycle is the worst possible time that you could do it because the cost of fixing a defect rises exponentially the longer it takes you to find it.
A successful data warehouse is a product whose lifecycle hopefully spans several decades. The point is that it is not developed as a single project, but instead as a series of projects over time. The implication is that once you've released into production your team will likely find itself right back at the beginning of the Construction phase again implementing new requirements. There's no rest for the weary.
As I've shown in this two-part article, it is not only possible to take an evolutionary approach to DW development by following the RUP, quite frankly it's one of the most desirable options available to you. By addressing scope and technical risk early in the project, and then developing the DW incrementally so that you can show concrete progress and gain feedback throughout the lifecycle, you dramatically increase both your productivity and the chance that you'll develop a useable DW.
I'd love to hear your feedback about this article and the ideas that I've presented in it. I'm a regular in the RUP forum on developerWorks, so feel free to post your comments there. Hopefully we'll get an interesting conversation going.
I'd like to thank Per Kroll, Christine Posluszny, and Mike Perrow for their feedback upon which I improved this series of articles.
1 See Scott W. Ambler and Pramod J. Sadalage, Refactoring Databases: Evolutionary Database Design, Addison Wesley Professional: 2006. http://www.ambysoft.com/books/refactoringDatabases.html
2 See Rick Sherman, "Communication Breakdown: The Achilles' Heel of BI Projects" in DM Review Magazine, November 2006.
- Read the 3-Part series, Deliver an effective and flexible data warehouse solution:
- Ken Collier's Agile Database Testing article in Agile Data Techniques focuses specifically on testing activities in a data warehousing project.
- Communication Breakdown: The Achilles' Heel of BI Projects by Rick Sherman describes problems on traditional business intelligence projects surrounding communication and collaboration.
- My September 2006 Data Quality survey for Dr. Dobb's Journal reveals how agile techniques such as database regression testing, database refactoring, and willing following data naming conventions improves data quality.
- The article Database Regression Testing describes effective strategies for database testing.
- The article The Joy of Legacy Data overviews many of the common data quality, database design, and data architecture challenges which you are likely to run into.
- The BI Best Practices Chat Program provides an excellent forum to hear leading-edge ideas from the IBM Business Intelligence team.
- See IBM Solution Builder Express for more detailed information on Business Intelligence and other solution areas.
- Check the official IBM DB2 Business Intelligence Website for updated product-specific information.
- Read DB2 Magazine for technical and business-level features on DB2 data mining, programming, system administration, content management, and more.
- Browse IBM Redbooks online for in-depth coverage of a wide array of technical subjects.
Scott W. Ambler is the Practice Leader Agile Development with IBM Rational and works with IBM customers around the world to improve their software processes. He is the founder of the Agile Modeling (AM), Agile Data (AD), Agile Unified Process (AUP), and Enterprise Unified Process (EUP) methodologies. He is the (co-)author of nineteen books, including Refactoring Databases, Agile Modeling, Agile Database Techniques, The Object Primer 3rd Edition, and The Enterprise Unified Process. He is a senior contributing editor at Dr. Dobb’s Journal. His personal home page is www.ibm.com/rational/bios/ambler.html and his Agile at Scale blog is www.ibm.com/developerworks/blogs/page/ambler.