Skip to main content

Key questions from an enterprise data architect

Plan for project success by taking measurements of the relevant data

Uche Ogbuji (uche@ogbuji.net), Consultant, Zepheira, LLC
Uche Ogbuji 的照片
Uche Ogbuji is partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications and lead developer of the Versa RDF query language. He is a computer engineer and writer, born in Nigeria, living and working in Boulder, Colorado, U.S.A. Find out more about Mr. Ogbuji at his blog Copia.

Summary:  Data is the lifeblood of the enterprise, and the best way to prepare for a development and integration project is to document the characteristics of the data that drive the target applications. Learn the key questions that an enterprise data architect should explore in order to effectively document the characteristics of relevant data and take the most important first step towards project success.

Date:  06 May 2008
Level:  Intermediate PDF:  A4 and Letter (27KB | 7 pages)Get Adobe® Reader®
Activity:  923 views
Comments:  

There are many software development methodologies, but all of them emphasize the importance of the analysis portion of the life cycle. During analysis, you look to clearly understand the problem and create use cases. You document requirements and user-acceptance criteria. You document the social and technological environment in which the software is to be used. One part of analysis that's critical to the success of many projects is data analysis. Perhaps you are implementing an enterprise resource planning (ERP) system. Perhaps you are adding sophisticated e-commerce features to an existing corporate Web site. Perhaps you are developing an innovative software as a service (SaaS) product. In any such project, whether you are developing a new application or integrating existing applications, it's essential to understand the nature and flow of the related data.

Developing and documenting the shared understanding of relevant data sets and data characteristics is your primary job as an enterprise data architect. If the project involves integrating existing systems, you have to do a comprehensive analysis of the existing data and governing business rules. Often there is precious little written down clearly, and you have to perform the analysis from scratch. If the project is creating entirely new components, you might have to create new data models. Either way, the key to getting the right answers about data is in asking the right questions. In this article I present a series of questions you can use to make sure you cover all aspects of data during analysis.

Understand the domain

The first group of questions comes earliest. These are key to understanding the business problem that drives the data. Gathering good requirements is an important prerequisite to this step, because the requirements will lead you to the crucial concepts that you must translate into the data model.

What are the principal terms (nouns and verbs) that come from the requirements and description of the problem space? These terms will become the first data elements in the model.

How do you identify different entities in the problem space? Are people identified by user name? Are products identified by SKU? What are the rules governing the uniqueness of these identifiers? Are identifiers subject to geographical, temporal, or other variations? Note that such identifiers should almost never be used as primary keys in databases or data sets. The very fact that they come from the problem space usually taints their value as stable identifiers.

What is the frequency, volume, and granularity of requests for information in the system? Try to flesh out examples and scenarios to get a sense of the sorts of information users will need to put in and pull out of the system. These can be expanded in the creation of use cases or user stories.

What are the costs typically associated with changes to the data and to the data processing systems? These may be costs incurred in determining, validating, or propagating the changes through the business processes.

In what global regions will the data be consumed? Are the representations of the data subject to translation? Are the representations of the data subject to local interpretations? Are any of the source materials or feeder systems currently internationalized or translated? This is one of the more complicated questions, because it's one of the trickiest topics. But it's a very important topic. Explore these questions not just for their near-term answers but also for likely long-term trends. It's much more expensive to retrofit globalization to an existing system than to build it from scratch, and in today's economy you should always consider globalization unless there is cast-iron certainty that it will not be needed. Remember that globalization is not just a matter of language translation but also of dates, names, financial quantities, measurements, regulatory references, geographical references, and more.

What are the current source materials or feeder systems for information in the project? Try to get samples of existing source material to get a sense of data formats and data quality.

What is the scope of use of the information? Is the data used only within an organization? Within one or more selected departments? Is it shared with customers, vendors, or partners? Is it public?


Establish chain of responsibility

The earlier you can establish the rules, conventions, and patterns that govern data creation, flow, and disposal, the better. Surprises in data governance and responsibility can quickly derail a project. These matters are often the last thing the stakeholders think about, because they're inclined to be focused on the value they hope to gain from the system. So it's your job as data architect to instill the needed discipline. The following questions will allow you to understand who is responsible for data, what their responsibility entails, and when it might change hands.

What data elements are the primary responsibility of the system being developed? What data elements are externally referenced from the system? Try to establish as much detail as possible over ownership of the data. If more than one system can modify the data, which version is authoritative? Is there a process for conflict resolution?

Who creates the data? Who modifies it? Who disposes of it? This takes analysis of data ownership to greater granularity. Who are the key actors in the life cycle of the data? These questions are also closely related to full-blown use cases.

How should access to the data be controlled? It may be sufficient that users are organized into groups with differentiated access for creating, updating, reading, and deleting records. You may also require more granular control over actions on the data.

How sensitive are the various classes of information? What are the consequences of exposure or corruption? Determine the risks associated with the data, whether from leaks of data as it flows through the system, or from changes to the data or to the data processing systems themselves. The answers sometimes depend on policies and regulations, but you might also have to press the stakeholders to commit to business analysis of risks associated with information.

What data elements are considered invariants? This means that they can be changed only after extensive business process and review. Again, policy or regulations might mean that certain data elements are specially protected. This might indicate the need for additional testing to ensure integrity of such data.

When is the planned obsolescence of the system for processing the data? When is the planned obsolescence of the current format and storage of the data itself and for the essential information represented in the data? Remember to plan for obsolescence. Data usually outlives the systems that process it, but if you have a sense of the useful lifetime of the system, you can ensure smooth transitions in the future. Also, try to separate the essential content of the data from a particular form in which it's stored, remembering that there may be no foreseeable time at which the essential content is obsolete.


Refine the model

After gathering the base information from the stakeholders, the data architect has to refine the models and rules to prepare for lower-level design. This is generally the step in which you translate the conceptual model into the logical model.

Is the data mostly in the form of highly structured values or of documents and flowing text? These two classes of data are different in many ways, and it's a good idea to respect the differences. Clearly, you'll often have both kinds of data in the same system, but you still want to consider processing each in accordance with its nature.

To what extent are data elements dependent on or derived from other data elements? To what extent are data elements dependent on or derived from data externally referenced from the system? Dependencies between data are an important subtlety that requires technical skill to determine, but whether you're normalizing a database or creating properly extensible markup design, analyzing dependencies is key to getting relationships right.

Do you need to maintain versions and history of information? You can get the general idea for versioning needs through business analysis, but it takes experience to translate into implementation strategy. Version control is an ongoing challenge, so you should bite off only as much of it as you need to.

What rules govern which data must be updated and maintained together? At a high level this determines the key relationships in the data model. At a more granular level it outlines needs for transactions and concurrency.


Write it all down

That's a lot of questions to consider, and you might be wondering how best to record the answers. There are no definitive answers, of course, and the best advice is to follow the recommendations of your chosen software development methodology. Traditional methodologies provide very detailed specifications for meta-information, such as the class, activity, use case and structure diagrams of UML. Some agile methodologies go no further than recommending that you jot down user stories on index cards. I'm not very comfortable with any less than writing down the most important factors of the data architecture. It's enough to do so on a wiki, as long as you do record the notes somewhere accessible and maintainable. Two advantages of a wiki are that it's easy to share and easy to annotate. Sharing helps ensure that all those involved in the project understand its parameters. Annotation helps ensure that the design documentation does not die after it's completed. Change, surprises, and further refinement are inevitable over the lifetime of a project, and the data architect should have every incentive to continue to track the answers to these questions.


Wrap it up

The questions presented in this article are not a complete, ultimate set. Data architecture is as much art as science, and each professional will have a palette of favored techniques, which extends to these driving questions. Also, different organizations and even individual projects have specialized needs that affect what data parameters the architect should record. But having an established set of questions for probing during analysis and design helps provide direction, give participants confidence, and avoid oversight. Give your project a better chance of success by asking the right questions to nail down the shape and dynamics of data as early as possible in the project, and be sure to share the results, make them easy to reference, and easy to update or annotate.


Resources

Learn

Get products and technologies

  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

About the author

Uche Ogbuji 的照片

Uche Ogbuji is partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications and lead developer of the Versa RDF query language. He is a computer engineer and writer, born in Nigeria, living and working in Boulder, Colorado, U.S.A. Find out more about Mr. Ogbuji at his blog Copia.

Comments



Trademarks

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=SOA and Web services
ArticleID=306340
ArticleTitle=Key questions from an enterprise data architect
publish-date=05062008
author1-email=uche@ogbuji.net
author1-email-cc=