Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Key questions from an enterprise data architect

Plan for project success by taking measurements of the relevant data

Uche Ogbuji (uche@ogbuji.net), Consultant, Zepheira, LLC
Uche Ogbuji 的照片
Uche Ogbuji is partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications and lead developer of the Versa RDF query language. He is a computer engineer and writer, born in Nigeria, living and working in Boulder, Colorado, U.S.A. Find out more about Mr. Ogbuji at his blog Copia.

Summary:  Data is the lifeblood of the enterprise, and the best way to prepare for a development and integration project is to document the characteristics of the data that drive the target applications. Learn the key questions that an enterprise data architect should explore in order to effectively document the characteristics of relevant data and take the most important first step towards project success.

Date:  06 May 2008
Level:  Intermediate PDF:  A4 and Letter (27KB | 7 pages)Get Adobe® Reader®

Activity:  5845 views
Comments:  

There are many software development methodologies, but all of them emphasize the importance of the analysis portion of the life cycle. During analysis, you look to clearly understand the problem and create use cases. You document requirements and user-acceptance criteria. You document the social and technological environment in which the software is to be used. One part of analysis that's critical to the success of many projects is data analysis. Perhaps you are implementing an enterprise resource planning (ERP) system. Perhaps you are adding sophisticated e-commerce features to an existing corporate Web site. Perhaps you are developing an innovative software as a service (SaaS) product. In any such project, whether you are developing a new application or integrating existing applications, it's essential to understand the nature and flow of the related data.

Developing and documenting the shared understanding of relevant data sets and data characteristics is your primary job as an enterprise data architect. If the project involves integrating existing systems, you have to do a comprehensive analysis of the existing data and governing business rules. Often there is precious little written down clearly, and you have to perform the analysis from scratch. If the project is creating entirely new components, you might have to create new data models. Either way, the key to getting the right answers about data is in asking the right questions. In this article I present a series of questions you can use to make sure you cover all aspects of data during analysis.

Understand the domain

The first group of questions comes earliest. These are key to understanding the business problem that drives the data. Gathering good requirements is an important prerequisite to this step, because the requirements will lead you to the crucial concepts that you must translate into the data model.

What are the principal terms (nouns and verbs) that come from the requirements and description of the problem space? These terms will become the first data elements in the model.

How do you identify different entities in the problem space? Are people identified by user name? Are products identified by SKU? What are the rules governing the uniqueness of these identifiers? Are identifiers subject to geographical, temporal, or other variations? Note that such identifiers should almost never be used as primary keys in databases or data sets. The very fact that they come from the problem space usually taints their value as stable identifiers.

What is the frequency, volume, and granularity of requests for information in the system? Try to flesh out examples and scenarios to get a sense of the sorts of information users will need to put in and pull out of the system. These can be expanded in the creation of use cases or user stories.

What are the costs typically associated with changes to the data and to the data processing systems? These may be costs incurred in determining, validating, or propagating the changes through the business processes.

In what global regions will the data be consumed? Are the representations of the data subject to translation? Are the representations of the data subject to local interpretations? Are any of the source materials or feeder systems currently internationalized or translated? This is one of the more complicated questions, because it's one of the trickiest topics. But it's a very important topic. Explore these questions not just for their near-term answers but also for likely long-term trends. It's much more expensive to retrofit globalization to an existing system than to build it from scratch, and in today's economy you should always consider globalization unless there is cast-iron certainty that it will not be needed. Remember that globalization is not just a matter of language translation but also of dates, names, financial quantities, measurements, regulatory references, geographical references, and more.

What are the current source materials or feeder systems for information in the project? Try to get samples of existing source material to get a sense of data formats and data quality.

What is the scope of use of the information? Is the data used only within an organization? Within one or more selected departments? Is it shared with customers, vendors, or partners? Is it public?


Establish chain of responsibility

The earlier you can establish the rules, conventions, and patterns that govern data creation, flow, and disposal, the better. Surprises in data governance and responsibility can quickly derail a project. These matters are often the last thing the stakeholders think about, because they're inclined to be focused on the value they hope to gain from the system. So it's your job as data architect to instill the needed discipline. The following questions will allow you to understand who is responsible for data, what their responsibility entails, and when it might change hands.

What data elements are the primary responsibility of the system being developed? What data elements are externally referenced from the system? Try to establish as much detail as possible over ownership of the data. If more than one system can modify the data, which version is authoritative? Is there a process for conflict resolution?

Who creates the data? Who modifies it? Who disposes of it? This takes analysis of data ownership to greater granularity. Who are the key actors in the life cycle of the data? These questions are also closely related to full-blown use cases.

How should access to the data be controlled? It may be sufficient that users are organized into groups with differentiated access for creating, updating, reading, and deleting records. You may also require more granular control over actions on the data.

How sensitive are the various classes of information? What are the consequences of exposure or corruption? Determine the risks associated with the data, whether from leaks of data as it flows through the system, or from changes to the data or to the data processing systems themselves. The answers sometimes depend on policies and regulations, but you might also have to press the stakeholders to commit to business analysis of risks associated with information.

What data elements are considered invariants? This means that they can be changed only after extensive business process and review. Again, policy or regulations might mean that certain data elements are specially protected. This might indicate the need for additional testing to ensure integrity of such data.

When is the planned obsolescence of the system for processing the data? When is the planned obsolescence of the current format and storage of the data itself and for the essential information represented in the data? Remember to plan for obsolescence. Data usually outlives the systems that process it, but if you have a sense of the useful lifetime of the system, you can ensure smooth transitions in the future. Also, try to separate the essential content of the data from a particular form in which it's stored, remembering that there may be no foreseeable time at which the essential content is obsolete.


Refine the model

After gathering the base information from the stakeholders, the data architect has to refine the models and rules to prepare for lower-level design. This is generally the step in which you translate the conceptual model into the logical model.

Is the data mostly in the form of highly structured values or of documents and flowing text? These two classes of data are different in many ways, and it's a good idea to respect the differences. Clearly, you'll often have both kinds of data in the same system, but you still want to consider processing each in accordance with its nature.

To what extent are data elements dependent on or derived from other data elements? To what extent are data elements dependent on or derived from data externally referenced from the system? Dependencies between data are an important subtlety that requires technical skill to determine, but whether you're normalizing a database or creating properly extensible markup design, analyzing dependencies is key to getting relationships right.

Do you need to maintain versions and history of information? You can get the general idea for versioning needs through business analysis, but it takes experience to translate into implementation strategy. Version control is an ongoing challenge, so you should bite off only as much of it as you need to.

What rules govern which data must be updated and maintained together? At a high level this determines the key relationships in the data model. At a more granular level it outlines needs for transactions and concurrency.


Write it all down

That's a lot of questions to consider, and you might be wondering how best to record the answers. There are no definitive answers, of course, and the best advice is to follow the recommendations of your chosen software development methodology. Traditional methodologies provide very detailed specifications for meta-information, such as the class, activity, use case and structure diagrams of UML. Some agile methodologies go no further than recommending that you jot down user stories on index cards. I'm not very comfortable with any less than writing down the most important factors of the data architecture. It's enough to do so on a wiki, as long as you do record the notes somewhere accessible and maintainable. Two advantages of a wiki are that it's easy to share and easy to annotate. Sharing helps ensure that all those involved in the project understand its parameters. Annotation helps ensure that the design documentation does not die after it's completed. Change, surprises, and further refinement are inevitable over the lifetime of a project, and the data architect should have every incentive to continue to track the answers to these questions.


Wrap it up

The questions presented in this article are not a complete, ultimate set. Data architecture is as much art as science, and each professional will have a palette of favored techniques, which extends to these driving questions. Also, different organizations and even individual projects have specialized needs that affect what data parameters the architect should record. But having an established set of questions for probing during analysis and design helps provide direction, give participants confidence, and avoid oversight. Give your project a better chance of success by asking the right questions to nail down the shape and dynamics of data as early as possible in the project, and be sure to share the results, make them easy to reference, and easy to update or annotate.


Resources

Learn

Get products and technologies

  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

About the author

Uche Ogbuji 的照片

Uche Ogbuji is partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications and lead developer of the Versa RDF query language. He is a computer engineer and writer, born in Nigeria, living and working in Boulder, Colorado, U.S.A. Find out more about Mr. Ogbuji at his blog Copia.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, SOA and Web services
ArticleID=306340
ArticleTitle=Key questions from an enterprise data architect
publish-date=05062008
author1-email=uche@ogbuji.net
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers