5 Things to Know about Avoiding a Data Swamp with a Data Reservoir
LindaMay 12000078DK Visits (10359)
It is estimated that a staggering 70% of the time spent on analytic projects is concerned with identifying, cleansing, and integrating data. To rectify this situation many organizations are considering a data lake solution. A data lake contains data from various sources. However, without proper management and governance a data lake can quickly become a data swamp. A data swamp is unsafe to use because no one is sure where data came from, how reliable it is, and how it should be protected. IBM proposes an enhanced data lake solution that is built with management, affordability, and governance at its core. This solution is known as a data reservoir.
1. What is a data reservoir?
A data reservoir provides credible information to subject matter experts (such as data to analysts, data scientists, and business teams) so they can perform analysis activities such as, investigating and understanding a particular situation, event, or activity. A data reservoir has capabilities that ensure the data is properly cataloged and protected so subject matter experts can confidently access the data they need for their work and analysis.
2. What makes up a data reservoir?
The data reservoir is composed of three main components:
These services can locate, access, prepare, transform, process, and move data in and out of the data reservoir repositories.
The repositories provide platforms both for storing data and running analytics as close to the data as possible.
The fabric provides the engines and libraries to govern and manage the data in the data reservoir.
3. Where does the data come from that feeds the data reservoir?
Much of the data in the data reservoir comes from the enterprise IT systems such as, business systems and business applications. Solutions that monitor activities might also be a source for data. For example, a source could be the log data on usage of the enterprise's web site.
4. How do you roll out a data reservoir?
A data reservoir is a dynamic, agile environment for business teams to control and use in an interactive, self-service manner. There are at least two initial activities necessary to establish the governance and management framework essential to a data reservoir. One activity is to install the information integration and governance platform with at least one data repository. Another activity is the definition of the governance policies and related implementations for managing data for each subject area stored in the data reservoir.
5. What are some of the key roles in the team?
Various roles are important to defining and enhancing the data reservoir. For example, the governance team enables the data reservoir to accept data on new subject areas by defining the governance policies and related data definitions. The IT team enhances the data reservoir by adding new types of repositories, new data refineries, and feeds from non-traditional sources of information. By the way, a data refinery provides the ability to move and transform data in, out, and between the data reservoir repositories. The data refineries use the governance polices to efficiently process the data and ensure the governance policies are enforced. Another role is that of the information curators who define new sources of information that extend the ability to create insights with the data reservoir. Business teams are critical because they add their knowledge and departmental data into the reservoir bringing additional perspective on the operational systems data.
To learn more details about data reservoirs, see the IBM Redguide “Go
LindaMay Patterson is an IBM Redbooks Technical Writer. She works with thought leaders from across IBM to create books, papers, blog posts, and videos.