It is estimated that a staggering 70% of the time spent on analytics projects is concerned with identifying, cleansing, and integrating data. Acquiring data for analytics in an ad hoc manner creates a huge burden on the teams that own the systems supplying data. Often the same type of data is repeatedly requested and the original information owner finds it hard to keep track of who has copies of which data. As a result, many organizations are considering implementing a data lake solution. A data lake is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. The data lake contains data from many different sources. People in the organization are free to add data to the data lake and access any updates as necessary.
However, without proper management and governance, such a data lake can quickly become a data swamp. A data swamp is overwhelming and unsafe to use because no-one is sure where data came from, how reliable it is, and how it should be protected. IBM® proposes an enhanced data lake solution that is built with management, affordability, and governance at its core. This solution is known as a data reservoir.
This IBM Redguide™ publication discusses the value of a data reservoir, discusses how it fits into the existing business IT environment, and identifies sources of data for the data reservoir. It also provides a high-level architecture of a data reservoir and discusses key components of that architecture. It identifies key roles essential to creating, supporting, and maintaining the data reservoir and how information integration and governance play a pivotal role in supporting the data reservoir.