What is a data lake?

A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is accessed. The term data lake is usually associated with Hadoop-oriented object storage in which an organization’s data is loaded into the Hadoop platform and then business analytics and data-mining tools are applied to the data where it resides on the Hadoop cluster. 

However, data lakes can also be used effectively without incorporating Hadoop depending on the needs and goals of the organization. The term data lake is increasingly being used to describe any large data pool in which the schema and data requirements are not defined until the data is queried. 


Easier data access to data across the organization

Access structured and unstructured data residing both on premises and in the cloud.

Faster data preparation

Take less time to access and locate data, thereby speeding up data preparation and reuse efforts

Enhance agility

Components of the data lake can be employed as a sandbox that enables users to build and test analytics models with greater agility.

More accurate insights, stronger decisions

Track data lineage to help ensure data is trustworthy. 


Apache™ Hadoop®

Manage large volumes and different types of data with open source Hadoop. Tap into unmatched performance, simplicity and standards compliance to use all data, regardless of where it resides. Visualize, filter and analyze large data sets into consumable, business-specific contexts.

Apache™ Spark™

Build algorithms quickly, iterate faster and put analytics into action with Spark. Easily create models that capture insight from complex data, and apply that insight in time to drive outcomes. Access all data, build analytic models quickly, iterate fast in a unified programming model and deploy those analytics anywhere. 

Stream computing

Stream computing enables organizations to process data streams which are always on and never ceasing. This helps them spot opportunities and risks across all data in time to effect change.

Governance and metadata tools

Governance and metadata tools enable you to locate and retrieve information about data objects as well as their meaning, physical location, characteristics, and usage.



Db2 Big SQL is a SQL engine for Hadoop that concurrently exploits Hive, HBase and Spark using a single database connection — even a single query. For this reason, Db2 Big SQL is also the ultimate hybrid engine. 

IBM® Big Replicate

IBM Big Replicate provides enterprise class replication for Apache™ Hadoop® and object store by delivering continuous availability, performance and guaranteed data consistency. It replicates big data from lab to production, from production to disaster recovery sites or, from ground to cloud object stores governed by the most demanding business and regulatory requirements.

IBM Data Science Experience

Cloud-based, social workspace that helps data scientists consolidate their use of and collaborate across multiple open source tools such as R and Python. 


Data Lake: Taming the Data Dragon

Learn important benefits of a data lake, as a trusted data asset, to IT and data scientists allowing for agility in responding to project needs, while keeping costs down, maintaining the resilience of business-critical data provisioning and preventing ungoverned data environments and usage from springing up. 

Next-Generation Predictive Analytics

Ventana Research Benchmark Executive Summary: Gain valuable insight into how nearly 200 organizations of all sizes and industries are taking advantage of predictive analytics.

Making Sense of Big Data

A Day in the Life of an Enterprise Architect