Data Lake

Discovery of facts, patterns in data, and ad hoc reporting

IBM + Hortonworks

IBM and Hortonworks have partnered to bring you the future of data science!

Read the press release Contact us to learn more

Hortonworks and IBM are coming to a city near you!

Join us to learn about our newly expanded partnership and how it can benefit your data-driven business.

Register for a city

What is a data lake?

A data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is accessed. The term data lake is usually associated with Hadoop-oriented object storage in which an organization’s data is loaded into the Hadoop platform and then business analytics and data-mining tools are applied to the data where it resides on the Hadoop cluster. However, data lakes can also be used effectively without incorporating Hadoop depending on the needs and goals of the organization. The term data lake is increasingly being used to describe any large data pool in which the schema and data requirements are not defined until the data is queried.
Image of two people looking at an iPad


Easier data access to a broad range of data across the organization

Access structured and unstructured data residing both on premises and in the cloud.

Faster data preparation

Take less time to access and locate data, thereby speeding up data preparation and reuse efforts

Enhanced agility

Components of the data lake can be employed as a sandbox that enables users to build and test analytics models with greater agility.

More accurate insights, stronger decisions

Track data lineage to help ensure data is trustworthy.



Manage large volumes and different types of data with open source Apache Hadoop systems. Tap into unmatched performance, simplicity and standards compliance to use all data, regardless of where it resides. Visualize, filter and analyze large data sets into consumable, business-specific contexts.



Build algorithms quickly, iterate faster and put analytics into action with Apache Spark. Easily create models that capture insight from complex data, and apply that insight in time to drive outcomes. Access all data, build analytic models quickly, iterate fast in a unified programming model and deploy those analytics anywhere.



Stream computing

Stream computing enables organizations to process data streams which are always on and never ceasing. This helps them spot opportunities and risks across all data in time to effect change.


Governance and Metadata Tools

Governance and Metadata Tools enable you to locate and retrieve information about data objects as well as their meaning, physical location, characteristics, and usage.