What Is a Data Lake? I IBM

What is a data lake?

Originally coined by the former CTO of Pentaho, a data lake is a low-cost storage environment, which typically houses petabytes of raw data.

Unlike a data warehouse, a data lake can store both structured and unstructured data, and it does not require a defined schema to store data, a characteristic known as “schema-on-read.” This flexibility in storage requirements is particularly useful for data scientists, data engineers, and developers, allowing them to access data for data discovery exercises and machine learning projects.

A recent Voice of the Enterprise (link resides outside ibm.com) report from 451 Research determined that almost “three quarters (71%) of enterprises are currently using or piloting a data lake environment or plan to do so within the next 12 months, and 53% of respondents are already in deployment or POC.” Respondents in this report highlight business agility as a key benefit from their deployments, which can vary. They also found that data lakes are typically hosted either in the cloud, or "on premises" through an organization's data centers.

While adopters are finding value in data lakes, some can fall victim to becoming data swamps or data pits. A data swamp is the result of a poorly managed data lake-that is, it lacks in appropriate data quality and data governance practices to provide insightful learnings. Without the proper oversight, the data in these repositories will be rendered useless. Data pits, on the other hand, are similar to data swamps in that they provide little business value, but the source of the data issue is unclear in these instances. Similarly, involvement from data governance and data science teams can help to safeguard against these pitfalls.

IBM named a leader by IDC

Read why IBM was named a leader in the IDC MarketScape: Worldwide AI Governance Platforms 2023 report.

Related content

Read the guide for data leaders

Data lake vs. data warehouse

While data lakes and data warehouses both store data, each repository has its own requirements for storage, which makes it an ideal choice for different scenarios. For instance, data warehouses require a defined schema to fit specific data analytics requirements for data outputs, such as dashboards, data visualizations, and other business intelligence tasks. These requirements are usually specified by business users and other relevant stakeholders, who will utilize the reporting output on a regular basis. The underlying structure of a data warehouse is typically organized as a relational system (i.e. in a structured data format), sourcing data from transactional databases. Data lakes, on the other hand, incorporate data from both relational and non-relational systems, allowing data scientists to incorporate structured and unstructured data into more data science projects.

Each system also has its own set of advantages and disadvantages. For example, data warehouses tend to be more performant, but it comes at a higher cost. Data lakes may be slower in returning query results, but they have lower storage costs. Additionally, the storage capacity of data lakes makes it ideal for enterprise data.

Data lake vs. data lakehouse

While adoption for both data lakes and data warehouses will only increase with the growth of new data sources, the limitations of both data repositories are leading to a convergence in these technologies. A data lakehouse couples the cost benefits of a data lake with the data structure and data management capabilities of a data warehouse. According to another survey report (link resides outside ibm.com) from 415 Research, “two-thirds of companies are already using or piloting a data lakehouse environment, or plan to do so within 12 months.” In addition, they found that 93% of organizations that have embraced data lakes also plan to adopt a data lakehouse in the next 12 months.

Data lake architecture

Data lakes are also commonly associated with Apache Hadoop, an open-source software framework which provides low-cost, reliable distributed processing for big data storage. They were traditionally deployed on-premise, but as indicated in 451 Research’s report, adopters are quickly moving to cloud environments as they provide more flexibility to end users. Unlike on-premise deployments, cloud storage providers allow users to spin up large clusters as needed, only requiring payment for the storage specified. This means that if you need additional compute power to run a job in a few hours vs. a few days, you can easily do this on a cloud platform by purchasing additional compute nodes. Forrester Research (link resides outside ibm.com) reports that businesses who use cloud data lakes over on-premises data lakes see savings of roughly 25%.

Within Hadoop, Hadoop Distributed File System (HDFS) stores and replicates data across multiple servers while Yet Another Resource Negotiator (YARN) determines how to allocate resources across those servers. You can then use Apache Spark to create one large memory space for data processing, allowing more advanced users to access data via interfaces using Python, R, and Spark SQL.

As the volume of data grows at an exponential rate, data lakes serve as an essential component of the data pipeline.

Use cases of a data lake

Since data lakes are primarily leveraged for their ability to store vast amounts of raw data, the business purpose of the data does not necessarily need to be defined at the onset. That said, two main use cases for data lakes can be found below:

- Proof of concepts (POCs): Data lake storage is ideal for proof-of-concept projects. Their ability to store different types of data is especially beneficial for machine learning models, providing the opportunity to incorporate both structured and unstructured data into predictive models. This can be useful for use cases, like text classification, as data scientists are unable to utilize relational databases for this (at least not without preprocessing data to fit schema requirements). Data lakes can also act as a sandbox for other big data analytics projects. This can range anywhere from large-scale dashboard development to IoT app support, which typically requires real-time streaming data. After the purpose and value of the data has been determined, it can then undergo ETL or ELT processing for storage in a downstream data warehouse.

- Data Backup and Recovery: High storage capacity and low storage costs allow data lakes to act as a storage alternative for disaster recovery incidents. They can also be beneficial for data audits to enforce quality assurance as data is stored in its native format (i.e. without transformations). This can be particularly useful if a data warehouse lacks the appropriate documentation around its data processing, allowing teams to cross-check work from previous data owners.

Finally, since data in a data lake doesn’t necessarily require an immediate purpose for storage, it can also be a way to store cold or inactive data at a cost-effective price, which may be useful at a later date for regulatory inquiries or net new analyses.

Benefits of a data lake

More flexible: Data lakes can ingest both structured, semi-structured, and unstructured datasets, making them ideal for advanced analytics and machine learning projects.

Cost: Since data lakes do not require as much upfront planning to ingest the data (e.g. schema and transformation definition), less money needs to be invested into human resources. Additionally, the actual storage costs of data lakes are lower compared to other storage repositories, like data warehouses. This allows companies to optimize their budgets and resources more effectively across data management initiatives.

Scalability: Data lakes can help businesses scale in a couple of ways. The self-service functionality and overall storage capacity make data lakes more scalable compared to other storage services. Additionally, data lakes provide a sandbox for workers to develop successful POCs. Once a project has demonstrated value at a smaller scale, it’s easier to expand that workflow at larger scale using automation.

Reduced data silos: From healthcare to supply chain, companies across various industries experience data silos within their organization. Since data lakes ingest raw data across different functions, those dependencies start to eliminate themselves as there is no longer a single owner to a given dataset.

Enhanced customer experience: While this benefit will not be immediately seen, successful proof of concepts can improve the overall user experience, enabling teams to better understand and personalize the customer journey through net-new, insightful analyses.

Challenges of a data lake

While data lakes provide a number of benefits, they are not without their challenges. Some of them include:

- Performance: As the volume of data fed into a data lake grows, it comes at the expense of performance, which is already slower than other alternate data storage systems.

- Governance: While a data lake’s ability to ingest various data sources provides enterprises with an advantage in their data management practices, it also requires strong governance to manage appropriately. Data should be tagged and classified with relevant metadata to avoid data swamps, and this information should be easily accessible through a data catalog, enabling self-service functionality for less technical staff, like business analysts. Finally, guardrails should also be put into place to meet privacy and regulatory standards; this can include access controls, data encryption, and more.