For a long time, organizations relied on relational databases (developed in the 1970s) and data warehouses (developed in the 1980s) to manage their data. These solutions are still important parts of many organizations’ IT ecosystems, but they were designed primarily for structured datasets.
With the growth of the internet—and especially the arrival of social media and streaming media—organizations found themselves dealing with vast amounts of unstructured data, such as free-form text and images. Data warehouses and relational databases were ill-equipped to handle this influx of real-time data due to their strict schemas and comparatively expensive storage costs.
In 2011, James Dixon, then the chief technology officer at Pentaho, coined the term “data lake.” Dixon saw the lake as an alternative to the data warehouse. Whereas warehouses provide processed data for targeted business use cases, Dixon imagined a data lake as a large body of data housed in its natural format. Users could draw the data they needed from this lake and use it as they pleased.
Many of the first data lakes were built on the Hadoop Distributed File System (HDFS), an open source framework and one of the major components of Apache Hadoop. These early data lakes were hosted on-premises, but this quickly became an issue as the volume of data continued to surge. Cloud computing offered a solution: moving data lakes to more scalable cloud-based object storage services.
Data lakes are still evolving today. Many data lake solutions now offer features beyond cheap, scalable storage, such as data security and governance tools, data catalogs and metadata management.
Data lakes are also core components of data lakehouses, a relatively new data management solution that combines the low-cost storage of a lake and the high-performance analytics capabilities of a warehouse.