The first layer in many data platforms is the data storage layer. The type of data storage used depends on the needs of the organization and can include both on-premises and cloud storage. Common data stores include:
Data warehouses
A data warehouse—or enterprise data warehouse (EDW)—aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, AI and machine learning. Data warehouses are most often used for managing structured data with clearly defined analytics use cases.
Data lakes
A data lake is a lower-cost storage environment, which typically houses petabytes of raw data. A data lake can store both structured and unstructured data in various formats, allowing researchers to more easily work with a broad range of data.
Data lakes were often originally built in the Hadoop ecosystem, an open-source project based on NoSQL. Starting around 2015, many data lakes began shifting to the cloud. A typical data lake architecture now might store data on an object storage platform, such as Amazon S3 from Amazon Web Services (AWS) and use a tool such as Spark to process the data.
Data lakehouses
A data lakehouse combines the capabilities of data warehouses and data lakes into a single data management solution.
While data warehouses offer better performance than data lakes, they are often more expensive and limited in their ability to scale. Data lakes optimize for storage costs but lack the structure for useful analytics.
A data lakehouse is designed to address these challenges by using cloud object storage to store a broader range of data types—that is, structured data, unstructured data and semistructured data. A data lakehouse architecture combines this storage with tools to support advanced analytics efforts, such as business intelligence and machine learning.