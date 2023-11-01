The first foundational layer of a modern data platform is storage and processing.

Modern data storage systems are focused on using data efficiently, which includes where to store data and how to process it. The two most popular storage formats are data warehouses and data lakes, although data lakehouses and data mesh are gaining in popularity.



The data warehouse



Data warehouses are designed for managing structured data with clear and defined use cases.



The use of data warehouses can be traced back to the 1990s when databases were used for storing data. These data warehouses were on premises and had very limited storage capacity.



Around 2013, data warehouses began shifting to the cloud where scalability was suddenly possible. Cloud-based data warehouses have remained the preferred data storage system because they optimize compute power and processing speeds.



For a data warehouse to function properly, the data must be collected, reformatted, cleaned and uploaded to the warehouse. Any data which can’t be reformatted may be lost.



The data lake



In January of 2008, Yahoo released Hadoop (based on NoSQL) as an open-source project to the Apache Software Foundation. Data lakes were originally built on Hadoop, were scalable and designed for on-premises use. Unfortunately, the Hadoop ecosystem is extremely complex and difficult to use. Data lakes began shifting to the cloud around 2015, making them much less expensive and more user-friendly.



Data lakes were originally designed to collect raw, unstructured data without enforcing schema (formats) so that researchers could gain more insights from a broad range of data. Due to problems with parsing old, inaccurate or useless information, data lakes can become less-effective “data swamps”.



A typical data lake architecture might have data stored on an object storage like Amazon S3 from AWS, coupled with a tool like Spark to process the data.



The data lakehouse



Data lakehouses merge the flexibility, cost efficiency and scaling abilities of data lakes with the ACID (atomicity, consistency, isolation, and durability) transactions and data management features of data warehouses. (ACID is an acronym for the set of 4 key properties that define a transaction: atomicity, consistency, isolation and durability.)

Data lakehouses support BI and machine learning, while a key strength of the data lakehouse is that it uses metadata layers. Data lakehouses also use a new query engine, designed for high-performance SQL searches.



Data mesh



Unlike data warehouses, data lakes and data lakehouses, data mesh decentralizes data ownership. With this architectural model, a specific domain (e.g. business partner or department) does not own its data, but shares it freely with other domains. This means all data within the data mesh system should maintain a uniform format.



Data mesh systems can be useful for businesses supporting multiple data domains. Within the data mesh design, there is a data governance layer and a layer of observability. There is also a universal interoperability layer.



Data mesh can be useful for organizations that are expanding quickly and need scalability for storing data.

