My IBM

What is Apache Iceberg?

26 February 2024

Authors

Dave Bergmann

Senior Writer, AI Models

IBM

What is Apache Iceberg?

Apache Iceberg is a high-performance open source format for massive analytic tables, facilitating the use of SQL tables for big data and the safe integration of those tables with engines like Apache Spark, Trino, Flink, Presto, Hive and Impala.

In addition to its open table format specification, Iceberg also comprises a set of APIs and libraries that enable storage engines, query engines and execution engines to smoothly interact with tables following that format.

The Iceberg table format has become an integral part of the big data ecosystem, largely due to its ability to provide functions typically not available with other table formats. Using a host of metadata kept on each table, Iceberg allows for schema evolution, partition evolution and table version rollback without the need for costly table rewrites or table migration. It is fully storage-system agnostic, with support for multiple data sources and no file system dependencies.

Originally created by data engineers at Netflix and Apple in 2017 to address the shortcomings of Apache Hive, Iceberg was made open source and donated to the Apache Software Foundation the following year. It became a top-level Apache project in 2020.

Apache Iceberg’s speed, efficiency, reliability and overall user-friendliness helps to simplify and coordinate data processing at any scale. These strengths have made it a table format of choice for a number of leading data warehouses, data lakes and data lakehouses, including IBM watsonx.data®, Netezza® and Db2® warehouse.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Why use Apache Iceberg?

Iceberg is one of few open source table formats that enables ACID transactions: data exchanges that preserve accuracy by guaranteeing atomicity, consistency, isolation and durability.

Iceberg’s genesis was an effort to address the practical limitations of Apache Hive tables in a large data lake environment. According to Ryan Blue, PMC Chair of the Apache Iceberg project and (formerly) senior engineer at Netflix, “many different services and engines were using Hive tables. But the problem was, we didn’t have that correctness guarantee. We didn’t have atomic transactions,” he said in a 2021 conference. “Sometimes changes from one system caused another system to get the wrong data and that sort of issue caused us to just never use these services, not make changes to our tables, just to be safe.”¹

Apache Hive itself originated as a means to make Apache Hadoop clusters operate similarly to SQL-accessible relational databases. While it works effectively for static data, it adapts poorly to changing datasets: changes must be manually coordinated across different applications and users, or else risk corruption and subsequent inaccuracy of large datasets.

To guarantee accuracy in a dynamic environment, Iceberg was designed to ensure that any data transaction exhibits all four ACID properties:

Atomicity

All changes to data are performed as if they were a single operation—that is, either all of the changes are executed, or none of them are. For example, in a financial data transaction, atomicity ensures that any debit made from one account has a corresponding credit made to the other account.

Consistency

There are no contradictions between the overall data state before the transaction and the data state after the transaction. Continuing the financial transaction example, consistency ensures that the combined funds existing between the two accounts is the same after the transaction it was before the transaction.

Isolation

The intermediate state of any transaction is invisible to other transactions; concurrent transactions—transactions operating simultaneously on the same set of data—are treated as if they were serialized. In this financial transaction, isolation ensures that any other transaction may see the transferred funds in the debited account or in the credited account, but not in both (nor in neither).

Durability

After a successful transaction, any data changes will persist even in the event of a system failure. In our financial example, this means that the transaction will remain complete even if there is a systemwide power outage immediately afterward.

Apache Iceberg vs. Delta Lake vs. Apache Hudi

The Apache Iceberg table format is often compared to two other open source data technologies offering ACID transactions: Delta Lake, an optimized storage layer originally created by Databricks that extends Parquet data files with a file-based transaction log and scalable metadata handling, and Apache Hudi—short for “Hadoop Upserts Deletes and Incrementals”—which was originally developed by Uber in 2016.

A 2022 study conducted by Synvert generated random data and stored them in JSON format in an AWS S3 bucket to use for benchmarking the three technologies. Their testing ultimately demonstrated that Iceberg’s optimized table format yielded performance superior to that of both Delta Lake and Apache Hudi across all metrics tested.²

Storage: The resulting file sizes of the Iceberg tables were significantly smaller than those of Delta Lake or Hudi, which provides a major advantage for storage optimization.
Insert operations: For insert operations, Iceberg also had the fastest performance—that is, the shortest runtime. Both Iceberg and Delta Lake were significantly faster than Hudi.
Update operations: For update operations, Iceberg was drastically faster than both Delta Lake and Hudi. Notably, unlike its counterparts, Iceberg’s runtime did not significantly increase with the total number of records: at the maximum workloads tested within the study—500 million records—Iceberg was nearly 10 times faster than Delta Lake.
Removal operations: Likewise, Iceberg was multiple times faster than both alternatives at removal operations.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Go to episode

How does Apache Iceberg work?

Iceberg implements a three-layer hierarchy of metadata files to ensure correctness and coordination of table data across diverse file formats and constant changes.

Written in Java and Python and also offered on a Scala API, Iceberg supports a variety of big data file formats, including Apache Parquet, Apache Avro and Apache ORC. It offers functionality similar to that of SQL tables in traditional databases in a file format-agnostic and vendor-agnostic way, allowing multiple engines to operate on the same dataset.

The architecture of an Iceberg table comprises three layers: the Iceberg catalog, the metadata layer and the data layer.

Iceberg catalog

The Iceberg catalog itself sits atop the metadata layer and data later, much like the tip of an iceberg sits above the water’s surface. It stores the up-to-date (or “current”) metadata pointers that map the name of a given table to the location of its current metadata file(s). In addition to its built-in catalog, Iceberg supports other catalog frameworks like Hive MetaStore or AWS Glue.

Operations at the Iceberg catalog level are atomic, as this is essential to ensuring the correctness of proceeding transactions.

A query engine thus begins any SELECT query at the Iceberg catalog, which provides the location of the current metadata file for the table the query engine seeks to read.

Metadata layer

The Iceberg metadata layer comprises—in descending order—metadata files, manifest lists and manifest files.

Metadata file

Metadata files store a table’s metadata, including the table’s schema, partition information, its current snapshot and snapshots of previous states. Having been pointed to the current metadata file from the table’s entry in the Iceberg catalog, the query engine uses the [current-snapshot-id] value to find its entry in the [snapshots] array. From there, it can locate and open the table’s manifest list.

Manifest list

The manifest list is simply a list of manifest files and important information for each data file therein, like its location, the snapshot it’s associated with and partitions it belongs to. Certain optimizations and filtering functions are available at this stage.

Manifest file

Manifest files track data files and their associated details, metadata and statistics. This drives one of the fundamental advantages of the Iceberg table format over the Hive table format: its ability to track data at the file level. At this stage, the [file-path] values for each [data-file] object can be used to find and open that file.

Data layer

The data layer, as its name suggests, resides below the metadata layer and contains the ultimate files themselves.

Key Apache Iceberg features

Apache Iceberg offers a number of helpful features to improve and simplify data management.

Hidden partitioning

Iceberg handles all details of partitioning and querying under the hood. Iceberg’s hidden partitioning saves users the work of supplying partition layout information when querying tables. Users do not need to maintain partition columns themselves, or even understand the physical table layout, to get accurate query results.

This doesn’t just make Iceberg partitioning very user-friendly—it also allows partition layouts to be changed over time without breaking pre-written queries. When partition specs evolve, the data in the table (and its metadata) are unaffected. Only new data, written to the table after evolution, is partitioned with the new spec, and metadata for this new data is kept separately.

Schema evolution

Iceberg provides native support for schema evolution. This allows users to modify table schemes without need for complex data migration, greatly streamlining adaptation to evolving data structures.

Time travel

Iceberg allows users to time travel back through snapshots of Iceberg data at different points in time. This is valuable for a variety of use cases, including audits, debugging and compliance checks.

Data compaction and filtering

Iceberg offers a number of indexing capabilities that help optimize query performance, like compaction options to merge smaller files into larger ones to reduce metadata overhead and Bloom filters to reduce unnecessary data reading during query execution.

Data management for AI and analytics

Explore the value of data architectures and learn how IBM’s database portfolio can help simplify data for all your applications, analytics and AI workflows.

Resources

Managing data for AI and analytics at scale

Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.

2024 Gartner® Magic Quadrant™ for Data Integration Tools

IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.

The data differentiator

Explore the data leader’s guide to building a data-driven organization and driving business advantage.

Increase AI adoption with AI-ready data

Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

IBM Research® data management publications

Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak® for Data.

Gartner® predicts 2024: How AI will impact analytics users

Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.

Footnotes

¹ "Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?", Datanami, 8 February 2021
²"Clash of ACID Data Lakes: Comparing Data Lakehouse Formats", Data Insights, 9 August 2022

What is Apache Iceberg?

26 February 2024

Authors

Dave Bergmann

What is Apache Iceberg?

The latest AI News + Insights

Why use Apache Iceberg?

Apache Iceberg vs. Delta Lake vs. Apache Hudi

Is data management the secret to generative AI?

How does Apache Iceberg work?

Iceberg catalog

Metadata layer

Metadata file

Manifest list

Manifest file

Data layer

Key Apache Iceberg features

Hidden partitioning

Schema evolution

Time travel

Data compaction and filtering

Resources

Related solutions

Footnotes

The latest AI News + Insights