Apache Iceberg is a high-performance open source format for massive analytic tables, facilitating the use of SQL tables for big data and the safe integration of those tables with engines like Apache Spark, Trino, Flink, Presto, Hive and Impala.
In addition to its open table format specification, Iceberg also comprises a set of APIs and libraries that enable storage engines, query engines and execution engines to smoothly interact with tables following that format.
The Iceberg table format has become an integral part of the big data ecosystem, largely due to its ability to provide functions typically not available with other table formats. Using a host of metadata kept on each table, Iceberg allows for schema evolution, partition evolution and table version rollback without the need for costly table rewrites or table migration. It is fully storage-system agnostic, with support for multiple data sources and no file system dependencies.
Originally created by data engineers at Netflix and Apple in 2017 to address the shortcomings of Apache Hive, Iceberg was made open source and donated to the Apache Software Foundation the following year. It became a top-level Apache project in 2020.
Apache Iceberg’s speed, efficiency, reliability and overall user-friendliness helps to simplify and coordinate data processing at any scale. These strengths have made it a table format of choice for a number of leading data warehouses, data lakes and data lakehouses, including IBM watsonx.data®, Netezza® and Db2® warehouse.
Iceberg is one of few open source table formats that enables ACID transactions: data exchanges that preserve accuracy by guaranteeing atomicity, consistency, isolation and durability.
Iceberg’s genesis was an effort to address the practical limitations of Apache Hive tables in a large data lake environment. According to Ryan Blue, PMC Chair of the Apache Iceberg project and (formerly) senior engineer at Netflix, “many different services and engines were using Hive tables. But the problem was, we didn’t have that correctness guarantee. We didn’t have atomic transactions,” he said in a 2021 conference. “Sometimes changes from one system caused another system to get the wrong data and that sort of issue caused us to just never use these services, not make changes to our tables, just to be safe.”1
Apache Hive itself originated as a means to make Apache Hadoop clusters operate similarly to SQL-accessible relational databases. While it works effectively for static data, it adapts poorly to changing datasets: changes must be manually coordinated across different applications and users, or else risk corruption and subsequent inaccuracy of large datasets.
To guarantee accuracy in a dynamic environment, Iceberg was designed to ensure that any data transaction exhibits all four ACID properties:
All changes to data are performed as if they were a single operation—that is, either all of the changes are executed, or none of them are. For example, in a financial data transaction, atomicity ensures that any debit made from one account has a corresponding credit made to the other account.
There are no contradictions between the overall data state before the transaction and the data state after the transaction. Continuing the financial transaction example, consistency ensures that the combined funds existing between the two accounts is the same after the transaction it was before the transaction.
The intermediate state of any transaction is invisible to other transactions; concurrent transactions—transactions operating simultaneously on the same set of data—are treated as if they were serialized. In this financial transaction, isolation ensures that any other transaction may see the transferred funds in the debited account or in the credited account, but not in both (nor in neither).
After a successful transaction, any data changes will persist even in the event of a system failure. In our financial example, this means that the transaction will remain complete even if there is a systemwide power outage immediately afterward.
The Apache Iceberg table format is often compared to two other open source data technologies offering ACID transactions: Delta Lake, an optimized storage layer originally created by Databricks that extends Parquet data files with a file-based transaction log and scalable metadata handling, and Apache Hudi—short for “Hadoop Upserts Deletes and Incrementals”—which was originally developed by Uber in 2016.
A 2022 study conducted by Synvert generated random data and stored them in JSON format in an AWS S3 bucket to use for benchmarking the three technologies. Their testing ultimately demonstrated that Iceberg’s optimized table format yielded performance superior to that of both Delta Lake and Apache Hudi across all metrics tested.2
Storage: The resulting file sizes of the Iceberg tables were significantly smaller than those of Delta Lake or Hudi, which provides a major advantage for storage optimization.
Insert operations: For insert operations, Iceberg also had the fastest performance—that is, the shortest runtime. Both Iceberg and Delta Lake were significantly faster than Hudi.
Update operations: For update operations, Iceberg was drastically faster than both Delta Lake and Hudi. Notably, unlike its counterparts, Iceberg’s runtime did not significantly increase with the total number of records: at the maximum workloads tested within the study—500 million records—Iceberg was nearly 10 times faster than Delta Lake.
Removal operations: Likewise, Iceberg was multiple times faster than both alternatives at removal operations.
Iceberg implements a three-layer hierarchy of metadata files to ensure correctness and coordination of table data across diverse file formats and constant changes.
Written in Java and Python and also offered on a Scala API, Iceberg supports a variety of big data file formats, including Apache Parquet, Apache Avro and Apache ORC. It offers functionality similar to that of SQL tables in traditional databases in a file format-agnostic and vendor-agnostic way, allowing multiple engines to operate on the same dataset.
The architecture of an Iceberg table comprises three layers: the Iceberg catalog, the metadata layer and the data layer.
The Iceberg catalog itself sits atop the metadata layer and data later, much like the tip of an iceberg sits above the water’s surface. It stores the up-to-date (or “current”) metadata pointers that map the name of a given table to the location of its current metadata file(s). In addition to its built-in catalog, Iceberg supports other catalog frameworks like Hive MetaStore or AWS Glue.
Operations at the Iceberg catalog level are atomic, as this is essential to ensuring the correctness of proceeding transactions.
A query engine thus begins any SELECT query at the Iceberg catalog, which provides the location of the current metadata file for the table the query engine seeks to read.
The Iceberg metadata layer comprises—in descending order—metadata files, manifest lists and manifest files.
Metadata files store a table’s metadata, including the table’s schema, partition information, its current snapshot and snapshots of previous states. Having been pointed to the current metadata file from the table’s entry in the Iceberg catalog, the query engine uses the [current-snapshot-id] value to find its entry in the [snapshots] array. From there, it can locate and open the table’s manifest list.
The manifest list is simply a list of manifest files and important information for each data file therein, like its location, the snapshot it’s associated with and partitions it belongs to. Certain optimizations and filtering functions are available at this stage.
Manifest files track data files and their associated details, metadata and statistics. This drives one of the fundamental advantages of the Iceberg table format over the Hive table format: its ability to track data at the file level. At this stage, the [file-path] values for each [data-file] object can be used to find and open that file.
The data layer, as its name suggests, resides below the metadata layer and contains the ultimate files themselves.
Apache Iceberg offers a number of helpful features to improve and simplify data management.
Iceberg handles all details of partitioning and querying under the hood. Iceberg’s hidden partitioning saves users the work of supplying partition layout information when querying tables. Users do not need to maintain partition columns themselves, or even understand the physical table layout, to get accurate query results.
This doesn’t just make Iceberg partitioning very user-friendly—it also allows partition layouts to be changed over time without breaking pre-written queries. When partition specs evolve, the data in the table (and its metadata) are unaffected. Only new data, written to the table after evolution, is partitioned with the new spec, and metadata for this new data is kept separately.
Iceberg provides native support for schema evolution. This allows users to modify table schemes without need for complex data migration, greatly streamlining adaptation to evolving data structures.
Iceberg allows users to time travel back through snapshots of Iceberg data at different points in time. This is valuable for a variety of use cases, including audits, debugging and compliance checks.
Iceberg offers a number of indexing capabilities that help optimize query performance, like compaction options to merge smaller files into larger ones to reduce metadata overhead and Bloom filters to reduce unnecessary data reading during query execution.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.
1 "Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?", Datanami, 8 February 2021
2 "Clash of ACID Data Lakes: Comparing Data Lakehouse Formats", Data Insights, 9 August 2022
IBM web domains
ibm.com, ibm.org, ibm-zcouncil.com, insights-on-business.com, jazz.net, mobilebusinessinsights.com, promontory.com, proveit.com, ptech.org, s81c.com, securityintelligence.com, skillsbuild.org, softlayer.com, storagecommunity.org, think-exchange.com, thoughtsoncloud.com, alphaevents.webcasts.com, ibm-cloud.github.io, ibmbigdatahub.com, bluemix.net, mybluemix.net, ibm.net, ibmcloud.com, galasa.dev, blueworkslive.com, swiss-quantum.ch, blueworkslive.com, cloudant.com, ibm.ie, ibm.fr, ibm.com.br, ibm.co, ibm.ca, community.watsonanalytics.com, datapower.com, skills.yourlearning.ibm.com, bluewolf.com, carbondesignsystem.com, openliberty.io