Comparison Open Table Formats: Apache Iceberg, Delta Lake and Apache Hudi

The concept of the data lake as a universal repository for structured and unstructured data in all kind of formats introduced significant flexibility but also brought challenges. Analytical systems must be able to rely on the structure of the data they process. Irregularities often lead to malfunctions, making such systems very susceptible to errors. The fact that data lakes are designed to store and process large amounts of data quickly led to the need to think about how to improve read and write efficiency. A problem that cannot be solved by the processing engines alone.

Open File Formats

Various formats have developed from this need, which have tried to solve the problems of the time. These formats are now known as Open File Formats, which include Apache Avro, Apache Parquet and ORC. One of these formats has become particularly popular in the analytical field: Apache Parquet. It became quickly quite popular since they enabled the serialization of structured data within defined schemas, offering advantages but addressing only a fraction of the issues.

Open Table Formats

Open Table Formats represent the next evolution of this concept. They build on formats like Apache Parquet and wrapping them with a metadata layer. Those metadata describe the actual data files in a way that makes it very efficient for processing engines like for example Apache Spark to read from and write into storage. The read and write efficiency is optimized by storing partition- and column-level data statistics that help to reduce the amount of data that is read from storage. In addition, Open Table Formats are also known to provide storage systems with capabilities such as ACID Transaction, Time Travel, Version Rollback, Schema Evolution.

The Holy Trinity

Nowadays, nobody would build an analytical system (whether data lake and/or lakehouse) without using an open table format. For those who still have to deal with this, the question now arises as to which open table formats are available and which one should I use? There are currently three well-known open table formats: Apache Iceberg, Delta Lake and Apache Hudi.

Apache Hudi

For getting the least attention of the three, Hudi was the first of the modern open table formats to appear on the market. Developed by Uber and released under the Apache License 2.0 in 2018. Initial idea at Uber was to solve their snapshot-based data ingestion by inventing a table format that supports updates and delete operations [Uber, 2018]. With Hudi, user at Uber were able to simply pass on their last checkpoint timestamp and retrieve all the records that have been updated since, regardless of whether these updates are new records added to recent date partitions or updates to older data (e.g., a new trip happening today versus an updated trip from 6 months ago), without running an expensive query that scans the entire source table.

Apache Iceberg

Iceberg is a high-performance, open table format for managing large-scale datasets, initially developed by Netflix and released under the Apache License 2.0 in 2018. It was created to address challenges in managing petabyte-scale data lakes, particularly around data consistency, schema evolution, and efficient querying. Iceberg introduces features like hidden partitioning, atomic operations, and time travel queries, enabling efficient reads and writes while maintaining the reliability of large datasets. By abstracting the complexities of table management, Iceberg allows users to interact with their data seamlessly, making it a popular choice for modern data lake architectures.

Delta Lake

Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions, scalable metadata handling, and data versioning to data lakes. Developed by Databricks and released under the Apache License 2.0 in 2019 (when it was donated to the Linux Foundation), Delta Lake was designed to address common challenges in data lakes, such as data consistency and reliability in concurrent environments. It enables features like schema enforcement, time travel, and efficient upserts and deletes, making it easier to build robust pipelines and perform advanced analytics. By combining the flexibility of a data lake with the reliability of a data warehouse, Delta Lake empowers organizations to build scalable and trustworthy data architectures.

Comparsion

Key features that all three:

ACID Transactions: Ensures data consistency during concurrent operations.
Time Travel: Allows querying data at different points in time.
Schema Evolution: The ability to manage and adapt to changes in a dataset's schema.
Version Rollback: Reverting a table to a previous version of its data and schema.

The primary differences between these formats lie in their intended use cases and the ecosystems they support.

Apache Hudi: Designed for near real-time data ingestion and updates, Hudi excels in scenarios requiring frequent updates and Change Data Capture (CDC). It integrates tightly with Apache Spark, Hive, and Presto/Trino, making it a strong choice for dynamic, update-heavy environments.
Apache Iceberg: Optimized for large-scale analytics and table management, Iceberg is tailored to handle big data workloads. It boasts robust integration with engines like Apache Spark, Presto, Trino, and Flink. Originally designed as a standalone table format, it has recently gained significant traction in cloud ecosystems, particularly on platforms like AWS.
Delta Lake: Widely favored in the ML/AI community, Delta Lake is known for its seamless integration with Spark and Databricks. It enhances data lakes with features for reliability and performance, making it well-suited for machine learning and streaming applications.

The following table provides a detailed comparison of the features across all three formats:

Feature	Apache Hudi	Apache Iceberg	Delta Lake
Write Modes	- MOR: Write updates to delta log files; merge at read time. - COW: Rewrite Parquet files during updates.	Always writes new data files for updates	Always writes new data files for updates
ACID Transactions	Yes	Yes	Yes
Schema Evolution	Yes(with some limitations, see docs)	Yes	Yes
Time Travel	Yes	Yes	Yes
Partitioning Approach	Explicit folder-based partitioning	Hidden partitions with metadata-driven management.	Explicit folder-based partitioning
Partition Column Handling	- Partition columns are tied to directory structure. - Users need to include partition filters in queries for efficiency.	Abstracted from users; columns are logical and don’t need to be part of physical paths.	- Same as Hudi: folder-based partitioning with explicit column management.
Partition Evolution	Manual	Automatic	Manual
Version Rollback	Yes	Yes	Yes
Compaction	Built-in for MOR	Supported (automatic strategies)	Supported with Optimized Write & Auto Compaction
Incremental Queries	Yes (built-in support for upserts)	Yes (through snapshots)	Yes (through Change Data Feed)
Streaming Support	Strong, designed for streaming	Supported (via Spark, Flink) but it is emerging, not as optimized as the other two.	Supported (Spark-native)
Ecosystem Integration	Spark-centric, expanding	Multi-engine (Spark, Flink, Presto)	Spark, Flink, Trino etc.

Conclusion

In conclusion, Apache Iceberg, Delta Lake, and Apache Hudi each provide robust solutions for managing data in modern data lakes, but they cater to distinct use cases and priorities. There is no one-size-fits-all solution—the best choice depends on your specific requirements.

Apache Iceberg: Excels in advanced query optimization, partition management, and scalability, making it an excellent choice for large-scale analytics. Its growing support within AWS further enhances its appeal for organizations leveraging AWS cloud infrastructure.
Delta Lake: With deep integration into Spark and Databricks, Delta Lake is particularly well-suited for machine learning and data engineering workflows, where reliability and performance are critical.
Apache Hudi: Shines in real-time use cases, efficiently handling updates, deletes, and change data capture (CDC), making it ideal for dynamic and streaming data scenarios.

If you're unsure which format best suits your needs or want to experiment with multiple options, Apache XTables can help by enabling interoperability between formats without rewriting dataset files. This flexibility allows you to test and transition between table formats with ease.

Choosing the right format requires a clear understanding of each option's features and strengths. For example, if your use case involves constant partition evolution, Iceberg might be the best fit. By aligning the strengths of each solution with your organization’s goals, you can build an optimized and efficient data lake ecosystem tailored to your needs.