Comparison Open Table Formats: Apache Iceberg, Delta Lake and Apache Hudi
The concept of the data lake as a universal repository for structured and unstructured data introduced significant flexibility but also brought challenges. Analytical systems must be able to rely on the structure of the data they process. Irregularities often lead to malfunctions, and the sheer scale of data lakes quickly surfaced a need to improve read and write efficiency — a problem that processing engines alone cannot solve.
Open File Formats
Various formats developed from this need: Apache Avro, Apache Parquet, and ORC — now known as Open File Formats. Apache Parquet became particularly popular in the analytical field, enabling serialization of structured data within defined schemas. But these formats addressed only a fraction of the issues.
Open Table Formats
Open Table Formats represent the next evolution. They build on formats like Parquet and wrap them with a metadata layer that describes the actual data files in a way that makes it very efficient for processing engines like Apache Spark to read from and write into storage. Read and write efficiency is optimized by storing partition- and column-level statistics that reduce the amount of data scanned. Beyond performance, Open Table Formats also bring storage-level capabilities: ACID Transactions, Time Travel, Version Rollback, and Schema Evolution.
The Holy Trinity
Nobody would build a modern analytical system without an open table format. Three are well-established today: Apache Iceberg, Delta Lake, and Apache Hudi.
Apache Hudi
The first of the modern open table formats, developed by Uber and released under the Apache License 2.0 in 2018. The initial goal was to solve snapshot-based data ingestion by supporting updates and deletes. With Hudi, engineers at Uber could pass a last-checkpoint timestamp and retrieve all records updated since — whether recent or from months ago — without scanning the entire source table.
Apache Iceberg
A high-performance open table format for managing large-scale datasets, initially developed by Netflix and released in 2018. Created to address petabyte-scale challenges around data consistency, schema evolution, and efficient querying, Iceberg introduces hidden partitioning, atomic operations, and time travel queries. Its growing adoption within AWS cloud infrastructure has made it a popular choice for modern lakehouse architectures.
Delta Lake
An open-source storage layer that brings ACID transactions, scalable metadata handling, and data versioning to data lakes. Developed by Databricks, donated to the Linux Foundation in 2019, and released under the Apache License 2.0. Delta Lake enables schema enforcement, time travel, and efficient upserts and deletes — combining the flexibility of a data lake with the reliability of a data warehouse. It has become the go-to choice in the Databricks and ML/AI ecosystem.
Comparison
Key features shared by all three
- ACID Transactions — ensures data consistency during concurrent operations.
- Time Travel — query data at any previous point in time.
- Schema Evolution — manage and adapt to changes in a dataset's schema without rewriting data.
- Version Rollback — revert a table to a previous version of its data and schema.
Where they differ
- Apache Hudi — designed for near real-time data ingestion and updates. Excels in Change Data Capture (CDC) and update-heavy environments. Tight integration with Apache Spark, Hive, and Presto/Trino.
- Apache Iceberg — optimized for large-scale analytics. Handles big data workloads with robust multi-engine support (Spark, Presto, Trino, Flink). Growing traction on AWS.
- Delta Lake — favored in the ML/AI community for its seamless Databricks integration, reliability, and performance in machine learning and streaming applications.
Feature matrix
| Feature | Apache Hudi | Apache Iceberg | Delta Lake |
|---|---|---|---|
| Write Modes | MOR: Write updates to delta log files; merge at read time. COW: Rewrite Parquet files during updates. |
Always writes new data files for updates | Always writes new data files for updates |
| ACID Transactions | Yes | Yes | Yes |
| Schema Evolution | Yes (with some limitations) | Yes | Yes |
| Time Travel | Yes | Yes | Yes |
| Partitioning Approach | Explicit folder-based partitioning | Hidden partitions with metadata-driven management | Explicit folder-based partitioning |
| Partition Column Handling | Tied to directory structure; users need to include partition filters in queries for efficiency | Abstracted from users; columns are logical and don't need to be part of physical paths | Same as Hudi: folder-based partitioning with explicit column management |
| Partition Evolution | Manual | Automatic | Manual |
| Version Rollback | Yes | Yes | Yes |
| Compaction | Built-in for MOR | Supported (automatic strategies) | Supported with Optimized Write & Auto Compaction |
| Incremental Queries | Yes (built-in support for upserts) | Yes (through snapshots) | Yes (through Change Data Feed) |
| Streaming Support | Strong, designed for streaming | Supported (via Spark, Flink) but emerging, not as optimized as the other two | Supported (Spark-native) |
| Ecosystem Integration | Spark-centric, expanding | Multi-engine (Spark, Flink, Presto) | Spark, Flink, Trino etc. |
Conclusion
Apache Iceberg, Delta Lake, and Apache Hudi each provide robust solutions for modern data lakes, but they cater to distinct use cases. There is no one-size-fits-all answer — the best choice depends on your specific requirements:
- Apache Iceberg — best for large-scale analytics, advanced query optimization, and partition management. Strong choice for AWS-centric infrastructure.
- Delta Lake — best for machine learning and data engineering workflows on Spark and Databricks, where reliability and performance are critical.
- Apache Hudi — best for real-time use cases: frequent updates, deletes, and Change Data Capture (CDC) in dynamic streaming scenarios.
If you're unsure which format best suits your needs, Apache XTable enables interoperability between formats without rewriting dataset files — allowing you to test and transition between them with ease.
For any questions, feel free to reach out to us at hello@datamax.ai.

Ergin Gorishti
Data Engineer at DataMax
Related Articles
DataMax collaborates with AWS to deploy AI and Data Platform Solutions on AWS European Sovereign Cloud
2 min read
DataMax achieves AWS Glue Service Delivery Recognition for our Data Engineering Expertise
1 min read
Maximizing Efficiency in Distributed AI Training and Execution with Ray and DeepSpeed
4 min read