Why Your Data Platform Bill Is Growing Faster Than Your Data

Data volumes go up 30% per year. Bills go up 70%. At some point the CFO stops accepting "we have more data" as an explanation and starts asking real questions.

We have had this conversation many times. And when we actually trace the costs, the answer is almost never the data itself. The answer is decisions made three or four years ago that seemed fine at the time and are now quietly compounding.

Below are the four things we see driving the gap, based on working with data teams at companies of different sizes.

1. The same data stored three times

Take a typical mid-size company. They have raw files in S3. A cleaned, modeled copy in Snowflake or Redshift. And then a third copy somewhere the ML team created — a Parquet export, a feature store, a Postgres database someone set up and now nobody wants to delete.

Storage is the obvious cost here, but usually not the biggest one. The pipelines that keep these copies in sync are more expensive. CDC jobs, scheduled exports, reverse ETL — each one needs someone to maintain it, monitor it, and wake up when it breaks. That is real engineering time.

The other cost is harder to see. When copies exist, they drift apart. Marketing reports one revenue number, finance reports another. A junior analyst spends a week figuring out which pipeline fell behind and when. This kind of thing erodes trust in the data, and eroded trust means more manual checking, more meetings, more headcount doing work that should not need to exist.

2. One warehouse doing everything

Most companies end up routing everything through one warehouse — Snowflake, BigQuery, Redshift, Databricks SQL. Ad-hoc queries, heavy ETL jobs, dashboard refreshes, ML feature generation, all of it.

These workloads do not have the same needs. Ad-hoc queries want cheap compute that starts fast and stops fast. Heavy ETL wants distributed throughput. ML pipelines often just want direct file access from Python. A warehouse is built for interactive SQL at scale. That is useful for some of these, not for others.

The practical result is that you pay warehouse prices for things that do not need a warehouse. A script that reads 10GB and writes a transformed output should be cheap. Run it through Snowflake and it is not cheap, because the data lives in Snowflake and there is no cheaper way to get to it. As the company grows and more teams build more things, more workloads pile into the warehouse, and the bill grows with use cases rather than with actual data.

3. You are paying for a bundle

Raw object storage on S3 is around $0.02 per GB per month. Query engines like Trino, DuckDB, Athena, Spark on EMR are open source or nearly free. The infrastructure cost of actually storing and querying data has dropped a lot over the last decade.

Warehouse vendors charge for the bundle — storage, compute, and a proprietary format that ties them together. The markup over raw infrastructure is significant, often five to ten times for equivalent work. This is not a complaint about those vendors. The integration they provide is real, and for many workloads the convenience is worth it.

The problem is you pay for the bundle even when you only need part of it. Most companies use maybe a third of what they are paying for, but the way the data is stored means they cannot unbundle.

4. The glue code nobody counts

Look at your data team's commits from the last quarter. How much of it is pipelines that exist only to move data from one system to another, because those systems cannot read each other's format?

This cost does not appear on any invoice. It shows up in salaries. Every engineer maintaining a sync pipeline is an engineer not building the things your company actually needs. Glue code also accumulates — once a pipeline exists, deleting it feels risky, so teams keep adding to the pile.

What actually helps

All four of these trace back to the same thing: the data lives inside a vendor's proprietary format, so you have to copy it to use it elsewhere, pay that vendor's prices to query it, and write code when you want to move it.

Switching warehouses does not fix this. You get a different vendor, but the same structure.

What does fix it is putting the data in an open format that multiple engines can read directly. Apache Iceberg is the most mature option right now. The concept is simple enough — your data sits in your own S3 buckets in open Parquet files, with an Iceberg metadata layer on top. Athena, Spark, DuckDB, Trino, Snowflake, Redshift — all of them can query the same files. No copying, no syncing.

When this works, the costs shift in a practical way:

One copy of the data. Queryable from everywhere. The sync pipelines that were keeping copies aligned go away, and so does the engineering time they consumed.

Right tool for each workload. DuckDB on a laptop for quick exploration. Athena for serverless SQL. Spark for heavy ETL. A warehouse for the workloads where a warehouse is genuinely the right choice. Each workload runs on appropriate infrastructure.

The markup becomes a choice. You can still use Snowflake or Databricks for workloads where their capabilities justify the cost. The difference is you are choosing to pay for what you actually use, not paying because your data is locked there.

Less glue. When engines read the same files, most of the sync pipelines have no reason to exist.

The practical reality

The setup itself is not complicated — S3, Iceberg, a catalog. That is the foundation. Everything else connects to it.

Teams we have worked with have cut data platform spending by 40 to 60% moving in this direction. The savings are real and spread across all four areas above.

Getting there takes some care though. Iceberg has depth — catalog choices, table maintenance, engine compatibility, governance. Moving quickly without understanding these usually results in something cheaper but harder to operate. That is not a good trade.

This is the first post in a series about this. The next ones will cover what Iceberg actually is in concrete terms, how to choose a catalog, what migration from a warehouse looks like step by step, and the parts that vendor documentation tends to skip. The goal is that by the end you have enough to make a real decision, not just enough to ask for a vendor demo.

Next week: what Iceberg actually is.

Why Your Data Platform Bill Is Growing Faster Than Your Data

1. The same data stored three times

2. One warehouse doing everything

3. You are paying for a bundle

4. The glue code nobody counts

What actually helps

The practical reality

Related Articles

Streaming data from PostgreSQL to BigQuery with Datastream

Three Cost Management Pitfalls on GCP Cloud SQL

Modern Data Team Hats