S3 + Parquet + Iceberg + Trino: A Poor Man’s Market Data Platform

S3 + Parquet + Iceberg + Trino: A Poor Man’s Market Data Platform

Before I start talking about how effective this architecture can be at reducing infrastructure costs, I should first make the old point that there is really no free lunch. Compared with commercial cloud data platforms and warehouses such as Databricks, BigQuery, and Snowflake, an open lakehouse setup requires significantly more engineering effort to build, operate, and tune properly. You trade managed convenience for lower-level control, flexibility, and potentially much lower long-term costs. 

QuantFlow currently supports three types of data engines:

  1. Local engine — DuckDB, mainly for local development, debugging, and lightweight research workflows.
  2. Cloud warehouse engine — commercial data platforms such as Databricks, BigQuery, and Snowflake.
  3. Open lakehouse engine — the QuantFlow embedded data engine built on top of S3-compatible object storage + Parquet + Iceberg + Trino.

Why an Open Lakehouse Engine at All?

I have to admit that I have always believed that self-managed systems built on top of open-source products tend to cost more overall than commercial platforms, especially when considering engineering labour, operational issues, maintenance overhead, and opportunity cost. For most routine data processing and analytics workloads, commercial cloud data platforms are actually quite reasonable when managed properly.

However, I become much more hesitant when dealing with quant research over market data, especially with the current trend toward microstructure-level research using tick and order book data. It is not only the sheer scale of market data required today, but more importantly the highly iterative nature of quantitative research and experimentation, that can make usage-based pricing models much more expensive than expected.

Market data is naturally high-volume, time-sensitive, append-heavy, and repeatedly scanned during research. A single symbol can generate a surprisingly large amount of data when working with tick trades or order book updates. Once you move from one symbol to a cross-sectional strategy, the numbers grow very quickly. For example, one year of QQQ MBP-1 data can already be around 117 GB. That is just one symbol, one schema, and one year. 

The cost problem is not one query. The cost problem is repeated experimentation, such as:

try one feature set
try another feature set
change the sampling method
change the label horizon
change the universe
change the lookback window

S3 + Parquet + Iceberg + Trino

The open lakehouse architecture is simple in concept: store large market data files in cheap S3-compatible object storage, use Parquet as the physical file format, use Iceberg as the table format, and use Trino as the SQL query engine.

The important point is that the platform is no longer a single product. It becomes a set of replaceable layers.

Parquet matters because market data is naturally columnar. Query engines can read only the required columns instead of scanning entire files. Iceberg matters because Parquet files alone do not make a table — Iceberg adds snapshots, schema evolution, partition management, and atomic commits. Trino sits on top of Iceberg and executes distributed SQL queries across many Parquet files in parallel. For Python-native state and feature engineering, I still prefer Ray + Polars over SQL-based transformations.

Example Architecture Cost Breakdown

Below is a simplified monthly infrastructure breakdown for the open lakehouse setup used in QuantFlow:

This is obviously not a complete production cost model. It does not include engineering labour, monitoring systems, backup infrastructure, or operational overhead. The point is simply to show that the raw infrastructure layer for large-scale market data research can be surprisingly affordable when storage and compute are separated properly.

Cost Comparison (made by ChatGPT)

To make the cost discussion more concrete, I asked chatgpt to do a comparision on one practical example: one year of QQQ MBP-1 data at around 117 GB. One scan of 117 GB does not sound expensive. The problem is that market data research rarely scans it once. A cross-sectional strategy may scan many symbols, and a research workflow may scan the same data repeatedly while changing features, labels, horizons, and sampling rules.

A simple way to think about it is this: 117 GB is about 0.114 TiB. If we scan that dataset 1,000 times during research, that is around 114 TiB of scanned data. If we scale from one symbol to a 10-symbol research universe with similar order-book data size, one full scan is already around 1.17 TB, and 100 research iterations becomes around 117 TB of scanned data. The cost problem is not the single QQQ query; it is repeated experimentation over a growing universe.

Below is an indicative monthly comparison for a QQQ-style workload. To keep the example simple, assume QQQ one-year MBP-1 data is 117 GB, a 10-symbol universe has similar data size per symbol, and the research workflow scans that universe 100 times in a month.

117 GB × 10 symbols × 100 scans
≈ 117 TB scanned
≈ 114–117 TiB scanned per month

The exact numbers depend on region, provider, discounts, pruning, compression, warehouse size, and runtime, but this gives a useful order of magnitude.

The exact numbers will obviously vary depending on compression ratio, pruning efficiency, warehouse size, concurrency, cloud provider, and research behaviour. The important point is not the precise dollar amount, but how the cost scales with repeated scans and experimentation.

A more detailed breakdown:

Open lakehouse:
R2 storage for ~1.17 TB active dataset: $40/month
1 Trino coordinator + 4 worker VMs (16 GB each): $250/month
R2 egress: $0
Estimated total: $290/month
BigQuery on-demand:
Capability-matched repeated research scans and larger concurrent workloads
Effective monthly scanned data: ~232 TiB
232 × $6.25 ≈ $1,450/month
Storage for ~1.17 TB: relatively small compared with scan cost
Databricks Jobs:
Underlying cloud VMs + DBU charges
1 driver + 4 workers 16 GB cluster, always-on equivalent: about $1,350/month
Databricks All-Purpose:
Same cluster shape, higher interactive DBU rate
About $3,200/month if kept running heavily
Snowflake:
Medium warehouse with sustained research usage
6 credits/hour × 100 hours × ~$3/credit ≈ $1,800
Plus storage, usually smaller than compute in this example

The main observation is not that the open lakehouse is always cheaper for every workload. It is that for repeated market-data scans, its cost grows much more slowly. Once the three VMs are running, scanning the same Parquet/Iceberg data repeatedly does not create a new per-TiB query bill in the same way as BigQuery on-demand, and it does not add a Databricks or Snowflake platform charge on top of every hour of managed compute.

For the open lakehouse version, the cost is more predictable. For example, using Cloudflare R2 as active storage and three low-cost 16 GB VMs for Trino/Ray workers, the monthly cost can be roughly in the low hundreds of dollars rather than scaling directly with every TiB scanned. The storage cost is mostly object storage, and the compute cost is mostly the fixed VM bill. If the workload scans the same market data many times, this fixed-compute model can be attractive.

BigQuery is different. With on-demand pricing, the query cost is linked to the amount of data scanned. That model is very convenient and often perfectly reasonable for normal analytics, but market data research can generate many repeated scans. A single 117 GB QQQ scan is small; hundreds or thousands of scans across many symbols are not.

Databricks has a different shape again. It is not simply “per query”. The cost comes from the underlying cloud infrastructure plus Databricks DBU usage. It gives you Spark, notebooks, managed jobs, collaboration, and a very productive platform, but if the target workload is mainly Ray/Polars-style ingestion and repeated market-data processing, a small self-managed VM cluster can be much cheaper.

Snowflake is also not exactly “per query”. It is mainly warehouse-credit based: you pay for the virtual warehouse size and how long it runs. This is excellent for managed SQL workloads and enterprise analytics, but repeated order-book scans and backtest-style research can keep warehouses running and consuming credits.

Leave a comment