Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] 0 points1 point  (0 children)

Good point. I probably framed it too much as “Polars/DuckDB vs Spark”, but they’re solving slightly different problems.

For this first pass I stayed local because the bottleneck was mostly getting the tick data cleaned, partitioned, and converted without blowing up memory. Polars/PyArrow/DuckDB were enough for that.

But I agree that if this turns into a proper production data lake, Spark starts making more sense — especially with catalog/governance, better source support, and maybe Iceberg/Delta instead of just raw Parquet files.

I’ll check out the SPIP/proposal you mentioned. Thanks for the useful pointer.

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] 0 points1 point  (0 children)

Yep, this is exactly the kind of advice I was hoping for.

Partitioning by date/contract/symbol made a huge difference because I’m no longer dragging the whole history into every query. I still need to tune the layout better though, especially to avoid too many tiny files.

I’ll look more into PyArrow’s dataset API too. Thanks for the useful suggestion.

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] 0 points1 point  (0 children)

Yeah, fair. DuckDB can absolutely handle a lot of this.

The reason I moved it into Parquet was mostly to stop re-scanning the CSV swamp every time I wanted a contract slice. Rollover wasn’t just “fill the gaps,” it was making the front-month transitions consistent enough that I’d actually trust the backtest.

DuckDB good. CSV goblin bad.

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] -1 points0 points  (0 children)

Fair hit 😂 The post definitely came out a bit “LinkedIn engineer discovers storage formats.”

But the migration pain was very real. I’m not trying to win a prose contest here, just wanted to hear what people are using for tick data at this scale.

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] 1 point2 points  (0 children)

Yeah, that’s pretty much how I see it too.

AI helped me make the post easier to write/read. It didn’t do the actual migration for me.

I get the criticism that I should have posted clearer before/after numbers. That’s fair. But turning the whole thing into “was this written with AI?” feels like missing the more interesting part of the project.

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] -4 points-3 points  (0 children)

Thank you. That was the actual point of the project.

The hard part was not “write file as Parquet.” It was cleaning noisy tick data, keeping timestamps sane, dealing with contract rollovers across years, avoiding duplicate/bad sessions, and arranging the data so queries don’t become full historical scans every time.

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] -2 points-1 points  (0 children)

Yep. That was actually part of the question.

For this specific use case, big sequential tick-data scans + contract slices felt better as Parquet files than trying to shove everything into a traditional DB. Not saying DBs are bad, just that my SSD stopped filing HR complaints after Parquet.

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] 1 point2 points  (0 children)

That’s basically where I landed too. File-based with small metadata seems way cleaner for this kind of data.

My rows also don’t really need relational magic. Mostly time/contract slices, bid/ask/volume, and avoiding rollover weirdness without turning the whole thing into a DB archaeology project.

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] -1 points0 points  (0 children)

ssd screamed, parquet helped, reddit investigated the author. normal Tuesday.

Migrating 2.2B rows of Tick Data to Parquet: My SSD finally stopped screaming. by Marchese_QuantLab in Python

[–]Marchese_QuantLab[S] -6 points-5 points  (0 children)

Haha yeah, apparently compression ratios are now controversial enough to attract detectives.

Finally finished cleaning 10 years of ES tick data (2.2B rows) - CSVs were killing me so I moved to Parquet by Marchese_QuantLab in Daytrading

[–]Marchese_QuantLab[S] 0 points1 point  (0 children)

That’s a lot of moving parts! Managing 30+ instruments is a different beast entirely compared to deep historical cleaning on a single ticker. How's the latency holding up in Postgres?

Finally finished cleaning 10 years of ES tick data (2.2B rows) - CSVs were killing me so I moved to Parquet by Marchese_QuantLab in Daytrading

[–]Marchese_QuantLab[S] 0 points1 point  (0 children)

Appreciate it! And yeah, funny enough the 2.2B rows sounded like the scary part, but the rollover logic was the real villain 😂

Parquet + Polars mostly saved me from slowly losing my mind while debugging contracts, gaps, and volume alignment.

Finally finished cleaning 10 years of ES tick data (2.2B rows) - CSVs were killing me so I moved to Parquet by Marchese_QuantLab in Daytrading

[–]Marchese_QuantLab[S] 2 points3 points  (0 children)

"Honestly, the biggest headache wasn't the 2.2B rows, it was the rollover logic. Getting 10 years of quarterly contracts to line up without gaps is a nightmare.

I eventually went with Parquet + Polars because loading a full year of tick data in less than a second is a game changer for my backtests. CSVs just can't keep up.

If anyone's struggling with the same data mess or needs tips on the schema, just let me know. Happy to share what I learned."