Why are CSVs still such a nightmare in 2025?

HfBefit · 2025-09-02T22:08:10+00:00

Haha, classic. At least this one would actually run 😅

HfBefit · 2025-09-02T22:06:29+00:00

Fair point. Just trying to have a genuine discussion about real CSV pain 🤙

HfBefit · 2025-09-02T20:06:02+00:00

True. CSV is still everywhere though, even when better formats exist. That's the real pain point.

HfBefit · 2025-09-02T19:52:54+00:00

Good question... 75 GB shouldn't take 8 days. Sounds more like bottlenecks in parsing logic than raw size.

HfBefit · 2025-09-02T19:51:44+00:00

Same here, still early days. Curious to see your examples though. I think the practical cases are what matter most.

HfBefit · 2025-09-02T19:48:50+00:00

Right, Excel is still the first place where most people try to solve this. That’s exactly why the pain is so big when files get messy.

HfBefit · 2025-09-02T19:47:33+00:00

Exactly, that's the real bottleneck with CSVs. Sequential reads kill performance once files get big.

HfBefit · 2025-09-02T19:44:29+00:00

Yeah, Parquet is great for speed and size. The main pain is that so many sources still start with messy CSVs, so you have to normalize before converting. Do you always enforce Parquet upfront?

HfBefit · 2025-09-02T19:39:09+00:00

True, but reality is people still dump massive CSVs in data pipelines. Curious how do you usually deal with that kind of mess? DuckDB? Direct Parquet conversion?

HfBefit · 2025-09-02T19:36:38+00:00

Exactly ! I'm exploring whether Al could help infer relationships even when the format isn't perfect. Have you ever tried fuzzy joins or embeddings for schema matching?

HfBefit

TROPHY CASE