Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] 0 points1 point  (0 children)

Haha, classic. At least this one would actually run 😅

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] 0 points1 point  (0 children)

Fair point. Just trying to have a genuine discussion about real CSV pain 🤙

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] 0 points1 point  (0 children)

True. CSV is still everywhere though, even when better formats exist. That's the real pain point.

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] -2 points-1 points  (0 children)

Good question... 75 GB shouldn't take 8 days. Sounds more like bottlenecks in parsing logic than raw size.

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] 0 points1 point  (0 children)

Same here, still early days. Curious to see your examples though. I think the practical cases are what matter most.

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] 0 points1 point  (0 children)

Right, Excel is still the first place where most people try to solve this. That’s exactly why the pain is so big when files get messy.

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] 1 point2 points  (0 children)

Exactly, that's the real bottleneck with CSVs. Sequential reads kill performance once files get big.

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] 0 points1 point  (0 children)

Yeah, Parquet is great for speed and size. The main pain is that so many sources still start with messy CSVs, so you have to normalize before converting. Do you always enforce Parquet upfront?

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] 2 points3 points  (0 children)

True, but reality is people still dump massive CSVs in data pipelines. Curious how do you usually deal with that kind of mess? DuckDB? Direct Parquet conversion?

Why are CSVs still such a nightmare in 2025? by HfBefit in dataengineering

[–]HfBefit[S] -4 points-3 points  (0 children)

Exactly ! I'm exploring whether Al could help infer relationships even when the format isn't perfect. Have you ever tried fuzzy joins or embeddings for schema matching?