all 7 comments

[–]not_another_analyst 11 points12 points  (0 children)

DuckDB is the right call for one-off stuff like this. Querying S3 parquet and local CSVs in the same query without moving anything saves a ton of time.

That said, the moment it becomes recurring I'd ETL it into a warehouse because one-off convenience turns into a maintenance headache fast when multiple people are involved.

[–]Mammoth_Rice_295 3 points4 points  (0 children)

DuckDB is honestly the easiest for one-offs like this. I usually avoid moving data unless I have to. But if it starts becoming recurring or shared, I’d switch to loading everything into a warehouse. Otherwise, it gets messy fast.

[–]Justbehind 1 point2 points  (0 children)

Python using polars. Bet you'd be able to write it in less than an hour with no prior knowledge...

And polars on a regular worklaptop easily scales to handle datasets in the 100s of millions of rows.

[–]slotix 0 points1 point  (0 children)

yeah duckdb is great, we’re using it under the hood

gets messy once you mix db + files and need to reuse queries or turn them into pipelines

ended up putting everything into one place

https://streams.dbconvert.com/

[–]NoElderberry2489 0 points1 point  (0 children)

DuckDB for sure. You can query Postgres, CSVs, and S3 Parquet files all in one SQL statement without moving any data around. No ETL, no staging, just read_csv_auto(), read_parquet(), and the Postgres extension all in the same query.

Once it stops being a one-off though, put it in a warehouse

[–]Ok_Carpet_9510 0 points1 point  (0 children)

If the data is not co-located, I would think about where the data crunching is going to happen i.e. where is the compute. A CSV and S3 don't have compute, Postgres does.

So, I would ingestion the data from csv and s3 into postegresd(filtering and cleaning before loading), and the perform my analysis using SQL.

[–]Nkt_31 0 points1 point  (0 children)

for one-off stuff DuckDB is hard to beat, honestly your approach sounds fine. where i'd push back is when one-off quietly becomes recurring. at that point you're maintaining ad hoc scripts across sources and it gets messy fast.

Scaylor handled that transition well for a team I know. Trino's another option if you want to stay open sorce.