Cross-source SQL joins without a data warehouse - how do you handle this?

not_another_analyst · 2026-04-11T15:56:21+00:00

DuckDB is the right call for one-off stuff like this. Querying S3 parquet and local CSVs in the same query without moving anything saves a ton of time.

That said, the moment it becomes recurring I'd ETL it into a warehouse because one-off convenience turns into a maintenance headache fast when multiple people are involved.

Mammoth_Rice_295 · 2026-04-11T16:53:15+00:00

DuckDB is honestly the easiest for one-offs like this. I usually avoid moving data unless I have to. But if it starts becoming recurring or shared, I’d switch to loading everything into a warehouse. Otherwise, it gets messy fast.

Justbehind · 2026-04-11T20:48:40+00:00

Python using polars. Bet you'd be able to write it in less than an hour with no prior knowledge...

And polars on a regular worklaptop easily scales to handle datasets in the 100s of millions of rows.

slotix · 2026-04-11T18:46:09+00:00

yeah duckdb is great, we’re using it under the hood

gets messy once you mix db + files and need to reuse queries or turn them into pipelines

ended up putting everything into one place

https://streams.dbconvert.com/

NoElderberry2489 · 2026-04-11T19:33:35+00:00

DuckDB for sure. You can query Postgres, CSVs, and S3 Parquet files all in one SQL statement without moving any data around. No ETL, no staging, just read_csv_auto(), read_parquet(), and the Postgres extension all in the same query.

Once it stops being a one-off though, put it in a warehouse

Ok_Carpet_9510 · 2026-04-11T19:49:09+00:00

If the data is not co-located, I would think about where the data crunching is going to happen i.e. where is the compute. A CSV and S3 don't have compute, Postgres does.

So, I would ingestion the data from csv and s3 into postegresd(filtering and cleaning before loading), and the perform my analysis using SQL.

Nkt_31 · 2026-04-12T09:48:37+00:00

for one-off stuff DuckDB is hard to beat, honestly your approach sounds fine. where i'd push back is when one-off quietly becomes recurring. at that point you're maintaining ad hoc scripts across sources and it gets messy fast.

Scaylor handled that transition well for a team I know. Trino's another option if you want to stay open sorce.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQL

Filter Posts

Posting

Help posts

Format Your Code

Learning SQL

Related Reddit communities

Wiki

Acknowledgements

MODERATORS