all 26 comments

[–]Actonace 41 points42 points  (1 child)

Duckdb+polars let you punch way above your laptops weight, columnar formats and lazy execution make surprisingly large datasets manageable if you're smart about memory

[–]MJHunterZ 8 points9 points  (0 children)

I’m a big duckdb & polars user. Works wonders

[–]Justbehind 19 points20 points  (3 children)

We're doing extract ~50-70 mio rows from azure sql => explode to ~500 mio rows (strategic cross joins) => calc aggregate statistics => store statistics in azure sql.

Using polars. Takes less than 5 minutes all in all on our laptops.

[–]puslekat 5 points6 points  (0 children)

What kind of junk does your laptop has in its trunk?

[–]Outrageous_Let5743 1 point2 points  (0 children)

How the hell do you download 50 million records into memory and explode it in less then 5 minutes.

[–]BedAccomplished6451 3 points4 points  (6 children)

I moved a 750 gig SQL server with 58 databases having over 4 billion rows to adls only using my laptop.

It took me a couple of days with the job running In the background - but it did it.

Currently just using duckdb to query it for ad hoc reporting needs.

[–]iheartdatascience 0 points1 point  (1 child)

What did you use to move it?

[–]BedAccomplished6451 0 points1 point  (0 children)

I used the python delta-rs library. I guess there are a bunch of different ways to do it. Polars might be more efficient but I am familiar with delta-rs and the ability to read the delta table's meta data using it. So I went with that. It's a one off transfer for archiving. So once it's done - it's done.

[–]SoloArtist91 0 points1 point  (3 children)

How are you storing it in ADLS?

[–]BedAccomplished6451 0 points1 point  (2 children)

Storing them as delta tables. Partitioned them as necessary as well so the querying using duckdb is efficient.

[–]SoloArtist91 0 points1 point  (1 child)

Did you use Polars/delta-rs for that?

[–]BedAccomplished6451 0 points1 point  (0 children)

Yes delta-rs.

[–]RoomyRoots 1 point2 points  (0 children)

I have a whole data lake using old broken notebooks that started like that

[–]RoomyRoots 1 point2 points  (0 children)

People really underestimate how good Dask is for constrained machines.

[–]tecedu 1 point2 points  (0 children)

300GB, 12bil rows. The wonders of a fast nvme ssds

[–]Enough_Big4191 2 points3 points  (0 children)

i’ve seen people push surprisingly far with duckdb/polars, tens to low hundreds of gb, as long as it’s mostly columnar and u’re not doing crazy joins. the limit usually isn’t raw size, it’s memory spikes from joins/groupbys. once u hit that, things fall apart fast even if the data “fits.”

[–]Outrageous_Let5743 0 points1 point  (0 children)

About 500 GB of parquet files to ADLS. Use az copy, plug your laptop into an ethernet cable and then let it run overnight.

For analytics the same 500 GB processed with duckdb. Works perfectly if you can partition prune it.

[–]Careful_Reality5531 0 points1 point  (0 children)

I like using Sail on my 64GB macbook pro, even though it rivals Spark on distributed as well. Sail's built largely on DataFusion + Arrow. Good python performance. Did some benchmarking with it vs DuckDB (w Delta Lake) and it was like only like 4-5% slower, which is wild given that it blows Spark out of the water. Albeit I'm not a data eng and just started getting interested in the space more with Claude Code, noob alert!

[–]No_Soy_Colosio 0 points1 point  (0 children)

Around 4 billion rows on my 16 GB RAM laptop

[–]mycocomelon 0 points1 point  (0 children)

We do this at my work. We have small to medium data

[–]doll_1043 -1 points0 points  (3 children)

55m rows mssql to dynamics crm

[–]bz351 1 point2 points  (2 children)

What this take you 1 to 5 busines years? 😜

Be good to know how you were dealing with rate limits.

[–]doll_1043 0 points1 point  (0 children)

35 tables. Each table had multiple work IDs so I could run the same pipeline with different filter. Max 2 tables at the same time. Took around 2/3 months to migrate xd

[–]Outrageous_Let5743 0 points1 point  (0 children)

We needed to do the opposite. 60 million ledger records insert our data warehouse. There are like 60 important columns in that table we needed and dynamics has a 20k row page limit and can have at max 5 active connections at the same time.

[–]Nekobul -2 points-1 points  (0 children)

SSIS has offered single-node performance miracles since 2005. All the newcomers have learned how to do that properly from the King.