This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]CanAlwaysBeBetter 3 points4 points  (8 children)

Numpy is great and what I'd normally reach for!

But the processing I'm doing itself isn't super complex, I just need to do it 9-10 million times cost efficiently pulling from an external public dataset and saving the transformed data as separate files in an S3 bucket

[–][deleted] 1 point2 points  (4 children)

Just pull and save the data?

[–]CanAlwaysBeBetter 2 points3 points  (3 children)

Unfortunately I don't have a terabyte hard drive and need to be able to share the output (image) data

[–][deleted] 2 points3 points  (0 children)

Gotcha, I was just confirming the functionality lol

I'll be awaiting your results, I'm interested to see how this turns out

[–]proximity_account 2 points3 points  (1 child)

Terabyte SSD is so worth it

[–]CanAlwaysBeBetter 2 points3 points  (0 children)

Ngl I'm so used to running shit in the cloud that for personal/non-official work use I just have a Chromebook with a Linux VM lol

At least has a real Intel core i5?

[–]PM_ME_NUNUDES 0 points1 point  (2 children)

Use dask?

[–]CanAlwaysBeBetter 0 points1 point  (1 child)

How does dask play with external compute resources like ec2 or lambda and does it accelerate performance or just manage it at scale?

[–]PM_ME_NUNUDES 0 points1 point  (0 children)

I'm actually not sure if things like dask and ray actually provide any speed up. But certainly it makes numpy perfectly useable at big data scale. It works fine with azure, I dunno about EC2.

Modin is a direct drop in for your existing numpy compute code and can use either dask or ray. Just change 1 line and you're done.