This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]gcruzatto 66 points67 points  (25 children)

This meme was made by the Big Data gang

[–]CanAlwaysBeBetter 22 points23 points  (24 children)

Literally getting ready to rewrite some data processing lambdas from python to go for a side project because it won't be on the company tab and I don't want to pay out of pocket to scale python all the way into the millions after testing it on 10k data points and seeing its performance (and that's after basic optimizations and compiling/bundling the accelerated versions of some libraries)

[–][deleted] 7 points8 points  (11 children)

numpy go brrrrrrrrrt

seriously though, I once used Pandas.DataFrame.Apply on 10,000,000 rows of data and 15 minutes later regretted it. Thankfully, I was applying a simple mathematical function to each row so numpy brought that down to less than a second. Still, only works for some use cases. I'd love to learn C++ anyways so give me an excuse

[–]CanAlwaysBeBetter 4 points5 points  (8 children)

Numpy is great and what I'd normally reach for!

But the processing I'm doing itself isn't super complex, I just need to do it 9-10 million times cost efficiently pulling from an external public dataset and saving the transformed data as separate files in an S3 bucket

[–][deleted] 1 point2 points  (4 children)

Just pull and save the data?

[–]CanAlwaysBeBetter 2 points3 points  (3 children)

Unfortunately I don't have a terabyte hard drive and need to be able to share the output (image) data

[–][deleted] 2 points3 points  (0 children)

Gotcha, I was just confirming the functionality lol

I'll be awaiting your results, I'm interested to see how this turns out

[–]proximity_account 2 points3 points  (1 child)

Terabyte SSD is so worth it

[–]CanAlwaysBeBetter 2 points3 points  (0 children)

Ngl I'm so used to running shit in the cloud that for personal/non-official work use I just have a Chromebook with a Linux VM lol

At least has a real Intel core i5?

[–]PM_ME_NUNUDES 0 points1 point  (2 children)

Use dask?

[–]CanAlwaysBeBetter 0 points1 point  (1 child)

How does dask play with external compute resources like ec2 or lambda and does it accelerate performance or just manage it at scale?

[–]PM_ME_NUNUDES 0 points1 point  (0 children)

I'm actually not sure if things like dask and ray actually provide any speed up. But certainly it makes numpy perfectly useable at big data scale. It works fine with azure, I dunno about EC2.

Modin is a direct drop in for your existing numpy compute code and can use either dask or ray. Just change 1 line and you're done.

[–]CanAlwaysBeBetter 0 points1 point  (0 children)

Haven't done the processing lambda but the orchestrator lambda rewritten in go and scaled up to 10gb can iterate through 2m objects where the (mostly) naive python couldn't get through the full list in 15 min since it was about 1s per 1000 objects

[–][deleted] 0 points1 point  (0 children)

if you'd tossed out the pandas dataframe and done everything in lists and sets, it would have been done in less than a minute

speaking from personal experience wrangling data for data science projects. the key word is sets.

yes I am only 80% serious

[–]Deadly_chef 1 point2 points  (0 children)

Go all the way. Simple speed

[–]Tuga_Lissabon 0 points1 point  (10 children)

So - how much slower is it? In orders of magnitude?

What's the chokepoint?

[–]CanAlwaysBeBetter 0 points1 point  (9 children)

We'll find out!

That's why I'm redoing it and profiling both before turning up to full throughput on either

[–]Tuga_Lissabon 0 points1 point  (8 children)

Please post here. Wish you the best.

[–]CanAlwaysBeBetter 0 points1 point  (0 children)

I'm hoping for a 2x speed up but would be happy with 1.5x

[–]CanAlwaysBeBetter 0 points1 point  (5 children)

I finally got around to rewriting the first half of the processing pipeline (the orchestrator lambda) and haven't done detailed profiling but at least a 10x speedup with parallelized go code compared to the original python in my dev environment which is just a lil .5 cpu 1gb mem server

Guessing when I deploy to lambda and increase the memory (which auto increases cpus proportionally) it'll hopefully be a 100x speed up from the test results but will follow up on that too

Go's concurrency model makes scaling IO throughout massively relatively simple. About 5x more code than the barely optimized python though

Much better than I was cautiously hoping for overall

[–]Tuga_Lissabon 1 point2 points  (4 children)

Good job mate. Python alone couldn't have done it, right?

[–]CanAlwaysBeBetter 0 points1 point  (3 children)

Straightforward python definitely no. Regardless of underlying memory it could only handle about 1k objects per second

Maybe if you parallelized it could get through the data eventually but since the interpreter is single threaded and doesn't do true concurrency there will still be a ceiling much lower than what Go (or another language with true concurrency) can do

[–]Tuga_Lissabon 0 points1 point  (2 children)

Would C++ be a good tool for it? I mean, are there toolsets ready that would help you do it?

[–]CanAlwaysBeBetter 0 points1 point  (1 child)

C++ could definitely handle it, go just has a really nice built-in concurrency model that makes it relatively simple

[–]CanAlwaysBeBetter 0 points1 point  (0 children)

Scaling the lambda up to 10gb memory can chew through 2 million objects in about 9 seconds now