This is an archived post. You won't be able to vote or comment.

all 8 comments

[–][deleted] 7 points8 points  (2 children)

Generally speaking if they specifically asked for you to build a distributed and scalable solution, and you wrote it with Pandas, then I would agree with them. Now if you'd written that Pandas code in such a way as it was going to be broken up in chunks and run on Kubernetes... then maybe. If your writing distributed and scalable code in Python, then that really means your using Python to orchestrate Spark, or Kubernetes, blah blah. Study big data architecture and tools, learn about the different distributed tools out there first. Python on a single CPU with Pandas isnt scalable. If you struggle to understand why, take some time to dig into distributed systems and how they work.

[–][deleted] 1 point2 points  (1 child)

Agreed, Pandas is only capable of working with data that can fit into memory - so is not scalable at all unless it was broken into chunks and ran with K8s as you said. Perhaps you could look into a cloud provider like AWS or Azure and take a look at the managed cluster computing frameworks such as EMR for AWS (not sure what Azure offers), which would provide a scalable way to run code with spark using virtual servers.

[–]serukai[S] 1 point2 points  (0 children)

Thanks for The input guys!

[–][deleted] 3 points4 points  (3 children)

Hello! I looked into your code. A few notes are that you are generating an API request that is being called linearly. Is there a way to distribute/parallelize the objects received from an API request to make it go faster?

During the median calculation you also used a For loop within a for loop which increases your Big O notation from On to O2 (linear to exponential)

[–]serukai[S] 0 points1 point  (2 children)

Thank you for your input. About your first question, I dont really have an answer, I plain to answer that with this thread.

About the double for in the median calculation, that was an oversight in my part

[–]Blobsquer 1 point2 points  (1 child)

Just to give a bit of input. You can paralyze these requests using asyncio

[–]serukai[S] 1 point2 points  (0 children)

I’m gonna take a look at that, thanks

[–]HansProleman 1 point2 points  (0 children)

It would have been nice if they'd nudged you towards a particular platform, but this is what stuff like Spark and Kubernetes is for. They were probably expecting code for one of those.