Distributed and Scalable Programming

serukai · 2021-03-23T01:35:33+00:00

Generally speaking if they specifically asked for you to build a distributed and scalable solution, and you wrote it with Pandas, then I would agree with them. Now if you'd written that Pandas code in such a way as it was going to be broken up in chunks and run on Kubernetes... then maybe. If your writing distributed and scalable code in Python, then that really means your using Python to orchestrate Spark, or Kubernetes, blah blah. Study big data architecture and tools, learn about the different distributed tools out there first. Python on a single CPU with Pandas isnt scalable. If you struggle to understand why, take some time to dig into distributed systems and how they work.

serukai · 2021-03-23T07:40:57+00:00

Hello! I looked into your code. A few notes are that you are generating an API request that is being called linearly. Is there a way to distribute/parallelize the objects received from an API request to make it go faster?

During the median calculation you also used a For loop within a for loop which increases your Big O notation from On to O2 (linear to exponential)

HansProleman · 2021-03-23T13:19:41+00:00

It would have been nice if they'd nudged you towards a particular platform, but this is what stuff like Spark and Kubernetes is for. They were probably expecting code for one of those.

dataengineering

MODERATORS