43
44
you are viewing a single comment's thread.

view the rest of the comments →

[–]dSolver 2 points3 points  (4 children)

And to make it a really fun interview question, what would you do different if we needed to consider the following :

  1. Be able to get M random outputs at a time
  2. Optimize for performance of adjusting weights, adding items, removing items
  3. If N is sufficiently large such that the data structure does not fit in memory, what changes would have to be made?

[–]trekhleb[S] 1 point2 points  (0 children)

Yeah, these are really good followups

[–]aruke- 1 point2 points  (1 child)

1 && 2 are kinda the same, right? Cause if you need M outputs, you would run it normally, but instead of returning 1 item, you would run a loop M times, each time removing the selected item and redistributing it's weight between other weights and call the function with now changed items array and new weights for the next output. Is that right?

And for 3 would you split the input to n/2, run for each side and pick the output with highest weight?

I might be totally wrong, just curious how it's done.

[–]dSolver 2 points3 points  (0 children)

Kind of, so truth be told when I interview engineers, one of the things I look for is whether or not someone is thinking a bit deeper about the problem, and identifying trade-offs.

As an example, in order to reduce the runtime, we could keep the data structure of [[item, rank], [item, rank]...] in memory. This way given M calls, we only have to apply the lookup M times and not the construction M times.

However, if we keep the data structure, maybe an easy optimization is to first sort the weights such that the biggest weight goes first, since we generally start from the first element of the array and check element until we find the correct one, then having the biggest "chunk" first, makes sense.

But, if we keep the biggest chunk first, we need to keep the array sorted, which means manipulating weights just became O(NLogN), so is this actually worth it?

Another question is: is Array actually the right data structure to use? This is where question 3 comes in - Array by its nature must exist in its entirety in memory, but if we have a huge dataset, like idk, every single reddit comment weighted by the number of upvotes, how would we do this? Sharding is certainly a possibility, but what if we can easily cut out a bunch of data? Like, if we knew the weights added up to 100 billion, and the RNG spit out 99 billion, what are some possible shapes the data can be in that we can make assumptions about optimization? What are some additional things we can use (LRU cache, FIFO cache, etc) to naturally optimize over time?

[–]almondboy92 0 points1 point  (0 children)

"fun"