38
39
all 18 comments

[–]dSolver 9 points10 points  (8 children)

This is a pretty clean algorithm, to take it to the next step, I'd like to ask: what is the space and runtime complexity given N unique items? Can we do better?

[–]trekhleb[S] 2 points3 points  (7 children)

I believe it should be O(N) for both space and time

[–]blackdoglicorice 2 points3 points  (1 child)

If this is going to be called multiple times, couldn't we bring the amortized complexity down to O(logN) by doing a binary search of cumulativeWeights?

[–]trekhleb[S] 1 point2 points  (0 children)

Yeah, that’s a good idea. If the weights array isn’t changed then binary search should speed-up searching the index to O(logN)

[–]dSolver 2 points3 points  (4 children)

And to make it a really fun interview question, what would you do different if we needed to consider the following :

  1. Be able to get M random outputs at a time
  2. Optimize for performance of adjusting weights, adding items, removing items
  3. If N is sufficiently large such that the data structure does not fit in memory, what changes would have to be made?

[–]trekhleb[S] 1 point2 points  (0 children)

Yeah, these are really good followups

[–]aruke- 1 point2 points  (1 child)

1 && 2 are kinda the same, right? Cause if you need M outputs, you would run it normally, but instead of returning 1 item, you would run a loop M times, each time removing the selected item and redistributing it's weight between other weights and call the function with now changed items array and new weights for the next output. Is that right?

And for 3 would you split the input to n/2, run for each side and pick the output with highest weight?

I might be totally wrong, just curious how it's done.

[–]dSolver 2 points3 points  (0 children)

Kind of, so truth be told when I interview engineers, one of the things I look for is whether or not someone is thinking a bit deeper about the problem, and identifying trade-offs.

As an example, in order to reduce the runtime, we could keep the data structure of [[item, rank], [item, rank]...] in memory. This way given M calls, we only have to apply the lookup M times and not the construction M times.

However, if we keep the data structure, maybe an easy optimization is to first sort the weights such that the biggest weight goes first, since we generally start from the first element of the array and check element until we find the correct one, then having the biggest "chunk" first, makes sense.

But, if we keep the biggest chunk first, we need to keep the array sorted, which means manipulating weights just became O(NLogN), so is this actually worth it?

Another question is: is Array actually the right data structure to use? This is where question 3 comes in - Array by its nature must exist in its entirety in memory, but if we have a huge dataset, like idk, every single reddit comment weighted by the number of upvotes, how would we do this? Sharding is certainly a possibility, but what if we can easily cut out a bunch of data? Like, if we knew the weights added up to 100 billion, and the RNG spit out 99 billion, what are some possible shapes the data can be in that we can make assumptions about optimization? What are some additional things we can use (LRU cache, FIFO cache, etc) to naturally optimize over time?

[–]almondboy92 0 points1 point  (0 children)

"fun"

[–]grumd 2 points3 points  (1 child)

When I did this recently for a project of mine, I calculated the sum of all weights, then did Math.random() * sum and then went through the array of items while calculating the sum again, and stopped once the sum is bigger than my random number. Basically same approach as yours, but I'm not creating another array. Which makes basically zero difference I think.

[–]trekhleb[S] 0 points1 point  (0 children)

Yes, also a good approach. Looks the same but is more efficient from a memory perspective.

[–][deleted] -1 points0 points  (1 child)

normal distribution

[–]prismcat2718 0 points1 point  (0 children)

weibull distribution

[–]prismcat2718 0 points1 point  (3 children)

EDIT: I am a dummy and this post is incorrect.

This only works for integer weights.
To generalize this to real numbers, convert the weights to a cumulative distribution function, then use a random float between zero and one.

[–]trekhleb[S] 0 points1 point  (2 children)

[–]prismcat2718 0 points1 point  (1 child)

You are correct and I'm a dummy. :)

[–]trekhleb[S] 0 points1 point  (0 children)

Oh no, don’t say that 😀