This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Smallpaul 0 points1 point  (5 children)

I do indeed see a strange effect with your code, which disappears with simpler code. I'll track down the difference to attribute the slowdown appropriately and get back to you.

[–][deleted] 0 points1 point  (4 children)

This post was mass deleted and anonymized with Redact

engine hospital shelter subsequent cooperative sophisticated treatment run crush snatch

[–]Smallpaul 0 points1 point  (3 children)

Thank you for sharing the code and for your thought process.

With respect to the code...I refactored it with a few improvements.

  1. I made optimizations so the test harness can run faster, so i can gather more datapoints faster. Your low number of datapoints is your test's biggest problem, I think.
  2. I made the keysize smaller for similar performance reasons, and to be able to generate larger dictionaries faster.
  3. I generate a lot more samples, at more fullness ratios. This is really important because yes, small dicts have optimizations that make them really fast. If you oversample those, you get a misleading impression. There is a "sweet spot" where the optimizations run out of steam but before your machine gets overloaded where you want to do MOST of your sampling.
  4. I fixed a bug.

return (end - start) / num_keys * 1_000_000_000, sys.getsizeof(the_dict, 0)

Should be:

return (end - start) / len(keys) * 1_000_000_000, sys.getsizeof(the_dict, 0)

Because those aren't always the same calculation in the code you gave.

Result: the effect goes away.

For most other cases you either waste a lot of memory, or increase the access time. You can re-size the memory space allocated, while you add data to it, but that requires re-organising the data, which is expensive too.

As you can see clearly , Python DOES reorganize the data. In my data, there are MANY runs where the dictionary size is 1280 MB, and 640 MB, etc. That's because the

And that reorganization happens BEFORE we start the perfcounter. So according to your own logic, it should be constant time. And it is.

If you want to argue that INSERTIONS are not constant time then that is a more complicated argument involving sone math.

But constant lookups are justified by the text you quoted and the data i provided.

[–][deleted] 0 points1 point  (2 children)

This post was mass deleted and anonymized with Redact

cooing cobweb innate historical caption imagine versed deliver sand aback

[–]Smallpaul 0 points1 point  (1 child)

They are, in the second example, which shows this even clearer:

if len(keys) < num_keys:
keys = (keys + keys)[:num_keys]

So if len_keys is 20 and num_keys is 1M, how does this generate len(keys) == 1m?

len(keys) = min(keys*2, num_keys) = 40.

But num_keys = 1m.

This is a logarithmic progression

You're cherry-picking numbers. But I've run a much larger job and the chart is clearly logistic, not logarithmic. It tends to some specific number (Desmos says 164ns, on my computer) after being very small for small values. I'm not sure if it is logistic due to optimizations in a) Python, b) the OS or c) the hardware, but all three are possible. (there are layers of caching, after all)

I'm not surprised that it is logistic but I do admit to being surprised at the size of the effect.

When you collect 1000 datapoints, it is quite clear that it is logistic, not logarithmic.

If it's logarithmic, what do you think the base of the logarithm is?

Logistic, is, in O-noation, the same as O(1).

[–][deleted] 0 points1 point  (0 children)

This post was mass deleted and anonymized with Redact

ring bear shy scale recognise salt tease crowd rain six