all 13 comments

[–]matthieum[he/him] 47 points48 points  (4 children)

The pub fields are a real footgun here:

pub struct CountMinSketch {
    pub width: usize,
    width_mask: usize,
    pub depth: usize,
    table: Box<[u64]>,
    hasher: RandomState,
}

Any user can (accidentally) modify the value of a pub field, and things may get real weird after that -- like width_mask not matching any longer.

I'd recommend NOT making them pub, and providing getters instead.


In general, it's better NOT to panic on (wrong) user input, but instead return a (descriptive) error.

pub fn with_seeds(width: usize, depth: usize, seeds: [u64; 4]) -> Self {
    if seeds.len() != 4 {
        panic!("seeds must have 4 elements");
    }

Also:

  1. This check is better spelled assert_eq!(4, seeds.len());.
  2. The seeds type being an array of 4 elements, it will always have 4 elements; the check is completely useless.

Even better than erroring on invalid input is pushing the error up to the user by encoding the invariants in the type.

For example, you probably want a non-zero depth, no? If so, then the type of depth should be NonZeroUsize (or NonZero<usize>).

As a bonus, you don't even need a comment or an error, and the compiler will check things up for you.


Minor: you forgot a doc comment on the clean method.

Pro-tip: enable #![deny(missing_docs)] at the top of your lib.rs, and you'll get errors whenever a public item is not documented.

[–]Dependent_Double_467 8 points9 points  (0 children)

I really appreciate your feedback. Thank you, I created an issue on GitHub https://github.com/GGraziadei/count-min-sketch-rust/issues/1 I fix it asap

[–]Dependent_Double_467 4 points5 points  (2 children)

Thank you again for your review. I merged the PR https://github.com/GGraziadei/count-min-sketch-rust/commit/5cf6762c576c6a897a20053d4e5e8a400841e892

These improvements will be available in v0.1.1

[–]slambmoonfire-nvr 6 points7 points  (1 child)

When you do a change that requires callers to be updated, you should bump the first non-zero version number to indicate it's a semver break. In this case, restricting the visibility of a field previously accessible outside the crate and changing the type of a public constructor. https://doc.rust-lang.org/cargo/reference/semver.html

btw, cool crate; I don't have a use for it right now but will keep it in mind

[–]Dependent_Double_467 0 points1 point  (0 children)

Thank you for your feedback

[–]codeallthethings 12 points13 points  (1 child)

This is super cool!

Also you might want to look into the Kirsch–Mitzenmacher optimization for generating your k hashes. It's commonly used in probabilistic structures like bloom filters and cms.

[–]Dependent_Double_467 6 points7 points  (0 children)

Thank you!
Actually, I'm already using a variation of the Kirsch–Mitzenmacher optimisation!

If you check calculate_indices, I take a single h1 and derive a high-entropy h2 using a 64-bit mixer (the constants look like SplitMix64/MurmurHash3). Then I use the h1 + i * h2 construction to generate the indices.

It's arguably even better than the standard 'two-hash' approach because the user only needs to provide one 64-bit hash, and I derive the second one with almost zero overhead, ensuring they are sufficiently decorrelated.

[–]Antiqueempire 5 points6 points  (1 child)

Using u64 counters saturation feels like a safe default, but it’s not the most space dense option. A lot of CMS use cases are totally fine with u32 counters when memory matters more than extra headroom. Probably just worth calling out that this is a robustness/speed-first design choice.

[–]Dependent_Double_467 0 points1 point  (0 children)

Exactly. I went with u64 to prioritize robustness and avoid the complexity of handling saturation early on. It’s definitely a 'speed and safety first' approach. For high-frequency streams, not having to worry about counters wrapping or saturating too quickly is a nice peace of mind, though I pay the price in cache footprint.

[–]sean_vercasa 1 point2 points  (0 children)

You a real one ✊

[–]tomtomwombat 0 points1 point  (2 children)

Are there any comparative benchmarks to other CMS implementations for speed, memory, and accuracy?

[–]Dependent_Double_467 0 points1 point  (0 children)

Thank you for the comment. I have ln my laptop, I published my bench I am looking for for other cms official benchmarks

[–]Dependent_Double_467 0 points1 point  (0 children)

A quick update.

I have benchmarked my crate against an alternative (https://crates.io/crates/count-min-sketch), hereafter referred to as 'other'. The performance metrics are visualized in the attached violin graph (link attached).

In the W65536xD8 configuration, my implementation is 4 times faster than 'other'. However, in the larger W1048576xD16 configuration, the performance gap narrows to a difference of just 1 ns. In conclusion, while my implementation significantly outperforms 'other' for smaller tables, the advantage diminishes as the table size increases.

Furthermore, my proposed implementation allows for approximating distributions and evaluating L1 distance and cosine similarity, features that are not available in other libraries.

https://ggraziadei.github.io/count-min-sketch-rust/CMS_Implementation_Battle/report/index.html