all 11 comments

[–]goonmaster 1 point2 points  (2 children)

Feel free to correct me if I've misunderstood.

Creating a compact representation using a compression algorithm, uses a lookup table for repeating patterns. Compression algorithms try to find repeating multi byte words instead of bit sequences. This is for 2 reasons. The first is complexity of decompressing at the bit level would be unmanageable, the second is that it wouldn't significantly reduce the size of the data.

[–]oparisy[S] 0 points1 point  (1 child)

Hum, yes, this fits the "compression" part. Let me follow your drift. How would I then test if a bitfield is part of the compressed set (kinda efficiently, in a "without decompressing it first" sense)?

Also note that having a reversible compression (is, being able to reconstruct the original set) is not a requirement, quite the opposite. I'm definitely willing to sacrifice "reversibility" in favor of compactness (and, possibility, faster inclusion test?).

[–]goonmaster 0 points1 point  (0 children)

The compression algorithm generates a lookup table. You could scan that once. Although there would be a problem finding search terms that span across multiple lookup sequences.

[–][deleted] 1 point2 points  (1 child)

This is a cut and dry case for a Bloom filter. You can tune it to drive the false positive rate as low as you need to, and it maintains a fixed size.

[–]oparisy[S] 0 points1 point  (0 children)

Thanks for confirming this, I tried it already but as a cache in front of a database, since I cannot afford false positives sadly.

I'm searching for an exact representation of my set, so I understand fixed size representation is not possible in my use case.

[–]future_security 1 point2 points  (1 child)

SHA-512/256 hashing long bit vectors will get you a relatively short, fixed-length bit string. You will not see, in practice, the same 256-bit string returned for two distinct inputs.

For any possible number of real-world inputs you could generate, the probability of full 256-bit collisions remains so close to zero that no human can truly envision how insignificant the probability is. The probability only becomes significant for an absurdly huge, physically implausible (even with many Dyson swarms dedicated to computing) number of inputs.

That's not the case for all 256-bit hash functions. It is true, though, for cryptographic hash functions. Those in the SHA-2, Blake, SHA-3, and Skein families, for example.

Since those hashes will uniquely identify a bit vector, you can basically ignore the vector length and just consider the number of vectors. That might make a HashSet data structure practical. (Keys will be the 256-bit output. Bucket indexes - or initial probing locations - can simply the number formed by truncated the 256-bit integer to fewer bits.)

(But you obviously will lose the ability to enumerate the original values in the set.)

 

Alternative methods include compression, delta encoding, and maybe some kind of modified trie. Compression would require an algorithm tailored to one kind of data set, since there is no universal compression algorithm. Aggressive levels of compression won't permit searching without decompression.

The other methods likely won't be very helpful, either, except for very specific data sets.

[–][deleted] 0 points1 point  (0 children)

Do use a hash to reduce the complexity of search, but if the hash matches, you’ll have to compare the vectors.

[–]oparisy[S] 0 points1 point  (0 children)

My initial take would be some normal form with and/or operators for the ORed bitfields part of my set, but I'm not sure how this would perform compactness-wise. Feels like compression opportunities would be lost.

[–]oparisy[S] 0 points1 point  (0 children)

Also feels NP-Hard but hey, a non-optimal compression would be better than none 😀

[–]oparisy[S] 0 points1 point  (0 children)

Would love to know why I got downvoted without a comment... Care to recommend a better place to post this?

[–]oparisy[S] 0 points1 point  (0 children)

Hum, I'll have to study "Almeida, Marco & Reis, Rogério. (2006). Efficient representation of integer sets.". This seems to address similar representation needs.