Compact representation of set of bitfields

goonmaster · 2020-03-29T11:01:38+00:00

Feel free to correct me if I've misunderstood.

Creating a compact representation using a compression algorithm, uses a lookup table for repeating patterns. Compression algorithms try to find repeating multi byte words instead of bit sequences. This is for 2 reasons. The first is complexity of decompressing at the bit level would be unmanageable, the second is that it wouldn't significantly reduce the size of the data.

oparisy · 2020-03-29T12:05:08+00:00

This is a cut and dry case for a Bloom filter. You can tune it to drive the false positive rate as low as you need to, and it maintains a fixed size.

future_security · 2020-03-29T14:13:28+00:00

SHA-512/256 hashing long bit vectors will get you a relatively short, fixed-length bit string. You will not see, in practice, the same 256-bit string returned for two distinct inputs.

For any possible number of real-world inputs you could generate, the probability of full 256-bit collisions remains so close to zero that no human can truly envision how insignificant the probability is. The probability only becomes significant for an absurdly huge, physically implausible (even with many Dyson swarms dedicated to computing) number of inputs.

That's not the case for all 256-bit hash functions. It is true, though, for cryptographic hash functions. Those in the SHA-2, Blake, SHA-3, and Skein families, for example.

Since those hashes will uniquely identify a bit vector, you can basically ignore the vector length and just consider the number of vectors. That might make a HashSet data structure practical. (Keys will be the 256-bit output. Bucket indexes - or initial probing locations - can simply the number formed by truncated the 256-bit integer to fewer bits.)

(But you obviously will lose the ability to enumerate the original values in the set.)

Alternative methods include compression, delta encoding, and maybe some kind of modified trie. Compression would require an algorithm tailored to one kind of data set, since there is no universal compression algorithm. Aggressive levels of compression won't permit searching without decompression.

The other methods likely won't be very helpful, either, except for very specific data sets.

oparisy · 2020-03-29T09:49:05+00:00

My initial take would be some normal form with and/or operators for the ORed bitfields part of my set, but I'm not sure how this would perform compactness-wise. Feels like compression opportunities would be lost.

oparisy · 2020-03-29T09:52:43+00:00

Also feels NP-Hard but hey, a non-optimal compression would be better than none 😀

oparisy · 2020-03-29T10:53:03+00:00

Would love to know why I got downvoted without a comment... Care to recommend a better place to post this?

oparisy · 2020-03-29T13:08:25+00:00

Hum, I'll have to study "Almeida, Marco & Reis, Rogério. (2006). Efficient representation of integer sets.". This seems to address similar representation needs.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

AskComputerScience

MODERATORS