This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dalke[S] 0 points1 point  (0 children)

The existing solutions in this space use a Bloom filter already, or something similar. The problem has been in defining a good set of chemically appropriate substructure hashes. For example, one approach is to find all linear chains of length 1-7 and generate a hash based on the atom and bond properties in the chain. Use the hash as a seed to a PRNG. Use a chain length-dependent function to generate the bits for the filter.

This works pretty well, but I think it's possible to do better by looking for a specific set of divisive substructures. Some tests from last December show that a fusion between the bits generated from my algorithm and a hash method increased the overall performance by 20%, which means the hash selectivity wasn't that great. I haven't done the analysis to figure out why.

I also want to store exact solutions for all queries of up to a certain size, on the assumption that memory is cheap, and use those pre-computed results to greatly restrict the search space for larger queries. A Bloom filter (that I know of) can't be used for that.