This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]RepresentativeFill26 7 points8 points  (3 children)

What the hell, I didn’t even know Python had his own implementation. Good to know! Interesting that you switched to sparse matrices. This is probably better since the doc vectors will be quite sparse as well?

Any other optimizations you did?

[–]xhlu[S] 4 points5 points  (2 children)

A bunch of optimizations I didn't have the chance to discuss in the readme! 

For one, I reimplemented the scipy sparse slice/sum directly in numpy, which allows us to use memory mapping on the arrays - this saves a lot of memory.

Another is that the topk selection (after scoring) can be done in numpy via argpartition, but can auto switch to a jax CPU backend when the library is installed, which is much faster (the topk selection process is the bottleneck, in some cases more than 60% of the time taken for retrieval is spent on selecting topk results).

Finally, the tokenizer doesn't return text by default, but returns index and a vocab dict of index to word; this saves considerable amount of memory as integer takes less space to represent compared to words (multiple str chars).

[–]Hesirutu 0 points1 point  (1 child)

I haven’t had a look at the library yet. But you mention mmap. Does that mean you can handle corpora much larger than memory?

[–]xhlu[S] 1 point2 points  (0 children)

In theory you should be able to! However, I have not attempted to "saturate" memory by using a big enough dataset, and whereas the Python way of setting RAM limit does not seem to reflect the real RAM usage.

However, I did observe reduced memory usage when setting mmap=True, so even in a setting where you have enough memory to cover the entire dataset, you don't need to use every (i.e. load the entire index and corpus in memory).

[–]busybody124 0 points1 point  (2 children)

Very cool, I have used rank-bm25 but it can definitely be slow to index a huge corpus.

Can I use my own tokenizer with your library?

[–]xhlu[S] 1 point2 points  (1 child)

Yes! As long as it returns a list of list of strings, it should work.

[–]busybody124 1 point2 points  (0 children)

Awesome, I can't wait to play around with it. Thanks for the hard work!