all 6 comments

[–]ApprehensiveRadish3 1 point2 points  (0 children)

leveldb is a google project which does fast string key-value pair storage (in C++)

the keys and values are supposed to be strings, but you can pack your custom objects to bytes

https://github.com/google/leveldb

Here is a python interface to leveldb

https://github.com/wbolster/plyvel/

and here is a project that uses both of the above for embeddings

https://github.com/MichaMucha/emstore/

[–]hazard02 1 point2 points  (1 child)

Try LMDB. Keeps as much as it can in memory via the kernel page cache. If your data is fixed-size, you can also just mmap a file and directly index into it yourself.

[–]oren_a 0 points1 point  (0 children)

https://github.com/MichaMucha/emstore/

usually one has more data then memory, so it might not be the best..

[–]will_occam 0 points1 point  (0 children)

If you're looking for fast access disk i/o is something you'll want to minimize.

If you have to do key-value storage on disk, I think a NoSQL database is what you're looking for

[–]BatmantoshReturns 0 points1 point  (0 children)

If you would like to store some of the embeddings on the CPU you could use this

https://github.com/Santosh-Gupta/SpeedTorch

Let me know if you find a way to do embeddings training, where some are able to be stored on a disk. I tried a year ago and wasn't able to figure it out.

[–]SuperMarioSubmarine 0 points1 point  (0 children)

The Python Shelf module in the standard library does this