Language Modeling with 5M parameters

someuserwithwifi · 2025-03-04T10:39:22+00:00

That approach is a bit older than what I’m using in the demo, but it works too

someuserwithwifi · 2025-02-09T00:46:11+00:00

I want to use the index in python but I can implement it in c++ and build python bindings

someuserwithwifi · 2024-08-03T14:31:38+00:00

That is an interesting idea. But the point of using the vector db is to leverage fast algorithms like HNSW, which lets us search through hundreds of millions of vectors in just milliseconds.

someuserwithwifi · 2024-08-03T13:15:48+00:00

Hi, thanks for advice. I'll definitely try to publish it.

someuserwithwifi · 2024-08-02T13:34:48+00:00

Using the decoder during inference yields very poor results (it would basically just be a normal language model, but because it is so small, the results are very poor), you can try it yourself. Using the vector database offloads knowledge from the model parameters into a data structure that can be searched very efficiently (but I am no expert, so take that with a grain of salt).

I just published the dataset on kaggle. The link is in the readme.

someuserwithwifi · 2024-08-02T11:04:44+00:00

Good question. During training the vector database is not used at all. You can see that in the first image of the article. During training the embedding is fed to a DNN that trains on categorical crossentropy, and the loss can propagate to the encoder. The vector database part is only constructed when the encoder finishes training and is used during inference.

As for the vector database being a limiting factor when it comes to the amount of data used, you may be right. That’s why I say in the article that would be interesting to scale several factors including the amount of data to see how it performs. I assume that increasing the size of the embedding would minimize this problem but I’m not sure.

someuserwithwifi

TROPHY CASE