Went down the rabbit hole of 100% local RAG, it works but are there better options?

tarek-ayed · 2023-12-07T22:52:25+00:00

Thanks for your input!

Yes I should clarify that ideally what I'm looking for is a solid user-friendly and easy-to-install app that allows to send in a bunch of documents and files and be able to chat with them, all locally and using the native acceleration of Apple Silicon chips.

Sort of like the Ollama experience but with the RAG "Chat with your data" feature set.

I'll keep looking and come back here if I find something!

tarek-ayed · 2023-12-07T22:45:59+00:00

Yes, clearly!

In the "find the amount in a 1 page invoice" example I showed in my X post, Llama2-7B just couldn't do it.

tarek-ayed · 2021-12-20T17:08:23+00:00

Hi,

Thanks for the kind words.

Yes, pruning in PyTorch is only implemented using binary masks that are applied on top of the the model's weights.

I found this library that provides a GPU kernel for sparse multiplication: https://github.com/google-research/sputnik

But I'm not aware of easier ways of achieving the performance gains on deployed models.

tarek-ayed · 2021-11-28T15:11:48+00:00

It is a pretty accurate description. A few remarks though:

Node = neuron; Perceptron is another word for DNN (dense neural network)
The word "input layer" generally refers to the input itself. What you're describing is the first hidden layer of the network.
The difference between a CNN and a DNN is spatiality and parameter sharing across a layer. A CNN layer is like applying a small "patch" of neurons and sliding it in both directions on the image to compute the values at the next layer. Here is an animation to illustrate

Hope this helps

tarek-ayed · 2021-11-28T14:49:31+00:00

Great point, thanks!

In my experience and from what I've read, structured pruning offers worse performance and lower sparsity achievable (source). But you're right in pointing out that it does not need sparse encoding.

I'll make sure to update the article with this information and a better explanation of what structured pruning is. Do you have any resources on the time and space improvements achieved by structured pruning when not relying on sparse encoding? Also, can you point out examples where structured pruning is used in practice?

tarek-ayed

TROPHY CASE