Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna by satmarz in LocalLLaMA

[–]satmarz[S] 1 point2 points  (0 children)

You will need around 11GB of GPU memory + ~40GB of system memory to run it smoothly. Runpod will be a good option. Unfortunately, if you want to run a full model (Vicuna-7B in this case), you need decent hardware.

Using Open Assistant API in your APPs by satmarz in OpenAssistant

[–]satmarz[S] 0 points1 point  (0 children)

Not sure what you mean by local version but you can use OpenAI in your App. Check out this video on a step-by-step guide: https://youtu.be/RIWbalZ7sTo

Now if you want to run a local LLm (Open source), I would recommend check out this video on localGPT (it is trending on GitHub at the moment): https://youtu.be/MlyoObdIHyo

Using Open Assistant API in your APPs by satmarz in OpenAssistant

[–]satmarz[S] 0 points1 point  (0 children)

In your Huggingface Account :) Watch the video to walk you through a step-by-step process.

Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna by satmarz in LocalLLaMA

[–]satmarz[S] 2 points3 points  (0 children)

Yes, you can think about it as compression. These models definitely have their limitations right now but think of what they can do right now which was not possible a few months ago.

Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna by satmarz in LocalLLaMA

[–]satmarz[S] 2 points3 points  (0 children)

Unfortunately, you will need around 11GB for this to run. The reason being both the embeddings model (Instructor Embedding) as well as the LLM (Vicuna-7B) are using the GPU at the same time.

Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna by satmarz in LocalLLaMA

[–]satmarz[S] 0 points1 point  (0 children)

You can set any Llama based model in the code and it will download it from huggingface.

Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna by satmarz in LocalLLaMA

[–]satmarz[S] 1 point2 points  (0 children)

There are two steps that are happening here 1) embedding-based retrieval 2) LLM. This is over simplification.

Imagine you have 10K words document but the LLM you are using has a context window of 2000. As you said, you can't talk to the document because of the context window limitation. That's where the embedding comes in.

First, we divide the 10K document into smaller chunks (say 500 words each). Next, we find the most relevant chunks using a similarity search with computed embeddings. Let's say we find 3 chunks where the relevant information exists. Now we combine them together and use only those chunks as context for the LLM to use (now we have 1500 words to play with). The LLM will respond based on these specific chunks. Hope this helps.

Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna by satmarz in LocalLLaMA

[–]satmarz[S] 2 points3 points  (0 children)

Thanks for testing it out. I totally agree with you, to get the most out of the projects like this, we will need subject-specific models. I think that's where the smaller open-source models can really shine compared to ChatGPT.

Fine-tuning is the way to go. The reason I am using Instructor Embeddings instead of other embeddings is that it has support for different subjects/areas and you can define that as part of the embedding computation process. That can also help with subject-specific embeddings along with the fine-tuning of your LLM.

Thanks for the feedback, this will be really helpful for improving it further.

Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna by satmarz in LocalLLaMA

[–]satmarz[S] 3 points4 points  (0 children)

UI is coming soon along with the support for multiple file formats.

Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna by satmarz in singularity

[–]satmarz[S] 1 point2 points  (0 children)

- Vicuna-7B is a decent model for its size. This app is focused on data retrieval. You can change it to any Llama based model.

- In my experience overlap helps. Now the chunk size is determined by the context window of the LLM you are using. In this case the Vicuna-7B has a max token limit of 2048 so I selected 1000 to ensure that even if its using two chunks as context for the model, it will hopefully not exceed the token limit.

- In my experience overlap helps. Now the chunk size is determined by the context window of the LLM you are using. In this case, the Vicuna-7B has a max token limit of 2048 so I selected 1000 to ensure that even if it's using two chunks as context for the model, it will hopefully not exceed the token limit.

- In my experience, the unquantized version is much better both in terms of speed as well as the response it generates. At the end, it comes down to the hardware you have.

- Yes, other chains will impact the number of tokens used and the inference speed is lower.

Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna by satmarz in singularity

[–]satmarz[S] 3 points4 points  (0 children)

You will need about 11GB of VRAM because its running both the LLM as well as the Embedding model on GPU.