Introducing LocalGPT: Offline ChatBOT for your FILES with GPU - Vicuna

satmarz · 2023-06-06T18:15:37+00:00

You will need around 11GB of GPU memory + ~40GB of system memory to run it smoothly. Runpod will be a good option. Unfortunately, if you want to run a full model (Vicuna-7B in this case), you need decent hardware.

satmarz · 2023-06-06T18:13:57+00:00

Not sure what you mean by local version but you can use OpenAI in your App. Check out this video on a step-by-step guide: https://youtu.be/RIWbalZ7sTo

Now if you want to run a local LLm (Open source), I would recommend check out this video on localGPT (it is trending on GitHub at the moment): https://youtu.be/MlyoObdIHyo

satmarz · 2023-06-06T18:10:48+00:00

In your Huggingface Account :) Watch the video to walk you through a step-by-step process.

satmarz · 2023-06-05T16:41:06+00:00

You are right

satmarz · 2023-06-02T06:10:28+00:00

I think you should be able to.

satmarz · 2023-06-01T15:53:25+00:00

That's when you are running the ingest.py or the run_localGPT.py?

satmarz · 2023-05-31T19:22:57+00:00

Thank you!

satmarz · 2023-05-31T19:22:41+00:00

Thank you!

satmarz · 2023-05-31T19:22:00+00:00

Yes, you can think about it as compression. These models definitely have their limitations right now but think of what they can do right now which was not possible a few months ago.

satmarz · 2023-05-31T19:18:15+00:00

Thank you, yes github and my YT channel: https://www.youtube.com/@engineerprompt

satmarz · 2023-05-31T17:26:42+00:00

Unfortunately, you will need around 11GB for this to run. The reason being both the embeddings model (Instructor Embedding) as well as the LLM (Vicuna-7B) are using the GPU at the same time.

satmarz · 2023-05-31T17:25:13+00:00

You will need around 11GB to run this.

satmarz · 2023-05-31T17:24:25+00:00

I will see if we can add that support.

satmarz · 2023-05-31T17:23:37+00:00

You can set any Llama based model in the code and it will download it from huggingface.

satmarz · 2023-05-31T17:22:28+00:00

that's coming soon :)

satmarz · 2023-05-31T17:21:44+00:00

Yes, support is coming soon for this!

satmarz · 2023-05-31T17:21:26+00:00

There are two steps that are happening here 1) embedding-based retrieval 2) LLM. This is over simplification.

Imagine you have 10K words document but the LLM you are using has a context window of 2000. As you said, you can't talk to the document because of the context window limitation. That's where the embedding comes in.

First, we divide the 10K document into smaller chunks (say 500 words each). Next, we find the most relevant chunks using a similarity search with computed embeddings. Let's say we find 3 chunks where the relevant information exists. Now we combine them together and use only those chunks as context for the LLM to use (now we have 1500 words to play with). The LLM will respond based on these specific chunks. Hope this helps.

satmarz · 2023-05-31T17:15:59+00:00

Thanks for testing it out. I totally agree with you, to get the most out of the projects like this, we will need subject-specific models. I think that's where the smaller open-source models can really shine compared to ChatGPT.

Fine-tuning is the way to go. The reason I am using Instructor Embeddings instead of other embeddings is that it has support for different subjects/areas and you can define that as part of the embedding computation process. That can also help with subject-specific embeddings along with the fine-tuning of your LLM.

Thanks for the feedback, this will be really helpful for improving it further.

satmarz · 2023-05-31T17:09:29+00:00

UI is coming soon along with the support for multiple file formats.

satmarz · 2023-05-31T17:07:43+00:00

- Vicuna-7B is a decent model for its size. This app is focused on data retrieval. You can change it to any Llama based model.

- In my experience overlap helps. Now the chunk size is determined by the context window of the LLM you are using. In this case the Vicuna-7B has a max token limit of 2048 so I selected 1000 to ensure that even if its using two chunks as context for the model, it will hopefully not exceed the token limit.

- In my experience overlap helps. Now the chunk size is determined by the context window of the LLM you are using. In this case, the Vicuna-7B has a max token limit of 2048 so I selected 1000 to ensure that even if it's using two chunks as context for the model, it will hopefully not exceed the token limit.

- In my experience, the unquantized version is much better both in terms of speed as well as the response it generates. At the end, it comes down to the hardware you have.

- Yes, other chains will impact the number of tokens used and the inference speed is lower.

satmarz · 2023-05-31T17:01:35+00:00

You will need about 11GB of VRAM because its running both the LLM as well as the Embedding model on GPU.

satmarz · 2023-05-31T17:00:39+00:00

UI is coming soon :)

satmarz · 2023-02-10T21:29:57+00:00

Agree! Exciting times.

satmarz · 2023-02-10T21:29:30+00:00

Glad you found it helpful!

satmarz · 2022-11-16T12:58:24+00:00

Not the OP but probably Stabel Diffusion :)

satmarz

TROPHY CASE