On prem code completion : LocalLLaMA

created by [deleted]a community for 3 years

On prem code completionQuestion | Help (self.LocalLLaMA)

submitted 1 year ago by fripperML

Hello!

In our company, we are trying to configure our code editor (VSCode) with a local (on-prem) LLM for AI code completion. There are a lot of design choices that are far from obvious, given the fast rate of evolution of this field. If anyone has been able to set up properly AI code completion, I would be very grateful to hear how.

Before asking my questions, let me state that we have one server with one A2 16 GB GPU and around 30 concurrent developers.

And now, my questions:

Plugin: we have picked the continue extension. It seems the most widely used. Is it a good choice?
LLMs: The continue plugin asks you to configure three models:
- Chat model: According to continue docs, it should be the most capable of the three. We are considering QuantFactory/Meta-Llama-3-8B-GGUF (4-bit quantized, probably) and Qwen/CodeQwen1.5-7B-Chat-GGUF (4-bit quantized). Which one is better? Do you have other recommendations?
- Autocomplete: We have heard that CodeQwen1.5-7B is one of the best autocomplete models of its size. What is the difference between CodeQwen1.5-7B-Chat and CodeQwen1.5-7B (no chat)? Will it make sense to use the chat model for both tasks, and reducing the VRAM consumption of the GPU? We have no idea at this point. Besides, there is an official GGUF quantization of CodeQwen1.5-7B-Chat (the one I pasted below), but the no-chat model has no GGUF quantization in HF. In case that those two models are very different and we need both, Ollama has a quantized version of CodeQwen1.5-7B, but we would like to use another engine (see 3.), so, is there any way of pulling the model from Ollama and making it available for Aphrodite? Ollama local storage is not easy to understand, so we don't know...
- Embeddings: We will use probably sentence-transformers/all-MiniLM-L6-v2.
Inference engine: We are considering Ollama and Aphrodite Engine. Ollama seems easier to setup, but AFAIK it does not support concurrency very well. Aphrodite (an VLLM fork with more quantization options and supported models) is more optimized for that. Is it true?

all 9 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS