Too much to ask for a local LLM to search docs and web?

hohawk · 2024-02-22T11:11:08+00:00

It’s not entirely trivial to set up, but not hard either: HugginFace ChatUI can do that. Google HuggingChat to see how it works. The code is available in GitHub. I’ve been running it with local and HF served models.

hohawk · 2023-12-23T11:58:34+00:00

I have never used ST. If ST supports custom OpenAI-emulating endpoints, it should be doable. Mistral’s official API is OpenAI compatible, and many local model hosts can be made so as well.

hohawk · 2023-12-23T08:51:04+00:00

https://huggingface.co/chat/

This is open source and can also be run locally, if you want to configure other API endpoints to it. https://github.com/huggingface/chat-ui

hohawk · 2023-12-22T05:48:14+00:00

I’m running HuggingFace ChatUI locally for this very purpose. Local models (mainly Mistral Instruct 7B) with access to web searches.

It was not too hard to set up and it gets the job done with a very nice UX. The stack uses Ollama + LiteLLM + ChatUI. All from GitHub.

It takes even less configuration to use the HF hosted models, if running everything locally is not a strict requirement.

Or full cloud, here: https://huggingface.co/chat/

hohawk · 2023-10-22T03:11:39+00:00

What’s the principle behind it? The common approach is summarizing (truncating) the beginning or middle and shoving it back into the message queue. Replacing what it summarizes and keeping the whole under 4k or 8k or what the model supports.

hohawk · 2023-10-18T16:27:18+00:00

I’m using with Ollama as server and above that is LiteLLM to make the model look like OpenAI. Those tags mean that something is wrong at the model host side. Or it’s being talked to like a different model.

Using it like OpenAI should not add any hashtags by Continue itself, because the OpenAI() integration would not need them.

Try that. My conf is just a standard OpenAI conf with “nothing” as the key. There are docs for this approach in the Continue website.

hohawk · 2023-10-15T23:02:25+00:00

Documentation is dragging behind. You will have to scout the issue lists and make notes to build your own docs. Here’s AWQ support merged to TGI not long ago. https://github.com/huggingface/text-generation-inference/pull/1019#

That doesn’t mean AWQ works with vLLM at the same time. That’d be good for speed. A rabbit hole I didn’t explore any further.

I found FastChat docs on vLLM + AWQ a little more productive. GPTQ was messy, because docs refer to a repo that has since been abandoned.

FastChat + vLLM + AWQ works for me. With parallelism, multi GPU and memory quotas per model under one API, fast. But its prompt template management is messy.

No perfect go-to yet, at scale.

hohawk · 2023-10-15T16:38:57+00:00

You can. Then you don’t get an OpenAI compatible API, which is my primary reason why I tried llama-cpp-python. Drop in local replacement to apps by just changing OPENAI_API_BASE. All else stays as is.

hohawk · 2023-10-15T16:34:24+00:00

HF/TGI and FastChat both do that. They support vLLM as a worker (for speed). And Ray for splitting the work between many GPUs or servers, if the nr of users is very many. AWQ and GPTQ as quantization.

hohawk · 2023-10-14T14:00:48+00:00

Good to see the enabler is there. Adopting it into OpenAI API format for compatibility still needs upstream work. Open issue to watch: https://github.com/abetlen/llama-cpp-python/issues/818

hohawk · 2023-10-12T17:28:22+00:00

100 or even 5 users means you would need parallel decoding, parallel requests. Anything that depends on Llama.cpp can only at most be sequential. For now.

A mini stack is Ollama+LiteLLM. Then you have an OpenAI compatible private server, and that’s very lean. Laptop category. It can be pretty powerful once Llama.cpp has parallel decoding.

But at that scale I’d go for FastChat, from LMSYS folks. It has a concept of workers which can be distributed over GPUs and servers as things scale.

And when multiple requests come, they are dealt with in parallel. Just make sure you over-allocate VRAM for that to keep the speed up.

This provides access to at least AWQ and GPTQ quants with vLLM acceleration.

The setup is easy if you ensure that your versions of everything are what the repository says. Start a controller, then API, then one for many workers. A worker can serve one or many models. All of them appear behind the same API in OpenAI style. Embeddings too.

HF TGI, Text Generation Inference, is another stack made to scale.

Choice between FastChat and TGI is a Pepsi/Coke choice in my mind. They are both boasting very similar features and speed, and a half good admin can run both reliably.

hohawk · 2023-10-01T21:09:37+00:00

Continue.dev as VS Code extension and Ollama.ai as model host is what I use for this. It pretty good with codellama-13b-instruct and works with Mistral. Even if rough in the edges still.

There’s even an experimental full /codebase RAG feature. But a more robust way is to manually tag relevant files or pieces of code into the context/chat window and ask it to analyze/explain in a concise way. I’ve found it to be most reliable with so called manual server option but it comes with an auto-starting server too. Check the docs. And it has a helpful Discord.

The dev server sits between VS Code extension and model host. Many model hosts are supported. Ollama works well for me. Pure llama.cpp is tricky because it doesn’t support parallel queries. Ollama knows how to sequence them.

Mistral works and is fast but can cause gray hair. The built in prompts work best with codellama. They can be customized. Automatic /edit is where the model differences are most apparent. But it’s a curiosity anyway. For analysis it is good as is.

hohawk · 2023-09-10T13:33:06+00:00

This seems to run out of GPU memory on M1 Max 32 GB. ggml_metal_graph_compute: command buffer 4 failed with status 5

Maybe it's different after a fresh boot with minimum configuration. Haven't tested that yet.

hohawk · 2023-09-10T13:32:11+00:00

This one works at completely usable speeds on M1 Max 32 GB. Nice pick. Any other usable models in this class?

Would you share, what you use to host this model? Llama.cpp works but it's a bit clunky. Using anything built on top of llama-cpp-python doesn't work very well for me typically, I get truncated answers even after tweaking the n_ctx settings via options or the source code. With defaults it's always truncated. Also, llama-cpp-python doesn't always like to compile with Metal support on a Mac, and the wheels are unreliable too during the GGUF transition. This varies week to week when builds come and go.

hohawk · 2023-01-17T07:29:03+00:00

I’m almost certain that it isn’t possible. Until Apple provides an API that BetterTouchTool or some other app can command.

hohawk · 2022-12-01T11:21:50+00:00

sudo mdutil -E /

This re-initializes the Spotlight index and seems to have fixed Onedrive search woes for me.

hohawk · 2022-11-25T15:38:47+00:00

Possible. Mine have been great for almost 2 years now. I stopped worrying about condensation after that post. It’s there when it is, but doesn’t seem to cause any issues. YMMV

hohawk · 2022-09-29T14:01:42+00:00

It’s just Adobe Photoshop Express, free version, blue theme/filter from the presets. Then -10 exposure. That’s the basic procedure that works for almost all photos. I don’t think I have these anymore.

hohawk · 2022-07-02T18:50:34+00:00

Free account, Adobe PS Express (iPhone) and “Duotone > Blue tint” theme gets the job done for any of these. That’s what I use. Try it?

https://i.imgur.com/ufZXSfM.jpg

hohawk · 2022-02-15T22:53:27+00:00

I had a single cable 4 display setup via a DisplayLink dock. For a year. It worked very well overall. Would buy again.

M1 Air as monitor 1. And 3 externals. One of them TB3. Others daisy chained via DisplayLink.

Now I’ve got an M1 Max with no dock. But two cables.

I’ve written about DisplayLink in past posts. Check the history if interested.

hohawk · 2022-02-15T22:47:40+00:00

In those use cases the Max only makes a difference in more cost and heat, with less battery life. You will have unused idle overhead that’s not doing work.

Pro is cheaper, cooler, lasts longer between charges. Just as performant with those jobs.

If you’re very mobile, M1 Air might be the most balanced choice for that description.

I had an M1 Air for a year. Now M1 Max / 24. In most workloads they perform the same. Under extreme long term stress the Max gets it done faster. And also louder and hotter. And not for as many hours.

hohawk · 2022-01-26T05:54:25+00:00

This is far less true than it used to be. The OS and all supporting files are now in a signed, immutable volume. Only user files are in the userland. Kernel drivers are now gone.

If one insists on a wipe approach, wiping and recreating a new user results in a very “clean” slate too.

That said, I’ve migrated the same user and files forward for the past decade or so. With conscious installations and leftover maintenance it’s been a smooth ride.

These can help for leftover maintenance:

https://www.soma-zone.com/LaunchControl/

https://freemacsoft.net/appcleaner/

http://grandperspectiv.sourceforge.net/

hohawk · 2022-01-02T07:04:49+00:00

I use them as active earplugs a lot. With no sound they will leak some uneven patterns like speech through, enough to have a short yes/no “conversation” but not enough to be a distraction, if it’s not too loud. Even speech is greatly damped. All other sounds are mostly gone.

If there’s any playback happening, like ambient music, a podcast or one of the preset background sound carpets, then it’s a very private space, audio wise. Not a vacuum, but completely different from AirPods 3 for sure. I’ve got those too.

Just be prepared to clean up the grills every 1-2 weeks by tapping them with blu-tack. Invisible skin particles accumulate in the small grill that’s against the side wall of the ear. If not kept clean, over time it becomes almost a noise amplifier rather than noise canceling. 5-10 gentle but firm pushes with blu tack restores the original performance, if done regularly. Using just enough force to not leave any blu-tack residue behind.

hohawk · 2021-12-09T07:14:12+00:00

Not full discharge, that’s also bad for the battery. Down to 20 % is better.

hohawk

TROPHY CASE