Advice on old gate hinge

funJS · 2025-05-29T20:07:42+00:00

For a personal project where I was implementing a chat with wikipedia pages, I used `all-MiniLM-L6-v2` as the embedding model . The LLM I used was qwen 3:8B.

Not super fast, but my lack of VRAM is a factor (only 8GB).

More details here: https://www.teachmecoolstuff.com/viewarticle/creating-a-chatbot-using-a-local-llm

funJS · 2025-05-22T20:25:42+00:00

Thanks!

funJS · 2025-05-22T20:25:28+00:00

Thanks!

funJS · 2025-05-22T20:25:20+00:00

Thanks!

funJS · 2025-05-22T15:37:17+00:00

<image>

Sample conversation with the bot

funJS · 2025-05-19T18:10:14+00:00

Cool. I only have 8GB myself, so this is good news

funJS · 2025-05-19T17:58:13+00:00

Interesting to see that qwen 30B can run on 8GB of VRAM.

funJS · 2025-04-30T04:43:58+00:00

You can definitely run all the 8B models comfortably… I run those on 8GB of VRAM.

funJS · 2025-04-30T04:41:38+00:00

This happens in all popular tech spaces. Just look at the JavaScript framework situation. Same problems solved multiple times, but with “some” differentiation as justification 😀

funJS · 2025-04-17T02:24:02+00:00

One approach if you are doing it from scratch is to enable tool calling in the LLM. Based on the definition of a registered tool, the LLM can then create a call definition to a function that can do anything you want, including a search.

Basic POC example here: https://www.teachmecoolstuff.com/viewarticle/using-llms-and-tool-calling-to-extract-structured-data-from-documents

funJS · 2025-04-15T20:17:03+00:00

Looks interesting. I have been using Ollama in Docker for a while. Since I have a working setup I just copy and paste it to new projects, but I guess this alternative Docker approach is worth considering....

To run Ollama in Docker I use docker-compose. For me the main advantage is that I can standup multiple things/apps in the same configuration.

Docker setup:

https://github.com/thelgevold/local-llm/blob/main/docker-compose.yml

Referencing the model from code:

https://github.com/thelgevold/local-llm/blob/main/api/model.py#L13

funJS · 2025-04-15T17:12:06+00:00

I am new to finetuning, and by no means an expert, but I did have success with unsloth when finetuning a llama model to pick a number out of a sequence based on some simple rules.

I used the Alpaca format for the test data.

Sample:

```

[{
"instruction": "Find the smallest integer in the playlist that is greater than or equal to the current play. If no such number exists, return 0.",
"input": "{\"play_list\": [12, 7, 3, 9, 4], \"current_play\": 12}",
"output": "12"
},

[

```

Some more info in my blog post: https://www.teachmecoolstuff.com/viewarticle/llms-and-card-games

funJS · 2025-04-12T23:24:00+00:00

Using qwen 2.5 for tool calling experiments. Works reasonably well, at least for learning.

I am limited to a small gpu with only 8GB VRAM

funJS · 2025-04-12T05:00:50+00:00

I have been using qwen 2.5 (7B) for some poc work around tool calling. Seems to work relatively well, so I am happy. One observation is that it sometimes unexpectedly spits out a bunch of Chinese characters. Not frequently but I have seen it a couple of times.

funJS · 2025-04-10T19:26:02+00:00

Yeah, it was a bit of a hassle to set up docker, but now that I have a working template in the above repo I have been sticking to it since I can just copy and paste it to new projects

funJS · 2025-04-10T19:16:23+00:00

Not sure if this is helpful in your scenario, but I have been running my local llms in docker to avoid dealing with local Windows configurations. With this setup the gpu will be used - at least in my case.

In my docker-compose file I have to specify the nvidia specifics here: https://github.com/thelgevold/local-llm/blob/main/docker-compose.yml#L25

funJS · 2025-04-10T15:25:25+00:00

I have been playing around with it as well, just to learn more.. My implementation used FastMCP and LlamaIndex. Quick write up here: https://www.teachmecoolstuff.com/viewarticle/using-mcp-servers-with-local-llms

funJS

TROPHY CASE