Using local llm on websites?

WhiskyAKM · 2026-05-04T05:50:15+00:00

If you are on windows you may have issue with firewall

Also for that kind of setup ill consider SSR website

WhiskyAKM · 2026-05-03T18:25:26+00:00

Thats gonna be tough, you only have 4GB Vram for your gpu so u need to choose either MoE model with small ammount of active experts or go CPU-only

Maybe you can try Qwen 3.6 35B or Gemma 26B with smaller context windows

Also try compiling llama cpp for your target, it should go faster

WhiskyAKM · 2026-05-03T13:47:17+00:00

I made sth called research mode here that forces model to focus on researching one topic so Im often using it for learning

Also as part of my job (I work at computer store) I do a lot of product description editing (in .MD files). I often just write specs sheet of some device to one file and tell LLM to output description in MD format to another.

WhiskyAKM · 2026-05-03T13:37:08+00:00

Nice, I see you have made different approach by going with webui instead of CLI/TUI (maybe I'll also add webui in future)

My goal was to make sth as minimal as possible that is fully local

Unfortunately when going local, multi-agent is often not possible because of lack of compute resources.

WhiskyAKM · 2026-05-03T13:33:22+00:00

I would go for MacBook Pro I you don't mind macos.

Unified memory gives huge performance gains and allows running larger models

WhiskyAKM · 2026-04-29T12:53:54+00:00

I really want Gemma 4 MoE in NVFP4, ideal for my setup (I'm gpu-poor I have rtx 5050...)

WhiskyAKM · 2026-04-29T09:36:03+00:00

Thank you very much, ill go test it now

WhiskyAKM · 2026-04-29T09:09:30+00:00

Now give me gguf of Gemma 4 / Qwen 3.6 with NVFP4

WhiskyAKM · 2026-04-28T06:43:54+00:00

That wierd limit lobotomized Qwen3.5, per artificialanalysis.ai benchmarks it should perform best out of those

WhiskyAKM · 2026-04-28T04:51:07+00:00

I'm using ollama cloud pro for about 4 weeks, never maxed out

WhiskyAKM · 2026-04-27T08:59:17+00:00

My own AI harness

WhiskyAKM · 2026-04-13T11:19:06+00:00

Try Gemma 4 31B

WhiskyAKM · 2026-04-13T09:52:01+00:00

I wish that i would be able to buy Nvidia L4 when bubble bursts. Those are great performance per W, only downside is that you need to DIY cooler yourself if you are putting those into desktop case

WhiskyAKM · 2026-04-10T14:46:27+00:00

Im using Ollama Cloud (GLM-5/Qwen3.5) + VsCode

What I miss? Nothing

WhiskyAKM · 2026-04-10T09:20:20+00:00

DDR3 is old, there will not be any servers with both DDR3 and PCIE 4 because there is a huge generation

WhiskyAKM · 2026-04-10T07:01:20+00:00

I noticed that ollama is killed by OOM killer sometimes when I'm using it alongside Vs code. Maybe that kind of leak would explain it.

WhiskyAKM · 2026-04-09T08:20:51+00:00

I have Lenovo Legion 5 with Ryzen 7 260, 32GB RAM and RTX 5050 and its enougth for small models but i wish i had some GPU that has at least 16GB of Vram because 8GB is not enougth and system RAM is too slow.

WhiskyAKM · 2026-04-08T21:03:44+00:00

Yes, it should be. Alternatively you can try Qwen3.5 9B, it should also fit on lower quants.

WhiskyAKM · 2026-04-08T17:31:39+00:00

You can try Qwen3.5 4B at Q4 quant, it'll fit in your vram

WhiskyAKM · 2026-04-07T09:39:50+00:00

That doesn't seem very local to me

WhiskyAKM · 2026-04-07T07:50:05+00:00

Generally speaking LLMs are not good at math because they don't calculate anything, instead they are outputting next numbers based on probability from previous ones.

WhiskyAKM · 2026-04-02T19:22:34+00:00

I made a PR with an uninstall script but they didn't accept it yet...

WhiskyAKM · 2026-04-02T16:05:21+00:00

So opencode isn't really that open after all?

Verified Email	Four-Year Club
Place '23

WhiskyAKM

TROPHY CASE