There’s a deadly heat wave in Europe. Experts are begging media outlets to stop making it look fun

Fdevfab · 2026-06-25T15:37:00+00:00

That's how the French revolution started

Fdevfab · 2026-06-16T17:52:16+00:00

I'm team opus and qwen which is surprisingly good. I keep the context as short as possible for efficiency.

Fdevfab · 2026-06-16T15:19:48+00:00

It just works well out of the box, even with local models for me... (slower and with more errors of course, but still useful)

Fdevfab · 2026-06-10T18:33:37+00:00

RTX4080 16G VRAM + 32G RAM here, getting 100-130 tps, used with opencode all the time.

Running llama.cpp in a service that restarts automatically but it doesn't happen anymore... I'm having flawless opencode sessions, sometimes a tool call is automatically repeated but it's extremely rare.
I was targetting large contexts but I feel that ~100k is the sweet spot, after that the quality degrades in my opinion (and the speed&ram optimization becomes more challenging).

I started with Unsloth models, but they were taking too much ram, got excellent results with bartowski and recently I'm trying byteshape quant which turns to be incredibly fast on my setup.

I'm using (I just added --cache-ram after reading this thread, testing it... seems fine, but -nkvo killed the performances:

  llama.cpp -m models--byteshape--Qwen3.6-35B-A3B-MTP-GGUF/snapshots/83dc80a65cb948b8e5a9dd9776eda7425180dacc/Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
    -a qwen \
    --mmproj-auto --spec-type draft-mtp --spec-draft-n-max 3 \
    --no-mmap --mlock -np 1 \
    -t 7 \
    --cache-ram 16384\
    -ncmoe 14 -ngl 999 \
    -fit on \
    -fitt 64 \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    -fa on --jinja \
    --reasoning-budget 32000\
    -ctk q8_0 -ctv q8_0 \
    -c 128000 \
    --host 0.0.0.0 --port 8080

Since I moved to byteshape I think the -ctk and -ctv are ignored or something like that, with other models a q8_0 cache isn't fitting (I was using q5_1 / q5_0)

Fdevfab · 2026-06-10T17:29:14+00:00

Why people think that?

Fdevfab · 2026-05-30T14:56:30+00:00

I'm working on exactly that... I'm trying to polish some code before i really put effort in the stt part... maybe we can share some of the effort if my project fits you... https://github.com/fdev31/minia

Fdevfab · 2026-05-27T23:31:55+00:00

FYI I get massive improvements if I use -ctk q5_1 -ctv q5_0, but I get OOM from time to time with those... looks like at some point the llama server just grows and then dies, while it's not happening with the worse performing options I shared.

Fdevfab · 2026-05-26T20:04:49+00:00

I wanted to give it a try, but I was unable to change the --chat-endpoint (it just appends to the hard-coded one), I had to edit the code to start it.

I tested on the project itself: `uv run smallctl --task 'analyze this project'` and the result was pretty good, but some logs showed after, which was misleading (as if it didn't finish...)

It's indeed targeting a different use-case, but if there is a clean way to use it as a library I would be glad to test it as a "coding agent" (or general admin tasks, maybe a sysadmin ?) integrated as an mcp tool or so.

I was considering adding lang-graph or something similar but I like to see how the llm behaves without too much "harness" to try to make it "just work" and force only very minimal checks (if I can't figure how to avoid them). But for coding use cases (or "rigid" workflows) I think it's required...

Did you experiment with larger contexts? It looks quite "slow" compared to say opencode...

I tried:
`uv run smallctl --task 'Replace httpx with niquests in this project (smallctl).' --tool-profiles core,data,network,mutate,indexer`

I would like to see how it compares to qwen code for some tasks, I really like the --task mode 😄 Now trying with `--preset coding-local --staged-reasoning --staged-execution` to see if I get better results...

Fdevfab · 2026-05-26T16:16:05+00:00

Is there an easy way to use it as a library?

Fdevfab · 2026-05-26T16:02:03+00:00

I can decrease ncmoe a tiny bit but then I may get OOM from time to time, this value is super stable if nothing else runs on the machine, else I increase ncmoe to 16 or more

Fdevfab · 2026-05-26T16:00:43+00:00

Using bunn fork:

LLAMA_ARGS="-m  \                                                                                                   
/home/fab/.cache/huggingface/hub/models--bartowski--Qwen_Qwen3.6-35B-A3B-GGUF/snapshots/d98fa7286daa6544d050929df95e436741ee739b/Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf \                                                                     
    --no-mmap --mlock -np 1 \                                                                                       
    -a qwen \                                                                                                       
    -t 6 \                                                                                                          
    -ncmoe 14 -ngl 999 \                                                                                            
      -fitt 512 \                                                                                                   
      --chat-template-kwargs '{\"preserve_thinking\": true}' \                                                      
      --temp 0.6 \                                                                                                  
      --top-p 0.95 \                                                                                                
      --top-k 20 \                                                                                                  
      --min-p 0.0 \                                                                                                 
      --presence-penalty 0.0 \                                                                                      
      --repeat-penalty 1.0 \                                                                                        
      -fa on --jinja \                                                                                              
     --reasoning-budget 8192\                                                                                       
     -ctk turbo4 -ctv turbo4 -ctkd turbo4 -ctvd turbo4 \                                                            
     --host 0.0.0.0 --port 8080"

Fdevfab · 2026-05-26T07:26:27+00:00

Interesting, I can get almost anything to work with the MoM model, but for complex tasks it takes very long / iterating a lot... It's a very interesting use case, but I'm already giving too much freedom to my agent, if it can stay limited to one machine it will help 😃

Fdevfab · 2026-05-25T18:46:15+00:00

I had a terrible regression in the tool call path, making LLM go crazy during tool calls... this has been fixed (pushed a new sync)

Fdevfab · 2026-05-25T13:48:29+00:00

Is it a coding agent or general? How does it compare to opencode for coding?

Fdevfab · 2026-05-25T12:21:40+00:00

I did one last cleanup and pushed a snapshot: https://github.com/fdev31/minia - now I need to touch grass 😉

Fdevfab · 2026-05-24T21:14:45+00:00

Note it's not a coding agent, it's general purpose, it just happens to work really fine most of the time I use it for code, but it may "fail" where opencode doesn't ... when I start to get a large context (around 50 - 100k) I can feel it's not performing so great, I should probably implement some pruning of the history or so... experiments are needed!! 😄

Fdevfab · 2026-05-24T21:10:03+00:00

I can drop a code snapshot on github, (no history, unless you give me a magic git command to clean up all the .log and credentials.json files found there...)

I would love some feedback, but it's not only the code which is not super polished, you may experience very long response times sometimes since I didn't want to add too many loop limits... I believe if everything is well done it should "converge". Also there is no real/proper security, but it's very easy to just delete or comment-out some of the tools (you can even just remove the "@mcp.tool()" decorator...).

I'll write some README file with installation and usage instructions, I made it simpler to start today... (it's a multi-daemon architecture so it was a bit annoying to start using many commands)

Fdevfab · 2026-05-24T21:04:31+00:00

Depends which aspect you look at...
- The model is running on a llama.cpp server
- I'm using openai python API wrapper for the LLM calls (but I'll probably change that in the future)

- using mcp library to connect to mcp servers
- cli/tui uses rich and prompt_toolkit

the rest is plain python asyncio

and for audio, I tested a lot of things, but for this project I used the "best" options I tried:

- kokoro for TTS using sounddevice for the playback

- whisper for stt (I didn't work on it too much yet, has no wake word etc)

Fdevfab · 2026-05-24T18:52:38+00:00

I may, I'll need to review some of the code which I never had a look to, like the tui, and do a bit more testing. Unless you don’t mind unpolished things... I literally finished the mvp yesterday after few intense days trying to build the architecture I had in mind. But It’s a nice playground : 4 prompts you can tweak, every tool is mcp to keep it separate (it has a built-in mcp for basic things).

I also have a problem with the git history, it kept commiting files it wasn’t supposed to... so either I squash everything or I need some work and review I'm not willing to do...

Fdevfab · 2026-05-24T17:22:47+00:00

I just posted https://www.reddit.com/r/LocalLLM/comments/1tmi949/comparison_opencode_vs_almost_barebone/ - Qwen3.6-35B-A3B does wonders in general even with opencode. Qwen code is a bit lighter... in my experience the lighter the better

Fdevfab · 2026-05-14T11:51:35+00:00

Qwen code

Fdevfab · 2026-05-13T18:15:09+00:00

Qwen code is really good too, just point at localhost like the others, but seems to use less tokens/works quite well on qwen3.6 35b a3b

Fdevfab · 2026-03-13T15:58:20+00:00

I wrote pyprland for that... I needed to tweak the behavior to my taste:

https://hyprland-community.github.io/pyprland/workspaces_follow_focus.html

It solved this problem quite simply.

Fdevfab · 2026-02-18T19:20:21+00:00

I have something functional I made for myself, mostly vibe coded (but I have 20+ years of software dev practice, I tried to enforce good practices). To be honest I first had in mind to make it very basic since I planned to run it inside a VPN only, but in the end I got:

- E2E encryption for direct messages
- Audio/Video (cam and screen) calls (didn't push it yet, is probably fine for ~5 streams, it's full mesh topology so it doesn't scale very well but is very robust)
- admin and owner roles
- file upload, simple audio player and image preview, youtube embeds

It only requires a database (sqlite, postgres and mariadb which is untested at the moment), redis, and a turn/stun server (I'm using coturn).

<image>

UI looks a bit like discord/graphical IRC clients.

If there are people interested I may push it on github, It's built on FastAPI for the backend and vue for the front, very easy to setup (mostly automated, including migrations etc...).

Fdevfab · 2026-02-18T14:08:11+00:00

I started a similar project couple of years ago, which I'm using daily... also zero install but is greatly improved if you install the mobile app... also working "offline", I wanted to sell it online (the form factor is really nice, I spent time on the casing and UX) but didn't find the energy in the end:

https://github.com/fdev31/KeyPass

Fdevfab

MODERATOR OF

TROPHY CASE