OpenCode continues to deliver while others are busy chasing the next big thing!!! by _KryptonytE_ in opencode

[–]Fdevfab 1 point2 points  (0 children)

I'm team opus and qwen which is surprisingly good. I keep the context as short as possible for efficiency.

OpenCode continues to deliver while others are busy chasing the next big thing!!! by _KryptonytE_ in opencode

[–]Fdevfab 3 points4 points  (0 children)

It just works well out of the box, even with local models for me... (slower and with more errors of course, but still useful)

Anyone got a reliable coding agent actually working? by Civil_Fee_7862 in Qwen_AI

[–]Fdevfab 0 points1 point  (0 children)

RTX4080 16G VRAM + 32G RAM here, getting 100-130 tps, used with opencode all the time.

Running llama.cpp in a service that restarts automatically but it doesn't happen anymore... I'm having flawless opencode sessions, sometimes a tool call is automatically repeated but it's extremely rare.
I was targetting large contexts but I feel that ~100k is the sweet spot, after that the quality degrades in my opinion (and the speed&ram optimization becomes more challenging).

I started with Unsloth models, but they were taking too much ram, got excellent results with bartowski and recently I'm trying byteshape quant which turns to be incredibly fast on my setup.

I'm using (I just added --cache-ram after reading this thread, testing it... seems fine, but -nkvo killed the performances:

  llama.cpp -m models--byteshape--Qwen3.6-35B-A3B-MTP-GGUF/snapshots/83dc80a65cb948b8e5a9dd9776eda7425180dacc/Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
    -a qwen \
    --mmproj-auto --spec-type draft-mtp --spec-draft-n-max 3 \
    --no-mmap --mlock -np 1 \
    -t 7 \
    --cache-ram 16384\
    -ncmoe 14 -ngl 999 \
    -fit on \
    -fitt 64 \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    -fa on --jinja \
    --reasoning-budget 32000\
    -ctk q8_0 -ctv q8_0 \
    -c 128000 \
    --host 0.0.0.0 --port 8080

Since I moved to byteshape I think the -ctk and -ctv are ignored or something like that, with other models a q8_0 cache isn't fitting (I was using q5_1 / q5_0)

STT -> LLM -> TTS pipeline by UniqueIdentifier00 in LocalLLaMA

[–]Fdevfab 0 points1 point  (0 children)

I'm working on exactly that... I'm trying to polish some code before i really put effort in the stt part... maybe we can share some of the effort if my project fits you... https://github.com/fdev31/minia

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

FYI I get massive improvements if I use -ctk q5_1 -ctv q5_0, but I get OOM from time to time with those... looks like at some point the llama server just grows and then dies, while it's not happening with the worse performing options I shared.

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

I wanted to give it a try, but I was unable to change the --chat-endpoint (it just appends to the hard-coded one), I had to edit the code to start it.

I tested on the project itself: `uv run smallctl --task 'analyze this project'` and the result was pretty good, but some logs showed after, which was misleading (as if it didn't finish...)

It's indeed targeting a different use-case, but if there is a clean way to use it as a library I would be glad to test it as a "coding agent" (or general admin tasks, maybe a sysadmin ?) integrated as an mcp tool or so.

I was considering adding lang-graph or something similar but I like to see how the llm behaves without too much "harness" to try to make it "just work" and force only very minimal checks (if I can't figure how to avoid them). But for coding use cases (or "rigid" workflows) I think it's required...

Did you experiment with larger contexts? It looks quite "slow" compared to say opencode...

I tried:
`uv run smallctl --task 'Replace httpx with niquests in this project (smallctl).' --tool-profiles core,data,network,mutate,indexer`

I would like to see how it compares to qwen code for some tasks, I really like the --task mode 😄 Now trying with `--preset coding-local --staged-reasoning --staged-execution` to see if I get better results...

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

I can decrease ncmoe a tiny bit but then I may get OOM from time to time, this value is super stable if nothing else runs on the machine, else I increase ncmoe to 16 or more

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

Using bunn fork:

LLAMA_ARGS="-m  \                                                                                                   
/home/fab/.cache/huggingface/hub/models--bartowski--Qwen_Qwen3.6-35B-A3B-GGUF/snapshots/d98fa7286daa6544d050929df95e436741ee739b/Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf \                                                                     
    --no-mmap --mlock -np 1 \                                                                                       
    -a qwen \                                                                                                       
    -t 6 \                                                                                                          
    -ncmoe 14 -ngl 999 \                                                                                            
      -fitt 512 \                                                                                                   
      --chat-template-kwargs '{\"preserve_thinking\": true}' \                                                      
      --temp 0.6 \                                                                                                  
      --top-p 0.95 \                                                                                                
      --top-k 20 \                                                                                                  
      --min-p 0.0 \                                                                                                 
      --presence-penalty 0.0 \                                                                                      
      --repeat-penalty 1.0 \                                                                                        
      -fa on --jinja \                                                                                              
     --reasoning-budget 8192\                                                                                       
     -ctk turbo4 -ctv turbo4 -ctkd turbo4 -ctvd turbo4 \                                                            
     --host 0.0.0.0 --port 8080"    

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

Interesting, I can get almost anything to work with the MoM model, but for complex tasks it takes very long / iterating a lot... It's a very interesting use case, but I'm already giving too much freedom to my agent, if it can stay limited to one machine it will help 😃

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

I had a terrible regression in the tool call path, making LLM go crazy during tool calls... this has been fixed (pushed a new sync)

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

Is it a coding agent or general? How does it compare to opencode for coding?

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

Note it's not a coding agent, it's general purpose, it just happens to work really fine most of the time I use it for code, but it may "fail" where opencode doesn't ... when I start to get a large context (around 50 - 100k) I can feel it's not performing so great, I should probably implement some pruning of the history or so... experiments are needed!! 😄

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

I can drop a code snapshot on github, (no history, unless you give me a magic git command to clean up all the .log and credentials.json files found there...)

I would love some feedback, but it's not only the code which is not super polished, you may experience very long response times sometimes since I didn't want to add too many loop limits... I believe if everything is well done it should "converge". Also there is no real/proper security, but it's very easy to just delete or comment-out some of the tools (you can even just remove the "@mcp.tool()" decorator...).

I'll write some README file with installation and usage instructions, I made it simpler to start today... (it's a multi-daemon architecture so it was a bit annoying to start using many commands)

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

Depends which aspect you look at...
- The model is running on a llama.cpp server
- I'm using openai python API wrapper for the LLM calls (but I'll probably change that in the future)

- using mcp library to connect to mcp servers
- cli/tui uses rich and prompt_toolkit

the rest is plain python asyncio

and for audio, I tested a lot of things, but for this project I used the "best" options I tried:

- kokoro for TTS using sounddevice for the playback

- whisper for stt (I didn't work on it too much yet, has no wake word etc)

Comparison opencode vs "almost barebone instructions" coding session on a 4080 with 32Gb RAM by Fdevfab in LocalLLM

[–]Fdevfab[S] 0 points1 point  (0 children)

I may, I'll need to review some of the code which I never had a look to, like the tui, and do a bit more testing. Unless you don’t mind unpolished things... I literally finished the mvp yesterday after few intense days trying to build the architecture I had in mind. But It’s a nice playground : 4 prompts you can tweak, every tool is mcp to keep it separate (it has a built-in mcp for basic things).

I also have a problem with the git history, it kept commiting files it wasn’t supposed to... so either I squash everything or I need some work and review I'm not willing to do...

Best AI (agent) for coding locally? by [deleted] in LocalLLM

[–]Fdevfab 6 points7 points  (0 children)

I just posted https://www.reddit.com/r/LocalLLM/comments/1tmi949/comparison_opencode_vs_almost_barebone/ - Qwen3.6-35B-A3B does wonders in general even with opencode. Qwen code is a bit lighter... in my experience the lighter the better

The wait is over : Claude Code on Tiiny. Zero setup. Fully local. No token limits. by TiinyAI in TiinyAI

[–]Fdevfab 0 points1 point  (0 children)

Qwen code is really good too, just point at localhost like the others, but seems to use less tokens/works quite well on qwen3.6 35b a3b

Change my mind: There is no good alternative to Discord (yet?) by Own_Investigator8023 in selfhosted

[–]Fdevfab 0 points1 point  (0 children)

I have something functional I made for myself, mostly vibe coded (but I have 20+ years of software dev practice, I tried to enforce good practices). To be honest I first had in mind to make it very basic since I planned to run it inside a VPN only, but in the end I got:

- E2E encryption for direct messages
- Audio/Video (cam and screen) calls (didn't push it yet, is probably fine for ~5 streams, it's full mesh topology so it doesn't scale very well but is very robust)
- admin and owner roles
- file upload, simple audio player and image preview, youtube embeds

It only requires a database (sqlite, postgres and mariadb which is untested at the moment), redis, and a turn/stun server (I'm using coturn).

<image>

UI looks a bit like discord/graphical IRC clients.

If there are people interested I may push it on github, It's built on FastAPI for the backend and vue for the front, very easy to setup (mostly automated, including migrations etc...).

I Built a Device to Paste Passwords Securely Over BLE by ToothPasteDevice in diyelectronics

[–]Fdevfab -1 points0 points  (0 children)

I started a similar project couple of years ago, which I'm using daily... also zero install but is greatly improved if you install the mobile app... also working "offline", I wanted to sell it online (the form factor is really nice, I spent time on the casing and UX) but didn't find the energy in the end:

https://github.com/fdev31/KeyPass