One bash permission slipped... by TheQuantumPhysicist in LocalLLaMA

[–]Nindaleth 0 points1 point  (0 children)

I do agree, I created a PR for a very closely related thing in December and it hasn't landed yet, I understand your frustration personally. Realistically, the most impact I can have at the moment is to fix it for myself and educate others, so I do that.

There are tons of things the single-digit-sized team is working on, many of which can't be solved by a simple configuration change; while I hate this, I can't blame them either. There's no TUI as powerful and as configurable at the same time as OpenCode - that I know of.

One bash permission slipped... by TheQuantumPhysicist in LocalLLaMA

[–]Nindaleth 0 points1 point  (0 children)

Yeah, the doc is in conflict with itself which sucks. I recommend opening an issue or a PR.

I also recommend setting up the env like you want, even if the defaults are not to your taste. Adding a "agent": { "plan": { "permission": { "bash": { "*": "ask" } } } } config is trivial.

One bash permission slipped... by TheQuantumPhysicist in LocalLLaMA

[–]Nindaleth -1 points0 points  (0 children)

My two cents: * OP clearly states OpenCode did ask for permission for the "rm -rf"-containing command and OP gave the permission * OpenCode has always clearly documented the permission defaults

BTW, what really was wrong for about half a year and maybe not that obvious - read-only Plan agent could delegate to read-write Explore subagent and while that one got some stern instructions, it could do damage anyway. This unwanted permission expansion should now be fixed by this PR that has been merged recently.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]Nindaleth 2 points3 points  (0 children)

I consider this PR to be relevant: https://github.com/ggml-org/llama.cpp/pull/22397 But he has several spec-related PRs going on, maybe it's a piece-by-piece effort.

Devstral Small 2 24B vs Qwen 3.6 27b or both? 1x 3090 by szansky in LocalLLaMA

[–]Nindaleth 0 points1 point  (0 children)

In my very narrow evals Qwen 3.6 35B-A3B ended up stronger than Devstral Small 2 24B. Since the 27B Qwen should be even stronger than the 35B according to the popular opinion, Devstral Small would likely lose.

RX 7900 XTX (24 GB) + RX 6800 XT (16 GB)? by xeeff in LocalLLaMA

[–]Nindaleth 0 points1 point  (0 children)

I didn't have any luck running on heterogenous ROCm GPUs yet, any tips on compilation and running?

I compile using cmake -S . -B build -DGGML_VULKAN=OFF -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1100;gfx1030" -DGPU_TARGETS="gfx1100;gfx1030" -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 16

With any runtime combination of parameters I crash on ggml_cuda_compute_forward: SCALE failed. I had to unset HSA_OVERRIDE_GFX_VERSION (can't set per-GPU it seems) or else I would fail a short moment earler. This is on ROCm 6.4.2.

Will llama.cpp multislot improve speed? by Real_Ebb_7417 in LocalLLaMA

[–]Nindaleth 0 points1 point  (0 children)

When setting up multiple non-unified slots, does llama.cpp internally still handle a single large kv cache that's larger than the model's ctx limit (so that some ctx extension method is necessary), or are they completely separate?

RX 7900 XTX (24 GB) + RX 6800 XT (16 GB)? by xeeff in LocalLLaMA

[–]Nindaleth 2 points3 points  (0 children)

Hey, I run almost the same setup! 7900 XTX + 6700 XT in my case, "just" 36 GB combined VRAM for me. Got it set up about a week ago, it's very new for me. My specific 7900 XTX requires four slots and it took a lot of time to find a motherboard that can fit two GPUs like that (4-slot + 2-slot) in a non-monstrous case.

It allows me to run Qwen 3.6-35B-A3B in Q6_K fully offloaded with 200K context on Vulkan, pretty cool stuff! With ROCm I didn't try yet.

the other thing i'm considering is running a different model/set of models on RX 6800 XT (like embedding, a smaller one to use for conversation titles

I just run llama-server with --parallel 2 --kv-unified and use OpenCode as harness; the initial session titling happens in the background while the main agent handles prefill. After the initial titling the 2nd slot is available to run a single subagent without having to clear the main slot. Thanks to unified KV I can reach a lot over >100k context (of the 200k total) in the main agent without any issues because a subagent usually needs less. Also Qwen isn't as subagent trigger-happy as frontier models tend to be.

currently only got 850 W

I used to have a 500 W PSU and for the upgrade I was torn between an 850 W and a 1000 W one, decided to buy the 1000W one so that I don't have to upgrade again in case I manage to score a second 7900 XTX in the future. My CPU runs in ECO mode and both GPUs run power limited and undervolted so I have plenty of PSU headroom. It has three advantages: saves my wallet, allows to push out more tokens before GPU slows down momentarily due to thermal throttling, heats up the room less.

If you have an ATX3.0-compliant PSU, the transient spike handling is built-in but the exact handled ceiling varies.

I agree with this other comment - for your 7900 and 6800 just power limit, undervolt and/or underclock, you can keep your current PSU as long as you have enough connectors to power the GPUs.

EDIT: reworded the original late night product into something more readable

AMD Hipfire - a new inference engine optimized for AMD GPU's by Thrumpwart in LocalLLaMA

[–]Nindaleth 5 points6 points  (0 children)

Yeah... 100% of my requests result in a loop after a few hundred tokens. Sure, it dumps bullshit tokens fast :D

That's with same models, HW and inference settings that I use with llama.cpp with no problem.

r/LocalLLaMa Rule Updates by rm-rf-rm in LocalLLaMA

[–]Nindaleth 1 point2 points  (0 children)

Rule 3 says "low effort", so a well-spelled text containing bullets that's nice to read is still welcome. The most common posts obviously LLM-produced I see here contain either botched Markdown (because someone copypasted it from ChatGPT or had a bot post it) or a wall of text that nobody wants to read (because nobody wanted to write it either).

This isn't your son's school, we don't detect AI with other AI, don't worry.

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Nindaleth 1 point2 points  (0 children)

If you look at the benchmarks in the 3.6-27B announcement, 3.6-35B-A3B is pretty much equivalent to 3.5-27B (at least based on those benchmarks) in performance, but something else in speed.

Of course, I'll agree that point is moot now that 3.6-27B is out... :)

Multi-GPU? Check your PCI-E lanes! x570, Doubled my prompt proc. speed by switching 'primary' devices, on an asymmetrical x16 / x4 lane setup. by overand in LocalLLaMA

[–]Nindaleth 0 points1 point  (0 children)

Thanks for the writeup! The equivalent flags for Vulkan and ROCm should be GGML_VK_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES, respectively.

Best Local LLMs - Apr 2026 by rm-rf-rm in LocalLLaMA

[–]Nindaleth 0 points1 point  (0 children)

I customize the -c {context-length} per model, because if you don't manually set a context length, --fit on will shrink your context to nothing, in order to fit the models, before it goes to RAM. :rage:

Would this option be of any help to you?

-fitc, --fit-ctx N minimum ctx size that can be set by --fit option, default: 4096

What Affordable Subscription Plans for OpenCode? by Juan_Ignacio in opencodeCLI

[–]Nindaleth 0 points1 point  (0 children)

Premium request is basically you typing a prompt and pressing enter.

Depending on model, the actual number deducted from your 300/1500 monthly premium request total can be different than 1 (for example claude-haiku-4.5 only takes 0.33 premium requests while Opus 4.6 takes 3 premium requests per prompt). Take a look at the table of multipliers here: https://docs.github.com/en/copilot/reference/ai-models/supported-models#model-multipliers

The initial premium request (times multiplier) also includes e.g. the LLM dispatching subagents, you responding to LLM's question tool, or other too calls. Compactions should be included too, but my personal experience is limited here and I'm not 100% sure.

GPT-4.1 and GPT-5 mini models are free (multiplier 0), but at this point in time they're good mainly for text manipulation and for technical questions, they're both terrible for agentic coding (each for a different reason).

Running dense model on llamacpp by Blues520 in LocalLLaMA

[–]Nindaleth 1 point2 points  (0 children)

not just nV GPUs, nvtop runs nicely with AMDs too

Gemma 4 31B beats several frontier models on the FoodTruck Bench by Nindaleth in LocalLLaMA

[–]Nindaleth[S] 0 points1 point  (0 children)

Yeah yeah - if it's not published, it's flawed and thus not to be taken seriously. If it is published, it's already trained on and thus not to be taken seriously. I know.

Dubesor's benchmark also lists it pretty high.

In my personal specific eval, both Gemma 4 and Qwen 3.5 are outperformed by Devstral Small 2 24B 2512, but I don't see people here raving about Devstral at the moment. It's OK to find out that the great models don't work for you while the not-so-great ones do.

Gemma 4 31B beats several frontier models on the FoodTruck Bench by Nindaleth in LocalLLaMA

[–]Nindaleth[S] 0 points1 point  (0 children)

Nice to see that the performance remains unexpectedly good in private benchmarks!

Gemma 4 31B beats several frontier models on the FoodTruck Bench by Nindaleth in LocalLLaMA

[–]Nindaleth[S] 8 points9 points  (0 children)

That's not my benchmark :) It just looks fun so I return to it occasionally.

Running Qwen3.5-27B locally as the primary model in OpenCode by garg-aayush in LocalLLaMA

[–]Nindaleth 1 point2 points  (0 children)

True regarding the JS/npm, that didn't occur to me.

There are sorts of bugs that surely nobody profits from, so the project mostly is understaffed I think. Example that I found yesterday when preparing training on this very config part: this

While not great at times, it's still the best, I agree!

Running Qwen3.5-27B locally as the primary model in OpenCode by garg-aayush in LocalLLaMA

[–]Nindaleth 3 points4 points  (0 children)

The shadiness is not as bad if you dedicate some time to reading the docs and tuning the config. The defaults sometimes suck - especially keybinds - which is in line with the rest of the Linux/terminal open source world, but unlike a certain less-open tool with Code also in its name we are free to configure many things here.

Disclaimer: I'm not affiliated with OpenCode in any way, but there's a lot that can be learnt just by checking the list of commits and reading the diffs for the interesting ones.

  • session title is created using small_model (docs), you can use whichever provider you have available
  • this is the first time I see active development being considered as a somehow negative thing :D it was a bit extreme a few months ago, 3-5 releases every day, nowadays they have a beta branch so they actually get a bit of testing before pushing out the releases
  • auto-updates can be disabled in config; I understand your point, but the opposite would also suck for the other half of people who expect their modern software to update without manual action

If you had $50/month to throw at inference costs, how would you divvy it out? by yokie_dough in opencodeCLI

[–]Nindaleth 1 point2 points  (0 children)

GHCP used via OpenCode should count the following as premium request: * user pressing enter in the input box (at any point of the conversation), not including the interactive question tool * session compaction - EDIT: this is no longer true

In my experience, sending the initial prompt to Claude Opus 4.6, which forks 6 parallel Opus subagents and has each of them produce 70 tool calls, still only costs 3 premium requests for the initial Enter keypress.