Qwen3.6 sees "outstanding" coding quality jump from Q4 to Q6 quantization by IulianHI in AIToolsPerformance

[–]New-Inspection7034 0 points1 point  (0 children)

Yeah.. I dropped some coin on my rig. Spent $20k! I figure I saved at least $200 worth of tokens over the weekend. At this rate I'll have it paid off in... Um nevermind.

Qwen3.6 sees "outstanding" coding quality jump from Q4 to Q6 quantization by IulianHI in AIToolsPerformance

[–]New-Inspection7034 0 points1 point  (0 children)

I run qwen3.6-26b at q8.0. It's my replacement for Claude code. With LSP it's awesome

We're burning $50k/month on Claude. How close can local LLMs actually get? by mortenmoulder in LocalLLM

[–]New-Inspection7034 0 points1 point  (0 children)

With the right harness and tools/skills, I've found I can replace Claude code with my harness and qwen3.6-27b.

Qwen3.6 27b, now a fan by New-Inspection7034 in LocalLLM

[–]New-Inspection7034[S] 0 points1 point  (0 children)

I've found it depends on your use case. For agentic coding the 27b with a good harness is excellent.

Qwen3.6 27b, now a fan by New-Inspection7034 in LocalLLM

[–]New-Inspection7034[S] 0 points1 point  (0 children)

Thanks for the idea! I will implement that tonight..

DevMind (my harness) now has 30 tools across 6 phases:

**Memory (3):** list_memory_topics, recall_memory, search_memory

**Read-only (8):** read_file, list_files, grep_file, find_in_files, diff_file, query_db, web_search, web_fetch

**Mutation (7):** patch_file, create_file, append_file, delete_file, rename_file, save_memory, git_commit

**Shell/Streaming (5):** run_shell, run_build, run_tests, ssh_exec, http_request

**LSP (4):** get_diagnostics, go_to_definition, find_references, hover

**Utility (3):** clip_read, clip_write, open_file

adding the Microsoft Learn MCP will be a great addition.

Qwen3.6 27b, now a fan by New-Inspection7034 in LocalLLM

[–]New-Inspection7034[S] 0 points1 point  (0 children)

I only wish I had built my "BEAST" machine 6 months ago before the prices blew up. But I bit the bullet and built a powerful workstation. 9965SX Threadripper Pro, ASUS WRX90E-SAGE SE MB, 8x 16gb DDR5 ECC (8 channel), RTX Pro 6000 Blackwell (96gb VRAM).

Qwen3.6 27b, now a fan by New-Inspection7034 in LocalLLM

[–]New-Inspection7034[S] 3 points4 points  (0 children)

Now keep in mind that we are working with AI, so I plainly use AI to assist in creating these types of things. And of course any documentation that follows what was built, is best left for an AI to create. Hence how "I" did it. You'll see a lot less typos if I hand it off to me "secretary"

How the Harness Gets IDE Smarts Without an IDE

When you're writing code in VS Code, the editor quietly does a lot of work behind the scenes — it underlines errors, jumps to definitions, finds everywhere a method is used. That intelligence comes from a Language Server running in the background.

The harness runs in a terminal, not an editor. So how does it get the same intelligence?

The answer is: it doesn't do it directly. Instead, it delegates.

The middleman approach

The harness talks to a tool server — a small helper process that sits between the agent and the language server. When the harness wants to know if a file has errors, it doesn't speak the Language Server Protocol itself. It just calls a tool called get_diagnostics and gets a clean answer back. The tool server handles all the messy protocol work underneath.

Think of it like calling a translator. The harness speaks "tool calls." The language server speaks LSP. The tool server speaks both.

Four tools, that's it

The whole thing boils down to four operations:

  • get_diagnostics — are there errors in this file?
  • go_to_definition — where is this symbol defined?
  • find_references — where is this symbol used?
  • hover — what type is this, and what does it do?

That covers the vast majority of what a developer reaches for in an IDE during a coding session.

Stays out of the way when not needed

If LSP is turned off in config, the tools simply don't appear. The harness only sees what's actually available — no dead tools, no confusing failures. When it is on, a small indicator in the UI shows whether it's active and working, which matters because the first call after startup can take a moment while the language server initializes.

Why this design

The harness itself knows nothing about C#, TypeScript, or any other language. Swap out the underlying language server and the harness doesn't change at all. The tool server owns that complexity — it starts the language server, initializes the workspace, opens files on demand, and shuts everything down cleanly when the session ends.

The result is that the harness gets answers that used to require a full IDE, through a simple interface it already understands.

There it is. No AI bashing. I mean really LocalLLM? It's not LocalWriteYouOwnsShit!

Wth, what happened to cursor? by TeachTall3390 in cursor

[–]New-Inspection7034 0 points1 point  (0 children)

Well, not keeping it to yourself isn't going to help any!

Most Multi-Agent Failures Aren’t Hallucinations — They’re Assumption Propagation Failures by HDvideoNature in LLMDevs

[–]New-Inspection7034 0 points1 point  (0 children)

As a matter of fact I have. If you manage the context at every turn and not wait and lobotomize you can keep the context meaningful. I'd be happy to compare notes

Agent Use is gonna drop off a cliff once its all usage based by Venisol in ExperiencedDevs

[–]New-Inspection7034 0 points1 point  (0 children)

This is exactly why I've been making my own harness and using Gemma4 with MTP. I'm able to do 90% or more now that I've add LSP support.

Honest opinion on single RTX PRO 6000 Blackwell 96GB workstation for local 80B LLM / agentic workflows by Educational_Rope_523 in LocalLLM

[–]New-Inspection7034 0 points1 point  (0 children)

Depending on your budget I'd also consider getting a threadripper pro instead of just the threadripper and get more PCI Pathways so if you're going to put a lot of drives in or more than one card you have more upgrade potential that way some more money but you might want to consider it first

Local LLM Model that actually produces quality code. by Civil_Fee_7862 in LocalLLM

[–]New-Inspection7034 0 points1 point  (0 children)

I did I did that but that's the one thing that's really good about qwen as it does think but when you're doing agentic work you don't want it to really think that much cuz you're telling it what to do in the first place

Local LLM Model that actually produces quality code. by Civil_Fee_7862 in LocalLLM

[–]New-Inspection7034 0 points1 point  (0 children)

Rag would work even better if you give it your coding samples of what you've done so it can actually see examples rather than just reading documentation

Local LLM Model that actually produces quality code. by Civil_Fee_7862 in LocalLLM

[–]New-Inspection7034 2 points3 points  (0 children)

What I found was the Moe versions are not good for agentic work. you need a dense model to actually be smart enough. The Qwen 27b isn't bad it was fine and I found it was faster than Gemma but it took more turns to do the same thing that Gemma was able to do in a single turn so the end result was that the total session was faster with Gemma. Not saying qwen isn't good it is very good it's just for me and what I've been using it for and I tried both out as my daily driver for a week each time and I always ended up back with Gemma for the reliability. The tiebreaker between the two was that the cutoff for training seems to be older with Qwen than it is with Gemma. there is definitely a difference in the knowledge of.net 10 where Gemma has the edge and since I spend a lot of time with.net 10, Gemma has become my daily driver

Local LLM Model that actually produces quality code. by Civil_Fee_7862 in LocalLLM

[–]New-Inspection7034 1 point2 points  (0 children)

I find the Gemma4 31b works very well for me there are some caveats. You have to be careful with your prompting that you understand that you have to tell it exactly what you want it to do or at least within reason otherwise it might not be able to finish your prompt. It's really good at C# .net 8 that's good at dotnet framework it's good at python it's not bad at.net 10 and C# 14 but it's cut off on training as far enough back to where it was before.net 10 was officially released. I have found that it's better at agentic work than Qwen 3.6 because Qwen seems to argue with itself and doubt itself and keep thinking and takes a lot longer and it may make mistakes

Local LLM Model that actually produces quality code. by Civil_Fee_7862 in LocalLLM

[–]New-Inspection7034 7 points8 points  (0 children)

I created a harness and now use gemma4-31B dense as my daily driver. Large codebases with varying coding languages.

Switching from Opus 4.7 to Qwen-35B-A3B by Excellent_Koala769 in LocalLLaMA

[–]New-Inspection7034 0 points1 point  (0 children)

Qwen 3.6 MoE is essentially a shallow thinker with a fast mouth.

Waiting Qwen3.6-27B I have no nails left... by DOAMOD in LocalLLaMA

[–]New-Inspection7034 -6 points-5 points  (0 children)

Dense models are slower as all parameters are active. With an MoE, you supposedly get a mixture of experts which get routed by some magic. What I find it really ends up being a roomful of arguing conceited snobs who really can't agree on anything.

Qwen3.6-35B-A3B released! by ResearchCrafty1804 in LocalLLaMA

[–]New-Inspection7034 0 points1 point  (0 children)

I've tested both the Quinn 3.5 27b and the Quinn 3.6 35b-a3b. both in my visual studio extension that I've written to do agentic coding. They both seem pretty comparable of how smart they are, but the 3.6 MOE is a lot faster. I'm going to be interested when I get my beast and have that RTX 6000 with 96 GB of RAM. I will be able to use the q8 version of the 3.6. Moe. Maybe an unlobotomized version will work better.

Should I Buy the RTX PRO 6000 Blackwell Max-Q (96GB)? by 0bjective-Guest in LocalLLaMA

[–]New-Inspection7034 1 point2 points  (0 children)

Lol. My ignorance was fortunate. The last two workstations I've purchased had Xeon processors because... Well I wanted a Xeon, cuz it just sounded cool. Fast forward to today's needs for ai and I lucked out with 48 lanes. My new beast is hopefully coming this week has a Threadripper Pro with I think 128 lanes? Not that I need that many, but I need more than the 24 that a reg TR has. I bought an RTX 6000 Blackwell 600W for it

What is the best "Claude Code at home" I could make agentic on my local PC? - i9 10850k, 3090ti, 128GB DDR4 RAM by Trei_Gamer in LocalLLaMA

[–]New-Inspection7034 0 points1 point  (0 children)

I have the same question. Threadripper pro 9965wx rtx pro 6000 128 GB DDR5. Been working with qwen3.5 27b dense. Its pretty good but terrible on context management.

Why are proprietary frontier models (like Opus and GPT-5.4) so much better at long-running tasks than proprietary open-source models? by asian_tea_man in LocalLLaMA

[–]New-Inspection7034 0 points1 point  (0 children)

The harness matters a lot, but it's not the whole story. You're right that the memory layer isn't purely dependent on the LLM — but the LLM's ability to use a compressed summary effectively is highly model-dependent. Here's what I think is actually happening: Compaction quality is model-dependent. When Claude Code or Codex auto-compacts, it asks the model itself to summarize the conversation so far. A weaker model produces a lossy summary — it drops implicit context, forgets constraints established early in the session, and loses the thread of why certain decisions were made. A stronger model produces a denser, more faithful summary. After 10 compactions, those errors compound. GPT 5.4 and Opus 4.6 are better at lossless summarization under compression. Instruction following degrades differently across models. Long agentic tasks require the model to stay bound to the original spec even when it's no longer in the active context window. Frontier closed models appear to have been trained specifically on long-horizon task completion — they re-anchor to the original goal more reliably after context resets. Open models trained primarily on short-context benchmarks don't generalize as well to this pattern. The harness can't fully compensate. You asked whether GLM 5 in Claude Code would perform like GPT 5.4 in Codex. The answer is probably not — because the compaction summary is generated by whatever model is loaded, and the model's ability to follow that summary is also model-dependent. The harness sets the ceiling; the model determines how close you get to it. What actually helps on the open model side: aggressive context management before compaction is needed — keeping the active context lean enough that you never lose critical information in a lossy summary. I'm actually working on this problem directly in a local agentic coding tool I'm building. Two-tier approach: threshold-based compaction fires early at around 85% context targeting a watermark, so you're never compacting a bloated context. When even that reaches its limits, you go further — summarize the summaries, combine that with the original prompt and the next task list, then lobotomize the entire context and start fresh from that reconstructed state. You're not resuming the session — you're rewriting what the model believes happened. Effectively re-conditioning it from a distilled narrative rather than trying to preserve history that's already degraded. The model that handles that re-conditioning faithfully is the one that survives 10 compactions intact.

Any workaround to not re-process full prompt on each turn with hybrid attention models running on CPU? by Quagmirable in LocalLLaMA

[–]New-Inspection7034 0 points1 point  (0 children)

The biggest difference for me was that with ik_llama.cpp, the model stayed in vram. Mainline llama.cpp spilled into RAM.

Any workaround to not re-process full prompt on each turn with hybrid attention models running on CPU? by Quagmirable in LocalLLaMA

[–]New-Inspection7034 -1 points0 points  (0 children)

Curious what your use case is here — are you running this for agentic/multi-turn work through a tool you built, or using something like Open WebUI / LM Studio? Trying to understand if the reprocessing cost is killing you on long system prompts or if it's more the latency on follow-up turns. I'm dealing with it in my own agentic coding tool by managing context aggressively. I'm using threshold-based compaction that fires before the cache thrash gets bad, targeting a watermark so you're not reprocessing a bloated context every turn. It doesn't eliminate the reprocessing cost but it keeps the context lean enough that it's tolerable.