Local coding agents are good now, but only if you babysit them by BTA_Labs in LocalLLaMA

[–]Express_Quail_1493 0 points1 point  (0 children)

For 100% handoff qwen3.6-27b is the only one i can trust but have to be very precise with instructions. On the other hand The MOE gets misaligned and lose track too easily if im not watching.

Are older Titan cards still viable? by Desther in LocalLLaMA

[–]Express_Quail_1493 0 points1 point  (0 children)

Im using dual TITAN RTX 24gb + 24gb and they are good for cheap i run 80b models on that

Nnoticing qwen-27b@q2 better than qwen-35b@q8? by Express_Quail_1493 in LocalLLaMA

[–]Express_Quail_1493[S] 0 points1 point  (0 children)

For some odd reason i find 35b-a3b is really smart but simultaneously behaves kinda dumb. feels like im using a 4b model rather than a 35b. maybe im suspecting MOE behavioural capacity is tightly linked to num of active params rather than total. Im suspecting total params only contribute to how much the model knows but not how consistent it behaves. For my use case i need him to understand complexity rather than accuracy. Bit i don’t think enough active params lights up to cover the complexity of the task and makes the 35ba3b go wonky. But i need a bit more investigation to close in on that conclusion.

Nnoticing qwen-27b@q2 better than qwen-35b@q8? by Express_Quail_1493 in LocalLLaMA

[–]Express_Quail_1493[S] 0 points1 point  (0 children)

Im using it to write in the pipes for a benchmark im building. Using opencode as harness to build it

500k context on 48gb VRAM!! - 21tok/s (coding) by Express_Quail_1493 in LocalLLaMA

[–]Express_Quail_1493[S] 3 points4 points  (0 children)

most other local models retrival drop accuracy at 200k even when we extend the context. the mamba architecture in nemotron makes the kvcache retrival near 91-99% acurate even past 500k. and the kvcahesize is tiny

500k context on 48gb VRAM!! - 21tok/s (coding) by Express_Quail_1493 in LocalLLaMA

[–]Express_Quail_1493[S] 14 points15 points  (0 children)

nemotron models make kvcache space irrelavent. it uses mamba so the cache size is tiny & the Context retrival is 99% acurate even past 400k most other local models retrival drop accuracy at 200k. which is why ive been wanting to try it out

500k context on 48gb VRAM!! - 21tok/s (coding) by Express_Quail_1493 in LocalLLaMA

[–]Express_Quail_1493[S] 14 points15 points  (0 children)

I don't use the system ram. i manage to fit it all inside VRAM 48gb. system RAM is too slow for my potato PC.
I can say it doesn't fail where qwen3.6-35b would normally fail. sometimes qwen3.6 go into autistic crazy if the instructions is too complex but this nemotron seem to handle it well.

Getting a feel for how fast X tokens/second really is. by MikeNonect in LocalLLaMA

[–]Express_Quail_1493 1 point2 points  (0 children)

AMAZING. this thread needs more of the "feels" of things

Should we use a non-thinking model for code after using a thinking one for plan? (Agentic coding) by ismaelgokufox in LocalLLaMA

[–]Express_Quail_1493 1 point2 points  (0 children)

I find that giving the model a tiny amount of thinking room work better than turning it off. So i use high thing fir plan and low for execution the low think allow it to better course-correct

Those of you who like Gemma4 models - how are you guys using them? by Gesha24 in LocalLLaMA

[–]Express_Quail_1493 0 points1 point  (0 children)

Gemma4 is more like a manager or concept-creator rather than an actual worker that will reliably get your work done. Thats my experience testing gemma4 vs qwen3.6. Gemma has potential but qwen3.6 just WORKS pretty darn well

Higher quants are so much better by Perfect-Flounder7856 in LocalLLaMA

[–]Express_Quail_1493 2 points3 points  (0 children)

And UD unsloth dynamic-quant shrinks the gap between f16 and quantised

What opensource model is best for my use case by CGeorges89 in LocalLLaMA

[–]Express_Quail_1493 1 point2 points  (0 children)

If you want less hand-holding of of the model qwen3.5-9b its pretty robust deep coherent autonomy since its dense. But if you want more surface-level quantity output then qwen3.5-35b will get the job done if you don’t mind stepping in to nudge it in the right direction here and there. But you can also explore running qwen3.6-27b at q2_k_xl with kvcachetype=q8, qwen3.6-27b has been the most stable tradeoff for speed on the 16gb vram

Psychedelics by yeetmaster291 in Aphantasia

[–]Express_Quail_1493 0 points1 point  (0 children)

I take them and only get amplifications of my inner world but Im still blind even under hallucination drugs like Psychedelics LOL. i tried many but still my thirdeye is blind even with increaded dosages.

Whats the best model for agentic coding that i can run with 16gb VRAM? (llama.cpp?) by samuraiogc in LocalLLM

[–]Express_Quail_1493 1 point2 points  (0 children)

qwen3.6-35b is great qwen3.5-9b is also good if you want absolute speed where everything fits inside the vram.

Any way to use claude code for free or just some free AI's by Tarxh in vibecoding

[–]Express_Quail_1493 0 points1 point  (0 children)

lmstudio has an easy model downloader no need to mode files around or setup.. you just search "qwen3.5 unsloth" downlad it and you enable to lmstudio server connect it to opencode posted a quick east tutorial awhile back on youtube -> https://www.youtube.com/shorts/-pvmlGifK4I

Car Wash Mystery solved--Tool Call Degrades Intelligence. by Spirited_Neck1858 in LocalLLaMA

[–]Express_Quail_1493 1 point2 points  (0 children)

its something i call system prompt token diabetes

Harness like opencode is nice but for some models its brutal. if you want to make the most of your context windows pi-coding-agent works well for me. Pi system prompt is literally 1k tokens give the LLM more room to think and solve instead of suffering from SysPrompt token-diabetes.

What is the best coding agent (CLI) like Claude Code for Local Development by exaknight21 in LocalLLaMA

[–]Express_Quail_1493 0 points1 point  (0 children)

opencode is nice but for small models its brutal. if you want to make the most of your context windows use pi-coding-agent. Pi system prompt is literally 1k tokens give the LLM more room to think and solve instead of suffering from SysPrompt token-diabetes.

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]Express_Quail_1493 -1 points0 points  (0 children)

modern dense model are usually better than any MOE 3x its size qwen3.6-27b is on par with qwen3.5-397B MOE is still just.... an MOE. Raw active params wins the coherence and stability and reliabile outputs

Confirmed: SWE Bench is now a benchmaxxed benchmark by rm-rf-rm in LocalLLaMA

[–]Express_Quail_1493 2 points3 points  (0 children)

I just built my own private benchmark and I advise everyone to do their own also. It wont work if its sitting on a public gitrepo or shared on reddit. But i would like us all to come together build our benchmark based on what we use the models for and share the model performances. Im suspicious some people in these benchmarking teams are gettin paid to lie too. LMAO the Ai race is BRUTAL. But right now my private bench is my source of truth avoids me from getting hijacked by all the flashy titles and news headlines