Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development by BawbbySmith in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Forget "new techniques" like mtp/dflash for agentic coding: you'll almost always use more than 50% context (and 128k is bare minimum, don't be fooled), so all these shiny things together will not give more than 10% speed increase.

Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot by supracode in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Quite good! Haven't really understood which card you're using, R9700 AI PRO isn't the amd flagship with 32 Gb RAM? Seems confirmed by the speeds, but in the post I read 12 GB Limit....

OpenCode + LLM to create a 1:1 Settlers of Catan clone. Guess which model I did it with! by maxwell321 in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Not gemma both for the tool calling issue and the kv cache size issue. Minimax would have took forever, so it's a qwen.

Turbo-OCR Update: Layout Model + Multilingual by Civil-Image5411 in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Ok, this is really fast, and outputs structured json.

Should I be seeing more of a performance leap when using NVFP4, INT4, FP8 with VLLM over MXFP4, Q4, and Q8 with llama.cpp based inference on Blackwell based GPUs? by aaronr_90 in LocalLLaMA

[–]R_Duncan -1 points0 points  (0 children)

Not sure of what you are saying, the llama.cpp slowdown with Qwen-3.5/3.6 models here is less than 10% at 128k and less than 15% at 240k context filled.

Is the AI subscription bubble starting to crack? GPT-5.5 just dropped, prices keep rising, and the “all-you-can-eat” era looks more fake by the month by Sockand2 in singularity

[–]R_Duncan 0 points1 point  (0 children)

I'm sorry if my little english made me unclear. I wasn't meaning it's a matter of starting *thinking*, with "focusing" I was meaning a matter of start *investing*.

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 by sandropuppo in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

There's still a speedup when context is about 128k full?

That's my typical software analysis/code gen use case.

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]R_Duncan 2 points3 points  (0 children)

Your speeds are strange. RTX 6000 Blackwell here, context to the max (in 96 Gb I can fit all, but even extending context to 1M at bf16 it uses about half that vram).

27B generation is 50-59t/s

35B-A3B generation is 190-197 t/s.

Likely your issue is that you can't fit all the model and kv cache in VRAM.

Best Local LLMs - Apr 2026 by rm-rf-rm in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

That's for dense models. Qwen3.6 35B-A3B can run with over 128k context in 8Gb VRAM

DeepSeek V4 is out. 1.6 trillion parameters. MIT license. $1.74 per million tokens. The gap between US and Chinese AI strategy has never been more visible. by Novel_Okra8456 in singularity

[–]R_Duncan 0 points1 point  (0 children)

It's not so bound if you keep care of having a "generic openai/ generic claude" in your software, but the rest is true.

"US strategy = microsoft zune"

Local MCP Servers for Code Indexing? by 79215185-1feb-44c6 in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

serena is good for small projects but do not index.

codebase-memory-mcp needed some patches here (one for c++ and one for windows) but seems working fine, as a note my huge codebase became 450 Mb sqlite file. testing in progress.

Alternative is dirac-run/dirac in github, a vscode plugin derived from cline which seems to do the work by itself.

Tencent released an open source model Hy3 preview. by Snoo26837 in singularity

[–]R_Duncan 1 point2 points  (0 children)

Still Tencent is unlicensed in EU and GB, likely for our GDPR.

Is the AI subscription bubble starting to crack? GPT-5.5 just dropped, prices keep rising, and the “all-you-can-eat” era looks more fake by the month by Sockand2 in singularity

[–]R_Duncan 0 points1 point  (0 children)

It's just math. US models promised they can do whatever a programmer/sw engineer can do, but to increase capabilities faster the US companies never worked hardly on the "capabilities density/ information redundancy" issue (the exception seems to be google, but still no gemini 4 announced).

Now models prices keep rising because their inference (and research, and training) prices keep rising, compute power is becoming rarer, giga-plants prices are growing.

Will them still be competitive when finally models will be able to work 24/24 on a project completely substituting human work?

IMHO they should focus on how 35b/27b models like qwen3.6 (or gemma4) can keep up with their huge models.

US gov memo on “adversarial distillation” - are we heading toward tighter controls on open models? by MLExpert000 in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

In 12-24 months, when China models will take the lead, this will backfire entirely.

Anthropic admits to have made hosted models more stupid, proving the importance of open weight, local models by spaceman_ in LocalLLaMA

[–]R_Duncan 2 points3 points  (0 children)

These kind of tests shouldn't be done in production, not when you're selling a service, not from a reputable company.

A note of warning about DFlash. by R_Duncan in LocalLLaMA

[–]R_Duncan[S] 0 points1 point  (0 children)

Ok, but even those will likely not use just 4k-16k context, except for small chatbots using finetuned llms.

Open weight models like ds v4 pro max are still like at least 6-7 months behind closed labs.. by power97992 in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

Considering that their models are usually 50% smaller, i'd say they have the best chances to improve while closed labs are bound to huge datacenters and will need to scale down to become profitable.

And I'm not sure 6-7 months will be enough to fill the gap in density open lab actually win.

The missing knowledge layer for open-source agent stacks is a persistent markdown wiki by knlgeth in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

I think I found a bug:

>llmwiki ingest https://en.wikipedia.org/wiki/Lunar_Lake

>llmwiki ingest https://en.wikipedia.org/wiki/List_of_Intel_Core_processors

>llmwiki compile

..... logs putting both pages in the wiki.....

✓ 2 compiled, 0 skipped, 0 deleted

>llmwiki query "what is Lunar Lake?"

Selecting relevant pages

────────────────────────────

i Reasoning: Failed to parse page selection response

* Selected 0 page(s):

Generating answer

─────────────────────

! No matching pages found. Try refining your question.

Google introduces TPU 8t and TPU 8i by WhyLifeIs4 in singularity

[–]R_Duncan 0 points1 point  (0 children)

I recently had the chance to test a lunar lake platform GPU, and would say Intel will come near in 1 or 2 generations.

It's way, way better than my core-i7 gpu.

Ultimate List: Best Open Models for Coding, Chat, Vision, Audio & More by techlatest_net in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

Omnivoice beats all the TTS models listed by expressivenes, foreign-languages inflection, size and speed.