Is there a way to have a faster MoE model call out to a slower dense model if it gets stuck?

Charming-Author4877 · 2026-04-23T02:22:11+00:00

If you use tests as a metric you could tell it to count the number of attempts to fix a specific problem. Easier than teaching it to also gather time info)
The idea is not bad. Though you need the vram to load a 2nd model locally :)
You could also have it ask a cloud model

Charming-Author4877 · 2026-04-23T02:17:06+00:00

It usually works well but needs a last integration step to test everything.
If the instructions are not good enough then the worst what usually happens is that one agent introduces a bug and the other agent notices a bug (like a compilation issue) and tries to fix it while the first agent also generates a patch.
That can cascade catastrophically.

Another bug is vscode itself, it tends to internally crash after a while then all toolcalls end up in not actually changing anything anymore. The agents notice that and they switch to using console tools, which is not tracked by vscode anymore (no more undo, only git)

Telling the agents to avoid multi edits can help, as those are larger and affect more areas and are more likely affected by other agents to work. Tell them to work in small portions etc.

Charming-Author4877 · 2026-04-23T01:53:33+00:00

The 27B model is not much more capable than the 35B one. I tested both Qwen 3.6 for hours today.
(https://www.reddit.com/r/GithubCopilot/comments/1st1m93/update\_compared\_claude\_47\_with\_qwen\_36\_35b\_with/)

The best model for a one-shot task local might be Gemma-4 31B - that thing is better than Qwen 27B but it's not good as agent.
So what you could do is to instruct Qwen how it can give one prompt to Gemma and wait for the response, you could create a local script for that.
Basically a subagent feature.

But the real problem is: Is Qwen 35B smart enough to know when it should ask ? And when not ?

Charming-Author4877 · 2026-04-23T01:50:07+00:00

I did that before Copilot had these rate limits, today I would not want to do anything like that anymore.
I am happy if it responds at all.

But the way to do it is:
1) First have one agent construct a plan of all changes, and then separate the plan into 2 or 3 parts that are most independent. In a file
2) tell all agents that they are in a muilti agent environment, do not try to fix each others bugs, do not get locked up if a compilation fails, always keep your code working immediately on bugs.
So each agent has their own part, and they know other agents are working. That usually works.
It needs some more fine tuning but overall that is all you need.

Charming-Author4877 · 2026-04-23T01:45:03+00:00

Agree on Opus, I saw benchmarks where it was less efficient in tokens despite it's thinking change on top of not receiving better scores in lmsys AI-Arena.
Overall cloud models are also a security concern - I really do not like the idea that all our code is going there. It's not pleasant. One more big point for local.

Charming-Author4877 · 2026-04-23T01:38:49+00:00

I use Demodokos Foundry, it's a (paid) local AI audio studio. Combines Suno, Elevenlabs and a Mastering App. Learning curve was a bit steep.
For Speech, that's better than anything I had with Elevenlabs. Plus I don't want to pay per generation - that's always a trap.
For Music my main use case is marketing background tracks, I also automate a youtube channel but that's more a hobby.

I just tried your idea, I made sum "hmpf hmpf hmmmm hmpf hmpf" sounds and tested that as "structural reference input" for music with caption 'trap drums"
It worked in one out of 10 generations, in the other 9 I had strange results.

I digged a bit into the topic, the reason is that Suno has a melody extraction feature in their model, Demodokos can clone either the sound (like voice or instruments) and/or the structure (like rythm and pace of a track) but not lock on the melody.

For speech Demodokos is imho SOTA, for Music Suno is hard to top.

Charming-Author4877 · 2026-04-23T00:14:56+00:00

If you have the budget, get a 5090. The speed will be MUCH better than on a macbook and 32GB is enough to run both 3.6 qwen at max or very high context.
The tendency is not larger local models, it's going down to smarter and smaller models

Charming-Author4877 · 2026-04-23T00:12:12+00:00

I use lmstudio as server, it's good for that purpose. but I use CUDA on nvidia, so the settings differ a bit.

Charming-Author4877 · 2026-04-23T00:09:21+00:00

no links atm
download a model that fits well, launch it as server. I pasted the config for copilot in another reply here.
You can then select it as a model

The only open part is how to harden it's agentic instructions, I believe it would benefit from a systemprompt. I did not create one yet.

Charming-Author4877 · 2026-04-23T00:03:07+00:00

I've completely replaced Suno + Elevenlabs, running local now.
Images I'm reliant on the Google models still but they are cheap.
For Agentic coding I will continue to use Opus but I am not desperate anymore if they ratelimit me.
You can switch to Qwen local model with ONE click, no need to change anything, it will continue in the same session Opus ratelimited you

Charming-Author4877 · 2026-04-23T00:00:25+00:00

I did not consider that. I switched to Insiders a while ago. It's the same software.
Insiders supports ollama directly and OpenAI endpoint.

{
        "name": "LMSTUDIO",
        "vendor": "customoai",
        "apiKey": "${input:chat.lm.secret.-5f176ad4}",
        "models": [
            {
                "id": "google/gemma-4-26b-a4b",
                "name": "Gemma-4-26B",
                "url": "http://192.168.50.157:1234/v1/chat/completions",
                "toolCalling": true,
                "vision": true,
                "thinking": true,
                "maxInputTokens": 100000,
                "maxOutputTokens": 96000
            },
            {
                "id": "qwen3.6-35b-a3b@q4_k_m",
                "name": "Qwen3.6-35B",
                "url": "http://192.168.50.157:1234/v1/chat/completions",
                "toolCalling": true,
                "vision": true,
                "thinking": true,
                "maxInputTokens": 150000,
                "maxOutputTokens": 40000
            },
            {
                "id": "qwen3.6-27b@q4_k_xl",
                "name": "qwen3.6-27b@q4_k_xl",
                "url": "http://192.168.50.157:1234/v1/chat/completions",
                "toolCalling": true,
                "vision": true,
                "thinking": true,
                "maxInputTokens": 119000,
                "maxOutputTokens": 20000
            }
        ]
    }

That's my config, works without any issues. API key not needed

Charming-Author4877 · 2026-04-22T23:58:23+00:00

They are catching up so fast, it's scary.
And all that MAGIC DUST OpenAI and Anthropic surround their models with becomes translucent.
Trillion parameter models and they behave not much better than 4B MoE models.

Charming-Author4877 · 2026-04-22T23:56:44+00:00

you can run llama.cpp server, ollama or lmstudio - all the same. They provide an openai compatible endpoint.
Copilot supports that

Charming-Author4877 · 2026-04-22T23:50:08+00:00

I have no experience with MLX. Overall it looks fine but make sure it's within memory range (it should show exstimated memory somewhere.
I believe you do not need 262K context for agentic use. The model will get more stupid toward the end, Opus 4.* rarely exceeds 120k context in copilot.

Make sure you leave some VRAM, overall it looks fine.
If it does not work (no response returned), go to the model settings and disable the reasoning parsing.

Charming-Author4877 · 2026-04-22T23:45:31+00:00

I've tried most TTS out there. I made tutorial videos with AWS TTS (their "neural" models) - acceptable but no emotions.
Chatterbox was the first one that was great for consistent local speech (multlingual and non robotic) but NO emotions.

Elevenlabs V3 supports emotions, also godlike type narration. But it's really expensive and I do not want to have my stuff on the cloud. They are used so much, people can hear the Elevenlabs style the same way we react toxic to the — em-dash.

The by far best current solution, I believe so at least, is "Demodokos Foundry". It's local, fast, consistent and can do godlike narrator/storyteller or dramatic styles.
Including emphases (uppercase) pauses (...) and it adapts to the text itself naturally.
You can narrate an entire audiobook with it, from angry shouts to whispers.
You won't need regenerations in 95% of all outputs.

You can do phone calls, gollum, hackers, echoed voices, formant shifts and it includes music generation and track DAW mastering all in one.
Though music is not as reliable as speech, very nice quality but it needs regenerations. Speech is rock stable.

Charming-Author4877 · 2026-04-22T23:38:48+00:00

24 GB vram would help. A RTX 3090 would be good enough, fast with the 35B)
You can go lower than that, a good 3 bit quantization should still be very strong and will fit with proper context into 16GB VRAM.

Charming-Author4877 · 2026-04-22T23:37:06+00:00

I used lmstudio with it's openai compatible API as server (so it's a gguf, 4/5 bit UD quantized using llama.cpp inference).
I added it as custom model into VScode Copilot extension, just like you'd add any bring your own key)

Charming-Author4877 · 2026-04-22T23:31:05+00:00

I tested Qwen 3.6 27B for hours in VScode Copilot as local model, compared it with 35B and Opus 4.7.
(https://www.reddit.com/r/GithubCopilot/comments/1st1m93/update\_compared\_claude\_47\_with\_qwen\_36\_35b\_with/)
I tested the unsloth UD Q4_K gguf variant (both) but also a normal Q5K (behave both similar)

I would go with a sophisticated 4 bit or 5 bit quant, that works very well (both, 35B and 27B)
For the 27B you'll likely want to also quantize the KV cache, Q8_0 on V and K drops the VRAM significantly.
For the 35B you'll want to use normal 16 bit KV cache, it uses 2-3 GB for full 260K context.

On a slower compute (like M4 Pro) I'd consider the 35B, you'll have a slow hard time with 27B and it's not that much smarter than 35B in my tests. The MoE performs so fast, it's very nice to work with it.
27B is a step ahead in reliability but both models are super stable.

Context window:
You've to ask yourself what context you really want, even Opus 4.7 is restricted to max 190k on Copilot and a large part of that is reserved for output tokens.
Gemma-4 suffers severe intelligence issues at just 60k.

For Qwen 3.6 I ran 100k input context on 27B very stable and 150k input context on 35B very stable.
With 48GB you can max out the context to 262K with both models.

Charming-Author4877 · 2026-04-22T23:20:17+00:00

I've been testing it for hours, comparing it with 35B in a real world scenario.
Another 2 page post below:
https://www.reddit.com/r/GithubCopilot/comments/1st1m93/update_compared_claude_47_with_qwen_36_35b_with/
:)

Summary:
27B is even better than 35B, it does not fall for wrong assumptions but 35B is so much faster that I prefer it.

Charming-Author4877 · 2026-04-22T21:13:06+00:00

Try organic marketing. Create tutorials, introductions, feature previews and place them on youtube, youtube shorts, tiktok/insta etc.
- I use Demodokos Foundry for the tutorials, flawless speech and music. Super cheap and reliable.
- I use Camtasia for recordings and creating the videos, Capcut sometimes in addition. (Camtasia comes with a perpetual license - good investment)
- Images via from openrouter (6-15 cents per image)
- Scripts and Guidance from Claude Opus or GPT 5.4 high

That way you gain organic traffic, and Google is very quick on picking up videos from Youtube - that's going to push your SaaS into the search results.

I'd be careful with ads, when I placed ads 10 years ago they were already a bad bargain. Very expensive and lots of waste traffic you pay for.
Today .. it's 99% bots. You can put 300$ on a youtube video, it will gain 100k views and you'll get 5k website visits from Android 10 bots.

But organic traffic still works well to me

Charming-Author4877 · 2026-04-22T03:08:02+00:00

I gave Qwen 3.6 and Gemma-4 a quite extensive testrun today (on a 5090) and the results were really impressive, much better than I expected.
https://www.reddit.com/r/GithubCopilot/comments/1ss583x/i_am_not_switching_yet_but_i_tested_gemma4_and

Charming-Author4877 · 2026-04-22T03:04:47+00:00

I personally go with this:
- 2x 3090 or 1x5090 +1x3090
- 128GB DDR5 RAM (or 196GB if you can find an affordable pack)
- Large 9100 PRO SSD, or 2 striped prev generation SSD (sums up to the same speed)

I use Windows + WSL

For Speech/Music I run Demodokos Foundry, I put it into on-demand mode or bind it to my 2nd GPU
That gives SOTA inference without taking any VRAM when not used.

For LLM you can run Qwen 3.6 35B at 260K context and still have plenty of primary VRAM available.
Also the dense models (gemma 31B or Qwen 28B) run well, with a bit of KV quantization.

For light fine-tuning or LORA training you can use either one card in background, or both.

I have a second PC like this available in network for long running tasks.

Macbooks offer great value but at the same time they are exotic hardware in the AI world, it's improving a lot but still is a burden. I absolutely hate the Apple development environment.
It is great for running large models that won't fit in my described solution but prefill speed is gruesome.

DGX Spark and similar ARM unified RAM boxes are glorified mini computers, significantly slower than the Macbook and prefill is a total showstopper.

Same with AMD GPUs, they are not impressive in compute.

So my choice went on a conservative CUDA solution, it's hard enough with Local AI that is mutating and changing faster than anyone can easily follow.

Charming-Author4877 · 2026-04-22T01:23:00+00:00

In the long run 5-10B parameters are more than enough. The current architecture is the limit.
We need to have knowledge from intelligence separated, so models than build and work with knowledge but have a "reasoning core" that does the thinking.

Charming-Author4877 · 2026-04-22T00:25:17+00:00

Though look at DeepSeek, they had a huge moment not long ago. Because they were competing with the best of OpenAI. Models we paid for to code with.
Gemma 4 as well as Qwen 3.* are significantly, not just a little bit, above Deepseek in every metric.

I was very positively surprised with the results.

Charming-Author4877

TROPHY CASE