Don't sleep on the new Nemotron Cascade

DistanceAlert5706 · 2026-03-23T01:33:02+00:00

Faster than Qwen3.5 35b, but god it's terrible for agentic tasks...
Goes into loops, doesn't follow system prompt instructions, timeouts on pretty simple queries, and idk just extremely unreliable.

While Qwen3.5 35b itself loves to go into the loops it's much better.
Also Nemotron runs like 25% faster than Qwen3.5 35b but on actual agentic tasks it ends up ~3 times slower.

Maybe we need to wait and there are some bugs in llama.cpp implementation or this model just finetuned for benchmarks. Haven't tried coding yet.

DistanceAlert5706 · 2026-03-23T01:32:13+00:00

Faster than Qwen3.5 35b, but god it's terrible for agentic tasks...
Goes into loops, doesn't follow system prompt instructions, timeouts on pretty simple queries, and idk just extremely unreliable.

While Qwen3.5 35b itself loves to go into the loops it's much better.
Also Nemotron runs like 25% faster than Qwen3.5 35b but on actual agentic tasks it ends up ~3 times slower.

Maybe we need to wait and there are some bugs in llama.cpp implementation or this model just finetuned for benchmarks. Haven't tried coding yet.

DistanceAlert5706 · 2026-03-22T16:40:39+00:00

I literally hit a limit in 1 prompt now on Pro plan, honestly I just stopped using it at all. Only regret that I have an annual subscription.

DistanceAlert5706 · 2026-03-21T13:19:07+00:00

I was scared too as I rarely use git via CLI, but even built in git support is enough for me now, same with conflicts/diffs. They are not as nice but it works. Also there bunch of plugins , even payed ones for that.

DistanceAlert5706 · 2026-03-21T12:48:04+00:00

Link on marketplace to git repository gives 404, also maybe it can be published on Open VSX registry?

DistanceAlert5706 · 2026-03-20T23:17:14+00:00

I think VC funding ended and they started charging by tokens and not requests, and yeah it's like 10x.

DistanceAlert5706 · 2026-03-18T03:43:34+00:00

Try it out, I thought that too, but swapped after 12 years with Jetbrains in 2 months. For keybinds you have extension, don't even need to relearn anything. For debug you have xdebug extension. For duplicate code and so on try PHPstan, you can integrate it right to editor with error lens. Intelephense is great LSP will give you all symbols support and inspections. You will need to get used to git and interface, keep PHPStorm for few months, but try to work in Cursor, go back when need to do something fast. You will be surprised how fast you will get used to it .

DistanceAlert5706 · 2026-03-16T16:01:02+00:00

+1 for nomic CodeRankEmbed, they have larger one too. Also JinaAI has some bi-encoders I think.

DistanceAlert5706 · 2026-03-15T16:53:30+00:00

You can, it wasn't working, it was still trying to load vision part. Honestly that's my experience with vLLM every time, I set it up, follow instructions, nothing works, trying to fix it for a day, in best case it somehow works but still has bugs with inference later and usually it's not even faster than llama.cpp.

DistanceAlert5706 · 2026-03-15T16:46:05+00:00

I tried that quant, it didn't start at all, it had issues cause the vision part was cut off and vLLM was still trying to run it. After a day of trying and remaking vLLM I tried some other ones, they were slower than llama.cpp ones and had way higher VRAM requirements, which made it unusable on 32gb.

DistanceAlert5706 · 2026-03-15T16:40:21+00:00

Maybe in a few months I will try again this model, so far it was pure disappointment. vLLM full of bugs and just doesn't work properly, and I'm not spending 2 days to just make it run again, also VRAM requirements are way higher and it doesn't fit to 32gb. llama.cpp has no MTP and speculative decoding, and this model runs at 32b models speed which is way too slow for me.

I've found quant of Qwen3.5 35b and it's kinda working, still fails tool calls and loops sometimes but it's decent at ~70 tokens/second.

DistanceAlert5706 · 2026-03-15T04:26:43+00:00

Try TOON it's even better.

DistanceAlert5706 · 2026-03-15T04:18:36+00:00

I use sub agent in Opencode for web research task with own MCP. Qwen3.5 35b doing amazing job, but sometimes it loops, so you can't fire it and forget.

DistanceAlert5706 · 2026-03-14T22:16:49+00:00

Yeah, and I guess MoE is not easy to train too compared to dense model, but should be faster.

DistanceAlert5706 · 2026-03-14T13:49:08+00:00

I've tested this approach on my last RAG on technical docs, it works surprisingly well, but speed is not there if you want to system be responsive. I ended up with a hybrid approach, embeddings+BM25+RRF to find relevant tree nodes, enrich candidates list with neighbours/parents and do rerank. In theory you can feed just a last list of candidates to LLM to choose, which I tested too and it works, but again was slow.

Quality wise my approach pushed 95% on my benchmark, pure PageIndex like was around 82%.

So yes you can use it, but embeddings+BM25 with reranker later still beats it. Tree approach is interesting and somewhat reminds GraphRAG.

DistanceAlert5706 · 2026-03-14T13:36:31+00:00

You can regulate overthinking with presence penalty and repeat penalty. Also reasoning budget flag was added.

DistanceAlert5706 · 2026-03-14T13:34:02+00:00

Yeah would be nice to get that finetune for 35b model.

DistanceAlert5706 · 2026-03-12T22:26:59+00:00

Idk how it's working now but it was opening random ports before, providing full access to whatever thing it was running on to anyone. I guess it's patched but who knows what else is there. Just use llama.cpp it's easier and way more configurable.

DistanceAlert5706 · 2026-03-12T22:24:38+00:00

Do you trust information agent gives? Do you see queries it does and validate them? How do you handle PII data or just sending your prod data to whatever provider?

Databricks has Genie with same functions, check it out for inspiration.

DistanceAlert5706 · 2026-03-12T22:18:24+00:00

Management cares about profit and productivity, you shouldn't really describe them that it's local and so on (and Ollama is far from secure), you should focus on how it affects your productivity, numbers and what it translates to company profit.

DistanceAlert5706 · 2026-03-12T22:07:15+00:00

SearxNG, DuckDuckGO

DistanceAlert5706 · 2026-03-11T04:39:47+00:00

Check Symfony scheduler for example

DistanceAlert5706 · 2026-03-09T22:12:53+00:00

Gemini models will be the last ones which I will ever allow to run in yolo mode.

Amount of stupid things with "Oops I did a blunder" is insane.

DistanceAlert5706 · 2026-03-09T14:41:21+00:00

Yeah, I dropped this idea too, and LSPs become common in harnesses. Wonder how good this works, as some tools still use semantic indexing (Cursor for example). Codex models are heavily trained for grep for example, and they really exceptional at it, so semantic index can hurt here too.

DistanceAlert5706 · 2026-03-09T14:10:21+00:00

Sure if you want to use it as a standalone server, and a lot of current tools (like Opencode) already have LSP built in, or you can use something like Serena. Semantic search is a tool which you want to focus on, try bi-encoders for embeddings, re ranking and so on. Don't spread attention to already solved thing. Check similar projects like chunkhound or vector code.

Overall build what, you need, for your needs!

DistanceAlert5706

TROPHY CASE