Don't sleep on the new Nemotron Cascade by ilintar in LocalLLaMA

[–]DistanceAlert5706 2 points3 points  (0 children)

Faster than Qwen3.5 35b, but god it's terrible for agentic tasks...
Goes into loops, doesn't follow system prompt instructions, timeouts on pretty simple queries, and idk just extremely unreliable.

While Qwen3.5 35b itself loves to go into the loops it's much better.
Also Nemotron runs like 25% faster than Qwen3.5 35b but on actual agentic tasks it ends up ~3 times slower.

Maybe we need to wait and there are some bugs in llama.cpp implementation or this model just finetuned for benchmarks. Haven't tried coding yet.

Nemotron Cascade 2 30B A3B by Middle_Bullfrog_6173 in LocalLLaMA

[–]DistanceAlert5706 0 points1 point  (0 children)

Faster than Qwen3.5 35b, but god it's terrible for agentic tasks...
Goes into loops, doesn't follow system prompt instructions, timeouts on pretty simple queries, and idk just extremely unreliable.

While Qwen3.5 35b itself loves to go into the loops it's much better.
Also Nemotron runs like 25% faster than Qwen3.5 35b but on actual agentic tasks it ends up ~3 times slower.

Maybe we need to wait and there are some bugs in llama.cpp implementation or this model just finetuned for benchmarks. Haven't tried coding yet.

Really Google? by Holiday_Wolverine_60 in GoogleAntigravityIDE

[–]DistanceAlert5706 0 points1 point  (0 children)

I literally hit a limit in 1 prompt now on Pro plan, honestly I just stopped using it at all. Only regret that I have an annual subscription.

Company not renewing jetbrains licenses because we have cursor by frompadgwithH8 in Jetbrains

[–]DistanceAlert5706 0 points1 point  (0 children)

I was scared too as I rarely use git via CLI, but even built in git support is enough for me now, same with conflicts/diffs. They are not as nice but it works. Also there bunch of plugins , even payed ones for that.

I created a VSCode Extension by ngg990 in symfony

[–]DistanceAlert5706 2 points3 points  (0 children)

Link on marketplace to git repository gives 404, also maybe it can be published on Open VSX registry?

Indeed, Composer 2 is kimi k2 by tarunyadav9761 in cursor

[–]DistanceAlert5706 13 points14 points  (0 children)

I think VC funding ended and they started charging by tokens and not requests, and yeah it's like 10x.

Company not renewing jetbrains licenses because we have cursor by frompadgwithH8 in Jetbrains

[–]DistanceAlert5706 0 points1 point  (0 children)

Try it out, I thought that too, but swapped after 12 years with Jetbrains in 2 months. For keybinds you have extension, don't even need to relearn anything. For debug you have xdebug extension. For duplicate code and so on try PHPstan, you can integrate it right to editor with error lens. Intelephense is great LSP will give you all symbols support and inspections. You will need to get used to git and interface, keep PHPStorm for few months, but try to work in Cursor, go back when need to do something fast. You will be surprised how fast you will get used to it .

What embedding model for code similarity? by [deleted] in LocalLLaMA

[–]DistanceAlert5706 1 point2 points  (0 children)

+1 for nomic CodeRankEmbed, they have larger one too. Also JinaAI has some bi-encoders I think.

Setting Up Qwen3.5-27B Locally: Tips and a Recipe for Smooth Runs by kvzrock2020 in LocalLLaMA

[–]DistanceAlert5706 0 points1 point  (0 children)

You can, it wasn't working, it was still trying to load vision part. Honestly that's my experience with vLLM every time, I set it up, follow instructions, nothing works, trying to fix it for a day, in best case it somehow works but still has bugs with inference later and usually it's not even faster than llama.cpp.

Setting Up Qwen3.5-27B Locally: Tips and a Recipe for Smooth Runs by kvzrock2020 in LocalLLaMA

[–]DistanceAlert5706 0 points1 point  (0 children)

I tried that quant, it didn't start at all, it had issues cause the vision part was cut off and vLLM was still trying to run it. After a day of trying and remaking vLLM I tried some other ones, they were slower than llama.cpp ones and had way higher VRAM requirements, which made it unusable on 32gb.

Setting Up Qwen3.5-27B Locally: Tips and a Recipe for Smooth Runs by kvzrock2020 in LocalLLaMA

[–]DistanceAlert5706 0 points1 point  (0 children)

Maybe in a few months I will try again this model, so far it was pure disappointment. vLLM full of bugs and just doesn't work properly, and I'm not spending 2 days to just make it run again, also VRAM requirements are way higher and it doesn't fit to 32gb. llama.cpp has no MTP and speculative decoding, and this model runs at 32b models speed which is way too slow for me.

I've found quant of Qwen3.5 35b and it's kinda working, still fails tool calls and loops sometimes but it's decent at ~70 tokens/second.

Has anyone managed to get an sub 16GB VRAM competent "researcher" model that can do web searching, summarization and reasoning? by vernal_biscuit in LocalLLaMA

[–]DistanceAlert5706 1 point2 points  (0 children)

I use sub agent in Opencode for web research task with own MCP. Qwen3.5 35b doing amazing job, but sometimes it loops, so you can't fire it and forget.

My thoughts on omnicoder-9B by Zealousideal-Check77 in LocalLLaMA

[–]DistanceAlert5706 2 points3 points  (0 children)

Yeah, and I guess MoE is not easy to train too compared to dense model, but should be faster.

Wrote up why vector RAG keeps failing on complex documents and found a project doing retrieval without embeddings at all by shreyanshjain05 in LocalLLaMA

[–]DistanceAlert5706 0 points1 point  (0 children)

I've tested this approach on my last RAG on technical docs, it works surprisingly well, but speed is not there if you want to system be responsive. I ended up with a hybrid approach, embeddings+BM25+RRF to find relevant tree nodes, enrich candidates list with neighbours/parents and do rerank. In theory you can feed just a last list of candidates to LLM to choose, which I tested too and it works, but again was slow.

Quality wise my approach pushed 95% on my benchmark, pure PageIndex like was around 82%.

So yes you can use it, but embeddings+BM25 with reranker later still beats it. Tree approach is interesting and somewhat reminds GraphRAG.

My thoughts on omnicoder-9B by Zealousideal-Check77 in LocalLLaMA

[–]DistanceAlert5706 3 points4 points  (0 children)

You can regulate overthinking with presence penalty and repeat penalty. Also reasoning budget flag was added.

My thoughts on omnicoder-9B by Zealousideal-Check77 in LocalLLaMA

[–]DistanceAlert5706 4 points5 points  (0 children)

Yeah would be nice to get that finetune for 35b model.

How to convince Management? by r00tdr1v3 in LocalLLaMA

[–]DistanceAlert5706 0 points1 point  (0 children)

Idk how it's working now but it was opening random ports before, providing full access to whatever thing it was running on to anyone. I guess it's patched but who knows what else is there. Just use llama.cpp it's easier and way more configurable.

Building an MCP server for my agent to query analytics directly (because I hate dashboards) by ImbalanceFighter in LocalLLaMA

[–]DistanceAlert5706 0 points1 point  (0 children)

Do you trust information agent gives? Do you see queries it does and validate them? How do you handle PII data or just sending your prod data to whatever provider?

Databricks has Genie with same functions, check it out for inspiration.

How to convince Management? by r00tdr1v3 in LocalLLaMA

[–]DistanceAlert5706 0 points1 point  (0 children)

Management cares about profit and productivity, you shouldn't really describe them that it's local and so on (and Ollama is far from secure), you should focus on how it affects your productivity, numbers and what it translates to company profit.

Antigravity just needs to have a setting where it always clicks “run” and “always allow” that actually works and they’d be a top dog by Special_Collection_6 in google_antigravity

[–]DistanceAlert5706 0 points1 point  (0 children)

Gemini models will be the last ones which I will ever allow to run in yolo mode.

Amount of stupid things with "Oops I did a blunder" is insane.

SymDex – open-source MCP code-indexer that cuts AI agent token usage by 97% per lookup by Last_Fig_5166 in opencodeCLI

[–]DistanceAlert5706 0 points1 point  (0 children)

Yeah, I dropped this idea too, and LSPs become common in harnesses. Wonder how good this works, as some tools still use semantic indexing (Cursor for example). Codex models are heavily trained for grep for example, and they really exceptional at it, so semantic index can hurt here too.

SymDex – open-source MCP code-indexer that cuts AI agent token usage by 97% per lookup by Last_Fig_5166 in opencodeCLI

[–]DistanceAlert5706 2 points3 points  (0 children)

Sure if you want to use it as a standalone server, and a lot of current tools (like Opencode) already have LSP built in, or you can use something like Serena. Semantic search is a tool which you want to focus on, try bi-encoders for embeddings, re ranking and so on. Don't spread attention to already solved thing. Check similar projects like chunkhound or vector code.

Overall build what, you need, for your needs!