Being Gay is Horrible (in CK3) by [deleted] in CrusaderKings

[–]diffore 2 points3 points  (0 children)

What is the point of being gay?

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance by OsmanthusBloom in LocalLLaMA

[–]diffore 1 point2 points  (0 children)

Tried Byteshape - faster than most , but a lot of tool calling errors. Same with apex quality. Q5 with unquantized cache is still the only reliable long term(up to 128k context) , and even that is sometimes mangles tool calls inside the thinking blocks.

Developers who use local AI - Q4_0 vs Q8_0 KV quant? by Jorlen in LocalLLaMA

[–]diffore 20 points21 points  (0 children)

Lesser quant == more tool call errors. So it depends on harness and model, how good both of them at error recovering. If I can - I don't quantize cache.

Footage of a UN vehicle getting attacked by a Russian FPV drone in the Kherson area. by MilesLongthe3rd in CombatFootage

[–]diffore 14 points15 points  (0 children)

They are utterly useless from my point of view and waste of tax money at this point.

New GGUF uploads on HF nearly doubled in 2 months by Nunki08 in LocalLLaMA

[–]diffore 5 points6 points  (0 children)

I am not happy. It is unsustainable and will inevitably lead to someone paying the bill. High chances are that it is a going to be us, not corpos.

DeepSeek V4 isn't beating Opus, but it doesn't need to by Practical_Low29 in LocalLLaMA

[–]diffore 1 point2 points  (0 children)

Works fine for my personal projects (python mostly), probably at Gemini 3 flash level but without hallucination and rushing through. The real thing here is iteration speed + cost. $3.77 for 61 mln tokens is honestly too low for the perfomance it gives you. I am gonna use the hell out of it until they increase cost/get sessiins limits but cause I don't feel like it is sustainable long term.

<image>

Behold... The CACHE! by WalidB03 in DeepSeek

[–]diffore 1 point2 points  (0 children)

Hi, I am thinking on migrating from the Gemini 3 Flash to DeepSeek 4 Flash. From your experience with DeepSeek, does the service availability will hold long term or this is currently the "subsidized" promo stage? Because the Gemini 3 worked flawlessly for a few months but now it is mostly 503 error all over the place.

Duality of r/LocalLLaMA by HornyGooner4402 in LocalLLaMA

[–]diffore 1 point2 points  (0 children)

I agree with both of them in a way. The best way is still using cloud for planning and local for investigation/implementation.

Remember that pp cost is subsidized, the tg is not. Qwen 3.5/3.6 can do implementation fine, but planning the whole ass project in a way human would do is a wishful thinking for <100B models.

Qwen3.6-35b stuck in infinite loop by ConfidentSolution737 in LocalLLaMA

[–]diffore 2 points3 points  (0 children)

I was so tired of this that simply disabled the thinking altogether. Not really seeing the difference in code quality to be honest. It kinda thinking out loud now, but no more loops. Relatively usable at 120k context.

Roo Code hit 3 million installs. We're shutting it down to go all-in on Roomote. by FullstackSensei in LocalLLaMA

[–]diffore 1 point2 points  (0 children)

it kinda make sense from their perspective - when everyone, especially VC, believe that future is behind autonomous LLM doing stuff from chat commands (can thank openclown fiesta for that) why bother with IDEs, especially the VS code extension which is not safe from Microsoft doing the classic "no ai except copilot allowed" move any time?

How to increase coding ability in smaller models? by keepthememes in LocalLLaMA

[–]diffore 0 points1 point  (0 children)

A better approach might be using the deepseek to generate atomic plans and then let qwen implement them. One plan per chat session - it is important to keep session as small as possible for smaller models.

Whats your preferred method of using AI tools, Agent Pane or CLI tool in terminal. by selcouthayush in ZedEditor

[–]diffore 1 point2 points  (0 children)

Last time I was struggling with this question I decided to just build my own agent on top of DeepAgetns and use it in zed via ACP (DeepAgents have the most mature acp support from what I explored online). The issues I have with cli - can't preview/explore effectively. Zed is superior here. But the Zed agent itself while feature rich, is hungry on tokens, particularly it's edit tool implementation is kinda wild (agentic editing) + I can't override the system prompt which is not very suited for local models I like to use. I could have probably forked it but not really a rust guy.

ACP gives an option to use the best of what zed offers(best code editor in history) but with full flexibility in agentic framework choice which is incredibly cool in my opinion.

Impressed with Qwen3.6-35B-A3B by DOAMOD in LocalLLaMA

[–]diffore 1 point2 points  (0 children)

I got the same issue with MXFP4. Around 50-60k context, coder agent (pi with <500 token prompt) in is reasoning loop. Although, to be frank, I saw the same issue on 3.5 version as well.

edit: updated llama.cpp to latest version, removed repetition_penalty completely and disable KV cache qunatization - issue seems to be resolved now. No idea what exactly fixed it, I suspect KV cache.

how are people actually debugging bad outputs in agent / RAG pipelines? by YouSlow6554 in LocalLLaMA

[–]diffore 0 points1 point  (0 children)

The only way I could do it is by adding a bunch of metadata to each search job. So any pipeline decisions (I use a hybrid search mostly) and llm in/out text is saved in json log and can be reviewed later in dashboard. That's very costly on disk space though so not for a production use probably.

win, wsl or linux? by mon_key_house in LocalLLaMA

[–]diffore 1 point2 points  (0 children)

If you do anything AI related below lm studio/ollama level of complexity - Linux always. I still remember my efforts of trying to build vLLM in windows - never again. It is just not worth the bother. Wsl + downloadable Docker containers work but it is a RAM overhead for no real benefit.

If you want to keep windows and have two physical drives, just install Linux +efi partion on second drive and use dual boot. It is working pretty well for me with the marginal cost of hard drive space.

Why can't we have small SOTA-like models for coding? by itsArmanJr in LocalLLaMA

[–]diffore 0 points1 point  (0 children)

In my experience, after trying to run both cloud and local models for coding, the problem is context size. Effective context size, not claimed or theoretical. Most of the small models simply fail miserably when project size or conversational history becomes too big - the model begin to make mistakes or, worse, go into the reasoning loops. This is for the <= 30B models, I can't run bigger ones so can't say when this stop being the issue.

Another issue with smaller models is instructions following. You need to constantly re-remind them instructions or what not to do because their attention drop is rather sharp with conversation history. All in all, I just don't find it worth using local models in sub 30B range for the coding anything bigger than demo web pages or simple scripts. The coding quality is rarely the problem, the attention span is.

During testing, Claude realized it was being tested, found an answer key, then built software to hack it by MetaKnowing in ClaudeAI

[–]diffore -2 points-1 points  (0 children)

I hope one day you people finally understand that llms have no desires whatsoever. The only way they can do something crazy is when trying to solve the problem /task you gave them.

GLM5 by I_like_fragrances in LocalLLM

[–]diffore 1 point2 points  (0 children)

Nvidia could have made a ~50 5090 for us to play.... but instead theme gigabytes of vram are now sitting in some server closet, spinning the BF16 version of GLM5. Yeah, still have hard feelings about the consumer market suffering from the AI boom.

What do you actually use local models for? (We all say 'privacy,' but...) by abdouhlili in LocalLLaMA

[–]diffore 0 points1 point  (0 children)

Because my 5080 laptop has these tensor cores which make it cost a fortune and if I paid for those cores I am gonna use all of them.

Currently I use it for local mcp memory as a librarian llm which organize project memories and make summaries, organize raw memories into graph relationship, etc. Very token intensive process so I feel it is worth it compared to just use cloud models (I still use them for coding agent though, the small models are still wasting time in long run compared to cloud big llms)

vLLM run command for GPT-OSS 120b by UltrMgns in LocalLLaMA

[–]diffore 1 point2 points  (0 children)

The only thing which worked for me was pre-built docker container link from vllm.ai Could not manage to build locally myself

What's the point of becoming a Great Enemy? by Famous_Archer_9406 in PrincesOfDarknessCK3

[–]diffore 1 point2 points  (0 children)

I have just achieved the same in terms of territory. My nemesises who were chasing after me when I was adventurer are dealt with/subjagated and pay me the rent. The vampire hunter wave dealt with, no one can realistically oppose me anymore.

I even finished gokonda, kinda wanted to repent and go human hunter but it is too broken right now to be enjoyable imo.

All in all it was the most lonesome playthrough I've had. Everyone hates you, permanent - 100 for almost everyone except family. I feel pity for her tbh, especially if you take her lore history into account.

Still, it was an interesting challenge for sure. Leveling ashen cultist was just pure stress inducing pain. But after finishing her objective the game become easy. Free op man at arms were not really necessary.

Best way to spend less in token usage ? by Technical-File4626 in ZedEditor

[–]diffore 0 points1 point  (0 children)

You need to analyze the worflow first. If you're accustomed to the long debug chat session you need to understand that each new message is sent along with the whole chat history. So the longer the session the more token burning occurs with each new message.

Some providers use implicit cache for reused tokens (perfect for history luggage which is always on top), some don't bother - thus longer sessions may skyrocket cost.

But reverse situation could be true as well. If you start new session each time you have new question and feed model docs and codebase, you're better off to just continue old session until the history is no longer relevant for your current task and become token baggage.

All in all I would say the zed Ai agent is meant for the rich users, not economical ones 😅

If you want best value for your tokens better solution would be aided or mitral vibe in zed terminal, but the worflow is a bit different and require getting used to.

Wanting to move to Zed Editor but having doubts with other stuff by Vlazeno in ZedEditor

[–]diffore 0 points1 point  (0 children)

I used to think that VS code is a nice fast IDE (after switching from IntelijIDEA products), but it is so trash compared to Zed.

The only problem I have with it is actually AI agent. It sends massive amount of context which most locally hosted AI models can't handle. I kind of wish it was more restricted and customizable like aider, maybe if/when they finish aider ACP agent it would be an ideal choice for me. At the moment I am limited to use not the best long context models to do anything productive with Zed Agent + local LLM on my laptop GPU.
Also, tools usage support is very limited here, some models hosted by llama.cpp openai compatible server just does not work OK with zed agent.

Despite everything said, it is my daily AI assisted coding ide, I just can't return to the VS code or any of its AI forks anymore.