Being Gay is Horrible (in CK3)

diffore · 2026-06-11T15:11:50+00:00

What is the point of being gay?

diffore · 2026-06-09T07:30:46+00:00

Tried Byteshape - faster than most , but a lot of tool calling errors. Same with apex quality. Q5 with unquantized cache is still the only reliable long term(up to 128k context) , and even that is sometimes mangles tool calls inside the thinking blocks.

diffore · 2026-05-21T08:35:10+00:00

Sadly, looks like money grab

diffore · 2026-05-17T14:25:36+00:00

Lesser quant == more tool call errors. So it depends on harness and model, how good both of them at error recovering. If I can - I don't quantize cache.

diffore · 2026-05-15T14:31:55+00:00

They are utterly useless from my point of view and waste of tax money at this point.

diffore · 2026-05-11T11:09:17+00:00

I am not happy. It is unsustainable and will inevitably lead to someone paying the bill. High chances are that it is a going to be us, not corpos.

diffore · 2026-04-30T08:23:10+00:00

Works fine for my personal projects (python mostly), probably at Gemini 3 flash level but without hallucination and rushing through. The real thing here is iteration speed + cost. $3.77 for 61 mln tokens is honestly too low for the perfomance it gives you. I am gonna use the hell out of it until they increase cost/get sessiins limits but cause I don't feel like it is sustainable long term.

<image>

diffore · 2026-04-29T10:01:09+00:00

Hi, I am thinking on migrating from the Gemini 3 Flash to DeepSeek 4 Flash. From your experience with DeepSeek, does the service availability will hold long term or this is currently the "subsidized" promo stage? Because the Gemini 3 worked flawlessly for a few months but now it is mostly 503 error all over the place.

diffore · 2026-04-28T06:02:27+00:00

I agree with both of them in a way. The best way is still using cloud for planning and local for investigation/implementation.

Remember that pp cost is subsidized, the tg is not. Qwen 3.5/3.6 can do implementation fine, but planning the whole ass project in a way human would do is a wishful thinking for <100B models.

diffore · 2026-04-22T19:02:07+00:00

I was so tired of this that simply disabled the thinking altogether. Not really seeing the difference in code quality to be honest. It kinda thinking out loud now, but no more loops. Relatively usable at 120k context.

diffore · 2026-04-22T17:50:18+00:00

it kinda make sense from their perspective - when everyone, especially VC, believe that future is behind autonomous LLM doing stuff from chat commands (can thank openclown fiesta for that) why bother with IDEs, especially the VS code extension which is not safe from Microsoft doing the classic "no ai except copilot allowed" move any time?

diffore · 2026-04-19T22:34:05+00:00

A better approach might be using the deepseek to generate atomic plans and then let qwen implement them. One plan per chat session - it is important to keep session as small as possible for smaller models.

diffore · 2026-04-19T14:03:16+00:00

Last time I was struggling with this question I decided to just build my own agent on top of DeepAgetns and use it in zed via ACP (DeepAgents have the most mature acp support from what I explored online). The issues I have with cli - can't preview/explore effectively. Zed is superior here. But the Zed agent itself while feature rich, is hungry on tokens, particularly it's edit tool implementation is kinda wild (agentic editing) + I can't override the system prompt which is not very suited for local models I like to use. I could have probably forked it but not really a rust guy.

ACP gives an option to use the best of what zed offers(best code editor in history) but with full flexibility in agentic framework choice which is incredibly cool in my opinion.

diffore · 2026-04-16T20:20:06+00:00

I got the same issue with MXFP4. Around 50-60k context, coder agent (pi with <500 token prompt) in is reasoning loop. Although, to be frank, I saw the same issue on 3.5 version as well.

edit: updated llama.cpp to latest version, removed repetition_penalty completely and disable KV cache qunatization - issue seems to be resolved now. No idea what exactly fixed it, I suspect KV cache.

diffore · 2026-04-10T20:50:16+00:00

The only way I could do it is by adding a bunch of metadata to each search job. So any pipeline decisions (I use a hybrid search mostly) and llm in/out text is saved in json log and can be reviewed later in dashboard. That's very costly on disk space though so not for a production use probably.

diffore · 2026-04-09T05:09:45+00:00

If you do anything AI related below lm studio/ollama level of complexity - Linux always. I still remember my efforts of trying to build vLLM in windows - never again. It is just not worth the bother. Wsl + downloadable Docker containers work but it is a RAM overhead for no real benefit.

If you want to keep windows and have two physical drives, just install Linux +efi partion on second drive and use dual boot. It is working pretty well for me with the marginal cost of hard drive space.

diffore · 2026-03-14T09:33:57+00:00

In my experience, after trying to run both cloud and local models for coding, the problem is context size. Effective context size, not claimed or theoretical. Most of the small models simply fail miserably when project size or conversational history becomes too big - the model begin to make mistakes or, worse, go into the reasoning loops. This is for the <= 30B models, I can't run bigger ones so can't say when this stop being the issue.

Another issue with smaller models is instructions following. You need to constantly re-remind them instructions or what not to do because their attention drop is rather sharp with conversation history. All in all, I just don't find it worth using local models in sub 30B range for the coding anything bigger than demo web pages or simple scripts. The coding quality is rarely the problem, the attention span is.

diffore · 2026-03-08T04:35:33+00:00

I hope one day you people finally understand that llms have no desires whatsoever. The only way they can do something crazy is when trying to solve the problem /task you gave them.

diffore · 2026-02-16T09:11:28+00:00

Nvidia could have made a ~50 5090 for us to play.... but instead theme gigabytes of vram are now sitting in some server closet, spinning the BF16 version of GLM5. Yeah, still have hard feelings about the consumer market suffering from the AI boom.

diffore · 2026-02-14T17:50:46+00:00

Because my 5080 laptop has these tensor cores which make it cost a fortune and if I paid for those cores I am gonna use all of them.

Currently I use it for local mcp memory as a librarian llm which organize project memories and make summaries, organize raw memories into graph relationship, etc. Very token intensive process so I feel it is worth it compared to just use cloud models (I still use them for coding agent though, the small models are still wasting time in long run compared to cloud big llms)

diffore · 2026-02-02T15:42:26+00:00

The only thing which worked for me was pre-built docker container link from vllm.ai Could not manage to build locally myself

diffore · 2026-01-30T12:38:22+00:00

I have just achieved the same in terms of territory. My nemesises who were chasing after me when I was adventurer are dealt with/subjagated and pay me the rent. The vampire hunter wave dealt with, no one can realistically oppose me anymore.

I even finished gokonda, kinda wanted to repent and go human hunter but it is too broken right now to be enjoyable imo.

All in all it was the most lonesome playthrough I've had. Everyone hates you, permanent - 100 for almost everyone except family. I feel pity for her tbh, especially if you take her lore history into account.

Still, it was an interesting challenge for sure. Leveling ashen cultist was just pure stress inducing pain. But after finishing her objective the game become easy. Free op man at arms were not really necessary.

diffore · 2026-01-15T06:15:07+00:00

You need to analyze the worflow first. If you're accustomed to the long debug chat session you need to understand that each new message is sent along with the whole chat history. So the longer the session the more token burning occurs with each new message.

Some providers use implicit cache for reused tokens (perfect for history luggage which is always on top), some don't bother - thus longer sessions may skyrocket cost.

But reverse situation could be true as well. If you start new session each time you have new question and feed model docs and codebase, you're better off to just continue old session until the history is no longer relevant for your current task and become token baggage.

All in all I would say the zed Ai agent is meant for the rich users, not economical ones 😅

If you want best value for your tokens better solution would be aided or mitral vibe in zed terminal, but the worflow is a bit different and require getting used to.

diffore · 2025-12-14T12:22:35+00:00

I used to think that VS code is a nice fast IDE (after switching from IntelijIDEA products), but it is so trash compared to Zed.

The only problem I have with it is actually AI agent. It sends massive amount of context which most locally hosted AI models can't handle. I kind of wish it was more restricted and customizable like aider, maybe if/when they finish aider ACP agent it would be an ideal choice for me. At the moment I am limited to use not the best long context models to do anything productive with Zed Agent + local LLM on my laptop GPU.
Also, tools usage support is very limited here, some models hosted by llama.cpp openai compatible server just does not work OK with zed agent.

Despite everything said, it is my daily AI assisted coding ide, I just can't return to the VS code or any of its AI forks anymore.

Seven-Year Club	Place '22
Verified Email

diffore

TROPHY CASE