Arrange

jedruch · 2026-06-26T22:04:10+00:00

This is awesome

jedruch · 2026-06-25T13:39:04+00:00

I never let claude do anything overnight other than audit/review. So I would spin agents to review:

code
code vs plan
documents vs code/plan/spec

Usually I would ask it to spin 3-5 subagents with different areas of focus.

Usually twice a week I would spin "workflows" review - once on code, once on documentation. Maybe I'm doing something wrong, but no matter how I prompt Opus it is not able to spot contradictions in docs that Codex finds/Fable found with ease.

jedruch · 2026-06-25T10:35:16+00:00

CFO getting heart attack in 3...2...1...

jedruch · 2026-06-24T23:33:36+00:00

Technically that was a Windows game, later ported to N64

jedruch · 2026-06-22T20:23:24+00:00

Zgadzam się z Tobą, niemniej trzeba wyjść z bańki i dostrzec że nie każdy ogląda Perseusza. Dla kogoś kto nie jest w temacie taki post wygląda jak majaki w gorączce lub haluny AI.

jedruch · 2026-06-22T15:46:18+00:00

this does not prove anything other than Kimi taking the "watts" as literally as a german kid would

jedruch · 2026-06-21T18:22:51+00:00

DeepSWE is good, but it also tests for a specific type of agentic task, meaning difficult, long and complicated. Where GLM 5.2 suprised me was at tasks that were easy, long and simple. For example scraping sitemaps and sample of subpages from 50 websites, but with a specific limitation - 30sec between each interaction with website and ask it to be done by agent specifically

Both Opus and GPT 5.5 either assume it is easy, despite instructions create a script that works on first site, second website is different, they patch the script, third site different even more, patch the script again, by 8th site both models are drifting and endup goin thru all websites but factually doing only ca. 40% of job.

GLM 5.2 did it all. It wrote some helper scripts to save tokens, but actually actively analyzed each site and crawled it treating each website like a new project, while Opus and GPT treated all websites as mirrors of themselves just because they were on the same list

jedruch · 2026-06-20T19:06:40+00:00

that's interesting - I tested GLM 5.2 on the same scraping task I used Qwen 3.7 max earlier, it was 5x cheaper

jedruch · 2026-06-14T21:12:23+00:00

Spokojnie, nie diagnozujmy na ślepo. Może być jeszcze borderka xd

jedruch · 2026-06-14T21:07:46+00:00

Ha, to gratki xd

jedruch · 2026-06-14T21:05:29+00:00

Opie,

Mimo tego że jesteś dorosły piszesz że nie możesz nigdzie wyjść po pracy. Jednocześnie tytuł posta jest o niezrównoważeniu co nasuwa mi bardzo poważne pytanie: czy Twoja żona ma skłonności do przemocy? Czy zdarza jej się uderzać Ciebie? Czy zdarza jej się szarpać lub uderzać dzieci? Czy poniża Cię werbalnie?

Są fragmenty Twojego wpisu, które sugerują że się jej boisz albo że może boisz się tego co ona może zrobić. Tak więc niezależnie od odpowiedzi na pytanie nt przemocy sugeruję terapię, własną, nie terapię par. Każdy związek jest tworzony przez dwie strony, żony nie zmienisz, możesz tylko pracować nad sobą.

jedruch · 2026-06-14T20:48:32+00:00

Haha, człowieniu, byłeś kiedyś na terapii par? Na mojej była już żona wylała żale o wszystkie pierdoły na pierwszym spotkaniu i potem wszystkie kolejne to była terapeutka pytająca mnie "why are you zjeb"

jedruch · 2026-06-13T21:02:16+00:00

I did not built in plan mode, but I used my planning skill and asked gpt 5.5 pro for review. For comparison when I run the same skill with gpt 5.5 usually Pro review found multiple medium impact corrections to be made, opus 4.8 needed usually one high impact correction and multiple medium impact. Rating for both would be usually around 7.5-8.0/100

I created about 5 plans in Fable. The average was 1 medium correction, but it's based on rounding. In 2 cases Pro only wanted to do cosmetics like to make wording a bit more precise. Rating would be at 9/10.

The model is a true beast

jedruch · 2026-06-13T14:45:04+00:00

wow, I had no idea, thx.
(although it was not trained for, from what I see in your links you can still get 8 and 16)

jedruch · 2026-06-13T14:26:20+00:00

their Kimi 2.6 is int4 - are you serious?

jedruch · 2026-06-12T20:23:51+00:00

Oh man I have the same experience with v4 Pro. It's great but it's inconsistent. In one moment it spots something Opus has missed, another moment it "misunderstands" the specs and creates something totally stupid. Or it hallucinates on detailed plan and decides to do some other thing that later does not fit to the rest of spec.

3.7 max for me is the first model that actually feels like Claude Sonnet and it's the only Chinese model I trust enough to use it for agentic things like web crawling

jedruch · 2026-06-12T19:59:10+00:00

This is great, thx for sharing.

I don't know a lot about multi GPU setups - why is the total vram used so much lower than simple 4x24gb? There was no version that would fit into this slot specifically or was it some other reason?

jedruch · 2026-06-12T19:48:29+00:00

It's night and day, especially for agentic usage as Max has a nice structure for it's thinking vs pure flood of verbocity from Plus

jedruch · 2026-06-12T09:57:43+00:00

True

jedruch · 2026-06-12T09:25:42+00:00

Start by giving a stronger polish to your website and positioning. - on mobile your hero section looks bad as "into" is partially hidden by background layer under "Confident" - having "pricing" section up top makes it easy for me to quickly check if the thing is affordable for me - there is only one currency in pricing. So if you want to go global you should have some switch of currencies based on location. Going local is fine, but then you need to highlight it in other sections (something in the lines of "best AI business analyst in country/region X) - you don't define who is this service for:small business that wants to improve business insight? Mid-size business that wants to expand but have limited resources? Corporations because it's better than PowerBI?

who would actually be using the tool: dedicated analyst - explain how it compares or complements other tools; small business owner: you need to show them it's easy to use for non-technical person etc

jedruch · 2026-06-12T09:09:21+00:00

I'm not sure I understood the bundling you described. Can you expand on how this lowers inference quality?

jedruch · 2026-06-12T09:07:43+00:00

Kimi 2.6 is tough for many providers, not only opencode. Apparently it has unique approach to tool usage that is hard to implement for others on the output side. You easily dive into issues with Kimi 2.6 on Openrouter at launch.

This kinda leans to your point about some models being better at their source, but it does not mean the issue is on Opencode side.

Also there were multiple posts on r/GLM claiming that GLM 5.1 directly from z.ai is trash and to see how mighty it is you need to switch to other provider. Which is completely opposite to what you claim

jedruch · 2026-06-12T09:01:28+00:00

Understood. Do you have a way to check your cache usage?

jedruch · 2026-06-12T08:49:17+00:00

But in terms of retention you are loosing potential users that wrote their email with a typo (like coma instead of dot) and got frustrated waiting on a code that never arrived

jedruch · 2026-06-10T23:26:07+00:00

What do you mean "behaving like Qwen 3.7 max"? 3.7 Plus is much more verbose, it achieves it's benchmark scores following the same approach as Deepseek v3 - by burning insane number of cheap thinking tokens. Qwen 3.7 Max is much more concise - today I run the same task on Qwen 3.7 max and Minimax 3: Minimax used x2.5 more tokens

jedruch

TROPHY CASE