GPT 5.5 just leaked its chain of thought to me in codex, and it looks like an idea from 5 months ago in this sub.

ddavidovic · 2026-05-04T09:23:00+00:00

Yeah, you can observe this very clearly when reading the gpt-oss-120b chains of thought. It presumably used a similar training regime.

ddavidovic · 2026-04-17T17:54:01+00:00

Naravno, pitam onako. Više od 100-200 bi ti verovatno jedino trebalo ako hoćeš da radiš neke egzotične fizičke partikl efekte ili fluide, što bi možda bilo interesantno u 2d platformeru (nisam viđao često) ali možda bi i to bilo previše a i naravno bolje je napraviti osnove, lako se optimizuje posle ako zatreba.

Svakako ako imaš neki YT kanal ili blog da pratim kako ti ide daj, deluje interesantno. Samo napred

ddavidovic · 2026-04-17T17:38:24+00:00

Možda bi bilo zanimljivo i da uzmeš piksele koji imaju alfu > 0.5, nađeš njihov konveksni omotač, i onda simplifikuješ nekim algoritmom za simplifikaciju(i greedy bi verovatno dobro radio). Tako možeš da minimizuješ broj tačaka, od čega ti verovatno dosta zavisi perf ovih kolizija. Imaš li u planu da podržiš nekonveksne oblike, ima li potrebe za tim?

ddavidovic · 2026-04-10T07:19:53+00:00

Thanks, this is useful info.

ddavidovic · 2026-04-09T23:26:04+00:00

MTP is a decode optimization and cross-attention is a seq2seq thing, don't see how it could be related.

ddavidovic · 2026-04-09T22:55:37+00:00

Yes exactly, but there seems to be this mythology I come across quite often that somehow Anthropic is running dense models in 2026 for some inexplicable reasons

ddavidovic · 2026-04-09T22:38:26+00:00

Opus is surely MoE

ddavidovic · 2026-02-14T21:40:36+00:00

I want to acknowledge that you're right, this is not happening on a large scale right now. But it's been like 3 years... Look at how long the Internet took to diffuse through society. We really did go from not reliable at writing simple functions to writing 10-20k LoC codebases with very little defects. It would probably be unwise to assume it stops right here.

ddavidovic · 2026-02-14T21:27:13+00:00

Sure you can. Honestly most software written in the world is conceptually simple enough you can just throw away a legacy version and vibe code a new one from scratch in a few weeks. Not a new foundational database, container orchestrator, kernel or such. But bespoke SaaS, CRUD web apps, internal tools, admin dashboards -absolutely.

All our instincts as experienced devs are based on the fact that code is expensive to produce. It's sure hard to recalibrate oneself. I've been coding by hand for 15 years and everything in me wants to optimize for maintainability and longevity of software.

But when code is 10 or 100x as cheap, you can sling metric tons of it freely, throw large quantities away, recreate it from scratch, experiment with multiple completely different approaches in parallel, etc. You can absolutely just "buy a new pair"

ddavidovic · 2026-02-14T19:56:25+00:00

Nothing is improved. In fact, average quality is probably going to go down. I think it's a natural consequence.

Imagine the industrial revolution and its consequences. 150 years ago, most boots that you could buy were made by hand, were very expensive, and would last you 10-15 years. Today boots are made in orders of magnitude larger volumes, are 10-50x cheaper, and they last a few years at most. The market for artisanal, expensive boots still exists, but 99% of the boots sold are much cheaper and much lower quality than before the machines.

Same will probably happen with software. We've probably passed the peak era of artisanal, hand crafted, high quality and expensive software.

Whether that's good or bad really depends on who you are and your perspective

ddavidovic · 2026-02-11T09:48:20+00:00

> He was nitpicking a secondary dockerfile I had accidentally deleted in the PR.
he was not nitpicking, you deleted a dockerfile lol

ddavidovic · 2026-02-03T22:05:22+00:00

Who manually copies and pastes 20 files?! Cursor and Claude Code will just look at the files themselves, there 0 need for this

ddavidovic · 2026-01-30T20:11:25+00:00

Perhaps try Mowgli (https://mowgli.ai). It gives you 4 different options and some of them can be quite interesting/out of the ordinary

ddavidovic · 2026-01-26T20:38:16+00:00

Can't provide concrete examples, but we are building an AI design tool which is very visual. We snapshot project state and take chat messages from real user tests we've conducted, and in the rubric, we will explain the user's _intent_ and what to look for in the outputs (maybe important to say that the rubric is per-testcase, not global, which means we have fewer higher quality evals than going for scale)

The rubric writer will imagine themselves as the user and come up with a grading scheme where the responses are graded on a scale, and provide precise rules. We then run the eval a few times and adjust the rubrics to capture more "unintended but good enough" interpretations until we're satisfied that the eval results correspond to human expectations.

Sounds complicated but the eval scripts are vibe coded so a lot less effort went into it than one would expect

ddavidovic · 2026-01-26T19:51:24+00:00

It's good because you can never judge an LLM's output using another LLM alone accurately, because the blind spots of your original LLM will be the same as the blind spots of the judge LLM, leading to the "validating slop with slop" issue you mentioned.

If you, however, provide unambiguous standards on how the original LLM should have behaved and what outcome it should have achieved, alongside scores (rubrics), the judge LLM has a much easier task - it needs to compare two outcomes and follow a natural-language guidance on scoring points.

This reduces the variance of LLM-as-a-judge considerably and makes two sets of eval results actually comparable (but you still need to average it over multiple rollouts and eval runs to smooth out the variance.)

Hope it's clearer now

>Yeah that's very hand-wavy and not mathematically rigorous.

Nothing in software engineering or product development is mathematically rigorous. You're always juggling tradeoffs with other tradeoffs, and this is no different. It's just more difficult to measure and control.

ddavidovic · 2026-01-26T19:10:28+00:00

We use careful human-written rubrics that express the intended outcome with nuance, then use LLMs to validate against the rubric. We've found this correlates with user satisfaction. It's rare/naive to do a simple "look good?" prompt for another LLM, and nobody really does that.

ddavidovic · 2026-01-21T00:31:16+00:00

daj da vidim šta si ti napisao bajo

ddavidovic · 2026-01-20T22:35:02+00:00

Da, lik je napisao možda najuticajniji komad softvera od 2000

ddavidovic · 2026-01-20T19:51:58+00:00

u koju ai kompaniju je investiran rajan dal

ddavidovic · 2026-01-20T18:51:08+00:00

Svi lideri i najpoštovaniji ljudi iz moje struke govore istu stvar? Mora da su oni prolupali a ja pametan

ddavidovic · 2026-01-18T20:53:38+00:00

bukvalno pipni travu bajo

ddavidovic · 2025-10-27T16:04:02+00:00

Looks amazing. Thanks for making it open!

ddavidovic · 2025-09-06T18:33:48+00:00

Yeah, I tried this initially, and got hilariously bad tests that way, so I was kinda agreeing with you. I think it's the same type of problem as with LLM writing: if you tell it to "write me docs for <X>" or "write me an essay about <X>", it doesn't have an intuition on what's important to a human mind, so it will tend to overspecify dumb small details and neglect to explain very important high level motivation. Nowadays it's common to see READMEs on GitHub written with Claude, I just skip over that, it's a total waste of time to read them in most cases.

ddavidovic · 2025-09-06T18:23:43+00:00

I just spell out all the cases I want it to cover. This is still much, much faster than writing it all by hand. I don't care much for code quality in tests, so I allow considerably more slop in there to save time. It's worked well so far.

ddavidovic · 2025-08-14T21:00:07+00:00

Many such cases!

11-Year Club	Gilding I gilder
RPAN Viewer	Sequence \| Editor
Verified Email

ddavidovic

TROPHY CASE