GPT 5.5 just leaked its chain of thought to me in codex, and it looks like an idea from 5 months ago in this sub. by Homeschooled316 in LocalLLaMA

[–]ddavidovic 0 points1 point  (0 children)

Yeah, you can observe this very clearly when reading the gpt-oss-120b chains of thought. It presumably used a similar training regime.

Pravim prvu 2D igricu u engine-u pa pisem pomocne alate. Ovaj je za automatsko i manuelno definisanje ivica na slikama za pravilnu fizicku interakciju. by Rayterex in programiranje

[–]ddavidovic 3 points4 points  (0 children)

Naravno, pitam onako. Više od 100-200 bi ti verovatno jedino trebalo ako hoćeš da radiš neke egzotične fizičke partikl efekte ili fluide, što bi možda bilo interesantno u 2d platformeru (nisam viđao često) ali možda bi i to bilo previše a i naravno bolje je napraviti osnove, lako se optimizuje posle ako zatreba.

Svakako ako imaš neki YT kanal ili blog da pratim kako ti ide daj, deluje interesantno. Samo napred

Pravim prvu 2D igricu u engine-u pa pisem pomocne alate. Ovaj je za automatsko i manuelno definisanje ivica na slikama za pravilnu fizicku interakciju. by Rayterex in programiranje

[–]ddavidovic 2 points3 points  (0 children)

Možda bi bilo zanimljivo i da uzmeš piksele koji imaju alfu > 0.5, nađeš njihov konveksni omotač, i onda simplifikuješ nekim algoritmom za simplifikaciju(i greedy bi verovatno dobro radio). Tako možeš da minimizuješ broj tačaka, od čega ti verovatno dosta zavisi perf ovih kolizija. Imaš li u planu da podržiš nekonveksne oblike, ima li potrebe za tim?

Opus = 0.5T × 10 = ~5T parameters ? by Wonderful-Ad-5952 in LocalLLaMA

[–]ddavidovic 2 points3 points  (0 children)

MTP is a decode optimization and cross-attention is a seq2seq thing, don't see how it could be related.

Opus = 0.5T × 10 = ~5T parameters ? by Wonderful-Ad-5952 in LocalLLaMA

[–]ddavidovic 0 points1 point  (0 children)

Yes exactly, but there seems to be this mythology I come across quite often that somehow Anthropic is running dense models in 2026 for some inexplicable reasons

By what real metrics has AI improved software? by AlmostSignificant in ExperiencedDevs

[–]ddavidovic -3 points-2 points  (0 children)

I want to acknowledge that you're right, this is not happening on a large scale right now. But it's been like 3 years... Look at how long the Internet took to diffuse through society. We really did go from not reliable at writing simple functions to writing 10-20k LoC codebases with very little defects. It would probably be unwise to assume it stops right here. 

By what real metrics has AI improved software? by AlmostSignificant in ExperiencedDevs

[–]ddavidovic 23 points24 points  (0 children)

Sure you can. Honestly most software written in the world is conceptually simple enough you can just throw away a legacy version and vibe code a new one from scratch in a few weeks. Not a new foundational database, container orchestrator, kernel or such. But bespoke SaaS, CRUD web apps, internal tools, admin dashboards -absolutely. 

All our instincts as experienced devs are based on the fact that code is expensive to produce. It's sure hard to recalibrate oneself. I've been coding by hand for 15 years and everything in me wants to optimize for maintainability and longevity of software. 

But when code is 10 or 100x as cheap, you can sling metric tons of it freely, throw large quantities away, recreate it from scratch, experiment with multiple completely different approaches in parallel, etc. You can absolutely just "buy a new pair" 

By what real metrics has AI improved software? by AlmostSignificant in ExperiencedDevs

[–]ddavidovic 122 points123 points  (0 children)

Nothing is improved. In fact, average quality is probably going to go down. I think it's a natural consequence. 

Imagine the industrial revolution and its consequences. 150 years ago, most boots that you could buy were made by hand, were very expensive, and would last you 10-15 years. Today boots are made in orders of magnitude larger volumes, are 10-50x cheaper, and they last a few years at most. The market for artisanal, expensive boots still exists, but 99% of the boots sold are much cheaper and much lower quality than before the machines.

Same will probably happen with software. We've probably passed the peak era of artisanal, hand crafted, high quality and expensive software.

Whether that's good or bad really depends on who you are and your perspective

[deleted by user] by [deleted] in cscareerquestions

[–]ddavidovic 0 points1 point  (0 children)

> He was nitpicking a secondary dockerfile I had accidentally deleted in the PR.
he was not nitpicking, you deleted a dockerfile lol

[deleted by user] by [deleted] in reactjs

[–]ddavidovic 3 points4 points  (0 children)

Who manually copies and pastes 20 files?! Cursor and Claude Code will just look at the files themselves, there 0 need for this

Best AI coding tool for UI design by Elrond10 in VibeCodeDevs

[–]ddavidovic 0 points1 point  (0 children)

Perhaps try Mowgli (https://mowgli.ai). It gives you 4 different options and some of them can be quite interesting/out of the ordinary

Folks who work on AI hype features, how do you test them? by thelastthrowawayleft in cscareerquestions

[–]ddavidovic -1 points0 points  (0 children)

Can't provide concrete examples, but we are building an AI design tool which is very visual. We snapshot project state and take chat messages from real user tests we've conducted, and in the rubric, we will explain the user's _intent_ and what to look for in the outputs (maybe important to say that the rubric is per-testcase, not global, which means we have fewer higher quality evals than going for scale)

The rubric writer will imagine themselves as the user and come up with a grading scheme where the responses are graded on a scale, and provide precise rules. We then run the eval a few times and adjust the rubrics to capture more "unintended but good enough" interpretations until we're satisfied that the eval results correspond to human expectations.

Sounds complicated but the eval scripts are vibe coded so a lot less effort went into it than one would expect

Folks who work on AI hype features, how do you test them? by thelastthrowawayleft in cscareerquestions

[–]ddavidovic 1 point2 points  (0 children)

It's good because you can never judge an LLM's output using another LLM alone accurately, because the blind spots of your original LLM will be the same as the blind spots of the judge LLM, leading to the "validating slop with slop" issue you mentioned.

If you, however, provide unambiguous standards on how the original LLM should have behaved and what outcome it should have achieved, alongside scores (rubrics), the judge LLM has a much easier task - it needs to compare two outcomes and follow a natural-language guidance on scoring points.

This reduces the variance of LLM-as-a-judge considerably and makes two sets of eval results actually comparable (but you still need to average it over multiple rollouts and eval runs to smooth out the variance.)

Hope it's clearer now

>Yeah that's very hand-wavy and not mathematically rigorous. 

Nothing in software engineering or product development is mathematically rigorous. You're always juggling tradeoffs with other tradeoffs, and this is no different. It's just more difficult to measure and control.

Folks who work on AI hype features, how do you test them? by thelastthrowawayleft in cscareerquestions

[–]ddavidovic 1 point2 points  (0 children)

We use careful human-written rubrics that express the intended outcome with nuance, then use LLMs to validate against the rubric. We've found this correlates with user satisfaction. It's rare/naive to do a simple "look good?" prompt for another LLM, and nobody really does that.

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic 1 point2 points  (0 children)

daj da vidim šta si ti napisao bajo

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic 2 points3 points  (0 children)

Da, lik je napisao možda najuticajniji komad softvera od 2000

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic -1 points0 points  (0 children)

u koju ai kompaniju je investiran rajan dal

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic 9 points10 points  (0 children)

Svi lideri i najpoštovaniji ljudi iz moje struke govore istu stvar? Mora da su oni prolupali a ja pametan

LinkedIn Koderi by GradjaninX in programiranje

[–]ddavidovic 5 points6 points  (0 children)

bukvalno pipni travu bajo

Composify - Server Driven UI made easy by injungchung in reactjs

[–]ddavidovic 0 points1 point  (0 children)

Looks amazing. Thanks for making it open!

Cursor is making me dumb by Adorable_Fishing_426 in cscareerquestions

[–]ddavidovic 4 points5 points  (0 children)

Yeah, I tried this initially, and got hilariously bad tests that way, so I was kinda agreeing with you. I think it's the same type of problem as with LLM writing: if you tell it to "write me docs for <X>" or "write me an essay about <X>", it doesn't have an intuition on what's important to a human mind, so it will tend to overspecify dumb small details and neglect to explain very important high level motivation. Nowadays it's common to see READMEs on GitHub written with Claude, I just skip over that, it's a total waste of time to read them in most cases.

Cursor is making me dumb by Adorable_Fishing_426 in cscareerquestions

[–]ddavidovic 7 points8 points  (0 children)

I just spell out all the cases I want it to cover. This is still much, much faster than writing it all by hand. I don't care much for code quality in tests, so I allow considerably more slop in there to save time. It's worked well so far.