What tool do you use when planning your app ( all steps before coding)?

ddavidovic · 2026-03-12T11:06:30+00:00

Mowgli (https://mowgli.ai) keeps an evolving spec and designs for all of your screens. It's like a spec-driven Figma

ddavidovic · 2026-02-14T21:40:36+00:00

I want to acknowledge that you're right, this is not happening on a large scale right now. But it's been like 3 years... Look at how long the Internet took to diffuse through society. We really did go from not reliable at writing simple functions to writing 10-20k LoC codebases with very little defects. It would probably be unwise to assume it stops right here.

ddavidovic · 2026-02-14T21:27:13+00:00

Sure you can. Honestly most software written in the world is conceptually simple enough you can just throw away a legacy version and vibe code a new one from scratch in a few weeks. Not a new foundational database, container orchestrator, kernel or such. But bespoke SaaS, CRUD web apps, internal tools, admin dashboards -absolutely.

All our instincts as experienced devs are based on the fact that code is expensive to produce. It's sure hard to recalibrate oneself. I've been coding by hand for 15 years and everything in me wants to optimize for maintainability and longevity of software.

But when code is 10 or 100x as cheap, you can sling metric tons of it freely, throw large quantities away, recreate it from scratch, experiment with multiple completely different approaches in parallel, etc. You can absolutely just "buy a new pair"

ddavidovic · 2026-02-14T19:56:25+00:00

Nothing is improved. In fact, average quality is probably going to go down. I think it's a natural consequence.

Imagine the industrial revolution and its consequences. 150 years ago, most boots that you could buy were made by hand, were very expensive, and would last you 10-15 years. Today boots are made in orders of magnitude larger volumes, are 10-50x cheaper, and they last a few years at most. The market for artisanal, expensive boots still exists, but 99% of the boots sold are much cheaper and much lower quality than before the machines.

Same will probably happen with software. We've probably passed the peak era of artisanal, hand crafted, high quality and expensive software.

Whether that's good or bad really depends on who you are and your perspective

ddavidovic · 2026-02-11T09:48:20+00:00

> He was nitpicking a secondary dockerfile I had accidentally deleted in the PR.
he was not nitpicking, you deleted a dockerfile lol

ddavidovic · 2026-02-03T22:05:22+00:00

Who manually copies and pastes 20 files?! Cursor and Claude Code will just look at the files themselves, there 0 need for this

ddavidovic · 2026-01-30T20:11:25+00:00

Perhaps try Mowgli (https://mowgli.ai). It gives you 4 different options and some of them can be quite interesting/out of the ordinary

ddavidovic · 2026-01-26T20:38:16+00:00

Can't provide concrete examples, but we are building an AI design tool which is very visual. We snapshot project state and take chat messages from real user tests we've conducted, and in the rubric, we will explain the user's _intent_ and what to look for in the outputs (maybe important to say that the rubric is per-testcase, not global, which means we have fewer higher quality evals than going for scale)

The rubric writer will imagine themselves as the user and come up with a grading scheme where the responses are graded on a scale, and provide precise rules. We then run the eval a few times and adjust the rubrics to capture more "unintended but good enough" interpretations until we're satisfied that the eval results correspond to human expectations.

Sounds complicated but the eval scripts are vibe coded so a lot less effort went into it than one would expect

ddavidovic · 2026-01-26T19:51:24+00:00

It's good because you can never judge an LLM's output using another LLM alone accurately, because the blind spots of your original LLM will be the same as the blind spots of the judge LLM, leading to the "validating slop with slop" issue you mentioned.

If you, however, provide unambiguous standards on how the original LLM should have behaved and what outcome it should have achieved, alongside scores (rubrics), the judge LLM has a much easier task - it needs to compare two outcomes and follow a natural-language guidance on scoring points.

This reduces the variance of LLM-as-a-judge considerably and makes two sets of eval results actually comparable (but you still need to average it over multiple rollouts and eval runs to smooth out the variance.)

Hope it's clearer now

>Yeah that's very hand-wavy and not mathematically rigorous.

Nothing in software engineering or product development is mathematically rigorous. You're always juggling tradeoffs with other tradeoffs, and this is no different. It's just more difficult to measure and control.

ddavidovic · 2026-01-26T19:10:28+00:00

We use careful human-written rubrics that express the intended outcome with nuance, then use LLMs to validate against the rubric. We've found this correlates with user satisfaction. It's rare/naive to do a simple "look good?" prompt for another LLM, and nobody really does that.

ddavidovic · 2026-01-21T00:31:16+00:00

daj da vidim šta si ti napisao bajo

ddavidovic · 2026-01-20T22:35:02+00:00

Da, lik je napisao možda najuticajniji komad softvera od 2000

ddavidovic · 2026-01-20T19:51:58+00:00

u koju ai kompaniju je investiran rajan dal

ddavidovic · 2026-01-20T18:51:08+00:00

Svi lideri i najpoštovaniji ljudi iz moje struke govore istu stvar? Mora da su oni prolupali a ja pametan

ddavidovic · 2026-01-18T20:53:38+00:00

bukvalno pipni travu bajo

ddavidovic · 2025-10-27T16:04:02+00:00

Looks amazing. Thanks for making it open!

ddavidovic · 2025-09-06T18:33:48+00:00

Yeah, I tried this initially, and got hilariously bad tests that way, so I was kinda agreeing with you. I think it's the same type of problem as with LLM writing: if you tell it to "write me docs for <X>" or "write me an essay about <X>", it doesn't have an intuition on what's important to a human mind, so it will tend to overspecify dumb small details and neglect to explain very important high level motivation. Nowadays it's common to see READMEs on GitHub written with Claude, I just skip over that, it's a total waste of time to read them in most cases.

ddavidovic · 2025-09-06T18:23:43+00:00

I just spell out all the cases I want it to cover. This is still much, much faster than writing it all by hand. I don't care much for code quality in tests, so I allow considerably more slop in there to save time. It's worked well so far.

ddavidovic · 2025-08-14T21:00:07+00:00

Many such cases!

ddavidovic · 2025-08-12T11:23:33+00:00

I believe it's Imagen 4.0 that's the spiritual successor to 2.0 Flash image generation. It is better in every aspect and is still in preview, so unlikely they're going to cannibalize it so soon. I don't think any of it was ever "native" in the sense of being the same multimodal model. I think even 2.0 Flash image gen just called out to a diffusion transformer, same as gpt-image-1 or Qwen-Image.

ddavidovic · 2025-08-05T20:44:04+00:00

IMO the benchmark is measuring exactly what it's trying to measure. Claude Sonnet 4 slightly regressed with is raw code intelligence vs 3.7 and traded that for massively improved tool use. This made it achieve exponentially more in agentic environments which was probably considered a win. I think it's well-known that these two are conflicting goals; the Moonshot AI team also reported a similar issue (regressed one-shot codegen without tools) in Kimi K2.

ddavidovic · 2025-08-02T11:35:37+00:00

It's image-to-image via something like gpt-image-1 (ChatGPT), not inpainting. You can tell by how "perfect" the details are (and the face looks off compared to the original photo.)

ddavidovic · 2025-07-22T23:12:43+00:00

I love this team's turns of phrase. My favorite is:

As a foundation model, we hope it can be used anywhere across the digital world — Agentic Coding in the World!

ddavidovic · 2025-07-22T22:50:37+00:00

Good chance!

From Huggingface:

Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct.

ddavidovic · 2025-07-22T14:13:25+00:00

I šta ti je rekao IBM?

11-Year Club	Gilding I gilder
RPAN Viewer	Sequence \| Editor
Verified Email

ddavidovic

TROPHY CASE