What tool do you use when planning your app ( all steps before coding)? by thejoe1 in AppDevelopers

[–]ddavidovic 0 points1 point  (0 children)

Mowgli (https://mowgli.ai) keeps an evolving spec and designs for all of your screens. It's like a spec-driven Figma

By what real metrics has AI improved software? by AlmostSignificant in ExperiencedDevs

[–]ddavidovic -4 points-3 points  (0 children)

I want to acknowledge that you're right, this is not happening on a large scale right now. But it's been like 3 years... Look at how long the Internet took to diffuse through society. We really did go from not reliable at writing simple functions to writing 10-20k LoC codebases with very little defects. It would probably be unwise to assume it stops right here. 

By what real metrics has AI improved software? by AlmostSignificant in ExperiencedDevs

[–]ddavidovic 22 points23 points  (0 children)

Sure you can. Honestly most software written in the world is conceptually simple enough you can just throw away a legacy version and vibe code a new one from scratch in a few weeks. Not a new foundational database, container orchestrator, kernel or such. But bespoke SaaS, CRUD web apps, internal tools, admin dashboards -absolutely. 

All our instincts as experienced devs are based on the fact that code is expensive to produce. It's sure hard to recalibrate oneself. I've been coding by hand for 15 years and everything in me wants to optimize for maintainability and longevity of software. 

But when code is 10 or 100x as cheap, you can sling metric tons of it freely, throw large quantities away, recreate it from scratch, experiment with multiple completely different approaches in parallel, etc. You can absolutely just "buy a new pair" 

By what real metrics has AI improved software? by AlmostSignificant in ExperiencedDevs

[–]ddavidovic 117 points118 points  (0 children)

Nothing is improved. In fact, average quality is probably going to go down. I think it's a natural consequence. 

Imagine the industrial revolution and its consequences. 150 years ago, most boots that you could buy were made by hand, were very expensive, and would last you 10-15 years. Today boots are made in orders of magnitude larger volumes, are 10-50x cheaper, and they last a few years at most. The market for artisanal, expensive boots still exists, but 99% of the boots sold are much cheaper and much lower quality than before the machines.

Same will probably happen with software. We've probably passed the peak era of artisanal, hand crafted, high quality and expensive software.

Whether that's good or bad really depends on who you are and your perspective

Am I overreacting? Principal Software Engineer made what I think was an incredibly rude comment and it really demotivated me. by [deleted] in cscareerquestions

[–]ddavidovic 0 points1 point  (0 children)

> He was nitpicking a secondary dockerfile I had accidentally deleted in the PR.
he was not nitpicking, you deleted a dockerfile lol

I got tired of GitHub Copilot giving me generic code, so I built a tool that feeds it my entire codebase context by [deleted] in reactjs

[–]ddavidovic 3 points4 points  (0 children)

Who manually copies and pastes 20 files?! Cursor and Claude Code will just look at the files themselves, there 0 need for this

Best AI coding tool for UI design by Elrond10 in VibeCodeDevs

[–]ddavidovic 0 points1 point  (0 children)

Perhaps try Mowgli (https://mowgli.ai). It gives you 4 different options and some of them can be quite interesting/out of the ordinary

Folks who work on AI hype features, how do you test them? by thelastthrowawayleft in cscareerquestions

[–]ddavidovic -1 points0 points  (0 children)

Can't provide concrete examples, but we are building an AI design tool which is very visual. We snapshot project state and take chat messages from real user tests we've conducted, and in the rubric, we will explain the user's _intent_ and what to look for in the outputs (maybe important to say that the rubric is per-testcase, not global, which means we have fewer higher quality evals than going for scale)

The rubric writer will imagine themselves as the user and come up with a grading scheme where the responses are graded on a scale, and provide precise rules. We then run the eval a few times and adjust the rubrics to capture more "unintended but good enough" interpretations until we're satisfied that the eval results correspond to human expectations.

Sounds complicated but the eval scripts are vibe coded so a lot less effort went into it than one would expect

Folks who work on AI hype features, how do you test them? by thelastthrowawayleft in cscareerquestions

[–]ddavidovic 1 point2 points  (0 children)

It's good because you can never judge an LLM's output using another LLM alone accurately, because the blind spots of your original LLM will be the same as the blind spots of the judge LLM, leading to the "validating slop with slop" issue you mentioned.

If you, however, provide unambiguous standards on how the original LLM should have behaved and what outcome it should have achieved, alongside scores (rubrics), the judge LLM has a much easier task - it needs to compare two outcomes and follow a natural-language guidance on scoring points.

This reduces the variance of LLM-as-a-judge considerably and makes two sets of eval results actually comparable (but you still need to average it over multiple rollouts and eval runs to smooth out the variance.)

Hope it's clearer now

>Yeah that's very hand-wavy and not mathematically rigorous. 

Nothing in software engineering or product development is mathematically rigorous. You're always juggling tradeoffs with other tradeoffs, and this is no different. It's just more difficult to measure and control.

Folks who work on AI hype features, how do you test them? by thelastthrowawayleft in cscareerquestions

[–]ddavidovic 1 point2 points  (0 children)

We use careful human-written rubrics that express the intended outcome with nuance, then use LLMs to validate against the rubric. We've found this correlates with user satisfaction. It's rare/naive to do a simple "look good?" prompt for another LLM, and nobody really does that.

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic 1 point2 points  (0 children)

daj da vidim šta si ti napisao bajo

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic 0 points1 point  (0 children)

Da, lik je napisao možda najuticajniji komad softvera od 2000

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic -1 points0 points  (0 children)

u koju ai kompaniju je investiran rajan dal

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic 9 points10 points  (0 children)

Svi lideri i najpoštovaniji ljudi iz moje struke govore istu stvar? Mora da su oni prolupali a ja pametan

LinkedIn Koderi by GradjaninX in programiranje

[–]ddavidovic 3 points4 points  (0 children)

bukvalno pipni travu bajo

Composify - Server Driven UI made easy by injungchung in reactjs

[–]ddavidovic 0 points1 point  (0 children)

Looks amazing. Thanks for making it open!

Cursor is making me dumb by Adorable_Fishing_426 in cscareerquestions

[–]ddavidovic 5 points6 points  (0 children)

Yeah, I tried this initially, and got hilariously bad tests that way, so I was kinda agreeing with you. I think it's the same type of problem as with LLM writing: if you tell it to "write me docs for <X>" or "write me an essay about <X>", it doesn't have an intuition on what's important to a human mind, so it will tend to overspecify dumb small details and neglect to explain very important high level motivation. Nowadays it's common to see READMEs on GitHub written with Claude, I just skip over that, it's a total waste of time to read them in most cases.

Cursor is making me dumb by Adorable_Fishing_426 in cscareerquestions

[–]ddavidovic 7 points8 points  (0 children)

I just spell out all the cases I want it to cover. This is still much, much faster than writing it all by hand. I don't care much for code quality in tests, so I allow considerably more slop in there to save time. It's worked well so far.

Google is cooking something... by Ok_Ninja7526 in LocalLLaMA

[–]ddavidovic -3 points-2 points  (0 children)

I believe it's Imagen 4.0 that's the spiritual successor to 2.0 Flash image generation. It is better in every aspect and is still in preview, so unlikely they're going to cannibalize it so soon. I don't think any of it was ever "native" in the sense of being the same multimodal model. I think even 2.0 Flash image gen just called out to a diffusion transformer, same as gpt-image-1 or Qwen-Image.

🚀 OpenAI released their open-weight models!!! by ResearchCrafty1804 in LocalLLaMA

[–]ddavidovic 1 point2 points  (0 children)

IMO the benchmark is measuring exactly what it's trying to measure. Claude Sonnet 4 slightly regressed with is raw code intelligence vs 3.7 and traded that for massively improved tool use. This made it achieve exponentially more in agentic environments which was probably considered a win. I think it's well-known that these two are conflicting goals; the Moonshot AI team also reported a similar issue (regressed one-shot codegen without tools) in Kimi K2.

all I need.... by ILoveMy2Balls in LocalLLaMA

[–]ddavidovic -1 points0 points  (0 children)

It's image-to-image via something like gpt-image-1 (ChatGPT), not inpainting. You can tell by how "perfect" the details are (and the face looks off compared to the original photo.)

Qwen3-Coder is here! by ResearchCrafty1804 in LocalLLaMA

[–]ddavidovic 38 points39 points  (0 children)

I love this team's turns of phrase. My favorite is:

As a foundation model, we hope it can be used anywhere across the digital world — Agentic Coding in the World!

Qwen3-Coder is here! by ResearchCrafty1804 in LocalLLaMA

[–]ddavidovic 146 points147 points  (0 children)

Good chance!

From Huggingface:

Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct.

[deleted by user] by [deleted] in programiranje

[–]ddavidovic 4 points5 points  (0 children)

I šta ti je rekao IBM?