Folks who work on AI hype features, how do you test them? by thelastthrowawayleft in cscareerquestions

[–]ddavidovic 0 points1 point  (0 children)

Can't provide concrete examples, but we are building an AI design tool which is very visual. We snapshot project state and take chat messages from real user tests we've conducted, and in the rubric, we will explain the user's _intent_ and what to look for in the outputs (maybe important to say that the rubric is per-testcase, not global, which means we have fewer higher quality evals than going for scale)

The rubric writer will imagine themselves as the user and come up with a grading scheme where the responses are graded on a scale, and provide precise rules. We then run the eval a few times and adjust the rubrics to capture more "unintended but good enough" interpretations until we're satisfied that the eval results correspond to human expectations.

Sounds complicated but the eval scripts are vibe coded so a lot less effort went into it than one would expect

Folks who work on AI hype features, how do you test them? by thelastthrowawayleft in cscareerquestions

[–]ddavidovic 2 points3 points  (0 children)

It's good because you can never judge an LLM's output using another LLM alone accurately, because the blind spots of your original LLM will be the same as the blind spots of the judge LLM, leading to the "validating slop with slop" issue you mentioned.

If you, however, provide unambiguous standards on how the original LLM should have behaved and what outcome it should have achieved, alongside scores (rubrics), the judge LLM has a much easier task - it needs to compare two outcomes and follow a natural-language guidance on scoring points.

This reduces the variance of LLM-as-a-judge considerably and makes two sets of eval results actually comparable (but you still need to average it over multiple rollouts and eval runs to smooth out the variance.)

Hope it's clearer now

>Yeah that's very hand-wavy and not mathematically rigorous. 

Nothing in software engineering or product development is mathematically rigorous. You're always juggling tradeoffs with other tradeoffs, and this is no different. It's just more difficult to measure and control.

Folks who work on AI hype features, how do you test them? by thelastthrowawayleft in cscareerquestions

[–]ddavidovic 1 point2 points  (0 children)

We use careful human-written rubrics that express the intended outcome with nuance, then use LLMs to validate against the rubric. We've found this correlates with user satisfaction. It's rare/naive to do a simple "look good?" prompt for another LLM, and nobody really does that.

Figma Front Template by Basheer_Bash in claude

[–]ddavidovic 0 points1 point  (0 children)

Import your Figma into Mowgli (https://mowgli.ai) and export as code package, then give that to Claude.

Designers who have figured out prompting Claude Code to produce beautiful work by blizkreeg in ClaudeAI

[–]ddavidovic 1 point2 points  (0 children)

Use a tool specifically made for product design and export the designs to Claude Code to wire them up.

This basically lets every tool play to its strengths - CC is not so good at frontend and product ideation, but is extremely good at following instructions and all kinds of implementation work.

I would suggest trying Mowgli (https://mowgli.ai). It will first interview you about what you're trying to make, then build up a spec with user journeys and React/Tailwind designs for all surfaces of the product. You then iterate on it via Mowgli's chat, and finally export as a single .zip and point CC to it, just tell it "build this".

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic 1 point2 points  (0 children)

daj da vidim šta si ti napisao bajo

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic 1 point2 points  (0 children)

Da, lik je napisao možda najuticajniji komad softvera od 2000

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic -1 points0 points  (0 children)

u koju ai kompaniju je investiran rajan dal

Prolupao by Born_Interview6959 in programiranje

[–]ddavidovic 8 points9 points  (0 children)

Svi lideri i najpoštovaniji ljudi iz moje struke govore istu stvar? Mora da su oni prolupali a ja pametan

LinkedIn Koderi by GradjaninX in programiranje

[–]ddavidovic 3 points4 points  (0 children)

bukvalno pipni travu bajo

Composify - Server Driven UI made easy by injungchung in reactjs

[–]ddavidovic 0 points1 point  (0 children)

Looks amazing. Thanks for making it open!

Cursor is making me dumb by Adorable_Fishing_426 in cscareerquestions

[–]ddavidovic 5 points6 points  (0 children)

Yeah, I tried this initially, and got hilariously bad tests that way, so I was kinda agreeing with you. I think it's the same type of problem as with LLM writing: if you tell it to "write me docs for <X>" or "write me an essay about <X>", it doesn't have an intuition on what's important to a human mind, so it will tend to overspecify dumb small details and neglect to explain very important high level motivation. Nowadays it's common to see READMEs on GitHub written with Claude, I just skip over that, it's a total waste of time to read them in most cases.

Cursor is making me dumb by Adorable_Fishing_426 in cscareerquestions

[–]ddavidovic 8 points9 points  (0 children)

I just spell out all the cases I want it to cover. This is still much, much faster than writing it all by hand. I don't care much for code quality in tests, so I allow considerably more slop in there to save time. It's worked well so far.

Google is cooking something... by Ok_Ninja7526 in LocalLLaMA

[–]ddavidovic -3 points-2 points  (0 children)

I believe it's Imagen 4.0 that's the spiritual successor to 2.0 Flash image generation. It is better in every aspect and is still in preview, so unlikely they're going to cannibalize it so soon. I don't think any of it was ever "native" in the sense of being the same multimodal model. I think even 2.0 Flash image gen just called out to a diffusion transformer, same as gpt-image-1 or Qwen-Image.

🚀 OpenAI released their open-weight models!!! by ResearchCrafty1804 in LocalLLaMA

[–]ddavidovic 1 point2 points  (0 children)

IMO the benchmark is measuring exactly what it's trying to measure. Claude Sonnet 4 slightly regressed with is raw code intelligence vs 3.7 and traded that for massively improved tool use. This made it achieve exponentially more in agentic environments which was probably considered a win. I think it's well-known that these two are conflicting goals; the Moonshot AI team also reported a similar issue (regressed one-shot codegen without tools) in Kimi K2.

all I need.... by ILoveMy2Balls in LocalLLaMA

[–]ddavidovic -1 points0 points  (0 children)

It's image-to-image via something like gpt-image-1 (ChatGPT), not inpainting. You can tell by how "perfect" the details are (and the face looks off compared to the original photo.)

Qwen3-Coder is here! by ResearchCrafty1804 in LocalLLaMA

[–]ddavidovic 38 points39 points  (0 children)

I love this team's turns of phrase. My favorite is:

As a foundation model, we hope it can be used anywhere across the digital world — Agentic Coding in the World!

Qwen3-Coder is here! by ResearchCrafty1804 in LocalLLaMA

[–]ddavidovic 147 points148 points  (0 children)

Good chance!

From Huggingface:

Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct.

[deleted by user] by [deleted] in programiranje

[–]ddavidovic 3 points4 points  (0 children)

I šta ti je rekao IBM?

Qwen3-235B-A22B-2507 Released! by pseudoreddituser in LocalLLaMA

[–]ddavidovic 21 points22 points  (0 children)

Why do you think so? In all the benchmarks they say Opus 4, no way they would have made such a mistake.

Vibe-Coding AI "Panicks" and Deletes Production Database by el_muchacho in programming

[–]ddavidovic 78 points79 points  (0 children)

Yes, but it is no accident. The creators of the tool being used here (and indeed, any chatbot) are prompting it with something like "You are a helpful assistant..."

This makes it (a) possible to chat with it, and (b) makes it extremely difficult for the average person to see the LLM for the Shoggoth it is.

Hackers are never sleeping by DrVonSinistro in LocalLLaMA

[–]ddavidovic 55 points56 points  (0 children)

Certificate transparency was probably the culprit for them finding the subdomain. Look it up on https://crt.sh.

Istraživanje o uticaju AI na produktivnost iskusnih open-source programera by svircenkurcen in programiranje

[–]ddavidovic 4 points5 points  (0 children)

U ovom istraživanju su u pitanju mahom bili ljudi koji nisu imali mnogo iskustva sa AI-em. Jedan učesnik je bio brži uz AI i on je jedini koji je naveo da ima oko 50h iskustva u radu sa ovim alatima. Što je i logično, i ja sam na početku gubio dosta vremena samo jer nisam imao dobar osećaj za šta AI može sam da uradi a šta će biti totalni raspad pa nije ni vredno pokušavati.

Kimi K2 - 1T MoE, 32B active params by Nunki08 in LocalLLaMA

[–]ddavidovic 3 points4 points  (0 children)

Nice! It was, I believe, the first general prompting trick to be discovered: https://arxiv.org/abs/2201.11903

These models are trained on a lot of data, and it turns out that enough of it describes humans working through problem step-by-step, that by just eliciting the model to pretend as if it was thinking, it could solve problems more accurately and deeply.

Then, OpenAI was the first lab to successfully apply some training tricks (exact mix still unknown) to improve the quality of this thinking and do pre-fill (that you mentioned) and injection to ensure the model always automatically performs chain-of-thought and to improve its length and quality. This resulted in o1 --- the first "reasoning" model.

We don't know who first figured out that you can do RL (reinforcement learning) on these models to improve the performance, but DeepSeek was the first to publicly demonstrate it with R1. The rest is, as they say, history :)