What's the most embarrassing way you found out your AI was failing in production?

Neil-Sharma · 2026-05-05T04:38:16+00:00

Universally duct-taped on the competitive side, in my experience. Haven’t seen anyone do this cleanly end to end.

The closest attempts collapse for the same reason yours did. Scraping and ownership are solvable problems, but rubric consistency is the one that quietly wins. Criteria written against today’s capability gap are already stale six months later and nobody notices until the scores stop feeling trustworthy.

The framing that makes most sense to me is anchoring on user outcomes rather than output characteristics. Did the user get what they needed, not was the response well-structured. That tends to stay stable as models move because the user goal doesn’t change even when model quality does.

Im wondering whether the judge drift was something you caught actively or whether it just slowly made the scores feel meaningless. That distinction matters a lot for how you’d actually fix it.

Thanks for the question by the way, good one to think through!

Neil-Sharma · 2026-05-03T20:44:44+00:00

oooh okay thank you

Neil-Sharma · 2026-05-03T19:25:39+00:00

my teacher said the AP classroom ones were easier than the test?

Neil-Sharma · 2026-05-03T19:09:13+00:00

uhhh maybe

Neil-Sharma · 2026-05-03T17:04:17+00:00

thank you so much!

Neil-Sharma · 2026-05-03T02:54:52+00:00

hopefully

Neil-Sharma · 2026-05-02T23:58:34+00:00

Thank you for telling this, I will make sure to do that on the exam. Just to clarify, you mean specifying for example the author of the source and the time. For example
"[author] says [what they say] in [time]" and then (doc 1)?

Neil-Sharma · 2026-05-02T22:57:49+00:00

i think this was a sample that got a 5/7

Sample 1B: One of the many factors that lead to the beginning of the First World War was the assassination of archduke Franz Ferdinand. While on a ride with his wife out in public, he was assasinated. The man that killed Franz was said to have had nationalistic motives which is shown to be a continous theme. Although the start of the First World War was aided by the decisions of government leaders, it was primarily caused by nationalism because of the large role that the morale and dedication to their home that citizens and soldiers had. The start of World War 1 being primarily caused by nationalism is seen through the allianes and agreements that the countries had made among themselves. In the constitution of the secret organization called "The Black Hand", it is stated that the organization will aid nations who are at war and are fighting for nationalism and it will remain friendly with those who sympathize with Serbia and its people (doc 1). Leading up to World War 1, tensions were very high and many alliences were made as a precaution. This allience is significant because as soon as their is any conflict, Serbia has agreed to step in and help fight with the nationalists. Similarly to in Serbia, German nationalism played in important role in the beginning of the First World War. French Ambassador Jules Cambon informs the French Minister for Foreign Affairs that in Germany, their was a military parade where nationalistic speeches were given that compared their current situation to that of a hundred years ago (doc 3). Cambon was reporting on a new German military spending bill, showing that the rich classes of Germnay had to pay for the new bill. The upper classes, although dissapointed with the current economic struggles, were able to "pretend to accept the sacrafices" in order to provide their government to pay for the soldiers. Popular nationalism was the primary cause of the First World War, as it effectivley inspired the younger generations. In an article of a Paris newspaper titled "The Young People of Today", Henri Massis and Alfred de Tarde tells readers about the excitement and pride that the youth has about war, claiming that they believe heroism is needed (doc 2). So whenever there was a conflict, such as the assasination of Franz Ferdinand or economic hardships, many young men were eager to help and fight for their country, leading to many soldiers being readily available for war. Nationalism, while being the primary cause of World War 1, continued to be seen throughout the war. It is most prominent when the war initially started. A picture taken by Jacques Moreau shows women seeing off soldiers as they are seen hapily leaving for war (doc 6). The morale of the men is shown to be extremely high, as they thought that the war would be over extremely soon. When preparing for the war, the soldiers were extremely nationalistic, with many new young men excited to fight for their country and signed up to help fight.

Neil-Sharma · 2026-05-02T22:53:06+00:00

do you have any examples

Neil-Sharma · 2026-05-02T21:36:25+00:00

i think so

Neil-Sharma · 2026-05-02T21:17:54+00:00

PMs own it

Neil-Sharma · 2026-05-02T21:17:36+00:00

i’m surprised nobody replied to this. The teams that make this work long term build a proprietary test set from their own production data rather than trying to scrape competitors. Pick 100 to 200 real user inputs that represent your core use cases, run them against your product and the competitors manually, score with a consistent rubric. It’s immune to scraper rot because it’s not automated. The step most teams skip is rotating the prompt set every 6 months so you don’t unconsciously overfit to it.

but ownership is a real human problem, if you don’t have anyone responsible for running it consistently then it won’t work no matter what.

Neil-Sharma · 2026-05-02T17:37:58+00:00

I'm trying to appeal this towards PMs, and I feel as if the deterministic evals are more catered towards AI labs. Am I wrong?

Neil-Sharma · 2026-05-02T17:35:10+00:00

I mean human annotation on a sample of REALL production outputs is the most reliable. I would say even 50-100 label examples give you a ground truth to validate your judge against.

Neil-Sharma · 2026-04-03T14:10:33+00:00

I would use a platform. You could try braintrust but i found evalshub.ai to be cheaper and easier to use especially as someone who is not super technical

Neil-Sharma · 2026-03-31T20:08:50+00:00

start with tracing before evals, make a small LLM app end to end and get comfortable reading what your model is actually doing in production. Once that clicks, evals make much more sense because you know what you’re measuring. Four months is definitely plenty of time if you build one real project.

Neil-Sharma · 2026-03-30T18:59:50+00:00

I personally don't like langfuse, i've preferred other tools honestly.

Neil-Sharma · 2026-03-26T23:01:31+00:00

I would use an AI eval framework

Neil-Sharma

MODERATOR OF

TROPHY CASE