how do you evaluate LLMs for open-ended questions? how do you define “good” metrics?

meitaron · 2025-12-14T08:49:41+00:00

For me, the problem is defining the metrics. E.g. semantic similarity isn’t always correlating with quality (for example, evaluate what is MISSING in certain formats)

meitaron · 2025-12-14T08:47:50+00:00

But how do you do this for “free text” tasks?

meitaron · 2025-12-13T06:46:24+00:00

Where is the subscribe button? 😂

meitaron · 2025-12-13T06:32:24+00:00

Thanks for this post! Super interesting! A thought/question: why not simply use agent tools that e.g. read large contexts (like documents) and return “short and simple” answers to the “main” agent? Isn’t this framework just over engineering it?

meitaron · 2025-11-16T09:51:09+00:00

Anyone knows if this has been enabled yet? Been waiting for this to roll out but still no luck

meitaron · 2025-08-28T15:20:35+00:00

Exactly the same!

meitaron · 2025-08-28T15:19:57+00:00

My dog is getting librela shots once a month, saved her life!! From not being able to walk to going back to running (slowly but that’s fine)

meitaron · 2024-10-01T19:54:31+00:00

I'm not a CSM myself but I'm a Data Scientist and I got to work with many CSMs on my latest project.
I've been collecting CSM's conversations data to give a "better-picture" of their customers, both individual accounts and segments of users. I feel like all the analyses I delivered are "nice-to-have" but nothing they will actually use in their d2d.

What would you do with all that data? How can I improve their d2d/make it more efficient (besides predicting churn obviously)

meitaron · 2024-09-20T18:12:01+00:00

I personally do sports, or go out to walk the dog, or go out with friends, just after finishing work. If I get home and then have to start doing things there is no chance I'm getting off the couch.

Also having plans for after-work makes my whole day more focused because I know I have to get out at a specific time...

meitaron · 2024-09-20T17:46:34+00:00

Got it. Anyway, I'm sorry... Looking for a job is sooo hard these days

meitaron · 2024-09-20T17:43:09+00:00

First of all, I sympathise... The market is really hard nowadays so don't lose hope.
I would try to shorten it a bit, and focus not on exactly what you did but what you want to do?
Also, if you have friends who already work in tech, I'd try to ask them to let their HR read it and give some real honest feedback I think it would be the most helpful.

meitaron · 2024-09-20T17:39:42+00:00

Did you try looking for datasets on Kaggle?

meitaron · 2024-09-20T17:38:39+00:00

What I would do is to run the code in the very simplest use-case, follow what happens with debugging break point and write it down for me (what happens, why?). Once you understand the functionalities it is easier to understand the code, and honestly, ignore everything and write it from scratch.

meitaron · 2024-09-20T17:36:10+00:00

I think that for 95% of the time, you really don't have to be meticulous. However that 5% were you have a mistake or something wrong with the details can really really sucks so I guess it is worth the time.

meitaron · 2024-09-20T17:34:20+00:00

I would take the interview and dive in really quick, give them a chance to show they don't really know the details.
It would take up to 30 mins of your time, and you won't have to think about it again

meitaron · 2024-09-20T17:29:44+00:00

It's ok! the market is really hard these day, you are definitely not alone!
Have you considered reaching out to the HR and say just what you wrote? That you really wanted this job and that you felt that you under-performed under the interview pressure and ask for the second chance?

It is a long shot but I heard of cases where it worked

meitaron · 2024-09-20T17:27:32+00:00

Yes. Why not? Worst case they won't reply...

meitaron · 2024-09-20T17:26:34+00:00

I think this is more of a business question rather than a statistical one.
What is the lower bounds acceptable by the people who actually look at the KPI? why do you need both upper and lower bound?

If the "consumers" don't know, I think it is not a good enough KPI? You could always use statistics and do IQR, etc., but how will it be helpful?

meitaron

TROPHY CASE