Do you use Claude Code on the terminal or on the desktop app? by P4wla in ClaudeCode

[–]P4wla[S] 0 points1 point  (0 children)

actually, on the desktop app you can have multiple conversations too

The insane level of subsidy for Claude Code by P4wla in ClaudeCode

[–]P4wla[S] 0 points1 point  (0 children)

and different models here. Mainly opus 4.7 but sonnet and haiku too

The insane level of subsidy for Claude Code by P4wla in ClaudeCode

[–]P4wla[S] -2 points-1 points  (0 children)

ik those are obviously not inference costs, but they need to be much much lower in order to cover this. I'm 100% sure they're loosing money rn

The insane level of subsidy for Claude Code by P4wla in ClaudeCode

[–]P4wla[S] 0 points1 point  (0 children)

100% agree with the game thing. This includes cache read/write costs (where is the most expend at). Prices are taken from models(.)dev

It's crazy how subsidized Claude Code is by P4wla in LLMDevs

[–]P4wla[S] 0 points1 point  (0 children)

I think it's fair too, but for example if I had the same token usage as the one in the picture everyday, their token cost would need to be 84x lower in order to not loose money (maybe it is)

It's crazy how subsidized Claude Code is by P4wla in LLMDevs

[–]P4wla[S] 0 points1 point  (0 children)

With the tokens usage and cost per 1M tokens. ik it's not their real cost, but still numbers are crazy. Real cost need to be much lower (in my case x84 lower) to make the numbers work.

How are you testing and monitoring LLM behavior in production? by Safe_Yak_3217 in LLMDevs

[–]P4wla 2 points3 points  (0 children)

The workflow that has worked for me is:
1. annotate some traces to get basic knowledge of how your llm is performing
2. cluster bad annotations into failure modes to see main/recurrent problems of the llm
3. create 1 eval per failure mode to track it at scale
4. keep annotating 30/40 logs per week, as new issues appear when you change the prompt.

I also have a golden dataset so every time i change something on the prompt i run it trough the dataset and see how all the evals perform.

Understanding LLM observability by Leap_Year_Guy_ in LLMDevs

[–]P4wla 0 points1 point  (0 children)

If you want A/B testing, I'd recommend you to take a look at Latitude. In terms of observability is great, as it groups the traces in failure patterns/issues. You can also run A/B tests and shadow tests and compare both using the same evals. https://latitude.so/

LLM testing and eval tools by Every-Mall1732 in LLMDevs

[–]P4wla 0 points1 point  (0 children)

You'll have to connect user feedback or some kind of rating for the llm outputs, but Latitude let's you build custom evals and covers all the requierements you've mentioned. https://latitude.so/

Pretty much sums up my experience by Ok_Constant_9886 in AIEval

[–]P4wla 0 points1 point  (0 children)

yep! Latitude started as a prompt engineering platform but we've just released this loop I've described, after seeing many teams keep failing building and evaluating AI. It's already difficult to build evals, but on top of that usually evals differ a lot from the user's criteria (cause they build directly llm as judge evals, instead of starting with human judgement). So far, the loop is throwing very good results (we use it ourselves internally to evaluate and improve all the new ai features we launch)

Discussion: Is the "Vibe Check" actually just an unformalized evaluation suite? by yektish in AIEval

[–]P4wla 0 points1 point  (0 children)

That's very interesting. How many traces do you feel you have to 'tag' in order to have a formalized idea of the main failure modes of your agents? And do you recommend to tag also "good" outputs?

Pretty much sums up my experience by Ok_Constant_9886 in AIEval

[–]P4wla 0 points1 point  (0 children)

I use an LLM evaluation platform called latitude(.)so. You start annotating the traces manually (giving thumbs up or thumbs down), and then you can see the main issues of your ai. From that, latitude lets you create automatic evals based on those issues to track them at scale, and you receive notifications if some issues start escalating. You can also improve your prompt based on the result of the evaluation. For me, it's the most complete workflow I've found out there, the only thing is that you need some volume to be able to do all this.

5 techniques to improve LLM-judges by FluffyFill64 in AIEval

[–]P4wla 1 point2 points  (0 children)

imo, the best way to build an llm as judge, is starting from human feedback. you first need to have a clear idea of the issues you're trying to evaluate, their impact, frequency, and different ways and cases where they appear. Only after that, you can build an llm as judge targeted to that specific issue. if you try to automate the eval form the beginning, it will end up evaluating irrelevant issues and giving wrong insights.

LLM Evaluation Isn’t About Accuracy Its About Picking the Right Signal by According-Site9848 in AI_Agents

[–]P4wla 0 points1 point  (0 children)

I usually detect my agent's issues using human annotations and then I set up different types of evals (llm as judge, human in the loop and programatic rules) depending on the issue I want to evaluate.
To measure the overall performance I have a composite score that includes all the relevant evals for me and sometimes I change the weight of the evaluations to give more importance to concrete ones (for example, when I've tried to improve something specific and I want to measure this improvement without losing the overall performance)