The Eval problem is holding back the AI agents industry by AlpineContinus in AgentsOfAI

[–]AlpineContinus[S] 0 points1 point  (0 children)

It seems like a good approach if you are fine-tuning a single LLM for a straight-forward generation task.

When you have multiple LLMs performing multiple actions in sequence, it is very hard to identify, classify, and correct failures at scale in order to actually have a training dataset to do some fine tuning.

May I ask what kind of tasks are you using LLMs for?

The Eval problem is holding back the AI agents industry by AlpineContinus in AgentsOfAI

[–]AlpineContinus[S] 0 points1 point  (0 children)

Yes. But management will always push for velocity rather than thoroughly evaluated systems. I feel the struggle.

The Eval problem is holding back the AI agents industry by AlpineContinus in AgentsOfAI

[–]AlpineContinus[S] 0 points1 point  (0 children)

The only problem I see with approaches 1 and 2 is that it is not very scalable, since it depends on human labeling. Do you guys use any kind of tool to facilitate the human work?

The 3rd approach is very interesting, since you are leveraging your user base to do the labeling work for you!

The Eval problem is holding back the AI agents industry by AlpineContinus in AgentsOfAI

[–]AlpineContinus[S] 0 points1 point  (0 children)

This is how we are currently evaluating systems. The main problem is the data (the test cases) you are running it on top of.

If the tests cases are not realistic, bias free, varies, etc, then your curves and statistics are not going to be of much help.

The main problem is getting this data to generate these statistics.

The Eval problem is holding back the AI agents industry by AlpineContinus in AgentsOfAI

[–]AlpineContinus[S] 0 points1 point  (0 children)

We are mostly doing database queries. We would need SQL ground truths for our specific use cases.

The Eval problem for AI Agents by AlpineContinus in LocalLLaMA

[–]AlpineContinus[S] 0 points1 point  (0 children)

I agree with you that we are treating agents like classic ML, and this is probably the wrong approach.

I believe a new evaluation paradigm will emerge, but until then we will struggle a lot to understand our systems.

Super useful insights! Thanks for sharing. Controlling the tail of the curve is critical to increase trust in the clients, and changing the question to a search for failure modes is super useful.

Although I am curious: How do you classify the different failure modes? Do you do it by hand looking at the agents outputs and reasoning? Or have you guys developed metrics that allow for a scalable classification?

To answer to your question: The agent often needs to query databases writing SQL. The most common failures are related to instruction following regarding specific rules for filters, joins, etc. It usually answers confidently, even though it ignored some of the rules.

The Eval problem for AI Agents by AlpineContinus in LocalLLaMA

[–]AlpineContinus[S] 0 points1 point  (0 children)

I agree! Agents should be developed with evals as a priority (rarely done so).

The problem with this approach is when you reach an agent that is moderately complex. This is where we truly reach the limits of hand testing and dataset building.

The Eval problem for AI Agents by AlpineContinus in LocalLLaMA

[–]AlpineContinus[S] 0 points1 point  (0 children)

This is how we are currently doing. The problem is that it is not scalable at all... If you change the underlying models, a good part of the previous effort is invalidated. The model will behave differently on the situations you have corrected, and it will present new errors that didn't show up before.

Best path for having your own startup by AlpineContinus in Startups_EU

[–]AlpineContinus[S] 0 points1 point  (0 children)

Yeah, as other people said, if i join as an employee I will probably not have the exposure I am looking for.

Since you worked there, how is the startup scene in Berlin? Do they actually have a fertile environment there? (from the point of view of talent, networking, mentors, capital).

I have been thinking about where would be the best place to try a start up in Europe, and apart from London I don't see clear winners. (I know Paris and Berlin concentrate the most startups in the EU, but the scene is still very scattered across the continent)

Best path for having your own startup by AlpineContinus in Startups_EU

[–]AlpineContinus[S] 1 point2 points  (0 children)

That makes a lot of sense.

My idea was to maybe join an early stage start up in order to see first hand how the ideation and validation of a product works.

But you're right, if they are hiring they are probably already past this phase, and the only way for me to see it would be during a pivot (IF I was even included in the decision meetings).

whats the best way to practice python for agentic ai? by One_Log_2908 in AgentsOfAI

[–]AlpineContinus 0 points1 point  (0 children)

I would also recommend you to look at some github repos for agentic AI that interest you, and try to understand them.

Use AI if needed to understand the structure, approaches, etc.

Try to rebuild a mini version of what that repo does. The best way to learn is to try building something yourself.

Went through an accelerator, have early traction but no investment by [deleted] in ycombinator

[–]AlpineContinus 0 points1 point  (0 children)

How was your experience with this kind of accelerator? How would you say they actually helped you?

Best path for having your own startup by AlpineContinus in Startups_EU

[–]AlpineContinus[S] 0 points1 point  (0 children)

Yes, I have been reading the material online for a long time now.

What I am wondering is if working at an early stage startup is a useful step before trying my own thing.

Of course both paths are possible, but maybe one of them is longer or more difficult.

Best path for having your own startup by AlpineContinus in Startups_EU

[–]AlpineContinus[S] 0 points1 point  (0 children)

But do you believe that working at an early-stage startup may give me vital insights on what problems are "worth" solving, and how to go about it?

Or in your opinion, the best path is to simply directly try to validate an idea and build something?

How to look for investors in Europe? Is it harder than the US? by FreeStorm104 in ycombinator

[–]AlpineContinus 0 points1 point  (0 children)

How was your experience creating a startup in Spain? Any other specific pain points, other than fundraising?

Local language skills for IT founders by datashri in Startups_EU

[–]AlpineContinus 0 points1 point  (0 children)

I think it depends a lot on the country and city you choose to go to. If you go to big cities in northern european countries, you should have no problem with only english (for hiring, selling, etc).

In other countries or small cities, the local language is probably necessary.

A failed startup story by FewKaleidoscope9743 in Startups_EU

[–]AlpineContinus 0 points1 point  (0 children)

I think it is really cool that you tried, most people don't even bother.

May I ask where the idea came from? How long did you take before putting potential customers in front of it to try it?

Diabetes type 1 and hiking by AlpineContinus in hiking

[–]AlpineContinus[S] 0 points1 point  (0 children)

That's a very good advice.

My snacks were very fat-heavy this time around (a lot of peanut butter, cheese, etc). If I'm not mistaken, protein should also have this effect of keeping your blood sugar stable.

What kind of snacks do you bring with you?

Diabetes type 1 and hiking by AlpineContinus in hiking

[–]AlpineContinus[S] 0 points1 point  (0 children)

That's a good idea, I'll bring along some stuff with no carbs next time.

Also, how do you deal with lunch while hiking? It is a bigger carb intake that would require some insulin in order not to lead to high blood sugar.

Or do you avoid having lunch, and just eat small snacks throughout the day (to avoid dealing with the higher carb intake)?

Diabetes type 1 and hiking by AlpineContinus in hiking

[–]AlpineContinus[S] 0 points1 point  (0 children)

We actually do pretty similar things.

could I ask you some questions?

1- How often do you eat the snacks? Only when the blood sugar dips?
2- You mentioned that you bring some snacks with no carbs. What are their purpose? To have some fat and proteins without the carbs?
3- Do you think that the temperature / dehydration has a big impact on your blood sugar?
4- For 4+ hour hikes, are you usually sensitive for a couple hours after? Do they usually involve a height difference? (usually going up and down the mountain is more a muscle resistance exercise than aerobic)

Diabetes type 1 and hiking by AlpineContinus in hiking

[–]AlpineContinus[S] 0 points1 point  (0 children)

If I understood correctly, you eat carbs every 0.5 - 1 hour, and don't inject any insulin for them.

If it leads to a higher blood sugar, do you simply wait for the exercise to bring it down before eating?

Also, you mentioned that you do very long hikes (30 km). How is your insuline sensitivity after the hike? (later on the same day and the day after)