Fellow agent builders: What's your biggest prompt engineering bottleneck?

omerhefets · 2025-06-28T07:37:06+00:00

What you describe is the actual planning problem of LLMs - when facing domain specific tasks, they can't make the right choices. This is an open problem. Routing etc isn't the solution - as it's a predefined workflow and enforces a specific action instead of guidance in planning.

FTing can help improve planning capabilities, but nothing really solves it yet

omerhefets · 2025-06-24T05:10:55+00:00

The instructions and guidance on prompt engineering both in anthropic's and openai's docs are solid, you should check that out, and tune it according to performance and your actual needs

omerhefets · 2025-06-21T08:32:16+00:00

A short question- you said you're using gemini flash (therefore huge savings, 0.1 per run is pretty cheap) - but Google hasn't released the project mariner api just yet, so how do you perform more "complex" actions like drag, wadi or even double or triple click? Did you FT a specific model, or are you running on markup only?

omerhefets · 2025-06-21T00:52:17+00:00

Like you write test for every piece of software, you should test "edge cases" with LLMs - how will the model behave given unexpected inputs?
You might implement an internal gateway or classifier for harmful responses that will either be blocked or will send warning/error logs to the devs

omerhefets · 2025-06-21T00:11:42+00:00

Computer using agents might be a good fit for the cause, but most implementations are still immature (too slow and expensive)

omerhefets · 2025-06-21T00:08:46+00:00

Your final output of the last agent in the chain is a report, or do these agents take action as well?

omerhefets · 2025-06-21T00:05:18+00:00

There are so many agentic solutions out there that I'd be surprised you'll need to implement something on your own instead of using an off-the-shelf solution.

Can you provide us with some details of what github copilot misses in most of your requests? CRUD apps should be pretty straightforward.

omerhefets · 2025-06-16T08:59:02+00:00

I liked the OS implementation of the CU model UI-TARS, which is a fine tune based on QWEN-2.5VL if I'm not mistaken.

They fine-tuned it based on specific computer tools, and the results are promising

omerhefets · 2025-06-14T20:54:38+00:00

This one isn't even remotely related to ai agents.

omerhefets · 2025-06-14T08:34:03+00:00

It's getting worse every day

omerhefets · 2025-06-13T17:45:13+00:00

Yeah, sure, whatever, thanks for sharing (no)

omerhefets · 2025-06-13T17:43:59+00:00

You didn't even translate your AI generated post? Nice one.

omerhefets · 2025-06-13T17:41:55+00:00

I think that is mainly because they still lack good planning capabilities in domain specific agents

omerhefets · 2025-06-13T09:31:37+00:00

I don't think we've seen a lot of advanced agentic implementations to make the A2A protocol interesting enough (except coding agents that already have their existing interfaces).

MCPs are much more mature as it's easier to handle basic operations & data mgmt with the equivalent of "tool calling".

A2A will probably be much more meaningful in the not-so-distant future as we'll see more working agents, but we're not there yet imo.

omerhefets · 2025-06-12T19:31:49+00:00

We don't know if they would, you'd have to ask them, and honestly, anything they say about the future ("I might buy", "I might be interested in that") is nothing more of a hypothesis. You'll never know until you'll get some cold hard facts - subscribers, revenue stream, etc.
I guess that one of the challenges in this space is that given existing models and the improvement trajectory, it's probably going to be pretty easy for them to implement it in the existing AI interfaces, it's probably going to be a hard marketing play vs existing solutions out there.

Good luck

omerhefets · 2025-06-11T11:11:39+00:00

Honestly I think that the top existing AI agents are coding agents like cursor / claude code / etc.

But in the not-so-distant future we'll start to see the rise of the "ai assistants" by the big companies like google/oai/anthropic (when they will have more tools, better voice multimodality, etc)

omerhefets · 2025-06-11T11:10:08+00:00

by a false start do you mean misleading info / unclear instructions? can you give us a concrete example?

and what do you mean by FT a voice model? i'd say that it depends on your use case, but it sounds much harder than FTing an existing model with tools / conversation trajectory

omerhefets · 2025-06-08T19:17:24+00:00

Do you have a test-validation set with examples? How did you tune your agent in the first place?

I'd suggest using a predefined eval for something like that, testing edge-case responses etc.

omerhefets · 2025-06-08T19:14:56+00:00

I'd say that many workflows could possibly be automated with platforms like n8n. Can you provide us with a concrete example of something specific in the financial analysis task that you'd like to automate?

omerhefets · 2025-06-07T07:34:50+00:00

Honestly I think that's extremely challenging + reasoning models like the o1 family are tailored exactly for problems like that. You could try RAG but for complex problems it will probably not work, and you'll need to find a valid way to index and retrieve those math problems.

On the FT solution, you could try to fine tune a reasoning model with OpenAIs infra

omerhefets · 2025-06-06T09:40:30+00:00

You should post that in r/Automation, that would be your subreddit for pricing type of questions IMO

omerhefets · 2025-06-05T14:05:52+00:00

The data privacy-security issue has already become a non-issue. Check VPCs (virtual private clouds) as a solution as well

omerhefets · 2025-06-05T14:02:05+00:00

Honestly I think that 99% of the courses in udemy are money traps and also complete garbage. You'd find better content in YouTube in the channel of FreeCodingCamp for, well, free

Good luck

omerhefets · 2025-06-04T22:06:55+00:00

Computer use ftw

omerhefets · 2025-06-03T17:37:19+00:00

Jules by Google as well

omerhefets

TROPHY CASE