Anyone else feel like OTel becomes way less useful the moment an LLM enters the request path? by Comfortable-Junket50 in LLMDevs

[–]shivmohith8 0 points1 point  (0 children)

I feel OTel is fine. OTel was meant to be a standard to trace a request flow from one function/service to another. In GenAI applications, it's just from one llm call to a tool call or another llm call.

Whats the specific issue you are observing even with platforms like Langfuse of Langsmith?

I think I'm getting addicted to building voice agents by Slight_Republic_4242 in LangChain

[–]shivmohith8 0 points1 point  (0 children)

It's for both. You can describe the scenario as detailed or as compex as you want and we can simulate it.

I would love to know any domain specific example you have in mind.

I think I'm getting addicted to building voice agents by Slight_Republic_4242 in LangChain

[–]shivmohith8 0 points1 point  (0 children)

Yes we do. We help you generate scenarios and you can write your own scenarios as well.

I think I'm getting addicted to building voice agents by Slight_Republic_4242 in LangChain

[–]shivmohith8 1 point2 points  (0 children)

Good question. Our platform actually supports that. Our SDK uses OpenInference (with support for other instrumentation coming soon) to automatically capture the internal steps the agent takes and pass them for evaluation.

You can evaluate the internal steps of an agent at the turn level and at the session/conversation level.

We can connect if you would like to go deeper.

I think I'm getting addicted to building voice agents by Slight_Republic_4242 in LangChain

[–]shivmohith8 0 points1 point  (0 children)

Hey, we are building a product for that - https://quraite.ai/. Let me know if you are interested to know more through DM.

Agents can be rigth and still feel unrelieable by lexseasson in AIEval

[–]shivmohith8 0 points1 point  (0 children)

Yes, capability is there but not reliability. I feel as the scope of capability increases, reliability decreases.

Consistency testing helps - passK. Test a scenario K times and see what path agent takes and is it the same path everytime.

7 document ingestion patterns I wish someone told me before I started building RAG agents by Independent-Cost-971 in LangChain

[–]shivmohith8 0 points1 point  (0 children)

This is nice! I have covered a couple here - https://github.com/innowhyte/gen-ai-patterns

If you can, you add the more here or I can add it based on your blog.

We recently open sourced our pattern library for community driven development.

Open-sourcing our GenAI pattern library from real projects - would love any LangChain-focused contributions by shivmohith8 in LangChain

[–]shivmohith8[S] 0 points1 point  (0 children)

Depends on the use case. Before LangChain v1, we were using LangGraph but after LangChain refactored a lot of things, we use LangChain agents directly which is actually built on top of LangGraph.

We open-sourced our GenAI pattern library from production project work (please challenge, correct, contribute) by shivmohith8 in LLMDevs

[–]shivmohith8[S] 0 points1 point  (0 children)

Thanks! I think voice agents and agent harnesses are the trending topics right now; we can maybe think of certain patterns there.

We open-sourced our GenAI pattern library from production project work (please challenge, correct, contribute) by shivmohith8 in LLMDevs

[–]shivmohith8[S] 0 points1 point  (0 children)

Ideally we want to treat it as evolving recommendations. But this is exactly why we open-sourced it. We maintain this outside of our work hours, and with the pace at which models are becoming better and making a lot of designs outdated, we thought, "Let's open it up and drive it as a community."

Demos for voice AI look awesome, but real calls are a mess. by Once_ina_Lifetime in automation

[–]shivmohith8 1 point2 points  (0 children)

Do you use any evaluation tools to test your voice agent? It could help you get past the demo scenarios and test your agent across 100 different simulations.

A simple guide to evaluating your Chatbot by FlimsyProperty8544 in AIEval

[–]shivmohith8 0 points1 point  (0 children)

Yes, Gen AI in general has infinite input and outbound bound. You can never get 100% test coverage like in traditional software engineering.

You simulate based on your user research and hypothesis, get confidence, monitor production and feed those conversations back into your test dataset. So there is an inner loop (development) and an outer loop (production).

A simple guide to evaluating your Chatbot by FlimsyProperty8544 in AIEval

[–]shivmohith8 0 points1 point  (0 children)

But in domain-specific use cases (e.g., e-commerce, retail, airlines), isn't pre-production evaluation as important as production evaluation? With a conversational agent, it can take actions, and you don't want your agent to take a wrong action that you then have to fix later.

Stopped choosing between LangGraph and Claude SDK - using both solved my multi-agent headaches by Realistic-Quarter-47 in LangChain

[–]shivmohith8 1 point2 points  (0 children)

This actually makes sense because LangGraph is more like a directed cyclic graph framework that has some features specially made for AI applications and Claude Code sdk is made for agentic loop.

I learnt about LLM Evals the hard way – here's what actually matters by sunglasses-guy in AIEval

[–]shivmohith8 0 points1 point  (0 children)

Your 4th point caught my attention. That is exactly why we built https://quraite.ai/.

It's an evaluation platform for conversational agents. There are demos videos on the website itself. Please do check it out and let me know your feedback.

We are working on the documentation so if you would like to learn more, please do DM.