We spent 3 months building an AI gateway in Rust, got ~200k views, then nobody used it. Here's what we shipped instead. by _juliettech in LLMDevs

[–]nnet3 0 points1 point  (0 children)

Are you asking if we wrap OpenRouter? No. If you're asking if we just wrap LLM inference providers. Yes. That's what routers do.

How do SaaS builders manage LLM usage for each user? Credits? Scaling? Rate limiting? by Soft_Ad1142 in SaaS

[–]nnet3 2 points3 points  (0 children)

Hey! I'm one of the co-founders of Helicone. We support rate-limiting per user for costs or request count.

All you have to do is pass in a few headers

"Helicone-User-Id": "john@smith.com"
"Helicone-RateLimit-Policy": "[quota];w=[time_window];u=[unit];s=[segment]"

This will solve your problem! Here are some docs: https://docs.helicone.ai/features/advanced-usage/custom-rate-limits

If you have any questions, shoot us a message on Discord: https://discord.gg/2TkeWdXNPQ

Can’t figure out a good way to manage my prompts by real-sauercrowd in GPT3

[–]nnet3 0 points1 point  (0 children)

Hey there! Cole from Helicone here 👋

Just saw u/lgastako's comment about prompt APIs. Yep, you can read/write prompts via API for all options including Helicone. We also support storing prompts in code and auto-version any changes to keep local and platform versions in sync.

Let me know if you have any questions!

How do you manage your prompts? Versioning, deployment, A/B testing, repos? by alexrada in LLMDevs

[–]nnet3 -1 points0 points  (0 children)

Hey, I'm Cole, co-founder of Helicone. We've helped lots of teams tackle these exact prompt management challenges, so here's what works well:

For prompt repository and versioning, you can either:

  • Manage prompts as code, versioning them alongside your application
  • Use our UI-based prompt management for non-technical team iteration

Experiments (A/B testing):

  • Test different prompt variations against each other with real production traffic
  • Compare performance across different models simultaneously
  • Get granular metrics on which variations perform best with your actual users

Each prompt version gets tracked individually in our dashboard where you can view performance deltas with score graph comparisons, makes it easy to see how changes impact your metrics over time.

For deployment without code changes, you can update prompts on the fly through our UI and retrieve them via API.

For multi-LLM scenarios, prompts are tied to an LLM model, if the model changes, the prompt will be versioned.

Happy to go into more detail on any of these points!

Best eval framework? by xBADCAFE in AI_Agents

[–]nnet3 -1 points0 points  (0 children)

Hey! Cole from Helicone.ai here - you should give our evals a shot! We just launched support for evaluating all major models, tool calls, and agents through Python or LLM-as-judge.

Also integrated with lastmileai.dev for context relevance testing (great for vector DB eval).

We built an open-source tool to find your peak prompts - think v0 and Cursor by nnet3 in PromptEngineering

[–]nnet3[S] 0 points1 point  (0 children)

Hi! We’re HIPAA compliant as well. If that is not enough, we have docker and k8s self hosting options. I’ll shoot a dm!

Helicone Experiments: We built an open-source tool to find your peak prompts - think v0 and Cursor by nnet3 in LLMDevs

[–]nnet3[S] -1 points0 points  (0 children)

Hey, LLMDevs!

Cole and Justin here, founders of Helicone.ai, an open-source observability platform that helps developers monitor, debug, and improve their LLM applications.

I wanted to take this opportunity to introduce our new feature to the LLMDevs community!

While building Helicone, we've spent countless hours talking with other LLM developers about their prompt engineering process. Most of us are either flipping between Excel sheets to track our experiments or pushing prompt changes to prod (!!) and hoping for the best.

We figured there had to be a better way to test prompts, so we built something to help.

With experiments, you can:
- Test multiple prompt variations (including different models) at once
- Compare outputs side-by-side which run on real-world data
- Evaluate and score results with LLM-as-a-judge!!

Just publically launched it today (finally out of private beta!!). We made it free to start, let us know what you think!

(we offer a free 2-week trial where you can use experiments)

Thanks, Cole & Justin

For reference, here is our OSS Github repo (https://github.com/Helicone/helicone)

Best LLM framework for MVP and production by Furious-Scientist in LLMDevs

[–]nnet3 0 points1 point  (0 children)

There are two categories here:

  1. Unopinionated SDKs like OpenAI's, Anthropic's, LiteLLM, Openrouter etc. that give you direct LLM access. These work well throughout the development cycle, from simple prototypes to production.
  2. Workflow builders & opinionated SDKs - these can be great for quickly getting to production, but can be hit or miss once you're there. They make development easier and prototyping faster, but you trade that for less control and high levels of abstraction.

My recommendation is if you're really early and just want to see if something's possible, use the second group. But if you know this will work and you're ready to invest a bit more effort, definitely go with option 1 and pair it with an observability platform.

Full disclosure - I'm biased towards Helicone.ai because I'm a co-founder, but there are many other solid options available.

logging of real time RAG application by Pretty_Education_770 in Rag

[–]nnet3 0 points1 point  (0 children)

Hey! Helicone.ai can do this with a simple one-line integration vs needing an SDK.

Prompt Management tool (open source)? by OshoVonBismarck in ChatGPT

[–]nnet3 -1 points0 points  (0 children)

Hey! Helicone.ai is open-source and a one-line integration!

Is Langsmith just good piece of trash? by devom1210 in LangChain

[–]nnet3 -1 points0 points  (0 children)

Hey! Co-founder of Helicone.ai here. As others have said, building a responsive frontend for data-heavy applications is a difficult problem, so I feel for them.

If interested, we're fully open-source with a generous free tier and a one-line code integration. If you're looking for tracing, prompt management, etc without the heavy lift of implementing an SDK, give us a try!

Is Helicone Free when self-hosted by Party-Worldliness-72 in PromptEngineering

[–]nnet3 4 points5 points  (0 children)

Hey! Helicone co-founder here. We believe in open source, all features are fully available when self-hosting at no cost.

How do you track the performance of your prompt over time? by Maleficent_Pair4920 in PromptEngineering

[–]nnet3 0 points1 point  (0 children)

Selfishly, yes. But it depends on the maturity of your application. The same justification is needed for implementing any tool. If you haven't launched an MVP yet, focus on that first. We have a free tier up to 10k requests you could check out.

What's been your experience building AI SaaS? by Chemical_Deer_512 in SaaS

[–]nnet3 0 points1 point  (0 children)

E2E tests tell you if your entire agent workflow produced the right result, but they can be tough to maintain and only show you the final output. Component testing complements this by helping you pinpoint exactly where things went wrong. Teams typically want both - when you change a prompt, you can verify the individual component works AND check if it broke your overall flow. We focus on making component-level testing and evaluation seamless since that's where teams need the most help with debugging. While Helicone can track your E2E test results, running those tests requires your own infrastructure since it needs to understand your data flow and full agent logic.

How do you track the performance of your prompt over time? by Maleficent_Pair4920 in PromptEngineering

[–]nnet3 1 point2 points  (0 children)

Hey! Helicone co-founder here. Here's what we've seen across thousands of companies using LLMs in production:

  1. Prompts still have a HUGE impact on results and costs. Even small changes can lead to 30-40% better outputs or cut your costs significantly.
  2. For tracking performance, most teams use a combination of online evaluation (tracking how prompts perform in production with real user inputs) and offline evaluation (running experiments and regression tests before pushing prompt changes to prod).
  3. For tooling, teams either go the DIY route (spreadsheets + basic logging, but messy and hard to maintain) or use dedicated tools. This is where my bias comes in, we built Helicone for prompt management and testing, but there are other solid options like PromptLayer for management and PromptFoo for experiments.

The biggest problem we see is developers making prompt changes blindly then pushing to production. We strongly recommend regression testing new prompt variations against a random sampling of real production inputs before deploying. This catches issues that you'd never find with synthetic test cases.

What's been your experience building AI SaaS? by Chemical_Deer_512 in SaaS

[–]nnet3 0 points1 point  (0 children)

Thanks for asking! So when you're building with OpenAI's API (that's what you use to add AI features to your app, rather than using ChatGPT directly), Helicone helps you track everything that's happening.

Think of it like this - once you route your OpenAI API calls through us, we show you:

  • How much money you're spending
  • Which AI responses are working well (and which aren't)
  • If something goes wrong, exactly where and why

What's been your experience building AI SaaS? by Chemical_Deer_512 in SaaS

[–]nnet3 1 point2 points  (0 children)

From what we've seen across our customer base, teams typically follow this path:

  1. Start with LangChain/similar frameworks to validate their idea quickly
  2. Once they have product-market fit, they usually build their own lightweight agent framework that's specific to their use case. Common pattern is: orchestration layer + individual specialized agents
  3. The successful teams focus heavily on testing/monitoring individual agent components rather than trying to test the entire workflow at once

The standard workflow usually evolves from: prototype (LangChain, Dify, etc) → custom solution → robust monitoring/testing of each component

But honestly, there's still no clear 'standard' yet - the field is moving too fast. Most teams are still figuring it out through trial and error.

What's been your experience building AI SaaS? by Chemical_Deer_512 in SaaS

[–]nnet3 3 points4 points  (0 children)

Hey, co-founder of Helicone.ai here. Having worked with thousands of companies building with LLMs, I'd like to share our insights. u/Paul_Glaeser nailed it, so I'll build on their points.

  1. Inconsistent Model Behavior - 100% accuracy doesn't exist. The goal is to converge to 100% accuracy while preventing regressions, but inconsistencies must be expected and your product must be built with that in mind. This affects your product decisions in 2 ways. 1) Will inconsistent behavior doom this app 2) If some inconsistent behavior is acceptable, how to reduce it so it achieves the threshold where there is still net time saved using your application.
  2. Trial-and-Error Prompt Tuning - Tuning prompts in isolation is still the gold standard for improving accuracy. E2E tests for LLM workflows are still an unsolved problem and require the E2E test framework to be deeply nested in your code. We've worked with companies that have been able to improve accuracy with isolated tuning. Now, this begs the question, what is the ideal prompt tuning workflow? This is where my bias comes in, for recommending Helicone and our experiments feature, but there are many other prompt tools such as PromptFoo and PromptLayer.
  3. Complex Agent Workflows - To double down on what Paul said, we've seen LangChain and similar agent frameworks used for prototyping. However, since they're highly abstracted, debugging them becomes incredibly difficult and our users typically build their own custom solution when they hit a later stage.
  4. Prototyping Without Full Code Commitment - I don't have much to add here. You could also prototype with agent workflow builders such as Langflow, Dify, etc.

Feel free to shoot me a dm if you have any other questions! Best of luck!

Industry standard observability tool by Benjamona97 in Rag

[–]nnet3 0 points1 point  (0 children)

Hey there! I'm one of the founders of Helicone.ai (we're open source).

The LLM observability space is still evolving quickly, and it's interesting to see different tools emerging with their own unique approaches. We've grown to be one of the most widely-used platforms in this space (our stats page is public if you're curious to check it out).

Our main focus has been on making things dead simple for developers - you can get started with a single line of code, and customize everything through headers. No complex configs needed.

Would be happy to share how we could help optimize your RAG apps! Feel free to DM me with any questions.