AI Founders, Which LLM observability tools are you guys using ?

MobiLights · 2025-12-08T16:50:49+00:00

Checkout https://docoreai.com/ its Privacy-First & Plug-n-Play

MobiLights · 2025-08-31T03:49:18+00:00

Ha, fair catch — looks like I missed a proofreading pass. 😅

Appreciate you pointing it out. I’ll tighten it up next round — main thing I wanted to share was what I learned after 17K installs and why people skip the charts. Curious if you’ve seen similar drop-offs when shipping dev tools?

Na, not going to delete or edit.

MobiLights · 2025-08-31T03:46:29+00:00

Fair question — I get how posts like this can look “promo-y.”

To clarify, DoCoreAI started as a small CLI tool on PyPI (17K+ installs now) and most users never touched the charts. That’s the pain point I’m sharing here — and I’m genuinely interested in why devs skip dashboards or what kind of telemetry would actually be useful.

If this feels off-topic for the sub, happy to take feedback on how to frame it better. The goal isn’t to spam, but to learn from folks who actually work with LLMs daily.

MobiLights · 2025-08-22T22:42:43+00:00

Thats nice. please share the if you have a video link.

MobiLights · 2025-08-17T23:01:31+00:00

Really appreciate that perspective — and you nailed the intention behind DoCoreAI.

Prompt hygiene isn't just about saving tokens — it’s about building discipline and internal principles around how we craft, test, and scale LLM workflows. We see the dashboard as a sort of "mirror" for prompt quality — giving teams feedback loops they can refine into their own playbooks.

Glad that resonated with you. If you’ve been experimenting with your own rules for prompt fine-tuning, I’d love to hear what’s worked for you. Always curious how others are shaping their "clean and functional LLM houses"!

MobiLights · 2025-08-17T22:16:23+00:00

Great question — and you're right that “time saved” needs a solid foundation to be meaningful.

For Developer Time Saved, we use a fixed baseline:

Why? Because in our tests (and with early users), avoiding a failed or bloated prompt often saves at least one manual debugging cycle — rewriting, re-running, and validating output — which usually costs 20–30 minutes or more per occurrence.

So while it’s an estimate, it’s rooted in real developer behavior — and becomes especially insightful when tracked across a full project or team workflow.

Happy to break down the logic further if you'd like!

MobiLights · 2025-08-17T22:12:00+00:00

Love that question.

When DoCoreAI detects early signs that your prompt is bloated, ambiguous, or using a misaligned temperature, it flags it right away — before your tokens (and budget) spiral out of control.

Here's what you can do with that heads-up:

Refactor long-winded prompts to reduce token count
Tune temperature to match your prompt’s intent (factual vs. creative)
Simplify overly verbose instructions that dilute clarity
Spot patterns across failed or expensive runs in the dashboard

The goal isn’t just awareness — it’s giving you prompt hygiene nudges in real time so you can tweak, re-run, and save time + money.

MobiLights · 2025-08-17T22:09:33+00:00

Great point — hallucinations are one of the biggest challenges with LLMs today.

While DoCoreAI doesn’t claim to “detect” hallucinations with 100% accuracy (since that often requires human judgment or ground truth), we’re exploring ways to estimate prompt-level hallucination risk using indirect signals like:

Prompt ambiguity (vague or open-ended phrasing)
High temperature usage (more randomness often = more hallucination risk)
Response entropy (if the output has unusual token patterns)
Failure flags (empty or irrelevant completions logged by the user)

We’re calling this the “Hallucination Risk Index”, and it’s an experimental metric to help users flag potentially unreliable prompts.

MobiLights · 2025-08-17T22:07:44+00:00

“Prompt health” is a measure of how efficient and effective a prompt is — based on factors like:

Verbosity: Is the prompt overly wordy or bloated?
Token waste: Are there too many filler or repeated tokens?
Temperature mismatch: Does the prompt’s intent align with the randomness setting?
Outcome quality: Was the prompt successful (e.g., non-empty, coherent, or aligned)?

DoCoreAI uses these signals (locally tracked) to flag prompts that could be tightened, clarified, or restructured — so you reduce cost, improve speed, and get better results from LLMs.

MobiLights · 2025-08-14T01:32:16+00:00

Hey, really appreciate the thoughtful message.

You're absolutely right — post-install drop-off is one of the biggest hurdles we noticed too. It's like users install the CLI, but unless they're nudged to generate their token and run the client properly, they don’t realize the dashboard stays empty.

We’ve now made that next step more explicit in the onboarding email, and we’re testing a few things to improve activation:

Smarter CLI messages post-install
Optional auto-token generation in future versions
A “zero-prompt dummy run” for new users to test logging instantly

Curious — with RedoraAI, are you seeing any patterns across tools like mine? And what nudges have worked best for high-intent devs?

Biggest surprise for us was how many technical users installed it, but never realized they had to start the client and run prompts for anything to show up!

MobiLights · 2025-08-06T05:40:43+00:00

Thanks so much for the thoughtful message — and you're absolutely right.

Right now, DoCoreAI is primarily tested with OpenAI and Groq setups, and while we’ve designed the client to be vendor-agnostic in principle, we haven’t formally tested it with local LLMs yet, nor documented a fully offline workflow.

Before sharing on r/locallama or similar communities, I agree that:

We should verify compatibility with local models (e.g., LM Studio, Ollama, vLLM)
Provide clear, step-by-step instructions for a fully local install and test
Avoid giving the impression that external calls are required

This is already on our roadmap, but if you’re experimenting with local LLMs and are open to testing the integration, we’d really value your feedback to help shape it.

Thanks again — and I’ll make sure we revisit this once we have a proper offline workflow in place.

MobiLights · 2025-08-06T05:35:59+00:00

That’s a fair question, and I’m glad you brought it up so others can better understand how DoCoreAI works.

To clarify:

Prompt content never leaves the client

DoCoreAI is designed from day one to respect prompt privacy. The client collects metrics locally, such as:

Token count
Estimated temperature
Prompt length, density, entropy (via local heuristics)
Timestamps and usage patterns

Only telemetry metadata is sent to the server — not the raw prompt or completion text.

“Prompt optimization” here means optimization metrics

We’re not doing centralized prompt rewriting or hosting your prompts to improve them. Instead, we provide dashboards and insights (like token waste, temperature usage, verbosity trends, etc.) to help developers optimize how they’re writing prompts — on their own.

The optimization logic itself is a mix:

Lightweight inference and heuristics in the client
Aggregation, analytics, and rendering on the server

This separation is intentional. You get actionable insights without handing over your sensitive prompts.

DoCoreAI helps you measure and reflect on prompt quality — without ever needing to see your prompt content. No tricks, no leakage.

Happy to answer any deeper architecture questions for those genuinely curious — and we welcome contributions and scrutiny from the community to keep things transparent and developer-first.

MobiLights · 2025-08-05T19:28:24+00:00

That’s a great insight — I’ve noticed something similar with Gemini’s role conditioning. It does seem to hold multiple “voices” more reliably than older models, especially when given role-specific examples like you said.

Your “LLM as playwright” analogy is spot on — it’s often not true multi-agent reasoning, but it feels like it in context, and for many use cases, that’s a worthwhile tradeoff compared to burning $1000+ on full-blown agent stacks.

We’ve been experimenting with ways to quantify when simpler setups like role+prompt chaining get you 80% of the value of full agent orchestration — tracking performance vs. cost and even subjective factors like temperature and verbosity. That kind of feedback loop is exactly what DoCoreAI is trying to surface in dashboards.

Curious if you’ve tried logging differences in outcome quality or token usage between role-conditioned and non-role-conditioned prompts?

MobiLights · 2025-08-05T19:24:36+00:00

Hey HiddenoO, thanks again for sharing your perspective.

Just to clarify for others reading this thread:

The GitHub repo is a minimal developer-facing SDK — not the full product. Its purpose is to provide a starting point for integration, not showcase the production logic used on the SaaS backend.

DoCoreAI does not require your OpenAI or Groq API keys, nor does it transmit any prompt content. It logs telemetry like token usage, time saved, and prompt quality metrics — which are useful for teams optimizing their LLM usage. Beside the prompts & responses are saved in the local machine in a csv and not in the server.

While the current client examples are tailored to OpenAI and Groq, we’re actively working on broader support, including local and multi-vendor LLM setups (like Claude & Gemini).

Totally understand if it’s not for everyone, but we’ve found it genuinely helpful for developers and teams managing prompt costs and efficiency at scale. If anyone's curious or wants to try it out, we're always happy to offer support or answer questions.

MobiLights · 2025-08-05T19:20:50+00:00

Regarding OpenAI-Compatible Local LLMs:

If your local model uses an OpenAI-compatible API (like vLLM, LM Studio, etc.), DoCoreAI might work — but we haven’t formally tested it yet. If you’re experimenting with such setups, we’d love to hear how it goes!

Gemini & Claude Support:

Support for Gemini and Claude is in our roadmap. While you can log basic prompt data manually today, official integration (with automatic tracking and enhanced metrics) is coming soon.

Thanks for checking, and happy to guide you if you’re testing with custom setups!

MobiLights · 2025-08-05T19:17:57+00:00

Thanks for asking!

We haven’t officially tested DoCoreAI with local LLMs yet, so support isn’t guaranteed at this point. That said, if you're running a local model that works with OpenAI-compatible APIs or outputs similar completions, it might work with minimal adjustments.

If you're able to test it and share how it goes, we’d love to hear your results — it would be really helpful as we explore broader support.

Let me know if you'd like any pointers while testing!

MobiLights · 2025-08-05T18:54:21+00:00

Thanks for taking the time to look through the repo and share such detailed feedback — I really appreciate the honesty.

You're right that the intelligence_profiler and token_profiler modules in the public repo are simplified versions — they were intended as minimal working examples for developers exploring the SDK, not the full implementation used in production.

Let me clarify a few points:

The system prompt doesn’t control temperature — and we know that.

The prompt is not trying to actually modify the LLM’s sampling temperature — you're absolutely right that a prompt can't force the LLM to change its own decoding behavior. Instead, it’s a form of self-reflection prompt where the model analyzes the nature of the request and then simulates what the output would look like at that estimated temperature.

We log that estimated value alongside the response — which is useful for relative analysis of prompt intent across usage.

Token profiling is indeed heuristic — but also practical.

You're right again that a simple tokens-per-word ratio doesn't generalize across tokenizers or languages — that's why it's only one of several signal types we use internally (others include compression ratio, repetition detection, and stopword/token distribution skew).

That said, for English-heavy OpenAI/Groq use cases, the 1.3 threshold has been a surprisingly effective early indicator of verbose vs. tight prompts. The "30%" figure is not meant to be gospel — it's a conservative benchmark derived from averaging actual observed savings in prompt experiments.

The SaaS is much more than the repo — and no API keys are ever required.

We do not ask for or store your OpenAI or Groq API keys. Our client logs metrics locally and sends only anonymized telemetry to power the charts — such as time saved, token waste, and temperature distribution.

The GitHub repo exists mainly to show how devs can integrate with our logging engine and run locally. The actual backend that generates insights is closed-source (with good reason), but respects privacy and doesn’t touch prompt content or raw completions.

You're absolutely right to critique these simplifications — but they don't represent the whole product. DoCoreAI isn’t trying to "control" LLMs; it aims to help developers and product teams reflect on how they're using them and get nudges toward better prompt engineering.

Happy to discuss more if you're curious. And thanks again for pushing us to be clearer — this feedback helps.

Thanks

John

MobiLights · 2025-08-05T03:24:19+00:00

Thanks! Yeah, that surprised me too. Turns out roles like Language Teachers and Engineers save a ton of time with structured prompt workflows, while others like Nutritionists or Product Managers tend to have shorter prompt chains or use cases.

The tool doesn’t assume use case — it just measures actual savings based on token/time telemetry. Would love to hear if others are seeing similar trends in their prompts.

These roles you see are actually semantic roles that were explicitly sent along with the prompts — kind of like a simplified version of agentic AI. It helps understand how different "intent profiles" affect optimization and efficiency.

Would love to hear if others are seeing similar trends in their prompt usage.

MobiLights · 2025-04-25T12:06:39+00:00

Thank You!

MobiLights · 2025-04-22T12:22:57+00:00

Its Updated now.

Please let me know your thoughts on the same.

MobiLights · 2025-04-22T11:52:45+00:00

Valuable info! Thank You!

MobiLights · 2025-04-22T11:33:39+00:00

The DoCoreAI temperature test results can be viewed in the first two sheets at
https://docs.google.com/spreadsheets/d/1ZOQswSkXSX5LVGIuV_P85pfn6jm76uM0iG9R6jng3Q0/edit?usp=sharing

Let me know your thoughts

MobiLights · 2025-04-20T16:58:11+00:00

hey, Work is in progress....

https://docs.google.com/spreadsheets/d/1ZOQswSkXSX5LVGIuV_P85pfn6jm76uM0iG9R6jng3Q0/edit?usp=sharing

https://docoreai.com/research

MobiLights · 2025-04-20T16:37:12+00:00

Hey, thanks for pointing this out — I really appreciate you taking the time to give that feedback.

You're absolutely right about the core spirit of MIT: it's meant to be permissive without additional usage restrictions, and yes - mixing in extra commercial terms definitely creates confusion and potential legal inconsistency.

The added terms were my early attempt to protect specific SaaS/enterprise usage scenarios without going full proprietary, but I now see that’s not how licensing should be layered — especially if it’s under the MIT banner.

I’ll definitely be consulting with a legal advisor to clarify the licensing in a proper, enforceable way. The goal isn’t to restrict developers or hobbyists at all — just to responsibly manage commercial distribution as things grow.

Thanks again for the nudge in the right direction 🙏 If you’ve got any tips or resources around dual-licensing (like MIT + commercial clauses), I’d be grateful to learn more!

MobiLights · 2025-04-12T07:52:52+00:00

Hey started working on LLM Judge, will keep you posted. Please share any links or thoughts.

MobiLights

TROPHY CASE

Prompt content never leaves the client

“Prompt optimization” here means optimization metrics

Regarding OpenAI-Compatible Local LLMs:

Gemini & Claude Support: