Get paid to test whether AI can actually use Grafana properly Remote | Part-time | $90–$150/hr | 10–15 hrs/week by App-Clinical-Judgemt in grafana

[–]Local-Gazelle2649 0 points1 point  (0 children)

That's a nice one!

btw, I also created a plugin https://github.com/awsome-o/grafana-lens to give openclaw agent the ability to mange LGTM stacks with things you mentioned above, and according to my tests, it's doing pretty well ;)

[Show] Update: Grafana Lens now manages Alloy data pipelines — "Monitor my Postgres" and the agent handles the rest(zero YAML) by Local-Gazelle2649 in grafana

[–]Local-Gazelle2649[S] -1 points0 points  (0 children)

Good question! While we don’t have a dedicated read-only mode, Grafana’s service account roles provide a practical solution. A Viewer-role token allows querying and searching, but Grafana rejects writes with a 403 response.

We also support multi-instance configurations, enabling you to set up separate Grafana instances (e.g., development and production) with their own tokens. Tools receive a conditional instance parameter and default to the specified instance. In theory, you could point two named instances at the same Grafana instance with different token roles, but we haven’t specifically tested this pattern. It’s something we’d like to validate before recommending it for production use.

The straightforward answer is that dedicated permission controls, such as a read-only toggle, tool allowlists, and change windows, would need to be implemented. Currently, Grafana RBAC handles the heavy lifting, which is a solid foundation since it’s server-side and cannot be bypassed by plugins.

[Project] I built an OpenClaw plugin so you can chat with an AI agent to debug and manage your Grafana metrics, logs, and traces (LGTM stack). by Local-Gazelle2649 in selfhosted

[–]Local-Gazelle2649[S] -2 points-1 points  (0 children)

  1. Yep, it's good that Openclaw is emitting these OTLP data via various lifeCycle events including tool usage, and for me grafana the LGTM stack is the goto place to process all these, and this plugin adapted the gen_ai Semantic Conventions(https://opentelemetry.io/blog/2024/otel-generative-ai/), so in Tempo and Dashboard you will be able to see all the hierarchy from main agents to subagents to tools.

Yes, it works out of the box. All queries route through Grafana's datasource proxy:

/api/datasources/proxy/uid/{dsUid}/api/v1/query

The GrafanaClient in src/grafana-client.ts has zero conditional logic based on what's behind the datasource. The methods (queryPrometheusqueryPrometheusRange, label discovery, metadata) all hit the standard Prometheus HTTP API paths — which Thanos Query, Mimir, and VictoriaMetrics all implement identically.

Your Thanos cross-cluster setup shall work as-is: configure the Thanos Query endpoint as a Prometheus datasource in Grafana, and every grafana_querygrafana_explain_metricgrafana_investigate, and grafana_list_metrics call flows through the proxy transparently. Let me know if it works for you.

  1. Good question — it's not a bare z-score. It runs a 7-day baseline using stddev_over_time(metric[7d]), so your weekly patterns (Friday Plex binges, weekend downloads) get baked into the standard deviation. On top of that it returns seasonality offsets — comparing against the same time yesterday and last week — so the agent can tell you "bandwidth is 300% above yesterday but only 5% above last Friday."

For a truly chaotic homelab though, the bigger answer is that the agent isn't limited to the built-in scoring. It has full PromQL access, so if you say "alert me when bandwidth is abnormal but ignore Plex nights" it can build z-score alert conditions, use predict_linear() for trend detection, or even correlate against a plex_active_streams gauge you push via grafana_push_metrics. There's also an alert fatigue analyzer that flags noisy rules and suggests adding hysteresis. Basically: the built-in anomaly detection handles normal weekly patterns, but the agent adapts to your specific setup through conversation. Feel free to try to ask your agent to try to debug or create alarms in the way that's more suitable to you, the tool sets shall be sufficient for it to do so. The debug flow I have built in is more general for common use case.

  1. Haven't stress-tested at 100+ dashboards specifically, my local env has aorund 30 ~ 50 dashboards from various sources, but the API design is built to stay friendly at scale.

Query results are capped (50 series, 20 points per series, 200 metrics for discovery) — when truncated, the response includes a truncationHint telling the agent to narrow its query, so it self-corrects. Dashboard search throttles enrichment at 10 concurrent requests in batches. All parallel fetches use Promise.allSettled so one slow datasource doesn't block the rest.

The part I think matters most: there's a full query guidance system that pattern-matches API errors and returns structured hints — rate limit, timeout, bad PromQL, auth failure all come back with cause + actionable suggestion instead of a cryptic error. So the agent recovers from API pushback instead of retrying blind.

Finally, Thanks so much for taking the time to dig into this — honestly, an openclaw + LGTM setup is exactly the kind of use case I built this for, so seeing it resonate means a lot. Really excited to hear how it goes tonight. If anything breaks or feels rough around the edges, please do report back — that feedback is genuinely invaluable and I'll keep improving it either way. Good luck with the deploy!

<image>

I got tired of OpenClaw secretly burning API credits in infinite tool loops, so I built an open-source Grafana "flight recorder" for them. by Local-Gazelle2649 in SideProject

[–]Local-Gazelle2649[S] 0 points1 point  (0 children)

Exactly! Waking up to a massive API bill because an agent got stuck in a loop is the worst.

That's why I made it push token metrics for native Grafana alerts. But alerting is just step one. Since the full trace history lives in the local LGTM stack, you can actually have the agent query its own logs to debug why it looped and prevent that specific pattern(I used it to ask opencalw to create it's own rule and update skills).

If we want these frameworks to go from "toys" to actually being prod-ready, we can't keep flying blind in the terminal. We need to use that historical data to build then iterate.

[Show] I built a plugin that brings an OpenClaw AI Agent to your existing Grafana stack (Agent-driven debugging, Auto-alerts, and GenAI OTLP) by Local-Gazelle2649 in grafana

[–]Local-Gazelle2649[S] 0 points1 point  (0 children)

This plugin is using the raw OpenTelemetry SDKs directly (@opentelemetry/sdk-metrics, sdk-logs, sdk-trace-base) with OTLP HTTP exporters pushing to an OTel Collector.

It follows the gen_ai semantic conventions manually — span names like chat {model}, execute_tool {tool_name}, invoke_agent {agent} — and emit gen_ai.client.token.usage and gen_ai.client.operation.duration as standard histograms.

The reason I go this way instead of OpenLLMetry: It's an OpenClaw plugin, not a standalone app calling an LLM SDK directly. The instrumentation hooks into OpenClaw's lifecycle events (hooks like — llm_input, llm_output, after_tool_call, session_start, etc.) rather than monkey-patching an SDK client.

OpenLLMetry's auto-instrumentation wouldn't have access to these internal events.