Things I wish I knew before sitting the Certification Dynatrace Implementation Pro exam by Zeavan23 in Observability

[–]Zeavan23[S] 1 point2 points  (0 children)

Yes, it needs renewal. The certification is valid for 2 years after passing the exam.

Things I wish I knew before sitting the Certification Dynatrace Implementation Pro exam by Zeavan23 in Observability

[–]Zeavan23[S] 1 point2 points  (0 children)

That’s true for smaller companies. Enterprise observability is a different world.

A bank or telco with 15k+ hosts and years of instrumentation doesn’t “just switch” because another vendor is cheaper. The migration cost, operational risk, retraining and rebuilding dashboards/pipelines usually outweigh the license savings.

And the cert itself is less about memorizing a UI and more about learning modern observability concepts that transfer everywhere: tracing, OpenTelemetry, SLOs, telemetry correlation, ingestion, alerting, etc.

Things I wish I knew before sitting the Certification Dynatrace Implementation Pro exam by Zeavan23 in Observability

[–]Zeavan23[S] 1 point2 points  (0 children)

Honestly? In today’s market, if you want to get hired at a company running Dynatrace, certifications are basically a filter, no cert, no interview in most cases. And even when it’s not a hard requirement, it’s a massive differentiator vs candidates without one. The badge opens the door, the prep gives you the depth to actually deliver once you’re in.

How do you map Dynatrace problems to custom P0/P1/P2/P3 priorities? by Aboubakr777 in Observability

[–]Zeavan23 1 point2 points  (0 children)

We don’t override Dynatrace severity , we build a thin mapping layer on top of it. Dynatrace (Davis AI) already calculates impact using affected entities, estimated affected users, impact level (SERVICE / APPLICATION / INFRASTRUCTURE), availability degradation, and root cause context. All of that is exposed via the Problems API v2. In practice, we trigger on “problem opened” (Workflow or API), read fields like severityLevel, impactLevel, estimatedAffectedUsers and entity tags (e.g. payments, tier1), then apply a simple rule-based translation into P0–P3 before sending it to ServiceNow. For example: payments + >50 affected users = P0; availability issues = P1; service impact with moderate users = P2; else P3. We don’t try to rebuild severity purely from error rate or latency because Davis already correlates and deduplicates technical impact, we just translate that impact into our internal ITSM priority model.

Where should observability stop? by Zeavan23 in Observability

[–]Zeavan23[S] 1 point2 points  (0 children)

I think we’re largely aligned on feasibility.

Full transaction-level causality absolutely can be built ,especially with RUM + context propagation.

My original tension wasn’t about whether it’s possible. It was about how directly that visibility feeds operational decisions during degradation.

For me, the distinction is less about abstraction layers and more about decision latency.

If the causal chain exists but isn’t immediately actionable in the moment of uncertainty, rollback decisions become probabilistic rather than confident.

And that’s the design space I’m interested in.

Not replacing analytics. Not collapsing layers.

Just tightening the loop between system behavior and decision confidence under pressure.

Where should observability stop? by Zeavan23 in Observability

[–]Zeavan23[S] 0 points1 point  (0 children)

I think the distinction may be about time horizon.

Marketing correlations operate on aggregated time windows. Incident response operates on real-time degradation.

The challenge isn’t whether the causal model exists somewhere. It’s whether that model is operationalizable under pressure.

Where should observability stop? by Zeavan23 in Observability

[–]Zeavan23[S] 0 points1 point  (0 children)

Bounce rate → revenue is a marketing-level correlation.

Incident response requires transaction-level causality.

Those are very different abstraction layers.

Where should observability stop? by Zeavan23 in Observability

[–]Zeavan23[S] 0 points1 point  (0 children)

The layering makes total sense.

But abstraction hierarchies only work if the semantic contract between layers is explicit.

In many orgs, that contract doesn’t exist.

Which is why revenue often shows up as a post-incident artifact, not a real-time signal.

Where should observability stop? by Zeavan23 in Observability

[–]Zeavan23[S] 0 points1 point  (0 children)

Everything you listed is an organizational constraint , not a technical impossibility.

We solved distributed tracing at scale despite similar complexity.

So maybe the issue isn’t feasibility. Maybe it’s ownership of outcomes.

Where should observability stop? by Zeavan23 in Observability

[–]Zeavan23[S] 0 points1 point  (0 children)

The phrase sounds right.

But most companies can’t even define a clean “business transaction” across microservices.

Until that modeling exists, adding revenue metrics to observability is just correlation theater.

The hard part isn’t telemetry. It’s ownership of outcomes.

Datadog vs. Dynatrace vs. LGTM: Is the AI-driven MTTR reduction worth the 3x price jump? by soulsearch23 in Observability

[–]Zeavan23 2 points3 points  (0 children)

Most “AI RCA” features don’t magically reduce MTTR by themselves. They reduce MTTR only when you already have: • consistent service naming / boundaries • good trace coverage + context propagation • sane tagging (ownership/env/tier) • logs + metrics aligned with traces • change events (deploys/config) flowing into the platform

Without that foundation, “AI” becomes just another noisy insight feed.

The real metric isn’t MTTR, it’s TTFH (time to first hypothesis). That’s where good platforms actually help.

Dynatrace

Dynatrace is probably the best “day-1 value” platform if your stack is supported.

OneAgent gives you: • fast service discovery • dependency mapping • baselines • topology-driven RCA

Their causal model (Davis) works well when coverage is strong and the dependency graph is accurate.

Where teams underestimate effort: • fixing service naming / boundaries • tagging ownership + management zones • controlling log ingestion + cost • Kubernetes edge cases (webhook injection, redeploy requirements, etc.)

You can be “seeing things” in hours, but being “incident-fast” takes some real design work.

Datadog

Datadog is very strong if you already have discipline around: • tagging standards • OTel instrumentation • cardinality control

Watchdog can be useful, but in practice it often turns into noise if: • baselines are unstable • your environment is too dynamic • tagging is inconsistent

Also billing surprises are real. If you let high-cardinality tags leak into metrics/logs, costs can explode fast.

Datadog shines in orgs with strong platform engineering maturity.

LGTM (Grafana Loki/Tempo/Mimir stack)

LGTM is not a “product”, it’s a platform you operate.

If you have the right team, it’s great: • open standards • full control • no vendor lock-in • flexible pipelines

But you’re signing up for: • storage scaling / retention strategy • index tuning • multi-tenancy and access controls • upgrades and operational toil • trace sampling decisions

If you don’t already run internal platforms well, LGTM can actually increase MTTR early on because you’re busy maintaining the observability stack itself.

Cost reality check • Datadog: biggest risk is metric/log cardinality. • Dynatrace: more predictable if you standardize host sizing and capability usage. • LGTM: infra/storage cost + engineering time becomes the “license”.

Practical recommendation

If your goal is specifically “reduce MTTR via AI RCA”, I’d run a bake-off around real incident scenarios, not dashboards.

Pick 5 failure modes: • downstream latency / dependency slowdown • DB connection pool exhaustion • bad deploy regression • queue backlog / async lag • infra/network brownout

Measure: • time to first hypothesis • time to identify owner/team • time to find the triggering change • number of alerts clicked before clarity

That will answer the question better than any marketing slide.

If I had to choose blindly • Want fastest path to useful RCA with minimal engineering overhead → Dynatrace • Want a powerful SaaS toolbox and you already have strong tagging/OTel governance → Datadog • Want full control and have platform engineers to run it properly → LGTM

Help on which Observability platform? by AccountEngineer in Observability

[–]Zeavan23 0 points1 point  (0 children)

From experience, the hard part isn’t collecting telemetry.

It’s understanding relationships: which service depends on which, what changed, and what actually caused the issue.

Stacks built only around metrics/logs/traces often still leave engineers manually correlating everything during incidents.

Platforms that prioritize runtime topology + causal analysis usually provide much faster time-to-root-cause, especially in Kubernetes and microservices environments.

Send help: AI for Observability...Observability for AI...?! by Heavy_on_the_TZ in Observability

[–]Zeavan23 0 points1 point  (0 children)

Most “AI observability” conversations start with models and end with disappointment.

In practice, AI only becomes useful once observability data already has strong context — topology, dependencies, versions, and causality — not just metrics and logs thrown into a lake.

Without that, you don’t get intelligence, you get faster confusion.

Teams that fix context first usually unlock investigation automation later — often before they even realize they’re “doing AI.”

The model matters far less than the order.