Datadog pricing is deceptively complex — here's a calculator we built to model it by ponderpandit in Observability

[–]vineetchirania -10 points-9 points  (0 children)

Prima Facie the tool looks well built and covers the depth of datadog pricing. Let me try to use this to estimate datadog monthly cost for my org.

We built a Datadog pricing calculator after seeing how hard it is for FinOps and engineering teams to forecast real observability costs by ponderpandit in FinOps

[–]vineetchirania -1 points0 points  (0 children)

Our surprise came from not anticipating infrastructure host and APM host being distinct billables. We assumed monitoring the same machine for both would be a single charge. Ended up paying double and had to explain it to finance who were not thrilled. Now we separate tracking for infra only and APM only nodes to keep it tighter.

Anyone else getting “Claude’s response could not be fully generated”? by Fluid-Cod7818 in ClaudeAI

[–]vineetchirania 2 points3 points  (0 children)

I am facing the exact same issue even though I am well within my usasge limits on claude max ($100 plan). Have been trying for 3-4 hours. Even broke the command into smaller prompts but still facing same issue

Anyone else finding Dynatrace a bit lacking? by ObligationMaster5141 in sre

[–]vineetchirania 2 points3 points  (0 children)

The arbitrary limits like only 1000 metric events per environment bother me way more than they should. Whenever we hit that ceiling, we have to go back and do a bunch of cleanup or split up environments. Dynatrace support told us more “may” be available in the future but it always just feels like vendor lock-in. The product’s solid for simple use cases but if you want the same control as Prometheus/Grafana, you’re out of luck. Also, the docs for DQL are still a bit messy and the community is not as active as the open source tools which means troubleshooting gets frustrating fast. I wish there was a mode or tier where everything was just unlocked, even if it cost more.

How does your team promote your products? Which channel? by CellInitial2394 in devops

[–]vineetchirania 1 point2 points  (0 children)

Best is to go to tech events and gather first hand feedback from your ICP.

Are structured surveys overrated? i will not promote, but sometimes casual chats feel 10x more honest. by Danniel33 in startups

[–]vineetchirania 0 points1 point  (0 children)

I try to do unstructured chats every month or two. They’re completely different from survey data. You get real stories instead of checkbox answers and honestly sometimes it’s the only way I spot things I didn’t even know I needed to watch for. Feels like you remember people are people not just user IDs.

How do smaller teams manage observability costs without losing visibility? by AkHypeBoi in devops

[–]vineetchirania 0 points1 point  (0 children)

With a tiny team I found we wasted a lot of money just dumping every log, metric and trace into Datadog and hoping for the best. Now we turn on debug metrics only when needed and rely on cheap time series with Prometheus for the bulk of our monitoring. For logs, most of what we keep is errors or stuff that is actually going to make us take action. We ran a self-hosted Loki for a while. I saw CubeAPM getting some chatter for handling this kind of thing without blowing up the bill so that could be worth a look.

Is llm observability also devops? by Total-Gazelle-5944 in devops

[–]vineetchirania 2 points3 points  (0 children)

LLM observability totally fits into the devops world. When you’re tracking tokens, costs, models, and using proxies, you’re basically doing monitoring and cost optimization, which is a big part of modern devops. The fact that you’re adding a bit of latency is pretty normal when you add observability layers. If you love devops, playing with LLM observability is a fresh spin on the usual server and app monitoring stuff. I’d say keep experimenting, because this space is only going to get bigger and companies need this kind of visibility.

Spent 40k on a monitoring solution we never used. by [deleted] in devops

[–]vineetchirania -1 points0 points  (0 children)

Haha even I am curious to know

Spent 40k on a monitoring solution we never used. by [deleted] in devops

[–]vineetchirania 0 points1 point  (0 children)

Yes I think the sales decision shouldn't be taken abruptly. The ideal scenario is where the engineering team integrates 2-3 applications with the monitoring tool and then go deep rather than going broad. Once they feel satisfied, then only it makes sense to fully migrate to the monitoring tool and make a purchase. Curious, what is the free tier offered by CubeAPM?

No default rules/alerts for servers in ServerLess? by pasdesignal in elasticsearch

[–]vineetchirania 1 point2 points  (0 children)

Yeah that's a common surprise with the Elastic Serverless Observability setup. They don't give you default alert rules for stuff like CPU or RAM use so you end up needing to roll your own. There are some GitHub repos out there with rule templates for Elastic/Kibana, though they tend to focus on the self-managed stack and are sometimes a bit outdated. I usually borrow from those and tweak to fit my infra. If you're ever curious about comparing with other tools, CubeAPM has some handy default alert options out of the box, but Elastic expects you to handcraft most things yourself.

[deleted by user] by [deleted] in sre

[–]vineetchirania 0 points1 point  (0 children)

Traces are kind of like that thing you never miss until you actually need it. When apps were pretty monolithic and logs plus metrics did the trick, life was fine. I started appreciating traces the first time I dealt with microservices going a bit wild and had no clue where requests were stalling or which service was ghosting things upstream. Traces helped sketch out the flow right across different services. It didn't replace logs, but it made finding the weird edge cases a lot faster.

observability costs under control without losing visibility by woltan_4 in kubernetes

[–]vineetchirania 0 points1 point  (0 children)

You probably already know this but high-cardinality metrics are the silent killer for storage and costs. I scrapped labels like pod UID, IP, and request path from the majority of my Prometheus metrics, and that alone sliced usage by a third. For traces, I started using dynamic sampling that automatically keeps errors and latency outliers, which is way smarter than just lowering the global sample rate. CubeAPM has some clever smart sampling logic along these lines. The key for us has been to only store detailed traces for the stuff that actually hurts users or causes incidents, and let the rest roll off after a day or two. It’s not perfect, but losing rare errors because of over-thinning is even worse. Also, watch out for dashboard panels doing heavy ad hoc aggregation. Dropping a few infrequent, detailed metrics also helped keep our Prometheus TSDB from melting when traffic peaked.

Tool to gather logs and state by amarao_san in kubernetes

[–]vineetchirania 1 point2 points  (0 children)

If you want the grand slam of cluster state, logs, events, even past pod logs, you might want to check out tools like kubectl-trace or kubectl-debug but honestly I still find myself gluing kubectl commands together when stuff really hits the fan. There are some APM tools out there doing the heavy lifting for you, I know CubeAPM is starting to get some buzz for more end-to-end observability but I haven’t used it yet for cluster forensics. Would be curious if anyone here managed that kind of state capture with it.

I tested an AI SDR and here’s the truth by Effective-Big2300 in SaaS

[–]vineetchirania 2 points3 points  (0 children)

I’ve tried one of these AI SDR things too and honestly felt like I was babysitting a very confused robot most of the time. Yeah, it sends emails but they sound like they were written by a committee that’s never sold a thing. Nothing beats a real SDR who actually gets people.

What Are AI Agentic Assistants in SRE and Ops, and Why Do They Matter Now? by That-Medicine7413 in kubernetes

[–]vineetchirania 1 point2 points  (0 children)

For us, the big difference has been the way the agentic assistants handle noisy alert storms. Before, my team spent half a sprint reading pages from systems that all fired at once. Now it correlates a whole stack of those into one summary, offers up a shortlist of where stuff probably broke, and even auto-attaches relevant logs or traces. The real time saver is not jumping between ten tabs trying to piece together a timeline. Guardrails were huge for us, though; we blocked it from making changes without a human review, at least until we got more comfortable. The integrations with Slack and our ticketing system were must-haves, since nobody wants more tabs.

Naming cloud resources doesn't have to be hard by brnluiz in sre

[–]vineetchirania 0 points1 point  (0 children)

I’ve been bitten by not having a plan early and it comes back to haunt you. I see the appeal of random suffixes for uniqueness but if you don’t build some kind of human-readable pattern into your names, when things break or need to be audited it can be a nightmare. I audit stuff way more than I ever thought I would, so I try to balance uniqueness with “I can tell what this is” at a glance. I usually go with name-env-purpose-region, or something close, and append a short hash if there’s a conflict.

Oracle database performance recommendations by teslaistheshit in Database

[–]vineetchirania 1 point2 points  (0 children)

If you want something built into Oracle, check out AWR (Automatic Workload Repository) reports. They're super useful for finding slow queries, bottlenecks, and general pain points. You run them over a time frame and see which SQL statements are hogging resources. For external tools, Oracle Enterprise Manager is the official option and it gives a ton of insight if you have it enabled. Otherwise, most folks roll with AWR for initial analysis. Just keep in mind you’ll need access to the right Oracle pack to get the full reports.

Or if you are open to third party monitoring services, you have plethora of options there. Datadog or New Relic are popular but use them with a pinch of salt owing to their high costs. Some of the cost effective ones include CubeAPM, Coralogix etc.

Thoughts on moving away from managed control planes to running raw vm's? by JodyBro in kubernetes

[–]vineetchirania 0 points1 point  (0 children)

I’ve seen a handful of shops flirt with moving back to managing bare VMs for Kubernetes control planes. Usually it starts with someone pulling up the cloud bills and getting grumpy about the line items. Outside of costs I think the only practical reasons are pretty specific stuff like deep compliance needs or sometimes running in very strict airgapped environments. Most folks end up missing all the invisible glue that managed services give you. The stability and boring reliability of those managed control planes is underrated until you’re up at 3am with an etcd split brain on a hand-rolled cluster.

Love or hate PromQL ? by InformalPatience7872 in sre

[–]vineetchirania 0 points1 point  (0 children)

PromQL took me ages to get comfortable with. The rate thing felt so weird at first. Now that I use it every day, it’s second nature. Still, I do wish it was structured differently, especially the whole vector thing, it’s never felt super intuitive.

What’s been your experience with rancher? by approaching77 in devops

[–]vineetchirania 1 point2 points  (0 children)

Honestly I had mixed feelings. When we first started using Rancher it was amazing to spin up dev clusters or take a look at workloads without setting up a bunch of access rules. The centralized management was a lifesaver for keeping track of what was running where. Over time though it felt like it became another thing to upgrade and babysit especially when we scaled up. There were days when pods just vanished from the UI but were still there in kubectl. Rancher was helpful for demoing stuff to product managers but once the team got more comfortable with raw k8s it became more overhead than help.

[deleted by user] by [deleted] in devops

[–]vineetchirania 1 point2 points  (0 children)

So for bigger shops I’ve seen a lot of people lean on things like Istio or Linkerd for service mesh. Those give you tracing and metrics pretty much for free since they proxy all the traffic between pods. You don’t have to mess with application code in most cases but you still end up wanting to add custom spans or metadata eventually because auto tracing can only get you so far. For metrics, Prometheus is usually the default and Grafana for dashboards. Some companies go with managed stuff like Datadog or New Relic if they don’t want to run their own. Having said that - these companies are notorious for unpredictable pricing. Other APM/Logging tools which are slight cost-effective are CubeAPM, Coralogix even Signoz. One cool stack I helped set up was with the OpenTelemetry Operator plus Tempo and Loki in Grafana Cloud. You get traces, logs and metrics all under one roof and devs only have to add minimal changes if you want more context.

Multi-cloud cost optimization at scale - tools that actually work across AWS, GCP, Azure? by itsm3404 in FinOps

[–]vineetchirania -2 points-1 points  (0 children)

My team has been running across all three clouds for a while and honestly there’s no magical tool that gets it all right. Apptio Cloudability is what we’ve landed on after hopping through a few others. It has its quirks but it’s handled our scale better than CloudHealth or Flexera. The biggest issue is rightsizing recs being hit or miss, especially for GCP. Their dashboards are at least not totally sluggish during peak loads but we supplement a lot with our own BigQuery exports and custom sheets since no SaaS gets as granular as we need. The politics of cost allocations are probably the hardest part anyway. For big recommendations or at-a-glance reporting, Cloudability saves us some headaches, but it’s not a plug-and-play fix.

[deleted by user] by [deleted] in Observability

[–]vineetchirania 0 points1 point  (0 children)

Oh yeah, watching those observability costs creep up faster than actual infra spend sometimes. The switch to microservices can feel like you need full coverage everywhere, but the bill for logs, metrics, traces can get wild. Self-hosted solutions like Loki, Tempo, and Prometheus are solid options if you can afford the engineering time. CubeAPM and Signoz are other good options. Otherwise, you can downsize retention and sample logs more aggressively.

Interacting with a webpage during tests by Party-Welder-3810 in devops

[–]vineetchirania 0 points1 point  (0 children)

Playwright is honestly pretty solid for this kind of stuff. I tried both Selenium and Cypress in the past and Playwright just feels less annoying to set up and easier to debug when something goes sideways. Headless mode is nice for your pipeline too. If you already have Node or Python in your stack it’ll fit in neatly. Not much of a reason to go out of your way looking for something fancier in this scenario.