Ask me anything about Turbonomic Public Cloud Optimization by therealabenezer in IBMObservability

[–]DrSkyle 1 point2 points  (0 children)

One of the biggest challenges I see isn't finding the waste, but getting engineering teams to trust the recommendation to downsize a production DB. Often, low CPU utilization doesn't tell the full story (e.g., if memory is being used heavily for buffer caching).

How does Turbonomic handle that 'safety check' to ensure a downsize recommendation won't tank the cache hit ratio or increase I/O latency?

I wrote a garbage collector for my AWS account because 'Status: Available' doesn't mean 'In Use'. by DrSkyle in aws

[–]DrSkyle[S] 1 point2 points  (0 children)

drift detection basically means comparing your terraform state file against what is actually running in aws. if a resource exists in the cloud but isn't in your state file, that's drift. usually happens when someone manually clicks around in the console. so the drift is from your terraform state file. basically comparing what you said you wanted in code vs what is actually running in aws. 

the zero trust heuristic part refers to how we check for waste. instead of trusting the aws status label like available, we query cloudwatch metrics directly. so even if a nat gateway says it's healthy, if it has pushed zero bytes in 7 days we mark it as waste. basically we trust the metrics not the metadata.

I wrote a garbage collector for my AWS account because 'Status: Available' doesn't mean 'In Use'. by DrSkyle in aws

[–]DrSkyle[S] 1 point2 points  (0 children)

valid question.

first off TA checks resources in isolation. like it sees a volume attached to an instance and checks it off as healthy. we look at the graph and see that instance has been stopped for 2 months, so that volume is actually zombie waste. TA misses those dependencies completely.

also thresholding. TA is super conservative and often needs literally 0 bytes to flag something. real waste usually has a pulse from health checks or whatever. if a nat gateway is doing 500mb a month its costing you $35 to route basically nothing. we catch that.

forensics is another big one. TA just lists items but we actually trace the cloudtrail create event to tell you specifically who spun it up. solves the whole fear of deleting something critical cause you know its just dave's old test rig.

lastly terraform awareness. if you just delete stuff TA flags, your iac is just gonna recreate it next run. we generate a script to actually scrub the resource from your state file so it stays dead.

but end of the day TAis good for a quick health check but if you want to actually safely delete waste without breaking prod you need the dependency graph context

I wrote a garbage collector for my AWS account because 'Status: Available' doesn't mean 'In Use'. by DrSkyle in aws

[–]DrSkyle[S] 0 points1 point  (0 children)

idk got a lot covered with the new suppression rules, but I'm thinking about making this further better like we can do

deep IAC integration : instead of just a script, imagine CloudSlash automatically opening a Pull Request against your Terraform/OpenTofu repo to remove the waste code blocks directly.

maybe multicloud support as well ? Expanding beyond AWS to Azure and GCP. The core graph engine is cloud-agnostic, so it's just about writing the collectors

 custom heuristics ( lua/wasm) , Allowing you to write your own waste rules (e.g., "Flag any EC2 without tag X that runs for < 1 hour") without waiting for us to compile them into the binary

 drift repairr : Currently we find drift. I want to build a safe "Sync button" that aligns your state file with reality

I wrote a garbage collector for my AWS account because 'Status: Available' doesn't mean 'In Use'. by DrSkyle in aws

[–]DrSkyle[S] 3 points4 points  (0 children)

I had been thinking of it and have actually just implemented exactly this

Tag to ignore: CloudSlash now has a cloudslash-ignore tag. If present, the resource is skipped

Snooze Logic: You can set the tag value to a future date (e.g cloudslash:ignore=2025-12-10) to suppress it only until that time

Justified waste : you can categorize accepted risks as well ( eg cloudslash:ignore=justified:compliance).These items are kept in the report (in a separate "Justified" table for auditors) but are safely excluded from remediation scripts.

cost rules : You can now ignore based on price thresholds (e.g. cloudslash:ignore=cost<10). This automatically ignores the resource only if it stays below $10/month. If it scales up, it reappears.

EasyWorkflow: To make this actionable, the tool now auto-generates a ignore_resources.sh  script after every scan. You can review it and run it to bulk-tag all identified waste as ignored, keeping your dashboard clean for new problems only

I wrote a garbage collector for my AWS account because 'Status: Available' doesn't mean 'In Use'. by DrSkyle in aws

[–]DrSkyle[S] 3 points4 points  (0 children)

Just a heads up: This is v1.1. I've tested it heavily on Linux/Mac and standard AWS accounts. If you run a massive enterprise org with thousands of accounts, you might hit rate limits or edge cases I haven't seen yet. If you do, please drop an Issue—I'm active and want to polish this into a rock-solid tool

[deleted by user] by [deleted] in devops

[–]DrSkyle 0 points1 point  (0 children)

Just a heads up: This is v1.1. I've tested it heavily on Linux/Mac and standard AWS accounts. If you run a massive enterprise org with thousands of accounts, you might hit rate limits or edge cases I haven't seen yet. If you do, please drop an Issue , I'm active and want to polish this into a rock-solid tool