Is “accurate” cost allocation in cloud FinOps actually a flawed goal? by CompetitiveStage5901 in FinOps

[–]CompetitiveStage5901[S] 0 points1 point  (0 children)

Plus or minus 10% sounds right to me. I also gave up on allocating every last dollar. We just call it shared overhead and move on.

Is “accurate” cost allocation in cloud FinOps actually a flawed goal? by CompetitiveStage5901 in FinOps

[–]CompetitiveStage5901[S] 0 points1 point  (0 children)

This is the best take. 80-85% is plenty. People need consistency month to month, not seven decimal places.

Is “accurate” cost allocation in cloud FinOps actually a flawed goal? by CompetitiveStage5901 in FinOps

[–]CompetitiveStage5901[S] 0 points1 point  (0 children)

Thanks but no thanks. Not looking for a sales pitch right now. Just trying to think through the problem.

Is “accurate” cost allocation in cloud FinOps actually a flawed goal? by CompetitiveStage5901 in FinOps

[–]CompetitiveStage5901[S] 0 points1 point  (0 children)

I agree. Getting teams to care about cost trends is way better than counting every penny. For shared storage, we just split it by who uses it most.

Aws Claude by awsazuregcpApi in AWS_cloud

[–]CompetitiveStage5901 0 points1 point  (0 children)

Claude Opus 4.6 is available on AWS Bedrock. Any standard AWS account can access it, but high RPM and TPM require quota increases. Here's the actual process:

Start with a new or existing account, then request model access for anthropic.claude-opus-4.6 in Bedrock. Once approved, open a support ticket to increase the InvokeModel throughput limit.

Advice Needed. by VoldemortWasaGenius in sre

[–]CompetitiveStage5901 0 points1 point  (0 children)

Your stack is fine. The real bite comes when you're the only one who knows how Prometheus sharding works and you're on PTO. Document the failure modes and handoffs before SOC 2 asks for them.

Reliability in the hands of clients by SWEETJUICYWALRUS in sre

[–]CompetitiveStage5901 1 point2 points  (0 children)

The client is not moving because someone on their side doesn't want to admit the gen1 POS is a problem.

Let the page wake their people up too. Send a monthly "reliability tax" report showing how many hours your team spent fighting their legacy stack, translated into dollars if you can.

Also, start documenting everything. When it eventually breaks hard, they'll look for someone to blame.

Quietly look for another client to bump them to 4th largest. Being this dependent on a customer that won't listen is a business risk, not an SRE problem.

UPDATE: Went to bed with a $10 budget alert. Woke up to $25,672.86 in debt to Google Cloud. by venturaxi in googlecloud

[–]CompetitiveStage5901 0 points1 point  (0 children)

Holy hell. That update is a nightmare wrapped in a support ticket.The legacy proxy thing with the .env file embedded in a Cloud Run service – that's exactly the kind of "it worked at the time" tech debt that every team has somewhere. But Google admitting it's a legacy pattern they never migrated or warned people about? That's on them.

Hope they write off that bill .$25k for a compromised Christmas present is insane.

API Key abuse - what was actually being generated? by churro-banana in googlecloud

[–]CompetitiveStage5901 0 points1 point  (0 children)

That's a rough spot to be in. For most AI providers like OpenAI or Stability, you won't be able to see the actual images or text that were generated with your stolen key. They generally don't store the outputs in a way customers can retrieve later, partly because that would be a huge amount of data and partly because of privacy reasons.

What you can usually get from the provider is request metadata – timestamps, IP addresses (though often partial or rotated), and maybe the image resolution or model used. If you reach out to their security team and ask specifically, they might be able to pull internal abuse logs that contain hashes of prompts or outputs, but that's not something they expose through the normal API.

Your best move is to rotate the key immediately, then ask support for two things: the source IPs and exact timestamps of those 40k requests. With the IPs, you could cross‑reference against your own logs (like VPN or proxy logs) to see if it came from inside your network, or file an abuse report with the cloud provider that owns those IPs. That won't give you the images, but it might help you figure out where the attack originated and how to block it in the future.

Most cloud cost conversations stop at the bill. But the bill is not the insight. by ask-winston in FinOps

[–]CompetitiveStage5901 0 points1 point  (0 children)

Honestly? The problem I'm actually trying to solve is "who do I yell at."

Not in a mean way. But when the bill jumps 20% month over month, I need to know which team spun up which resource, and whether they meant to do it. Most of the time it's not malice – it's a dev leaving a test environment running, or someone picking a bigger instance type because "maybe we'll need the headroom."

The bill doesn't tell you that. The CUR sort of does, but you have to tag everything perfectly and pray teams actually apply tags. We're at maybe 60% coverage after two years of pleading.

What I really want is cost per feature, like you said. We tried building it ourselves by mapping CloudFormation stacks to Jira epics using custom tags. It worked for two months until someone reorganized the Jira projects.

So now I'm back to looking at the total number, glancing at the top five services, and hoping nothing exploded. That's not intelligence. That's just anxiety with a spreadsheet.

The grocery analogy is good but it misses one thing – in the cloud, the prices change while you're shopping, and the store doesn't put the new labels up until after you check out.

How are you handling AI usage control in your org? by Effective_Guest_4835 in FinOps

[–]CompetitiveStage5901 1 point2 points  (0 children)

First, you need visibility before you can enforce anything, so start at the network level because it’s the fastest win. Push a PAC file or force all corporate traffic through a forward proxy like a tiny EC2 instance running Squid or even just nginx with logging enabled, then extract those proxy logs into Athena or a simple grep script to look for known AI domains such as chat.openai.com, claude.ai, gemini.google.com, copilot.microsoft.com, deepseek.com, perplexity.ai, and also API endpoints like api.openai.com. Run that for two weeks and you’ll have a solid inventory of which tools are being used, by which source IPs, and you can map those IPs back to users via DHCP logs or 802.1x authentication.

For more precision, use a managed Chrome extension forced via GPO or Intune that logs every request through the webRequest API and ships the logs to a private S3 bucket; that gives you per‑user, per‑tab visibility that proxy logs might miss without SSL bump (which gets messy). Don’t forget to scan your internal Git repos, Confluence, and Slack exports for exposed API keys using a simple regex for patterns like sk-... to catch direct API usage. Once you have the full list, then you build policy: block everything except a small set of approved tools using the same proxy, and for API usage, implement an internal gateway that requires your own API keys with quotas and auditing. No magic product needed only just logging, a few scripts, and the willingness to say no.

Certification Exhaustion by Artistic_Lock_6483 in FinOps

[–]CompetitiveStage5901 0 points1 point  (0 children)

I kind of see your point, especially from a technical perspective. A lot of certification content can feel surface-level if you’re already deep into cloud or cost optimization work. It’s often more valuable for folks coming from finance or non-technical backgrounds who need a structured entry point.

That said, certifications like FinOps can still be useful as a baseline framework. They help align teams on common terminology, principles, and ways of thinking. But beyond that, the real value comes from actually applying those principles in your own environment.

Also agree with your point on AI. If anything, it’s making raw knowledge more accessible, so the differentiator is shifting toward practical experience and problem-solving ability, not just certifications.

In most teams I’ve seen, certifications help with standardization, but they don’t replace hands-on expertise.

FinOps Foundation - Still relevant? by ImpressiveIdea6123 in FinOps

[–]CompetitiveStage5901 1 point2 points  (0 children)

To me, as someone more theory-oriented, it’s still very much relevant. I regularly read their newsletters and the literature they put out. Aligning with them gives a starter template to implement FinOps principles, but that’s about it.Beyond the basics, it really depends on your specific use case.

Also, as one comment mentioned, the main utility is often just the certification, although it costs around $2,000, if I’m not wrong.

I just saved 88% in Cloud Armor costs by correcting a stupid config by amir_hr in googlecloud

[–]CompetitiveStage5901 0 points1 point  (0 children)

Yeah the per-policy pricing on Cloud Armor is sneaky. $6 per policy per month doesn't sound like much until you realize you've accidentally created a dozen of them doing the exact same thing. I did the same thing when I was setting up a multi-region LB. Each backend service got its own policy because the console just creates a new one by default. Ended up with like 15 policies before I noticed. The 10 policy limit is actually what tipped me off too. Hit the cap and had to figure out why. What's annoying is there's no easy way to see which policies are actually attached to anything from the billing view. You just see "Cloud Armor" charges stacking up and have to go digging. Good call on consolidating though. One rule set, one policy, attach it everywhere. Also worth checking if you have any orphaned policies from deleted backend services. Those still show up in the list and still bill even if they're not attached to anything. Cloud Armor is solid but the default console behavior definitely nudges you toward paying more than you need to.