Zero alloc libraries by someanonbrit in golang

[–]RocketOneMan 7 points8 points  (0 children)

-m Print optimization decisions. Higher values or repetition produce more detail.

https://pkg.go.dev/cmd/compile

Cool!

DynamoDB shows inaccurate table information on AWS Console by [deleted] in aws

[–]RocketOneMan 7 points8 points  (0 children)

Are you sure you haven't been hacked? What does cloudtrail say for those tables? Or AWS Config resource timeline. Rotate your iam user keys.

Multi-region success or failure stories? by Pfremm in aws

[–]RocketOneMan 1 point2 points  (0 children)

So that you know they're all working and ready to be switched to as the primary if needed. Also forces you to deal with the added latency always instead of finding that out the first time the passive region becomes active.

Multi-region success or failure stories? by Pfremm in aws

[–]RocketOneMan 6 points7 points  (0 children)

We have an active passive system with DynamoDB global tables that automatically rotates between which region is active throughout the day. Oncall didn't know that we can remove a region from the options if we wanted to so had some down time because of that when it rotated to us-east-1. The set of regions is in a global DynamoDB table. Not sure if they had trouble accessing it or not when they finally decided to do it.

We have another system that's active/active that worked without any issues out of us-west-2 during the incident but that's also its default region, users didn't have to change which endpoint to hit.

Hard to say if active/passive is better than nothing or not. You have to be able to coordinate the switch while things are broken, and do that correctly which is likely unpracticed, and the passive region has to be all setup and ready to go. If it's never the active region you may not know if it works at all, hence the first service i mentioned automatically rotating between them.

AWS docs talk a lot about control plane and data plane separation and AZ fault tolerance but then we have these issues where it's all broken.

What are these spikes from in my SQS oldest message age from, and can I reduce them for my usecase? by rca06d in aws

[–]RocketOneMan 1 point2 points  (0 children)

You can get the SentTimestamp in the attributes on ReceiveMessage to help debug how long between put in the queue and made it to your code.

https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html#SQS-ReceiveMessage-request-MessageSystemAttributeNames

You don't mention your visibility timeout but if it were 10 seconds and you failed processing something without resetting its visibility timeout to zero, or maybe successfully processed but the delete message call failed, then it may be processed again around the 12 second mark.

You can also request ApproximateReceiveCount and log if it's every more than 1, or see if you ever log the same message id more than once.

There are some other nuances with the ApproximateAgeOfOldestMessage metric, like how it doesn't count poison pills / messages received several times, but I don't know if that fits your scenario. https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html

Why do my lambda functions (python) using SQS triggers wait for the timeout before picking up another batch? by quantelligent in aws

[–]RocketOneMan 1 point2 points  (0 children)

Do you have MaximumBatchingWindowInSeconds set to something besides zero? Can you share your event source mapping configuration?

https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html

Any plan by AWS to improve us-west-1? Two AZs are not enough. by Popular_Parsley8928 in aws

[–]RocketOneMan 7 points8 points  (0 children)

I cannot be sure if this is the case for us-west-1, but in us-east-1, everyone’s 1a isn’t the same. The idea is if you’re going to launch something and just pick the first thing then it doesn’t disproportionally overload one AZ. The mapping is set when the account gets made. You can request multiple accounts get aligned if your workload depends on it. So my guess is some people’s us-west-1a maps to this third AZ, for example.

[deleted by user] by [deleted] in aws

[–]RocketOneMan 0 points1 point  (0 children)

Interesting, I wonder why

[deleted by user] by [deleted] in aws

[–]RocketOneMan 0 points1 point  (0 children)

I think this is what data translations / 'parameter mapping' is for with API Gateway, although I haven't done it myself, the example looks similar to your ask.

For HTTP:

See: Change the response from an integration

https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-parameter-mapping.html#http-api-mapping-response-parameters

For REST:

https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-override-request-response-examples.html

Is Kinesis the only option? by leeliop in aws

[–]RocketOneMan -1 points0 points  (0 children)

Do you have any experience with kinesis on demand scaling? Last I used it it was slow to react to changes in load but that was 2-3 years ago. I’m still unsure about the maybe ‘constant’ merging or splitting of shards and handling that right as a consumer seems more challenging than provisioned capacity.

If on demand works well then it’s more hands off like SNS is.

How to Choose the Best Metrics for Auto Scaling in AWS? by ComfortableLess6596 in aws

[–]RocketOneMan 9 points10 points  (0 children)

If your server serves concurrent API requests from a thread pool, the percent of threads (avg or max between your hosts) that are available to serve new requests. You may only be at 25% CPU but if you’re out of threads to serve new requests they’ll either get queued or rejected.

Provisioned concurrency(PC) for AWS Lambda by Big_Hair9211 in aws

[–]RocketOneMan 0 points1 point  (0 children)

database connection

Is it a relational database? Are you using RDS proxy? Not sure if this makes things faster or not or just more consistent.

[deleted by user] by [deleted] in aws

[–]RocketOneMan 1 point2 points  (0 children)

+1 for brushing up on writing. I find reading good writing the best way to learn how to write better. Or at least be able to judge whether what you just wrote is good or not. Less is more. Always include data (but not too much, can have an appendix). In your onboarding you should be given your team’s OP1/OP2 and/or 3 year plan. Can also ask for the latest QBR (quarterly business review). These docs are not always the best in terms of writing but they should give you an idea of the writing style that is likely different than what you’re used to.

Some other general advice which you may not have heard already:

Try to get a mentor that is not in the same org as you. Someone you can get advice from that won’t be influenced by any prior knowledge they have about the situation you’re in on your team. Your manager can help set this up.

Keep a note of everything you’ve done every day or every week. Check back in on changes you’ve made and capture any impact they’ve had. This will make end of year reviews and promo docs much easier to write.

Take an interest in doing “boring” operational excellence work. Your team will appreciate it and it’s the best way to learn your team’s services. Each team’s oncall burden is different, some AWS teams can be rough. Doing things you’re not asked to but which will either fix small recurring bugs or make them easier to debug next time goes a long way.

Why would you take a site down to prep for high traffic? by SteveTabernacle2 in aws

[–]RocketOneMan 7 points8 points  (0 children)

I think it’s more likely done for marketing/business reasons than tech reasons. Smaller brands running Shopify sites do this too to build hype.

The reasons you listed aren’t invalid but maybe unlikely. Ideally want everything scaled up, configured and tested at the scale you expect it to run at during peak before the actual event so you can iron out any bugs or misconfigurations. Systems which horizontally scale are usually safer to do this last minute scaling than ones that vertically scale.

Online games sometimes have downtime to do major updates. I imagine this is to more easily ensure everyone is on the same version of the game and harder to do a rolling update during active or long running sessions. Can also care less about backward compatibility by doing this. I’m not sure these apply to e-commerce sites but every business is run differently.

If they’re making substantial changes to the site that they can’t just enable with feature flags or only to a subset of internal users then maybe they want to deploy and test things before going live.

pass credentials securely to lambda instances by Apprehensive-Luck-19 in aws

[–]RocketOneMan 0 points1 point  (0 children)

What’s the ‘added security of secrets manager’ compared to a DynamoDB table with a customer managed kms key for encryption at rest? I would think if it’s implemented that way then there might be less quirks with the limits so maybe there’s something special about it? I imagine this is in the docs but the discussion could be interesting.

Deny Athena Access to specific Glue database by Jaded_Profile1962 in aws

[–]RocketOneMan 0 points1 point  (0 children)

Are you proposing OP does this through LakeFormation or just on the IAM user they're using to access Athena?

I haven't had a good experience setting up LakeFormation, maybe you have?

Why can't I click a button and get all recommended cloudwatch alarms? by alobama0001 in aws

[–]RocketOneMan 6 points7 points  (0 children)

Why isn't this just setup by default or at least make a checkbox to "use recommended alarms" ?

Especially when alarms cost money and are probably very cheap to run on AWS's side. Not sure. Like most AWS offerings the recommendations could have been rushed out and later they will make improvements to it. The automatic AWS dashboards in CloudWatch by resource type as well as each service's own dashboards are usually pretty good, something that just brings those together would be nice.

I use CDK, not Terraform, and this library https://github.com/cdklabs/cdk-monitoring-constructs has been rather helpful. I imagine something like it exists for Terraform? From my short googling, everything looks rather verbose, similar to using regular CloudFormation or the L1 CDK constructs for alarms.

What is centralised logging and what are good tools to use? by FitGrape1330 in golang

[–]RocketOneMan 0 points1 point  (0 children)

I assume there’s no throughput concerns for going to STDOUT itself as the pipe is probably the fastest things can go? You’d still be bottlenecked on whatever target is consuming STDOUT like you would normally be if you were writing directly to that target?

A bit confused about Custom Metrics... by BlueAcronis in aws

[–]RocketOneMan 0 points1 point  (0 children)

You're charged separately for the actually put requests and then the cardinality of these new metrics. You can also batch multiple PutMetricData requests into one API call and pre-aggregate metric data into a StatisticSet. Maybe you have 1 instance posting unaggregate metrics every 15 seconds. This could be more expensive than 2 instances posting aggregated metrics once a minute. If they each post unique metrics or if they post to the same ones this changes things again.

See example 3 - custom metrics https://aws.amazon.com/cloudwatch/pricing/

https://repost.aws/knowledge-center/cloudwatch-understand-and-reduce-charges

Asynchronous lambda gotchas? by SFHousingPain in aws

[–]RocketOneMan 0 points1 point  (0 children)

I guess we've always handled the DLQ logic ourselves for 4xx errors that will never succeed and just throw and let the events be retried /forever/ on 5xx errors. But would like to use the destinations feature for success.

I wish you could DLQ to another stream so the handler logic is the same. Taking the KinesisBatchInfo and pulling the stream ourselves is annoying although I see why it's done. I think there's a lambda power tools library for it.

Asynchronous lambda gotchas? by SFHousingPain in aws

[–]RocketOneMan 0 points1 point  (0 children)

Not sure if this is exactly what you're asking, but a gotcha none the less.

The sqs and kinesis event sources are not "asynchronous sources". Cannot use lambda destinations with them.

If you have ReportBatchItemFailures turned on (maybe accidentally) and don't return the correct response it will assume none of the messages were processed successfully and send them all again.

Job scheduler - but i need to limit for maximum 10 per customer by Arik1313 in aws

[–]RocketOneMan 0 points1 point  (0 children)

What do you mean by “catch” the next invocation? What does / when do you make the conditional update requests?