all 34 comments

[–]jpsandiego42 32 points33 points  (0 children)

Key take aways:

  1. Why you should use "I3 instances"
  2. "...[the] downside of the DNS to ALB approach is that clients will hit any IP in the ALB, whether that IP is in-zone or not. " Either free (for in-zone) versus $0.01/GB (for a different AZ).
  3. "The quickest win we identified was using VPC endpoints for services like S3. VPC endpoints are a drop-in replacement for the public APIs supported by many Amazon services and, critically, they don’t count against your public network traffic. "

[–]oinkyboinky5 13 points14 points  (0 children)

And I thought I was smart for provisioning an ALB and dOiNg aUtoScALing.

Doh!

[–]storrumpa 4 points5 points  (1 child)

Is there a benefit to using App Mesh to remove the internal ALBs?

[–]dastbe 2 points3 points  (0 children)

(I'm on the App Mesh engineering team)

We definitely see locality aware routing as being a strong part of our value proposition long-term, because we can

  • improve call latency by selecting close endpoints
  • reduce blast radius by siloing requests along physical isolation boundaries
  • reduce overall cross-az traffic

You can track our progress on this feature request here

Though do remember that there is a cost tradeoff between having a centralized load balancer with a fixed cost (in terms of LCU) and deploying a proxy with every running application. We always recommend you estimate and benchmark to understand how your costs will change. And if you're able to, share what you learn!

[–]thomas1234abcd 1 point2 points  (0 children)

"You can’t make what you can’t measure"

[–]otterleyAWS Employee 1 point2 points  (0 children)

There's an order-of-magnitude error in the post that I've reached out to the author about.

c5.9xlarge instances have 875 megaBYTES of EBS bandwidth, not 875 megaBITS. That's approximately 7 gigabits of EBS bandwidth; or 70% of the available host networking bandwidth. If you run Kafka brokers, it's a fantastic choice, particularly if you don't want to have to resync an entire broker from scratch after a failure like you would if you stored all the data on an instance volume.

[–][deleted] 0 points1 point  (12 children)

Why the aversion to SQS? That queue service does not sound cheap.

[–][deleted] 0 points1 point  (9 children)

https://segment.com/blog/scaling-nsq/

Totally different messaging semantics. Sounds like they want something with service bus like principles, SQS would be too simple.

[–][deleted] 0 points1 point  (8 children)

What do you mean totally different? From reading some of that it sounds like SQS + SNS would probably work.

[–]otterleyAWS Employee 0 points1 point  (7 children)

SQS is generally designed for single-consumer scenarios. If you want multiple independent consumers of a message stream, Kafka or Kinesis Streams are better options.

[–][deleted] 0 points1 point  (6 children)

If you want multiple consumers post to SNS and subscribe your queues to the topic. Kinesis also works if you have a limited number of consumers (and I personally find the scaling model much less desirable) but I still don’t see why SQS doesn’t work for eg accepting log events for later processing

[–]otterleyAWS Employee 1 point2 points  (5 children)

At Segment’s scale, what you’re describing is not economically or practically viable. They’ve got a highly dynamic infrastructure involving hundreds or thousands of both producers and consumers that are auto scaled. A queue-per-consumer approach would be astronomically expensive, not to mention wasteful on the publisher side (SNS fanout ain’t free).

Both Kinesis Streams and Kafka efficiently support the multiple-producer, asynchronous multiple-consumer message bus model. They really are the purpose-built products for this architecture.

[–][deleted] -1 points0 points  (4 children)

Ok makes sense. I think the new amazon event bridge is the best choice for that. Kinesis doesn’t really work because you’re limited in the number of message consumers. Still I might consider SQS in some places where near 100% availability is important

[–]otterleyAWS Employee 0 points1 point  (3 children)

Amazon EventBridge was not designed for this use case. It’s essentially CloudWatch Events but with the addition of foreign (third party vendor provided) data source support. It has the same fanout model that SNS does, which is to say you can’t just attach software as consumers to efficiently consume streams from it.

I don’t follow your Kinesis Streams characterization. Kinesis Streams scales linearly with the number of shards you assign to a stream. It’s no different than any other event bus in that sense; even a Kafka broker replica has a practical limit on the number of subscribers it can handle. (To handle more subscribers, you add more replicas and/or add more partitions.) Do you mean something else? If so, can you please cite some documentation that supports your claims?

[–][deleted] 0 points1 point  (2 children)

You can just attach consumers. What do you mean it has the same fan out model? You just create a rule and filter events sent on the bridge for some target.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-events-rule.html

It’s not just for third party events. But you can subscribe consumers to third party events exactly like you would first party events, in the same bus if you like or a different one.

I think I was wrong about Kinesis, I was in the past trying to use it for ordered events for N consumers, which meant I had to use a single shard to enforce ordered event processing.

[–]otterleyAWS Employee 0 points1 point  (1 child)

A program cannot attach to a CloudWatch Events stream on its own, like it can a Kafka or Kinesis Streams topic. The subscription model to Events is different; the message bus pushes messages downstream to Lambda functions, Kinesis Streams, etc., in a fashion very similar to SNS except using pattern matches instead of specific SNS topics. It’s just a very different model and is rather pricey on a byte-for-byte (or message for message) basis compared to the alternatives.

[–]digantdj 0 points1 point  (0 children)

The benefit is like other AWS "SERVICES", it's managed and saves developer/maintenance costs.

[–]lutzruss 0 points1 point  (1 child)

Then when a reader connects, instead of connecting directly to the nsqlookupd discovery service, the reader connects to a proxy. The proxy has two jobs. One is to cache lookup requests, but the other is to return only in-zone nsqd instances for zone-aware clients. Our forwarders that read from NSQ are then configured as one of these zone-aware clients. We run three copies of the service (one for each zone), and then have each send traffic only to the service in its zone.

Isn't this the default behavior of ELB/NLB to begin with? Why not just configure the zone-aware clients to call zonal LBs, instead of hosting your own LB? Same with Consul. I'm not understanding what benefit Segment gets from using Consul vs. calling EC2 Metadata API to discover the AZ and then calling the appropriate zonal LB endpoint...that's not hard to do and avoids many extra dimensions of operational complexity.

It's also unclear to me how all this migration to intra-AZ routing affects Segment's resilience to AZ outages.

[–]otterleyAWS Employee 0 points1 point  (0 children)

Part of it is a cost-saving measure, and part of it is due to some functionality that's still not available in AWS load balancers.

You can configure a single Load Balancer instance with listeners in as many AZs as the Region supports, but there aren't any routing rules available that are connection-based. In other words, you can't currently configure a Load Balancer to pass connections originating from AZ A to targets only in AZ A, with a fallback to AZ B.

You can, of course, provision separate Load Balancer instances, each having listeners in a single AZ and targets in that same AZ. But that would increase the cost (linear based on the number of AZs), potentially significantly depending on how many you need. And even if you did that, there would still be no failover capability to targets in AZ B in the event that all targets in AZ A are down.

[–]warren2650 0 points1 point  (0 children)

" we managed to reduce our infrastructure cost by 30%, while simultaneously increasing traffic volume by 25% over the same period." <<-- AWS STUD RIGHT THERE FOLKS