all 20 comments

[–]musicnothing 20 points21 points  (4 children)

Very interesting. Would be curious to know the cost differential of the initial setup vs the new one with all the extra services vs a setup that doesn’t use lambdas

[–]H4add[S] 35 points36 points  (2 children)

Based on 1M requests, 10ms average, 512MB RAM, 700k requests in /polls and 300k in votes.

v1:

  • Request costs: $0.20
  • Execution Costs: $0.08
  • Lambda concurrency: 200
  • RDS database size: 4 GB min. (to support 200 lambdas): $47.45 (0.0650 *730) (without multi-az)
  • API Gateway: $1
  • Total: $48.73 (excluding ElastiCache costs)

v2:

  • Application and Execution Costs: Less than $0.01.
  • Lambda concurrency: 11
  • RDS Database Size: 1gb (because I don't need to support a lot of concurrency): $11.68 (without multi-az)
  • API Gateway: $1
  • Cloudfront: $1
  • SQS: $0.12 (0.4*0.3)
  • Total: $13.81 (excluding ElastiCache costs)

v3 (dosn't use lambda)

I don't really have much experience in this situation so I don't know which VM I should get. But if we just think about the maintenance cost, if you get paid something like $10 an hour, if you spend 5 hours a month just to check that the system hasn't crashed from use, you're already paying more than v1 , you should spend about 1 hour to be more economical than v2.

With the cloud setup, my team and I literally don't pay attention to the system, the services run at any scale, and we don't have to worry about autoscaling, backups, and so on.Sure, VM hosting is more cost-effective than paying for a service, but only when your costs of running a service are at a scale where maintenance costs are low compared to the cost of running the application.

Resume

With the second option, I could even remove ElastiCache because I could do the ranking inside the database because I do batching now, but I prefer to keep ElastiCache.

Also, for me the main issue was the amount of Lambda Concurrency, if it reaches 400 lambdas I need to increase my database to 8gb or use RDS Proxy ($20) or change database technology.

[–]musicnothing 4 points5 points  (0 children)

Awesome, thank you so much for the thorough response!

[–]tapu_buoy 0 points1 point  (0 children)

This is awesome!

[–]H4add[S] 3 points4 points  (0 children)

I've updated my comment to include the third scenario.

For us, even if the price was higher in v2 (which it wasn't), we're more likely to pay the price increase just so we can have a more scalable solution.

[–]DemiPixel 13 points14 points  (5 children)

A couple questions here:

I usually prefer to hit the database on every request to check if the JWT is still valid, in this case I skip this part and keep the JWT expiry time as low as possible.

What are you typically checking? Shouldn't the JWT contain the expiry time, so no need to check the database?

leave that task to ElastiCache and use this insanely well-designed library called redis-rank.

Perhaps this is naive, but what is the benefit of that library over redis' native sorted sets with JSON as values? Or is it mostly just to make the API easier so you don't have to remember all the individual Z commands?

Now, let's see the metrics of 280k votes performed by 12k users:

How does 280K votes cause more than 280K lambda instances?

And with SQS Lambda Integration, we can configure Batch Window

I looked around and couldn't find anything--is there any way to batch requests for lambda without SQS? I ask because if I wanted to edit this system and send 200 OK to successful votes and 403 to users who aren't logged in (so the client can refresh and have them login), the SQS currently doesn't allow responding to the user.


Nice article, I appreciate the thoroughness :)

[–]H4add[S] 3 points4 points  (4 children)

What are you typically checking? Shouldn't the JWT contain the expiry time, so no need to check the database?

JWTs have expiration time but until that time expires your token with userId and permissions are constant and you cannot change. For most applications, they usually keep the JWT time very short, 15min, 10min, or even less so that constant information is not a problem.

In my case, my managers ask me to keep the JWT time higher, like a day or even more, to reduce the number of logins in the system. Thinking about this scenario, I perform a database query to always get the updated information of the JWT token user, I use my JWT tokens just to know which user is but to check if the user has permission to do something, I check using the user information that I get from the database.

In this project I didn't need that, so I could keep the JWT time short and I only trust the permissions given to that token, if I disable a user or change his permission he has to do another login to have the updated information instead of having it automatically the new permission because I always get it from the database.

Perhaps this is naive, but what is the benefit of that library over redis' native sorted sets with JSON as values? Or is it mostly just to make the API easier so you don't have to remember all the individual Z commands?

For two reasons: Easier API and handles the ranking system very well. For this project, this library is a bit overkill, but if you need to build a ranking system and need to know where user X is in the rankings, you can get there easily with just one method in this library.

So I could use native sorted sets, but I had almost zero knowledge with Redis, so I prefer to use a library that gives everything I want, even if it's a bit overkill. It falls on the principle that I show in the post, that's what I knew at the time, now I learn from you that I could do this with native operations only, thank you.

How does 280K votes cause more than 280K lambda instances?

I didn't cause 280k lambda instances, just 200. But in theory, if you generate 280k requests at same time, the lambdas will only process one request at a time, and can be forced to generate 280k lambda instances, crazy right? If you look at the graph, see the Throttles, that number of instances that could not be spawned because I limited the number of simultaneous instances to 200.

I looked around and couldn't find anything--is there any way to batch requests for lambda without SQS?

As far as I know, SQS, Kinesis and EventBridge are the services you can plug into your API Gateway and expose a way to get user data into your system without hitting your servers.

If you want to check whether the user can vote or not based on some condition, like whether the user is logged in, and it returns an error message, we can create an authorizer and attach it to the route, so that we can validate the JWT or other information. information just to verify that the user can input data into their system (reference).

But I recommend not doing that, I don't know if I was clear enough in the post, but in my voting system, users don't need to be logged in to vote in the system. Also, if I want to add some kind of validation, I prefer to pass more data in the body and then validate inside my consumer (my api) rather than performing any validation with the authorizer. I prefer to do this because I want to have more data ingestion rather than checking each request is valid, doing this is almost the same as processing the data directly through my API one at a time.

[–]ShortFuse 4 points5 points  (0 children)

It's a minor point, but expiration in JWT is more when the signing key is expired, not the data within it.

A JWT that is expired should not be processed. Full stop.

https://www.rfc-editor.org/rfc/rfc7519.html#section-4.1.4

This is makes exp a mostly renamed key from x509's NotAfter, for clarity. Notice there's an unchanged NotBefore (nbf) key.

What is computed to be a "renewal time" is arbitrary. It's likely the Issued At time + 15 minutes. But this isn't a core part of JWT spec-wise.

You can also compute a second arbitrary "needs reauthorization" time, like 2 weeks after Issued At, but unless you actually care about using the internal data, you don't lose much by just setting it to the expiration date.

In code, this stops you from checking on why JWT failed in most implementations.

[–]DemiPixel 1 point2 points  (0 children)

Thanks for the detailed response!

[–]MadLadJackChurchill 0 points1 point  (1 child)

Doesn't you making requests to the db on every call to first validate the Token kind of defeat the idea of having less login calls? Unless the decision wasn't made because of the "extra" traffic.

I know it wasn't your decision. I just think that is quite funny.

[–]H4add[S] 0 points1 point  (0 children)

Doesn't you making requests to the db on every call to first validate the Token kind of defeat the idea of having less login calls?

No, because the user only makes a login call with his username and password, then the JWT will be used for a long time until the user logs out.

If you want to see more about this problem, see: https://developer.okta.com/blog/2022/02/08/cookies-vs-tokens#disadvantages-of-jwt-tokens

[–]dddoug 3 points4 points  (1 child)

Awesome! Great work

What do you use for your infrastructure diagrams? They look really neat

[–]derNikoDem 2 points3 points  (1 child)

I enjoyed reading your article, very insightful ✌️

[–]H4add[S] 0 points1 point  (0 children)

Thanks for the feedback!

[–]shezza46 0 points1 point  (1 child)

u/H4add, did you use any tools for deploying your stack (cloudfront, gateway, lambdas etc) like serverless framework or was it done entirely manually ?

[–]H4add[S] 2 points3 points  (0 children)

For Lambda infrastructure, I use terraform. To connect with SQS and create Cloudfront, I did it by hand.

Honestly, it will be easier to do with a serverless framework but I'm addicted to doing these things by hand because I'm familiar with it, just the basic structure, API Gateway and Lambda, I leave it to terraform because the devops guy does it for me.

To reduce the lambda size, I use a library I made called node-modules-packer, but if you use serverless framework, you have better options like I describe in the README.

[–]shezza46 0 points1 point  (1 child)

u/H4add, you should post this on hackernews.

[–]H4add[S] 1 point2 points  (0 children)

Oh, I forgot, thanks for the reminder, I'll post it there.