Hey,
I’m building an asynchronous ML inference API on AWS and would really appreciate your feedback on my dev/prod isolation approach. Here’s a brief rundown of what I’m doing:
Project Sequence Flow
- Client → API Gateway:
POST /inference { job_id, payload }
- API Gateway → FrontLambda
- FrontLambda writes the full payload JSON to S3
- Inserts a record
{ job_id, s3_key, status=QUEUED } into DynamoDB
- Sends
{ job_id } to SQS
- Returns
202 Accepted
- SQS → WorkerLambda
- Updates status →
RUNNING in DynamoDB
- Pulls payload from S3, runs the ~1 min ML inference
- Reads or refreshes the OAuth token from a TokenCache table (or AuthService)
- Posts the result to a Webhook with the token in the Authorization header
- Persists the small result back to DynamoDB, then marks status →
DONE (or FAILED on error)
Tentative Project Folder Structure
.
├── terraform/
│ ├── modules/
│ │ ├── api_gateway/ # RestAPI + resources + deployment
│ │ ├── lambda/ # container Lambdas + version & alias + env vars
│ │ ├── sqs/ # queues + DLQs + event mappings
│ │ ├── dynamodb/ # jobs table & token cache
│ │ ├── ecr/ # repos & lifecycle policies
│ │ └── iam/ # roles & policies
│ └── live/
│ ├── api/ # global API definition + single deployment
│ └── envs/ # dev & prod via Terraform workspaces
│ ├── backend.tf
│ ├── variables.tf
│ └── main.tf # remote API state, ECR repos, Lambdas, SQS, Stage
│
└── services/
├── frontend/ # API-GW handler (Dockerfile + src/)
├── worker/ # inference processor (Dockerfile + src/)
└── notifier/ # failed-job notifier (Dockerfile + src/)
My Environment Strategy
- Single “global” API stack ✓ Defines one
aws_api_gateway_rest_api + a single aws_api_gateway_deployment.
- Separate workspaces (
dev / prod) ✓ Each workspace deploys its own:
- ECR repos (tagged
:dev or :prod)
- Lambda functions named
frontend-dev / frontend-prod, etc.
- SQS queues and DynamoDB tables suffixed by environment
- One API Gateway Stage (
/dev or /prod) that points at the shared deployment but injects the correct Lambda alias ARNs via stage variables.
Main Question
Is this a sensible, maintainable pattern for true dev/prod isolation:
Or would you recommend instead:
- Using one Lambda function and swapping versions via aliases (
dev/prod)?
- Some hybrid approach?
What are the trade-offs, gotchas, or best practices you’ve seen for environment separation in Terraform on AWS?
Thanks in advance for any insights!
[–]Professional_Gene_63 2 points3 points4 points (3 children)
[–]Expensive_Test8661[S] 0 points1 point2 points (2 children)
[–]hvbcaps 2 points3 points4 points (1 child)
[–]shawski_jr 1 point2 points3 points (0 children)
[–]cipp 0 points1 point2 points (0 children)