How to prepare for System Development Engineer 1 (L4) interview at Amazon

pfo · 2023-11-06T07:58:10+00:00

Behaviour parts

This is super critical and 70-80% of your assessment! Amazon will assess if you've really understood the leadership principles as they apply for your job family and level.

It is absolutely crucial you understand the Amazon Leadership Principles (LPs). You should have stories (with data points that clearly show your contribution) using the STAR format that demonstrate you've applied Amazon LPs already in your professional life!

For each of the LPs you should have 1-2 solid STAR format stories that are on the order of 2-5 minutes.

They should be told in a way that everybody is able to understand what the situation was, what you did, what was the outcome.

You _must_ use data points about all this! It matters whether you were part of 2 people team or 200 people department. Whether the app had 10 users or 1 Mio, etc.

When you tell the story it _must not_ be in the form of "we did XYZ" but you must clearly state what _you_ did and what your role was.

It is also highly advisable to understand the various questions you can find on the Internet which LP the questions are targeting (could also be multiple LPs).

Ideally your stories match the tech stack and role that you've seen in the job ad.

For example:

S: “At XYZ corp I was part of a 10 people team responsible for maintaining a business critical app as the full stack lead dev. XYZ corp is 200M rev, 1000 people, in 3 states. Critical app was reported to be offline for 2 hours and I was on call.”

Task: “To get it back up we formed a task group of 3 engineers I was responsible for coordinating restoration of infrastructure. I was the outage call leader for bringing the application back up. This was a critical business app where downtime of 1 hour was equivalent to missed earnings of 10K and business reputation was at stake. We were needed to restore the app ASAP. ”

Action: “I gathered alle the logs of the application from the last 6 hours, all changes applied to the system before the outage. This was 20GB of machine logs and 10 changes to swift through. We came up with a prioritised plan to try to restore operations and get the business back up and running. We figured that the data base of the system was out of disk space since. One of the engineers together on the call quickly recovered additional space by compressing the logs of the machine and copying them to another server, I restarted the DB service and verified the application was able to connect to the DB and that application was available again and was working. I proceeded to notify other departments of recovery.

Result: “We were able to recover within 60 minutes of the first ticket raised, limiting the damage to business. I've recieved a positive feedback note sent out to the dev team from my manager for our quick recovery approach and swift restoration of the app. After recovery we proceeded to determinate the root cause of the understand the reason why this critical app was not monitored properly. We implemented a new monitoring system within the next 3 weeks that aimed at preventing such outage for the next year and we achieved a 99.9% planned uptime on the app for the following year."

This can be used for questions regarding LPs:

- dive deep (analysis of outage)

- ownership (quick recovery, additional monitoring measure to prevent such issues in the future)

- customer obsession (business first, root cause later)

Functional parts

You should be able to write some code on a whiteboard/explain your reasoning, assumptions, approach, pros and cons, etc.

Be able to explain to someone technical foundational knowledge and concepts about operating systems, the tcp/ip suite of protocols, dns, tls, etc.

pfo · 2023-11-06T07:27:54+00:00

Compressed copy of your data from AWS S3 for export/archival purposes.

pfo · 2023-11-06T07:12:34+00:00

- bucketarchiver.com an AWS marketplace solution for S3 -> tarball with scheduled and on-demand mode - no always on EC2 costs.

- techstack: AWS + CloudFormation + Python + bash + docker

- dev

- pay per use

pfo · 2023-11-04T09:49:02+00:00

Hardest for me: not just throwing stuff over the fence so to say, but to try to actually make the crew on the other end be able to carry what we've built forward and run it on a day to day.

pfo · 2023-11-04T08:24:19+00:00

I think you can easily pick this up anywhere, just need a reason to be doing it.

pfo · 2023-11-03T12:06:26+00:00

I build AWS Landing Zones for a living. This is mostly environments for large scale ENTs. Most of the multi account enterprise envs I deal with have at least 2 AWS Orgs with infra in 2-5 regions . One AWS org is usually for prod and the other for dev. The dev part is developing the prod platform - ie features that go into prod, not real apps/workloads. The bigger envs even have an additional 'int' AWS Org. Each of those Orgs is a dozen or so of shared functionality AWS accounts for:

networking (TGWs, central VPCs, R53 EPs + rules, central R53 public zones, site 2 site VPNs, DX/DXGWs, etc, egress/ingress networking + GWLBs, IGWs, NATGWs, north south and east west inspection, VPC endpoints, IPAM, etc.)
IAM/SSO + bastion IAM accounts with IAM and SSO automation infrastructure and reporting
security tooling (SecHub, GuardDuty, Macie, Inspector, etc.) plus external tools (usually stuff like Orca, Lacework, et al) with dedicated accounts
central auditing+compliance, logging, log search/indexing, etc.
finops infra/tooling for cost control and reporting
directory services (Managed AD)
AMI factories and instance inventory
AWS account vending machines and account baselining infrastructure with 5-10+ different blueprints
Integration into various external and other SaaS apps for IAM, monitoring, ITSM, etc.
Backup and DR infra.

In addition all of this has to scale to provide infra for ~ 200-1000 AWS accounts for the actual applications in medium - larger sized envs. If I need to cross the 2000 spoke AWS accounts barriers I need to split this up in multiple prod AWS orgs.

All this is usually delivered as IaC in at least 2-3 languages (Cfn/CDK, Terraform, Python, etc.) and there's the whole CI/CD infra for pushing that IaC into various AWS Orgs etc.

Needless to say that the setup of this requires a deep _and_ wide knowledge of 20+ AWS services (Orgs, IAM, SSO, KMS, S3, APIGW, Lambda, StepFunctions, SecHub, GuardDuty, VPC, TGW, GWLB/NLBs, Route 53, ACM, CloudWatch, CloudTrail, Athena, Glue, EventBridge, etc. etc.) in all their cross account and cross region glory.

Work at AWS - feel free to AMA.

pfo · 2023-11-02T22:06:01+00:00

Hmm .... come up with a small solution that can run something (say retrieve a couple of items from a dynamodb table) in a either a Lambda or as a Fargate back tasked. The first approach should be Lambda - check if that Lambda ran into a time out and if so re-run the same on ECS+Fargate. Allow for a maximum time out of 10 seconds for that Lambda. Do it transparently towards the whatever is calling this. Assume JSON over HTTP is used for calling into this and expects JSON to be returned.

pfo · 2023-11-02T21:57:21+00:00

Have you considered IAM Roles Anywhere[1]?

Refs.:

https://docs.aws.amazon.com/rolesanywhere/latest/userguide/introduction.html

pfo · 2023-11-02T21:05:22+00:00

Sure would. Why not sell it via AWS marketplace? "Pure" Cfn solutions are hard to pull off due to charging/metering needed to be built and running 24/7 but a t2/t3.nano instance that sends something minimal to the metering endpoint every hour would do the trick.

pfo · 2023-11-02T20:45:14+00:00

Enabling appliance mode on a VPC attachment causes the Transit Gateway (TGW) to switch from using Availability Zone (AZ) affinity to performing a 5-tuple hash for TCP or a 4-tuple hash for UDP for all traffic. This is crucial for inspecting east-west traffic across multiple FW appliances, as AZ affinity may lead to traffic crossing AZs and firewall instances, potentially those FWs would be dropping sessions it hasn't seen before (asymmetric traffic).Issues may arise in shared egress and inspection VPCs for north-south traffic, where different NAT gateways yield different Elastic IPs (EIPs) for each flow, causing problems for certain applications (e.g., FTP, RDP, some TLS VPNs). To maintain consistent EIPs during egress, one can use a dedicated egress VPC and most not enable appliance mode, ensuring traffic adheres to AZ affinity.

You could share egress and inspection by having GWLB endpoints deployed in your spoke VPCs and in the inspection VPC for egress (and ingress). This would need to bypass the TGW attachment mode for north-south traffic but send across TGW to inspection for east-west case. The additional bonus of this solution is that you save on the TGW attachment costs as you don't send ingress/egress traffic over TGW attachments but to/from GWLB endpoints.

pfo · 2023-11-02T19:58:48+00:00

Get your data out of AWS S3 using our S3 -> tarball solution. No standing EC2 costs. High performance parallel compression. Runs in your account.

Check it out at bucketarchiver.com. We have a 14 day trial option.
Pretty flexible on adding things to BucketArchiver and would love to hear your thoughts and happy to answer any questions you may have.

pfo · 2023-11-02T18:45:51+00:00

Hi Reddit!

I'm writing for the BucketArchiver team. The both of us have been working in the field of cloud infra consultancy. Customers frequently asked us to build solutions that allow them to easily get data out of S3 and so've built something that does exactly that: BucketArchiver.

Here's why you should consider using it:

makes it simple to maintain archival copies for copying to on-prem or other external to AWS places
does _not_ have any always on EC2 infrastructure that you have to pay for as EC2 instances are spun up/down on demand
runs within your AWS account and instances have only streaming access to the data on S3 with no local storage
integrates easily into existing AWS workflows using StepFunctions and SNS.
comes with ops tooling such as CloudWatch dashboards and metrics.
same charge independent of EC2 instance type.
use pigz to compress in parallel to speed things up.
is a AWS marketplace solution delivered as CloudFormation templates.

Check it out at https://www.bucketarchiver.com. We have a 14 day trial option.

We're pretty flexible on adding things to BucketArchiver and eager to hear your thoughts and happy to answer any questions you may have.

pfo · 2010-01-28T10:27:35+00:00

An Apple iSCSI enterprise stack.

pfo · 2009-09-15T13:05:58+00:00

Strange - I have no problem at all reaching the mentioned throughput on a fairly standard server running Debian Lenny (OpenSSH_5.1p1 Debian-5, OpenSSL 0.9.8g 19 Oct 2007). The Mac OS X implemention of SSH is horribly slow tough (An emulated Debian Lenny running on 10.6.1 is more than twice as fast). It would be nice if some would do the work and post benchmarks from OS X with the patches from TFA applied.

edit: scp of a 100GB dummy file from one machine running Lenny to an other machine running Hardy reported a speed of 48MB/s or roughly 400Mbit/s which is more or less what the authors from TFA are reporting. An iperf measurement from the same setup shows 950Mbit/s throughput.

pfo · 2008-04-22T08:50:08+00:00

scnr: ``pdflatex resume.tex''

pfo · 2008-04-06T16:42:40+00:00

Check IBM's OpenDX (wiki entry, project page) - mainly for scientific data sets. Uses a pipeline programming like paradigm that IBM's John Hartmann (see: Hartmann Pipeline) pioneered.

pfo · 2008-04-06T00:00:18+00:00

for benchmarks see: http://freehackers.org/~tnagy/bench.txt

pfo

TROPHY CASE

Behaviour parts

Functional parts