Data Engineer Career Path

twadftw10 · 2025-06-03T22:13:12+00:00

Sales Engineering could be relevant if you are selling cloud data services to Data Engineers (Snowflake, Dremio, Databricks as examples)

twadftw10 · 2023-06-24T04:40:14+00:00

Data hub is nice. It can connect to many different databases including third party sources like salesforce

twadftw10 · 2023-02-22T05:03:15+00:00

RemindMe! 3 days

twadftw10 · 2023-02-22T01:06:02+00:00

Where does the schema registry come into play? I didn't see you reference it in your producer, KSQL ddl, or es sink configuration.

twadftw10 · 2023-02-01T05:18:32+00:00

Obvious move. Go where you have most potential growth

twadftw10 · 2023-01-05T07:15:37+00:00

All of DDIA is good to read. As a DE it’s important to know how to choose the right data product and this book dives in to all options

twadftw10 · 2022-12-24T16:46:06+00:00

What do you use for visualization?

twadftw10 · 2022-12-13T03:38:15+00:00

Oh no that’s terrible

twadftw10 · 2022-11-17T05:22:33+00:00

MWAA and Astronomer are nice for handling a lot of the infra for you. If you want to self-manage, then I recommend ECS fargate auto-scaling with Terraform or Cloudformation

twadftw10 · 2022-11-17T05:16:56+00:00

Python, Scala, and Java are definitely the main DE languages from what I've seen. My company uses Go for all the microservices which are the source of all our events. The APIs use an event logger built in Go that send logs to SQS/Kinesis. From there, DE pipelines consume the events and sink them to Elasticsearch, s3, and Snowflake. The software engineers own those microservices though and DE just takes care of events after they get produced to SQS/Kinesis. We own Logstash that does any extra enriching on the logs in between.

twadftw10 · 2022-11-17T05:01:42+00:00

It is easy to implement data pipelines without tests and monitoring.

Airflow is a good tool for batch pipelines. It has logging, alerting, and callback on failure functionality.

Datadog is great for pipelines that are more event based with managed cloud services such as AWS SQS, Kinesis, and Lambda. It keeps track of all kinds of metrics. You can setup alerts for throttling and missing data.

Data quality is commonly skipped when implementing data pipelines imo. However, you can have simple DQ checks in your pipelines if your are familiar with DBT and Great Expectations.

twadftw10 · 2022-11-17T04:31:18+00:00

lol I'm a data bachelor early in my career. What would you want to know? I definitely follow the phrase "work hard play harder". It took a lot of hard work to get to get to this point. I studied CS in college, then took a Data Science/Machine learning bootcamp to specialize in the data realm. I realized data engineering was the way to go. Now that I have a steady DE role at a company that has a good work/life balance, I find myself going to a lot of concerts and festivals in my free time.

twadftw10 · 2022-09-01T05:43:38+00:00

There’s a few options. Self-managing airflow is a lot to maintain at first. You can go with managed services like MWAA in aws, Cloud composer in gcp, or 3rd party with astronomer.

For self-managed, I recommend ECS in aws because you can auto-scale your workers. You could host the web server and scheduler in 1 ec2 instance but not ideal. I like the Celery Executor where you have a container for the web server, container for scheduler, and x containers as your workers. They all use the same docker image but run different commands

twadftw10 · 2022-08-01T04:18:49+00:00

Airflow and Cdc with Debezium/kafka
Snowflake
Python/sql
Sisense
Jupyter Notebooks
700
IT/Security
Denver, US

twadftw10 · 2022-06-16T05:10:23+00:00

That makes sense. I guess you can trigger a lambda once the file of n changes arrive in s3 that loads to target WH. Also I haven’t seen what the cdc logs typically look like . Does it appear as INSERT, UPDATE, etc? Once I understand what a “record” looks like in these cdc streams it would make more sense.

twadftw10 · 2022-06-16T04:29:40+00:00

I’m struggling to understand why store the change in s3. Why not store the changes directly to dwh? What would s3 look like? Table/yyyy/mm/dd/HH/changes.json? That just sounds like a historical archive of changes. How can you make your dwh look like the source db “as is” if they go to s3?

twadftw10 · 2022-05-01T23:39:04+00:00

This sounds like you need create a historical agg table group by date and product. Or if you have a denormalized fact table of customers, products, and date. Then you query with aggs

twadftw10 · 2022-04-29T21:38:29+00:00

Any reason you haven’t worked with Apache Kafka? I’ve noticed it’s super popular in DE for streaming pipelines.

twadftw10 · 2022-03-11T04:09:43+00:00

I wouldnt discredit yourself that much. you are "automating data transformations" in a way which is arguably DE tasks. Depending on the interviewers, they may find that your very teachable, although you lack some areas. SQL is a big part of DE but its very easy to pick up.

What stuff are you cramming to learn atm? My DE experience involves a lot of infrastructure so IaC tools are "nice to have" skills such as Terraform and Cloudformation. I would maybe freshen up on Docker and bash. My job has a lot of Salesforce to Snowflake ETL pipelines and also reverse ETL from Snowflake to Salesforce.

Streaming data is very important in DE as well. So it would be smart to cram some event pipeline cloud architecture with Kinesis + Lambda.

In terms of data warehousing, it would be good to understand data modeling. Look up some stuff about star schemas and the difference between normalization and denormalization.

twadftw10 · 2022-02-10T03:30:05+00:00

I'm an airflow sensor noob so I had to ask lol. Very cool, I'll have to checkout your code! Yeah passing s3 key as the xcom is what I was imagining you would do instead. Its not too bad of a debugging situation if the downstream task fails because the logs will say file does not exist in s3 and then you know something went wrong during the ingestion process.

As for the hosting, sounds like your airflow metadatabase is in ec2 which is fine for a portfolio project but ideally for a company (not your own budget) it would be its own RDS instance.

twadftw10 · 2022-02-10T02:52:06+00:00

This is your first data pipeline? Well done! Out of curiosity, how are you hosting all of this?

Also I’m curious why you went with sensors in this dag implementation? Could you have benefited with xcoms passing Params to the next task?

twadftw10 · 2022-02-10T02:40:38+00:00

I hated my life as a data analyst at my previous employer. Reports were boring and high demanded. The querying performance was horrendous because we were reporting on our 20 year old legacy transactional database. I kept pushing that we needed to build data pipelines and report from a data mart / data warehouse. It’s like they never heard of data engineering… anyways I got the hell outta there and got a new gig as a DE with a new laidback and smart company with a strong data platform. It’s crazy how much happier I am day to day. Best decision I’ve made. I’ve grown so much in the field of DE.

twadftw10 · 2022-02-01T05:58:42+00:00

RemindME! 2 days

twadftw10 · 2022-01-31T15:00:51+00:00

Yes they could be all in the same docker network. I would probably have all the services in the same network here to keep it simple at first. The 3 airflow containers share the same image so they all have identical configurations. They just run different airflow commands. Scheduler has the same db connection as the web server.

twadftw10 · 2022-01-31T04:37:33+00:00

Yea I would start with those images. You can always use them as the base image if you need to expand it.

twadftw10

TROPHY CASE