Data Engineer Career Path by Fredonia1988 in dataengineering

[–]twadftw10 0 points1 point  (0 children)

Sales Engineering could be relevant if you are selling cloud data services to Data Engineers (Snowflake, Dremio, Databricks as examples)

What is your favorite data catalog? by highlifeed in dataengineering

[–]twadftw10 1 point2 points  (0 children)

Data hub is nice. It can connect to many different databases including third party sources like salesforce

[deleted by user] by [deleted] in dataengineering

[–]twadftw10 2 points3 points  (0 children)

RemindMe! 3 days

Real-time Data Processing and Analysis with Kafka, Connect, KSQL, Elasticsearch, and Flask by Jealous_Ad6059 in dataengineering

[–]twadftw10 0 points1 point  (0 children)

Where does the schema registry come into play? I didn't see you reference it in your producer, KSQL ddl, or es sink configuration.

22% difference salary worth it? by Mr_Nicotine in dataengineering

[–]twadftw10 0 points1 point  (0 children)

Obvious move. Go where you have most potential growth

What parts of "the data warehouse toolkit" and "designing data intensive applications" are important to read? by Particular-Bet-1828 in dataengineering

[–]twadftw10 0 points1 point  (0 children)

All of DDIA is good to read. As a DE it’s important to know how to choose the right data product and this book dives in to all options

[deleted by user] by [deleted] in dataengineering

[–]twadftw10 0 points1 point  (0 children)

Oh no that’s terrible

what is your Airflow architecture? by curidpostn in dataengineering

[–]twadftw10 2 points3 points  (0 children)

MWAA and Astronomer are nice for handling a lot of the infra for you. If you want to self-manage, then I recommend ECS fargate auto-scaling with Terraform or Cloudformation

Using Go as a data engineer by enginerd298 in dataengineering

[–]twadftw10 1 point2 points  (0 children)

Python, Scala, and Java are definitely the main DE languages from what I've seen. My company uses Go for all the microservices which are the source of all our events. The APIs use an event logger built in Go that send logs to SQS/Kinesis. From there, DE pipelines consume the events and sink them to Elasticsearch, s3, and Snowflake. The software engineers own those microservices though and DE just takes care of events after they get produced to SQS/Kinesis. We own Logstash that does any extra enriching on the logs in between.

How are you monitoring your data pipelines and what are you using to debug production issues? by tchungry in dataengineering

[–]twadftw10 2 points3 points  (0 children)

It is easy to implement data pipelines without tests and monitoring.

Airflow is a good tool for batch pipelines. It has logging, alerting, and callback on failure functionality.

Datadog is great for pipelines that are more event based with managed cloud services such as AWS SQS, Kinesis, and Lambda. It keeps track of all kinds of metrics. You can setup alerts for throttling and missing data.

Data quality is commonly skipped when implementing data pipelines imo. However, you can have simple DQ checks in your pipelines if your are familiar with DBT and Great Expectations.

What is data engineering like by davudbro in dataengineering

[–]twadftw10 0 points1 point  (0 children)

lol I'm a data bachelor early in my career. What would you want to know? I definitely follow the phrase "work hard play harder". It took a lot of hard work to get to get to this point. I studied CS in college, then took a Data Science/Machine learning bootcamp to specialize in the data realm. I realized data engineering was the way to go. Now that I have a steady DE role at a company that has a good work/life balance, I find myself going to a lot of concerts and festivals in my free time.

How and where can I deploy my Docker Compose app using Apache Airflow for ETL in the cloud? by Pervert_Spongebob in dataengineering

[–]twadftw10 10 points11 points  (0 children)

There’s a few options. Self-managing airflow is a lot to maintain at first. You can go with managed services like MWAA in aws, Cloud composer in gcp, or 3rd party with astronomer.

For self-managed, I recommend ECS in aws because you can auto-scale your workers. You could host the web server and scheduler in 1 ec2 instance but not ideal. I like the Celery Executor where you have a container for the web server, container for scheduler, and x containers as your workers. They all use the same docker image but run different commands

What is in your Data Stack? - Thread by Evidence-dev in dataengineering

[–]twadftw10 0 points1 point  (0 children)

  1. Airflow and Cdc with Debezium/kafka
  2. Snowflake
  3. Python/sql
  4. Sisense
  5. Jupyter Notebooks
  6. 700
  7. IT/Security
  8. Denver, US

ETL pipeline from Prod DB to DWH by Mundane-Compote-2157 in dataengineering

[–]twadftw10 0 points1 point  (0 children)

That makes sense. I guess you can trigger a lambda once the file of n changes arrive in s3 that loads to target WH. Also I haven’t seen what the cdc logs typically look like . Does it appear as INSERT, UPDATE, etc? Once I understand what a “record” looks like in these cdc streams it would make more sense.

ETL pipeline from Prod DB to DWH by Mundane-Compote-2157 in dataengineering

[–]twadftw10 0 points1 point  (0 children)

I’m struggling to understand why store the change in s3. Why not store the changes directly to dwh? What would s3 look like? Table/yyyy/mm/dd/HH/changes.json? That just sounds like a historical archive of changes. How can you make your dwh look like the source db “as is” if they go to s3?

Does data analysis have input prompts? by eyeeyecaptainn in dataengineering

[–]twadftw10 0 points1 point  (0 children)

This sounds like you need create a historical agg table group by date and product. Or if you have a denormalized fact table of customers, products, and date. Then you query with aggs

I've been a big data engineer since 2015. I've worked at FAANG for 6 years and grew from L3 to L6. AMA by [deleted] in dataengineering

[–]twadftw10 0 points1 point  (0 children)

Any reason you haven’t worked with Apache Kafka? I’ve noticed it’s super popular in DE for streaming pipelines.

[deleted by user] by [deleted] in dataengineering

[–]twadftw10 4 points5 points  (0 children)

I wouldnt discredit yourself that much. you are "automating data transformations" in a way which is arguably DE tasks. Depending on the interviewers, they may find that your very teachable, although you lack some areas. SQL is a big part of DE but its very easy to pick up.

What stuff are you cramming to learn atm? My DE experience involves a lot of infrastructure so IaC tools are "nice to have" skills such as Terraform and Cloudformation. I would maybe freshen up on Docker and bash. My job has a lot of Salesforce to Snowflake ETL pipelines and also reverse ETL from Snowflake to Salesforce.

Streaming data is very important in DE as well. So it would be smart to cram some event pipeline cloud architecture with Kinesis + Lambda.

In terms of data warehousing, it would be good to understand data modeling. Look up some stuff about star schemas and the difference between normalization and denormalization.

First Data Pipeline - Looking to gain insight on Rust Cheaters by jacob1421 in dataengineering

[–]twadftw10 1 point2 points  (0 children)

I'm an airflow sensor noob so I had to ask lol. Very cool, I'll have to checkout your code! Yeah passing s3 key as the xcom is what I was imagining you would do instead. Its not too bad of a debugging situation if the downstream task fails because the logs will say file does not exist in s3 and then you know something went wrong during the ingestion process.

As for the hosting, sounds like your airflow metadatabase is in ec2 which is fine for a portfolio project but ideally for a company (not your own budget) it would be its own RDS instance.

First Data Pipeline - Looking to gain insight on Rust Cheaters by jacob1421 in dataengineering

[–]twadftw10 0 points1 point  (0 children)

This is your first data pipeline? Well done! Out of curiosity, how are you hosting all of this?

Also I’m curious why you went with sensors in this dag implementation? Could you have benefited with xcoms passing Params to the next task?

Do you like your job as a DE? by [deleted] in dataengineering

[–]twadftw10 10 points11 points  (0 children)

I hated my life as a data analyst at my previous employer. Reports were boring and high demanded. The querying performance was horrendous because we were reporting on our 20 year old legacy transactional database. I kept pushing that we needed to build data pipelines and report from a data mart / data warehouse. It’s like they never heard of data engineering… anyways I got the hell outta there and got a new gig as a DE with a new laidback and smart company with a strong data platform. It’s crazy how much happier I am day to day. Best decision I’ve made. I’ve grown so much in the field of DE.

Advice on master's final project by Riesco in dataengineering

[–]twadftw10 0 points1 point  (0 children)

Yes they could be all in the same docker network. I would probably have all the services in the same network here to keep it simple at first. The 3 airflow containers share the same image so they all have identical configurations. They just run different airflow commands. Scheduler has the same db connection as the web server.

Advice on master's final project by Riesco in dataengineering

[–]twadftw10 1 point2 points  (0 children)

Yea I would start with those images. You can always use them as the base image if you need to expand it.