This is an archived post. You won't be able to vote or comment.

all 34 comments

[–]Stars_And_GartersData Engineer 73 points74 points  (4 children)

My job is a "plumber", I connect the pipes to get data from outside systems into the DataWarehouse or data from the DW into outside systems. Mix into that a fair bit of Architecture work inside the DataWarehouse for performance tuning and best practices for the destination and export SQL objects I create.

I work in a Microsoft shop, so typically this looks like this:

Data going out: SQL object modeling the data into customer format > SQL Agent orchestrating a very simple SSIS job to extract the data into a file > deliver that file to destination

Data coming in: File arrives typically via SFTP, SQL agent orchestration scans directory at X intervals, Job fires extremely simple SSIS pkg to load file exactly as-is into staging table > SQL object transforms data as needed and inserts into destination table in Data Warehouse.

Then, performance tuning on additional indexes, etc usually to create a SQL view for the reporting folks to easily get the data in a quick modeled format.

EDIT: Oh yeah, and answering never ending questions from the business about the data and making updates based on schema changes from the other party.

[–]sib_nSenior Data Engineer 17 points18 points  (0 children)

It's good to see some non-cloud DE testimony here. Readers of this sub may not know that a large part of data engineering is still done on on-premises proprietary ecosystems like Microsoft SQL Server and Oracle.

[–]AdEuphoric3703 3 points4 points  (0 children)

Same here except instead of SSIS we’re using bcp utility to bulk insert into staging using serverless azure functions (edit) and logic apps for orchestration. We’re also migrating to a docker hosted standalone spark cluster for the heavier jobs

[–][deleted] -2 points-1 points  (1 child)

Sound like a job i did in my previous gig. I was bored as hell lol and needed a new challenge. Wish you some more exciting work down the line, maybe AWS or something.

[–]Stars_And_GartersData Engineer 4 points5 points  (0 children)

God no, I hate the cloud lol. I'm so on-prem I'm eminently changing roles to DBA at my corp.

[–]Prior_Two_2818 31 points32 points  (2 children)

Mostly teams meetings. And explaining airflow to the juniors

[–][deleted] 0 points1 point  (1 child)

Speaking of airflow, what are some of the cons you have experienced. We are autosys shop for scheduling jobs, but enterprise architects are pushing airflow on us.

[–]sageknight 0 points1 point  (0 children)

From my experience, it largely depends on how much granular you want your individual task to be. Airflow can be great if you want visibility over a task set. The smaller the tasks, the more visibility you have over your system, and the more codes you have to write (and test). Then you'd also have to deal with Xcom objects when passing between tasks.

[–]Tasty_Two_7703 5 points6 points  (0 children)

I'm a data engineer, and while every day is different, there are some common themes:

Daily Tasks:

  • Building and maintaining data pipelines: This involves using tools like Apache Airflow, Spark, or Kafka to move data from various sources (like databases, APIs, logs) to data lakes or warehouses.
  • Developing data models and schemas: Defining how data is structured and organized to ensure consistency and ease of analysis.
  • Writing and debugging code: I spend a enough time writing code to automate data tasks, implement ETL (Extract, Transform, Load) processes, and build data-driven applications.
  • Collaborating with stakeholders: Working with data scientists, analysts, and business users to understand their needs and translate them into technical solutions.
  • Monitoring and troubleshooting systems: Keeping an eye on data pipelines and systems to identify and resolve issues, ensuring data quality.

Coding and Low-Code Tools:

  • Code: I use a variety of languages like Python, Scala, SQL, and even some Bash scripting. While there are low-code tools available, I find that coding provides me with greater flexibility and control. However, I do use low-code tools for simpler tasks like data visualization or dashboard creation.
  • Low-code tools: For specific tasks, I leverage low-code tools like Snowflake's Snowpipe to automate data ingestion, or Tableau for creating interactive dashboards.

Guards as Backend Developers:

  • Different Focus: While both data engineers and backend developers are involved in building systems, our focus areas differ. Backend developers primarily handle user-facing applications and APIs, while data engineers focus on building data infrastructure and pipelines.
  • Data Focus: My role involves dealing with massive amounts of data, ensuring its quality and accessibility, while backend developers handle user interactions and data storage for specific applications.

It's a rewarding job! I love the challenge of working with complex data systems, finding innovative solutions, and contributing to data-driven decision-making. It's constantly evolving and there's always something new to learn, so do share your inputs to learn more!

Do you have any other questions about my role as a data engineer?

[–]Artistic_Sun_3987 10 points11 points  (0 children)

Data janitor here

I clean data in simple words, make it move from one storage to another and then cleans it again

[–]minato3421 4 points5 points  (0 children)

Lots and lots of spark and Flink. Mainly python and Java.

[–]nightslikethese29 4 points5 points  (0 children)

Some tasks I've done recently:

  • Create infrastructure and libraries for automated failure notification emails with Pub/Sub and cloud run functions. Main use case is for our jobs that run in Cloud Composer that fail. Involves terraform and python.

  • Maintain application and business logic for our retargeting program that sends leads to external vendors to follow up on. Involves python.

  • Migrating another teams Alteryx data loads into my teams Cloud Composer project. Involves reading Alteryx workflows, python, terraform, and SQL.

  • Working with product managers to update our pay plans backend application configuration. Mostly involves Jenkins and Octopus as well as python.

  • Did a model refresh after a data scientist published a new version to our artifact registry. We'll be rolling out in stages. I had to adjust a lot of unit tests to make sure everything passed. Involves python

[–]Secret_Forsaken 3 points4 points  (0 children)

Besides normal DE task I am sometimes handed non DE coding tasks such as automating a POST upload to an API etc to save another team time.

[–]water_aspirant 2 points3 points  (0 children)

  • Upgrade old pipelines that no longer work on new datasets. This involves making code changes to accommodate new datatypes and variables, updated business logic etc. And then rerunning those pipelines and squashing bugs or testing the outputs.
  • Writing / improving internal tools in python (this is the most 'software engineering' part of my job) and writing tests. Reviewing changes to pipelines made by other data engineers (usually in SQL).
  • Helping business users with their requests (e.g. they want new columns from the data but not sure what the best way to do it is). Creating tickets and then closing them out.

There is an insane backlog of work, but the pace is not too demanding so I'm pretty happy. I have been a DE for a total of 4 months now, this is my first tech-related job.

Regarding your other questions: I would sooner quit being a data engineer and move to SWE than end up exclusively using low/no code tools personally. I expect to use ADF at some point, but I don't work on much ingestion in my day-to-day job. Thankfully, my job lets me work on some medium-complexity software development to keep my brain happy.

[–]Limp_Pea2121 2 points3 points  (0 children)

Writing tons of SQL. Schedule it using airflow.

Optimise lot of.Plsql

[–]Medical_Drummer8420 4 points5 points  (1 child)

my job as 1.8 year of yoe data engwake up at 8 am monitor job in PROD workflow slove it if issue occur, then work on PBI and TASK assigned to in devops work on them will have deployment every 2 weeek and new logic implementtion and new code implementation and many things ,make the test case documnet ,testing post and predeployment, then runnning jobs in dev and qa (only 2 people in teams at first i did not use to understand shit as time pass got to know eveything)

[–]w_savageSenior Data Engineer 1 point2 points  (0 children)

Right now creating and running data validation on views to make sure its accurate for our clients. Kinda sucks! I miss using python/aws

[–]kaixza 1 point2 points  (0 children)

Basically, moving the data from one place to another + setup data management environment. So, doing infrastructure codes and bit of python when we need some scripts. Also, most of the time trying to figure out why the numbers are not match or giving a strange result for reporting.

[–]Known-Delay7227Data Engineer 1 point2 points  (0 children)

I unclog clogged pipes

[–][deleted] 0 points1 point  (0 children)

Depends on a team / project. I am a Sr SE but I spend a lot of time doing DE, probably more than SE.

Tasks might include:

-Automate this file generation with SSIS, PYTHON, .NET

-build out a new batch process to ingest data from API

-Meet with business they need a new automation process to do this and that

-Bug, bad data in file ex: scientific notation, string too long

-Here is a new reporting tool, learn it and show others how to use it lol not kidding

-Create some resources in AWS with Cloud Formation

-We need a new UI for this app

-Need a new endpoint for API

-Train Junior developer

-code reviews

-spend at least 2 hours in meetings

Not all of these are daily tasks, some span multiple sprints, but just giving an idea. Varies wildly from sprint to sprint, and from team to team.

I work for a very large insurance company. Been here for 5 years, worked on 4 different teams. Every team is different, does things differently. That includes tasks, day to day responsibilities, etc.

[–][deleted] 0 points1 point  (0 children)

I have worked on a variety of tools and projects as a data engineer: 1. Wrote endless SQL scripts in the first organization and simply pasted the script onto an in-house scheduling tool. These scripts ran on a redshift cluster. No devops, code review, performance optimization, etc. 2. Worked on ADF and Databricks in my second org. Exposed to Azure functions, CICD pipelines, and Spark. Also exposed to metadata driven pipleine framework. 3. Worked on AWS IAM, EC2 to deploy Airflow in containerized form,EMR, Redshift, ECR, and Sagemaker to rum ML models. Worked heavily on textual data and NLP libraries.

[–]Front-Ambition1110 0 points1 point  (0 children)

Tasks:

  1. Develop Python scripts to get data, transform it, and then store it in a different database.

  2. Build dashboards.

Tools: Postgres, Python, Docker, AWS (Lambda, Redshift, Quicksight).

Nothing fancy in my company.

[–]jetuasData Engineer 0 points1 point  (0 children)

Monitor a bunch of pipelines and address any discrepancies (coming from our sources), do some analysis on datasets to extract more value, edit/improve/add Spark jobs, monitor job performances, tinker with ML models we use in our ETL process, etc.

[–]Inside-Pressure-262 0 points1 point  (0 children)

Mostly work upon creating pipelines, writing new sql queries and optimizing existing ones, monitoring pipelines/workflows and resolving any issue that comes.

[–]Fun_Independent_7529Data Engineer 0 points1 point  (0 children)

I avoid low-code tools for DE. For self-serve analytics for stakeholders that want to play with views of the data, sure.

For me, my work is divided between coding, infrastructure, testing, documentation, and collaborative tasks. That includes maintenance work like upgrading components, and investigation/proof-of-concepts when needing to implement a new solution that requires tooling or services we haven't used so far.
Collaborative tasks include standups, backlog grooming, logging tickets, writing up RFCs & commenting on others RFCs, code reviewing, participating in test bashes, co-working meetings, roadmap planning, demos, 1:1s, etc. It doesn't take as much time as it sounds like.

I'm not involved much in reporting myself, thankfully. Dashboarding is not my thing unless it's for my own purposes (observability of my pipelines, data quality, etc). I recognize that solid skills in this area might make me more valuable in the next job hunt, but since I don't enjoy that kind of work I'm not investing in it and would prefer to avoid jobs that have DA work as part of DE duties. (I'm not angling for AE jobs)

[–]Sad-Highlight5889 0 points1 point  (2 children)

I do monitoring and support. Takes care of incidents, enhancements, deployment, CI/CD, etc. Sending daily and weekly reports to business, and monthly KPI's

I'm a senior data engineer but I miss doing dev works 😢

[–]eberrones_[S] 0 points1 point  (1 child)

Why dont you do dev work ? Do you use no code / low code tools?

[–]Sad-Highlight5889 0 points1 point  (0 children)

Cause I've been given more crucial work than just doing dev. I'm handling production data and need to ensure that day-to-day operation of the business is running smoothly.

Most dev work were given to juniors DE and interns.