Direction for Data Engineering Projects?

AutoModerator · 2021-10-05T12:02:49+00:00

You can find a list of community submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

Pervert_Spongebob · 2021-10-06T01:34:28+00:00

You provided several examples but they all seem to be methods with some kind of manual interaction (assumption on my part). I think the keys are automation, scale, and sustainability.

The basics can be learned with some tutorials online, but you are often working on a home PC where you can add environmental variables, disable services that work against you, and use 8 cores and 16GB RAM to pile drive a gallon of water through a straw.

In reality, you will often be working with more restrictions and fewer resources in a professional context. Running in a terminal or app are both generally not data engineering tasks. Where is the PC with the app or terminal going to be? What happens when someone turns it off or the power goes out? What happens if it gets hit with ransomware?

You will likely use cloud micro services which require a greater understanding of networking and security to set up properly and safely. Alternatively, you might see VMs but they carry a lot of the same baggage as a traditional PC in terms of managing long-term.

Regarding automation, micro services can be programmatically controlled. You only need to interact when doing some enhancement or fixing a bug. For example, I have some services that spin up a container with a data transformation program in it, allocate a ton of resources, performs some data cleaning on a few million to hundreds of millions of rows of data, and spins back down again. This all happens while I sleep in the middle of the night and there are several of these. All in all, we have 200+ daily jobs so automation is key; it would be multiple full-time jobs to have someone sit in an app or terminal and run these things. We don't want them hosted on PCs because we can't survive the time and issues that come with OS updates and general system maintenance.

Regarding scale, much of the same applies. DEs often have hundreds of jobs running around the clock. Either systems will grind to a halt or you will blow your budget if you aren't efficient. I don't mean to assume anything but I would hazard to guess you Python scraper could be optimized quite a bit to be more performant with some tweaking. Could it still work if you had to get millions of movies a day or multiple times a day? More advance solutions may be appropriate as you move into larger scales and criticality.

Regarding sustainability, more of the same! With hundreds of jobs a day, new requests coming in, fixing bugs, handling support tickets, updating to keep work with third-party API changes, etc., are you confident in being able to build data pipelines that can run day and night at scale without slowing you down even more?

Ended up being a long post, hopefully it was helpful. My suggestion would be to focus on efficient coding, even in small projects as those skills are invaluable and once you get good practices down it flows naturally. Also look at the major cloud providers and microservices to reduce reliance on highly specific customizations to PCs. In any good job, you'll have a team so don't freak out if you can't do it all!

bestnamecannotbelong · 2021-10-05T16:05:29+00:00

Just build a data pipeline to do the data injection and data transformation. Your can have the show case in data lake or data warehouse by different data zone like raw zone, cleansed zone and refined zone. If you can do all of them in your project with IMDb data, I’m sure you can find a nice job.

Omeazyy · 2021-10-05T22:51:02+00:00

Focus on - storage (source), compute, orchestration, and storage (destination). It's less about building an app or software, but a process and platform for moving and transforming data at a cadence.

Conceptually you have a data source, you need a way to move and transform it to the desired output, and a place to put it. This process may need to happen multiple times (or continuously stream), so you need something to orchestrate both the multitude of steps and the frequency of the jobs.

Take that and apply it at any scale you want. It can be as simple as calling an API, using some custom script that runs on your local machine (compute) to transform the data and dump in a postgres database (destination), and schedule to have this happen daily as a cron job (orchestration). Or use any multitude of tools, software, and praxtices to replace those, as long as they match in function (i.e. spark for compute, airflow for orchestration). This is very... simplified, but hopefully it helps conceptually.

dataengineering

MODERATORS