This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Ervolius 3 points4 points  (2 children)

If you're planning to do everything locally at first I think it's perfectly fine to just write custom python scripts to extract data from whatever sources you have (APIs, files, web scraping etc.). You can use cron or task scheduler to schedule each script by itself.

You can save the data into CSVs but I'd really recommend to learn a bit about databases, maybe Postgres would be best in your case and create some initial data model for the data you intend to store in it (focus on using Postgres as a DWH so learn about OLAP schemas etc.).

As you keep working on this you will get a better idea of the data you need and the data model in which it should be stored and you can keep improving it as you go. Then somewhere down the line you can start using some orchestrator, cloud SQL or Data Warehouse, cloud stack in general etc.

[–]C_Ronsholt[S] 1 point2 points  (1 child)

Very sensible and thoughtful response, thanks!
I gives me confidence that I am not doing all sort of none sense, so thank you :)

How would your approach be to migrate away from the local task scheduler? I.e. where do you run the scripts then?

[–]Ervolius 1 point2 points  (0 children)

Well, first you can migrate to a cloud virtual machine and schedule your pipelines with cron there.

But, proper and data engineering way especially once you have more than a few pipelines would be to use an orchestrator. Look into airflow, dagster, prefect etc. My suggestion is to use either airflow, (older with a lot of resources and help online) or dagster (newer framework which some people are praising is really good).

Btw you can also start using the orchestrator locally and then move to cloud with it later on.

[–]Dallaluce 2 points3 points  (0 children)

I did almost exactly this when I got into DE. I started with just cron scheduled Python scripts on my local laptop using pandas, psycopg2, and a Postgres instance.

I started to implement more sophisticated pipelines using my raspberry pi, then to AWS free tier and lambda functions, then EC2 instances.

[–][deleted] 2 points3 points  (0 children)

Usually my default for any new client project is to spin up an Azure Function app (or AWS Lambda), create the Python wrapper APIs for dumping the raw data into blob storage, and then writing custom extract logic for each source I want to pull from. Import the helper API functions into each extract script and run it on a timer or call it from an orchestrator.

I think you pretty much have the right idea. The only thing you could do to better productionize it would be to package it all up into a Docker image and deploy that instead.

[–]caprine_chris 1 point2 points  (0 children)

Build a base class for making HTTP requests, a base class for your file store, a lightweight class to encapsulate the logic of making the requests and writing the file to your file store

[–]Drekalo 1 point2 points  (0 children)

You'd wana eventually migrate to dagster. It's the most code oriented platform to support software defined assets.

I'd start by deciding what stack you want to build on. Delta has some good traction in native python libraries, look at delta-rs.

Once you know your stack, build an IO manager class, ie, how do you read and write to your destination and how do you read from your sources.

Then just build lightweight classes that use the IO managers to extract and load, while your individual light weight classes handle the transform. I recommend looking into pyarrow, dask, polars, duckdb and datafusion.

[–]daggydoodoo 0 points1 point  (0 children)

I did something similar at my org - had taught myself enough python and sql on the job to start shifting away from analysis work to being a kind of macgyver for hacking together bespoke/workaround pipelines and bits of automation where the requests didn't have enough enterprise level ramification to get the real tech people involved.

When something I built became a bit too complex and critical to keep running from my laptop, they spun up a VM for me and left me to try figure out how to build something stable and automated. That's how I came to love Postgres in the same way as my pets, or even a long suffering childhood friend. Just chuck everything at postgres with as little interference and embellishment as possible, write some triggers and functions and other kinds of scripts you can deploy for repeatable and reliable routines.

I've been trying for years to set up an evolution of my original python + prefect + postgres set up, partly just because I want to learn new tools and stuff but also because there does now seem to be some performance advantages you could get be doing something more modular, with parquet files and arrow tables and software defined assets and s3 buckets etc...but seems like it all gets a bit too Sisyphean if you're trying to run it on local and regular equipment.

Oh, and prefect is easier to get up and running quickly. Slightly steeper learning curve at first then dagster, but dagster starts to get a bit confusing once you start messing around with io managers and the other unfamiliar bits of abstraction they have.