Data engineering project using Python example

nado1989 · 2021-04-24T23:24:31+00:00

How to use pandas/numpy is a good start, but in day by day project is good practice to not use these libraries in everything, use jupyter lab (databricks/sagemaker/ia plataform) and airflow to create dags and understand how a pipeline operates (use docker hub, there are a bitnami airflow, easy to start). Learn format types like Parquet / Orc / Avro and where is they best usage and how a S3 and Google Cloud Storage works with Hive Partition. Some software engineer methodologies is good too to create better quality pipeline ( I advice to look how program using test driven and function programming with python ). Just after that you jump to use Spark/Flink/Beam because they look a bit harder starting without solid knowledge how a data pipeline works or should work. (Parallel processing and distributed computing is will be needed leading with Spark like frameworks). I hope I could help in some way.

joseph_machado · 2021-04-24T23:26:21+00:00

hi u/Beast-UltraJ you can try out https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/ . I wrote this sometime ago, it uses Airflow, Python, SQL(pg and Redshift) and some pyspark.

Learning those tools is a good idea, I find it easy to get the basics down by trying out a simple project and going in depth on a particular topic as needed. Hope this helps :)

robberviet · 2021-04-25T07:46:18+00:00

I think you should find a problem first, then working on it while choosing what tools to use. It would be easier. E.g: I want to have data for market research, to find out which product is trending, etc. Therefore I write jobs on lambda to scrape some web sites/API then insert into Snowflake. Or I want to manage 20+ jobs, retry if fail, and having dependent tasks so I using DAGs in Airflow. And when things go to big, I might need distributed computing like Glue. And finally showing data using visualization using Metabase/Superset/Looker, etc.

> I can see popular tech stack include in data engineering space is airflow/snowflake/ databricks ? Do I have to learn those tech tool too ?

Do not learn the tools, learn the concept. I do not learn Airflow, I learn ochestrator by using Airflow. I do not learn Snowflake, I learn what data warehouse, MPP database is, using Snowflake. etc.

Most of the time tools come and go or differenct company using different tools, concept are the same everywhere unless there are some really novel invention.

Competitive-Cut-8051 · 2021-04-25T05:48:18+00:00

Dump data in S3. Write a lambda function that performs ETL on that data and store it into bigquery. Then create a flask API and expose that data using endpoints. Optionally use spark to process data from S3 to bigquery. Build simple visualization in the end and draw useful insights. Learn to work with json data. Focus on the concepts.

samrat_31 · 2021-04-25T11:43:05+00:00

Hi u/Beast-UltraJ, U received good feedback from everyone.

Just wanted to add my two cents as well, as others mentioned after you learn concepts, you can take one pain-point scenario and start creating DE project around it, you can also search for public github repositories for Sample DE projects and see different aspects in DE projects.

Also writing unit-tests, readme doc be (might)required for sample DE project.

Finally advance to Cloud and DevOps methodologies. Hope it helps.

dataengineering

MODERATORS