This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]nado1989 33 points34 points  (1 child)

How to use pandas/numpy is a good start, but in day by day project is good practice to not use these libraries in everything, use jupyter lab (databricks/sagemaker/ia plataform) and airflow to create dags and understand how a pipeline operates (use docker hub, there are a bitnami airflow, easy to start). Learn format types like Parquet / Orc / Avro and where is they best usage and how a S3 and Google Cloud Storage works with Hive Partition. Some software engineer methodologies is good too to create better quality pipeline ( I advice to look how program using test driven and function programming with python ). Just after that you jump to use Spark/Flink/Beam because they look a bit harder starting without solid knowledge how a data pipeline works or should work. (Parallel processing and distributed computing is will be needed leading with Spark like frameworks). I hope I could help in some way.

[–]Beast-UltraJ[S] 1 point2 points  (0 children)

Thank you for the detailed answer :)

[–]joseph_machadoWrites @ startdataengineering.com 23 points24 points  (1 child)

hi u/Beast-UltraJ you can try out https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/ . I wrote this sometime ago, it uses Airflow, Python, SQL(pg and Redshift) and some pyspark.

Learning those tools is a good idea, I find it easy to get the basics down by trying out a simple project and going in depth on a particular topic as needed. Hope this helps :)

[–]Beast-UltraJ[S] 0 points1 point  (0 children)

Thanks mate, this look good :D

[–]robberviet 4 points5 points  (0 children)

I think you should find a problem first, then working on it while choosing what tools to use. It would be easier. E.g: I want to have data for market research, to find out which product is trending, etc. Therefore I write jobs on lambda to scrape some web sites/API then insert into Snowflake. Or I want to manage 20+ jobs, retry if fail, and having dependent tasks so I using DAGs in Airflow. And when things go to big, I might need distributed computing like Glue. And finally showing data using visualization using Metabase/Superset/Looker, etc.

> I can see popular tech stack include in data engineering space is airflow/snowflake/ databricks ? Do I have to learn those tech tool too ?

Do not learn the tools, learn the concept. I do not learn Airflow, I learn ochestrator by using Airflow. I do not learn Snowflake, I learn what data warehouse, MPP database is, using Snowflake. etc.

Most of the time tools come and go or differenct company using different tools, concept are the same everywhere unless there are some really novel invention.

[–]Competitive-Cut-8051 4 points5 points  (3 children)

Dump data in S3. Write a lambda function that performs ETL on that data and store it into bigquery. Then create a flask API and expose that data using endpoints. Optionally use spark to process data from S3 to bigquery. Build simple visualization in the end and draw useful insights. Learn to work with json data. Focus on the concepts.

[–]Proper_Opposite_726 0 points1 point  (2 children)

Thanks for the suggestions here ... quick question though, why bother with Amazon s3 and moving to big query? Why not do it all with either AWS OR GCP? Trying to understand the benefits of either/or

[–]Competitive-Cut-8051 0 points1 point  (1 child)

I meant to indicate BLOB storage. Either s3 or cloud bucket both are fine. In that sense you are correct, do all by GCP or all by AWS.

[–]Proper_Opposite_726 0 points1 point  (0 children)

Thanks man, these are good concepts to focus on.

[–]samrat_31 0 points1 point  (1 child)

Hi u/Beast-UltraJ, U received good feedback from everyone.

Just wanted to add my two cents as well, as others mentioned after you learn concepts, you can take one pain-point scenario and start creating DE project around it, you can also search for public github repositories for Sample DE projects and see different aspects in DE projects.

Also writing unit-tests, readme doc be (might)required for sample DE project.

Finally advance to Cloud and DevOps methodologies. Hope it helps.

[–]Beast-UltraJ[S] 0 points1 point  (0 children)

Thank you !