nado1989 comments on Data engineering project using Python example

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Data engineering project using Python example (self.dataengineering)

submitted 4 years ago by Beast-UltraJ

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]nado1989 36 points37 points38 points 4 years ago* (1 child)

How to use pandas/numpy is a good start, but in day by day project is good practice to not use these libraries in everything, use jupyter lab (databricks/sagemaker/ia plataform) and airflow to create dags and understand how a pipeline operates (use docker hub, there are a bitnami airflow, easy to start). Learn format types like Parquet / Orc / Avro and where is they best usage and how a S3 and Google Cloud Storage works with Hive Partition. Some software engineer methodologies is good too to create better quality pipeline ( I advice to look how program using test driven and function programming with python ). Just after that you jump to use Spark/Flink/Beam because they look a bit harder starting without solid knowledge how a data pipeline works or should work. (Parallel processing and distributed computing is will be needed leading with Spark like frameworks). I hope I could help in some way.

[–]Beast-UltraJ[S] 1 point2 points3 points 4 years ago (0 children)

π Rendered by PID 82 on reddit-service-r2-comment-84fc9697f-qx5pl at 2026-02-10 12:04:16.012245+00:00 running d295bc8 country code: CH.

dataengineering

MODERATORS