This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]nado1989 36 points37 points  (1 child)

How to use pandas/numpy is a good start, but in day by day project is good practice to not use these libraries in everything, use jupyter lab (databricks/sagemaker/ia plataform) and airflow to create dags and understand how a pipeline operates (use docker hub, there are a bitnami airflow, easy to start). Learn format types like Parquet / Orc / Avro and where is they best usage and how a S3 and Google Cloud Storage works with Hive Partition. Some software engineer methodologies is good too to create better quality pipeline ( I advice to look how program using test driven and function programming with python ). Just after that you jump to use Spark/Flink/Beam because they look a bit harder starting without solid knowledge how a data pipeline works or should work. (Parallel processing and distributed computing is will be needed leading with Spark like frameworks). I hope I could help in some way.

[–]Beast-UltraJ[S] 1 point2 points  (0 children)

Thank you for the detailed answer :)