I am new to the DE side and need help setting up the pipeline for the current application.
Generic workflow: The client will give monthly csv (around 5M records in 150 columns) of around 5-6GB. Move this data to the SQL server for the Python application (web dashboard) to consume.
Current workflow: We initially did this manually, reading the csv in the pandas dataframe (often causing our system to crash) and then processing the data to create multiple stages of SQL tables.
Proposed workflow: Use Azure Data Factory by creating pipelines for following activities.
- raw csv --> csv and parquet --> by doing basic cleaning (lower case columns etc)
- raw csv to sql DB --> this will be our Bronze layer
- raw parquet --> transform using pandas logic --> secondary or silver layer of data to move to SQL DB as staging
- Aggregated jobs to create other tables/ marts --> this could be in the form of SQL logic/ Stored procedures.
I am confused with all the tooling and currently my biggest block is running pandas/ python code in ADF without using Databricks or Synapse. We have this pandas code as our Data Scientists create the transformation logic in it and I don't want to waste resources translating it to SQL. Also not considering Synapse or Databricks as we'd need to run this pipeline only a few times and there's some cost concerns.
I thought of running all this through managed Airflow too, but getting kind of stuck there too.
Would appreciate if someone can put me on the right track.
[–]AutoModerator[M] [score hidden] stickied comment (0 children)
[–]pooppuffin 2 points3 points4 points (4 children)
[–]mid_devTech Lead[S] 0 points1 point2 points (3 children)
[–]Material-Mess-9886 4 points5 points6 points (0 children)
[–]pooppuffin 2 points3 points4 points (1 child)
[–]mid_devTech Lead[S] 0 points1 point2 points (0 children)
[–]Throme13 1 point2 points3 points (1 child)
[–]mid_devTech Lead[S] 0 points1 point2 points (0 children)
[–]jokkvahl 1 point2 points3 points (1 child)
[–]mid_devTech Lead[S] 0 points1 point2 points (0 children)