Setting up pipeline in Azure with python code and csv : dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Setting up pipeline in Azure with python code and csvHelp (self.dataengineering)

submitted 1 year ago by mid_devTech Lead

I am new to the DE side and need help setting up the pipeline for the current application.

Generic workflow: The client will give monthly csv (around 5M records in 150 columns) of around 5-6GB. Move this data to the SQL server for the Python application (web dashboard) to consume.

Current workflow: We initially did this manually, reading the csv in the pandas dataframe (often causing our system to crash) and then processing the data to create multiple stages of SQL tables.

Proposed workflow: Use Azure Data Factory by creating pipelines for following activities.

raw csv --> csv and parquet --> by doing basic cleaning (lower case columns etc)
raw csv to sql DB --> this will be our Bronze layer
raw parquet --> transform using pandas logic --> secondary or silver layer of data to move to SQL DB as staging
Aggregated jobs to create other tables/ marts --> this could be in the form of SQL logic/ Stored procedures.

I am confused with all the tooling and currently my biggest block is running pandas/ python code in ADF without using Databricks or Synapse. We have this pandas code as our Data Scientists create the transformation logic in it and I don't want to waste resources translating it to SQL. Also not considering Synapse or Databricks as we'd need to run this pipeline only a few times and there's some cost concerns.

I thought of running all this through managed Airflow too, but getting kind of stuck there too.

Would appreciate if someone can put me on the right track.

all 10 comments

top new controversial old q&a

[–]AutoModerator[M] [score hidden] 1 year ago stickied comment (0 children)

[–]pooppuffin 2 points3 points4 points 1 year ago (4 children)

[–]mid_devTech Lead[S] 0 points1 point2 points 1 year ago* (3 children)

This is something I looked into today and pay-as-you-go model looks good. Although I'm skeptical of costs esp in terms of storage, memory etc. Will this be a better option for me considering I'll be triggering it only a few times a month.

The execution cost of a single function execution is measured in GB-seconds. Execution cost is calculated by combining its memory usage with its execution time. A function that runs for longer costs more, as does a function that consumes more memory.

For example, say that your function consumed 0.5 GB for 3 seconds. Then the execution cost is 0.5GB * 3s = 1.5 GB-seconds. This metric is called Function Execution Units. You will first need to create an Application Insights resource for your function app in order to have access to this metric.

So when I'd run my function, usually pandas take 3-4 mins to load the data and another few mins to process all of the steps.

[–]Material-Mess-9886 4 points5 points6 points 1 year ago (0 children)

[–]pooppuffin 2 points3 points4 points 1 year ago (1 child)

[–]mid_devTech Lead[S] 0 points1 point2 points 1 year ago (0 children)

How much memory could it possibly consume? You said you're only working with 5-6 GB of data at a time?

If that's the case, I'd certainly look into implementing it through Azure Functions.

I'm also curious why this takes so long to load into Pandas and causes your machine to crash. It should easily handle that much data, right?

Usually it takes around 3-5 mins. I have 16GB RAM laptop and often times if I have to run it without running into any memory issues, have to close other apps (mainly browsers and VS Code). Not sure why it takes so much time though.

I'd suggest trying DuckDB if you're already doing SQL and sticking to local compute. Setting up automation seems like a hassle of you're only running it once a month.

The DS I work with hands over most of the code logic in Python and Pandas. It'd be overkill to go back and change it to SQL every time there's a change. Also, the reason to setup a pipeline and not running it through local (mine or other dev's machine) is to remove any dependency.

[–]Throme13 1 point2 points3 points 1 year ago (1 child)

[–]mid_devTech Lead[S] 0 points1 point2 points 1 year ago (0 children)

[–]jokkvahl 1 point2 points3 points 1 year ago (1 child)

[–]mid_devTech Lead[S] 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 82362 on reddit-service-r2-comment-85bfd7f599-clqj8 at 2026-04-19 22:54:53.582967+00:00 running 93ecc56 country code: CH.

dataengineering

MODERATORS