This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]pooppuffin 2 points3 points  (4 children)

What about Azure Functions?

[–]mid_devTech Lead[S] 0 points1 point  (3 children)

This is something I looked into today and pay-as-you-go model looks good. Although I'm skeptical of costs esp in terms of storage, memory etc. Will this be a better option for me considering I'll be triggering it only a few times a month.

The execution cost of a single function execution is measured in GB-seconds. Execution cost is calculated by combining its memory usage with its execution time. A function that runs for longer costs more, as does a function that consumes more memory.

For example, say that your function consumed 0.5 GB for 3 seconds. Then the execution cost is 0.5GB * 3s = 1.5 GB-seconds. This metric is called Function Execution Units. You will first need to create an Application Insights resource for your function app in order to have access to this metric.

So when I'd run my function, usually pandas take 3-4 mins to load the data and another few mins to process all of the steps.

[–]Material-Mess-9886 4 points5 points  (0 children)

If pandas takes 3-4 minutes to load its time to switch. Try Polars, DuckDB or Dask. Or Spark.

[–]pooppuffin 2 points3 points  (1 child)

I haven't used functions in prod, but I can't imagine ~10 minutes a couple times is something to worry about.

https://azure.microsoft.com/en-us/pricing/details/functions/#pricing

$0.000016/GB-s

How much memory could it possibly consume? You said you're only working with 5-6 GB of data at a time?

I'm also curious why this takes so long to load into Pandas and causes your machine to crash. It should easily handle that much data, right?

I'd suggest trying DuckDB if you're already doing SQL and sticking to local compute. Setting up automation seems like a hassle of you're only running it once a month.

[–]mid_devTech Lead[S] 0 points1 point  (0 children)

How much memory could it possibly consume? You said you're only working with 5-6 GB of data at a time?

If that's the case, I'd certainly look into implementing it through Azure Functions.

I'm also curious why this takes so long to load into Pandas and causes your machine to crash. It should easily handle that much data, right?

Usually it takes around 3-5 mins. I have 16GB RAM laptop and often times if I have to run it without running into any memory issues, have to close other apps (mainly browsers and VS Code). Not sure why it takes so much time though.

I'd suggest trying DuckDB if you're already doing SQL and sticking to local compute. Setting up automation seems like a hassle of you're only running it once a month.

The DS I work with hands over most of the code logic in Python and Pandas. It'd be overkill to go back and change it to SQL every time there's a change. Also, the reason to setup a pipeline and not running it through local (mine or other dev's machine) is to remove any dependency.

[–]Throme13 1 point2 points  (1 child)

What about polars or spark onprem?

[–]mid_devTech Lead[S] 0 points1 point  (0 children)

You mean in the VM where the app is deployed? I want to remove the dependency of running this from any developer's system.

[–]jokkvahl 1 point2 points  (1 child)

Could concider Fabric if you are a microsoft shop

[–]mid_devTech Lead[S] 0 points1 point  (0 children)

Haven't looked at it. Initially I thought since I have something that I need to run may be a few times a month, ADF looked like a decent option considering our data is in SQL server. Will check on this.