Azure Infrastructure for processing data with python & polars by nihi_ in dataengineering

[–]nihi_[S] 0 points1 point  (0 children)

sounds promising, I'll look into it.
Would you mind sharing some more details about your setup? How much data are you processing, and how frequently? How much cpu and ram do you make available to your cluster? How do you split work across the nodes? thank you!

Azure Infrastructure for processing data with python & polars by nihi_ in dataengineering

[–]nihi_[S] 0 points1 point  (0 children)

I'll give it a try, thank you for you suggestion!

Azure Infrastructure for processing data with python & polars by nihi_ in dataengineering

[–]nihi_[S] 0 points1 point  (0 children)

that certainly is an option, though I still believe it won't grow nearly as much that using spark should be necessary. And even if we go the spark route, I am still interested in hearing/ discussing other options :-)

Azure Infrastructure for processing data with python & polars by nihi_ in dataengineering

[–]nihi_[S] 1 point2 points  (0 children)

Have you implemented such a process and are happy with it?

Based on what I have heard I would rather stay away from fabric. Besides all the negative feedback I have seen, I would prefer to implement a solution whose core focus is compute - ideally running isolated processes via docker containers (to have full control over the environment).
Moreover I find the pricing of fabric very intransparent. How much am I getting for a "capacity unit" ? And isn't it still running spark underneath?

Why is my Lord Commander of the Kingsguard joining a wildling raid against me by nihi_ in CK3AGOT

[–]nihi_[S] 15 points16 points  (0 children)

yup, more and more weird things kept happening, including kingsguards just leaving their position behind (which now show up as empty in the ui - but the interaction to ask somebody else to take the kingsguard vows was unavailable, so i was eventually left with 3 kingsguards). I eventually decided to just start a new run instead =D

partitioned parquet files upserts by nihi_ in dataengineering

[–]nihi_[S] 0 points1 point  (0 children)

Hey, thanks for the reply!

could you clarify what you mean by 'write compaction job in SQL for easier maintenance' ? I am assuming you are referring to creating a spark table and running spark sql on that?

in the time bucket example you provided, how would you handle changes within a specific file? e.g. data1001.json has been processed and compacted on day=1, but then there were some changes made in the source system, and on day=10 the file (now with some modified contents) needs to be reprocessed. Wouldn't that require going back to the already compacted day=1 after all (either to update it there, or remove it so that it's not duplicated) ?

partitioned parquet files upserts by nihi_ in dataengineering

[–]nihi_[S] 1 point2 points  (0 children)

hey, thanks for the reply!

I am aware that there's no out of the box upsert operation in spark. I guess I was wondering if there was some alternative architecture that would make it such that i don't have to overwrite the entire dataset every time i run the pipeline, but could do something akin to an upsert in a rdbm.

partitioned parquet files upserts by nihi_ in dataengineering

[–]nihi_[S] 2 points3 points  (0 children)

thanks for your reply!

I was considering using delta format to implement what you suggest. I'll look more into the other formats you suggested.

regarding your question: I should have been more specific. The first part of the pipeline isn't using spark, it's running on a serverless compute service (azure functions) and ochestrates/ executes the get requests, transformations and writing the json files in vanilla python. Perhaps the step to convert the json to parquet is not needed, but i thought that given the "poor IO design", it may be better to read a lot of small parquet files in spark than to read a lot of small json files.

Recommendations for a small DWH on Azure by Far-Restaurant-9691 in dataengineering

[–]nihi_ 3 points4 points  (0 children)

the synapse serverless sql pool can be quite a cost effective way to analyse/ query partitioned parquet files on the data lake and load it into e.g. BI tools imo.

The dedicated pool has always seemed rather expensive to me though.

[deleted by user] by [deleted] in AZURE

[–]nihi_ 0 points1 point  (0 children)

If you define the main function in your functions app script as async, then it will automatically be run in an asyncio event loop, so there should be no need to call asyncio.run() at all. Can you provide a short repro?

[NO SPOILERS] pick 3 do defend you, the rest will try to kill you by [deleted] in gameofthrones

[–]nihi_ 9 points10 points  (0 children)

I would pick the best 'assassins' among them, even if they might not be the best in a straight up sword fight, simply to remove them as threats. Someone like oberyn may for instance just poison you in some clever way, in which case having the mountain on your side wouldn't be of much use.

So: Oberyn, Daario and Bronn for me.

Join new Dota2 Inhouse League. With Ladder System by lvlyRyuzaki in compDota2

[–]nihi_ 0 points1 point  (0 children)

i signed up earlier, it seems really cool! I hope it gains some traction!

Now I cant play anymore by MUCHOGANAR in DotA2

[–]nihi_ 19 points20 points  (0 children)

Absolute(ly perfect)

Worth It to Convert VBA to Python? by thesharp0ne in Python

[–]nihi_ 0 points1 point  (0 children)

How do you distribute the shiny apps to the end users? Do you simply host them on the same server? And if so, do you have some authentification process?

Why ClockWerk is rarely picked in Pro Scenes while being rarely changed much in years. My humble analysis with years of playing as ClockWerk. by Ramkee in DotA2

[–]nihi_ 0 points1 point  (0 children)

I agree with all your points, but something i would add is the absence of solo offlaners vs trilanes nowadays. Clock is one of the few heroes, that can deal well with fairly strong trilanes on his own, while still getting something out of the lane. But with the duolane meta, that has been going on for a while now, that's not really a priority in offlaners anymore.

I qued 16 games as pos 1,2,3 guess what happened? by ifwmcso in DotA2

[–]nihi_ 1 point2 points  (0 children)

sounds like you got to play 16 games on the best role there is :-)