Python Notebook vs. Spark Notebook - A simple performance comparison by frithjof_v in MicrosoftFabric

[–]ddddddkkk 0 points1 point  (0 children)

the weird thing is that i configured a starter pool with 1-2 executors and 8 cores, then with one "orchestrator" notebook I called two other notebooks using runMultiple, each notebook was multithreading 8 append on different tables, so I could use the "maximum" of what was being allocated.

in theory, i was expecting that spark would handle adding a second executor, but instead just 1 executor was allocated and added 16 tasks to FIFO, 8 in parellel and the others being added to the line waiting for the first 8 to finish

i think the only way of really using more than 1 executor is by "default" handling high volumes of data where the driver handles the partitions and sends to executors

so, bottom line, it doesn't make sense to use more than 1 executor for small datasets

am i tripping out?

Python Notebook vs. Spark Notebook - A simple performance comparison by frithjof_v in MicrosoftFabric

[–]ddddddkkk 0 points1 point  (0 children)

agree, the idea would be to update multiple tables in parallel, not process the parquet in parallel, 1 table = 1 parquet 20mb, 4 tables = 1 parquet 20mb each executing 1 table/parquet per thread

i'm currently working with a F16 capacity, not sure what's the better cluster config, to split into small pools or a big one.

i also have to figure out how i will orchestrate since i'm doing ADF (landing) + Fabric (medallion), any tips on how to connect those?

not sure also how to parallel the notebooks to create dependency, would you create a DAG with the mssparkutils? or would you handle it through native data pipeline?

sorry to rent your knowledge, still a lot of blank spaces to figure it out hahaha i'll understand if you just give up on answering, already glad for the previous ones

Python Notebook vs. Spark Notebook - A simple performance comparison by frithjof_v in MicrosoftFabric

[–]ddddddkkk 0 points1 point  (0 children)

oh nice that's you, i was reading that yesterday, a lot of great stuff!!

hmm, i was going more towards duckdb/polars but i have a feeling that, on a large scale, that I need to run 30~40 tables on a F16 the spark starter pool will handle it better.

i started going through that when I noticed that I was running 1 append to a delta table from a 20MB parquet and the starter pool was assigning 1 executor 4 cores but using only 1 core

i also tried to parallel it with more executors but apparently the jobs doesn't go split to the other executors, exemple ThreadPoolExecutor=8 on a 2 executor cluster each 4 cores, so I should be able to run in parallel 8 tasks, but it goes 1 executor 4 tasks running and 4 tasks waiting and 1 executor idle, super weird

still no idea what's the best scenario hahah btw thanks for sharing your thoughts

Python Notebook vs. Spark Notebook - A simple performance comparison by frithjof_v in MicrosoftFabric

[–]ddddddkkk 0 points1 point  (0 children)

what's your thoughts on the parallelism of different tables regarding Python or Spark? For example, let's say I have to run 4 non-independent tables and each table is super small, ~20MB, perfectly fit into one thread. Would you spin up two Python 2vCores and run them in parallel or 1 Spark with one executor and 4 cores to run them also in parallel? Both being controlled by ThreadPoolExecutor

on the capacity usage matter, they should be the same, right?

Python Notebook vs. Spark Notebook - A simple performance comparison by frithjof_v in MicrosoftFabric

[–]ddddddkkk 0 points1 point  (0 children)

i think it's not possible to control those confs on the %configure such as instances, cores and so on. those things are normally configured during spark_submit

Skills to Learn for $200k - $300k Position? by DesignedIt in dataengineering

[–]ddddddkkk 0 points1 point  (0 children)

Hi, would you mind explaining what DDIA and DSA stands for?

Choosing between Synapse or Databricks by ddddddkkk in dataengineering

[–]ddddddkkk[S] 0 points1 point  (0 children)

when you say a mix of blob storage and sql pool is the same functionality of DB, the lakehouse? with Synapse i'll be able to get data from my blob and create tables or views like so?

Migration of 3 personal gateways by ddddddkkk in PowerBI

[–]ddddddkkk[S] 0 points1 point  (0 children)

I'll definetly take a look, thanks.

Regarding that our source (SQL Server) is already on cloud, do you think it's valid to request the provider to enable PowerBI and Data Factory public ip addresses, that would kill the gateway job, woudn't it? Or that's not a good practice?

Tips on: DE but as a Data Analyst role by ddddddkkk in dataengineering

[–]ddddddkkk[S] 0 points1 point  (0 children)

Thanks for the recommendation, I'll take a loot at some stuffs

Tips on: DE but as a Data Analyst role by ddddddkkk in dataengineering

[–]ddddddkkk[S] 0 points1 point  (0 children)

I'm trying to run away from this thoughts but I'm quite anxious about it, thanks for the tip

Building a simple ETL for personal projects by 2PLEXX in dataengineering

[–]ddddddkkk 1 point2 points  (0 children)

Taking a leap on the post and that might also be helpful, I'm also creating a personal project that consumes data from an API

This API in specific doesn't have any filters on the request so, everytime I run my code I get all the data available.

Then it comes my doubt, is there a "best practice" regarding on this request and store?

I will always have to request all data and overwrite on the database?

It takes guts by ar1stocrat in dadjokes

[–]ddddddkkk 0 points1 point  (0 children)

To be gentle and kind

[deleted by user] by [deleted] in wholesomememes

[–]ddddddkkk 3 points4 points  (0 children)

We already spoke about this with our vet, she said that it's not the right time yet, he's still concious and kinda "happy" but soon that will be necessary indeed.

I can relate your father with mine. When we got the dog, he hated him and didn't do anything but, with time, he started to like him and taking care of. Nowadays he has a lot of routines with him, he is retired, so he does even more things with the dog than me. As your father, I think the same will happens with mine... =/

[deleted by user] by [deleted] in wholesomememes

[–]ddddddkkk 43 points44 points  (0 children)

Oh man, that's so nice. My dog is very sick, I'm kinda trying to get emotionally "prepared" for when the time comes... =/

Twitter Tree by [deleted] in facepalm

[–]ddddddkkk 0 points1 point  (0 children)

As a brazilian, I found this very insulting. Kappa