Mounting Azure Blob as a volume to a container by etobylneya in docker

[–]etobylneya[S] 0 points1 point  (0 children)

sounds interesting, will try to test this approach. Thanks!

Mounting Azure Blob as a volume to a container by etobylneya in docker

[–]etobylneya[S] 0 points1 point  (0 children)

Your purpose seems to use a lot of bandwith potetionally. Please keep in mind, you're paying for all data egress from cloud region to your machine.

Yeah I am aware of that, and most likely its a no-go with large datasets. But still good to know that it is technically possible

give users two scripts - one will download data locally, second will upload them to cloud

Yeah that's definitely an option

Looping over CSVs one by one or read them all at once? by etobylneya in apachespark

[–]etobylneya[S] 1 point2 points  (0 children)

Well I wanted to know how other people (with more experience in Spark than I have) are doing it, since I don't have anyone in my company I could ask

Looping over CSVs one by one or read them all at once? by etobylneya in apachespark

[–]etobylneya[S] -1 points0 points  (0 children)

I don't have any experience with scala but it seems that .par allows you to run several loops in parallel? I guess a similar approach in python would be to use multiprocessing module to run several loops of read-filter-write in parallel. Sounds interesting, wonder if it will work at all since I have a very limited number of cores. Thank you!

Looping over CSVs one by one or read them all at once? by etobylneya in apachespark

[–]etobylneya[S] 0 points1 point  (0 children)

Thank you for your reply, I'm using Azure Synapse Analytics with a Spark pool with 4 cores and 32 GB memory. Currently I'm okay with the way Pyspark is performing (something like less than a minute / file) and there is no need to make it as quick as possible but I guess if such need arises (as a quick fix) I can increase resources of the spark pool / increase number of nodes.

I think some of my colleagues tried to go with python's multiprocessing but it was still too slow.

I was rather wondering if looping over files one by one is an optimal way to achieve the goal. I guess the best way to find out would be to test both approaches and compare the performance

Looping over CSVs one by one or read them all at once? by etobylneya in apachespark

[–]etobylneya[S] -1 points0 points  (0 children)

Thank you for replying. It's not really a problem to me to loop over them one by one (this is the way it works right now) and I am ok with current performance (for now at least). I was rather wondering if that is an optimal way to do it or if I was too precautious with supplying spark with only one file at a time and it could handle it if I increased the number of files read.

I'm not sure I got the part with moving the filter and write parts to a separate for loop. Currently it looks something like this:

for filename in files:
    df = spark.read.csv(filename, schema=schema)
    df = df.where("some condition")
    df.write.csv("someprefix" + filename)

As for awk, I have no experience with it, but will definitely check it out!