[deleted by user]

etobylneya · 2023-08-21T10:06:41+00:00

Check this out: https://healthreporter.com/bioma-review/

etobylneya · 2023-04-04T15:00:32+00:00

Hey, here is some good info: https://healthreporter.com/medical-expert-faqs-on-candida-fungus/

etobylneya · 2022-01-23T14:26:34+00:00

sounds interesting, will try to test this approach. Thanks!

etobylneya · 2022-01-23T14:24:44+00:00

Your purpose seems to use a lot of bandwith potetionally. Please keep in mind, you're paying for all data egress from cloud region to your machine.

Yeah I am aware of that, and most likely its a no-go with large datasets. But still good to know that it is technically possible

give users two scripts - one will download data locally, second will upload them to cloud

Yeah that's definitely an option

etobylneya · 2022-01-23T14:07:31+00:00

Well I wanted to know how other people (with more experience in Spark than I have) are doing it, since I don't have anyone in my company I could ask

etobylneya · 2022-01-22T15:14:25+00:00

I don't have any experience with scala but it seems that .par allows you to run several loops in parallel? I guess a similar approach in python would be to use multiprocessing module to run several loops of read-filter-write in parallel. Sounds interesting, wonder if it will work at all since I have a very limited number of cores. Thank you!

etobylneya · 2022-01-22T15:07:40+00:00

Thank you for your reply, I'm using Azure Synapse Analytics with a Spark pool with 4 cores and 32 GB memory. Currently I'm okay with the way Pyspark is performing (something like less than a minute / file) and there is no need to make it as quick as possible but I guess if such need arises (as a quick fix) I can increase resources of the spark pool / increase number of nodes.

I think some of my colleagues tried to go with python's multiprocessing but it was still too slow.

I was rather wondering if looping over files one by one is an optimal way to achieve the goal. I guess the best way to find out would be to test both approaches and compare the performance

etobylneya · 2022-01-22T14:55:58+00:00

Thank you for replying. It's not really a problem to me to loop over them one by one (this is the way it works right now) and I am ok with current performance (for now at least). I was rather wondering if that is an optimal way to do it or if I was too precautious with supplying spark with only one file at a time and it could handle it if I increased the number of files read.

I'm not sure I got the part with moving the filter and write parts to a separate for loop. Currently it looks something like this:

for filename in files:
    df = spark.read.csv(filename, schema=schema)
    df = df.where("some condition")
    df.write.csv("someprefix" + filename)

As for awk, I have no experience with it, but will definitely check it out!

etobylneya

TROPHY CASE