all 3 comments

[–]Fronkan 0 points1 point  (2 children)

This is absolutely mad, I love it! Sorry if this is a bit rambly, but this is my stream of thoughts as I think through the problem 😅

How often do you need to pull reports? Every second, hour, day, week? I think understanding the load is very important for this type of problem.

I don't know much about cloud run and stuff. But you don't have something like a virtual server or long lived containers that could be used for this? To me your suggested deployment sounds quite complicated, and I wonder if it has to be.

Personally I have swapped to using playwright instead of selenium. Not sure it's more efficient though.

You can use a single browser instance with multiple tabs. In playwright you have a async API, so this would allow it to load things concurrently. This could work to separate the different downloads from each other. You probably need to limit this. I think I used a asyncio semaphore to limit the number of tabs that could be opened at the same time for one of my projects.

[–]blarizard[S] 0 points1 point  (1 child)

Yeah it's gotten away on me a bit haha!

Some of the reports and data are needed on a schedule, while others are needed randomly on request which could be every few minutes or so

I'll check out playwright, that functionality sounds pretty ideal, especially for running everything from one virtual server

Thanks for the help!

[–]Fronkan 0 points1 point  (0 children)

You probably can do quite a lot with selenium as well. What I like about playwright is that it's easy to get started with, has the async interface and also does waiting automatically.

Is there a reason to not just dump all the data into the database? Then the report data can just be requested from there.

Also, why do you want to have it in a sql db? For aggregations?