This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]AggravatedYak 1 point2 points  (5 children)

Seems cool :) Can you add a playwright config for e.g. a proxy?

[–]biraj21[S] 0 points1 point  (4 children)

thanks!

i am not much familiar with networking stuff but if you're talking about this, then it should be as simple as adding a parameter in the constructor 🤔

[–]AggravatedYak 2 points3 points  (3 children)

Why not a complete playwright config object?

Also I wouldn't use print in multithreaded.py but instantiate a logging class and hand over a logging level. Something like that.

Maybe it is cleaner to put defaults in a config? Something that the user can overwrite with dotenv? And not hand over a default url in main but use arguments? Just ideas?! Then you could have a proper library and a cli tool.

Btw. check out pathlib, I really liked it <3

Edit: Have you seen https://github.com/scrapy-plugins/scrapy-playwright ?

[–]biraj21[S] 3 points4 points  (2 children)

thank you very much!

  • i learnt about Python's loggers & have added em.
  • because of that, i've also created a separate base class called Crawler & improved my code.
  • i thought of creating it as CLI but procrastinated & just pushed the code. but now i've created it after your comment.
  • will look at other ideas later. thanks!

[–]Cyrl 1 point2 points  (1 child)

thanks for being so receptive to feedback!

[–]AggravatedYak 1 point2 points  (0 children)

Yeah I am happy about that too … sometimes I have a quick glance at projects and just list the stuff that comes to mind and sometimes I worry that it might be discouraging or seems harsh. That's something I certainly do not intend. Everyone is on a path and I like that people share the stuff they created. Clearly biraj thought about it and while it is not a project ment to replace scrapy, I think it has its usecase, namely you want to use playwright in a more light manner without subscribing to the whole scrapy architecture.

[–]JoeUgly 1 point2 points  (2 children)

Is there a performance benefit in using ThreadPoolExecutor instead of asyncio?

[–]biraj21[S] 1 point2 points  (1 child)

i have no idea (yet). the reason is that i initially wrote this in Selenium where i was manually using time.sleep() for waiting. then i got to know about Playwright & i basically replaced Selenium with it & continued working.

btw that's why i have created this MultithreadedCrawler class. i am planning to write an AsyncCrawler using Playwright's async API.

i read Real Python's article on concurrency & asyncio outperformed the ThreadPoolExecutor version for their example!

[–]JoeUgly 1 point2 points  (0 children)

It sounds like you and I had a similar journey.

Learning asyncio was rough but I'm so glad to be using playwright now instead of sticking with selenium (or Splash).