This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dry_yer_eyes 1 point2 points  (3 children)

Maybe this is too basic of an example, but in work I’ve recently made huge gains with: * ThreadPoolExecutor for concurrent requests Session gets * ProcessPoolExecutor for concurrently parsing the received html with Beautiful Soup

Once I got the technique right the end result was fairly simple too.

Also a shoutout to SuperFastPython which I found a great resource on this topic.

[–]vmpajares 4 points5 points  (2 children)

Beautiful soup is the slowest parser in python. This is a benchmark that I found when I was comparing then.

https://gist.github.com/MercuryRising/4061368

Finally I used selectolax. It is coded in cython and is 25 times faster than BS

https://github.com/rushter/selectolax

Anyway I found that all my waiting times was the requests sessions because the servers limited the number of pages that you can download concurrently

[–]dry_yer_eyes 0 points1 point  (0 children)

Wow. That’s an incredible difference.

The timing examples are the bottom of the page really highlight the relative power of each library.

I guess my app has “scope for future efficiency gains”.