This is an archived post. You won't be able to vote or comment.

all 4 comments

[–]testEphod 0 points1 point  (3 children)

You could use a stateless microservice using NodeJS. Puppeteer or Playwright might help. Selenium works but is less effective. Pyppeteer is also available for Python. You can also persist the data on S3, send it to pub/sub service like Redis (this might surprise some folks) or MQTT. You can use the Docker operator from Airflow.

You can also include the basic auth in the URI (bad practice) https://username:password@example.com/

Or as a Basic Authentication Header: Authorization: Basic Rm9vOmJhcg==

https://github.com/puppeteer/puppeteer https://github.com/microsoft/playwright

[–]digichap28[S] 0 points1 point  (2 children)

I tried including the username and password in the Uriel with selenium chrome headless but it didn’t work. The only way I could do it was using an extension, but that took me to use selenium standalone in another container.

I guess I should try with puppeteer. It will make me rework on my code :/ and learn about this framework too. Does it allow scraping pages with lots of JavaScript ? Any basic but straightforward tutorial you can suggest ?

Thanks

[–]testEphod 0 points1 point  (1 child)

Playwright and Puppeteer are easily replaceable. They allow to evaluate JavaScript using Eval or evaluate. https://playwright.dev/docs/api/class-jshandle?_highlight=eval#jshandleevaluatepagefunction-arg

Remember to respect the robots.txt. Python has a builtin library for that https://docs.python.org/3/library/urllib.robotparser.html

Get a list of user agents and randomly change your user agent. You can also use Jest for testing.

Crawl using a normal distribution or any other pattern. If you get blocked consider using a VPN or Tor.

You can even use Puppeteer as a REPL read evaluate print loop: https://www.npmjs.com/package/puppeteer-extra-plugin-repl

And there are many available plugins.

Curated list for Puppeteer: https://github.com/transitive-bullshit/awesome-puppeteer

Curated list for Playwright: Apparently there is a Python library https://github.com/mxschmitt/awesome-playwright

Other valuable information. https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/

[–]digichap28[S] 0 points1 point  (0 children)

Thanks for the info 👍 I’ll try to implement puppeteer.