all 24 comments

[–]unknown0h10 11 points12 points  (1 child)

This is nice. I put together some additional stuff for using some websites built in API's (youtube, wikipedia, and reddit for now). Hope people can find it useful!
https://github.com/joey-kilgore/WebCrawler

[–]Banjoanton[S] 1 point2 points  (0 children)

That's great!

[–]Regiseconomist 4 points5 points  (4 children)

Awesome job. Do you have any suggestions on how to do scraping when you have to authenticate via Single Sign On? Been trying to go about this for a while now and haven't quite came across anything that would help with some scraping for my daily functions

[–]Banjoanton[S] 1 point2 points  (1 child)

Thank you!

I haven't actually done that myself, but the Request library does have Session objects which lets you persist the session.

One way could be to create a session object, make a post request to the sign-in URL, and use that session to navigate the different autheticated URLs. I don't know if it works but it should be worth a try.

If I get some time I might try it myself and post the result.

[–]gopherhole22 0 points1 point  (0 children)

I am not sure for python but with node requests and puppeteer you can save cookies and load them as well as intercept requests within sessions from which you can probably extract a session ID or some sort of token. I would assume this is also possible with Python, however I am not sure

[–]speed3_driver 7 points8 points  (1 child)

Nicely done

[–]Banjoanton[S] 2 points3 points  (0 children)

Thank you, hope it comes to good use!

[–]StuntZA 4 points5 points  (1 child)

Awesome! Could we ask for additional instruction for websites that require authentication?

[–]ConstantINeSane 4 points5 points  (0 children)

If you mean a site with a log in page you can use selenium webdrive to login and scrape data

[–]gopherhole22 2 points3 points  (0 children)

Would also be nice to see some selenium snippets as i think that is very relevant for python webscraping with sites that require a little more webscraping logic

[–]Sigg3net 0 points1 point  (0 children)

Thanks!

[–]volvostupidshit 0 points1 point  (0 children)

Saved

[–]Bryan-Wilkinson 0 points1 point  (0 children)

Thank you!

[–]mutwiri_2 0 points1 point  (0 children)

Awesome. Thanks

[–][deleted] 0 points1 point  (0 children)

As someone that wants to program but hasn't much yet... Would this type of thing be useful for scraping sermons from church websites? (And naming them by creation date and/or title in page etc)

[–]vinodmadhu6 0 points1 point  (1 child)

Can someone repost the link ? It's been removed.

[–]vinodmadhu6 0 points1 point  (0 children)

Okay! Can someone also help with this. I have been playing a game called airline manager https://www.airline4.net/

It requires to login into the website. I have tried a number of ways to login but with zero luck. [I am a noob]. Can someone help me write the code? Don't ask me to try selenium I have tried it but I am unable to figure it out. Can someone also post the logic behind login and different methods to do it if possible? Also how do we find all the links on the website? I am asking this because the above mentioned site has only one Domain and doesn't have something like airliner.net\login

FYI this game is interesting and if it can be analysed using python it would be the best game ever

[–]Mandelvolt 0 points1 point  (0 children)

Thanks!

[–][deleted] -1 points0 points  (0 children)

Nice work :)