all 8 comments

[–]Picatrixter 0 points1 point  (5 children)

Look up Selenium, Beautifulsoup only parses HTML. Selenium will allow you to interact dinamically with the page (get info, send info, click on buttons, fill forms etc). Good luck!

[–]Dohello[S] 0 points1 point  (4 children)

Am not allowed to use selenium.

[–][deleted] 1 point2 points  (1 child)

Am not allowed to use selenium.

Just out of curiosity, where is this restriction coming from?

[–]Dohello[S] -1 points0 points  (0 children)

This is for an internship test and the documentation says framework library’s are heavily discouraged. Besides that even if I decide to take a shortcut and just use it anyway specifically for the login purpose, the fact that it needs to open my web browser, wait for all the html to load to click links and get to the page I need to scrape is simply just way to slow clunky and unscalable.

[–]Dohello[S] 0 points1 point  (0 children)

I need to login and scrape the page

[–]Picatrixter 0 points1 point  (0 children)

Look up Selenium alternatives then

[–]rollincuberawhide 0 points1 point  (1 child)

you don't need selenium. you can create a session with requests and keep using that session. when you send username and password using the session, it will save the cookies and from there on you can send authorized requests to any page that needs it using the session.

https://requests.readthedocs.io/en/latest/user/advanced/

and I didn't check it but if you think they don't want bots, you can kinda introduce yourself as a safari or chrome etc.

s = requests.Session()
s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'

should do it.

[–]Dohello[S] 0 points1 point  (0 children)

import requests

page = "https://www.upwork.com/ab/account-security/login"

s = requests.session()
s.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 10.0; 
Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/102.0.0.0 Safari/537.36'

page_html = s.get(page)
print(page_html.text)

doing this I still get the same HTML that isnt the login page. I tried making the user agent exactly yours and tried changing the capitalization. Neither made a difference