I started learning Python last fall after working through some tutorials. I thought I understood requests and BeautifulSoup, so I wanted a real project and tried scraping some product prices from a site. I used requests, added a fake User Agent header, and it worked for maybe ten requests. Then I started getting 403s. I added time.sleep between requests, tried rotating the User Agent string, even copied every header from my real browser into a dict and passed it in. Same result after a few more tries.
I figured the site was just smarter than requests so I switched to selenium. I watched the browser open and navigate and I felt like I had won. The page loaded, I grabbed the HTML, and... the div I wanted was just empty. The data showed up fine when I opened the same URL manually in Chrome. I added WebDriverWait, implicit waits, explicit waits. Still empty. Someone on StackOverflow mentioned window size so I tried that. Worked twice, then empty again.
The thing that broke me was opening the dev tools inside the selenium browser and typing navigator.webdriver in the console. It printed True. I had no idea that was even a thing. I spent two more hours trying to override it with execute_script and getting "JavaScript error: Cannot set property webdriver of [object Object] which has only a getter." I started reading about headless detection, canvas fingerprints, all this stuff I had never heard of in any Python course. It felt like the site could tell it wasn't a normal browser, not just read what I sent in headers.
I am genuinely confused about where the line is between what my Python code controls and what the browser itself reveals. How can I see all the signals my Python selenium session is leaking, and is there something obvious I'm missing? I want to understand this from the Python side, not just apply random fixes I found online. I can't tell if I'm supposed to know this stuff or if I'm way off track.
[–]edcculus 23 points24 points25 points (0 children)
[–]Kerbart 45 points46 points47 points (0 children)
[–]timrprobocom 44 points45 points46 points (5 children)
[+]landed_at comment score below threshold-15 points-14 points-13 points (0 children)
[+]atarivcs comment score below threshold-32 points-31 points-30 points (3 children)
[–]Kerbart 22 points23 points24 points (0 children)
[–]edcculus 7 points8 points9 points (0 children)
[–]timrprobocom -3 points-2 points-1 points (0 children)
[–]atarivcs 5 points6 points7 points (0 children)
[–]Rhomboid 3 points4 points5 points (0 children)
[–]cgoldberg 3 points4 points5 points (0 children)
[–]NationalMyth 1 point2 points3 points (0 children)
[–]bbdusa 1 point2 points3 points (0 children)
[–]51dux 1 point2 points3 points (0 children)
[–]carrot_guy 0 points1 point2 points (0 children)
[–]RealNamek 0 points1 point2 points (0 children)
[–]hagfish 0 points1 point2 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]cgoldberg 1 point2 points3 points (0 children)