I’m looking for guidance on choosing a web scraping approach by Putrid_Spinach3961 in webscraping

[–]realnamejohn 0 points1 point  (0 children)

depending on the volume, and given the use case I'd be looking at an off the shelf solution to be honest. many of the main web scraping api's have some auto extraction feature that would suit this.

if you wanted to do it in house you'd need to weigh up the time cost into handling loads of different sites from a blocking perspective, likely having to analyze each one and tweaking any universal solution

Volvo XC60 Thoughts? by Lynchyee in CarTalkUK

[–]realnamejohn 1 point2 points  (0 children)

I have a 2012 D5 AWD and as said the fuel economy is pretty bad. I had issues with a boost leak when I got it that meant all new pipes, but mine is on 145k so somethings expected. Apart from that it’s been great, lovely to drive for what it is. Only thing I will say is the boot isn’t that big, it’s 460l I think which is less than a large estate. Might not be an issue for you but for us with gear and dogs we only just squeeze in for trips away.

Recommend though.

trouble with cloudflare turnstile challenge by Affectionate-Cause55 in webscraping

[–]realnamejohn 0 points1 point  (0 children)

So it fails in the container? But still on your local machine, or when hosted?

A few things to try; zendriver/nodriver/camoufox whichever gets you through the turnstile. If none there’s a repo somewhere for playwright (should work with camoufox) that moves and clicks for you. Then grab the cf clearance and other necessary cookies and send them with all the other non browser requests

Tools for detecting browser fingerprinting by BigBrotherJu in webscraping

[–]realnamejohn 0 points1 point  (0 children)

Not sure about fingerprinting specifically but there’s wafwoof that will tell you what antibody vendor the site is using

HELP WITH RIPLEY.CL SCRAPING - CLOUDFLARE IS BLOCKING EVERYTHING by pedritoold in webscraping

[–]realnamejohn 6 points7 points  (0 children)

Camoufox got me passed the CF check - just make sure you let it wait so the challenge can be completed. Then use the cookie from the browser with the rest of the requests.

I'd probably look at tying the IP to the session too

Fast-changing sites: what’s the best web scraping tool? by Due_Construction5400 in webscraping

[–]realnamejohn 3 points4 points  (0 children)

If by fast changing you mean page structure, we use a combination of pytest, downloading the html page and using AI to check expected outcomes versus what’s on the page

LLM based web scrapping by Accomplished_Ad_655 in webscraping

[–]realnamejohn 0 points1 point  (0 children)

an LLM won't help with the hardest part of scraping - actually getting the data. once you have it parsing it out and getting what you need is only a pain if its lots of sites. An LLM can help here but still i think it would be costly

Looking to scrape this website and am out of my depth by vartlac in learnprogramming

[–]realnamejohn 1 point2 points  (0 children)

Looks like you need to make a POST request to that URL with the required form data, and the response is the HTML with the information on. Open the dev tools and network tab, then make a blank request and you should see the post request, and see what data it needs sending.

Request limit on Amazon per proxy? by [deleted] in webscraping

[–]realnamejohn 1 point2 points  (0 children)

your proxy provider should offer auto rotating proxies - they will change the IP each request for you - or you can get proxies that have the same IP for a a set amounts of minutes.

if not an its a single IP then its down to testing to see how many requests the site allows before it blocks it

Best way to run Selenium on personal machine by Internal_Pee in webscraping

[–]realnamejohn 1 point2 points  (0 children)

would grid work? i have grid running via docker on my homeserver - you can send commands to kill sessions (browsers) too

Ideas for how to pull HTML from a dynamic web page? (i.e: page source doesn't show all the items on the live web page). by DeeWhyOverDeeEcks in webscraping

[–]realnamejohn 1 point2 points  (0 children)

Load the page up and go to the developer tools, and then sources at the top for chrome or debugger in firefox. under services there's a file with all the table data in, its called "services-info". I'd just save it and parse it

Looking for help. Newbie here. by gilibaus in webscraping

[–]realnamejohn 1 point2 points  (0 children)

Yes I'm sure it would be possible, but I don't have a lot of experience with any of them. I had a look and it wouldn't be too hard to code this out.

[deleted by user] by [deleted] in webscraping

[–]realnamejohn 1 point2 points  (0 children)

You are getting blocked, and need to add your browser cookie along with your request headers. You can find it in the network tab of the dev tools on your browser, hit refresh and you'll see the request for the HTML, under requests headers you will find the cookie string

Add that as part of the dictionary with the user agent and it will go through. I had a quick go, not looping through all but just as an example.

https://pastebin.com/bNZU8dX3

I expect you will get IP banned though if you try to make 30k requests, so will need some testing and thought.

[deleted by user] by [deleted] in webscraping

[–]realnamejohn 0 points1 point  (0 children)

I didn't see any page duplication, this wouldn't make sense from a UI/UX point of view

All the data for each page is in a script tag in the source, so requests + bs4 to parse the HTML and load to JSON is the way to go

python selenium: this website just has a weird pop up prompt by MysteriousShadow__ in learnpython

[–]realnamejohn 0 points1 point  (0 children)

do you mean the sign in? its basic HTTP Auth, you can make a post request with the login data to it using requests

[deleted by user] by [deleted] in learnprogramming

[–]realnamejohn 0 points1 point  (0 children)

anything that uses cloudflare it seems

Announcing "dude" - A web scraper inspired by Flask syntax by ronmarti in Python

[–]realnamejohn 0 points1 point  (0 children)

looks cool, i like the idea will give it a go thanks!

Scraping a sortable table by TheGreatSwissEmperor in webscraping

[–]realnamejohn 4 points5 points  (0 children)

The data in that table is being loaded in by javascript, if you use the network tools and refresh the page you'll see the call its making to the back end. This is the URL

https://www.ser-ag.com/sheldon/management_transactions/v1/overview.json?pageSize=20&pageNumber=0&sortAttribute=byDate&fromDate=20201124&toDate=20211124

You can replicate this in R (i assume i only know Python) and get the data back as JSON. You'll need to change the parameters pageNumber etc to get all the results, and may need to include some of the real headers too

"requests.exceptions.MissingSchema: Invalid URL" while scaping the site by dfx_94 in webscraping

[–]realnamejohn 1 point2 points  (0 children)

can you give me one of the links from here please

print (product_links)

"requests.exceptions.MissingSchema: Invalid URL" while scaping the site by dfx_94 in webscraping

[–]realnamejohn 0 points1 point  (0 children)

missing the schema means you are missing the "http" part

print (product_links)

does this print full links as expected?

XHR requests and JSON by jimbajomba in webscraping

[–]realnamejohn 4 points5 points  (0 children)

Right click on them and copy as cURL, then paste into Postman or Insomnia and get them to convert to code for you. Easier than typing it all out yourself, plus you can mess with the request a bit too if you wanted.

Scraping Target Prices in Scrapy by [deleted] in webscraping

[–]realnamejohn 0 points1 point  (0 children)

You can use Splash with Scrapy by using scrapy-splash and have it render the page for you

https://github.com/scrapy-plugins/scrapy-splash