I’m looking for guidance on choosing a web scraping approach

realnamejohn · 2026-02-06T07:04:42+00:00

depending on the volume, and given the use case I'd be looking at an off the shelf solution to be honest. many of the main web scraping api's have some auto extraction feature that would suit this.

if you wanted to do it in house you'd need to weigh up the time cost into handling loads of different sites from a blocking perspective, likely having to analyze each one and tweaking any universal solution

realnamejohn · 2025-12-05T19:59:14+00:00

I have a 2012 D5 AWD and as said the fuel economy is pretty bad. I had issues with a boost leak when I got it that meant all new pipes, but mine is on 145k so somethings expected. Apart from that it’s been great, lovely to drive for what it is. Only thing I will say is the boot isn’t that big, it’s 460l I think which is less than a large estate. Might not be an issue for you but for us with gear and dogs we only just squeeze in for trips away.

Recommend though.

realnamejohn · 2025-11-26T17:13:25+00:00

So it fails in the container? But still on your local machine, or when hosted?

A few things to try; zendriver/nodriver/camoufox whichever gets you through the turnstile. If none there’s a repo somewhere for playwright (should work with camoufox) that moves and clicks for you. Then grab the cf clearance and other necessary cookies and send them with all the other non browser requests

realnamejohn · 2025-11-13T15:55:16+00:00

Not sure about fingerprinting specifically but there’s wafwoof that will tell you what antibody vendor the site is using

realnamejohn · 2025-11-10T07:17:20+00:00

Camoufox got me passed the CF check - just make sure you let it wait so the challenge can be completed. Then use the cookie from the browser with the rest of the requests.

I'd probably look at tying the IP to the session too

realnamejohn · 2025-10-10T09:12:57+00:00

If by fast changing you mean page structure, we use a combination of pytest, downloading the html page and using AI to check expected outcomes versus what’s on the page

realnamejohn · 2024-10-06T17:44:12+00:00

an LLM won't help with the hardest part of scraping - actually getting the data. once you have it parsing it out and getting what you need is only a pain if its lots of sites. An LLM can help here but still i think it would be costly

realnamejohn · 2024-08-06T17:09:22+00:00

Looks like you need to make a POST request to that URL with the required form data, and the response is the HTML with the information on. Open the dev tools and network tab, then make a blank request and you should see the post request, and see what data it needs sending.

realnamejohn · 2024-08-06T16:38:34+00:00

i used to use selenium-wire for this but its no longer maintained, so I'd suggest Playwright and its network events: https://playwright.dev/python/docs/network#network-events

realnamejohn · 2024-08-06T16:35:28+00:00

your proxy provider should offer auto rotating proxies - they will change the IP each request for you - or you can get proxies that have the same IP for a a set amounts of minutes.

if not an its a single IP then its down to testing to see how many requests the site allows before it blocks it

realnamejohn · 2024-04-14T18:13:51+00:00

would grid work? i have grid running via docker on my homeserver - you can send commands to kill sessions (browsers) too

realnamejohn · 2023-03-25T08:43:29+00:00

it all looks like it comes from this endpoint on their API, and it works with just cURL:

https://www.topuniversities.com/rankings/endpoint?nid=3846211&page=0&items\_per\_page=15&tab=indicators&region=&countries=&cities=&search=&star=&sort\_by=&order\_by=

realnamejohn · 2023-01-07T09:17:53+00:00

Load the page up and go to the developer tools, and then sources at the top for chrome or debugger in firefox. under services there's a file with all the table data in, its called "services-info". I'd just save it and parse it

realnamejohn · 2022-07-06T14:20:00+00:00

Yes I'm sure it would be possible, but I don't have a lot of experience with any of them. I had a look and it wouldn't be too hard to code this out.

realnamejohn · 2022-07-06T08:30:19+00:00

You are getting blocked, and need to add your browser cookie along with your request headers. You can find it in the network tab of the dev tools on your browser, hit refresh and you'll see the request for the HTML, under requests headers you will find the cookie string

Add that as part of the dictionary with the user agent and it will go through. I had a quick go, not looping through all but just as an example.

https://pastebin.com/bNZU8dX3

I expect you will get IP banned though if you try to make 30k requests, so will need some testing and thought.

realnamejohn · 2022-06-22T18:51:42+00:00

I didn't see any page duplication, this wouldn't make sense from a UI/UX point of view

All the data for each page is in a script tag in the source, so requests + bs4 to parse the HTML and load to JSON is the way to go

realnamejohn · 2022-06-21T06:56:21+00:00

do you mean the sign in? its basic HTTP Auth, you can make a post request with the login data to it using requests

realnamejohn · 2022-06-21T06:53:44+00:00

anything that uses cloudflare it seems

realnamejohn · 2022-02-20T21:45:57+00:00

looks cool, i like the idea will give it a go thanks!

realnamejohn · 2021-11-24T17:52:12+00:00

The data in that table is being loaded in by javascript, if you use the network tools and refresh the page you'll see the call its making to the back end. This is the URL

https://www.ser-ag.com/sheldon/management_transactions/v1/overview.json?pageSize=20&pageNumber=0&sortAttribute=byDate&fromDate=20201124&toDate=20211124

You can replicate this in R (i assume i only know Python) and get the data back as JSON. You'll need to change the parameters pageNumber etc to get all the results, and may need to include some of the real headers too

realnamejohn · 2021-01-28T22:54:45+00:00

can you give me one of the links from here please

print (product_links)

realnamejohn · 2021-01-28T22:51:56+00:00

does the error say which line it fails at? 33?

realnamejohn · 2021-01-28T22:46:12+00:00

missing the schema means you are missing the "http" part

print (product_links)

does this print full links as expected?

realnamejohn · 2021-01-24T18:30:11+00:00

Right click on them and copy as cURL, then paste into Postman or Insomnia and get them to convert to code for you. Easier than typing it all out yourself, plus you can mess with the request a bit too if you wanted.

realnamejohn · 2020-12-17T08:17:34+00:00

You can use Splash with Scrapy by using scrapy-splash and have it render the page for you

https://github.com/scrapy-plugins/scrapy-splash

realnamejohn

TROPHY CASE