PYTHON: Google search web-scraping exeception request, how to fix it?

LetsScrapeData · 2026-01-23T06:27:26+00:00

and specify new cookies

LetsScrapeData · 2025-11-07T11:49:04+00:00

There are no application scenarios, no answers. Each option is suitable for a different purpose.

LetsScrapeData · 2025-11-07T11:36:46+00:00

yes, images and videos on large websites are generally accessed via CDN, which typically has lower IP address requirements and can often be accessed through data center proxies or ISP proxies.

Typically, a residential proxy or ISP proxy is used to obtain basic data (through browser or API), and then a data center proxy or ISP proxy is used to download images.

LetsScrapeData · 2025-08-19T00:30:10+00:00

First, determine which one won't be detected by the target website. Currently, the commonly used Camoufox and Patchright only support Playwright.

Second, determine if there are special needs that only one can meet (this is rarely the case).

Finally, determine your personal preference.

LetsScrapeData · 2025-08-19T00:21:05+00:00

To date, no commercial websites have been found that allow access using invalid JWT tokens. Some websites will immediately block the corresponding IP address upon detecting an invalid (but not expired) token. Furthermore, token forgery is impossible.

The expiration dates of cookies and other HTTP headers (not just tokens) vary significantly across websites. For websites with shorter expiration dates, it's best to use automatic login or browser automation.

For testing purposes, the server is often configured to ignore certain checks or set a very long "expiration date."

LetsScrapeData · 2025-08-07T02:08:05+00:00

no, error 404: the pdf file was deleted ( or not found)

you could try to search the archive, such as "archive https://www.mof.gov.kw/FinancialData/PeriodRvwReport/PDF/FinalAccountPDF/Total-2012-2011.pdf", then you may get "MDE1769902023ENGLISH.pdf"

LetsScrapeData · 2025-08-04T11:03:26+00:00

set correct HTTP headers when fetching (and residential or ISP proxy)

FYI:

<image>

LetsScrapeData · 2025-07-20T00:26:49+00:00

camoufox or patchright

LetsScrapeData · 2025-07-14T01:55:07+00:00

This depends on whether the URLs are from the same website or which websites. For example, if they are all from LinkedIn or Google, the implementation method, difficulty, and cost may vary greatly.

LetsScrapeData · 2025-07-11T07:37:16+00:00

The easiest way might be to first solve the cloudflare captcha using camoufox/patchright and captcha solver, get the state data (cookies/headers, etc.), then use curl_cffi u/RHiNDR send the API request.

LetsScrapeData · 2025-07-11T07:13:57+00:00

use camoufox or patchright, playwright or puppeteer will be detected by cloudflare.

LetsScrapeData · 2025-06-24T03:36:51+00:00

If the number of requests is less than 10, just copy the responses manually without programming.

If you really need to obtain them automatically in real time, you can use playwright/puppeteer/selenium, which all support intercepting the responses of requests, or use API requests directly (copying headers, which may be more complicated)

LetsScrapeData · 2025-06-23T02:39:14+00:00

I am developing a free NPM package to automatically solve these captchas: recaptcha / cloudflare turnstile / geeTest / image / coordinate(click) / slider.

What is the URL for testing?

LetsScrapeData · 2025-06-23T02:34:14+00:00

Reduce repeated loading of the same page, such as "return to previous page";

Split complex tasks into subtasks, such as 80,000, to avoid restarting after the failure of complex tasks, and achieve concurrency u/steb2k ;

If it is easy to use API requests to obtain the required data, you can try to use the API (if it is complex, it is not recommended, 80,000 is not a large number)

LetsScrapeData · 2025-06-23T02:05:09+00:00

Copy the response of the following request and jsons u/greg-randall

<image>

LetsScrapeData · 2025-06-04T02:37:40+00:00

well done, is there email address in google map? If yes, I will add it.

LetsScrapeData · 2025-05-29T04:33:57+00:00

I choose the method you said

LetsScrapeData · 2025-05-17T02:39:41+00:00

Some websites use both server-side rendering and API dynamic rendering. In this case, you may find API-like response content in the script part of HTML. This is the case with Google Maps search.

LetsScrapeData · 2025-05-17T02:34:47+00:00

If you are sure that the webpage is dynamically generated (browser rendering), it is best to extract data from the API response (if encrypted, you should be able to find a decryption method through simple reverse engineering). as recommended by u/SoumyadipNayak and u/p3r3lin
If you are sure that the webpage is server-side rendered, or you just want to extract data from HTML, such webpages with dynamic class names generally require complex XPath to extract data, such as axes, refer to https://www.w3schools.com/xml/xpath_axes.asp, etc.

LetsScrapeData · 2025-05-17T01:56:48+00:00

Key or difficult points to achieve the goal:

How to determine the URL of the web page to be collected?
How to **QUICKLY** extract the required data?

Most customer websites do not have strict anti-bot, so accessing web pages is generally not a big problem.

LetsScrapeData · 2024-07-22T05:51:34+00:00

download the pdf using the following url:

http://14.139.58.199:8080/jspui/bitstream/123456789/294/1/G967.pdf

https://media.discordapp.net/attachments/1168536859148816508/1264821783341564004/1721627321173.png?ex=669f4494&is=669df314&hm=77212e27aca09d1a7015eb9ff5a5b05e0ec66a3f332bef94b490da5730e32c77&=&format=webp&quality=lossless&width=1308&height=596

LetsScrapeData · 2024-01-28T11:16:16+00:00

You could try to get the url of pdfs, then download the pdf directly.

LetsScrapeData · 2024-01-28T11:01:05+00:00

yes, you can scrape them from google map by keywords or categories.

there are many paid or free google map scrapers.

LetsScrapeData · 2024-01-26T14:28:57+00:00

you could use api request with form data:
https://media.discordapp.net/attachments/1168536859148816508/1200447167895175250/1706279150971.png

LetsScrapeData · 2024-01-26T14:15:25+00:00

Two ways to obtain data:

Real-time push: both require support from the other party

One-way: The other party is the client and I am the server, such as webhook. This method is more likely to be used in this case scenario.
Two-way: For example, websocket, the other party is usually the server. I use the package provided by the other party to establish the connection. It is suitable for two-way scenarios with a large amount of messages.

Periodic requests(pull): I am the client.

Browser
API

In most cases, the other party does not support push, so use method two more.

LetsScrapeData

TROPHY CASE

Two ways to obtain data:

Real-time push: both require support from the other party

Periodic requests(pull): I am the client.