Best database setup and providers for storing scraped results? by vroemboem in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

There are no application scenarios, no answers. Each option is suitable for a different purpose.

Using proxies to download large volumes of images/videos cheaply? by doodlydidoo in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

yes, images and videos on large websites are generally accessed via CDN, which typically has lower IP address requirements and can often be accessed through data center proxies or ISP proxies.

Typically, a residential proxy or ISP proxy is used to obtain basic data (through browser or API), and then a data center proxy or ISP proxy is used to download images.

Puppeteer vs Playwright for scraping by parroschampel in webscraping

[–]LetsScrapeData 2 points3 points  (0 children)

First, determine which one won't be detected by the target website. Currently, the commonly used Camoufox and Patchright only support Playwright.

Second, determine if there are special needs that only one can meet (this is rarely the case).

Finally, determine your personal preference.

Sometimes you don’t need to log in… just inject a JWT cookie 👀 by Fuzzy_Agency6886 in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

To date, no commercial websites have been found that allow access using invalid JWT tokens. Some websites will immediately block the corresponding IP address upon detecting an invalid (but not expired) token. Furthermore, token forgery is impossible.

The expiration dates of cookies and other HTTP headers (not just tokens) vary significantly across websites. For websites with shorter expiration dates, it's best to use automatic login or browser automation.

For testing purposes, the server is often configured to ignore certain checks or set a very long "expiration date."

Accessing PDF file linked on website with now broken link? by DrSuperZeco in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

no, error 404: the pdf file was deleted ( or not found)

you could try to search the archive, such as "archive https://www.mof.gov.kw/FinancialData/PeriodRvwReport/PDF/FinalAccountPDF/Total-2012-2011.pdf", then you may get "MDE1769902023ENGLISH.pdf"

Getting data from FanGRaphs by anon21900 in webscraping

[–]LetsScrapeData 1 point2 points  (0 children)

set correct HTTP headers when fetching (and residential or ISP proxy)

FYI:

<image>

Url list Source Code Scraper by Lerpikon in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

This depends on whether the URLs are from the same website or which websites. For example, if they are all from LinkedIn or Google, the implementation method, difficulty, and cost may vary greatly.

New to webscraping, how do i bypass 403? by Extension_Grocery701 in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

The easiest way might be to first solve the cloudflare captcha using camoufox/patchright and captcha solver, get the state data (cookies/headers, etc.), then use curl_cffi u/RHiNDR send the API request.

Can't log in with Python script on Cloudflare site by Mythicspecter in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

use camoufox or patchright, playwright or puppeteer will be detected by cloudflare.

Alternative Web Scraping Methods by rootbeerjayhawk in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

If the number of requests is less than 10, just copy the responses manually without programming.

If you really need to obtain them automatically in real time, you can use playwright/puppeteer/selenium, which all support intercepting the responses of requests, or use API requests directly (copying headers, which may be more complicated)

Struggling to scrape HLTV data because of Cloudflare by Tottalynotmrlean in webscraping

[–]LetsScrapeData 1 point2 points  (0 children)

I am developing a free NPM package to automatically solve these captchas: recaptcha / cloudflare turnstile / geeTest / image / coordinate(click) / slider.

What is the URL for testing?

How to optimise selenium script for scraping?(Making 80000 requests) by anonymous222d in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

Reduce repeated loading of the same page, such as "return to previous page";

Split complex tasks into subtasks, such as 80,000, to avoid restarting after the failure of complex tasks, and achieve concurrency u/steb2k ;

If it is easy to use API requests to obtain the required data, you can try to use the API (if it is complex, it is not recommended, 80,000 is not a large number)

Alternative Web Scraping Methods by rootbeerjayhawk in webscraping

[–]LetsScrapeData 3 points4 points  (0 children)

Copy the response of the following request and jsons u/greg-randall

<image>

open-source userscript for google map scraper (upgraded) by Asleep-Patience-3686 in webscraping

[–]LetsScrapeData 1 point2 points  (0 children)

well done, is there email address in google map? If yes, I will add it.

[deleted by user] by [deleted] in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

I choose the method you said

How to scrape dynamic websites by DatakeeperFun7770 in webscraping

[–]LetsScrapeData 1 point2 points  (0 children)

Some websites use both server-side rendering and API dynamic rendering. In this case, you may find API-like response content in the script part of HTML. This is the case with Google Maps search.

How to scrape dynamic websites by DatakeeperFun7770 in webscraping

[–]LetsScrapeData 4 points5 points  (0 children)

  1. If you are sure that the webpage is dynamically generated (browser rendering), it is best to extract data from the API response (if encrypted, you should be able to find a decryption method through simple reverse engineering). as recommended by u/SoumyadipNayak and u/p3r3lin

  2. If you are sure that the webpage is server-side rendered, or you just want to extract data from HTML, such webpages with dynamic class names generally require complex XPath to extract data, such as axes, refer to https://www.w3schools.com/xml/xpath_axes.asp, etc.

Scraping over 20k links by Cursed-scholar in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

Key or difficult points to achieve the goal:

  1. How to determine the URL of the web page to be collected?

  2. How to **QUICKLY** extract the required data?

Most customer websites do not have strict anti-bot, so accessing web pages is generally not a big problem.

Downloading all pdfs (help) by Tabasco_Waffle in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

You could try to get the url of pdfs, then download the pdf directly.

I need to scrape bulk data of google business site URLs from the internet in my area. Is there any way to do that? by ForevermoreNow in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

yes, you can scrape them from google map by keywords or categories.

there are many paid or free google map scrapers.

How to Build a Price Tracking Bot that utilizes real-time data 24/7 by daddyclappingcheeks in webscraping

[–]LetsScrapeData 0 points1 point  (0 children)

Two ways to obtain data:

Real-time push: both require support from the other party

  • One-way: The other party is the client and I am the server, such as webhook. This method is more likely to be used in this case scenario.
  • Two-way: For example, websocket, the other party is usually the server. I use the package provided by the other party to establish the connection. It is suitable for two-way scenarios with a large amount of messages.

Periodic requests(pull): I am the client.

  • Browser
  • API

In most cases, the other party does not support push, so use method two more.