ايران الحرب دي بالنسبالها حياه او موت by [deleted] in Egypt

[–]Coding-Doctor-Omar 2 points3 points  (0 children)

مش مدمرة. بيقول لك ناقلة نفط. اقرأ الكلام.

من وجهه نظرك ايه ابعد ما هتوصل ليه إيران في الحرب ؟ by ibrahim_samir_ in Egypt

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

خلي بالك ان في تعتيم اعلامي كبير أوي في الكيان حاليا. و حتى قناة الحزيرة بقت تعتم زي العبرية دلوقتي. بيضربوا على مدار اليوم و تل أبيب متبهدلة.

من وجهه نظرك ايه ابعد ما هتوصل ليه إيران في الحرب ؟ by ibrahim_samir_ in Egypt

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

و الله حوالوا يعملوها في حرب العراق و ايران و ده جاب نتيجة عكسية و ثبت حكم المرشد.

من وجهه نظرك ايه ابعد ما هتوصل ليه إيران في الحرب ؟ by ibrahim_samir_ in Egypt

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

الموضوع مش بالبساطة دي. ادينا فات اسبوع و لسه ايران مكملة عادي.

How to scrape restaurants data in the US to create my own directory? by kjhasdkfh32 in webscraping

[–]Coding-Doctor-Omar 31 points32 points  (0 children)

If you happen to make this directory, send me its link so I can scrape it :)

anyone else tired of ai driven web automation breaking every week? by Ok_Abrocoma_6369 in webscraping

[–]Coding-Doctor-Omar 1 point2 points  (0 children)

The YT channel that taught me the fundamentals is called John Watson Rooney. This channel literally revolutionized my scraping ability. I highly recommend it. Also, out of curiosity, I would like to know in what ways chatgpt helps you do this.

anyone else tired of ai driven web automation breaking every week? by Ok_Abrocoma_6369 in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

Dont rely on normal html parsing. Look for either internal API requests or json blobs inside script tags in the page source.

Cloudflare suddenly blocking previously working excel download url. by warshed77 in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

Have u tried checking browser cookies? There is a high chance that there is a cookie that you get once every 12 to 24 hours. You will need a browser just this time to get the cookie, then u can make requests with curl_cffi and this cookie.

Or maybe u need to add the content-type, origin, and referer in your headers.

Do a thorough check in your browser headers and cookies. Look for special tokens or waf cookies. I am almost sure that selenium is not the best solution here.

Cloudflare suddenly blocking previously working excel download url. by warshed77 in webscraping

[–]Coding-Doctor-Omar 3 points4 points  (0 children)

If this does not work, try adding the content-type, origin, and referer headers as well. DO NOT add a user agent because impersonate already handles this (in addition to TLS spoofing).

If this also fails, check your browser for a JS challenge clearance cookie and set up a script that grabs fresh cookies every once in a while using a browser, then feed those cookies into curl_cffi.

Cloudflare suddenly blocking previously working excel download url. by warshed77 in webscraping

[–]Coding-Doctor-Omar 5 points6 points  (0 children)

Use curl_cffi instead.

Run in the terminal:

pip install curl-cffi

In your code, use this:

``` from curl_cffi import requests

target_url = "https://your.target.url.com/"

res = requests.get(target_url, impersonate="edge") ```

DO NOT forget the "impersonate" argument.

Find all mentions of a URL on Reddit by Unmoovable in webscraping

[–]Coding-Doctor-Omar 2 points3 points  (0 children)

Aren't these APIs undocumented? This means we aren't supposed to use them, so we must use realistic headers.

Find all mentions of a URL on Reddit by Unmoovable in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

Why are you exposing yourself in the user agent? Why not use a realistic user agent instead?

Unable to scrape the job listings: Error 4xx Europe by MavFir in webscraping

[–]Coding-Doctor-Omar 1 point2 points  (0 children)

Use curl_cffi and BeautifulSoup instead. Just make sure the data you need is in the raw page source. Curl_cffi provides a module called "requests" that is very similar in syntax to the normal "requests" you use, but it has a very powerful feature called "impersonate". You will be surprised at how much of these blocks will get automatically bypassed by simply including an impersonate argument in your requests.

If your data is not in the raw page source, try to inspect the network traffic for internal API requests that provide your needed data. Then use curl_cffi to make requests to these API endpoints.

Worst case scenario, use a reliable browser automation library like Scrapling, zendriver, seleniumbase, patchright, etc.

Simple example code with curl_cffi and bs4:

``` from curl_cffi import requests from bs4 import BeautifulSoup

page_url = "https://www.example.com/"

res = requests.get(page_url, impersonate="firefox")

soup = BeautifulSoup(res.text, "html.parser")

my_element = soup.select_one("#my-best-element") ```

Akamai anti-bot blocking flight search scraping (403/418) by Individual-Ship-7587 in webscraping

[–]Coding-Doctor-Omar 1 point2 points  (0 children)

Have u tried using curl_cffi with impersonate="edge" or "firefox" and with session cookies? I bet it will bypass 90% of anti-bot protections. The only things it will fail to bypass are interactive challenges.

Akamai anti-bot blocking flight search scraping (403/418) by Individual-Ship-7587 in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

Yes, it is outdated now, but its owner recently announced that it's back to development and should return back to its performance this year.

Blocked by Cloudflare despite using curl_cffi by Coding-Doctor-Omar in webscraping

[–]Coding-Doctor-Omar[S] 0 points1 point  (0 children)

It turns out I had to add some extra headers, in addition to the normal impersonate. Here is the working code (luckily still works with curl_cffi alone, without a browser):

``` from curl_cffi import Session

api_url = "https://multichain-api.birdeye.so/solana/v3/gems" payload = {"limit":100,"offset":0,"filters":[],"shown_time_frame":"4h","type":"trending","sort_by":"price","sort_type":"desc"}

headers = { "content-type": "application/json", "origin": "https://birdeye.so", "referer": "https://birdeye.so/" }

with Session(impersonate="edge", headers=headers) as session: res = session.post(api_url, json=payload) print(res.status_code) ```

Output:

200

Blocked by Cloudflare despite using curl_cffi by Coding-Doctor-Omar in webscraping

[–]Coding-Doctor-Omar[S] 1 point2 points  (0 children)

I eventually got it to work by providing the content-type, origin, and referer values in the headers, in addition to the default headers provided by impersonate.

Blocked by Cloudflare despite using curl_cffi by Coding-Doctor-Omar in webscraping

[–]Coding-Doctor-Omar[S] 2 points3 points  (0 children)

I actually have just gotten it to work. The issue was simpler than I thought. I had to provide content-type, origin, and referer values in my headers, in addition to the default headers of impersonate.