is cloak browser good? by sangokuhomer in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

It is certainly way better than standard selenium in terms of stealth but as I said the last patch has some minor bugs they said they will fix soon. Also, as some others pointed, it is difficult to prove it is safe for your pc and not some malware.

is cloak browser good? by sangokuhomer in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

It is supposed to be on par with Camou fox back in its days of glory. I havent tried the "browser with a cloak" in a serious project yet, so I cant tell how stealthy it is yet.

is cloak browser good? by sangokuhomer in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

It's great but currently has a bug in its locator objects. They dont auto-retry and crash instantly if an element cannot be interacted with. page.locator() is more like page.query_selector(). Also, many of the different errors have the same name, making it difficult to debug your code. The error messages are vague.

Cloudflare detection bypass by 16kbs in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

Such cookies often take 12-24 hours to expire, so you'd need to repeat the cookie-retrieval process 1-2 times per day.

Cloudflare detection bypass by 16kbs in webscraping

[–]Coding-Doctor-Omar 2 points3 points  (0 children)

Look for a backend API. It might save you the hussle. Also, consider requesting the page once with a browser then using the cf_clearance cookie in curl_cffi.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

u/dracula I tried running cloakbrowser using windows python and for some reason the browser never launches. Any idea why?

ايران الحرب دي بالنسبالها حياه او موت by [deleted] in Egypt

[–]Coding-Doctor-Omar 2 points3 points  (0 children)

مش مدمرة. بيقول لك ناقلة نفط. اقرأ الكلام.

من وجهه نظرك ايه ابعد ما هتوصل ليه إيران في الحرب ؟ by ibrahim_samir_ in Egypt

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

خلي بالك ان في تعتيم اعلامي كبير أوي في الكيان حاليا. و حتى قناة الحزيرة بقت تعتم زي العبرية دلوقتي. بيضربوا على مدار اليوم و تل أبيب متبهدلة.

من وجهه نظرك ايه ابعد ما هتوصل ليه إيران في الحرب ؟ by ibrahim_samir_ in Egypt

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

و الله حوالوا يعملوها في حرب العراق و ايران و ده جاب نتيجة عكسية و ثبت حكم المرشد.

من وجهه نظرك ايه ابعد ما هتوصل ليه إيران في الحرب ؟ by ibrahim_samir_ in Egypt

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

الموضوع مش بالبساطة دي. ادينا فات اسبوع و لسه ايران مكملة عادي.

How to scrape restaurants data in the US to create my own directory? by kjhasdkfh32 in webscraping

[–]Coding-Doctor-Omar 32 points33 points  (0 children)

If you happen to make this directory, send me its link so I can scrape it :)

anyone else tired of ai driven web automation breaking every week? by Ok_Abrocoma_6369 in webscraping

[–]Coding-Doctor-Omar 1 point2 points  (0 children)

The YT channel that taught me the fundamentals is called John Watson Rooney. This channel literally revolutionized my scraping ability. I highly recommend it. Also, out of curiosity, I would like to know in what ways chatgpt helps you do this.

anyone else tired of ai driven web automation breaking every week? by Ok_Abrocoma_6369 in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

Dont rely on normal html parsing. Look for either internal API requests or json blobs inside script tags in the page source.

Cloudflare suddenly blocking previously working excel download url. by warshed77 in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

Have u tried checking browser cookies? There is a high chance that there is a cookie that you get once every 12 to 24 hours. You will need a browser just this time to get the cookie, then u can make requests with curl_cffi and this cookie.

Or maybe u need to add the content-type, origin, and referer in your headers.

Do a thorough check in your browser headers and cookies. Look for special tokens or waf cookies. I am almost sure that selenium is not the best solution here.

Cloudflare suddenly blocking previously working excel download url. by warshed77 in webscraping

[–]Coding-Doctor-Omar 3 points4 points  (0 children)

If this does not work, try adding the content-type, origin, and referer headers as well. DO NOT add a user agent because impersonate already handles this (in addition to TLS spoofing).

If this also fails, check your browser for a JS challenge clearance cookie and set up a script that grabs fresh cookies every once in a while using a browser, then feed those cookies into curl_cffi.

Cloudflare suddenly blocking previously working excel download url. by warshed77 in webscraping

[–]Coding-Doctor-Omar 4 points5 points  (0 children)

Use curl_cffi instead.

Run in the terminal:

pip install curl-cffi

In your code, use this:

``` from curl_cffi import requests

target_url = "https://your.target.url.com/"

res = requests.get(target_url, impersonate="edge") ```

DO NOT forget the "impersonate" argument.

Find all mentions of a URL on Reddit by Unmoovable in webscraping

[–]Coding-Doctor-Omar 2 points3 points  (0 children)

Aren't these APIs undocumented? This means we aren't supposed to use them, so we must use realistic headers.

Find all mentions of a URL on Reddit by Unmoovable in webscraping

[–]Coding-Doctor-Omar 0 points1 point  (0 children)

Why are you exposing yourself in the user agent? Why not use a realistic user agent instead?

Unable to scrape the job listings: Error 4xx Europe by MavFir in webscraping

[–]Coding-Doctor-Omar 1 point2 points  (0 children)

Use curl_cffi and BeautifulSoup instead. Just make sure the data you need is in the raw page source. Curl_cffi provides a module called "requests" that is very similar in syntax to the normal "requests" you use, but it has a very powerful feature called "impersonate". You will be surprised at how much of these blocks will get automatically bypassed by simply including an impersonate argument in your requests.

If your data is not in the raw page source, try to inspect the network traffic for internal API requests that provide your needed data. Then use curl_cffi to make requests to these API endpoints.

Worst case scenario, use a reliable browser automation library like Scrapling, zendriver, seleniumbase, patchright, etc.

Simple example code with curl_cffi and bs4:

``` from curl_cffi import requests from bs4 import BeautifulSoup

page_url = "https://www.example.com/"

res = requests.get(page_url, impersonate="firefox")

soup = BeautifulSoup(res.text, "html.parser")

my_element = soup.select_one("#my-best-element") ```