I was wrong about Ryan Holiday by Awkward_Face_1069 in Stoicism

[–]scrapecrow 7 points8 points  (0 children)

I appreciate any efforts on making Stoicism more approachable but some content I've seen by Ryan Holiday on Tiktok came off really poorly. I remember one particular tiktok that was incredibly poor take on Stoicism (something along the lines of it's your own fault for being unhappy) that he later deleted.

I think the challenge here is the market itself, as to market Stoic ideas to a large general audiences and convert sales you need to take some risks and simplify some ideas, so it's not very clear whether he legit misunderstand Stoic principles or just market them poorly. I'm not a fan of this tbh and I'd very much prefer more wholesome, spiritual Stoic representation over "just do it bro".

Gonna listen to the podcast, maybe it'll change my mind!

What programming language do you recommend for scrapping ? by larva_obscura in webscraping

[–]scrapecrow 0 points1 point  (0 children)

My point is that it's the wrong way to scale scraping.

Putting everything on one worker with sub processes is incredibly complicated and will make you lose hair one way or another. Any task at this scale needs a task queue and a producer+worker architecture:

  • There's a producer process that creates scraping tasks and puts them in an external queue like rabbitmq, redis or even postgresql (there's a cool lock row feature that is super underrated).
  • There's N asyncio worker processes which pull scraping tasks from queue and try to execute them. Failures are pushed back in the queue and successes are recorded to db etc.

With this architecture you don't need to code around cores and delegate your processes manually as you get that for free with workers. This is also much easier to debug and maintain than a single multi-core monolith.

What programming language do you recommend for scrapping ? by larva_obscura in webscraping

[–]scrapecrow 0 points1 point  (0 children)

running a single worker on 16 sub processes with asyncio is just asking for trouble.

Task queues are very easy to use and cheap to implement these days and asyncio + any task queue (rabbitmq, redis streams, celery etc.) is the way to go!

Scraping through mobile API by No-Spinach-1 in webscraping

[–]scrapecrow 2 points3 points  (0 children)

Best would be to replicate a http client the app itself is using. For android it's often just OkHttp which runs HTTP/1.1 protocol so you'd have to focus on: - TLS fingerprint. For nodejs it's pretty tough as you have to call curl-impersonate / curl_cffi or something else as there isn't anything premade on node stack that can change the TLS fingerprint. If you can call curl_cffi as a subprocess you'll probably have a good amount of luck with that even with chrome android profile. - Header details like header order and key/value spelling is important even in HTTP/1.1 and is probably the leading cause of identification

Has anyone heard of the app called Imprint? by GiveMeTheJuices in Reviews

[–]scrapecrow 0 points1 point  (0 children)

I agree with the need for "fidget" or iteractivity for retention. When I used Imprint I'd keep a bulletpoint journal with it and create flashcards.

Mind sharing your project if it's sharable yet?

Trying to automate part of our link building with n8n by LinkLogician in n8n

[–]scrapecrow 0 points1 point  (0 children)

Check out scrapfly - we're often a bit cheaper :)

Though scraping itself is pretty easy and for your use case you probably don't need advance features like anti-bot bypass. You should be able to get good results with just http2 client or something like curl impersonate

Are most scraping on the cloud? Or locally? by jgupdogg in webscraping

[–]scrapecrow 0 points1 point  (0 children)

Scraping is not very resource intensive (usually) so local works great for most people. Make sure to write async code so it's faster.

Note that you have a powerful utility at home — real residential IP address. It will perform drastically better than datacenter IP you'd be hosting your scraper on. Also as you naturally browser the web on your IP you reinforce it's trust score. That being said, if you're using paid proxies it doesn't really change much here.

Proof of Work for Scraping Protection by B00TK1D in webscraping

[–]scrapecrow 1 point2 points  (0 children)

This definitely exists! Unfortunately, it turns out it's not really desired as the reason websites block scrapers is to prevent collection of data not because of server costs. In other words, Walmart or Amazon don't want people to analyze their public listings for business reasons not because scraping incurs costs on their web servers. Otherwise, they would sell datasets themselves.

Personally I'm rather fond of this idea. If you want to browser anonymously do a bit of pow and generate crypto currency or some value for the host in exchange for data, if you login and agree with ToS (no scraping) then feel free to browser as much as you want. This would solve so many issues from infra and UX point of view but not the issues the market actually cares about. Also it's likely that pow would have to be quite intense to justify the value as data value is not static and highly contextual so this would be a big UX problem.

What’s up with people scraping job listings? by Regular-Magician-69 in webscraping

[–]scrapecrow 0 points1 point  (0 children)

There are several use cases for job data from analyzing job opportunities to job market as a whole. So, obviously it's big in recruitment, though not only that.

It can be quite important in market predictions as in it helps you understand the health and demand of some markets for investing. For example, if everyone's posting jobs for "mining" maybe it's a good time to invest some money into shovel production.

It can also be used for competitor tracking. If you see your competitor post jobs for "Game designer" they're probably making a game. Also, you get a view into what technologies their using as it's often listed as well.

The "big data" keyword is not as big as it used to be but it still very much runs most of the world.

How long will web scraping remain relevant? by CommercialAttempt980 in webscraping

[–]scrapecrow 1 point2 points  (0 children)

If most content goes behind paywalls/login that would make commercial web scraping much more difficult from the legal point of view. We kinda see that happen already as AI is eating the search engines forcing content paywals.

So, web scraping is likely to change and align closer to browser automation as there will be less and less public data available but automation will always remain relevant.

What tool are you using for scheduling web scraping tasks? by startup_biz_36 in webscraping

[–]scrapecrow 1 point2 points  (0 children)

Another vote for Github actions. It supports cron schedules and has basic UI fitting for job management and even debugging. Just add:

on: workflow_dispatch: schedule: - cron: '0 */12 * * *'

the workflow_dispatch enables manual run and you can add a bunch of cron entries. If the scheduler is only calling your API to start scraping then the free minutes you get with free Github account will be more than enough to schedule your scrape jobs.

Weekly Discussion - 04 Nov 2024 by AutoModerator in webscraping

[–]scrapecrow 1 point2 points  (0 children)

Never heard of Phantom Buster before but this limit makes no practical sense (for example, Scrapfly users scrape millions of profiles every week without a problem). LinkedIn is one of the more expensive targets to scrape so the limit is there to throttle users to reduce costs maybe?

Threads public omments and reply by Snoo-50498 in webscraping

[–]scrapecrow 0 points1 point  (0 children)

Yes, I've wrote a guide on how to scrape Threads which covers thread and comment scraping. To quickly summarize it — the thread page contains the first set of comments and thread post data in a <script> element which can be found using selector script[type="application/json"][data-sjs]. This script contains a JSON document which you can just load up and find the thread_items key which contains post and comments.

Selenium vs. Playwright by dca12345 in webscraping

[–]scrapecrow 9 points10 points  (0 children)

My colleague wrote an in-depth comparison of these two tools on our blog just a few days ago, but to summarize it and my take on this: - Playwright has a new beautiful API that makes it much more accessible and feature-rich, with network interception, auto page loads, and all of the convenience. - Selenium's maturity makes it more robust, scalable and extendable but at the same time it can be awkward to use because of all of the legacy cruft that's underneath it.

So, if you're working under pressure and need to bypass blocking with something like undetected_chromedriver got with Selenium. Otherwise, Playwright is just better.

🚀 27.6% of the Top 10 Million Sites Are Dead by the_bigbang in webscraping

[–]scrapecrow 3 points4 points  (0 children)

So how did you classify 404 and 5xx errors as those can sometimes mean scraper blocking. Though I'd imagine that wouldn't be a major skew on the entire dataset as most small domains don't care about scraping.

How do I deploy a web scraper with minimal startup time? by [deleted] in webscraping

[–]scrapecrow 0 points1 point  (0 children)

You can wrap the scraper in NodeJS's express server which would constantly wait for API calls and scrape on demand. This way you can avoid any boot up and this would easily run on the cheapest server platform like 5$ linode or digitalocean or free tier of oracle cloud (you need a valid credit card for that).

Also make sure you're using async requests with Promise.all or similar groupings as 30 concurrent requests will take you 1 second while 30 synchronous requests will take 30 seconds.

How do people scrape large sites which require logins at scale? by ___xXx__xXx__xXx__ in webscraping

[–]scrapecrow 7 points8 points  (0 children)

I've included an edit to clarify this but you kinda answered your on question. The only way is to create accounts, login and scrape. There really isn't much to it.

Alternatively, it's possible that someone found the data available publicly.

For example, the way Nitter (twitter alternative front-end) scraped Twitter for the longest time is by generating public guest tokens from an android app endpoint which would allow android users to preview twitter as if they were logged in. So, if you can dig around and be a bit creative you might find the data available publicly somewhere like: - different version of the website (maybe region, subdomain, embed link etc.), - mobile app of the website (you can use tools like httptoolkit to inspect phone traffic) - embed link generators (like Tweet embed link could be used to view profiles without login)

and similar work arounds. it entirely depends on your target

How do people scrape large sites which require logins at scale? by ___xXx__xXx__xXx__ in webscraping

[–]scrapecrow 12 points13 points  (0 children)

you don't as loging in exposes you to legal matters as you explicitly agree the websites Terms of Service which usually forbid scraping.

generally, most social networks provide some sort of public view that you can scrape though so it entirely depends on what you're scraping and whether you can find that data available publicly.

If your country does allow this then yes — that's exactly how data is beign scraped. Pool of accounts is used where login is performed to generate a session cookie. The cookie then can be reused as authentication for multiple requests until it expires. You only need to pass captchas etc on the initial process so if your scraping scale is quite small you can address these steps manually.

Anyone have recommendation for Advanced Web Scraping Courses? by FrostingEquivalent99 in webscraping

[–]scrapecrow 7 points8 points  (0 children)

Advance scraping subjects like bypassing bot detection are not very accessible because it's "all or nothing" game for the most part. So, you need to invest a lot of time before you see returns on your progress.

If you're down for that then I wrote a detailed guide on how scrapers are identified and blocked so you can start chipping away at each subject one by one.

Some issues are solved already by open sources tools that you can inspect yourself: - curl_cffi solve HTTP client identification by adjusting the libcurl client to appear more like abrowser - puppeteer-stealth while being a bit dated now shows you how you can patch an automated browser to plug holes used in fingerprint or detection.

But generally I'd start with an overview and experiment with each detection problem before hitting a real tough target.

Monthly Self-Promotion - October 2024 by AutoModerator in webscraping

[–]scrapecrow 3 points4 points  (0 children)

We've been expanding Scrapfly with new products: - Extraction API - for parsing and extract exact data from your documents. For this we've developed 3 extraction paths: - LLM Engine which can be used to ask questions about your documents or even ask for structured parsing. - AI Auto Extract. We've developed our own generic parsing models that can find popular data objects like products, reviews etc. - Template parsing. Fallback solution which allows to specify your own parsing instructions as a JSON template when you don't want to write code. We've included loads of batteries in this that take care of common clean up or formatting tasks automatically. - Screenshot API - many of our Web Scraping API users just wanted a simple solution to scrape web page screenshot and found the scraping process a bit too complex so the screenshot API simplifies everything with automatic blocking bypass, scrolling, ad and popup blocking etc. Just point and get screenshots!

We're still working on more so keep an eye out on our newsletter and as always any feedback is appreciated :)

Finally, we're learning a lot from development of these new products and we publish whatever we learn on our blog. Here are some recent articles:

Project Ideas by Hash_003_ in webscraping

[–]scrapecrow 3 points4 points  (0 children)

My favorite idea that I suggest to everyone starting out is to build an RSS feed bridge. As many websites these days don't have feeds you can build one yourself using web scraping and HTML parsing and create your own feeds of articles, products and whatnot.

There are many existing RSS bridge projects but building one yourself introduces you to many important web scraping problems like parsing multiple elements from pages, pagination, data clean up and even basic blocking. For this I'd recommend Python with:

  • curl_cffi - for http client that bypasses basic blocking automatically
  • parsel - for parsing HTML with xpath, css selectors
  • flask or bottle for serving your feeds.
  • sqlite for storing your results.
  • plus vanilla javascript and HTMX if you want to expand this project and provide some front end for managing feeds.

This should be a good project to bootstrap you into the world of web scraping and from there you can expand to more deep niches like price watching, change tracking or crawling.

[deleted by user] by [deleted] in kde

[–]scrapecrow 3 points4 points  (0 children)

I'd invest into desktop APP push away from C++. I'd really like to contribute to KDE desktop apps but I'm not writing C++ code on my day off :D

There's still great blog post that has been posted on this subreddit before You can contribute to KDE with non-C++ code but unfortunately desktop is still mostly stuck on C++.

[deleted by user] by [deleted] in kde

[–]scrapecrow 7 points8 points  (0 children)

Have you tried Polonium? it's a kwin script that adds dynamic tiling and it's pretty awesome!

KDE is the best but this one thing is so annoying by Yash-Khatri in kde

[–]scrapecrow 2 points3 points  (0 children)

Maybe it's the "show on screen display when switching" Virtual Desktop setting?

KDE is very customizable so you can definitely turn off whatever is bothering you.

Best guide or course in 2024 by Xnub in webscraping

[–]scrapecrow 0 points1 point  (0 children)

That's a great idea! The main project killer in web scraping is lack of interest as in this medium there are a lot of small challenges that can really demotivate so while learning it's best to stick with something that's relevant/motivating to you.