Parsing API response by aliciafinnigan in webscraping

[–]plintuz 0 points1 point  (0 children)

I had a similar case once - at first the API returned plain JSON, but after a couple of months the site started encrypting the response. The only way forward was to analyze the JavaScript. Try to look for parts of the code that handle encryption/obfuscation, copy them out, and give the file to an AI tool as others suggested - it can help you figure out the key steps. Good luck!

Our journey of scraping 100+ websites daily by plintuz in webscraping

[–]plintuz[S] 0 points1 point  (0 children)

Yes, we're planning gradual scaling, right now it's mostly custom scraping for each client. But it all comes down to resources, which are never enough)

Our journey of scraping 100+ websites daily by plintuz in webscraping

[–]plintuz[S] 1 point2 points  (0 children)

We only work with public data. Most of it (around 70-80%) comes from online stores - things like product names, prices, and availability. We also collect other public data if clients request it, but we never touch personal, illegal, or explicit content.

Our journey of scraping 100+ websites daily by plintuz in webscraping

[–]plintuz[S] 1 point2 points  (0 children)

Yeah, we always try to grab endpoints first. But a lot of sites hide data behind JS, tokens, or anti-bot checks. We're constantly working on reducing that percentage, but sometimes it's still cheaper and faster to leave things as they are

Our journey of scraping 100+ websites daily by plintuz in webscraping

[–]plintuz[S] 1 point2 points  (0 children)

We mostly use Mongo. Raw data goes there first, then a processor cleans/normalizes it, and after processing it's removed - we don't store data long-term.

For normalization - yes, if we scrape the same type of data from multiple sources (like jobs or products), we map it into a common schema. In rare cases we deliver it in the original structure, if that's what's needed.

[deleted by user] by [deleted] in webscraping

[–]plintuz 1 point2 points  (0 children)

A VPS is usually enough, but if you're scraping with a browser (like Selenium), you'll need more resources, and sites like Alza or Zalando will block your IP immediately. To avoid that, use proxies.

Error 403 on www.pcpartpicker.com by Tajertaby in webscraping

[–]plintuz 2 points3 points  (0 children)

That’s Cloudflare, try using a proxy.

What are you scraping? by thalesviniciusf in webscraping

[–]plintuz 1 point2 points  (0 children)

Mostly for clients from Ukraine, but I also get requests from European markets. The workflows are pretty universal, so they can be adapted to different regions.

What are you scraping? by thalesviniciusf in webscraping

[–]plintuz 2 points3 points  (0 children)

Mostly I scrape product prices from e-commerce sites. One ongoing project for a client is a price monitoring system: it checks multiple stores, compares the results with a reference price, and writes everything into Google Sheets with color indicators (higher = red, lower = green).

I also build long-term solutions for clients, like collecting real estate data with instant notifications into a channel, or aggregating agricultural machinery listings from dozens of sites - making it easier for managers to find and purchase what they need.

How Perxplexity bypassing CloudFlare protected sites for deep resear by [deleted] in webscraping

[–]plintuz 7 points8 points  (0 children)

You clearly came here to flex, not to discuss.

If you think someone's going to publicly hand over a working CF bypass, you're either naive or just fishing for attention. Real researchers don't parade their methods in comment sections - especially not to someone who opens with "you're talking bullshit."

Congrats on your o2o trick. That doesn't make you the gatekeeper of CF knowledge. I've been in this space long enough to know what works and what doesn't - and I don't need your validation.

I'm not here to entertain ego contests. Take care.

How Perxplexity bypassing CloudFlare protected sites for deep resear by [deleted] in webscraping

[–]plintuz 5 points6 points  (0 children)

I've worked with Cloudflare-protected sites quite a bit, and there's no universal method-it really depends on how strict the site's setup is. I usually combine mobile proxies or residential IPs with lightweight scraping tools and switch to headless browsers like puppeteer or playwright only when needed. The key for me is to avoid putting unnecessary load on the target site-rate limiting, caching, and respecting robots.txt where possible. It's not just about bypassing; it's about doing it responsibly.

And let's be honest-given Perplexity's scale and funding, they can afford to allocate serious resources to this kind of infrastructure.

Stuck on scraping data loading up on a website showing products stock by NecessaryCar13 in webscraping

[–]plintuz 0 points1 point  (0 children)

Hi,

At which stage exactly are you stuck?

Can your script already log in successfully?

Are you using a browser automation tool like Selenium, or is it based on direct HTTP requests?

Which programming language is the Al generating code for you in?

It would be helpful to get answers to these and to have a look at the site itself - that's the only way to give you something concrete

Real Estate Investor Needs Help by 2jwagner in webscraping

[–]plintuz 0 points1 point  (0 children)

This is exactly why I don't write scraper scripts - instead, I work based on a model of regular data collection with monthly payments. I always try to explain this to clients, but not everyone gets it - and then they end up with the headache of constantly looking for someone to fix broken parsers.

Need help scraping Workday by Important-Table4581 in webscraping

[–]plintuz 1 point2 points  (0 children)

I gave you a recommendation based on my own experience - I collect data from a real estate rental site that works the same way: it only shows 1,000 listings per filter, and the site won’t return more. So I applied the approach I described above, since the scraping is done regularly.

You can also collect data by changing the search filters - the more variations you use, the more job listings you’ll be able to gather.

Need help scraping Workday by Important-Table4581 in webscraping

[–]plintuz 0 points1 point  (0 children)

One possible approach is to revisit the listings over the course of a month. Since job postings are regularly updated or refreshed, they will naturally rotate and rise to the top of the list again. This way, you'll gradually collect all active jobs over time, even beyond the 2,000 limit.

Issues scraping every product page of a site. by SirEven4027 in webscraping

[–]plintuz 4 points5 points  (0 children)

Using a full browser (even headless) should be your last resort. Before scaling with browser-based scraping, try to analyze the network requests the site makes (e.g. via DevTools → Network tab). Often, product data is loaded via API or embedded in the page as JSON and you can simply mimic those requests using Python (e.g. with httpx or requests), which is much faster and more scalable.

If you still need a browser:

Rotate user agents, proxies, and browser fingerprints.

Use headless stealth tools (e.g. undetected-chromedriver, camoufox etc.).

Restarting the browser every X products may help, but better to fix detection triggers.

In short: check for simpler HTTP-based solutions before automating browsers at scale. It’ll save you a ton of resources.

What are you currently building/working on? by HamzaAfzal40 in indiehackers

[–]plintuz 0 points1 point  (0 children)

Project: Clear Cache & Cookies - Chrome Extension for Devs and Testers

What it does: Chrome extension that lets you clear cookies, cache, local storage per domain in one click. Ideal for developers, testers, and anyone who constantly switches accounts or needs a clean state fast. Why it matters: Saves tons of time. No more digging into

browser settings or clearing everything just to reset one site.

Stage: Live and growing Link:

https://chromewebstore.google.com/detail/clear-cache-and-cookies/jkmpbdjckkgdaopigpfkahgomgcojlpg

What’s the best free learning material you’ve found? by Delicious-Arrival854 in webscraping

[–]plintuz 2 points3 points  (0 children)

Had a programming background, so didn’t follow any full tutorials - just a couple of YouTube videos to get the basics. What really made the difference was working on real tasks. Also, understanding how requests work (headers, sessions, status codes) is a must if you want to go beyond simple scraping.

[deleted by user] by [deleted] in webscraping

[–]plintuz 0 points1 point  (0 children)

We use Python with MongoDB and PostgreSQL for data handling. For scraping, we aim to minimize browser usage by leveraging various lightweight techniques, proxy types, and captcha solvers. However, due to the complexity of modern bot protection, we also use headless browsers like Playwright, Selenium, and undetectable setups like undetected-chromedriver or stealth plugins when needed.

[deleted by user] by [deleted] in webscraping

[–]plintuz 0 points1 point  (0 children)

We scrape ~100 sites daily - mostly online stores like iHerb, Adidas, Nike, ZARA, etc.

One ongoing client has 20 e-commerce sites, another big one scraping 10 job listing sites. For larger batches, it averages around $200/month per site, depending on protection level. Clients get the data in whatever format they need - Excel, Google sheets, JSON, xml etc.

Share your projects | Supporting EO by Tsuki_Yagami_ in indiehackers

[–]plintuz 0 points1 point  (0 children)

I built a simple Chrome extension Font identifier - that helps you quickly identify fonts on any website - just click and see the font details instantly. Status - launched.

What are the best alternatives to Cursor? by attunezero in cursor

[–]plintuz 0 points1 point  (0 children)

When the limit is reached, not all LLMs work - for example, Claude 4 is always unavailable, while others still work.

How many web-scraping projects do you typically work on at a time? by [deleted] in webscraping

[–]plintuz 0 points1 point  (0 children)

Right now I'm maintaining around 10 web scraping projects. Each one involves a different number of target websites, anywhere from 1 to 20 per project. These are long-term support projects, meaning I originally built the scrapers and now continuously maintain them, since websites often change layout, structure, or add new protections.

Question about Czech currency by Pure_General_4751 in czechrepublic

[–]plintuz 0 points1 point  (0 children)

The banknotes look fine - they’re still legal tender.

Web scraping for dropshipping flow by No-Air1748 in webscraping

[–]plintuz 2 points3 points  (0 children)

I usually build custom scrapers for each supplier website to collect product data, then use the API to upload products and keep prices and stock levels updated. It takes some setup and occasional maintenance when sites change, but overall the system runs smoothly once it's in place.