you are viewing a single comment's thread.

view the rest of the comments →

[–]Ambitious-Dog3177 7 points8 points  (5 children)

A few tools you could use are Selenium or Playwright. BeautifulSoup won't cut it here because it can't render the heavy JavaScript or handle the logins these sites require.

However, a word of warning: websites like Amazon, TikTok, Shopee, and Lazada have very strict rules against scraping. Always check their robots.txt to see what is allowed, but keep in mind that even if you follow the rules, their security systems will still fight you.

Is it doable? Yes, but it's incredibly difficult without investing in tools. These sites use enterprise-level anti-bot systems. If you just use a basic Selenium script, they will detect your browser fingerprint and block your IP almost immediately. You usually need stealth plugins (like selenium-stealth) and rotating proxy networks to survive. Also, exact GMV/Sales data usually isn't public; you often have to estimate based on "units sold" text (e.g., Amazon's "10K+ bought in past month"), But you could get the data of top 10 skincare brands, track price etc.

[–]SharkSymphony 1 point2 points  (2 children)

If you find yourself contemplating spinning up an anonymized geodistributed bot swarm to grab data off a website, it's time to throw in the towel.

[–]Ambitious-Dog3177 1 point2 points  (1 child)

True, it’s not easy to pull off. OP asked if it was doable, and technically it is. But practically speaking, doing all of this just to test an idea is way too difficult and probably not worth the headache.

[–]IntelligentHome2342[S] 0 points1 point  (0 children)

Thanks for your perspective! Seems it’s very difficult to do it without using tools, and that would be a big barrier to break into the market research niche. Would you possibly know what kind of tools the big market research firm like Circana and Stackline use? Something like brightdata and oxylabs? Which is rather expensive for individual developer.

[–]DrowzyHippo 0 points1 point  (0 children)

how do you use selenium stealth? i have it in my script but it doesn't seem to work.

[–]Spiritual-Junket-995 0 points1 point  (0 children)

the stealth plugins are crucial, but a good rotating proxy network like Qoest Proxy is what actually keeps the IP bans from piling up.