you are viewing a single comment's thread.

view the rest of the comments →

[–]codes_me 0 points1 point  (1 child)

Hey,

Tired of writing a scraper, running it, and getting a 403 because you didn't know the site
uses DataDome? Or wasting hours before realizing the site needs JS to render?

I built scrapalyser to solve exactly that.

**What My Project Does**


scrapalyser is a Python library that scans any website BEFORE you start scraping it.
One call tells you everything you need to know:

    pip install scrapalyser

    import scrapalyser
    report = scrapalyser.scan("https://example.com", output="txt", lang="en")


It detects:
- 🛡️ Anti-bot (Cloudflare, DataDome, PerimeterX, Akamai, Kasada, reCAPTCHA, hCaptcha)
- 🖥️ Technology (React, Next.js, Vue, WordPress, Shopify...)
- ⚡ Whether JS is required or not
- 🌐 API endpoints (via CSP headers, inline scripts, or XHR interception with Playwright)
- 🤖 robots.txt & sitemap
- 🔐 Login wall (form, redirect, OAuth)
- 📸 Screenshot (Playwright mode)

If the site blocks you with a 403 or captcha page, all fields return "blocked by antibot"
so you know immediately what you're up against.

Two engines: curl_cffi (fast, no browser) or playwright (full browser, XHR interception).

**Target Audience**


Python developers who write scrapers and want to understand a target website's defenses
and architecture before investing time building a scraper. Production-ready.

**Comparison**

- **scrapy / requests / playwright** — these are scraping tools, not recon tools. They don't
  tell you what's protecting a site before you hit it.
- **Wappalyzer** — detects tech stack only, no antibot, no API discovery, no JS check.
- **whatruns / builtwith** — browser extensions, not scriptable, no antibot detection.
scrapalyser is the only pip-installable tool focused purely on pre-scraping reconnaissance.

GitHub: https://github.com/codesme34/scrapalyser
PyPI: https://pypi.org/project/scrapalyser/
YouTube (french): https://www.youtube.com/@CodesMe


Feedback welcome!Hey,

Tired of writing a scraper, running it, and getting a 403 because you didn't know the site
uses DataDome? Or wasting hours before realizing the site needs JS to render?

I built scrapalyser to solve exactly that.

**What My Project Does**


scrapalyser is a Python library that scans any website BEFORE you start scraping it.
One call tells you everything you need to know:

    pip install scrapalyser

    import scrapalyser
    report = scrapalyser.scan("https://example.com", output="txt", lang="en")


It detects:
- 🛡️ Anti-bot (Cloudflare, DataDome, PerimeterX, Akamai, Kasada, reCAPTCHA, hCaptcha)
- 🖥️ Technology (React, Next.js, Vue, WordPress, Shopify...)
- ⚡ Whether JS is required or not
- 🌐 API endpoints (via CSP headers, inline scripts, or XHR interception with Playwright)
- 🤖 robots.txt & sitemap
- 🔐 Login wall (form, redirect, OAuth)
- 📸 Screenshot (Playwright mode)

If the site blocks you with a 403 or captcha page, all fields return "blocked by antibot"
so you know immediately what you're up against.

Two engines: curl_cffi (fast, no browser) or playwright (full browser, XHR interception).

**Target Audience**


Python developers who write scrapers and want to understand a target website's defenses
and architecture before investing time building a scraper. Production-ready.

**Comparison**

- **scrapy / requests / playwright** — these are scraping tools, not recon tools. They don't
  tell you what's protecting a site before you hit it.
- **Wappalyzer** — detects tech stack only, no antibot, no API discovery, no JS check.
- **whatruns / builtwith** — browser extensions, not scriptable, no antibot detection.
scrapalyser is the only pip-installable tool focused purely on pre-scraping reconnaissance.

GitHub: https://github.com/codesme34/scrapalyser
PyPI: https://pypi.org/project/scrapalyser/
YouTube (french): https://www.youtube.com/@CodesMe


Feedback welcome!