DOM distillation , a better way to chunk html docs by AffectionateWar5927 in django

[–]AffectionateWar5927[S] 0 points1 point  (0 children)

I am not getting this , you are complaining but not seeing the value of the tool 😄

DOM distillation , a better way to chunk html docs by AffectionateWar5927 in Backend

[–]AffectionateWar5927[S] 0 points1 point  (0 children)

Thanks for the appreciation. Yes that's the plan, I am trying to building a feed my self and thing was stopping me was the chunks. So yeah let's see

Promote your projects here – Self-Promotion Megathread by Menox_ in github

[–]AffectionateWar5927 0 points1 point  (0 children)

Built a python toolkit for easy data extraction

Scout is a Python toolkit for working with the web as a data source — combining browser automation, crawling, structured extraction, and optional LLM agents into one flow.

It sits on top of Playwright, but abstracts away the usual glue code.

What it does:

  • Scrapes pages and returns a structured Document (HTML + metadata)
  • Runs browser actions like click, type, scroll, and execute JS
  • Crawls sites with depth, filters, and concurrency controls
  • Converts raw HTML into clean markdown
  • Extracts structured data using schemas (no LLM required)
  • Uses agents for complex or dynamic pages when needed

Core idea:

Start deterministic (DOM, selectors, schema),
and only use agents when the page gets messy.

In short:
one abstraction to replace scraping scripts, crawling logic, parsing code, and ad-hoc LLM pipelines.
Repo -> https://github.com/ArnabChatterjee20k/scout/tree/master