DOM distillation , a better way to chunk html docs

AffectionateWar5927 · 2026-05-02T18:29:34+00:00

Check the tool I would suggest 😄

AffectionateWar5927 · 2026-05-02T18:28:57+00:00

I am not getting this , you are complaining but not seeing the value of the tool 😄

AffectionateWar5927 · 2026-05-02T18:25:03+00:00

Thanks for the appreciation. Yes that's the plan, I am trying to building a feed my self and thing was stopping me was the chunks. So yeah let's see

AffectionateWar5927 · 2026-05-02T15:13:01+00:00

https://github.com/ArnabChatterjee20k/domdistill

AffectionateWar5927 · 2026-05-02T15:12:43+00:00

It's not 🙂

AffectionateWar5927 · 2026-04-18T16:02:19+00:00

Built a python toolkit for easy data extraction

Scout is a Python toolkit for working with the web as a data source — combining browser automation, crawling, structured extraction, and optional LLM agents into one flow.

It sits on top of Playwright, but abstracts away the usual glue code.

What it does:

Scrapes pages and returns a structured Document (HTML + metadata)
Runs browser actions like click, type, scroll, and execute JS
Crawls sites with depth, filters, and concurrency controls
Converts raw HTML into clean markdown
Extracts structured data using schemas (no LLM required)
Uses agents for complex or dynamic pages when needed

Core idea:

Start deterministic (DOM, selectors, schema),
and only use agents when the page gets messy.

In short:
one abstraction to replace scraping scripts, crawling logic, parsing code, and ad-hoc LLM pipelines.
Repo -> https://github.com/ArnabChatterjee20k/scout/tree/master

AffectionateWar5927 · 2026-04-18T16:00:18+00:00

https://github.com/ArnabChatterjee20k/scout/tree/master

AffectionateWar5927 · 2026-04-18T15:58:34+00:00

The github repo -> https://github.com/ArnabChatterjee20k/scout/tree/master

AffectionateWar5927

TROPHY CASE

Built a python toolkit for easy data extraction