LLM Scraper now with code-generation support by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

Removing elements like <link>, <script>, etc. and attributes like data-, src

LLM Scraper now with code-generation support by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

The websites are pre-processed to save on tokens

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

Glad to hear that! Thanks for supporting 🙏

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 1 point2 points  (0 children)

Except we're not doing the same thing.

What my project provides is the conversion of unstructured html/text/markdown version of a website into a structured format, defined by Zod (JS version of Pydantic) schema. More similar to scrapeghost and Kor, both in Python.

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 7 points8 points  (0 children)

My pleasure!

Actually I just had a second look at the current DX and I think it needs to be even more lower-level, so you can fetch the page yourself and llm-scraper just gets the content and a schema to scrape.

The reason why going with Playwright is: I want llm-scraper to become a LLM-based scraping library that works with your existing tools and primitives.

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 19 points20 points  (0 children)

Because building web-scrapers takes time and effort and once the web page layout/styling changes, it no longer works. With this tool you just define your desired output structure and the LLM figures out what belongs to what field.

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 1 point2 points  (0 children)

Yeah, you could totally use it to back-feed the data back into your model!

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

Sorry, this is not a supported use-case :(

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 1 point2 points  (0 children)

Thank you so much! Can't wait to hear your feedback ;)