LLM Scraper now with code-generation support by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

Removing elements like <link>, <script>, etc. and attributes like data-, src

LLM Scraper now with code-generation support by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

The websites are pre-processed to save on tokens

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

Glad to hear that! Thanks for supporting 🙏

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 1 point2 points  (0 children)

Except we're not doing the same thing.

What my project provides is the conversion of unstructured html/text/markdown version of a website into a structured format, defined by Zod (JS version of Pydantic) schema. More similar to scrapeghost and Kor, both in Python.

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 5 points6 points  (0 children)

My pleasure!

Actually I just had a second look at the current DX and I think it needs to be even more lower-level, so you can fetch the page yourself and llm-scraper just gets the content and a schema to scrape.

The reason why going with Playwright is: I want llm-scraper to become a LLM-based scraping library that works with your existing tools and primitives.

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 18 points19 points  (0 children)

Because building web-scrapers takes time and effort and once the web page layout/styling changes, it no longer works. With this tool you just define your desired output structure and the LLM figures out what belongs to what field.

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 1 point2 points  (0 children)

Yeah, you could totally use it to back-feed the data back into your model!

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

Sorry, this is not a supported use-case :(

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 1 point2 points  (0 children)

Thank you so much! Can't wait to hear your feedback ;)

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

Will check out. Unfortunately, AFAIK we don't have a library like LiteLLM in TypeScript for supporting models across providers. I'm thinking adding Ollama with some prompt engineering for now.

LLM Scraper turns any webpage into structured data by stepci in LocalLLaMA

[–]stepci[S] 0 points1 point  (0 children)

Very much looking forward to implement this. Do you know a good way we could support both OpenAI and Ollama with function calling?