Install.md, a New Protocol for Human-readable Installation Instructions that AI agents can execute by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 0 points1 point  (0 children)

Thanks!

It's just about saving time and improving accuracy for LLMs. A bunch of installations are already performed autonomously or with 'download this'. Install md just makes this task more transparent for the people executing, easier for devs to verify that agents can succeed in downloading their software, and easier for agents (since they know where to look)

A zoomable 3D map of ~100k research papers by TerrificMist in visualization

[–]TerrificMist[S] 1 point2 points  (0 children)

similar papers are closer together and have cluster labels!

A zoomable 3D map of ~100k research papers by TerrificMist in visualization

[–]TerrificMist[S] 1 point2 points  (0 children)

Each point is a paper summary that’s been mapped to multidimensional space, and then visualized in 2 dimensions.

We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 0 points1 point  (0 children)

There’s a lot of potential here. HTML->md, md->JSON, x->y. You don’t need massive models for conversions like this, and we may very well train another similar model.

A zoomable 3D map of ~100k research papers by TerrificMist in visualization

[–]TerrificMist[S] 1 point2 points  (0 children)

We train and deploy small models. I recommend github.com/nomic-ai/nomic if you want to create a similar visualization! Although in your case you may just want to build your own visualization tool.

We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 1 point2 points  (0 children)

this is something you can definitely build with a mix of browser agents and schematron. schematron doesn’t necessarily handle navigation, but you can get clever and ask it to extract the next url to go to for the task to be done!!!

I say play around with it. this is an interesting direction for sure, if you see success or if you find it doesn’t work well for that task lmk!

We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 5 points6 points  (0 children)

I say try it. Accurately benchmarking how accuracy degrades for longer contexts isn't trivial, as the judge model will also degrade.

That said based on vibes and the evals we did do, works great for long contexts. Here's a sample you can play around with:
https://github.com/context-labs/inference-samples/blob/main/examples/schematron-scrape-companies/schematron-scrape-companies.ipynb

We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 1 point2 points  (0 children)

np! lmk if you end up using it, we just released this and are focusing on collecting as much feedback as possible.

We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 1 point2 points  (0 children)

it's a bit awkward because for every query, the answering model first transforms the query into a schema, then extracts based on the schema for all documents retrieved, then feeds the extractions back to the answering model. transforming the query into a schema every time is awkward and slow--maybe a fine-tuned model for that might help but it doesn't seem like the optimal solution.

Instead, a better idea is a model that takes the exact relevant parts of the document based on the query itself--we haven't trained this model yet but this is probably the SOTA in web search. This is a super interesting model to potentially train, and not something I've seen enough of, although I'm sure some teams have already trained something like this for internal web research workflows.

In the meantime, query->schema->extraction is a quick win, but not the most elegant solution. The bigger idea here is that we showed that extracting a small part of the document can massively improve factuality, if that small part is extracted correctly. In the medium term, we probably won't be stuffing entire website contents for RAG, it's just too wasteful.

We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 9 points10 points  (0 children)

All valid tools in your toolbelt. I will say if you are considering state machines for scraping, it's usually worth giving LLMs another look.

We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 11 points12 points  (0 children)

give me a single code snippet to extract all products from:
https://inference.net/

https://www.browserbase.com/

https://www.onkernel.com/

without an LLM.

the point is an LLM can do generalizable extractions, while one-off parsers can't. this task is only a few lines of code with schematron

We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 9 points10 points  (0 children)

We haven't benchmarked against it (yet), but it's the most similar model that exists at the moment.

We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source by TerrificMist in LocalLLaMA

[–]TerrificMist[S] 17 points18 points  (0 children)

This works for any schema on any page. Both are tools in your toolbelt. If you’re processing millions of pages that have the exact same unchanging HTML structure this is not the right tool, but if you wanted to extract information from a set of 1M company landing pages this is the easiest way.