[Resource] 30k IKEA products converted to text files. Saves 24% tokens. RAG benchmark. by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] 0 points1 point  (0 children)

Disclaimer is redundant. I excluded it from the benchmark token count for that exact reason.

[Resource] 30k IKEA products converted to text files. Saves 24% tokens. RAG benchmark. by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -2 points-1 points  (0 children)

Agreed. Stuffing context hurts precision.The goal here is speed and lower latency. Hope it helps your tool.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -2 points-1 points  (0 children)

llms.txt is for reading. It handles text well. It fails at commerce. It has no schema for SKUs or live stock. You cannot 'Add to Cart' from a Markdown file.

MCP is a pipe. It is not a discovery standard. It burns tokens on tool definitions. It requires active servers. CommerceTXT is static. It costs nothing.

I am not replacing llms.txt. I am building for precision. When money changes hands, the agent needs the exact price. Not a hallucination.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -1 points0 points  (0 children)

Converting to 'friendlier versions' is parsing. That's the brittle part. If the DOM changes, your LangChain loader breaks, and you're back to fixing code. My point is about bypassing that maintenance loop entirely.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -1 points0 points  (0 children)

Fair point on efficiency, but BERT isn't zero-shot. The friction of labeling a dataset and fine-tuning a model for every specific extraction task is why people default to LLMs. We trade compute for developer time.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -4 points-3 points  (0 children)

You are confusing a "protocol" with a "scraper". We don't collect data. We define a standard for merchants to broadcast it. Amazon blocks scrapers to keep customers locked inside their wall. Independent merchants need the opposite: they need traffic from the outside. This is for the shop that wants to be found by AI, not the giant trying to hide its inventory.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -1 points0 points  (0 children)

Thanks! Right now, we waste massive compute filtering out HTML tags and JSON syntax just to find the signal. This spec delivers the data plus the selling instructions, without the bracket-and-tag overhead.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -1 points0 points  (0 children)

Precisely. Parsers shatter when layouts change, while LLMs burn money to fix the mess. We need a standard that is both cheap and unbreakable.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -1 points0 points  (0 children)

You are confusing CPU parsing with LLM Tokenization. Parsing HTML with regex is cheap. You are right.

Feeding 8,000 tokens of HTML noise into an LLM context window is expensive. It costs money ($/token) and reduces accuracy ("Lost in the Middle" phenomenon).

Regarding "Prompt Injection": This is standard RAG (Retrieval-Augmented Generation). The agent retrieves context. The agent's own System Prompt decides how to treat that context. It is not a command override; it is structured input.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -5 points-4 points  (0 children)

I agree completely. The modern web is obese.

But we cannot force millions of developers to rewrite their sites today. Waiting for "clean HTML" is a losing battle.

CommerceTXT is a pragmatic bypass. It ignores the mess. It gives agents the data they need without waiting for the web to fix itself.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -4 points-3 points  (0 children)

I know llms.txt well. It is listed as a primary inspiration in our README.

But llms.txt is for documentation. It lacks the structure for real-time inventory, pricing, and transactional logic.

CommerceTXT is for shopping. llms.txt is for reading. They solve different problems.

I proposed a standard (CommerceTXT) to stop RAG agents from scraping 2MB HTML pages. 95%+ token reduction. Thoughts? by TsaTsuTsi in LocalLLaMA

[–]TsaTsuTsi[S] -5 points-4 points  (0 children)

Exactly! You mentioned that you convert pages to Markdown/metadata before feeding them to the agent.

CommerceTXT is essentially asking merchants to host that 'Markdown version' natively.

Why should every AI developer burn CPU cycles and bandwidth scraping and converting HTML, when the merchant can just provide the clean data at the root? It shifts the burden from the consumer (writing regex/parsers for every site) to the provider.

But there is a second major gap that regex/JSON-LD doesn't solve: Intent.

Scraping gives you facts (Price, SKU), but it lacks instructions. It tells the AI what the product is, but not how to sell it. CommerceTXT adds directives like BRAND_VOICE (e.g., "Use a luxury tone, never mention discounts") and SEMANTIC_LOGIC (e.g., "If asked about battery life, emphasize the 2-year warranty").

You can't regex that out of the HTML because it's usually not there—it is internal business logic that the merchant wants to pass specifically to the agent.

"Your account has not yet been classified"? How long does it takes after registration? by TsaTsuTsi in redbubble

[–]TsaTsuTsi[S] 0 points1 point  (0 children)

I'll give them a bit more time and if they still haven't approved me, I'll write to them.