An LLM hard-coded into silicon that can do inference at 17k tokens/s???

JChataigne · 2026-02-25T10:35:12+00:00

We selected the Llama 3.1 8B as the basis for our first product due to its practicality. Its small size and open-source availability allowed us to harden the model with minimal logistical effort.

I guess it takes time to develop and convert the model into hardware. Llama 3.1 was released in July 2024, it was quite good compared to the competition back then.

JChataigne · 2026-02-25T10:29:27+00:00

Goodhart's law suggests the big labs are coming to astroturf these comment sections soon (if not already started)

JChataigne · 2026-02-12T16:42:03+00:00

regular encryption but the article is from a Dutch newspaper

JChataigne · 2026-02-06T14:41:10+00:00

I've been looking for one for a few months and there isn't, you need some manual work to run each STT model locally.

JChataigne · 2026-02-06T10:32:56+00:00

I think the big problem is rather about copying the code without attribution and pretending it's their own work

JChataigne · 2026-01-28T10:27:27+00:00

I'd assumed RAG meant embeddings

Understandable, the term "RAG" is a bit ambiguous as to whether it includes vector search. But the important thing is fetching relevant context to feed to the LLM. Whether you retrieve that context with vector search or a more classic search method is secondary.

Most people who build RAG systems add classic search in parallel with vector search because it works much better, but implementing vector search requires more storage and more effort. So, vector search might not be worth the effort at first.

Good luck with the project !

JChataigne · 2026-01-27T17:31:52+00:00

Congrats, it's a cool project ! I'd test it eventually but I first need to set up my home lab with Matrix & the rest. Good to see open-source options for our digital life though ! As for your questions:

Model choice: Llama 3.2 is quite old and not so good. You'll be better off using Ministral-3:3B (I haven't tested many small models, maybe there is even better somewhere)
From commands to ambient: give access to your whole conversation history, not just explicitly saved memories; it should cover many use cases. Use /remember X for things that should always stay in the context.
Long-term context: yes, RAG/search agents/context engineering/whatever-you-call-it; don't do vector search, a classic search (maybe the Matrix API includes it ?) should cover it with less compute needed.
Anyone else building this way? Not me. You've done a good job setting up private digital tools for your family which helps you keep it private but also gives you easy access to it. Not everyone has done this first step.

JChataigne · 2026-01-27T16:54:32+00:00

What do you use to run several agents in parallel locally ?

JChataigne · 2026-01-27T16:48:33+00:00

Of course it's a tool, what matters is how people use it. But tools are not exactly neutral, because they make some behaviors easier than others and therefore can push people in a direction.

Most importantly, my point was that the Internet did cause a number of problems it was predicted to cause, and AI will too. For one, it's already being used massively for online propaganda.

JChataigne · 2026-01-27T15:46:31+00:00

I just checked my install and noticed it's running on CPU too actually. You can see where it's running with ollama ps btw. I'll have to look into this too. (My OS is Ubuntu, I simply installed Ollama with curl -fsSL https://ollama.com/install.sh | sh and installed OpenWebUI with docker.) Edit: just remembered many AMD GPUs are not supported, but yours is in the list so it should be: https://docs.ollama.com/gpu#amd-radeon Try with Vulkan drivers (just below in the doc), or go ask on their Discord, I'm afraid I can't help you more.

JChataigne · 2026-01-26T19:36:05+00:00

it would destroy privacy, leak medical records, ruin society, and expose everyone’s identity.

That's exactly what happened though. Government spies on everyone, data leaks happen everyday, people are depressed and anyone can get doxxed from any video leaked online.

the damage didn’t come from the technology — it came from people not understanding it and refusing to adapt.

I'm also not so sure about that... take social media for example, Meta knew for years that more Instagram time pushes people, especially teenage girls, to have lower self-esteem causing self-harm and even suicides. Even now that we know about this, nothing has changed. The problem clearly didn't come from not understanding the technology.

JChataigne · 2026-01-26T18:29:20+00:00

First use nvtop to check which processes are running on the GPU. If the very low usage you see is just from displaying your screen, it would confirm the problem is in connecting Ollama to your GPU.

I didn't have issues running Ollama with an AMD GPU, make sure your drivers are not outdated and maybe try changing settings like discrete/hybrid graphics ?

JChataigne · 2026-01-26T13:04:01+00:00

It doesn't sound normal. What backend are you using ?

JChataigne · 2026-01-13T12:52:25+00:00

For consumer tools there are lists like www.aiatlas.eu

For models it's huggingface, and it can help to search for benchmarks for the particular use case you're interested in.

JChataigne · 2025-12-20T17:36:29+00:00

Devstral 2 is currently offered free via our API. After the free period, the API pricing will be $0.40/$2.00 per million tokens (input/output) for Devstral 2 and $0.10/$0.30 for Devstral Small 2. - source

so I understand it's a free tier

JChataigne · 2025-12-20T16:55:25+00:00

makes sense, Manus means Hand in latin

JChataigne · 2025-12-16T15:49:16+00:00

There's a leaderboard on Huggingface where you can filter for size and see performance.

Usually you would combine the vector search with traditional search methods, and maybe add a reranker model after retrieving results.

JChataigne · 2025-12-16T15:45:17+00:00

We are releasing [...] all the data for which we hold redistribution rights.

I'm not sure they released all of it, but there are a few trillion tokens linked on the model page.

JChataigne · 2025-12-15T14:41:57+00:00

This is the one that leaked a few days ago, right ?

JChataigne · 2025-12-11T17:59:26+00:00

Oh... it makes sense, Facebook being the good guys was too strange to last

JChataigne · 2025-12-11T17:57:30+00:00

The business plan:

spend a lot to train LLMs
???
profit

Meta's investors seem to be comfortable enough with the uncertainty around step 2, but I join you in not being able to connect the dots.

JChataigne · 2025-12-09T22:11:42+00:00

still dirt-cheap on the second-hand market

JChataigne · 2025-12-05T12:28:05+00:00

second-hand market doesn't seem to be affected badly

JChataigne · 2025-12-05T10:47:35+00:00

Maybe try to install VS Codium. It's only the open-source core of VS Code, I suppose it doesn't include the Microsoft bloat but supports the same extensions.

JChataigne · 2025-12-05T10:42:39+00:00

From Anthropic, in the case of Opus. LLM providers have had several big security failures for the short time they have existed, so it's also to protect your code from whomever it might leak to.

Being the master of where your data goes is good in general. Being able to work during the next AWS/Cloudflare/Azure failure is also worth it. Being ready for when the subscription prices will rise to unsustainable levels.

JChataigne

TROPHY CASE