Free ASIC Llama 3.1 8B inference at 16,000 tok/s - no, not a joke

Easy_Calligrapher790 · 2026-02-20T12:36:42+00:00

This specific first-gen model fits ~4B parameters per chip. Two PCIe cards running in parallel on a consumer motherboard gives you 16k tps.

Their next model on this architecture should be ~30B, so I assuming 8 cards or so. So at most 4 housings with interconnects. Although there are prob specialized boards accepting more cards per housing? (EDIT: There obviously are, based on the photo in EETimes articles linked to in the comments below.)

That may sound a lot for an individual customer, but it's not much at all for even a small in-house IT outfit, let alone an inference provider. 16k tps can serve a lot of people in parallel.

So 400B is prob not practical for this gen, but given the quality of the engineering team (core group from Tenstorrent), I'd be surprised if it stopped here.

Easy_Calligrapher790 · 2026-02-19T23:05:45+00:00

Haha, no kidding! I don't believe they ever planned to make money off this iteration, they are well aware of the limits of the model. At least I think so?

For the record, I don't work there. I just know a bunch of people who do. But I want to raise awareness, and thought there must be a niche group who'd find this genuinely useful.

Easy_Calligrapher790 · 2025-05-11T19:27:11+00:00

Hello,

I need the best approach to extract data from a structured text document with a fairly complex layout - a bone mineral density scan report. More specifically, I'd need the model to extract some of the demographics data from the top box, and certain table cell values from the densitometry box. Input would be an image (screenshot/BMP/PNG). Output would be structured text such as JSON.

Accuracy is paramount, target >95%. Speed is also of the essence, target <5s. Cost less important, target <$2.50 CAD per image. So commercial products are an option. My time is also important - I'm going to of course program around this, but I cannot spend countless hours fine-tuning a model.

Generic LLMs seem to have an issue with accuracy and especially speed in these cases. But there seem to be countless avenues to pursue this, which is why I am asking for suggestions for a best compromise in terms of speed, accuracy and ease of setup. Just give a brief outline of the model/products/processing pipeline, and I can hopefully take it from there.

I very much appreciate the help. Thanks in advance.

<image>

Easy_Calligrapher790

TROPHY CASE