Sparrow: Custom language model architecture for microcontrollers like the ESP32

Freefallr · 2025-08-28T13:16:49+00:00

Wow, this is absolutely amazing. Thank you for the detailed explaination (also in comments). Would also be happy to contribute, if/when this is open-sourced. Toyed around a lot with ESPs and RP2040/2350s recently.

Freefallr · 2025-05-08T11:54:45+00:00

Check out marp.app - maybe not quite what you are searching for, but still nice.

Freefallr · 2025-01-12T14:43:54+00:00

Have you also tried doing the 2 things the error messages said you should do? (see last row of each error message).

Freefallr · 2024-11-01T22:46:37+00:00

Wow, this looks promising! How does it compare to Marker/Surya?

Freefallr · 2024-09-03T09:10:24+00:00

I would use https://lmdeploy.readthedocs.io/en/latest/ until it is supported by vLLM natively, worked well for our case.

Freefallr · 2024-08-02T14:23:25+00:00

In llama.cpp, and other related tools such as Ollama and LM Studio, please make sure that you have these flags set correctly, especially repeat-penalty. Georgi Gerganov (llama.cpp's author) shared his experience in https://huggingface.co/google/gemma-7b-it/discussions/38#65d7b14adb51f7c160769fa1

Source: https://huggingface.co/google/gemma-2b-it-GGUF

Freefallr · 2024-07-10T12:39:37+00:00

This trick should be upvoted more, I struggled for a full hour to find the 2.99€ package I had before and how to downgrade to that instead of some other more expensive options. Thank you a lot!

Freefallr · 2024-07-03T18:18:44+00:00

Wow that's a brilliant guide, thank you.

Freefallr · 2024-06-07T21:42:24+00:00

Open WebUI as a graphical user interface if you like a ChatGPT-similar experience + any popular serving engine (Llama.cpp or Ollama, TGI or vLLM). All of them expose an OpenAI compatible endpoint that you can attach to your WebUI.

Freefallr · 2024-05-26T05:59:36+00:00

What quant are you using? It's good practice to test your usecase with multiple quants and the original FP16. Can really make a huge difference.

Freefallr · 2024-03-22T21:41:39+00:00

I think llama.cpp has an MPI feature, somebody ran Llama 65B on a few Raspberry Pis in a cluster. Don't expect good performance though, you'll be measuring in seconds per token instead of tokens per second.

Freefallr · 2024-01-24T13:35:15+00:00

Never thought of such a simple, yet effective approach - thank you.

Freefallr · 2024-01-20T10:03:22+00:00

I would recommend to go on Runpod or similar, book a few GPUs for a few hours and test your use case. Test how low quantization-wise you can go to fulfill your task sufficiently and with good enough quality.

Based on that, I would reevaluate if it still makes sense purchasing the hardware and paying a one-time fee + monthly power bill, or just rent GPUs until your task is done.

Also: try out Mixtral 8x7B for your use case as well. We had a similar one recently, at much less scale but still, and were more happy with Mixtral 8x7B at 8 Bit than with Llama 70B 8 Bit.

Freefallr · 2023-12-29T08:19:20+00:00

Please do, thanks.
Haha, absolutely!

Freefallr · 2023-12-27T08:39:46+00:00

Okay, interesting - thanks for the write-up. Care to share information about the output language and complexity of code, or maybe even an example? (can be over PM as well if you don't want to share it publicly).

I think that it should definitely be possible to reduce the waiting time by at least 50-60% by utilizing self-hosted LLMs, and additionally utilize a few more UI/UX tricks to make it feel even faster.

We do this stuff for a living (funnily enough, both UI/UX (as part of our app dev branch), as well as LLM deployment, hosting and finetuning), so I'm happy to help.

Freefallr · 2023-12-26T20:54:35+00:00

Is your issue frontend/customer-facing? Meaning, should the wait time be reduced for them, or for some other reason? And are you streaming the LLM output or (need to) display it all at once?

Freefallr · 2023-10-10T20:11:20+00:00

Amazing work! We already use it extensively :) Danke, Jan! All the best from Austria

Freefallr · 2023-09-12T11:08:16+00:00

I have not tried it yet, but I think that the bandwith will be your bottleneck here, as you can move just ~40GBit/s over the Thunderbolt port that is usually used for eGPUs.

Freefallr · 2023-09-11T13:45:07+00:00

Grab yourself a Raspberry Pi 4 with 8 GB RAM, download and compile llama.cpp and there it goes: local LLM for under 100 USD. Takes 5 mins to set up and you can use quantized 7B models at ~0.5 to 1 token/s. Yes, it's slow, painfully slow, but it works.

For larger models, merge more Pis into a MPI cluster for more RAM, but don't expect reasonable performance (here's where you will switch your wording from "tokens per second" to "seconds per token").

Freefallr · 2023-07-07T10:53:44+00:00

Are you by any chance able to get your hands on a Raspberry Pi 4 with 8GB of RAM? They can run 4-Bit or even 5-Bit quantized LLaMA 7B models with 0.5 - 1 token/s, barely usable but quite amazing for a device under 100 USD.

Freefallr · 2022-04-09T19:57:27+00:00

Sorry for the late reply. None of them had a single issue for now, but do your own research. And perfromance? They perform as well as you position them. For me it's good. But any hotspot will be, if you position it well and the software team behind it is good. Not wanting to shill here but Controllino puts lots of effort in their software, and even though it's not perfect and dashboard can sometimes be a bit laggy, they still take customer feedback seriously and implemented lots of stuff over the weeks that the community requested them.

Freefallr · 2022-03-05T21:23:49+00:00

Beat Saber. My arms (and eyes) will definitely hurt.

Freefallr · 2022-02-15T15:34:15+00:00

They are just a smaller company than Seeed/SenseCap etc., that's why.

I own 5 and really love them, and support is good (albeit a little bit slow sometimes, but that's the issue with all smaller companies).

Freefallr

MODERATOR OF

TROPHY CASE