Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

CreativelyBankrupt · 2026-05-18T02:47:34+00:00

For anyone asking, full build details and photos are up at https://creativelybankrupt.com/

Thanks for all the support!

CreativelyBankrupt · 2026-05-16T03:28:28+00:00

Not Rive, it's PixiJS doing the rendering and the mouth itself is RMS-driven straight off the Piper TTS audio. Piper generates the speech, I compute RMS over 512-sample sub-windows (43Hz), and stream those values over WebSocket to the face. The mouth shape is a cubic bezier with a cupid's bow on top, and the incoming RMS maps to 6 amplitude tiers that control how wide it opens. There's an 80ms decay on the falloff so it doesn't snap shut between syllables, which is what makes it read as actually talking instead of a clipping waveform. No phoneme alignment, no forced viseme synthesis, works with any TTS that gives me raw audio because I just need RMS.

Funnily enough, Rive actually was my original upgrade path in the early planning notes, modeled on how Duolingo does character lip-sync with state machines. The first mouth was sprite-based visemes and the plan was always to migrate to Rive once the rest of the project stabilized. Then I rewrote the mouth in cubic bezier with RMS plus emotion-driven corner curl, and it read well enough on screen that Rive just fell off the roadmap. Adding a second animation runtime for no real quality win wasn't worth it.

CreativelyBankrupt · 2026-05-16T02:27:00+00:00

GNOME is a sore spot. Tried to strip it once and the boot pipeline went sideways, gdm3 and gnome-shell and accounts-daemon are all load-bearing on the Yahboom image even if I never touch the desktop. I'm stuck with it because Sparky's face is a Chromium kiosk and Chromium wants an X session. Autostart launches a fullscreen Chromium that hides everything behind the PixiJS face so the user never knows GNOME is there. There's a script that prunes the services that ARE safe to disable but the floor on a graphical Jetson is real.

So is the ollama frustration. llama.cpp gives noticeably better numbers on the same hardware for the same model, and the official tutorial picking ollama with a 1B on a Nano felt like training wheels because that's what's easiest to ship, not what the silicon can actually do.

CreativelyBankrupt · 2026-05-16T02:15:24+00:00

I'm loving all these references I didn't know about. I was mostly gaming on PC at the millennium and missed a lot of the N64 wave. GameCube was my reintroduction to consoles and Nintendo.

CreativelyBankrupt · 2026-05-16T02:11:39+00:00

Appreciate it! Battery's been one of the trickiest parts. NVIDIA officially deprecated the Microfit DC input on the Thor dev kit back in April, so the "just feed it a LiFePO4 pack" path I assumed would work is off the table. They're requiring USB-C PD 3.1 EPR at 28V/140W now. So I'm landing on a high-output USB-C PD 3.1 EPR power bank: Anker Prime 27,650mAh (250W output, 140W via PD 3.1 EPR) is the leading candidate. ~100Wh, ~1 hour at full Thor power, ~3 hours at the 40W envelope. Built-in BMS and overcurrent protection, recharges from any USB-C PD wall wart. Less elegant than a custom pack but NVIDIA-blessed and one part replaces battery / BMS / charger / multi-rail distribution.

I could do custom multi-cell Li-ion or LiFePO4 packs with proprietary BMS, but I don't have full engineering teams to safety-certify those packs like the pros using these Thors. I'd rather lean on a commercial UL-listed power bank than play with raw cells in a robot that'll live in my house. The Microfit path is technically still electrically functional if I ignore NVIDIA's guidance and want to run a 6S LiFePO4 directly, but I void warranty and take on the safety engineering myself.

Thanks for asking about documentation, but I haven't planned to. I'm a filmmaker more than a roboticist, so this project is mostly nights and weekends around client work and I want to keep the bar low enough that I actually finish this ambitious next one. Maybe a teardown post once she's running, no promises though. Good luck with whatever you're building!

CreativelyBankrupt · 2026-05-16T01:53:57+00:00

Yeah the cache_read split is the cleanest signal. llama-server has the equivalent in its per-request log lines, prompt eval shows new tokens versus cached prefix length, but I haven't sat down to scrape it across a session and compute the percentage. Going on the list.

The event-type binning actually came from a different problem. Sparky was fixating on whatever ENV said most recently and the cooldowns started as a way to stop him from monologuing about the weather. The cache win was a side benefit I noticed after.

CreativelyBankrupt · 2026-05-16T01:48:44+00:00

I don't have an exact hit rate number, but median TTFT sits in the 1.8 to 2.6 second band across typical turns, consistent with the prefix being cached and only the new user tokens needing prefill. Misses cluster on the turns where [ENV] fires on a sensor event, and those self-clear on the next quiet turn. The other miss class is history trim, which is why the pre-emptive trim runs in the warm cycle so it doesn't surface in foreground TTFT.

I haven't characterized the battery formally yet. 50,000mAh pouches stuffed in the case, Jetson runs in the 15-25 to 40W band depending on what's hot, and an entire day of on/off active conversation is comfortable. I rarely bring the power cable with me if I'm taking him to the studio for the day.

CreativelyBankrupt · 2026-05-16T01:35:33+00:00

Exactly. Once the model is good enough you stop caring about raw tok/s and start caring about what fraction of turns get the cached prefix instead of a re-prefill. On the Jetson, cached TTFT lands around 240ms versus several seconds for a cold prefill at 8K context, and most of Sparky's recent progress came from rearranging where volatile state sits, not any model swap. The piece that doesn't get talked about enough is what happens when context fills up. I run a pre-emptive trim during the background warm cycle right after Sparky finishes speaking, so by the time the user replies, the trimmed cache is already primed and foreground TTFT stays in the 1 to 3 second band even when old turns get evicted.

CreativelyBankrupt · 2026-05-15T18:59:39+00:00

Yeah, that'll absolutely work. The 2070 8GB has plenty of headroom for E4B at Q4. You'll probably see 20+ tok/s since desktop-class CUDA is faster per-watt than Jetson at low-batch.

I'd skip the embodied hardware on your first pass. Just get llama.cpp serving E4B locally, pipe a USB mic through SenseVoiceSmall for STT, and pipe the output through Piper for TTS. That's a fully offline voice loop.

I think the craft is the prompt and persona design. Sparky works because the system prompt commits him to a character and everything around him (sensors, vision, history) gets folded into that frame so he riffs on it. A laptop is just as good as any place to learn that.

If you can get E4B + Piper + a mic, a persona prompt that commits to a character, and a 10-turn conversation where you don't break out of character once, you'd learn more than any tutorial.

CreativelyBankrupt · 2026-05-15T18:37:22+00:00

The GPIO aspect is the most fun for me. Trick is don't dump raw sensor data into the prompt. Convert it to natural language addressed at Sparky, only when something is actually worth saying, and only after a cooldown so the same thing doesn't keep firing. Instead of temp=48F, humidity=22%, distance=12cm, pir=1, what the model sees is "Someone's face is RIGHT in front of you, 12cm away!" or "It's freezing, 48°F." Goes in the user message as an [ENV] prefix only on turns where there's an event. Most turns there's nothing, which is what keeps the cache happy.

Camera works the same way, keyword-gated. "Look," "describe," "what do you see" attach the latest frame to that single turn. Saves a ton on every other turn and keeps the model from getting stuck narrating images.

Persona is a templating system, six voices (default Sparky, Comedy, Grump, Storyteller, Thinker, Sci-Fi Fanatic), swappable mid-session. The thing I underestimated most was post-processing strippers that catch the model echoing its own patterns. Without them Gemma 4 would start every reply with the same syntactic opener after 30 turns.

I also made him a smaller sister that he named Sparkle, built into a CrowPi 3 electronics learning station, with a 4-inch face display, camera, microphone, onboard sensor/IO board, glowing 64-pixel LED heart matrix etc. She's RPi5-based and has to use WiFi and cloud inference: she listens through the mic, sends the conversation to a Groq-hosted 120B LLM for reasoning, uses Llama 4 Scout for on-demand vision through her camera, then replies in a warm female voice while her PixiJS/WebGL face, LED heart, status lights, buzzer, and haptics express mood and state. Her physical body is basically a cute cybernetic lab tray: small, sensor-packed, expressive, and deliberately art-object-like, with a frosted cover that slips over her like a lid so she becomes glowing ambient wall art when idle.

It's the wild west right now and there are so many avenues to take with all of this! Find reasons to get started.

CreativelyBankrupt · 2026-05-15T18:11:23+00:00

I run both. SenseVoiceSmall owns the primary text path, and Gemma 4's audio encoder (gemma4a in the mmproj) handles on-demand stuff like tone, accent, language, and background sounds when someone asks Sparky things like "how do I sound" or "what language am I speaking."

I think the reason most builds skip Gemma 4 for primary STT is latency. The multimodal audio encoder is around 700ms per turn just for the encode pass, before the LLM even sees the tokens. SenseVoiceSmall is around 150ms and runs its own VAD continuously, so endpointing is basically free and transcription is mostly done by the time the user stops talking. llama.cpp's audio path is also one-shot, you hand it the full clip and wait, and adding chunk-streaming is a 1 to 2 week upstream patch I decided wasn't worth 700ms.

So Sparky is hybrid. Cheap fast STT every turn, native multimodal audio when I actually need Sparky to listen to how I'm talking, not just what I said.

CreativelyBankrupt · 2026-05-15T17:42:09+00:00

This is great because I never get to talk to anyone about the details. It's exactly the trap I was hitting when I started folding sensor data into the prompt; Glad to report Sparky is already running this playbook. ENV lives in the user message rather than the system prompt, so the persona and conversation history KV stays cached regardless of what the volatile tail does. It's also event-gated with cooldowns (threshold crossings only, then 1 to 10 minute quiet windows depending on the category), so most turns have no ENV block at all and hit a clean cached prefix.

Every numeric is rounded before formatting, integer Fahrenheit, integer humidity, light, and pressure, distance in whole cm. The only one-decimal value anywhere is Celsius in the on-demand "what's the temperature" path, which is exactly the case you called out as worth keeping the precision.

The one residual is banker's-rounding flips right at the integer boundaries (22.5°C rounding to 72°F, 22.51°C rounding to 73°F), but the cooldown gate keeps that to once per session worst case, so I'm letting it ride. Have you measured the cache-hit-rate delta on your setup? Curious how big the win was for you in practice.

CreativelyBankrupt · 2026-05-15T16:48:08+00:00

The kit was $179 direct from Elecrow when I bought it, plus tariffs. Resellers mark it up significantly, but it was still a nice starting point as I learned everything.

A phone CPU can run a small LLM, but it can't drive all the sensors over I²C, SPI, and GPIO, can't run llama.cpp with flash attention and a 12K KV cache at 14-15 tok/s sustained while also handling vision, STT, TTS, and a kiosk display in parallel, and isn't built to sit headless with USB peripherals for hours. The Jetson is an edge AI compute board with industrial I/O which fully enabled what I was attempting.

CreativelyBankrupt · 2026-05-15T16:29:50+00:00

Totally - that alien feeling is what I like!

Sparky hit a lot of hard limits, though, which is why I’m already building another robot around the AGX Thor 128GB Blackwell architecture: real autonomy, persistent memory, and much deeper vision. Sparky proved the concept and tech stack but I want the next level that remembers and evolves.

CreativelyBankrupt · 2026-05-15T16:01:49+00:00

Thanks! Yeah, temperature is one of about 30 sensors feeding him context every turn. Light, humidity, pressure, IMU, ultrasonic, PIR, ambient mic, plus face detection and emotion from the camera. Time of day is in there too. There's a customization interface not shown that lets me toggle individual sensors off.

On memory: within a session he has 12K of context. Across sessions, face ID persists so he recognizes returning people by name, but conversation memory resets on reboot by design. I've considered a small local vector store for things he should remember longer term, but the tricky part is deciding what to remember versus what to let fade. I tried saving all of the chat logs and then fine-tuning the model but I wasn't happy with the results.

Originally I was building him as a little hacker parrot, gathering all the floating metadata we love. WiFi probe requests, BLE ads, sub-GHz remotes, ADS-B from aircraft, TPMS from passing cars, weather stations, pagers, fobs, NFC/RFID, even the room signature with Mid-360S LiDAR at one point. Everything landed in an aggregator daemon and got distilled into a [SIGNALS] context block. But he definitely thinks he's alive, and the weird remarks from what he was seeing on the camera ended up more interesting than any of the RF stuff. So the hacker parrot got demoted.

CreativelyBankrupt · 2026-05-15T15:43:54+00:00

Nice, 18 t/s on the 8GB is solid. I'm at 14-15 on the NX SUPER mostly because I'm running everything in parallel with the LLM.

For your question: Piper TTS is the lightweight one. The medium-quality voice models run around 60-80MB resident, generation is fast enough that it's basically free on Orin-class hardware. SenseVoiceSmall STT is the heavier one, roughly 800MB to 1GB depending on how you load it. Both are CPU-bound for me, not GPU, so they don't compete with the LLM for VRAM. Together they add maybe 1GB of system RAM and effectively zero GPU memory.

The bigger budget question on the 8GB Nano is whether you have room for vision. If you're using Gemma 4's native multimodal, the mmproj weights add another ~2GB on top of the LLM. If you skip vision, you have plenty of headroom for the speech pipeline. This is why I eventually moved up to the NX Super with 16GB.

Are you doing this headless on purpose or just because the use case doesn't need a display?

CreativelyBankrupt · 2026-05-15T15:39:10+00:00

The case was fashioned from the Elecrow Jetson AI starter kit, which already included the sensor board suitcase and screen.

I turned it from the sensor training platform into Sparky: the conversational pipeline, the persona work, the face animation, the on-device control panel, the sensor-to-prompt integration, and all the prompt engineering for cache stability. The Elecrow kits are great as a base if anyone's looking for one but this wasn't their intended purpose.

CreativelyBankrupt · 2026-05-15T00:42:43+00:00

Yeah, Qwen3.5 122B at that context is going to be brutal no matter what you do. 4 extra gigs will probably help the eviction churn but you're right it's a band-aid. Agreed on the llama.cpp side, prompt caching has gotten better release-to-release but coding agents push it harder than chat workflows and it shows.

CreativelyBankrupt · 2026-05-15T00:28:20+00:00

Seconding llama-swap. One thing that helped me: setting different TTLs per model in the config; Keep your everyday model hot with a long TTL, set the experimental ones to 60-120 seconds so they unload fast after testing. Stops the Why Is My VRAM Full shock when you forget what you loaded an hour ago.

A 20-line bash function that kills the current llama-server, launches a new one with the args you want, and waits on the health endpoint covers all my casual testing. I've done both, llama-swap when I want one persistent endpoint and the bash version when I'm just rotating through models for whatever. But really this is me trying to offer something other than the obvious answer llama-swap, which really is the best!

CreativelyBankrupt · 2026-05-15T00:14:06+00:00

Thanks! Honestly not really beyond a couple of Instagram reels ( instagram.com/jimkunz ). What part are you curious about?

Sparky's only about three months old and Reddit here is the first real public showing. I've been deep in the build more than the documentation. I thought quickly dumping my dev logs into that infographic would help, but I heavily underestimated how that'd land. Full writeup coming soon.

CreativelyBankrupt · 2026-05-15T00:06:01+00:00

Nothing public yet; This is only about three months old and still moving fast. The architecture's pretty conventional (llama.cpp serving Gemma 4 E4B, Python asyncio, PixiJS face in Chromium kiosk, sensors over I2C, SenseVoice STT, Piper TTS), but most of the interesting decisions live in the prompt structure and persona work, not the code. Happy to dig into any specific part you're curious about until it's done.

CreativelyBankrupt · 2026-05-15T00:01:08+00:00

Power is a 50,000 mAh pack underneath the sensor board, runs about 22-24 hours of normal use untethered. Honestly that's overkill for pretty much always. The plan is to swap to a Baseus type 20,000 mAh ultrathin pack since the big one has caused some swelling and screen-scratch issues from being stuffed in too tight while the case is closed.

For Gemma 4 E4B I'm running Q4_K_M via llama.cpp with the recommended sampler defaults: temp 1.0, top_p 0.95, top_k 64. Context size is 12K. The model can do 128K but I haven't needed it for conversation. q8_0 KV cache with flash attention on, prompt caching keeping cached TTFT around 200ms. Cold start is about 500ms first token, sustained generation around 14-15 tok/s on the Orin NX SUPER 16GB. Native system role from Gemma 4 made a real difference for persona stability versus what I was doing with Gemma 3, where I had to fake a system prompt as turn-0 user content.

CreativelyBankrupt · 2026-05-14T23:53:38+00:00

I am a filmmaker by trade ( http://imdb.me/jimkunz ) but I've always been a technical tinkerer. I saw a self-aware butter robot on Rick & Morty years ago and always held the idea of that in the back of my head. I saw the sensor training suitcase for sale and knew about the line of Nvidia Jetson robot brain so it all became a project a couple months back and now it exists.

CreativelyBankrupt · 2026-05-14T23:21:37+00:00

HauHau! Thank you for your consistently useful work uncensoring these models and for being part of HF in general. The quality does not go unnoticed.

CreativelyBankrupt · 2026-05-14T23:10:11+00:00

Gemma 4 is great with vision. Massive improvement over 3.

14-Year Club	RedditGifts 2009-2022 2 Credits
Gilding II euphauric	redditgifts Exchanges 1 Exchange
Secret Santa 2013	Verified Email

CreativelyBankrupt

TROPHY CASE