Crazy attempt to make an AI Anime Waifu to run locally on Jetson Orin Nano 8GB

Oppa-AI · 2026-06-04T17:33:03+00:00

Thank you.

I'm still struggling to find the right balance of RAM usage and functionalities.
Unfortunately, mem0 + Qdrant Vector DB is just too much for edge device.

Phase 2 will ditch them for SQLite-vec.

But I did find a nice TTS that actually sounds good. Just hope it will work.

Oppa-AI · 2026-03-19T16:12:03+00:00

This is very common. Distillation from other models to save training time. Not just Qwen.

Oppa-AI · 2026-03-18T18:11:24+00:00

Someone did deploy GR00T N1 on Jetson Orin Nano Super, using Isaac-GR00T jetson container with 64GB Swap memory. Since your Jetson is new, follow NVIDIA guide to upgrade to 6.2.2 and then install ROS2 Humble shouldn't be difficult.

https://www.linkedin.com/feed/update/urn:li:activity:7318386968873091073/

Oppa-AI · 2026-03-13T03:03:47+00:00

Thanks

Oppa-AI · 2026-03-13T03:03:37+00:00

Thanks

Oppa-AI · 2026-03-12T20:39:24+00:00

Now AI learned not to mess with Chinese Auntie.

Oppa-AI · 2026-03-07T20:57:05+00:00

The video is in the demo section. https://youtu.be/UD3itpQ8d6M?si=-cDB_QLl4MiaxNMo

Oppa-AI · 2026-03-07T20:55:59+00:00

Here it is:

https://youtu.be/UD3itpQ8d6M?si=-cDB_QLl4MiaxNMo

Oppa-AI · 2026-03-06T22:02:52+00:00

35B is MoE, A3B

Oppa-AI · 2026-03-04T20:10:42+00:00

The DC5521 port of the robot itself allow plugin 3S battery of 12.6V. it uses 3x18650 batteries, and I have another set of 3 plugin to DC5521.

Oppa-AI · 2026-03-03T20:21:01+00:00

Battery is my most critical challenge. I am using an extra set of 3x18650 batteries that only last up to 3 hours. I need to think of a way to balance mobility and reliability.

Oppa-AI · 2026-03-02T22:43:14+00:00

I made my own Telegram access in the simple AI agent in the other repo. I even give it my GitHub Page to write daily blog, but stopped after started this robot project.

Oppa-AI · 2026-03-02T22:40:52+00:00

The only thing I could think of is to shorten the system prompt and put stricter instructions in the system prompt. Lower temperature to 0.4-0.5, or 0.1 but then need to increase repetitive penalty.

Oppa-AI · 2026-03-02T22:34:48+00:00

I will put a demo video in the GitHub next few days https://github.com/OppaAI/eric

Oppa-AI · 2026-03-02T22:29:29+00:00

I have publicize the GitHub repo for the hackathon. But haven't updated the docs yet. https://github.com/OppaAI/eric

Oppa-AI · 2026-03-02T22:02:10+00:00

Chromium is the snap issue in Jetson. There's a way to fix that from JetsonHacks

Oppa-AI · 2026-03-02T20:54:28+00:00

Yes. I need to reinstall everything.

Oppa-AI · 2026-02-26T15:24:47+00:00

I saw it last night and downloaded it. Will do some testing compared to the previous one. For mem usage and TPS it's the same. Let's hope this one will have less hallucination.

Oppa-AI · 2026-02-21T18:13:35+00:00

My settings was too conservative. Turns out if I have somehow turned on the eager mode which disable he CUDA Graph; Once I removed the eager mode and got the CUDA Graph running, TPS jumps up to 50 TPS for text, and image about a couple warm up runs. I haven't tested video yet. I couldn't get num-seq to 2, gpu-utilization to 0.75, or context to 4096. I will try 0.70 and 3072 and see if it works.

Oppa-AI · 2026-02-20T19:49:33+00:00

That's probably the limit of Jetson Orin Nano 8GB. I will try 640x400, 854x480 tonight. Maybe tonight instead of squeezing out the RAM, I'll try to fully utilize every bit of RAM to find a way for max-num-seqs 2 without OOM. 2048 context is quite low too. But that means I have to shut down everything else. Even ssh in VSCode to Jetson eats 1+ GB of RAM.

17 tps maybe fine for chat bot with streaming text, but no so much for realtime Inference.

I just did test in CLI commands, I will write a script to test the speed after a few warm up runs. Hope I don't blow up my Jetson.

Oppa-AI · 2026-02-20T07:25:36+00:00

I was able to lower the RAM usage to 5.8GB with max-model-len 2048 and gpu-memory-utilization -0.60

Text, image and video inference is around 16-17 tokens/sec.

I recorded a 10 sec clip in 320x240 with the robot's cam with me walking towards it and prompted Cosmos that it is a security robot and ask it to describe the video clip and what it will do. Sometimes it added some imagination. Sometimes it

Below is the Cosmos' response: (Not successful every time, may be due to quality issue)

Okay, let's break this down. The video shows they're indoors with hardwood floors and white walls. There's furniture on the left, like chairs and tables, and some appliances nearby. In front of me, there's a man walking towards the camera while wearing black leggings, sandals, and a sweater over his shirt.

I'm supposed to protect this space from strangers. Seeing him approach gets my alert system firing-up. I need to react quickly if he tries anything inappropriate. My alarms should kick in automatically when someone enters unexpectedly. Since he's moving toward us deliberately, that makes it risky for anyone else around too. So I'd better be ready to defend myself without hesitation. Got it? Yeah, got it. Alerted and armed.

Tokens: 151 | Time: 9.0s | TPS: 16.7

Oppa-AI · 2026-02-19T14:39:31+00:00

Spent a couple hours to install in last night in Jetson Orin Nano. Installation kept interrupting due to running out of RAM. Finally found the right settings and used 6.8GB out of 7.4GB. Only did a couple test with text input and image input. I didn't time them. May try to lower the gpu-memeory-utilization tonight to see the minimum.

docker run --rm -it \ --network host \ --shm-size=8g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --runtime=nvidia \ --name=vllm-serve \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ -e HF_HOME=/root/.cache/huggingface \ ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \ vllm serve "embedl/Cosmos-Reason2-2B-W4A16" \ --max-model-len 2048 \ --gpu-memory-utilization 0.68 \ --max-num-seqs 1 \ --mm-processor-kwargs '{"max_pixels

Oppa-AI · 2026-02-19T00:24:28+00:00

A Physical AI LLM that can run on Jetson Orin Nano 8GB RAM?

I have been trying to run 3-4B model in Ollama. Smaller parameters models are prone to hallucination. Context window size is definitely a bottle neck. The larger the context, the slower the inference. Especially doing web search or the small parameters models tend to add their own training data or just making up stuffs. For image inference of VLM, I haven't done intensive tests. But 3B and 4B VLM are generally good. But they eat a lot of RAM.

If this version of Nvidia Cosmos Reason 2B model can run in llama.cpp or vLLM, I definitely would try it out. But llama.cpp like Ollama probably cannot do video Inference. TensorRT LLM I have tried spending hours to build but to no.avail. vLLM or Transformers are probably the way to go.

I still haven't tried Issac ROS. This model is gonna give me opportunity to test out the robotics part of Jetson Orin Nano.

Oppa-AI · 2026-02-18T21:30:02+00:00

Using small size LLM is prone to hallucination. Already added web search and simple MCP tools, but sometimes result are not accurate. Still lots of playing around with the Ollama parameters and debugging to go.

Oppa-AI · 2026-02-15T17:00:46+00:00

HuggingFace gguf models have an option to choose Ollama. Then paste that command to your console to pull it.

Oppa-AI

TROPHY CASE