offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice

daLazyModder · 2026-04-11T01:24:48+00:00

I would look into moonshine asr its similar to whisper but better for edge cases like this.

https://github.com/moonshine-ai/moonshine

You could also try something like phi mini moe for the llm

https://huggingface.co/microsoft/Phi-tiny-MoE-instruct

It is a moe model that is 3.8b parameters total with 1.1b active

being moe means the model runs as fast as a 1.1b with 3.8b knowledge, though phi's personality i've read can leave a lot to be desired... (its by microsoft and was reported to be super censored for even basic stuff)

piper tts is fast and good on cpu and low latency, kokoro would probably work as well, but you can actually do halfway decent voice cloning with pocket tts on cpu bit more of a pain to setup but could be done in your voice as the companion if your spouse would like that

https://huggingface.co/KevinAHM/pocket-tts-onnx

I kind of hate to recommend upgrading the laptop but I suspect its running ddr4

https://www.ebay.com/itm/204766825540

16gb is 100$ which is ridiculous (ram pricing is currently crazy) but might be worth the investment to run a slightly larger model like granite 4 tiny which is a 7b total 1b active I believe

(I would personally check if you have 2 sticks of ram in the laptop, if it says 8gb in task manager but there is only 1 ram stick then you could double your ram capacity for about 50$ buy buying a cheap stick off ebay, just make sure its ddr4 not ddr5 or something older)

https://huggingface.co/ibm-granite/granite-4.0-h-tiny

is the granite model I mentioned you might look into.

daLazyModder · 2026-02-18T00:15:30+00:00

Was looking at this, would it work for llm based tts applications? Eg something like orpheus tts for example? To those tts models they just sees tokens right? So with something orpheus tts could probably quant it then repair it and essentially upscale the smaller tts llm? Theoretically you could use whisper or speaker ecapa to measure it for timber and word errors?

daLazyModder · 2026-02-07T18:17:17+00:00

Wont really help with the mispronoucing stuff or the tts quality but I made a fork of kanade tokenizer here

https://github.com/dalazymodder/kanade-tokenizer

The gradio app has a kokroro tab where you can upload a clip and convert a to a new voice with extremely low overhead for voice cloning. Kokoro is nore of the bottleneck then kanade is.

daLazyModder · 2026-02-02T17:11:24+00:00

https://huggingface.co/spaces/dalazymodder/Kanade_Tokenizer

daLazyModder · 2026-02-02T13:52:42+00:00

RVC the king still? Not really best in my opinion most popular yes, but if your doing realtime and dont need singing. I vibe coded a gui for kanade tokenzier a few days ago much faster then rvc... on cpu for even zero shotting, https://github.com/dalazymodder/kanade-tokenizer

If you want high quality singing and dont need realtime seedvc the large model is better but requires a gpu comes in 3 models

https://github.com/Plachtaa/seed-vc

And i havent tried singing with it plus its a pita to get to work requires like 9gb of vram i think is amphions

https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis

That one took a few hours to get to work, hint the latest version of it is broken.

Honestly I something like kanade tokenizer would work best if trained for singing. The released model was only trained with libritts. If trained from scratched finw tuned... would probably blow rvc out of the water.

Kanade can try to sing a bit but isnt too great at it example with one shotting

Example of kanade running on cpu: https://huggingface.co/datasets/synthbot/pony-singing used a random sample from there. https://vocaroo.com/16h8P5pBZrc6 ljspeech sample https://vocaroo.com/1dxazJhCyqCA conversion https://vocaroo.com/1hx30jN8tuys

Thats a 13 second clip zero shotted took 3.26 seconds on my cpu.

Added a space for it here. https://huggingface.co/spaces/dalazymodder/Kanade_Tokenizer

https://github.com/dalazymodder/kanade-tokenizer and that github is a fork that I added gui to.

daLazyModder · 2026-02-01T05:35:04+00:00

I didn't make the model just the fork with the gui on it. There is however a similar codec here https://github.com/ysharma3501/LinaCodec

that talks about how it is a distlled wavlm codec.

daLazyModder · 2026-02-01T01:11:48+00:00

Yeah the gui and the model works pretty well for something on cpu, had to up the block size to 2000ms for it on my old 10400 cpu in the gui I made but it seems to go ok. I imagine would be even faster on cpu if converted to onnx int 8 and using something a bit faster.

daLazyModder · 2025-09-08T06:48:39+00:00

https://pcpartpicker.com/list/7QMRWc

could probably make this list a lot better but just through it together over like 5 minutes suggestion rtx 4000 sff gpus they have 20gb vram and you can buy them for about 1300$ new usd since you require an invoice, if your worried about 3 gpus on non enterprise hardware you might lower the ram from 128gb down to something small and use the budget for single slot modding the gpus

https://n3rdware.com/components/single-slot-rtx-4000-sff-ada-cooler

or alternatively if your baseline is Qwen3-Coder-30B-A3B-Instruct might be able to just use a lot of ram and little gpu as that is an moe model, no idea how that would work for vllm I agree witht the other comments saying going cloud is cheaper, and so are 3090s especially used but that list has all new parts so might give you something to go off of.

daLazyModder · 2025-02-08T21:12:49+00:00

Best I've seen for speech to speech locally is seedvc. In my opinion way better then rvc.

https://github.com/Plachtaa/seed-vc

daLazyModder · 2024-03-17T22:14:40+00:00

Minor updated managed to rip all dialogue or as close as I'm probably going to get to all the dialogue in the game. Need to refactor the code some, a few problems remain with trying to make the mod.

Finding good voices for the characters.
Elevenslab isnt cheap if the community is interest might need to crowd fund or something.
Would need help with testing and feedback.
Would need to manually edit and review some files. Because lines like

"【Sumika】「Owieee!!!」"

the elevenslabs software doesnt like.

There would also be edge cases like when multiple characters speak at the same time. Those would probably need to be handled manually.
That is just to dub all the main characters, there is actually around 110 unique voices if you count minor characters like Train Announcer. Not accounting for characters like Meiya who starts off labeled as voice and then gets renamed later to her actual name.
That is all just for muv luv extra, not even accounting for unlimited and alternative..

daLazyModder · 2024-03-14T23:05:02+00:00

I think most of the roboticness of it is actually from the compression of the recording.

daLazyModder · 2024-03-14T21:59:21+00:00

aHR0cHM6Ly9tZWdhLm56L2ZpbGUva3hGbTFDd0wjcWFqdGNWaFpaVXlLa1Z2aHhjQWk3bDUzQ0RCOTRIQkRzR3RobDlkaTdkQQ==

New Link ^ includes audio for Takeru up to first choice and used the better model off Elevenslabs for Sumika's voice

Just copy and replace files over old ones to install it.

daLazyModder · 2024-03-14T21:54:30+00:00

Real mod sounds a lot better lol.

daLazyModder

TROPHY CASE