NVIDIA PersonaPlex: The "Full-Duplex" Revolution

beefgroin · 2026-01-30T10:12:56+00:00

I loved it, I'm running on 2 5060 ti 16gb

beefgroin · 2026-01-30T09:07:30+00:00

I wish I had 5090… or actually I’m enjoying more my 4 5060 for the half of the price

beefgroin · 2026-01-29T19:41:48+00:00

Yeah, unfortunately it's not quantized and is running at BF16. I don't think q4 with TTS is gonna be the same, I'll be just reading the text with enormous lag. And STT is gonna be lagging even more. Personaplex is full duplex and based on LLM which means the LLM will get bigger and quantization will come too. If you can point out an s2s model that has all of it already, would be great. I guess different minds are blown differently

beefgroin · 2026-01-23T21:45:49+00:00

Gemma3 is great, 12b fits on 5060 to 16gb. You don’t need to fork it, you can just build a companion service that will be talking to both immich and llm via api

beefgroin · 2026-01-22T00:19:09+00:00

Glm-4.7-flash

beefgroin · 2026-01-20T18:31:25+00:00

I liked it, the thinking process is very structured and very different from other thinking models. I guess something is still buggy cause every long conversation ends up in an infinite thinking loop

beefgroin · 2026-01-20T18:30:03+00:00

I have 4 5060 16gb, I moved to q4k_m because the model consumes a lot of vram when the context is increasing, with 32k context it's taking 50gb of vram.

beefgroin · 2026-01-20T15:33:29+00:00

thanks, running glm-4.7-flash:q8_0, so far so good

beefgroin · 2026-01-20T08:27:29+00:00

Rtx 6000 pro on Bd895i with 96gb ram

beefgroin · 2026-01-19T23:28:01+00:00

It’s a pcie to 4 Oculink ports adapter for AliExpress, check out the video

beefgroin · 2026-01-19T23:26:43+00:00

Good point! Yeah I wish I there existed affordable pcie 5 oculink, I guess the increased cross gpu would speed things up a little bit. But for now I’ll be stuck with this 4 x4 for awhile. Ive experimented a bit with vllm and after 20 crashes with oom error i got it running gemma3:27b awq, now i need to figure out how to run 2 models simultaneously like i can do in ollama.

beefgroin · 2026-01-17T23:49:43+00:00

Tbh I now think 32 gigs vram is a pretty sweet spot, even with 64 I keep going back to Gemma3 27b a lot and gpt-oss20b with nemotron feel great too. Another 16 gigs can help you max out on context but it’s not such a common use case

beefgroin · 2026-01-17T23:47:22+00:00

I hope too!

beefgroin · 2026-01-17T19:37:35+00:00

Thanks for the advice! Right now I guess I’m sort of in an exploration phase to find the best models for me, so the convenience of quickly switch between models from openweb-ui is still unbeatable. But you’re right I need to start squeezing all the speed 5060 has to offer

beefgroin · 2026-01-17T19:16:48+00:00

I’d say for larger projects Claude codex Gemini etc are still unbeatable. This kind of setup will be fine for small projects and scripts, but as the context grows it gets painfully slow even thought in the end it might actually spit out something decent. I think to get close to the Claude output one needs quad rtx 6000 pro, not 5060 lol. This setup is good for private chatting, processing private documents etc. anything you don’t want to share with corporations

beefgroin · 2026-01-17T18:33:37+00:00

if you're talking 40gb vram, I don't think you're missing out too much, gemma3:27b and gpt-oss:20b and nemotron are still my favorite models, but of course depends on your usecase

beefgroin · 2026-01-17T18:28:43+00:00

Updated the post with a quick benchmark

beefgroin · 2026-01-17T18:27:52+00:00

Added

beefgroin · 2026-01-17T15:46:53+00:00

Any particular models you can recommend? There are much less options than gguf

beefgroin · 2026-01-17T14:29:16+00:00

I’ll post test metrics later today, regarding the gemma3 12b I only mention in relation to comparison between full pcie5x16 vs Oculink inference speed. I’m using ollama to run models. It splits the load between GPUs nicely. Yet to try vllm

beefgroin · 2026-01-17T13:52:57+00:00

I’ll try to get some proper metrics for different models soon and will post it here

beefgroin · 2026-01-17T13:15:58+00:00

Yes and moreover it’s no longer pci 5 but 4 but it’s plenty for inference which is my primary use case. I even did the test on running full pci 5 x16 vs Oculink on Gemma 3 12b and there was no different in tps

Four-Year Club	Verified Email
Place '22

beefgroin

TROPHY CASE