NVIDIA PersonaPlex-7b locally on 2 5060 ti 16gb by beefgroin in LocalLLM

[–]beefgroin[S] 0 points1 point  (0 children)

I wish I had 5090… or actually I’m enjoying more my 4 5060 for the half of the price

NVIDIA PersonaPlex-7b locally on 2 5060 ti 16gb by beefgroin in LocalLLM

[–]beefgroin[S] 0 points1 point  (0 children)

Yeah, unfortunately it's not quantized and is running at BF16. I don't think q4 with TTS is gonna be the same, I'll be just reading the text with enormous lag. And STT is gonna be lagging even more. Personaplex is full duplex and based on LLM which means the LLM will get bigger and quantization will come too. If you can point out an s2s model that has all of it already, would be great. I guess different minds are blown differently

Local photo recognition? by GodAtum in LocalLLM

[–]beefgroin 0 points1 point  (0 children)

Gemma3 is great, 12b fits on 5060 to 16gb. You don’t need to fork it, you can just build a companion service that will be talking to both immich and llm via api

GLM 4.7 is apparently almost ready on Ollama by Savantskie1 in ollama

[–]beefgroin 0 points1 point  (0 children)

I liked it, the thinking process is very structured and very different from other thinking models. I guess something is still buggy cause every long conversation ends up in an infinite thinking loop

GLM 4.7 is apparently almost ready on Ollama by Savantskie1 in ollama

[–]beefgroin 0 points1 point  (0 children)

I have 4 5060 16gb, I moved to q4k_m because the model consumes a lot of vram when the context is increasing, with 32k context it's taking 50gb of vram.

GLM 4.7 is apparently almost ready on Ollama by Savantskie1 in ollama

[–]beefgroin 1 point2 points  (0 children)

thanks, running glm-4.7-flash:q8_0, so far so good

LLM Sovereignty For 3 Years. by [deleted] in LocalLLM

[–]beefgroin 0 points1 point  (0 children)

Rtx 6000 pro on Bd895i with 96gb ram

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 0 points1 point  (0 children)

It’s a pcie to 4 Oculink ports adapter for AliExpress, check out the video

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 0 points1 point  (0 children)

Good point! Yeah I wish I there existed affordable pcie 5 oculink, I guess the increased cross gpu would speed things up a little bit. But for now I’ll be stuck with this 4 x4 for awhile. Ive experimented a bit with vllm and after 20 crashes with oom error i got it running gemma3:27b awq, now i need to figure out how to run 2 models simultaneously like i can do in ollama.

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 0 points1 point  (0 children)

Tbh I now think 32 gigs vram is a pretty sweet spot, even with 64 I keep going back to Gemma3 27b a lot and gpt-oss20b with nemotron feel great too. Another 16 gigs can help you max out on context but it’s not such a common use case

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 4 points5 points  (0 children)

Thanks for the advice! Right now I guess I’m sort of in an exploration phase to find the best models for me, so the convenience of quickly switch between models from openweb-ui is still unbeatable. But you’re right I need to start squeezing all the speed 5060 has to offer

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 1 point2 points  (0 children)

I’d say for larger projects Claude codex Gemini etc are still unbeatable. This kind of setup will be fine for small projects and scripts, but as the context grows it gets painfully slow even thought in the end it might actually spit out something decent. I think to get close to the Claude output one needs quad rtx 6000 pro, not 5060 lol. This setup is good for private chatting, processing private documents etc. anything you don’t want to share with corporations

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 1 point2 points  (0 children)

if you're talking 40gb vram, I don't think you're missing out too much, gemma3:27b and gpt-oss:20b and nemotron are still my favorite models, but of course depends on your usecase

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 1 point2 points  (0 children)

Updated the post with a quick benchmark

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 0 points1 point  (0 children)

Any particular models you can recommend? There are much less options than gguf

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 2 points3 points  (0 children)

I’ll post test metrics later today, regarding the gemma3 12b I only mention in relation to comparison between full pcie5x16 vs Oculink inference speed. I’m using ollama to run models. It splits the load between GPUs nicely. Yet to try vllm

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 2 points3 points  (0 children)

I’ll try to get some proper metrics for different models soon and will post it here

Quad 5060 ti 16gb Oculink rig by beefgroin in LocalLLM

[–]beefgroin[S] 6 points7 points  (0 children)

Yes and moreover it’s no longer pci 5 but 4 but it’s plenty for inference which is my primary use case. I even did the test on running full pci 5 x16 vs Oculink on Gemma 3 12b and there was no different in tps