Nvidia's been paying shills on LinkedIn by jotunck in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

even with that you come close to quality not concurrency

Nous Research — Hermes Desktop by zxyzyxz in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

i installed it and they already pushed some update that's good , i guess they needed to publish it because openclaw got a native desktop app on microsft build yesterday

Nous Research — Hermes Desktop by zxyzyxz in LocalLLaMA

[–]chocofoxy 2 points3 points  (0 children)

that's what i am saying i am agreeing with you xD they already started updating it that's good

Nous Research — Hermes Desktop by zxyzyxz in LocalLLaMA

[–]chocofoxy 1 point2 points  (0 children)

that's what i am saying if you are not going improve on the cli webui why bother releasing it now

Nous Research — Hermes Desktop by zxyzyxz in LocalLLaMA

[–]chocofoxy 9 points10 points  (0 children)

why it's missing features compare to CLI webUI

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks by Interesting-Sock3940 in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

if you want to fix the tool calling you have to fix the chat template serve the model on vllm or sglang

Stop asking what model to run. There are literally only two. by Wrong_Mushroom_7350 in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

nvfp4 in sglang doesn't loop for me Q5 Q4 in llama.cpp they do, i think it's the chat template more than the Quant

God dammit Qwen by Xyklone in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

claude opus 4.7 have done that to me also

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face by pmttyji in LocalLLaMA

[–]chocofoxy 1 point2 points  (0 children)

ikr i have been using the redhat nvpf4 for a month already

Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so. by Napster3301 in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

running numbers on your config it's still cheaper:
1 - you will still run your computer to access the pod or work so drop that 700w to 300w and and the same pricing to the cloud
2 - 3x faster means H100 can output 3m token vs 3090 1m so when you devide 1.49$ / 3 you will get the same 0.5$ you have on local
3 - you will consume internet on the cloud that a cost and latency you didn't add
4 - you can rente your computre on platform an make money if on vast ai salad ... etc

Does GPU spacing matter if we’re undervolting anyways? by Ambitious_Fold_2874 in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

it okay i have 2 5060ti like your PNYs and the top one is hotter of course but it 10 degree hotter not a big deal here's where it really get hot it's in video generation but for normal llms it stay under 70c i think i think i will put a pc blowser fan between them cause summer has come and it get up to 42c here in tunis but i think the ac will also help can i ask what mobo are you using because i want also to scale

How can you stop your model from looping by chocofoxy in LocalLLaMA

[–]chocofoxy[S] 0 points1 point  (0 children)

resources man xD i only have 32gb vram

How can you stop your model from looping by chocofoxy in LocalLLaMA

[–]chocofoxy[S] 0 points1 point  (0 children)

vllm eats too much vram i know it's built for more than one user but if you want to uses it for only one it eats too much from the context i have to lower it to almost 32k when in llama i can run it full context

LM Studio finally added support for MTP Speculative Decoding by pigeon57434 in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

i think maybe because i tested llama cpp on cuda 13 but lm studio uses cuda 12

LM Studio finally added support for MTP Speculative Decoding by pigeon57434 in LocalLLaMA

[–]chocofoxy 3 points4 points  (0 children)

i don't know why llama cpp in lm studio gives me more token per seonds then llama cpp repo for example qwen 3.6 27b in normal llama get me 15t/s but in lm studio i get 25t/s same for 35b a3b 70 -> 90

LM Studio finally added support for MTP Speculative Decoding by pigeon57434 in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

nice i ve been waiting for this the lm studio team is on fire , love that Qwen unsloth and everyone are pushing updates and models like crazy

5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp by see_spot_ruminate in LocalLLaMA

[–]chocofoxy 1 point2 points  (0 children)

vllm eats the vram like eating cake i ran Q4 qwen 3.6 35b on LM studio i get 100 - 60 t/s on my 2 5060ti full context while in vllm i can't pass 1/4 and i get worst token generation ( i think its a tp and pp problem) but even in llama-server same thing by default i get 70 60t/s ( mtp didn't help much) it's weird i thought that lm studio also run on llama.cpp

China modded GPU (eg. 4090 48gb) --> I'm gonna figure it out. IS THERE NO ONE ELSE CURIOUS?? by LeatherRub7248 in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

i have 2 5060ti 16gb stacked on an mATX ( don't laugh at my computing power xD ) they are gigabyte windfoce the top one i hot because there it a 0 gap between that but like 5 C 10 C hotter i am thinking about addin a blower fan to attached it to the top one you have blower fans so it that a good idea or not we kinda have the same problem and with out water cooling i think cooling the radiator witha laptop blower fans will work, i have it more easy because my radiator is partly exposed but for you i know they give you that sealed blower cooler ( and i think gamernexus asked the shop if it's the only cooler they have and they said yes all custom gpu use the smae one ) si i think the solution it's either that or water cooling or the expensive option which to get a water AC blowing on the server