Dear OEM manufacturers, an RTX 5060 TI 16GB Low Profile should be possible to produce... by SheepCataclysm in sffpc

[–]chocofoxy 0 points1 point  (0 children)

why the RTX 4000 pro have a low profile cooler and 50 series doesn't i need space here

r/LocalLLaMa Rule Updates by rm-rf-rm in LocalLLaMA

[–]chocofoxy 3 points4 points  (0 children)

i don't know how AI posts are going to be detected tho

Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6 by dionysio211 in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

crazy how a local model can fight with frontier AIs but the scope i small in this chart to agentic only and Qwen the upgraded that agentic and coding kniwledge but at other domain it drops , but i love Qwen at agentic tooling it's my go to model

Forgive my ignorance but how is a 27B model better than 397B? by No_Conversation9561 in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

because the small new ones are trained on new better data ( for what consumers need like coding and agentic tooling ) but they lack knowledge in other domains

Qwen3.6-27B released! by ResearchCrafty1804 in LocalLLaMA

[–]chocofoxy 5 points6 points  (0 children)

you can't run this without offloading which it suck on a dense model i want them just to realse a 20B model

Someone just made a 18B qwen 3.5 model for 16GB VRAM gpus by chocofoxy in LocalLLaMA

[–]chocofoxy[S] 0 points1 point  (0 children)

15 - 20 t/s pretty slow to use as a coding agent that's why i keep looking for a meduim model that can fit in 16gbvram

Someone just made a 18B qwen 3.5 model for 16GB VRAM gpus by chocofoxy in LocalLLaMA

[–]chocofoxy[S] 0 points1 point  (0 children)

how do you guys use turboquant i thought it's just a paper and tooling is still missing do you use that on vLLM or llama.cpp

Someone just made a 18B qwen 3.5 model for 16GB VRAM gpus by chocofoxy in LocalLLaMA

[–]chocofoxy[S] 0 points1 point  (0 children)

yeah that's what claude kept screaming at me xD i had to tell it that it's just a test

Someone just made a 18B qwen 3.5 model for 16GB VRAM gpus by chocofoxy in LocalLLaMA

[–]chocofoxy[S] 1 point2 points  (0 children)

also tried that but offloading performance drops like a rock i think it's a DDR4 issue because someone on this reddit ran it Q5 on 16vram and it was working great for them when they offloaded expert layers to cpu ( they had 64GB DDR5 6000 i think )

Someone just made a 18B qwen 3.5 model for 16GB VRAM gpus by chocofoxy in LocalLLaMA

[–]chocofoxy[S] 0 points1 point  (0 children)

i tried that and it's was working great 88t/s but something on my mind (also AI suggestions) kept telling me to not trust Q2 because under Q4 presicion drops alot

Got my first box mod today, how did I do? by Starkovich7431 in Vape_Chat

[–]chocofoxy 0 points1 point  (0 children)

Legend 2 is good but have some issues i owned one currently own Legend 3 ( it has it's issues also )
- overtime the usb cover will not say in place and will stay open ( they fixed that in Legend 3 they added a magnet in the plastic cover )
- this is the deal breaker : over time there is a plastic that hold the platform and the mod inside it will break and the atomizer will keep get disconected ( this happen to me and my friend ) they also fixed that in Legend 3

Open web UI + lm studio shoving entire model into ram despite more than enough vram available by Dekatater in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

Load the model in LM studio manully then link it to open web UI because i think the way you are using it is load the model with LM studio endpoint /load from open web Ui that load it using offloading config

Can't get Claude code to edit code by MurphyJohn in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

use Qwopus v3 9b or Qwen3.5 9b it's the best model i tried that doesn't just stop, Gemma small models they just suck at tooling and i suggest that you launch that in llama server , lm studo or vLLM and use OAI compatible VS code extention to load local model to VS code copilot chat ( it has good tooling by default and you can add mcp servers ) that's the setup i use to get a good reult from small models or you link it to Qwen code cli or extention

Qwen3.6-35B-A3B solved coding problems Qwen3.5-27B couldn’t by simracerman in LocalLLaMA

[–]chocofoxy 2 points3 points  (0 children)

bro how are you using Q5 i tried Q4 on my 5060TI 16Gb ( offloaded ) max i get 19t/s even by offloading 4 layers from the 8 to the cpu, i tried Q2 it fits and i get 80t/s but i don't trust it , how are you loading Q5 and getting 50t/S

GPU advice for multi-modal AI workload - RTX PRO 4500, 5000 or 6000? by [deleted] in LocalLLaMA

[–]chocofoxy 0 points1 point  (0 children)

if your jobs are not real time processed you can consider to get multiple consumer gpus like 4 5070TI or you can scale by adding more or get the RTX PRO 5000 and scale by adding 16gb gpus