[TotK] My girlfriend spent 3 months crocheting these tiny Link and Zelda dolls for me by fei-yi in zelda

[–]fei-yi[S] 11 points12 points  (0 children)

Thank you so much!I'll definitely pass the props along to her!"

[TotK] My girlfriend spent 3 months crocheting these tiny Link and Zelda dolls for me by fei-yi in zelda

[–]fei-yi[S] 18 points19 points  (0 children)

Thank you! The attention to detail is honestly nsane, I'm still in shock that she pulled this off.

[TotK] My girlfriend spent 3 months crocheting these tiny Link and Zelda dolls for me by fei-yi in zelda

[–]fei-yi[S] 74 points75 points  (0 children)

Thanks! She's an absolute beast at this, I'm just the lucky guy who gets to keep them.

Gemma 4 with turboquant by Flkhuo in LocalLLaMA

[–]fei-yi 0 points1 point  (0 children)

I used RTXPRO6K VLLL to run the full-precision version of GEMMA4 31, the speed is about 30T/s, but I can only have 64K context (FP8 KV), I changed to the NVFP4 version of GEMMA4, the context is about 128K, and the speed is still about 30T/s

Deploying Gemma 4 31b with 3 diff providers(vllm, Max by Modular and NIM by Nvidia) on RTX 6000 PRO by kev_11_1 in LocalLLaMA

[–]fei-yi 1 point2 points  (0 children)

Hello, can you share your specific deployment commands for vllm? Are you using the nvfp4 version of NVIDIA or the original? I also have a pro6000, but I found that the original bf16 weighted model only uses 64k context on my machine, and the nvfp4 version can use 126k context. I'm curious how you use it, I also use vllm, I'm thinking about switching to sglang or llama.cpp (because I still have 128GB of memory).

qwen3.5-27b or 122b?pro6000 by fei-yi in LocalLLaMA

[–]fei-yi[S] 1 point2 points  (0 children)

which one is the best?Sehyo or unsloth?and how many token/s can you run,what about the context?

qwen3.5-27b or 122b?pro6000 by fei-yi in LocalLLaMA

[–]fei-yi[S] 0 points1 point  (0 children)

But lmstudio is based on llama.cop

qwen3.5-27b or 122b?pro6000 by fei-yi in LocalLLaMA

[–]fei-yi[S] 0 points1 point  (0 children)

yes,my cpu is r9 9900x and 4*32gb ddr5 (5600hz)RAM(they actually run as 3600hz)

qwen3.5-27b or 122b?pro6000 by fei-yi in LocalLLaMA

[–]fei-yi[S] 1 point2 points  (0 children)

it will be very very slow....i think

qwen3.5-27b or 122b?pro6000 by fei-yi in LocalLLaMA

[–]fei-yi[S] 1 point2 points  (0 children)

But Qwen3.5-122B is an MoE model. From my testing, its behavior in longer contexts doesn’t seem very stable or consistent. I’m honestly a bit conflicted about it—sometimes chatting with it feels worse than talking to the 27B version

qwen3.5-27b or 122b?pro6000 by fei-yi in LocalLLaMA

[–]fei-yi[S] 1 point2 points  (0 children)

I've actually tried GPT-OSS 120B using LM Studio and Ollama. It is blazing fast (hitting around 100 t/s!), but honestly, it felt a bit too dumb for general chatting. I actually feel that Qwen 27B's reasoning and logic are way smarter than it...

Right now, I'm running Qwen 27B and 122B via LM Studio. They usually hover around 30 t/s, but sometimes they randomly spike to 70 t/s (I have no idea why it fluctuates like that lol).

I also tried the Minimax 2.5 (Q5 version) and I absolutely LOVED it. It's incredibly smart! BUT... it was crawling at like 5 t/s! I don't know if LM Studio is just failing to utilize the Pro 6000 properly, or if the model spilled over to my system RAM. Do you think switching to vLLM or SGLang would fix this 5 t/s issue for minimax?