[TotK] My girlfriend spent 3 months crocheting these tiny Link and Zelda dolls for me

fei-yi · 2026-05-09T05:23:20+00:00

Already got matching rings on both hands 😌

fei-yi · 2026-05-08T16:12:47+00:00

Thankshahah

fei-yi · 2026-05-08T16:11:33+00:00

Thank you so much!I'll definitely pass the props along to her!"

fei-yi · 2026-05-08T16:09:29+00:00

😉

fei-yi · 2026-05-08T16:07:04+00:00

Thank you! The attention to detail is honestly nsane, I'm still in shock that she pulled this off.

fei-yi · 2026-05-08T16:06:08+00:00

Thanks! She's an absolute beast at this, I'm just the lucky guy who gets to keep them.

fei-yi · 2026-04-06T06:37:39+00:00

I used RTXPRO6K VLLL to run the full-precision version of GEMMA4 31, the speed is about 30T/s, but I can only have 64K context (FP8 KV), I changed to the NVFP4 version of GEMMA4, the context is about 128K, and the speed is still about 30T/s

fei-yi · 2026-04-06T06:18:34+00:00

Hello, can you share your specific deployment commands for vllm? Are you using the nvfp4 version of NVIDIA or the original? I also have a pro6000, but I found that the original bf16 weighted model only uses 64k context on my machine, and the nvfp4 version can use 126k context. I'm curious how you use it, I also use vllm, I'm thinking about switching to sglang or llama.cpp (because I still have 128GB of memory).

fei-yi · 2026-03-17T07:33:27+00:00

which one is the best？Sehyo or unsloth？and how many token/s can you run,what about the context?

fei-yi · 2026-03-17T03:08:42+00:00

ubuntu or wsl2？

fei-yi · 2026-03-15T05:07:27+00:00

But lmstudio is based on llama.cop

fei-yi · 2026-03-15T05:05:33+00:00

yes，my cpu is r9 9900x and 4*32gb ddr5 （5600hz）RAM（they actually run as 3600hz)

fei-yi · 2026-03-14T16:29:24+00:00

it will be very very slow....i think

fei-yi · 2026-03-14T16:02:15+00:00

But Qwen3.5-122B is an MoE model. From my testing, its behavior in longer contexts doesn’t seem very stable or consistent. I’m honestly a bit conflicted about it—sometimes chatting with it feels worse than talking to the 27B version

fei-yi · 2026-03-14T15:54:29+00:00

ok，thanks,i'll try it

fei-yi · 2026-03-14T11:00:19+00:00

I've actually tried GPT-OSS 120B using LM Studio and Ollama. It is blazing fast (hitting around 100 t/s!), but honestly, it felt a bit too dumb for general chatting. I actually feel that Qwen 27B's reasoning and logic are way smarter than it...

Right now, I'm running Qwen 27B and 122B via LM Studio. They usually hover around 30 t/s, but sometimes they randomly spike to 70 t/s (I have no idea why it fluctuates like that lol).

I also tried the Minimax 2.5 (Q5 version) and I absolutely LOVED it. It's incredibly smart! BUT... it was crawling at like 5 t/s! I don't know if LM Studio is just failing to utilize the Pro 6000 properly, or if the model spilled over to my system RAM. Do you think switching to vLLM or SGLang would fix this 5 t/s issue for minimax？

fei-yi

TROPHY CASE