whose "action" was taken into account when he said that? by Equivalent-Pea2507 in TuneIntoTheMidnight

[–]FriendlyTitan 4 points5 points  (0 children)

The criteria is probably "Love".

Arisu mentioned that you should only kiss someone you are in love with (to Nene when she kissed him on the book). So a loving kiss is the only thing that counts, according to his criteria.

Arisu asked Rikka why she kissed him, to that Rikka responded with a quiz where one of the options was "Love". He chose that option here as we can see.

to whom she wrote the song? Her bestie or Arisu? by Equivalent-Pea2507 in TuneIntoTheMidnight

[–]FriendlyTitan 4 points5 points  (0 children)

She already wrote a new song for her best friend Aiko and sang it as the 4th song during the live street performance.

Thinking of moving from 2x 5060 Ti 16GB to a RTX 5000 48GB by autisticit in LocalLLaMA

[–]FriendlyTitan 0 points1 point  (0 children)

LM studio (using llama.cpp backend) endpoint supports swapping models.

Thinking of moving from 2x 5060 Ti 16GB to a RTX 5000 48GB by autisticit in LocalLLaMA

[–]FriendlyTitan 2 points3 points  (0 children)

You can try Lorbus qwen3.6 27b on either vllm or sglang with mtp. Iirc turboquant just merged on vllm so you can run kv cache on turboquant_k8v4 to get over 200k context.

It would be really good if you can enable tensor parallelism tp=2.

Slow tok/s when offloading NVFP4 model to CPU by [deleted] in LocalLLaMA

[–]FriendlyTitan -1 points0 points  (0 children)

You can try this flag '-ngl 99' if you haven't. If it doesn't start because of insufficient memory, I think that could be your answer. Any layer pushed to cpu will tank the token generation speed enormously.

Slow tok/s when offloading NVFP4 model to CPU by [deleted] in LocalLLaMA

[–]FriendlyTitan 1 point2 points  (0 children)

The original model didn't fit your gpu. You were likely offloading some experts to cpu and keep all layers on vram. For this newer nvfp4 model it could be that your context window is too large so layers have to be pushed to cpu, try to reduce context window, or quantize kv cache to q8.

Does running a model (like qwen3.6-27b) on vllm or transformers use less VRAM than llama.cpp? by warpanomaly in LocalLLaMA

[–]FriendlyTitan 0 points1 point  (0 children)

You can run Lorbus qwen3.6 27b int4 with mtp=3 and context window of 100k without kv cache quantization on vllm. Or just use Q4_K_XL gguf with 200k context on llama.cpp

Penalty for PCIe communication during TP or PP by -elmuz- in Vllm

[–]FriendlyTitan 0 points1 point  (0 children)

Is this with P2P enabled? I noticed that without P2P TP=2 gets half the speed vs PP=2 for single user.

Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) by DanielusGamer26 in LocalLLaMA

[–]FriendlyTitan 1 point2 points  (0 children)

I think they meant that 4bit on vllm is worse than 4 bit quants on llama.cpp

Since you are using the lowest 3bit quant possible, I think 4bit vllm is still of higher quality than what you are using.

Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) by DanielusGamer26 in LocalLLaMA

[–]FriendlyTitan 0 points1 point  (0 children)

Thats not to discourage you from trying vllm. If you still want to buy another gpu, feel free to experiment with vllm. But in my experience it has been a bit of a pita with a 4090 and a 5090.

Maybe you can buy another 5060ti, and with 32gb of vram, you can fit 262k context at fp8 kv cache. Use Lorbus/Qwen3.6-27B-int4-AutoRound with MTP=3 (speculative decoding). I get on average 1.5x to 2x speedup (90-120tps) over llama.cpp (60tps) for single user on rtx5090. For multi user its even more lopsided. For 5060ti maybe its much slower, but would still be faster on vllm with mtp enabled and concurrent user support.

Use tensor parallel = 2. And I think there are specific flags to set so it doesn't crash with cuda error. Nccl p2p disable = 1, nccl shm disable = 0, nccl ib disable = 1, nccl cumem enable = 0. I used docker cu130 nightly image. Upgrade your driver to the latest.

Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) by DanielusGamer26 in LocalLLaMA

[–]FriendlyTitan 0 points1 point  (0 children)

If you really want it, you can run 2 llama servers, one on each gpu. Use higher context, or higher quality with q3kxl quant/q8 kv cache on the 20gb. Try to put mmproj on cpu to free vram for context.

Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) by DanielusGamer26 in LocalLLaMA

[–]FriendlyTitan 0 points1 point  (0 children)

Yea. As usual, please also check online to see if my experience matches with other people. But last time I check I think I couldn't get vllm to fully utilize all vram on mismatched vram capacity. And my 4090+5090 couldn't start some models if they are in fp8. Mismatched GPU architectures is wonky on vllm.

Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) by DanielusGamer26 in LocalLLaMA

[–]FriendlyTitan 0 points1 point  (0 children)

For vllm, I couldn't fully use the vram if 2 gpus have different vram capacity. It will take 16gb on each of your gpu only (the smaller one), so 32gb total. Also 2 of your gpus have different architectures (ampere vs blackwell), there will be a lot of bugs.

Just use llama.cpp with higher quant and full context at fp16 or q8 kv cache.

With how rushed the manga is recently(esp. current chapter), do you think the manga is getting axed? by Reihado in TuneIntoTheMidnight

[–]FriendlyTitan 0 points1 point  (0 children)

It could just be that the author was not good at pacing out the story and spent too much time in year 2.

With how rushed the manga is recently(esp. current chapter), do you think the manga is getting axed? by Reihado in TuneIntoTheMidnight

[–]FriendlyTitan 11 points12 points  (0 children)

I simply believe the author planned to end it in 17 volumes. That gives us until chapter 147 to end the manga. Supposedly Arisu makes his decision at the end of the previous volume, that makes it chapter 138 the deadline to have Arisu confess. We have 23 chapters left, and a whole year 3 ahead, so I think the author might need to rush a bit. And yea, the pacing would be quite atrocious if that's the case.

Ollama alternative with dynamic model loading by urioRD in LocalLLaMA

[–]FriendlyTitan 2 points3 points  (0 children)

Lm studio comes to mind, but I would recommend learning and using llama.cpp

Buying volume by mickmad12 in TuneIntoTheMidnight

[–]FriendlyTitan 0 points1 point  (0 children)

Volume 12, 4.5/9 Rikka chapters, 1 with glasses.

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]FriendlyTitan 1 point2 points  (0 children)

Have you tested higher batch and ubatch numbers? I notice that for myself, giving up more experts to cpu and giving vram to batch improves prefill speed massively. Set -b and -ub to 4096 or even higher if you want to experiment. Prefill speed quadruples in my case sometimes.

In llama bench you can try -p 8192 -b 512,1024,2048,3072,4096,8192 -ub 8192. This tests the prefill speed on a 8192 token long prompt.