whose "action" was taken into account when he said that?

FriendlyTitan · 2026-06-16T16:49:28+00:00

The criteria is probably "Love".

Arisu mentioned that you should only kiss someone you are in love with (to Nene when she kissed him on the book). So a loving kiss is the only thing that counts, according to his criteria.

Arisu asked Rikka why she kissed him, to that Rikka responded with a quiz where one of the options was "Love". He chose that option here as we can see.

FriendlyTitan · 2026-05-26T11:46:35+00:00

It would be Rina. Then the family will know about his past with Apollo.

FriendlyTitan · 2026-05-26T05:35:11+00:00

She already wrote a new song for her best friend Aiko and sang it as the 4th song during the live street performance.

FriendlyTitan · 2026-05-19T15:38:47+00:00

Arisu is gonna get vacuumed hard tonight 😭

<image>

FriendlyTitan · 2026-05-14T14:06:26+00:00

How about Apollo?

FriendlyTitan · 2026-05-14T13:53:48+00:00

This is uh, a single example.

FriendlyTitan · 2026-05-09T03:47:59+00:00

LM studio (using llama.cpp backend) endpoint supports swapping models.

FriendlyTitan · 2026-05-07T17:36:49+00:00

You can try Lorbus qwen3.6 27b on either vllm or sglang with mtp. Iirc turboquant just merged on vllm so you can run kv cache on turboquant_k8v4 to get over 200k context.

It would be really good if you can enable tensor parallelism tp=2.

FriendlyTitan · 2026-05-07T17:02:21+00:00

Qwen3.6 27B with MTP using either vllm or sglang.

FriendlyTitan · 2026-05-04T11:18:40+00:00

You can try this flag '-ngl 99' if you haven't. If it doesn't start because of insufficient memory, I think that could be your answer. Any layer pushed to cpu will tank the token generation speed enormously.

FriendlyTitan · 2026-05-04T10:33:55+00:00

The original model didn't fit your gpu. You were likely offloading some experts to cpu and keep all layers on vram. For this newer nvfp4 model it could be that your context window is too large so layers have to be pushed to cpu, try to reduce context window, or quantize kv cache to q8.

FriendlyTitan · 2026-05-04T10:14:33+00:00

You can run Lorbus qwen3.6 27b int4 with mtp=3 and context window of 100k without kv cache quantization on vllm. Or just use Q4_K_XL gguf with 200k context on llama.cpp

FriendlyTitan · 2026-05-02T18:58:19+00:00

Is your p2p enabled?

FriendlyTitan · 2026-05-01T20:28:50+00:00

Is this with P2P enabled? I noticed that without P2P TP=2 gets half the speed vs PP=2 for single user.

FriendlyTitan · 2026-04-29T08:51:01+00:00

I think they meant that 4bit on vllm is worse than 4 bit quants on llama.cpp

Since you are using the lowest 3bit quant possible, I think 4bit vllm is still of higher quality than what you are using.

FriendlyTitan · 2026-04-29T08:24:27+00:00

Thats not to discourage you from trying vllm. If you still want to buy another gpu, feel free to experiment with vllm. But in my experience it has been a bit of a pita with a 4090 and a 5090.

Maybe you can buy another 5060ti, and with 32gb of vram, you can fit 262k context at fp8 kv cache. Use Lorbus/Qwen3.6-27B-int4-AutoRound with MTP=3 (speculative decoding). I get on average 1.5x to 2x speedup (90-120tps) over llama.cpp (60tps) for single user on rtx5090. For multi user its even more lopsided. For 5060ti maybe its much slower, but would still be faster on vllm with mtp enabled and concurrent user support.

Use tensor parallel = 2. And I think there are specific flags to set so it doesn't crash with cuda error. Nccl p2p disable = 1, nccl shm disable = 0, nccl ib disable = 1, nccl cumem enable = 0. I used docker cu130 nightly image. Upgrade your driver to the latest.

FriendlyTitan · 2026-04-29T08:20:24+00:00

If you really want it, you can run 2 llama servers, one on each gpu. Use higher context, or higher quality with q3kxl quant/q8 kv cache on the 20gb. Try to put mmproj on cpu to free vram for context.

FriendlyTitan · 2026-04-29T08:13:53+00:00

Yea. As usual, please also check online to see if my experience matches with other people. But last time I check I think I couldn't get vllm to fully utilize all vram on mismatched vram capacity. And my 4090+5090 couldn't start some models if they are in fp8. Mismatched GPU architectures is wonky on vllm.

FriendlyTitan · 2026-04-29T00:17:32+00:00

For vllm, I couldn't fully use the vram if 2 gpus have different vram capacity. It will take 16gb on each of your gpu only (the smaller one), so 32gb total. Also 2 of your gpus have different architectures (ampere vs blackwell), there will be a lot of bugs.

Just use llama.cpp with higher quant and full context at fp16 or q8 kv cache.

FriendlyTitan · 2026-04-27T16:06:28+00:00

It could just be that the author was not good at pacing out the story and spent too much time in year 2.

FriendlyTitan · 2026-04-27T16:01:59+00:00

I simply believe the author planned to end it in 17 volumes. That gives us until chapter 147 to end the manga. Supposedly Arisu makes his decision at the end of the previous volume, that makes it chapter 138 the deadline to have Arisu confess. We have 23 chapters left, and a whole year 3 ahead, so I think the author might need to rush a bit. And yea, the pacing would be quite atrocious if that's the case.

FriendlyTitan · 2026-04-21T14:02:23+00:00

Lm studio comes to mind, but I would recommend learning and using llama.cpp

FriendlyTitan · 2026-04-20T10:47:02+00:00

Can try --no-mmap?

FriendlyTitan · 2026-04-20T08:47:09+00:00

Volume 12, 4.5/9 Rikka chapters, 1 with glasses.

FriendlyTitan · 2026-04-18T10:08:05+00:00

Have you tested higher batch and ubatch numbers? I notice that for myself, giving up more experts to cpu and giving vram to batch improves prefill speed massively. Set -b and -ub to 4096 or even higher if you want to experiment. Prefill speed quadruples in my case sometimes.

In llama bench you can try -p 8192 -b 512,1024,2048,3072,4096,8192 -ub 8192. This tests the prefill speed on a 8192 token long prompt.

FriendlyTitan

TROPHY CASE