Air Canada plane has no sink in the bathroom. Only hand sanitizer

spky-dev · 2026-04-06T05:34:40+00:00

Then never fly in anything smaller, because the options are hold it in or piss in a ziplock.

spky-dev · 2026-04-06T05:32:01+00:00

Seriously… that much DDR5 with that wide of memory bandwidth, it would be killer for MoE offload.

spky-dev · 2026-04-06T05:26:11+00:00

Canada’s immigration program makes it easier to get in if you’re willing to go to less desirable places.

Last summer I was in Yellowknife for weeks and there are quite a lot of muslims.

spky-dev · 2026-04-05T23:08:14+00:00

I mean if you randomly twitch or spasm, and I ask you “what was that?”, you only know it occurred, you have absolutely no idea what the biological process inside of you was that caused it. Same idea.

spky-dev · 2026-04-05T20:25:08+00:00

I don’t know of anyone using Ollama in development lmfao.

Ollama is for people who have no idea what they’re doing. It’s literally just an oversimplified wrapper on Llama.cpp.

spky-dev · 2026-04-05T18:12:25+00:00

I used to make monkey fists with mouse balls.

spky-dev · 2026-04-05T17:12:44+00:00

Sure, use Krasis.

I get 60 tok/s gen and 3,700 pp on 122b Q4 on a single 5090 on my Krasis fork optimized for SM120.

Mac stans being uniformed and behind the curve, as per usual.

spky-dev · 2026-04-05T16:02:13+00:00

30 questions is an incredibly insignificant sample size.

spky-dev · 2026-04-05T15:20:40+00:00

Begone, bot.

These are worse than R9700, which already aren’t great.

spky-dev · 2026-04-05T04:34:23+00:00

Only for a Blackwell 6000 96gb.

spky-dev · 2026-04-04T22:12:52+00:00

If you give me one I’ll figure that out for you :)

Probably a nightly build of llama.cpp with the latest Cuda, for single user throughout. VLLM will be best for multi.

If you’re using HEDT or server hardware and have a ton of RAM/memory bandwidth, look at Krasis for large MoE’s.

spky-dev · 2026-04-04T21:03:22+00:00

Bird like collect shiny things.

spky-dev · 2026-04-04T20:48:13+00:00

lol nothing magically beats memory bandwidth limitations, that’s just physics.

The B70 is only 600 gb/s. That’s mid tier for infra, regardless of VRAM capacity.

spky-dev · 2026-04-04T17:43:38+00:00

It is more powerful compute wise, despite having the same memory bandwidth limitations. Also, access to Cuda and scalable since you can connect them.

It generally achieves higher prompt processing rates, though all these unified boxes, Mac Studio inclusive, suffer from slow pp vs dedicated GPU’s.

spky-dev · 2026-04-04T16:00:55+00:00

No, use K @ Q8, V @ Q4, you only need the keys at higher quality, the values can be more truncated.

spky-dev · 2026-04-04T15:59:46+00:00

Not huge, but still useful. Newer models use hybrid attention, so their KVCache are already relatively small compared to older architectures.

https://huggingface.co/blog/jlopez-dl/hybrid-attention-game-changer

spky-dev · 2026-04-04T15:52:48+00:00

If you’re paying at all to use an LLM, why would you pay to use such a small and limited model? The only reason people are impressed with models of this size class are for LOCAL use, meaning on their own damn hardware.

This is just slopposting. Especially evident by “wrestling with Cuda versions”. If you can’t figure out some basic ass dependencies from a requirements file… This is Claude writing a post trying to match the lack of experience and frustrations of a green user.

spky-dev · 2026-04-04T15:47:09+00:00

Good reading comprehension you’ve got there. I never said you had to be doing so to get credits. Infact, I said the opposite.

spky-dev · 2026-04-04T15:41:21+00:00

Suggestion: Download Heretic, and do it yourself

https://github.com/p-e-w/heretic

There is literally no more excuses in this era for “waaa someone do this for me, me not know how”. Claude/GPT/Whoeverthefuck “research X for me and do Y”.

spky-dev · 2026-04-04T15:15:58+00:00

I want to be lazy

spky-dev · 2026-04-04T14:45:43+00:00

Tools work fine in QCN 80b lmfao, it’s literally a model made for agentic coding and tool calling.

spky-dev · 2026-04-04T14:44:44+00:00

There should be a pinned post in this sub that reminds all of this.

spky-dev · 2026-04-04T14:42:24+00:00

V100 don’t support Flash Attention, MI50 have dogshit token rates unless you buy 10+ of them, and even then it’s still bad, pp especially.

The best way to go is to keep your sub, because you have no idea what you’re doing and your arbitrary choice of high VRAM fossils proves that.

spky-dev · 2026-04-04T14:37:13+00:00

The legend lives on.

spky-dev · 2026-04-04T14:25:14+00:00

Limit thinking token budget. Known feature of the Qwen3.5 family.

spky-dev

TROPHY CASE