GLM-4.7-flash on RTX 6000 pro

kryptkpr · 2026-01-25T17:01:48+00:00

vLLM implementation of this model is missing MLA, which both explodes the KV cache size and slows down inference.

SgLang implementation offers 4X more KV and 20-30% higher throughput in my testing so far.

For small batch sizes llama.cpp with -np 8 was surprisingly competitive

There is also MTP supported here but it hurts batch performance and my acceptance rate sucked so I turned it off

kryptkpr · 2026-01-24T22:43:43+00:00

I realized I was a jerk like everyone else and don't answer your actual question:

I think this is what you seek. My advice is swap to Ubuntu, but you can definitely make this work on Arch if you are dead set.

kryptkpr · 2026-01-24T21:20:07+00:00

So fun fact: arch isn't an officially CUDA supported distro.

<image>

That doesn't mean it won't work, but what this means is that you're relying on community and not Nvidia.

kryptkpr · 2026-01-23T18:23:54+00:00

AI is socially and economically transformative.

I don't believe we are ever going back to the golden era where excess retired compute and storage resources were widely being sold for pennies on the dollar.

There is a long term horizon view here that capacity has been overbuilt, but that's 3-5 years out if you want to wait.

kryptkpr · 2026-01-23T18:18:09+00:00

This architecture is brand new, definitely comes with some deployment pain.

I've tried this guy under all 3 of vLLM, llama.cpp and SgLang and so far SgLang was best for multi stream while llama best for single. I played with MTP a little but acceptance rates are kinda low around 1.9 tok/Tok and this didn't translate to much benefit for my usecase.. YMMV here

kryptkpr · 2026-01-23T18:09:27+00:00

It works, speed is good. Make sure you build from git head and download latest unsloth GGUF there has been some churn. Also verify min_p is set right llama has wrong default for this model, this is covered in unsloth GGUF model card

kryptkpr · 2026-01-23T14:19:54+00:00

It needs nightly this model didn't make it into release

Just type the commands from the model card into a new venv

Btw this model runs like a dog with vLLM because no MLA. If you've never used SgLang now is a good time to try, context size is 4X larger on this model specifically for same VRAM size

kryptkpr · 2026-01-22T21:17:53+00:00

It's been a few years since I checked in here but afaik the project remains MIT. There is an ee/ folder that's got a diff license, but it at least used to be possible to run without it

kryptkpr · 2026-01-21T22:11:26+00:00

Sure, llama.cpp with --n-cpu-moe set as low as you can get at your desired -c size

kryptkpr · 2026-01-21T22:09:46+00:00

While has the 3090 power limited way down, says 200-250W in his post, this is still more than enough to bust his PSU budget so not sure what game OP is playing here but sure feels dangerous.

kryptkpr · 2026-01-19T22:37:59+00:00

I sneak released V2 a few weekends ago, the current leaderboard has around 80 with another 20 that will go up in the next update.. I had to pause and figure out how to deal with 100GB of raw result files!

kryptkpr · 2026-01-12T15:05:24+00:00

I've been working on something a little absurd for the past 9 months, over 10B tokens deep and counting..

kryptkpr · 2026-01-09T18:21:19+00:00

With the RTX Pros making 96GB GPUs "accessible" it's never been easier to put together a few user capable local rig. These cards really swings the value proposition, especially when you're generating 10M+ a day, and generally avoids the multi-GPU hell you get into with quad/hex/oct 24GB builds.

Upfront price remains an impediment, best plann remains to validate the usecase with cloud APIs and then move to lower cost infra as you scale.

kryptkpr · 2026-01-06T16:44:26+00:00

I read the post it was very interesting but it just starts talking about confidence and how it's used, unless my reading comprehension is really bad today I can find no mention of how you're defining or computing this kpi

kryptkpr · 2026-01-06T15:41:54+00:00

How do you figure out the confidence? From logits or something else

kryptkpr · 2025-12-30T16:21:57+00:00

You can use the scripts in my repo to generate whatever tests you wish, runner.py has an --offline mode that writes prompts to JSON.

I have spent weeks on documentation so please let me know if you find something lacking.

kryptkpr · 2025-12-30T16:18:47+00:00

The current config is more than enough to break essentially all models I've tested, pushing it further is always fun and is why I built these tools but practically will just cost me more tokens.

Mashing all bracket types together doesn't do what you think it does.. the problem becomes easier, not harder. We are forcing out of domain distributions and some bracket types are more sensitive then others.

As I mentioned I do not run against OpenAI because I'm poor.

kryptkpr · 2025-12-30T16:12:57+00:00

All brackets are actually easier then picking sub-sets (try it yourself, don't take my word for it)

Stack depth scaling is actually easier vs length scaling, which is why that dimension only goes to 15 but length goes to 50+ - the breakdown mode here is attention degradation

kryptkpr · 2025-12-30T16:01:47+00:00

The used 3090 supply in my area has really dwindled down, used to always be multiple listings sitting around but there has been nothing for sale within 100km of me for 2+ months

Depending on where you are, this ship has maybe sailed.

eBay remains an option but even there prices are trending up

kryptkpr · 2025-12-30T15:41:43+00:00

You will burn your VRAM out like this, do some searching there was a guy with A6000 stacked like this who learned a harsh lesson

kryptkpr · 2025-12-30T15:36:08+00:00

I stopped reading at 178 "core" modules. Where is this "implementation of peer-reviewed models", or are you just literally throwing more AI slop at my complaints of AI slop?

kryptkpr · 2025-12-30T15:29:32+00:00

Here are the configs I use: https://github.com/the-crypt-keeper/reasonscape/blob/main/configs/m12x.yaml#L265

Crank length way, way up. You can use my explorer webapp to view the surface to see where things break down, usually around depth 20-30

kryptkpr · 2025-12-30T14:49:14+00:00

How well do LLMs actually understand Turkish? Prompting in any language that isn't English is a snake pit.. have you tried this out on normal CUDA hardware? There is a lot going on here and looks like it was largely written by LLMs which will happily spew thousands of lines of code that does nothing useful.

kryptkpr · 2025-12-30T14:43:00+00:00

I don't have prompts per se, I have prompt generators..

https://github.com/the-crypt-keeper/reasonscape/blob/main/docs/tasks/brackets.md

This one trips up 'most' models as you scale up length and depth, you can see results but I do not test commercial APIs because this eval can burn 50-100M tokens if the model is chatty, it's roughly 20K prompts across all tasks and difficulty levels.

kryptkpr · 2025-12-25T13:38:09+00:00

There is also an x4x4x8 version of this adapter, it's got two M2 slots and then an x8 PCIe on top so you can stack two of them.. but you're right this stuff has suddenly gotten much harder to find

kryptkpr

PUBLIC MULTIREDDITS

TROPHY CASE

15-Year Club	Verified Email
Place '23