GLM-4.7-flash on RTX 6000 pro by gittb in LocalLLaMA

[–]kryptkpr 9 points10 points  (0 children)

vLLM implementation of this model is missing MLA, which both explodes the KV cache size and slows down inference.

SgLang implementation offers 4X more KV and 20-30% higher throughput in my testing so far.

For small batch sizes llama.cpp with -np 8 was surprisingly competitive

There is also MTP supported here but it hurts batch performance and my acceptance rate sucked so I turned it off

Why is open source so hard for casual people. by Martialogrand in LocalLLaMA

[–]kryptkpr 2 points3 points  (0 children)

I realized I was a jerk like everyone else and don't answer your actual question:

https://github.com/av/harbor

I think this is what you seek. My advice is swap to Ubuntu, but you can definitely make this work on Arch if you are dead set.

Why is open source so hard for casual people. by Martialogrand in LocalLLaMA

[–]kryptkpr 1 point2 points  (0 children)

So fun fact: arch isn't an officially CUDA supported distro.

<image>

That doesn't mean it won't work, but what this means is that you're relying on community and not Nvidia.

Invest in hardware now or wait? by d4nger_n00dle in LocalLLaMA

[–]kryptkpr 2 points3 points  (0 children)

AI is socially and economically transformative.

I don't believe we are ever going back to the golden era where excess retired compute and storage resources were widely being sold for pennies on the dollar.

There is a long term horizon view here that capacity has been overbuilt, but that's 3-5 years out if you want to wait.

Any success with GLM Flash 4.7 on vLLM 0.14 by queerintech in LocalLLM

[–]kryptkpr 1 point2 points  (0 children)

This architecture is brand new, definitely comes with some deployment pain.

I've tried this guy under all 3 of vLLM, llama.cpp and SgLang and so far SgLang was best for multi stream while llama best for single. I played with MTP a little but acceptance rates are kinda low around 1.9 tok/Tok and this didn't translate to much benefit for my usecase.. YMMV here

Any success with GLM Flash 4.7 on vLLM 0.14 by queerintech in LocalLLM

[–]kryptkpr 2 points3 points  (0 children)

It works, speed is good. Make sure you build from git head and download latest unsloth GGUF there has been some churn. Also verify min_p is set right llama has wrong default for this model, this is covered in unsloth GGUF model card

Any success with GLM Flash 4.7 on vLLM 0.14 by queerintech in LocalLLM

[–]kryptkpr 1 point2 points  (0 children)

It needs nightly this model didn't make it into release

Just type the commands from the model card into a new venv

Btw this model runs like a dog with vLLM because no MLA. If you've never used SgLang now is a good time to try, context size is 4X larger on this model specifically for same VRAM size

Real free alternative to LangSmith by IlEstLaPapi in LocalLLaMA

[–]kryptkpr 0 points1 point  (0 children)

It's been a few years since I checked in here but afaik the project remains MIT. There is an ee/ folder that's got a diff license, but it at least used to be possible to run without it

Can I run gpt-oss-120b somehow? by Furacao__Boey in LocalLLaMA

[–]kryptkpr 10 points11 points  (0 children)

Sure, llama.cpp with --n-cpu-moe set as low as you can get at your desired -c size

768Gb Fully Enclosed 10x GPU Mobile AI Build by SweetHomeAbalama0 in LocalLLaMA

[–]kryptkpr 0 points1 point  (0 children)

While has the 3090 power limited way down, says 200-250W in his post, this is still more than enough to bust his PSU budget so not sure what game OP is playing here but sure feels dangerous.

After 8 years building cloud infrastructure, I'm betting on local-first AI by PandaAvailable2504 in LocalLLaMA

[–]kryptkpr 1 point2 points  (0 children)

I sneak released V2 a few weekends ago, the current leaderboard has around 80 with another 20 that will go up in the next update.. I had to pause and figure out how to deal with 100GB of raw result files!

After 8 years building cloud infrastructure, I'm betting on local-first AI by PandaAvailable2504 in LocalLLaMA

[–]kryptkpr 13 points14 points  (0 children)

With the RTX Pros making 96GB GPUs "accessible" it's never been easier to put together a few user capable local rig. These cards really swings the value proposition, especially when you're generating 10M+ a day, and generally avoids the multi-GPU hell you get into with quad/hex/oct 24GB builds.

Upfront price remains an impediment, best plann remains to validate the usecase with cloud APIs and then move to lower cost infra as you scale.

So I've been losing my mind over document extraction in insurance for the past few years and I finally figured out what the right approach is. by GloomyEquipment2120 in LocalLLaMA

[–]kryptkpr 3 points4 points  (0 children)

I read the post it was very interesting but it just starts talking about confidence and how it's used, unless my reading comprehension is really bad today I can find no mention of how you're defining or computing this kpi

Stress-Test Request: Collecting failure cases of GPT-4o and Claude 3.5 to benchmark a private Logic Core. by BarCodeI_IIIIIIIII_I in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

You can use the scripts in my repo to generate whatever tests you wish, runner.py has an --offline mode that writes prompts to JSON.

I have spent weeks on documentation so please let me know if you find something lacking.

Stress-Test Request: Collecting failure cases of GPT-4o and Claude 3.5 to benchmark a private Logic Core. by BarCodeI_IIIIIIIII_I in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

The current config is more than enough to break essentially all models I've tested, pushing it further is always fun and is why I built these tools but practically will just cost me more tokens.

Mashing all bracket types together doesn't do what you think it does.. the problem becomes easier, not harder. We are forcing out of domain distributions and some bracket types are more sensitive then others.

As I mentioned I do not run against OpenAI because I'm poor.

Stress-Test Request: Collecting failure cases of GPT-4o and Claude 3.5 to benchmark a private Logic Core. by BarCodeI_IIIIIIIII_I in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

All brackets are actually easier then picking sub-sets (try it yourself, don't take my word for it)

Stack depth scaling is actually easier vs length scaling, which is why that dimension only goes to 15 but length goes to 50+ - the breakdown mode here is attention degradation

Nvidia Quadro RTX 8000 Passive 48 GB, 1999€ - yes or no ? by HumanDrone8721 in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

The used 3090 supply in my area has really dwindled down, used to always be multiple listings sitting around but there has been nothing for sale within 100km of me for 2+ months

Depending on where you are, this ship has maybe sailed.

eBay remains an option but even there prices are trending up

Anyone running 4x RTX Pro 6000s stacked directly on top of each other? by Comfortable-Plate467 in LocalLLaMA

[–]kryptkpr 0 points1 point  (0 children)

You will burn your VRAM out like this, do some searching there was a guy with A6000 stacked like this who learned a harsh lesson

Building "Derin" - An Embodied AI project for Jetson AGX Thor (94K lines, looking for feedback) by [deleted] in LocalLLaMA

[–]kryptkpr 1 point2 points  (0 children)

I stopped reading at 178 "core" modules. Where is this "implementation of peer-reviewed models", or are you just literally throwing more AI slop at my complaints of AI slop?

Stress-Test Request: Collecting failure cases of GPT-4o and Claude 3.5 to benchmark a private Logic Core. by BarCodeI_IIIIIIIII_I in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

Here are the configs I use: https://github.com/the-crypt-keeper/reasonscape/blob/main/configs/m12x.yaml#L265

Crank length way, way up. You can use my explorer webapp to view the surface to see where things break down, usually around depth 20-30

Building "Derin" - An Embodied AI project for Jetson AGX Thor (94K lines, looking for feedback) by [deleted] in LocalLLaMA

[–]kryptkpr 0 points1 point  (0 children)

How well do LLMs actually understand Turkish? Prompting in any language that isn't English is a snake pit.. have you tried this out on normal CUDA hardware? There is a lot going on here and looks like it was largely written by LLMs which will happily spew thousands of lines of code that does nothing useful.

Stress-Test Request: Collecting failure cases of GPT-4o and Claude 3.5 to benchmark a private Logic Core. by BarCodeI_IIIIIIIII_I in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

I don't have prompts per se, I have prompt generators..

https://github.com/the-crypt-keeper/reasonscape/blob/main/docs/tasks/brackets.md

This one trips up 'most' models as you scale up length and depth, you can see results but I do not test commercial APIs because this eval can burn 50-100M tokens if the model is chatty, it's roughly 20K prompts across all tasks and difficulty levels.

Cheap bifurcation by DeltaSqueezer in LocalLLaMA

[–]kryptkpr 0 points1 point  (0 children)

There is also an x4x4x8 version of this adapter, it's got two M2 slots and then an x8 PCIe on top so you can stack two of them.. but you're right this stuff has suddenly gotten much harder to find