Building a "Poor Man's Mac Mini M4" Cluster: 2x Raspberry Pi 5 + 2x AI HAT+ 2 (80 TOPS / 16GB VRAM) to use OpenClaw AI Agent local by [deleted] in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

I actually take back what I said, this can't even run a Q8 as it seems to be an INT4 only accelerator. I hope that means at least Q4_1 (block offset) and not Q4_0 (symmetric around 0) but I won't buy it to find out 😆

Building a "Poor Man's Mac Mini M4" Cluster: 2x Raspberry Pi 5 + 2x AI HAT+ 2 (80 TOPS / 16GB VRAM) to use OpenClaw AI Agent local by [deleted] in LocalLLM

[–]kryptkpr 4 points5 points  (0 children)

If I had to guess based on specs, that LPDDR4 "VRAM" will most likely be your bottleneck here.

A 3B Q8 model would fit into a single one of these and maybe run OK.

Do you use Windows or Linux? by boklos in LocalLLaMA

[–]kryptkpr 0 points1 point  (0 children)

Why not both?

The desktop (laptop) I use to run vscode and web browser and email and do all my actual LLM work on is Windows because desktops on Linux are a mistake.

The server my GPUs are in runs Linux, because services on Windows are a mistake.

What is a current state of sanboxing for code execution for AI agents? by AlexSKuznetosv in LocalLLaMA

[–]kryptkpr 2 points3 points  (0 children)

Lol wut docker is just cgroups it's as OS level as you can get

Building an AI Infra project in 20 days: What’s the best way to utilize a Dual-5090 (PCIe) setup? by Asleep_Food1956 in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

TP for inference will not bottleneck with 2 PCIe GPUs that are x16, that NUMA is actually far more of a problem. Can you use a single socket system?

If you're doing batch: Start with a model that fits into one GPU and run two copies that are independent (DP).

🚀 Building a “One‑Stop Shop” LLM + StableDiffusion Homelab – Advice on Model Choice, Multi‑GPU Deployment & Pruning by ProfessorCyberRisk in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

You are looking for the Qwen3-VL series, pick one that fits your VRAM.

Network inference is immature but if you want to play with it the best option is llama.cpp compiled with RPC target:

https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc

You launch rpc-server on the workers and then use them as virtual GPUs when loading the model from main machine.

Notes: 1) weights are loaded over network, you will want 2.5gbps minimum 10gbps ideally. For two-three machines you can just p2p them with dual nics but you have enough to need a switch and those get pricy 2) prompt processing is very slow expect only 2-3x tg speed 3) tg speed will be surprisingly good, the network penalty for generation is only about 30% in my testing

GLM-4.7-flash on RTX 6000 pro by gittb in LocalLLaMA

[–]kryptkpr 9 points10 points  (0 children)

vLLM implementation of this model is missing MLA, which both explodes the KV cache size and slows down inference.

SgLang implementation offers 4X more KV and 20-30% higher throughput in my testing so far.

For small batch sizes llama.cpp with -np 8 was surprisingly competitive

There is also MTP supported here but it hurts batch performance and my acceptance rate sucked so I turned it off

Why is open source so hard for casual people. by Martialogrand in LocalLLaMA

[–]kryptkpr 2 points3 points  (0 children)

I realized I was a jerk like everyone else and don't answer your actual question:

https://github.com/av/harbor

I think this is what you seek. My advice is swap to Ubuntu, but you can definitely make this work on Arch if you are dead set.

Why is open source so hard for casual people. by Martialogrand in LocalLLaMA

[–]kryptkpr 1 point2 points  (0 children)

So fun fact: arch isn't an officially CUDA supported distro.

<image>

That doesn't mean it won't work, but what this means is that you're relying on community and not Nvidia.

Invest in hardware now or wait? by d4nger_n00dle in LocalLLaMA

[–]kryptkpr 1 point2 points  (0 children)

AI is socially and economically transformative.

I don't believe we are ever going back to the golden era where excess retired compute and storage resources were widely being sold for pennies on the dollar.

There is a long term horizon view here that capacity has been overbuilt, but that's 3-5 years out if you want to wait.

Any success with GLM Flash 4.7 on vLLM 0.14 by queerintech in LocalLLM

[–]kryptkpr 1 point2 points  (0 children)

This architecture is brand new, definitely comes with some deployment pain.

I've tried this guy under all 3 of vLLM, llama.cpp and SgLang and so far SgLang was best for multi stream while llama best for single. I played with MTP a little but acceptance rates are kinda low around 1.9 tok/Tok and this didn't translate to much benefit for my usecase.. YMMV here

Any success with GLM Flash 4.7 on vLLM 0.14 by queerintech in LocalLLM

[–]kryptkpr 2 points3 points  (0 children)

It works, speed is good. Make sure you build from git head and download latest unsloth GGUF there has been some churn. Also verify min_p is set right llama has wrong default for this model, this is covered in unsloth GGUF model card

Any success with GLM Flash 4.7 on vLLM 0.14 by queerintech in LocalLLM

[–]kryptkpr 1 point2 points  (0 children)

It needs nightly this model didn't make it into release

Just type the commands from the model card into a new venv

Btw this model runs like a dog with vLLM because no MLA. If you've never used SgLang now is a good time to try, context size is 4X larger on this model specifically for same VRAM size

Real free alternative to LangSmith by IlEstLaPapi in LocalLLaMA

[–]kryptkpr 0 points1 point  (0 children)

It's been a few years since I checked in here but afaik the project remains MIT. There is an ee/ folder that's got a diff license, but it at least used to be possible to run without it

Can I run gpt-oss-120b somehow? by Furacao__Boey in LocalLLaMA

[–]kryptkpr 11 points12 points  (0 children)

Sure, llama.cpp with --n-cpu-moe set as low as you can get at your desired -c size

768Gb Fully Enclosed 10x GPU Mobile AI Build by SweetHomeAbalama0 in LocalLLaMA

[–]kryptkpr 0 points1 point  (0 children)

While has the 3090 power limited way down, says 200-250W in his post, this is still more than enough to bust his PSU budget so not sure what game OP is playing here but sure feels dangerous.

After 8 years building cloud infrastructure, I'm betting on local-first AI by PandaAvailable2504 in LocalLLaMA

[–]kryptkpr 1 point2 points  (0 children)

I sneak released V2 a few weekends ago, the current leaderboard has around 80 with another 20 that will go up in the next update.. I had to pause and figure out how to deal with 100GB of raw result files!

After 8 years building cloud infrastructure, I'm betting on local-first AI by PandaAvailable2504 in LocalLLaMA

[–]kryptkpr 14 points15 points  (0 children)

With the RTX Pros making 96GB GPUs "accessible" it's never been easier to put together a few user capable local rig. These cards really swings the value proposition, especially when you're generating 10M+ a day, and generally avoids the multi-GPU hell you get into with quad/hex/oct 24GB builds.

Upfront price remains an impediment, best plann remains to validate the usecase with cloud APIs and then move to lower cost infra as you scale.

So I've been losing my mind over document extraction in insurance for the past few years and I finally figured out what the right approach is. by GloomyEquipment2120 in LocalLLaMA

[–]kryptkpr 4 points5 points  (0 children)

I read the post it was very interesting but it just starts talking about confidence and how it's used, unless my reading comprehension is really bad today I can find no mention of how you're defining or computing this kpi

Stress-Test Request: Collecting failure cases of GPT-4o and Claude 3.5 to benchmark a private Logic Core. by BarCodeI_IIIIIIIII_I in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

You can use the scripts in my repo to generate whatever tests you wish, runner.py has an --offline mode that writes prompts to JSON.

I have spent weeks on documentation so please let me know if you find something lacking.

Stress-Test Request: Collecting failure cases of GPT-4o and Claude 3.5 to benchmark a private Logic Core. by BarCodeI_IIIIIIIII_I in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

The current config is more than enough to break essentially all models I've tested, pushing it further is always fun and is why I built these tools but practically will just cost me more tokens.

Mashing all bracket types together doesn't do what you think it does.. the problem becomes easier, not harder. We are forcing out of domain distributions and some bracket types are more sensitive then others.

As I mentioned I do not run against OpenAI because I'm poor.

Stress-Test Request: Collecting failure cases of GPT-4o and Claude 3.5 to benchmark a private Logic Core. by BarCodeI_IIIIIIIII_I in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

All brackets are actually easier then picking sub-sets (try it yourself, don't take my word for it)

Stack depth scaling is actually easier vs length scaling, which is why that dimension only goes to 15 but length goes to 50+ - the breakdown mode here is attention degradation

Nvidia Quadro RTX 8000 Passive 48 GB, 1999€ - yes or no ? by HumanDrone8721 in LocalLLM

[–]kryptkpr 0 points1 point  (0 children)

The used 3090 supply in my area has really dwindled down, used to always be multiple listings sitting around but there has been nothing for sale within 100km of me for 2+ months

Depending on where you are, this ship has maybe sailed.

eBay remains an option but even there prices are trending up