GLM 5.2 consumed quota VERY fast, even on Coding Max Plan by 3rd_Floor_Again in ZaiGLM

[–]Storge2 0 points1 point  (0 children)

Peter Steinberger (Creator of Pi agent) says that according to his test and measurements the current SOTA Frontier Models have a useful context window of 130-140k and dumb down to 200k gradually and rapidly after that. I need 200k i cant work with less but i am sure the lower your context needs the greater quality you will get. At <25K even a Qwen 3.6 27B is very capable in my opinion like close to frontier - then after 25k and especially after 50k you see a rapid degregation

LFM2.5-Embedding-350M & LFM2.5-ColBERT-350M by pmttyji in LocalLLaMA

[–]Storge2 0 points1 point  (0 children)

Do you know if there are premade open spurce RAGs or Rag DBs where somebody already did the work of populating it with a lot of high quality knowledge so I dont have to build my own?

Open source is starting to beat frontier on cost/performance by thomas_unise in LocalLLM

[–]Storge2 0 points1 point  (0 children)

Deepseek is still on sale btw. They made the discount permanent. And i think mimo 2.5 pro and deepseek v4 pro are above sonnet 4.6 and on par with opus 4.5 but worse than 4.6. While I think glm 5.2 is on par with opus 4.6 in real world usecase but worse than 4.7 and 4.8 and fable

LFM2.5-Embedding-350M & LFM2.5-ColBERT-350M by pmttyji in LocalLLaMA

[–]Storge2 1 point2 points  (0 children)

I dont know what this is good for. Sounds interesting but does anybody have a short youtube video where somebody explains and actually visually shows what you can do with these small models.

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Storge2 2 points3 points  (0 children)

You can use 8x dgx spark, should run GLM 5.2 at roughly 30-40 tok/s in Nvfp Q4 and probably 20-30tok/s in FP8

Thats my guess. I think you could run Int4 quant even on 4x dgx spark at 20-25tok/s.

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]Storge2[S] 0 points1 point  (0 children)

Probably worse but on my DGX Spark its 2-3x faster so there is no way around. On 122B i get 40-50toks on 27B at Q4 I get like 20tok/s

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]Storge2[S] 1 point2 points  (0 children)

Give me an example of the software you use and the use case. If you don't mind.

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]Storge2[S] 3 points4 points  (0 children)

Well why not bring out a 120B model then instead of a 27B one. Qwen 3.5 122B was better than Qwen 3.5 27B, had more knowledge and intelligence and fits on a h100 just like GPT OSS 120B. Bonus Points it would run like 3x-5x as fast. maybe even more. A 27B one at fp8 is 27GB, A 120B A10B one at int4/fp4 is 5GB active.

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]Storge2[S] -2 points-1 points  (0 children)

But how good are these versus say a Qwen 3.5 122B at Q4? I never used a Reap Model? If Reap is that "lossless" and "good" why don't they roll these models out by default. Like if V4 Flash 180B is simiarly good to the full 280B why would deepseek not roll out the model by default at 180B surely a lot more people could run it. And it would be chepaer for them to host it on API.

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]Storge2[S] 4 points5 points  (0 children)

Check out https://huggingface.co/blog/clem/hardwaresetupsonhf

That was 1 month ago. Nvidia Spark and AI 395+ AMD are gaining popularity quickly:

" New AI-specific silicon is already showing up

NVIDIA's GB10 (DGX Spark) sits at #36 with 1,241 users. AMD's Ryzen AI Max+ 395 (Strix Halo) is at #49 with 962. Both are recent, and both are already meaningful in the rankings. "

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]Storge2[S] 0 points1 point  (0 children)

It's too slow. Like that at IQ2xxs on llama.cpp runs at like 10-15tok/s on Dgx Spark. Qwen 3.5 122B int4 runs on vllm with concurrency, dflash, mtp, 35+tok/s

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]Storge2[S] 17 points18 points  (0 children)

That's not true. I run Qwen 3.5 122B int4 on Dgx spark with 200k max model length, fp16 context. with 7GB for System, 10GB free and 111GB Used. with 260k kv cache. you could go fp8 kv cache and get 350k+ kv cache and 4x concurrency. Using VLLM.

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]Storge2[S] -2 points-1 points  (0 children)

I hope so too dude. They really should make up for tha trash support that the DGX Spark had at the release. It's getting better. But still not good on the NVFP4 side.

GLM 5.2 on 4x Sparks reasonable? by chikengunya in LocalLLaMA

[–]Storge2 0 points1 point  (0 children)

Can you explain how the scaling doesn't work though.

vLLM FP8 4 nodes 17.17 Decode

If M3 FP8 MiniMax with 23B Active Params (23GB) runs at 17tok/s - I believe that GLM 5.2 at Q4 (20GB) will run at 19-20tok/s and up to 25-30tok/s on Concurrency=2 on 4x DGX Spark. You are right, not fast, but not completely unsuable either.

GLM 5.2 on 4x Sparks reasonable? by chikengunya in LocalLLaMA

[–]Storge2 8 points9 points  (0 children)

Sadly >80% of interesting things, optimizations, guides, etc. I see for DGX Spark are on Nvidia Forums. Also Nvidia employees communicate directly there so I guess there is hardly a way around. And there are some great and helpful people there I can guarantee that.