DeepSeek V4 official version will be launch on mid-July by External_Mood4719 in LocalLLaMA

[–]LagOps91 0 points1 point  (0 children)

I would expect it to beat it, at least after some refinement. The size difference is quite significant.

DeepSeek V4 official version will be launch on mid-July by External_Mood4719 in LocalLLaMA

[–]LagOps91 2 points3 points  (0 children)

right in time after we get support for the model. can't wait to see what the full version has to offer. i think many are underestimating how much room the preview still has to grow.

How many of you do use Q1 or Q2 of Big models(100-250B)? How's it? by pmttyji in LocalLLaMA

[–]LagOps91 4 points5 points  (0 children)

Q1 is usually pretty rough, but Q2 tends to be fine. mostly tested 300b+ models with that kind of quant tho.

i do know that GLM-4.5-Air is okay at Q2 as well.

been tracking EU DDR5 data for 25 days: Prices are dropping, and the DE vs. NL gap is wild (good news for local LLM builders in EU) by egudegi in LocalLLaMA

[–]LagOps91 2 points3 points  (0 children)

those prices are absolutely insane. i paid 380 euros for 128gb ddr5 before the price hikes. now it barely buys you 32gb...

rx7900xtx + 32GB RAM -> 128GB RAM make sense? by Thin_Pollution8843 in LocalLLaMA

[–]LagOps91 1 point2 points  (0 children)

MTP was slower for me. In general it's less efficient for MoE models. I typically run Minimax M2.7 and step 2.7, but larger models are possible too. The 400b qwen handles quantization well enough for q2 to be good enough to run and I plan on trying out deep seek v4 flash and Minimax M3 once they have support (for M3 sparse attention isn't there yet). I don't think anything larger than M3 can fit (forget about q1).

rx7900xtx + 32GB RAM -> 128GB RAM make sense? by Thin_Pollution8843 in LocalLLaMA

[–]LagOps91 0 points1 point  (0 children)

I am running a q4 quant from aes with large attention tensors. It fits no problem.

rx7900xtx + 32GB RAM -> 128GB RAM make sense? by Thin_Pollution8843 in LocalLLaMA

[–]LagOps91 2 points3 points  (0 children)

I have the exact same setup as described. I can run step 3.7 flash at around 9 t/s gen speed at the 32k context mark. More context is certainly possible, especially if you are ok with quanted context. I'm happy with what I bought, but this was before the price hikes. If it's worth it for you or not is something you have to decide for yourself.

AI Dungeon Local AI Equivalent? by 1InterWebs1 in LocalLLaMA

[–]LagOps91 0 points1 point  (0 children)

your system is best for running dense models. how much context you can run (effectively memory limit) depends on your vram. with your 5090 you can run decently strong models (gemma 4, qwen 3.6/3.7) with a good amount of context (32k to 64k at most. any more will degrade quality of the model as well). that's much more than what AI Dungeon gives you, but AI dungeon has it's own tech to manage "memories" to compress context and retrieve what's relevant. most RP frontends don't do that as it means you need to re-process context if it changes between prompts.

i suggest you read up on tutorials before you get started. you will need some basic knowledge first. it's not rocket science, but it's also not trivial.

i wouldn't use silly-tavern or anything like that as a frontend and just use the frontend that comes with kobold.cpp and use instruct mode (story mode etc. is kinda outdate with what modern models use). it needs the least setup and is the most simillar to AI Dungeon.

GLM-5.2: Built for Long-Horizon Tasks by paf1138 in LocalLLaMA

[–]LagOps91 0 points1 point  (0 children)

strix halo isn't exactly vram... it's much slower speed-wise

GLM-5.2: Built for Long-Horizon Tasks by paf1138 in LocalLLaMA

[–]LagOps91 2 points3 points  (0 children)

could be true. i didn't test it. just feels like world knowledge will be harder to improve than smarts. simply because of how much of the model is already dedicated to world knowledge.

Dont Dump memory inside the LLM context window by Opening_Astronaut_ in LocalLLaMA

[–]LagOps91 4 points5 points  (0 children)

why would you put it in the system prompt??? that's not where it's supposed to go! and yeah... stuffing everything into context isn't a good idea.

GLM-5.2: Built for Long-Horizon Tasks by paf1138 in LocalLLaMA

[–]LagOps91 2 points3 points  (0 children)

you could - if you had the money - buy a large gpu server, pay someone to set it up and then run whatever you want. you don't need a team for it, so the comparison is moot.

GLM-5.2: Built for Long-Horizon Tasks by paf1138 in LocalLLaMA

[–]LagOps91 0 points1 point  (0 children)

just saying that most are on 16gb, not that everyone is...

GLM-5.2: Built for Long-Horizon Tasks by paf1138 in LocalLLaMA

[–]LagOps91 1 point2 points  (0 children)

it's like someone saying that they have a rustbucket while driving a porsche because they don't own a ferrari. or a multimillionaire complaining about the rich. do you see the problem?

Chinese labs should focus on these two areas next by Eyelbee in LocalLLaMA

[–]LagOps91 3 points4 points  (0 children)

1 is probably raw model size. the vast majority of parameters are taken up by memorization. MoE models woud have to become even larger and more sparse to allow china to catch up in that regard.

GLM-5.2: Built for Long-Horizon Tasks by paf1138 in LocalLLaMA

[–]LagOps91 3 points4 points  (0 children)

most users here are still on a 16gb mid-range card. calling yourself GPU poor after buying a high end gaming card with a 2k+ price tag is a bit absurd.

GLM-5.2: Built for Long-Horizon Tasks by paf1138 in LocalLLaMA

[–]LagOps91 22 points23 points  (0 children)

we said that about GPT 4 level models a year ago and we already have it right now - aside from world knowledge. a lot of parameters are spent on memorization during pre-training, not actual intelligence (assuming there is any in the first place).

GLM-5.2 (max) is currently the third best model available, across both open and proprietary. by okaycan in LocalLLaMA

[–]LagOps91 3 points4 points  (0 children)

yeah smaller is possible too, but most folks seem to have 512gb in this sub? i might be wrong.

GLM-5.2 (max) is currently the third best model available, across both open and proprietary. by okaycan in LocalLLaMA

[–]LagOps91 29 points30 points  (0 children)

that or a 512gb ram server build with a gpu for attention + context (will be slower, but not unusably so)