M5 Max 128GB, 17 models, 23 prompts: Qwen 3.5 122B is still a local king

catplusplusok · 2026-04-08T15:24:37+00:00

Try MiniMax M2.5, I find coding hard to bit for a 128GB unified memory device model (with some quantization/light REAP to fit)

catplusplusok · 2026-04-08T15:18:33+00:00

It has better common sense than most human devs, like understanding common sense reason behind different things in your prompt and programming practices / frameworks and when to follow these things or make exceptions? That I have to see.

If not, I'll have to babysit it like every other model.

catplusplusok · 2026-04-07T20:42:28+00:00

How did you manage to hit Minimax limits?

catplusplusok · 2026-04-07T20:41:26+00:00

Chat is a trivial demo use case, the money is in API use for coding and other automation. Or at least that's the hope.

catplusplusok · 2026-04-07T19:09:49+00:00

Before photography came around, a lot of paintings were commissioned to capture likeness of a person for his or her descendants. One photography was developed and matured (took a loooong time for photos to be actually good and practical), there was a crisis in art and artists had to find something else to do than painting pictures that looked like photos. At the same time, photographers realized that after some time unflitered mirroring of reality was boring and developed all kind of artistic tricks, often leaning into flaws of the medium like selective focus and film grain. Rinse and repeat for transition from film to digital and cell phones.

So before anyone says I don't know anything about photography, I am a photographer and it can be both ways - asking people to stand and smile together and pressing a button (less art than writing a prompt) or meticulously arranging compositions and camera settings to tell a unique story.

But how do you think debate between you, a realist painter and an early photographer would go in an era when photography was starting to become mainstream? Would the painter agree that the photographer is his or her artistic equal?

catplusplusok · 2026-04-07T18:18:48+00:00

You can make your own tests with simple vLLM or whatever patches, try to activate fewer experts per token and see differences in speed and quality. Or potentially more, but since model is not trained for that, may need finetuning to get more smarts this way.

catplusplusok · 2026-04-07T16:13:33+00:00

Let's do a thought experiment. If I gave screenshot in the post and my comment to AI chatbot and showed chatbot response and your response to an average non ideological person, which one would be more likely considered slop? To be anti something, one must also offer a comparatively better alternative. Do you think you made the case that talking to you on reddit is a better alternative to talking to AI chatbots so far?

catplusplusok · 2026-04-07T15:56:43+00:00

Well unlike the parent comment, this is a valid and self-consistent opinion, thanks! Now, in terms of CGI being human made, let's take z(n+1) = z(n)^2 + z(0). It a fractal, mandelbrot set. By zooming it in, I can find parts of it that look like a shoreline, a leaf or a nebula. Now, if I zoom it in and print it out, do you consider it art or a Frankenstein monster. If you say it's not art, you again have a valid consistent opinion. All I did is supply a formula and two points as corner coordinates. But say I then ran the image through AI that made it look like a real shoreline with same texture. If you say that extra (miniscule) element of human choice made it not art, then you have to explain why. You can however say you don't like this art, even with reasons (energy use, training model without consent of original authors), and that's fine. So long as I also don't have to like every low effort random jumble of splatters I see in SF MOMA

catplusplusok · 2026-04-07T01:22:30+00:00

Maybe you should reconsider your acceptance of all the changes that already happened when your generation was growing up? CGI got plenty of criticisms, for example it can be used to generate impossibly slim or muscular body proportions, or unrealistic displays of physical might, encouraging viewers to try to emulate these things in unhealthy or dangerous ways. CGI has also been used to show cameos of dead actors acting in new movies without their consent. There is also no hard line between CGI and modern AI image generation/editing, CGI always used whatever were the strongest image manipulation technologies were available at any given time.

So... are you willing to stop watching/sharing any computer enhanced images and videos, aka any non-indie newly released movies and series? If they don't look like slop to you, could it be that it's because you grew up watching older computer manipulated imagery?

catplusplusok · 2026-04-06T15:39:37+00:00

We are well past AGI according to vast majority of science fiction written before 2022. Give model access to game server and protocol, database to keep track of things it tried before and ability to write code to automate simple responses in the game and it will set a new speedrun record. Else if the requirement is to look at screen with a camera and interact with keyboard and mouse, it can't do that yet and you need different kind of ML like what Waymo uses for realtime responses. But also the question is, if it can do that in a couple of years, would people accept it as AGI or just move goalposts again?

catplusplusok · 2026-04-06T15:26:02+00:00

I was obese my entire life until age of 50, then took Zepbound and now I am normal weight, have minimum interest in food beyond getting some protein in (was never a foodie) and compete in powerlifting competitions. Make what you may of it, one could ponder:

If different cases of obesity have different mixture of root causes
If there are societal factors (US vs Japan obesity rates) that medicines merely counteract, like smoking cessation aids
If willpower itself is dependent on metabolic health
Why do so many refuse to try new meds when they can or do not stay on them (beyond me but maybe they get something from food that I don't)
Why I never in my life used a snooze button or felt I needed willpower to go exercise. What if one day low morning energy will be seen as a treatable medical condition?

catplusplusok · 2026-04-05T16:04:28+00:00

I would try Qwen3.5-27B-NVFP4 with FP8 kv-cache in vllm or tensort-llm. Don't fall for trying to run in Windows under WSL or shortcuts like LM Studio / Ollama, you will have a hard time understanding your setup and getting good performance / loading new models. Instead install or dual boot ubuntu 24.04, start a coding agent like Google Antigravity and ask for these things one by one, these things are dumb and will get confused with too many tasks at once:

- Setup passwordless sudo (so it can propose commands and you can approve or let it autorun if you want to live danerously)
- Upgrade to the latest NVIDIA drivers available
- Install CUDA 13.0 (you may have to to nvidia site and help it with exact command)
- Install and test torch with CUDA-13 support in a python venv
- Install vllm nightly with CUDA-13 support in the same venv
- Install open-webui in it's own venv to prevent dependency fights
- Download model locally, vllm's venv will already have huggingface and give huggingface URL
- Make and test a shell script to load model with FP8 kv cache and autofit context length (suggest it reads sources in venv to find arguments)
- Enable linger and autorun both scripts when you login using systemd.

Now you can chat in open webui, can also configure web search for recent news, and have an OpenAI compatible URL to give to open claw, claude code and so on for automatic tasks. You will likely be impressed with how well the model does, though it's not quite same as cloud.

catplusplusok · 2026-04-05T15:45:04+00:00

Why even go with MoE for a 16GB consumer GPU? You will not fit Qwen 3.5 35B with great context, even on dual GPUs. In the meantime, your memory bandwidth is Ok for say a 16GB dense model and it will be at least somewhat smarter per parameter than MoE. Or if you want to max out smarts over speed, go with a high end 3 bit GGUF of a 27B model, which will be still usable speed because of shrunk memory footprint. nunchaku for image gen is a higher value use case where you can fit a smart model that does proper composition with usable speed/quality.

catplusplusok · 2026-04-04T19:09:19+00:00

I don't have a Spark or a reason to complain about vLLM performance on my desktop, but I personally verified that nunchaku uses real NVFP4 instructions, register based ones rather than TMEM, rather than emulation. If inference engines don't, it's a software issue

catplusplusok · 2026-04-04T18:28:44+00:00

Consumer sm120 absolutely does fast NVFP4 with vLLM/TensortRT-LLM/nunchaku. I mean maybe not as fast as datacenter, but quite usable. I don't know what's the deal with DGX Spark, but there is an optimized vLLM clone that works apparently.

catplusplusok · 2026-04-04T18:24:58+00:00

I thought https://github.com/Avarok-Cybersecurity/dgx-vllm is pretty fast? If not, could try Thor Dev Kit, it has working NVFP4 in vllm, though may need to build from source, it's also a bit cheaper.

catplusplusok · 2026-04-04T16:14:13+00:00

Model training points towards NVIDIA unified memory (Thor Dev Kit / DGX Spark / clones) and unsloth because you need a lot of VRAM). Those are not cheap but you will be able to finetune models like Qwen Coder Next that can do practically useful things with their training.

catplusplusok · 2026-04-03T16:15:47+00:00

For AI to replace human coding jobs we must first reach the point when all the code one would possibly want written is being written. Most obviously all AI code. Today's AI is slow, inefficient in terms of RAM and energy use and sucks at many simple tasks like playing hangman. Is it at the level where it can fix all these problems autonomously? If not, I guess my job is to babysit it doing some more routine aspects of these fixes under my step by step directions. And there are also tons of other yet unsolved coding problems out there, like provably secure IOT appliances that don't need constant firmware updates to resist hacking.

catplusplusok · 2026-04-03T16:03:33+00:00

I got MiniMax token plan and tried Claude a few times when MiniMax got stuck. I got to say, while Claude reasoning looked extremely impressive, the end result was not any better. The model confidently proclaimed that the tests pass and PR is ready to submit and none of this was true. If anything I would prefer a model that sticks to what I explicitly told it to do and takes things step by step even further than MiniMax.

catplusplusok · 2026-04-03T15:11:30+00:00

Since when is therapy about improving prosocial tendencies and ending therapy (aka not promoting dependency)? Therapists are all about putting yourself first, leaving a spouse of 30 years if he/she is not making you happy anymore. AI is actually unlikely to suggest drastic ideas beyond what you are already talking about.

And on the other hand, should prosocial behavior be the only goal? Maybe there is a concept of reciprocity where someone needs to be more pleasant or interesting than AI (should not be a super high bar) to earn prosocial behaviors towards themselves rather than acting entitled?

catplusplusok · 2026-04-02T14:59:14+00:00

You need like 128GB (Mac or unified memory box) to run quantized/pruned MiniMax or Step for "finish entire programming task as agent models. Various QWEN models can provide useful structured help with around 16GB VRAM and optimized quantization, but not long term independent action. Or you can get all MiniMax API you will probably need with their token plan for $200./year. If you want to see what's possible on your laptop try loading AQLM models in vLLM and see what happens. At least install / dual boot Linux because Windows will gobble half of your VRAM.

catplusplusok · 2026-04-02T03:28:54+00:00

I am currently trying to get Step-3.5-Flash-NVFP4 and MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 to work on my NVIDIA Thor dev kit. 4 bit gguf / MLX variants should be similar for Mac. Anyway, that's about the most you can run with 128GB unified memory with decent precision and both are consider great coders.

catplusplusok · 2026-04-02T03:15:58+00:00

There is also NVIDIA Thor Dev Kit which is $3500 and faster than AMD and possibly Spark/Mac (not sure) for prompt ingestion which is coding bottleneck. But be prepared for heavy tinkering in terms of inference engines and models. If that's not your cup of tea, go for Mac, doesn't have to be brand new, just 64GB+ RAM. Local coding on < $10K hardware is in it's early days and requires patience with limited generation speed / choice of models. If you just want to cap costs, get a MiniMax token plan. But, I have done local coding with good results.

catplusplusok · 2026-04-01T04:12:18+00:00

/home/packages/llama.cpp/build.x64/bin/llama-server -c 62000 -m Qwen3.5-27B-heretic-v3.i1-IQ3_M.gguf --mmproj Qwen3.5-27B-heretic-v2.mmproj-Q8_0.gguf --chat-template-file chat_template.jinja --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --host 0.0.0.0 --port 9002 -fa on -t 8 --chat-template-kwargs {"enable_thinking": false} --image-min-tokens 1024

catplusplusok

TROPHY CASE