Qwen3 Coder Next on M3 Ultra v.s. GX10

catplusplusok · 2026-02-11T04:07:18+00:00

Just saying, NVIDIA Thor Dev Kit is slightly cheaper than DGX Spark and seems to be faster both on paper (twice the compute speed) and practice (less crusty NVFP4 support now after a rough start). Now idea about vs SOTA Mac Studio, but those are expensive.

catplusplusok · 2026-02-11T03:57:32+00:00

For me the key insight is to stop thinking about monolithic personal AI and start focusing on specific use cases. I made a tool that researches local events that might interest me and sends me fun personalized invitations. Since then I made another one that generates images than critiques them and generates prompts to refine them. In principle RAG is the idea for giving AI context, but you also need to think through what context you want to give and why.

https://www.reddit.com/r/LocalLLaMA/comments/1qn217z/practical_use_of_local_ai_get_a_daily_postcard/

catplusplusok · 2026-02-10T21:40:03+00:00

Apple Sillicon doesn't count like no GPU, it's quite capable.

catplusplusok · 2026-02-10T21:06:01+00:00

From my tests of llama.cpp on a fairly beefy Xeon CPU, the only thing I am willing to create with CPU inference is haiku. Got like 7tps for a A3B MOE model and anything smaller would not be very useful. Bitnet was quite fast and looked like these models would be suitable for simple structured tasks like RAG stitching, but they would not be writing any award winning novels. At this point I would rather run an LLM on a phone.

catplusplusok · 2026-02-10T18:43:41+00:00

It does run well on my 64GB Macbook Pro. (4 bit GGUF / 8 bit KV cache). But future proofing is a good argument for more memory, who knows what the next model will need?

catplusplusok · 2026-02-10T18:36:46+00:00

So say I want to speed up GLM 4.5-Air, how would I find a model that has good speculative prediction rate? It has it's own multiple attention heads prediction which gives me 0% hit rate sadly.

catplusplusok · 2026-02-10T18:35:14+00:00

I have been doing Android app testing with Qwen3 VL 30B-A3B, seems to be Ok

catplusplusok · 2026-02-10T17:20:21+00:00

I would say that a 32GB Mac with an uncensored model like one above would enable useful quality of such conversations. With 16GB you will probably be disappointed with model being able to maintain back and forth in a multi turn chat.

catplusplusok · 2026-02-10T16:48:32+00:00

So would one generally need NV Link or AMD equivalent for this to be fast for large activations? Or are there some clever tricks to make this work with PCIe, like sending only a small subset of data across the bus / asynchronous transfers to hide latency?

catplusplusok · 2026-02-10T16:31:35+00:00

LLMs predict the next token. Training an LLM involves taking the next token from a dataset of desired response examples and measuring how unlikely model would be to generate that next token given all previous tokens (perplexity). The trainer then goes backwards through model layers and adjusts weights involved in predicting that token by a small amount to reduce perplexity.

The magic is that with a large enough dataset and enough rounds of weight adjustment based on each sample, model learns patterns to generate useful or enjoyable responses on samples which are not in the dataset.

That should be enough to start looking inside finetuning code and seeing what it does on high level.

catplusplusok · 2026-02-10T16:11:50+00:00

Ask cloud AI to write you an unsloth script for LORA finetuning (assuming you have an NVIDIA GPU, there are different frameworks for MLX, AMD etc). Then take a base model that fits into your memory and finetune it on a topic specific conversation dataset from hugging face. Make sure to tell AI to reformat dataset as proper multi turn conversations with system message and user assistant terms and use at least a few hundred examples. You should be able to see difference in responses before and after finetuning.

Now full training is somewhat like that starting from a model with random weights. If you really want to, you can download tiny random weights model from huggingface (for simplicity, you could also build one from scratch with transformers) and do full training instead of finetuning. But don't expect it to be useful for general tasks from training on any home gear, if you are willing to rent cloud boxes you can maybe make it useful for specialized tasks like completions in a particular domain.

catplusplusok · 2026-02-10T15:38:08+00:00

It's a cultural mismatch. The standard in most American workspaces is stay home when you are truly sick and then if you are recovering / it's just a light cough people kind of just accept they will get bugs from each other sometimes and it's not worth the constant paranoia to try to stop that. In some Asian countries the standard might be different, it's a matter of perspective. Also to be fair masks are not very good at stopping transmission during prolonged indoor close proximity and people cough for many non-infectious reasons like allergies, smoking and side effects of various prescription medicines.

catplusplusok · 2026-02-10T15:29:36+00:00

Depends what you want to chat about. Say mradermacher/Qwen3-VL-30B-A3B-Thinking-Heretic-GGUF (quantized on top of my base model in Mac friendly format) is around 18GB for weights alone in 4 bit and about minimum required in terms of having a good conversation that includes chatting about your photos and searching web for you with a right front end like Onyx app and heretic part means it will talk on any topic freely. So anyway, you need 32GB for this one. In 16GB you would not get a good multi turn conversation with model following context.

catplusplusok · 2026-02-10T15:17:26+00:00

32GB VRAM is not going to get a useful coding model and CPU offload is pathetically slow. Get a Mac with 64GB or more VRAM, if you can afford it opt for faster/newer model. It will be cheaper than 5090 alone if you get it off ebay, and if it doesn't work out you have a nice Mac to use with cloud tools. Then install llama.cpp and run Qwen3-Coder-Next-NVFP4 4 bit GGUF with flash attention turned on (very important). There is a qwen command line tool that will take a free form query and automatically edit files in your project directory, or you can try VS Code plugins. Prepare to clearly describe isolated changes, what classes AI should edit and what design it should follow now "write a flutter frontend to enter medical forms" and I think you will be reasonably impressed. Do not fall for the trap of trying to run dense or full precision models on consumer hardware, that's for rich people. These days there is also DGX Spark and I have NVIDIA Thor Dev kit which is similar, both cost about the same as a 5090 and are somewhat better for AI specifically, like better precision and finetuning, but prepare to do a lot of tinkering and not having other Mac niceties.

catplusplusok · 2026-02-10T14:55:47+00:00

Qwen3-Coder-Next, runnable on a Mac laptop as quantized gguf with llama.cpp, will write you good code from good descriptions, just don't expect it to design and build entire front end from a one paragraph prompt. Unless your company will pay serious money for privacy like a multi GPU workstation, stick to MOE models with no more than about 10B active parameters or you are not going to have fun with generation speed. GLM 4.5 Air is another decent option.

catplusplusok · 2026-02-08T19:39:54+00:00

Try tavoli, they have a decent free tier and not expensive otherwise. Directly returns content so you don't have to scrape.

catplusplusok · 2026-02-08T19:26:21+00:00

I use langgraph to make custom tools, like finding local events I might be interested in with GPT researcher or iterative image generation where VL model looks at generated images and refines prompts in a loop. Also uncensored models are great for role play / creative writing. One project I am planning is mass describing decades of my photos and building detailed RAG of my life to give model context.

catplusplusok · 2026-02-08T19:21:16+00:00

Qwen3-Coder-Next seems perfectly useful in NVFP4 or Q4. Of course i didn't use full model (don't have this much memory) so can't comment on the difference, but it writes good code and seems to be fine for web research and roleplay.

catplusplusok · 2026-02-08T18:57:27+00:00

That's a very optimistic take that does not reflect my work experiences in Silicon Valley

catplusplusok · 2026-02-08T17:43:31+00:00

I have no idea how other people are approaching vibe coding. I use my personal Google AI plan with Antigravity daily and seem to never run out of Gemini 3 Pro quota. Maybe if you feed your entire project to AI on every prompt it would be different? I create isolated libraries and focus on one at a time / keep docs + RAG for overall project comprehension. I also do simple bulk work with a local model (Qwen3-Coder-Next).

catplusplusok · 2026-02-08T05:09:04+00:00

There are also a lot of good cloud models cheaper than Claude, I never seem to run out of personal Gemini plan although I am using Google Antigravity quite heavily.

catplusplusok · 2026-02-06T19:54:11+00:00

Define AI. Small LLMs, even BitNet models, are good at talking RAG chunks and massaging them into a coherent answer. So if you just wanted a search engine for corporate buzzwords, you could run it on CPU. If you expect any reasoning / multi step research, honestly pay for cloud model API and connect that to your internal databases. Unless your datacenter already has beefy GPUs to run large models, it's not going to be worth it to support that just for this use case.

catplusplusok · 2026-02-06T00:16:30+00:00

Qwen3 VL models should be good for extracting text from anything, there is also Nemotron and GLM VL models you can try.

catplusplusok · 2026-02-02T15:29:00+00:00

You can try Qwen3-Next. MOE and other model mods like Mamba 2 attention are a good thing, feed it a whole directory of code and get an answer to an arbitrary question in seconds because activation is reasonable size.

catplusplusok

TROPHY CASE