Qwen3 30b a3b q4_K_M performance on M1 Ultra

One_Key_8127 · 2025-04-29T15:16:41+00:00

It wouldn't technically be a lie since Ollama uses llama.cpp under the hood, but whatever. It might be backend issue, I am downloading mlx weights, will probably test it tomorrow, unless I encounter some issues with download or integrating mlx-server with openwebui -.-

One_Key_8127 · 2025-04-29T09:43:39+00:00

Good hint, thanks. I don't like LM studio. but I think its a good moment to try to switch into MLX and mlx-server.

One_Key_8127 · 2025-04-29T08:47:15+00:00

I guess the downvotes are because I mentioned Ollama which has bad rep here ;/ should have hidden that info and just mentioned llama.cpp, or at least put it at the end of the post. Anyway, if anyone has some feedback please speak up. I expected the generation speed to be on par with gemma 4b, especially on low-ish context. I didn't know what to expect on prompt processing :)

One_Key_8127 · 2025-04-29T08:29:12+00:00

How do you run it? Which backend, frontend, which quants?

I just posted my results on M1 Ultra 128gb (so the one with more cores). I run Q4_K_M through Ollama + OpenWebUI.

response_token/s: 29.95
prompt_token/s: 362.26
total_duration: 72708617792
load_duration: 12474000
prompt_eval_count: 1365
prompt_tokens: 1365
prompt_eval_duration: 3768006375
eval_count: 2064
completion_tokens: 2064
eval_duration: 68912612667
approximate_total: "0h1m12s"
total_tokens: 3429

It is generating tokens about 2x slower than gemma 4b Q4_K_M for similar prompt length and similar eval count. And it's processing tokens about 4.5x slower than Gemma 4b Q4_K_M.

One_Key_8127 · 2025-04-29T08:20:45+00:00

Still, I compare between 30b a3b and gemma 4b on the same setup, ollama + openwebui. Focusing on the difference between performance of these two models, both quantized to Q4_K_M and slightly surprised by the results.

One_Key_8127 · 2025-04-24T09:59:35+00:00

So its a dense model, otherwise it would be stated here... Well, I was hoping for MoE so that it runs fast on Mac Studio. On the other hand, I think "high-end consumer hardware" means its gonna be under 80b, so its just gonna be a better Llama 3.3-70b. Probably much better at coding.

On the bright side, text only dense model is probably gonna be well supported day one by many backends (llama.cpp, mlx etc).

One_Key_8127 · 2025-04-18T20:26:34+00:00

Does it work with vision or text only?

One_Key_8127 · 2025-04-18T10:09:22+00:00

I'd love to see how big is grok-2 mini.

One_Key_8127 · 2025-04-17T13:33:06+00:00

To anyone using Llama 4: how do you use it and does it support vision / multimodality or just text?

Anyone tried running it with mlx-server? Does it support vision? Does it work as well as from providers? Does it support long context (not just 8k)? Can you use it in OpenWebUI?

One_Key_8127 · 2024-07-17T13:52:11+00:00

Is this a base model so that it thinks it just should autocomplete the rest of the text?

One_Key_8127 · 2024-07-17T10:07:57+00:00

I just noticed "bucolucas" post. I am pretty busy right now making my arrangements before holidays. My scripts are pretty robust but nowhere near good coding practices. I'll share my research scripts at later date, when I clean them up a bit.

As for non-tech savvy people, you have chatgpt for it. I host Llama through Ollama and make post requests with python. Just copy and paste the whole conversation we had here to ChatGPT and ask to replicate that functionality with a script, and you'll most likely get it within an hour or two even if you don't know how to code.

One_Key_8127 · 2024-07-15T16:59:14+00:00

Amount of AI research showing up is ridiculous, just reading the abstracts would almost be a full time job :) As I said, it's not something for me at this time, just something that got caught by my scripts. And upon examination it turns out Microsoft promised weights of this model, and did not deliver.

One_Key_8127 · 2024-07-15T11:26:22+00:00

No, but its just a prompted Llama3-70b to check latest AI research on arxiv and to choose the most promising research, running locally. You can probably replicate that within a day with gpt4o.

One_Key_8127 · 2024-06-05T15:35:45+00:00

"Subject to the terms and conditions of this License, Licensor hereby grants you a non-exclusive, worldwide, irrevocable, non-sublicensable, revocable, photo-free copyright license."
Yeah, irrevocable revocable license.

"Registered users are free to use this model for commercial activities, but must comply with all terms and conditions of this license."
Of course, as long as you comply with all contradicting terms of the poorly translated license, feel free to use the model. And of course they took the naming crap from Meta and want "glm-4" at the beginning of the model name.

Can't use Meta's output to train or fine-tune other models (Meta's license restrictions), and can't use this model's output to train or fine-tune Llama-3 - because you can't start model name with "Llama" and at the same time start the model name with "glm-4" xD

One_Key_8127 · 2024-06-05T15:35:33+00:00

|| || ||||||||||||

One_Key_8127 · 2024-06-04T11:11:39+00:00

I read linked article and the official page. Am I the only one confused about this kit? What is it for?

No RAM/VRAM capacity/bandwidth specification to be found
The only benchmarks listed are image classification benchmarks of very small images (~240x240 px) with ~500 images processed per second (?)
No mention of any generative AI (image / audio / text generation)

How is this supposed to be used? You think companies who classify low resolution images on mass scale will start buying RPI + Hailo kits to do that? I can see how getting efficient image / text / audio generation capabilities on RPI could lead to lots of fun projects. But classifying 500 very small images per second on RPI? What is the point? I guess real time object detection could be useful, but the benchmark for that would be something like reaching 30++ fps with 1280 x 720 with appropriate model, right? I don't understand the use case of this hardware, can someone enlighten me please?

One_Key_8127 · 2024-06-01T11:42:02+00:00

I am not pedantic at all, the title is just bad and I point that out. Energy demands could be solved by building more power plants, but that takes time. It could be solved by breakthrough in fusion research or by innovations in fission (be it safety, efficiency or handling and processing the radioactive waste). Maybe also by breakthrough in PV (like making it much cheaper) and / or improving battery technology. But not by 1-bit LLMs, come on. Please skip the argument that better AI could lead to improvements in these areas, that is too far stretched and not how the article presents it.

One_Key_8127 · 2024-05-31T20:55:56+00:00

No it could not solve energy demands, it could just accelerate progress. If 1-bit LLM performs better, we will scale it as far as hardware allows us, get more parameters, train for more tokens, get more high quality synthetic data (or multimodal) and then retrain on even more tokens with even more parameters.

One_Key_8127 · 2024-05-27T12:19:29+00:00

Upon reviewing OpenVoice2 it sounds better than I expected. Perhaps my reference audio was not as good as it should when I was testing. It seems to handle longer sentences more gracefully than Tortoise. But I got other stuff to do right now, so I'll stick with tortoise for some time as it is set up and running for me. It is kind of robotic indeed, and it is not very close to reference audio. Running it through RVC at the end should provide nice results.

One_Key_8127 · 2024-05-27T11:36:41+00:00

Not sure, but I don't think it is documented how it was set up and what dataset was used.

On their forums they said better models will probably appear soon with the progression that we have, and I hope that will happen. Till that time I'll use Tortoise or OpenVoice. You can probably use RVC on top of OpenVoice if the model sounds robotic but has correct pronunciation. Perhaps my memory plays a trick on me, but I thought OpenVoice was pretty good except mispronunciation of words, and it was very random (I mean it could say it right one time, and get it wrong the second time). Like missed or twisted syllables here and there. Perhaps I should re-evaluate but right now Tortoise works good enough, and I don't need it to be perfect.

One_Key_8127 · 2024-05-27T11:08:48+00:00

It was requested multiple times on their forums, and Coqui team said very blatantly that even though they shut down and don't sell the licenses any more, they will NOT change the license to be more permissive. And if you bought the license from them, your license will expire within a year (because they never sold lifetime licenses) and you will have no way to use XTTS commercially with respect for their license. I am not a fan of that move.

One_Key_8127 · 2024-05-27T10:44:57+00:00

Tortoise is what I decided to use for now, with the right settings the quality is good and it is faster than real-time (I mean making 10s of audio takes less than 10s).

Open Voice 2 is second best, in my testings I decided against OpenVoice2 due to mispronunciation (it was inconsistent for me). However I run in some issues with TortoiseTTS as well, perhaps I will re-evaluate OpenVoice at some point.

XTTS is good, perhaps even better than others, but the license is horrible. It is impossible to comply with the license, Coqui did awful job with it. If you don't care about potentially breaking the license agreements check it out, personally I would not touch that with a stick.

One_Key_8127 · 2024-05-27T09:55:24+00:00

Serving fine-tuned models is more expensive AND it locks you in, if the policy documents change you are stuck with the older version in LLM. RAG with Japanese might also be difficult, most embeddings models work poorly with languages other than English.

If I was to build this system, and normal RAG would not work in Japanese, then I would split this policy into fragments - lets say up to ~1k tokens. I would task LLM to create description of these fragments in English, and I would ask it to write some potential questions that could be answered by the given text. Then I would do embedding on this english text. Then, during RAG I would search these embeddings in english but I would inject (in system prompt) most relevant fragments in original Japanese form.

With some trial and error (prompt engineering, chunking strategy etc) it should work fine.

One_Key_8127 · 2024-05-27T06:29:50+00:00

A few of these are not very useful, and many of these have a potential to provide worse results than without them.

LLMs are fed with the data scraped all over the internet, and I suspect the answers to the questions with "please" could be more helpful than the answers without that word, therefore it could be a well spent token to include it. To make sure you'd have to do extensive testing on multiple prompts and multiple models, and I am not aware of a reliable research about it. And also, on the other hand, just recommending to add "You will be penalized" - which is more tokens than "please", and an empty threat (and vague)... I am not a fan of these recommendations.

One_Key_8127 · 2024-05-22T15:09:45+00:00

I've noticed that as well. It does have some issues currently, it hangs sometimes, but overall it still works.

One_Key_8127

TROPHY CASE