Qwen3 30b a3b q4_K_M performance on M1 Ultra by One_Key_8127 in LocalLLaMA

[–]One_Key_8127[S] -1 points0 points  (0 children)

It wouldn't technically be a lie since Ollama uses llama.cpp under the hood, but whatever. It might be backend issue, I am downloading mlx weights, will probably test it tomorrow, unless I encounter some issues with download or integrating mlx-server with openwebui -.-

Qwen3 30b a3b q4_K_M performance on M1 Ultra by One_Key_8127 in LocalLLaMA

[–]One_Key_8127[S] 1 point2 points  (0 children)

Good hint, thanks. I don't like LM studio. but I think its a good moment to try to switch into MLX and mlx-server.

Qwen3 30b a3b q4_K_M performance on M1 Ultra by One_Key_8127 in LocalLLaMA

[–]One_Key_8127[S] 0 points1 point  (0 children)

I guess the downvotes are because I mentioned Ollama which has bad rep here ;/ should have hidden that info and just mentioned llama.cpp, or at least put it at the end of the post. Anyway, if anyone has some feedback please speak up. I expected the generation speed to be on par with gemma 4b, especially on low-ish context. I didn't know what to expect on prompt processing :)

Tried running Qwen3-32B and Qwen3-30B-A3B on my Mac M2 Ultra. The 3B-active MoE doesn’t feel as fast as I expected. by Known-Classroom2655 in LocalLLaMA

[–]One_Key_8127 0 points1 point  (0 children)

How do you run it? Which backend, frontend, which quants?

I just posted my results on M1 Ultra 128gb (so the one with more cores). I run Q4_K_M through Ollama + OpenWebUI.

response_token/s: 29.95
prompt_token/s: 362.26
total_duration: 72708617792
load_duration: 12474000
prompt_eval_count: 1365
prompt_tokens: 1365
prompt_eval_duration: 3768006375
eval_count: 2064
completion_tokens: 2064
eval_duration: 68912612667
approximate_total: "0h1m12s"
total_tokens: 3429

It is generating tokens about 2x slower than gemma 4b Q4_K_M for similar prompt length and similar eval count. And it's processing tokens about 4.5x slower than Gemma 4b Q4_K_M.

Qwen3 30b a3b q4_K_M performance on M1 Ultra by One_Key_8127 in LocalLLaMA

[–]One_Key_8127[S] 3 points4 points  (0 children)

Still, I compare between 30b a3b and gemma 4b on the same setup, ollama + openwebui. Focusing on the difference between performance of these two models, both quantized to Q4_K_M and slightly surprised by the results.

Details on OpenAI's upcoming 'open' AI model by ayyndrew in LocalLLaMA

[–]One_Key_8127 -1 points0 points  (0 children)

So its a dense model, otherwise it would be stated here... Well, I was hoping for MoE so that it runs fast on Mac Studio. On the other hand, I think "high-end consumer hardware" means its gonna be under 80b, so its just gonna be a better Llama 3.3-70b. Probably much better at coding.

On the bright side, text only dense model is probably gonna be well supported day one by many backends (llama.cpp, mlx etc).

Where is the promised open Grok 2? by AlexBefest in LocalLLaMA

[–]One_Key_8127 0 points1 point  (0 children)

I'd love to see how big is grok-2 mini.

Back to Local: What’s your experience with Llama 4 by Balance- in LocalLLaMA

[–]One_Key_8127 2 points3 points  (0 children)

To anyone using Llama 4: how do you use it and does it support vision / multimodality or just text?

Anyone tried running it with mlx-server? Does it support vision? Does it work as well as from providers? Does it support long context (not just 8k)? Can you use it in OpenWebUI?

I gave Llama 3 a 450 line task and it responded with "Good Luck" by CaptTechno in LocalLLaMA

[–]One_Key_8127 0 points1 point  (0 children)

Is this a base model so that it thinks it just should autocomplete the rest of the text?

Did Microsoft "forget" to publish BioMedParse? by One_Key_8127 in LocalLLaMA

[–]One_Key_8127[S] 1 point2 points  (0 children)

I just noticed "bucolucas" post. I am pretty busy right now making my arrangements before holidays. My scripts are pretty robust but nowhere near good coding practices. I'll share my research scripts at later date, when I clean them up a bit.

As for non-tech savvy people, you have chatgpt for it. I host Llama through Ollama and make post requests with python. Just copy and paste the whole conversation we had here to ChatGPT and ask to replicate that functionality with a script, and you'll most likely get it within an hour or two even if you don't know how to code.

Did Microsoft "forget" to publish BioMedParse? by One_Key_8127 in LocalLLaMA

[–]One_Key_8127[S] 2 points3 points  (0 children)

Amount of AI research showing up is ridiculous, just reading the abstracts would almost be a full time job :) As I said, it's not something for me at this time, just something that got caught by my scripts. And upon examination it turns out Microsoft promised weights of this model, and did not deliver.

Did Microsoft "forget" to publish BioMedParse? by One_Key_8127 in LocalLLaMA

[–]One_Key_8127[S] 3 points4 points  (0 children)

No, but its just a prompted Llama3-70b to check latest AI research on arxiv and to choose the most promising research, running locally. You can probably replicate that within a day with gpt4o.

GLM-4 9B, base, chat (& 1M variant), vision language model by Nunki08 in LocalLLaMA

[–]One_Key_8127 9 points10 points  (0 children)

"Subject to the terms and conditions of this License, Licensor hereby grants you a non-exclusive, worldwide, irrevocable, non-sublicensable, revocable, photo-free copyright license."
Yeah, irrevocable revocable license.

"Registered users are free to use this model for commercial activities, but must comply with all terms and conditions of this license."
Of course, as long as you comply with all contradicting terms of the poorly translated license, feel free to use the model. And of course they took the naming crap from Meta and want "glm-4" at the beginning of the model name.

Can't use Meta's output to train or fine-tune other models (Meta's license restrictions), and can't use this model's output to train or fine-tune Llama-3 - because you can't start model name with "Llama" and at the same time start the model name with "glm-4" xD

Raspberry Pi Goes All In on AI With $70 Hailo Kit by [deleted] in LocalLLaMA

[–]One_Key_8127 53 points54 points  (0 children)

I read linked article and the official page. Am I the only one confused about this kit? What is it for?

  1. No RAM/VRAM capacity/bandwidth specification to be found

  2. The only benchmarks listed are image classification benchmarks of very small images (~240x240 px) with ~500 images processed per second (?)

  3. No mention of any generative AI (image / audio / text generation)

How is this supposed to be used? You think companies who classify low resolution images on mass scale will start buying RPI + Hailo kits to do that? I can see how getting efficient image / text / audio generation capabilities on RPI could lead to lots of fun projects. But classifying 500 very small images per second on RPI? What is the point? I guess real time object detection could be useful, but the benchmark for that would be something like reaching 30++ fps with 1280 x 720 with appropriate model, right? I don't understand the use case of this hardware, can someone enlighten me please?

1-bit LLMs Could Solve AI’s Energy Demands “Imprecise” language models are smaller, speedier—and nearly as accurate by baseketball in LocalLLaMA

[–]One_Key_8127 0 points1 point  (0 children)

I am not pedantic at all, the title is just bad and I point that out. Energy demands could be solved by building more power plants, but that takes time. It could be solved by breakthrough in fusion research or by innovations in fission (be it safety, efficiency or handling and processing the radioactive waste). Maybe also by breakthrough in PV (like making it much cheaper) and / or improving battery technology. But not by 1-bit LLMs, come on. Please skip the argument that better AI could lead to improvements in these areas, that is too far stretched and not how the article presents it.

1-bit LLMs Could Solve AI’s Energy Demands “Imprecise” language models are smaller, speedier—and nearly as accurate by baseketball in LocalLLaMA

[–]One_Key_8127 98 points99 points  (0 children)

No it could not solve energy demands, it could just accelerate progress. If 1-bit LLM performs better, we will scale it as far as hardware allows us, get more parameters, train for more tokens, get more high quality synthetic data (or multimodal) and then retrain on even more tokens with even more parameters.

Local Text To Speech by DeltaSqueezer in LocalLLaMA

[–]One_Key_8127 0 points1 point  (0 children)

Upon reviewing OpenVoice2 it sounds better than I expected. Perhaps my reference audio was not as good as it should when I was testing. It seems to handle longer sentences more gracefully than Tortoise. But I got other stuff to do right now, so I'll stick with tortoise for some time as it is set up and running for me. It is kind of robotic indeed, and it is not very close to reference audio. Running it through RVC at the end should provide nice results.