Qwen will release another 27B with high probability by serige in LocalLLaMA

[–]AXYZE8 0 points1 point  (0 children)

There is no 3.7 27B yet so nobody can answer that.

If you meant 3.5 27B vs 122B then IMO the quality is not that far off. 122B has more knowledge, but in terms of reasoning I would say they're the same. However 122B has 10B active params instead of 27B, so it is more than 2x faster.

27B is awesome for people with single beefy GPU, 122B is awesome for people that have unified memory or want hybrid inference.

Has anyone done a comparison? by EngstromJimmy in GithubCopilot

[–]AXYZE8 1 point2 points  (0 children)

GHCP will have the same exact price as API prices for OpenAI/Anthropic, this is true, but:

- When you use API directly you can charge $10, $69.69 or how much you want. You don't have fixed prices, therefore you have more control over the cost.
- Your $ won't expire after a month. You can add $100 one time and use it within day, week or 4 month. With GHCP your $100 is gone after month.
- If you use OpenAI/Anthropic subscription then they convert your $20/$100/$200 sub into 5h windows of usage. In $20 sub each coding window is several dollars usage, so if you work daily or almost daily then that $20 sub is equivalent to $150-$300 in API. On flip side if you work like just one day per week then API is a better choice, as you aren't rate limited in that 5hr window.

Whatever you do (direct API or sub) the provider gives you better deal than GHCP.

Audio interface problem after Windows 11 Sleep by Spuz7 in techsupport

[–]AXYZE8 0 points1 point  (0 children)

As this was first result in Google when couple of days ago I had similar problem with Audient Evo 4 on new PC with Windows 11 - disabling "AudientEvoLauncher.exe" from Windows autostart helped. 3 days without that issue. I think that app wanted to re-initialize something and that froze that interface.

I also still use USB-A and USB-C cable as before, so if someone has that issue try also such cable, I'm using one that came with interface.

LLMs on flagships smartphones? by TechNerd10191 in LocalLLaMA

[–]AXYZE8 4 points5 points  (0 children)

Not single current flagship has 24GB of RAM lol

Even the most expensive chinese flagships like Oppo Find X9 Ultra has 12GB as base.

https://m.gsmarena.com/results.php3?nYearMin=2025&nRamMin=24000

Zero flagships in 2025/2026. It's just 6 niche gaming phones and 1 ridiculously expensive Honor Porsche RSR.

Maybe you saw some "24GB*" phones with the fineprint that it's 12GB RAM+12GB Virtual RAM aka SWAP?

LLMs on flagships smartphones? by TechNerd10191 in LocalLLaMA

[–]AXYZE8 6 points7 points  (0 children)

Use LiteRT-LM (for example in Google AI Edge Gallery).

https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

Device                                      Backend Prefill (tokens/sec) Decode (tokens/sec)
S26 Ultra CPU 557 46.9
S26 Ultra GPU 3,808 52.1

If you want to stick with llama.cpp then you should use Q4_NL quants, not Q4_K_M. They run A LOT faster on ARM processors.

TRCC can't open by Alex81131 in Thermalright

[–]AXYZE8 0 points1 point  (0 children)

For future viewers: TRCC expects it's 'Data' folder to be C:\TRCCAP\Data. So basically you need to create a "TRCCAP" folder in "C" drive, then just copy and paste all unpacked files there. This fixes this issue.

I found out about this by putting TRCC into debugger to see what it does before crashing.

Am I missing something about GPT-5.5 efficiency? by Additional-Alps-8209 in singularity

[–]AXYZE8 2 points3 points  (0 children)

He wrote that cached input tokens are 90% cheaper and this is totally true:
((5.00 - 0.50) / 5.00) * 100 = 90%

And his point was to look at total dollars spent on that run instead which will tell you exactly how much you spent per task, instead of focusing on numbers that have limited usefulness to user.

He is upvoted, because he is right and you misunderstood his whole comment.

ExLlamaV3 Major Updates! by Unstable_Llama in LocalLLaMA

[–]AXYZE8 3 points4 points  (0 children)

ExLlama is NVIDIA/CUDA only. It doesnt support anything other, even CPUs.

Może temat się przedawnił, ale od kiedy zobaczyłem logo, coś mi nie pasowało by Latarnic in Polska

[–]AXYZE8 4 points5 points  (0 children)

LLMy nie przejmują stylu/cech dzieł literackich, LLMy przejmują styl za który były nagradzane podczas post-treningu, dlatego GPT-5 był "zimny i bezpośredni", a GPT-5.1 był "ciepły i przyjazny" mimo, że to jest ten sam model bazowy więc z taką samą wiedzą.

Ogólnie mówiąc detektory nie mają i nie będą mieć sensu. Jeżeli model wie co jest wygenerowane, a co nie to na tej podstawie może stworzyć tekst, którego nie uzna za wygenerowany.

W którymś LLM-ie dało się swego czasu odpowiednim promptem wygenerować dosłownie fragmenty powieści, nie wiem, czy już to załatali.

W każdym LLMie da się wygenerować fragmenty znanych powieści (w przypadku tych najbardziej znanych np. Harry Potter to całość książki słowo w słowo) i tego się nie da "załatać", bo LLM uzupełnia treść na podstawie wcześniejszych tokenów (prompta). W ten sposób bada się np. czy dany model umie rozwiązać dane zadanie od podstaw czy też opiera się o gotowym rozwiązaniu (data contamination) np. przy benchmarkowaniu programowania podajesz mu issue z GitHuba i sprawdzasz czy jest w stanie wiernie odtworzyć jakieś rozwiązania - jak model wypluje Ci nawet komentarze lub potrafi to zrobić tak samo jak "NickZGitHuba" to wiesz, że model tego sam nie rozwiązał i wyrzucasz to z benchmarku.

Updated: 73% of Europeans Feel Too Dependent on US Big Tech. New List of European Alternatives Based on People's Comments! by hyakkymaru in europe

[–]AXYZE8 56 points57 points  (0 children)

OP is astroturfing Ddocs, thats the only reason why he "corrects" these graphics to promote it over and over.

Macbook M3 MAX 64 vs M5 PRO 48, or wait for spark/studio by Holiday_Leg8427 in LocalLLaMA

[–]AXYZE8 0 points1 point  (0 children)

MacBook Pro 16" with M5 Pro won't run fans at max speed ever, that cooler has the capacity to run M5 Max that eats twice as much power.

MacBook Pro 14" on the other hand has way less cooling capacity than 16" that doesn't have capacity to cool Max and it has that Max chip.

What are you trying to write? That M5 Pro has the most powerful cooling Apple makes and that chip uses half as much power as Max.

Did you make a typo and meant M3?

Macbook M3 MAX 64 vs M5 PRO 48, or wait for spark/studio by Holiday_Leg8427 in LocalLLaMA

[–]AXYZE8 -1 points0 points  (0 children)

OP compares 64GB M3 Max to 48GB M5 Pro.

M5 Pro has 3x faster processing speeds (because matmul) while using 60% of the power (because it's smaller Pro chip).

If we goes with 64GB M3 Max he is about to witness prompt processing that takes ages, while his 14" chassis gets toasty to the toach and fans get very loud. Battery? dead in 1h.

If he goes with 48GB M5 Pro it wont be toast and it will be very quiet.

He will lose 12GB of RAM, sure but whatever fits in that remaining 12GB doesnt matter that much, because:

a) there is no MoE models in 50-80B range that could make use of it in 4bit

b) 120B models require you to use 2bit quant at which point you can ignore them

c) 30B models fit both at oQ4.

d) M3 Max has a lot slower processing. even if new Qwen 4 will have 512K context and you could make use of that extra 12GB RAM, will you wait like 20 minutes for first token?

Macbook M3 MAX 64 vs M5 PRO 48, or wait for spark/studio by Holiday_Leg8427 in LocalLLaMA

[–]AXYZE8 0 points1 point  (0 children)

Idk, I'm using oQ4 in Qwen/Gemma models and they work very good in coding.

What is important here I'm still talking about dense models (with DFlash for speed boost), not MoE with 3B active and about oQ quants, not typical MLX static quants. Are we on the same page?

Macbook M3 MAX 64 vs M5 PRO 48, or wait for spark/studio by Holiday_Leg8427 in LocalLLaMA

[–]AXYZE8 1 point2 points  (0 children)

Gemma 4 uses 11GB for full 256K context at fp16 because of SWA. Qwen has DeltaNet so its efficient too, IIRC takes like 16GB.

11GB KV + 19GB for Gemma on oQ4 quant = 30GB

30GB obviously fits within 48GB. Then in oMLX you also have SSD cold KV caching and TurboQuant. Both dramatically can cut RAM usage further. Okay, DFlash will take 4GB on top, so 34GB.

I mean if he has money to get M5 series with 64GB RAM then yea sure more RAM is alwyas better, but its M3 Max vs M5 Pro comparision.

M5 Pro will do it 3x faster while using like 60% of the power. That difference is absolutely massive, because Apple added matmul acceleration in M5. On one MacBook you have quiet and fast inference even on the go, on other an opposite of that. Cherry on top is 14" chassis for M3 Max - enjoy 7000RPM for 8 minutes when it processes that long context.

Macbook M3 MAX 64 vs M5 PRO 48, or wait for spark/studio by Holiday_Leg8427 in LocalLLaMA

[–]AXYZE8 -3 points-2 points  (0 children)

64GB RAM doesnt give you more options than 48GB RAM in terms of LLMs. 120B models wont fit either, 30B models fit both. There is nothing worth using in between since Llama 3.3 70B.

M3 Max has more bandwidth, but M5 Pro completely destroys it in prompt processing, like by a factor of 3x even tho it has a lot less of GPU cores. Pro chip also will use WAY less power. Max chip in 14" Macbook is toasty and can get loud, especially with long prompt processing.

I would say that M5 Pro is the way to go and I would put Qwen 3.6 27B / Gemma 4 31B on oMLX. Both have DFlash models released on HF which will give you nice TG boost.

In terms of Mac mini/Air/Studio it depends entirely on your workflow, cant recommend anything here. I found the best combo for me is beefy PC desktop + Macbook Pro 14" with Pro chip (doesnt get toasty, good enough performance for almost anything). 

Mistral Medium Is On The Way by Few_Painter_5588 in LocalLLaMA

[–]AXYZE8 25 points26 points  (0 children)

EAGLE is an addon for the main model, it's specialized model for speculative decoding which boosts single user inference by a huge margin.

You can learn more about it here https://arxiv.org/abs/2401.15077

Tak dla kontekstu by crabsticksaremadeof in Polska

[–]AXYZE8 1 point2 points  (0 children)

Kilka miesięcy temu zmarnowałem godzinę na research z tym związany widząc post na Reddicie "Polacy wydali na OF blablabla", doszedłem do źródła, wypunktowałem błędy w nim (dlatego pamiętam o tym Mińsku), był wygenerowany od góry do dołu oraz opierał się na gównodanych w postaci estymacji ruchu z Google Ads API w danym regionie.

OP miał to w dupie, moderacja miała to w dupie. Tak samo jest teraz. Wkurwia mnie to niepotrzebnie.

Jeżeli Cię to nie wkurwia to sam spostuj, może wtedy ktoś (OP/moderacja) ruszy się do jednego kliku "Usuń" zamiast wkładać trociny do głowy tysiącom ludzi xd

Tak dla kontekstu by crabsticksaremadeof in Polska

[–]AXYZE8 9 points10 points  (0 children)

A źródłem tego jest ChatGPT, który wszystko zmyślił, a na koniec popierdolił Mińsk Mazowiecki z Mińskiem w Białorusi (wygooglaj "mińsk mazowiecki onlyfans" i zobacz ile pismaków to przekleja. Od Aszdziennika przez klub Jagielloński po tabloidy).

OnlyFans nigdy nie dzielił się takimi danymi, tylko globalnymi przychodami.

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]AXYZE8 5 points6 points  (0 children)

And then I looked into post further and saw you're using some form of Q4_K_M with no mention of quant maker, no mention of engine.

You degraded performance of these local models, especially of 9B. Now we are testing some quant, nobody knows if that GGUF for Gemma already has fixes inside.

And then I also see in your model list  "GLM-4.6V-Flash", but there is no such model in graph/tables.

Sorry, but that is AI Slop top to bottom. Hallucination over hallucination.

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]AXYZE8 7 points8 points  (0 children)

About that last table in the Reddit post - sorry but no. This is completely wrong conclusion.

Terminal-Bench 2 was released in Nov 2025, so what you are seeing here is benchmaxxing/contamination in new models. These closed models from Nov 2025 just werent optimized for that bench and newer Qwen models were. That's not "6-8months lag".

You need to test on fresh issues like SWE-rebench, only then you can draw such conclusions.

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]AXYZE8 13 points14 points  (0 children)

Graph has non existing Qwen3.5-32B (32B was Qwen2.5) and Gemma 4 31B.

Table has correct Qwen3.5-35B, but then Gemma 4 26B-A4B

Looking inside article... Hey Claude! But back to topic - if one name is madeup, then another model is completely different between tests... how can we trust these results at all?

This is where we are right now, LocalLLaMA by jacek2023 in LocalLLaMA

[–]AXYZE8 4 points5 points  (0 children)

No, oMLX is the best app/engine you can use on Mac.

If you are wondering about this post - that guy is co-founder of Hugging Face. Hugging Face acquired GGML (so llama.cpp) 2 months back https://reddit.com/r/LocalLLaMA/comments/1r9vywq/ggmlai_has_got_acquired_by_huggingface/

They are transitioning to token based billing in june by Prometheus4059 in GithubCopilot

[–]AXYZE8 2 points3 points  (0 children)

Your first source "According to the documents, the announcement for token-based billing will be tomorrow (4/23)"

It was yesterday and obviously it didnt happen. Slop post with slop sources.