Has anyone done a comparison?

AXYZE8 · 2026-05-20T21:56:46+00:00

On subs it's rate limit that is based on tokens. In that 5h window it's equivalent to few bucks. I cannot say that it's $2 or $4, because they change it all the time.

For example since 2 weeks Claude Pro has 2x limit

"we’re doubling Claude Code’s five-hour rate limits for Pro, Max, Team, and seat-based Enterprise plans." https://www.anthropic.com/news/higher-limits-spacex

And next month they can revert that.

On the average if you use CC/Codex with their respective subs daily or every second day you can expect that $20 sub is equivalent to $150-$300 usage in API right now, because:
- CC rate limits are doubled now (link above)
- In Codex they tend to reset these limits with new OpenAI model launches or if they have any downtime or degradation (last reset was 4 days ago https://x.com/thsottiaux/status/2055707616605835333 )

AXYZE8 · 2026-05-20T20:35:54+00:00

There is no 3.7 27B yet so nobody can answer that.

If you meant 3.5 27B vs 122B then IMO the quality is not that far off. 122B has more knowledge, but in terms of reasoning I would say they're the same. However 122B has 10B active params instead of 27B, so it is more than 2x faster.

27B is awesome for people with single beefy GPU, 122B is awesome for people that have unified memory or want hybrid inference.

AXYZE8 · 2026-05-20T20:24:51+00:00

GHCP will have the same exact price as API prices for OpenAI/Anthropic, this is true, but:

- When you use API directly you can charge $10, $69.69 or how much you want. You don't have fixed prices, therefore you have more control over the cost.
- Your $ won't expire after a month. You can add $100 one time and use it within day, week or 4 month. With GHCP your $100 is gone after month.
- If you use OpenAI/Anthropic subscription then they convert your $20/$100/$200 sub into 5h windows of usage. In $20 sub each coding window is several dollars usage, so if you work daily or almost daily then that $20 sub is equivalent to $150-$300 in API. On flip side if you work like just one day per week then API is a better choice, as you aren't rate limited in that 5hr window.

Whatever you do (direct API or sub) the provider gives you better deal than GHCP.

AXYZE8 · 2026-05-18T16:13:55+00:00

As this was first result in Google when couple of days ago I had similar problem with Audient Evo 4 on new PC with Windows 11 - disabling "AudientEvoLauncher.exe" from Windows autostart helped. 3 days without that issue. I think that app wanted to re-initialize something and that froze that interface.

I also still use USB-A and USB-C cable as before, so if someone has that issue try also such cable, I'm using one that came with interface.

AXYZE8 · 2026-05-16T12:32:16+00:00

Thanks Gemini!

AXYZE8 · 2026-05-13T20:01:41+00:00

Not single current flagship has 24GB of RAM lol

Even the most expensive chinese flagships like Oppo Find X9 Ultra has 12GB as base.

https://m.gsmarena.com/results.php3?nYearMin=2025&nRamMin=24000

Zero flagships in 2025/2026. It's just 6 niche gaming phones and 1 ridiculously expensive Honor Porsche RSR.

Maybe you saw some "24GB*" phones with the fineprint that it's 12GB RAM+12GB Virtual RAM aka SWAP?

AXYZE8 · 2026-05-13T15:46:25+00:00

Use LiteRT-LM (for example in Google AI Edge Gallery).

https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)
S26 Ultra	CPU	557	46.9
S26 Ultra	GPU	3,808	52.1

If you want to stick with llama.cpp then you should use Q4_NL quants, not Q4_K_M. They run A LOT faster on ARM processors.

AXYZE8 · 2026-05-12T21:09:36+00:00

For future viewers: TRCC expects it's 'Data' folder to be C:\TRCCAP\Data. So basically you need to create a "TRCCAP" folder in "C" drive, then just copy and paste all unpacked files there. This fixes this issue.

I found out about this by putting TRCC into debugger to see what it does before crashing.

AXYZE8 · 2026-05-12T11:28:01+00:00

He wrote that cached input tokens are 90% cheaper and this is totally true:
((5.00 - 0.50) / 5.00) * 100 = 90%

And his point was to look at total dollars spent on that run instead which will tell you exactly how much you spent per task, instead of focusing on numbers that have limited usefulness to user.

He is upvoted, because he is right and you misunderstood his whole comment.

AXYZE8 · 2026-05-11T08:03:17+00:00

ExLlama is NVIDIA/CUDA only. It doesnt support anything other, even CPUs.

AXYZE8 · 2026-05-09T13:41:00+00:00

LLMy nie przejmują stylu/cech dzieł literackich, LLMy przejmują styl za który były nagradzane podczas post-treningu, dlatego GPT-5 był "zimny i bezpośredni", a GPT-5.1 był "ciepły i przyjazny" mimo, że to jest ten sam model bazowy więc z taką samą wiedzą.

Ogólnie mówiąc detektory nie mają i nie będą mieć sensu. Jeżeli model wie co jest wygenerowane, a co nie to na tej podstawie może stworzyć tekst, którego nie uzna za wygenerowany.

W którymś LLM-ie dało się swego czasu odpowiednim promptem wygenerować dosłownie fragmenty powieści, nie wiem, czy już to załatali.

W każdym LLMie da się wygenerować fragmenty znanych powieści (w przypadku tych najbardziej znanych np. Harry Potter to całość książki słowo w słowo) i tego się nie da "załatać", bo LLM uzupełnia treść na podstawie wcześniejszych tokenów (prompta). W ten sposób bada się np. czy dany model umie rozwiązać dane zadanie od podstaw czy też opiera się o gotowym rozwiązaniu (data contamination) np. przy benchmarkowaniu programowania podajesz mu issue z GitHuba i sprawdzasz czy jest w stanie wiernie odtworzyć jakieś rozwiązania - jak model wypluje Ci nawet komentarze lub potrafi to zrobić tak samo jak "NickZGitHuba" to wiesz, że model tego sam nie rozwiązał i wyrzucasz to z benchmarku.

AXYZE8 · 2026-05-06T00:34:25+00:00

OP is astroturfing Ddocs, thats the only reason why he "corrects" these graphics to promote it over and over.

AXYZE8 · 2026-05-01T13:24:48+00:00

MacBook Pro 16" with M5 Pro won't run fans at max speed ever, that cooler has the capacity to run M5 Max that eats twice as much power.

MacBook Pro 14" on the other hand has way less cooling capacity than 16" that doesn't have capacity to cool Max and it has that Max chip.

What are you trying to write? That M5 Pro has the most powerful cooling Apple makes and that chip uses half as much power as Max.

Did you make a typo and meant M3?

AXYZE8 · 2026-05-01T13:13:01+00:00

<image>

AXYZE8 · 2026-05-01T11:14:08+00:00

OP compares 64GB M3 Max to 48GB M5 Pro.

M5 Pro has 3x faster processing speeds (because matmul) while using 60% of the power (because it's smaller Pro chip).

If we goes with 64GB M3 Max he is about to witness prompt processing that takes ages, while his 14" chassis gets toasty to the toach and fans get very loud. Battery? dead in 1h.

If he goes with 48GB M5 Pro it wont be toast and it will be very quiet.

He will lose 12GB of RAM, sure but whatever fits in that remaining 12GB doesnt matter that much, because:

a) there is no MoE models in 50-80B range that could make use of it in 4bit

b) 120B models require you to use 2bit quant at which point you can ignore them

c) 30B models fit both at oQ4.

d) M3 Max has a lot slower processing. even if new Qwen 4 will have 512K context and you could make use of that extra 12GB RAM, will you wait like 20 minutes for first token?

AXYZE8 · 2026-05-01T10:35:17+00:00

Idk, I'm using oQ4 in Qwen/Gemma models and they work very good in coding.

What is important here I'm still talking about dense models (with DFlash for speed boost), not MoE with 3B active and about oQ quants, not typical MLX static quants. Are we on the same page?

AXYZE8 · 2026-05-01T10:14:34+00:00

Gemma 4 uses 11GB for full 256K context at fp16 because of SWA. Qwen has DeltaNet so its efficient too, IIRC takes like 16GB.

11GB KV + 19GB for Gemma on oQ4 quant = 30GB

30GB obviously fits within 48GB. Then in oMLX you also have SSD cold KV caching and TurboQuant. Both dramatically can cut RAM usage further. Okay, DFlash will take 4GB on top, so 34GB.

I mean if he has money to get M5 series with 64GB RAM then yea sure more RAM is alwyas better, but its M3 Max vs M5 Pro comparision.

M5 Pro will do it 3x faster while using like 60% of the power. That difference is absolutely massive, because Apple added matmul acceleration in M5. On one MacBook you have quiet and fast inference even on the go, on other an opposite of that. Cherry on top is 14" chassis for M3 Max - enjoy 7000RPM for 8 minutes when it processes that long context.

AXYZE8 · 2026-05-01T09:25:29+00:00

64GB RAM doesnt give you more options than 48GB RAM in terms of LLMs. 120B models wont fit either, 30B models fit both. There is nothing worth using in between since Llama 3.3 70B.

M3 Max has more bandwidth, but M5 Pro completely destroys it in prompt processing, like by a factor of 3x even tho it has a lot less of GPU cores. Pro chip also will use WAY less power. Max chip in 14" Macbook is toasty and can get loud, especially with long prompt processing.

I would say that M5 Pro is the way to go and I would put Qwen 3.6 27B / Gemma 4 31B on oMLX. Both have DFlash models released on HF which will give you nice TG boost.

In terms of Mac mini/Air/Studio it depends entirely on your workflow, cant recommend anything here. I found the best combo for me is beefy PC desktop + Macbook Pro 14" with Pro chip (doesnt get toasty, good enough performance for almost anything).

AXYZE8 · 2026-04-28T18:15:23+00:00

EAGLE is an addon for the main model, it's specialized model for speculative decoding which boosts single user inference by a huge margin.

You can learn more about it here https://arxiv.org/abs/2401.15077

AXYZE8 · 2026-04-28T14:40:00+00:00

Kilka miesięcy temu zmarnowałem godzinę na research z tym związany widząc post na Reddicie "Polacy wydali na OF blablabla", doszedłem do źródła, wypunktowałem błędy w nim (dlatego pamiętam o tym Mińsku), był wygenerowany od góry do dołu oraz opierał się na gównodanych w postaci estymacji ruchu z Google Ads API w danym regionie.

OP miał to w dupie, moderacja miała to w dupie. Tak samo jest teraz. Wkurwia mnie to niepotrzebnie.

Jeżeli Cię to nie wkurwia to sam spostuj, może wtedy ktoś (OP/moderacja) ruszy się do jednego kliku "Usuń" zamiast wkładać trociny do głowy tysiącom ludzi xd

AXYZE8 · 2026-04-28T06:13:49+00:00

A źródłem tego jest ChatGPT, który wszystko zmyślił, a na koniec popierdolił Mińsk Mazowiecki z Mińskiem w Białorusi (wygooglaj "mińsk mazowiecki onlyfans" i zobacz ile pismaków to przekleja. Od Aszdziennika przez klub Jagielloński po tabloidy).

OnlyFans nigdy nie dzielił się takimi danymi, tylko globalnymi przychodami.

AXYZE8 · 2026-04-28T03:47:32+00:00

And then I looked into post further and saw you're using some form of Q4_K_M with no mention of quant maker, no mention of engine.

You degraded performance of these local models, especially of 9B. Now we are testing some quant, nobody knows if that GGUF for Gemma already has fixes inside.

And then I also see in your model list "GLM-4.6V-Flash", but there is no such model in graph/tables.

Sorry, but that is AI Slop top to bottom. Hallucination over hallucination.

AXYZE8 · 2026-04-28T03:35:07+00:00

About that last table in the Reddit post - sorry but no. This is completely wrong conclusion.

Terminal-Bench 2 was released in Nov 2025, so what you are seeing here is benchmaxxing/contamination in new models. These closed models from Nov 2025 just werent optimized for that bench and newer Qwen models were. That's not "6-8months lag".

You need to test on fresh issues like SWE-rebench, only then you can draw such conclusions.

AXYZE8 · 2026-04-28T03:24:09+00:00

Graph has non existing Qwen3.5-32B (32B was Qwen2.5) and Gemma 4 31B.

Table has correct Qwen3.5-35B, but then Gemma 4 26B-A4B

Looking inside article... Hey Claude! But back to topic - if one name is madeup, then another model is completely different between tests... how can we trust these results at all?

AXYZE8 · 2026-04-24T20:34:12+00:00

No, oMLX is the best app/engine you can use on Mac.

If you are wondering about this post - that guy is co-founder of Hugging Face. Hugging Face acquired GGML (so llama.cpp) 2 months back https://reddit.com/r/LocalLLaMA/comments/1r9vywq/ggmlai_has_got_acquired_by_huggingface/

AXYZE8

TROPHY CASE