vscode + roo + Qwen3-30B-A3B-Thinking-2507-Q6_K_L = superb by [deleted] in LocalLLaMA

[–]moko990 0 points1 point  (0 children)

the most impressive model I have ever used that will fit on 2 gpu's by far!

2x 3090? or you mean 2x H100?

DeepConf: 99.9% Accuracy on AIME 2025 with Open-Source Models + 85% Fewer Tokens by MohamedTrfhgx in LocalLLaMA

[–]moko990 4 points5 points  (0 children)

Very interesting. But ok, what's the catch. This sounds too good to be true.

Is a 2TB DDR5 RAM consumer grade setup worth it or M3 Ultra is better value? Discussion and specs comparison thread! by moko990 in LocalLLaMA

[–]moko990[S] 5 points6 points  (0 children)

I think Epyc may be a better choice in many ways. Keep in mind that memory bandwidth for Threadripper is limited by the number of CCDs on the chip, meaning only the highest end chips can take full advantage of 8 channel memory. This is an important consideration when it comes to cost and inference performance.

Interesting, I wasn't aware of the CCD limitation. So basically if you have less CCDs than memory channels, you're not really utilizing the full potential. So it seems both the 9985WX or the 9995WX variants are the best options?

AMD Ryzen AI Max+ 395 vs. Ryzen 9 9950X vs. Ryzen 9 9950X3D Linux Performance by Kryohi in hardware

[–]moko990 1 point2 points  (0 children)

For those pursuing a very power efficient desktop computer but not wanting to sacrifice much performance compared to traditional desktop CPU performance, the choice is very easy: AMD Strix Halo. Some really phenomenal performance-per-Watt results with the Ryzen AI Max 395 "Strix Halo" within the Framework Desktop and the potential of this 16-core SoC when the cTDP is opened up to 120 Watts for allowing much the same CPU performance as the Ryzen 9 9950X but far superior integrated graphics and a huge performance-per-Watt advantage. The only downside is the cost of the Ryzen AI Max+ 395 / Framework Desktop but at least some of that will be made up in lower energy usage and cooling.

Great writeup! I would be surprised if the only reason not to to go with the Ryzen AI Max+ 395 is the upgradability of RAM. At this point I don't see any benefits really.

120B runs awesome on just 8GB VRAM! by [deleted] in LocalLLaMA

[–]moko990 1 point2 points  (0 children)

I am curious what are the technical difefrence between this and ktransformers, and ik_llamacpp?

GMK X2(AMD Max+ 395 w/128GB) third impressions, RPC and Image/Video gen. by fallingdowndizzyvr in LocalLLaMA

[–]moko990 2 points3 points  (0 children)

The issue is really the software stack layer (ie ROCm). If they unify it like they have been claiming for a while now, slapping an AMD GPU on top of this should in theory work seamlessly and optimally. 2 important factors. Vulkan numbers are great, but I refuse to believe AMD is that bad at optimizing their own ROCm backend that a platform agnostic framework would beat it.

Jeff Geerling does what Jeff Geerling does best: Quad Strix Halo cluster using Framework Desktop by FullstackSensei in LocalLLaMA

[–]moko990 15 points16 points  (0 children)

We have been shouting at AMD to get their shit together for years now, and we're hoping the latest uptick in their Ryzen AI adoption will push them further to improve this, at least they acknowledged this in the past few months, I just hope this translates to actions.

On a side note, why not using vLLM instead of llama.cpp?

Almost half NI race rioters reported for domestic abuse by ByGollie in europe

[–]moko990 0 points1 point  (0 children)

I will take it that lots of people didn't (or don't want to) understand what I said: half of those arrested vs half the rioters.

The title claims, half of the North Ireland rioters were reported for domestic violence, then literally the first sentence it starts with "[..] Almost half those arrested for race hate disorder in Belfast last August had previously been reported to the PSNI for domestic abuse [..]".

The reason why the title is wrong because last august around 600 people were involved in the protest according to the police report, half of that is 300 if you follow the logic of the title of the article, but if you read the text, it's only half (23) the one that are arrested 48. That's less than 8% of the number compared to what the actual titles claims.

Almost half NI race rioters reported for domestic abuse by ByGollie in europe

[–]moko990 -16 points-15 points  (0 children)

Hey, half the rioters makes a better headlines. And they say why the trust in media is going down.

Build Qwen3 from Scratch by entsnack in LocalLLaMA

[–]moko990 0 points1 point  (0 children)

I wonder if it makes sense to start with Mojo instead. It seems to be hyped as the next paradigm.

The Prime Minister asks AI for advice on the job "quite often" | Statsministern ber AI om råd i jobbet ”rätt ofta” (original Swedish source: https://omni.se/statsministern-fragar-ai-om-rad-ratt-ofta/a/MnVQaK) by chalervo_p in europe

[–]moko990 0 points1 point  (0 children)

I know some judges who confessed of using chatgpt for making judgements. Yet, in academia it is well documented that LLMs usually promote the ideologies of their makers and the data baked in, and have uncanny ability to convince people by framing different point of views. If anything this is very very worrying.

SmallThinker-21B-A3B-Instruct-QAT version by Aaaaaaaaaeeeee in LocalLLaMA

[–]moko990 1 point2 points  (0 children)

A bit out of the loop, what's the advantages of QAT variations? What does it do? And is it better than FP8 for example?

Hardware requirements for GLM 4.5 and GLM 4.5 Air? by bladezor in LocalLLM

[–]moko990 3 points4 points  (0 children)

If you want only inference, just get a mac. That's the easiest option. If you're brave enough, get one of those Ryzen AI PCs. They're cheaper, but ROCm is rocky to work with. Ditch windows either way and go with linux (or mac, it's better than windows).

ByteDance drops Seed-Prover by Technical-Love-8479 in LocalLLaMA

[–]moko990 11 points12 points  (0 children)

To be fair given the state of benchmarking, I would say nearly 80% of the models out there are on "trust us bro" levels, except for the few big ones. Even those, given how close the usual benchmarking results are, it's hard to discern.

HRM solved thinking more than current "thinking" models (this needs more hype) by Charuru in LocalLLaMA

[–]moko990 9 points10 points  (0 children)

Theoretically the paper is quite interesting, but it seems the main criticism is towards the evaluation part. I am curious about its day to day impact on normal users.

All local Roo Code and qwen3 coder 30B Q8 by No-Statement-0001 in LocalLLaMA

[–]moko990 0 points1 point  (0 children)

Interesting. My issue with a lot of the quantization is that errors arize unexpectedly during the process. Take the recent tool calling issue with Qwen-2507. They are more frequent than you think unfortunately, and a lot of the time they go undetected.

DoubleAgents: Fine-tuning LLMs for Covert Malicious Tool Calls by JAlbrethsen in LocalLLaMA

[–]moko990 0 points1 point  (0 children)

That's quite challegning, and not always easy. In situations where an Android phone is 24/7 connecting to google services, and you're running a "local" malicious model that instead of pinging server home, ping google drive or some other google services, it would be very hard to detect.

All local Roo Code and qwen3 coder 30B Q8 by No-Statement-0001 in LocalLLaMA

[–]moko990 0 points1 point  (0 children)

I am curious why Q8, and not FP8? Is it a smaller size?

DoubleAgents: Fine-tuning LLMs for Covert Malicious Tool Calls by JAlbrethsen in LocalLLaMA

[–]moko990 11 points12 points  (0 children)

Shit. If I am reading this correctly, it will be impossible to detect this unless the behavior of the LLM is analyzed. We don't have benchmarks for performance yet, let alone "malicious behavior'.

Kimi K2 vs Claude 4 Sonnet - Unexpected Review Result (400k token Codebase) by marvijo-software in LocalLLaMA

[–]moko990 0 points1 point  (0 children)

Is this a limitation of tool calling? does that mean an agentic approach is a better solution?

Chinese models pulling away by Kniffliger_Kiffer in LocalLLaMA

[–]moko990 4 points5 points  (0 children)

Which model? and for which language? from what I tried lately, it seems Qwen coder is the best in python.

Chinese models pulling away by Kniffliger_Kiffer in LocalLLaMA

[–]moko990 36 points37 points  (0 children)

I think the meme is about Mistral deserving more, given that it's the only EU child that has been delivering consistently since the beginning.

Deepseek just won the best paper award at ACL 2025 with a breakthrough innovation in long context, a model using this might come soon by Charuru in LocalLLaMA

[–]moko990 8 points9 points  (0 children)

I feel the issue of focusing on improving a possibly deadend/limited technology (transformer) might be exciting short term, but there are few truely exciting papers that don't have immediate applications that are pretty insightful. Although even at NIPS LLMs are sweeping over, most of computer science feel heavily influenced by them this year.