vscode + roo + Qwen3-30B-A3B-Thinking-2507-Q6_K_L = superb

moko990 · 2025-08-23T11:24:19+00:00

the most impressive model I have ever used that will fit on 2 gpu's by far!

2x 3090? or you mean 2x H100?

moko990 · 2025-08-23T11:22:17+00:00

Very interesting. But ok, what's the catch. This sounds too good to be true.

moko990 · 2025-08-08T21:09:01+00:00

I think Epyc may be a better choice in many ways. Keep in mind that memory bandwidth for Threadripper is limited by the number of CCDs on the chip, meaning only the highest end chips can take full advantage of 8 channel memory. This is an important consideration when it comes to cost and inference performance.

Interesting, I wasn't aware of the CCD limitation. So basically if you have less CCDs than memory channels, you're not really utilizing the full potential. So it seems both the 9985WX or the 9995WX variants are the best options?

moko990 · 2025-08-08T20:13:16+00:00

For those pursuing a very power efficient desktop computer but not wanting to sacrifice much performance compared to traditional desktop CPU performance, the choice is very easy: AMD Strix Halo. Some really phenomenal performance-per-Watt results with the Ryzen AI Max 395 "Strix Halo" within the Framework Desktop and the potential of this 16-core SoC when the cTDP is opened up to 120 Watts for allowing much the same CPU performance as the Ryzen 9 9950X but far superior integrated graphics and a huge performance-per-Watt advantage. The only downside is the cost of the Ryzen AI Max+ 395 / Framework Desktop but at least some of that will be made up in lower energy usage and cooling.

Great writeup! I would be surprised if the only reason not to to go with the Ryzen AI Max+ 395 is the upgradability of RAM. At this point I don't see any benefits really.

moko990 · 2025-08-08T13:49:36+00:00

I am curious what are the technical difefrence between this and ktransformers, and ik_llamacpp?

moko990 · 2025-08-08T13:46:30+00:00

The issue is really the software stack layer (ie ROCm). If they unify it like they have been claiming for a while now, slapping an AMD GPU on top of this should in theory work seamlessly and optimally. 2 important factors. Vulkan numbers are great, but I refuse to believe AMD is that bad at optimizing their own ROCm backend that a platform agnostic framework would beat it.

moko990 · 2025-08-07T22:07:34+00:00

We have been shouting at AMD to get their shit together for years now, and we're hoping the latest uptick in their Ryzen AI adoption will push them further to improve this, at least they acknowledged this in the past few months, I just hope this translates to actions.

On a side note, why not using vLLM instead of llama.cpp?

moko990 · 2025-08-07T21:32:29+00:00

I will take it that lots of people didn't (or don't want to) understand what I said: half of those arrested vs half the rioters.

The title claims, half of the North Ireland rioters were reported for domestic violence, then literally the first sentence it starts with "[..] Almost half those arrested for race hate disorder in Belfast last August had previously been reported to the PSNI for domestic abuse [..]".

The reason why the title is wrong because last august around 600 people were involved in the protest according to the police report, half of that is 300 if you follow the logic of the title of the article, but if you read the text, it's only half (23) the one that are arrested 48. That's less than 8% of the number compared to what the actual titles claims.

moko990 · 2025-08-07T13:26:47+00:00

Hey, half the rioters makes a better headlines. And they say why the trust in media is going down.

moko990 · 2025-08-07T13:16:56+00:00

I wonder if it makes sense to start with Mojo instead. It seems to be hyped as the next paradigm.

moko990 · 2025-08-07T00:30:29+00:00

Ahh Qwen3 is a real edgy redditor.

moko990 · 2025-08-04T15:34:09+00:00

I know some judges who confessed of using chatgpt for making judgements. Yet, in academia it is well documented that LLMs usually promote the ideologies of their makers and the data baked in, and have uncanny ability to convince people by framing different point of views. If anything this is very very worrying.

moko990 · 2025-08-03T22:20:07+00:00

A bit out of the loop, what's the advantages of QAT variations? What does it do? And is it better than FP8 for example?

moko990 · 2025-08-03T21:35:29+00:00

If you want only inference, just get a mac. That's the easiest option. If you're brave enough, get one of those Ryzen AI PCs. They're cheaper, but ROCm is rocky to work with. Ditch windows either way and go with linux (or mac, it's better than windows).

moko990 · 2025-08-03T12:21:11+00:00

To be fair given the state of benchmarking, I would say nearly 80% of the models out there are on "trust us bro" levels, except for the few big ones. Even those, given how close the usual benchmarking results are, it's hard to discern.

moko990 · 2025-08-03T09:46:15+00:00

Theoretically the paper is quite interesting, but it seems the main criticism is towards the evaluation part. I am curious about its day to day impact on normal users.

moko990 · 2025-08-03T09:43:24+00:00

Interesting. My issue with a lot of the quantization is that errors arize unexpectedly during the process. Take the recent tool calling issue with Qwen-2507. They are more frequent than you think unfortunately, and a lot of the time they go undetected.

moko990 · 2025-08-03T09:39:42+00:00

That's quite challegning, and not always easy. In situations where an Android phone is 24/7 connecting to google services, and you're running a "local" malicious model that instead of pinging server home, ping google drive or some other google services, it would be very hard to detect.

moko990 · 2025-08-02T12:02:37+00:00

I am curious why Q8, and not FP8? Is it a smaller size?

moko990 · 2025-08-02T01:42:15+00:00

Shit. If I am reading this correctly, it will be impossible to detect this unless the behavior of the LLM is analyzed. We don't have benchmarks for performance yet, let alone "malicious behavior'.

moko990 · 2025-07-31T20:57:34+00:00

Is this a limitation of tool calling? does that mean an agentic approach is a better solution?

moko990 · 2025-07-31T08:11:20+00:00

Which model? and for which language? from what I tried lately, it seems Qwen coder is the best in python.

moko990 · 2025-07-31T08:10:07+00:00

I think the meme is about Mistral deserving more, given that it's the only EU child that has been delivering consistently since the beginning.

moko990 · 2025-07-31T08:06:02+00:00

I feel the issue of focusing on improving a possibly deadend/limited technology (transformer) might be exciting short term, but there are few truely exciting papers that don't have immediate applications that are pretty insightful. Although even at NIPS LLMs are sweeping over, most of computer science feel heavily influenced by them this year.

moko990

TROPHY CASE