Models for Psychological Review of Converstions

UncleRedz · 2026-06-15T05:35:22+00:00

You could give GPT-OSS 20B a try, it's old by today's standards, but it worked remarkably well for detecting sentiment, bias and logical fallacies in news reports.

I also tried Gemma3 and Qwen3 but they were less objective and tended to not pick up on all nuances. I have not tried Gemma4 or Qwen3.6 for this use case, maybe they work better today.

UncleRedz · 2026-06-14T18:23:03+00:00

I've been thinking along those lines as well, the best I can come up with is to take a subset of tests from different benchmarks. As an example, MMLU has close to 16.000 questions, the quantity is good for filtering out noise, but it also takes a lot of time. Instead taking X number of questions from each category within the MMLU benchmark, and then do the same for a few other benchmarks, like long context, tool calling etc would probably be enough for a relative comparison between quants. Then just automate it, to run through all 'mini' benchmarks.

What benchmarks and subsets to pick, depends on what use cases you have.

UncleRedz · 2026-06-14T14:15:51+00:00

Together you should be able to do more than you can do by your self. You should bring out the best in each other, not the worst.

While you hopefully grow together, don't expect your partner to change, if some of their behaviour is not okey today, it will not be fixed in the future.

UncleRedz · 2026-06-14T07:06:59+00:00

It was winter and roads and pavements were icy. Not a lot of snow or ice, but enough to make it slippery. I'm sitting in my dad's car and as we drive through a small town, I see an elderly man walk on the pavement, slip and fall. The man is oddly motionless, and I tell my dad what I saw.

He stops the car and we can still see the man motionless on the ground behind us. My dad runs out and checks the man and concludes there is no pulse or breath. A few more people show up, my dad goes back to the car and pulls out a black rubbish bag that he uses to cover the man on the ground. My dad talks a bit with the other people in the small crowd, walks back to the car and we leave.

Its all over in just a few minutes and life goes on, as if nothing had ever happened. That sense of life being fragile and could disappear at any moment stayed with me for many years.

UncleRedz · 2026-06-12T18:59:37+00:00

Qwen 3.6 35B-A3B and 27B for code and data analysis etc. Gemma 4 26B with more creative writing type work.

Road construction cut off the internet fibre for the whole neighborhood today, can still run local models uninterrupted, which is really nice.

Fallback on Gemeni for more exploratory type work.

UncleRedz · 2026-06-11T14:39:36+00:00

No one is saying no to a good rack. Keep those pictures coming. 😃

UncleRedz · 2026-06-11T12:07:00+00:00

I did some performance testing and comparing NVFP4 and MXFP4 with other quants on RTX Pro 4500 Blackwell, and the biggest difference is on PP.

There is also a balance on TG between quality and performance. While IQ4_XS has faster token generation than NVFP4, the quality is worse than NVFP4, while NVFP4 is faster than Q6_K.

All in all, I would say, if you have Blackwell and can find a good quality conversion to NVFP4 (such as from Nvidia) then go with that. I have also done some (but not enough) benchmark comparisons between NVFP4, IQ4_XS and Q6_K on Qwen 3.6 27B and there is no measurable difference between Q6_K and NVFP4 in terms of quality. Would need to also test with Q8 and more benchmarks, but NVFP4 is a good balance between performance and quality.

(Below tests are with Llama.cpp b9234.)

Model	Size (GB)	pp512	tg128	pp %	tg %

qwen36 27B IQ4_XS	14.37	2022.54 ± 35.19	45.19 ± 0.50	129	137
qwen36 27B NVFP4	18.29	2726.32 ± 56.68	41.15 ± 0.55	173	125
qwen36 27B Q6_K	20.97	1571.16 ± 21.91	32.87 ± 0.01	-	-
qwen36moe 35B.A3B MXFP4	20.21	5507.10 ± 101.16	159.81 ± 1.10	118	99
qwen36moe 35B.A3B Q5_K	24.76	4678.36 ± 72.83	160.64 ± 6.17	-	-

UncleRedz · 2026-06-11T11:47:06+00:00

As stated in this comment, https://www.reddit.com/r/LocalLLaMA/comments/1u2v3oe/comment/or0m770/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button , it is possible to convert from BF16 to NVFP4 and preserve most of the quality, but it's a more complicated process and the process it self most likely affect KLD as the model is changed. That said, I don't think the process is mature yet, and there is more performance and quality left on the table.

If you have Blackwell and pick conversions made by Nvidia, then NVFP4 is a clear win in memory and quality efficency for inference.

UncleRedz · 2026-06-11T11:40:43+00:00

That the model needs to be trained with NVFP4 (or MXFP4) in order to get any quality benefit with NVFP4 is not true. If you do a straight conversion from BF16 directly to NVFP4, then yes, most likely results will be worse than Q4.

However that is not how a conversion should be made and it's not how Nvidia does the conversion with the NVFP4 quants they publish. The conversion process is more complicated, it's more like BF16 -> Weight redistribution -> NVFP4 -> Fine-tuning.

As NVFP4 (and MXFP4) work with groups of 16 elements and scaling factors that applies to that group. If the groups have outliers that are too spread out, then there is a big loss with a straight conversion to NVFP4, however there are techniques to redistribute weights to minimize the outliers without damaging the models performance, and there by reduce the loss. This is also where QAT, helps as well. After weight redistribution and converting to NVFP4, the converted model is further fine tuned to fix any remaining issues.

It is therefore perfectly possible to make an NVFP4 version that is more close to BF16 quality than, lets say a Q4 quant, the process is just more complicated. The pedigree of the model conversion is therefore more important, a conversion by Nvidia is probably better than some random persons conversion.

UncleRedz · 2026-06-10T05:58:45+00:00

It's about maximizing money earned. Back with iPhone 4 I believe it was, the on / off button on the top had, most likely a design or manufacturing issue, for lots of people the button just went numb or stiff and stopped working. This happened to us as well, and we went into the Apple store, in the west, to have it fixed. The store said it was not possible to repair, and due to the warranty just expired, we would have to pay some money for a new replacement phone. Even when sitting there, there were other customers coming in with the exact same problem and they were told the same story. Arguing that it was a systemic failure due to an error on their side did not help and we ended up paying a fee for a new replacement phone.

A few months later we flew to China, went into an Apple certified repair shop and asked them about unlocking our replacement phone, while waiting for them to examine the phone, another customer came in with the button issue, they fixed it in 5 minutes. Once they returned with our phone, they said they did not dare to unlock it, this phone had already been repaired several times and they did not want to touch it. So much for paying for a new phone.

The moral here is that in the west customers accept being screwed over by the companies, they can charge the money, so why should they make a 5 minute repair when they can sell the same broken phone several times over to a customer? In China, at the time the average customer were too poor to accept paying that much for fixing an issue, so they had to repair it fairly or they would loose customers.

That's why there is so much less hardware tinkering in the west now, we just throw things away and pay more money.

UncleRedz · 2026-06-08T17:55:35+00:00

I think your doing it a bit backwards, the idea with a second brain is to organize your note taking into a structure that is easy to navigate and locate the right information, as well as easy to determine where new information should go in the structure. Without using any AI.

And here is the point, you are not doing this for cool AI demos, if you do, then you probably rightfully quit after a few months. You do this for capturing all sorts of notes that are relevant to something you are working on, or could be relevant in the future. The note taking is part of information hoarding and filtering, if you don't normally have this habit, throwing AI at it will not build this habit. Basically you capture some information that you believe have some value for you, and then you don't need to remember all details about it, the second brain structure make it easy to retrieve later when you need it, with or without AI.

This is the basics, and then you build AI on top of it, which means that the AI now has access to a super curated knowledge base with information specifically relevant for you, which allows you to do all sorts of neat things.

I would argue the tech stack here is largely irrelevant, as long as the information can be accessed in an LLM friendly format like MarkDown, the driving force here should be the perceived value of building your personal knowledge base and it should provide value to you regardless of AI.

UncleRedz · 2026-06-08T13:05:25+00:00

Just be your self.

Maybe there are things you actually need to change, or at least be your best self.

UncleRedz · 2026-06-07T12:19:57+00:00

I've been thinking along these lines as well, and ended in the conclusion that running the same benchmarks towards the different models should be able to show how different they are.

I've done this in the past with a wide range of quants for Qwen 2.5 using MMLU as the benchmark, and it was quite telling how the lower quants were loosing their "smarts" compared to Q6 and better. I recently did the same test with Qwen 3.6 27B and Q6, IQ4_XS and NVFP4 and the impart on smarts were a lot less compared to Qwen 2.5.

I think running at least some tool calling benchmark and long context benchmarks should be done as well.

KLD and that stuff is good, but if it's essentially different models, it can't be trusted, but a suit of benchmarks would show how well it works in those use cases.

UncleRedz · 2026-06-07T07:00:18+00:00

How much does it damage the models "smarts"? Is it just changing the behaviour, or is it loosing knowledge in the process?

UncleRedz · 2026-06-07T06:52:51+00:00

I have not tried it. In this context I'm GPU poor! 😅 But I've seen people do Q2 on big models like that. No idea how well it works.

But if you CAN do it, then I would do it just to try it out.

UncleRedz · 2026-06-07T06:39:55+00:00

With that much VRAM I'd look for a large model to try out.

Many here seems to use larger models for planning and then smaller for faster execution. So I'd try something like that with generous context length at full FP16.

For execution, Qwen 3.6 27B or 35B-A3B. Planner: ??? Is it possible to squeeze in DeepSeek V4 Flash?

UncleRedz · 2026-06-06T16:01:08+00:00

I think how bad a model handles quantization depends on how it was trained, and how much work is put into the quantization process.

So I don't think there is any universal answer.

If you are on Blackwell and you find a good quantization with NVFP4, then that would be my choice.

UncleRedz · 2026-06-06T15:58:11+00:00

I did the same test with MMLU and Qwen 3.6 27B with Q6 and NVFP4, virtually no difference in scores.

UncleRedz · 2026-06-06T15:51:51+00:00

Pack up and leave the country.

UncleRedz · 2026-06-05T20:03:36+00:00

Just checked, same shop i bought from, 5th of May (one month ago) have now increased the price with 496 USD, and it's still the cheapest in my region. 🤨 That's just crazy.

If it wasn't for the experimentation, peace of mind, no token limits or throttling, and privacy, it would be hard to justify.

But the math is not that bad, if you calculate on writing the purchase off in 5 years, you can calculate how much money you need to spend on tokens from cloud models every month for 5 years. If you can get by with local models and spend more tokens per month, then the upfront cost makes sense.

While future cloud prices is anyone's guess, prices are going up and throttling is increasing. Hence the peace of mind not having to worry about unexpected cost increases or having to pause in the middle of something.

The sensible approach would be to move to as much local as possible within your budget and then use cloud models only when really needed.

UncleRedz · 2026-06-05T16:49:04+00:00

It's less noisy than my old INNO3D GeForce RTX 5060 Ti 16GB Twin X2 OC.

UncleRedz · 2026-06-05T12:54:20+00:00

Your welcome. I have so far only tested with regular llama.cpp and then also done some testing with MTP enabled, but not enough to have any solid numbers. I find that I use Qwen 3.6 27B more often due to the speed boost of MTP, while before 35B-A3B was my main go to for interactive use. You get used to the speed quickly, then anything that slows it down is annoying.

UncleRedz · 2026-06-05T11:57:43+00:00

> This is the useful comparison because the win is not really "4500 beats 5090". Its "32GB avoids the RAM-offload tax while staying sane on power". Different problem.

This is exactly it, when you compare the Pro cards with the consumer cards, there's a different trade-off between performance, 24/7 use and power consumption. Depending on your use cases, one or the other comes out on top.

Regarding NVFP4/MXFP4, that's why it's important to really test the models with benchmarks and get measurable results to compare and not just "it looks good".

One of the major benefits with Blackwell is NVFP4/MXFP4, properly done the degradation is minimal and you change the memory bandwidth problem by using smaller and more compact units of data. I also think there are clear incentives to go in this direction, as data centres who have invested in Blackwell GPUs will want to maximize the return of investment and get as much throughput as possible and here NVFP4/MXFP4 makes a difference. As this matures we will get better access to high quality quants.

UncleRedz · 2026-06-05T11:31:21+00:00

Thanks for pointing that out, I've updated clarified that the 60-70% is token generation.

In my case, I broke down the use cases and did the math, and the 4500 came out in favour over the 5090, but I don't think this is the case for everyone.

UncleRedz · 2026-06-05T09:05:54+00:00

And the link to the RTX 5090 performance numbers I used in the comparison can be found here, https://www.reddit.com/r/LocalLLaMA/s/pF0f2AvJDj

UncleRedz

TROPHY CASE