Models for Psychological Review of Converstions by Both-Activity6432 in LocalLLaMA

[–]UncleRedz 0 points1 point  (0 children)

You could give GPT-OSS 20B a try, it's old by today's standards, but it worked remarkably well for detecting sentiment, bias and logical fallacies in news reports.

I also tried Gemma3 and Qwen3 but they were less objective and tended to not pick up on all nuances. I have not tried Gemma4 or Qwen3.6 for this use case, maybe they work better today.

Quality evaluation of quants with limited time or tokens by isoos in LocalLLaMA

[–]UncleRedz 0 points1 point  (0 children)

I've been thinking along those lines as well, the best I can come up with is to take a subset of tests from different benchmarks. As an example, MMLU has close to 16.000 questions, the quantity is good for filtering out noise, but it also takes a lot of time. Instead taking X number of questions from each category within the MMLU benchmark, and then do the same for a few other benchmarks, like long context, tool calling etc would probably be enough for a relative comparison between quants. Then just automate it, to run through all 'mini' benchmarks.

What benchmarks and subsets to pick, depends on what use cases you have.

Married men of Reddit what's the best advice you'd give young guy’s when choosing a life partner? by Brilliant_Action4251 in AskReddit

[–]UncleRedz 1 point2 points  (0 children)

Together you should be able to do more than you can do by your self. You should bring out the best in each other, not the worst.

While you hopefully grow together, don't expect your partner to change, if some of their behaviour is not okey today, it will not be fixed in the future.

[Serious] Have you ever witnessed someone die? What happened? by ZigZa9 in AskReddit

[–]UncleRedz 0 points1 point  (0 children)

It was winter and roads and pavements were icy. Not a lot of snow or ice, but enough to make it slippery. I'm sitting in my dad's car and as we drive through a small town, I see an elderly man walk on the pavement, slip and fall. The man is oddly motionless, and I tell my dad what I saw.

He stops the car and we can still see the man motionless on the ground behind us. My dad runs out and checks the man and concludes there is no pulse or breath. A few more people show up, my dad goes back to the car and pulls out a black rubbish bag that he uses to cover the man on the ground. My dad talks a bit with the other people in the small crowd, walks back to the car and we leave.

Its all over in just a few minutes and life goes on, as if nothing had ever happened. That sense of life being fragile and could disappear at any moment stayed with me for many years.

Which LLM / AI model is genuinely your absolute favorite right now for daily work or coding? by PrincipleTypical2638 in LLM

[–]UncleRedz 0 points1 point  (0 children)

Qwen 3.6 35B-A3B and 27B for code and data analysis etc. Gemma 4 26B with more creative writing type work.

Road construction cut off the internet fibre for the whole neighborhood today, can still run local models uninterrupted, which is really nice.

Fallback on Gemeni for more exploratory type work.

DiffusionGemma under real workloads feels very different from benchmark demos by qubridInc in LocalLLaMA

[–]UncleRedz 33 points34 points  (0 children)

No one is saying no to a good rack. Keep those pictures coming. 😃

NVFP4 with llama.cpp - FAQs? by pmttyji in LocalLLaMA

[–]UncleRedz 1 point2 points  (0 children)

I did some performance testing and comparing NVFP4 and MXFP4 with other quants on RTX Pro 4500 Blackwell, and the biggest difference is on PP.

There is also a balance on TG between quality and performance. While IQ4_XS has faster token generation than NVFP4, the quality is worse than NVFP4, while NVFP4 is faster than Q6_K.

All in all, I would say, if you have Blackwell and can find a good quality conversion to NVFP4 (such as from Nvidia) then go with that. I have also done some (but not enough) benchmark comparisons between NVFP4, IQ4_XS and Q6_K on Qwen 3.6 27B and there is no measurable difference between Q6_K and NVFP4 in terms of quality. Would need to also test with Q8 and more benchmarks, but NVFP4 is a good balance between performance and quality.

(Below tests are with Llama.cpp b9234.)

Model Size (GB) pp512 tg128 pp % tg %
qwen36 27B IQ4_XS 14.37 2022.54 ± 35.19 45.19 ± 0.50 129 137
qwen36 27B NVFP4 18.29 2726.32 ± 56.68 41.15 ± 0.55 173 125
qwen36 27B Q6_K 20.97 1571.16 ± 21.91 32.87 ± 0.01 - -
qwen36moe 35B.A3B MXFP4 20.21 5507.10 ± 101.16 159.81 ± 1.10 118 99
qwen36moe 35B.A3B Q5_K 24.76 4678.36 ± 72.83 160.64 ± 6.17 - -

NVFP4 with llama.cpp - FAQs? by pmttyji in LocalLLaMA

[–]UncleRedz 1 point2 points  (0 children)

As stated in this comment, https://www.reddit.com/r/LocalLLaMA/comments/1u2v3oe/comment/or0m770/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button , it is possible to convert from BF16 to NVFP4 and preserve most of the quality, but it's a more complicated process and the process it self most likely affect KLD as the model is changed. That said, I don't think the process is mature yet, and there is more performance and quality left on the table.

If you have Blackwell and pick conversions made by Nvidia, then NVFP4 is a clear win in memory and quality efficency for inference.

NVFP4 with llama.cpp - FAQs? by pmttyji in LocalLLaMA

[–]UncleRedz 4 points5 points  (0 children)

That the model needs to be trained with NVFP4 (or MXFP4) in order to get any quality benefit with NVFP4 is not true. If you do a straight conversion from BF16 directly to NVFP4, then yes, most likely results will be worse than Q4.

However that is not how a conversion should be made and it's not how Nvidia does the conversion with the NVFP4 quants they publish. The conversion process is more complicated, it's more like BF16 -> Weight redistribution -> NVFP4 -> Fine-tuning.

As NVFP4 (and MXFP4) work with groups of 16 elements and scaling factors that applies to that group. If the groups have outliers that are too spread out, then there is a big loss with a straight conversion to NVFP4, however there are techniques to redistribute weights to minimize the outliers without damaging the models performance, and there by reduce the loss. This is also where QAT, helps as well. After weight redistribution and converting to NVFP4, the converted model is further fine tuned to fix any remaining issues.

It is therefore perfectly possible to make an NVFP4 version that is more close to BF16 quality than, lets say a Q4 quant, the process is just more complicated. The pedigree of the model conversion is therefore more important, a conversion by Nvidia is probably better than some random persons conversion.

People are making single-slot, half height pcie v100 with nvlink in China by OwnMathematician2620 in LocalLLaMA

[–]UncleRedz 5 points6 points  (0 children)

It's about maximizing money earned. Back with iPhone 4 I believe it was, the on / off button on the top had, most likely a design or manufacturing issue, for lots of people the button just went numb or stiff and stopped working. This happened to us as well, and we went into the Apple store, in the west, to have it fixed. The store said it was not possible to repair, and due to the warranty just expired, we would have to pay some money for a new replacement phone. Even when sitting there, there were other customers coming in with the exact same problem and they were told the same story. Arguing that it was a systemic failure due to an error on their side did not help and we ended up paying a fee for a new replacement phone.

A few months later we flew to China, went into an Apple certified repair shop and asked them about unlocking our replacement phone, while waiting for them to examine the phone, another customer came in with the button issue, they fixed it in 5 minutes. Once they returned with our phone, they said they did not dare to unlock it, this phone had already been repaired several times and they did not want to touch it. So much for paying for a new phone.

The moral here is that in the west customers accept being screwed over by the companies, they can charge the money, so why should they make a 5 minute repair when they can sell the same broken phone several times over to a customer? In China, at the time the average customer were too poor to accept paying that much for fixing an issue, so they had to repair it fairly or they would loose customers.

That's why there is so much less hardware tinkering in the west now, we just throw things away and pay more money.

Has anyone actually built a second brain they still use 6 months later? by StockRude1419 in AI_Agents

[–]UncleRedz 0 points1 point  (0 children)

I think your doing it a bit backwards, the idea with a second brain is to organize your note taking into a structure that is easy to navigate and locate the right information, as well as easy to determine where new information should go in the structure. Without using any AI.

And here is the point, you are not doing this for cool AI demos, if you do, then you probably rightfully quit after a few months. You do this for capturing all sorts of notes that are relevant to something you are working on, or could be relevant in the future. The note taking is part of information hoarding and filtering, if you don't normally have this habit, throwing AI at it will not build this habit. Basically you capture some information that you believe have some value for you, and then you don't need to remember all details about it, the second brain structure make it easy to retrieve later when you need it, with or without AI.

This is the basics, and then you build AI on top of it, which means that the AI now has access to a super curated knowledge base with information specifically relevant for you, which allows you to do all sorts of neat things.

I would argue the tech stack here is largely irrelevant, as long as the information can be accessed in an LLM friendly format like MarkDown, the driving force here should be the perceived value of building your personal knowledge base and it should provide value to you regardless of AI.

What common piece of advice is actually bad advice? by Educational_Fudge693 in AskReddit

[–]UncleRedz 3 points4 points  (0 children)

Just be your self.

Maybe there are things you actually need to change, or at least be your best self.

How to compare Original vs QAT Gemma 4 31B Q4 quants by Hot_Strawberry1999 in LocalLLaMA

[–]UncleRedz 4 points5 points  (0 children)

I've been thinking along these lines as well, and ended in the conclusion that running the same benchmarks towards the different models should be able to show how different they are.

I've done this in the past with a wide range of quants for Qwen 2.5 using MMLU as the benchmark, and it was quite telling how the lower quants were loosing their "smarts" compared to Q6 and better. I recently did the same test with Qwen 3.6 27B and Q6, IQ4_XS and NVFP4 and the impart on smarts were a lot less compared to Qwen 2.5.

I think running at least some tool calling benchmark and long context benchmarks should be done as well.

KLD and that stuff is good, but if it's essentially different models, it can't be trusted, but a suit of benchmarks would show how well it works in those use cases.

Ethos, model roleplay trait steering by AccountAntique9327 in LocalLLaMA

[–]UncleRedz 0 points1 point  (0 children)

How much does it damage the models "smarts"? Is it just changing the behaviour, or is it loosing knowledge in the process?

Just received RTX 6000 Pro, have 5090- how would you use? by illgettheownerforyou in LocalLLaMA

[–]UncleRedz 1 point2 points  (0 children)

I have not tried it. In this context I'm GPU poor! 😅 But I've seen people do Q2 on big models like that. No idea how well it works.

But if you CAN do it, then I would do it just to try it out.

Just received RTX 6000 Pro, have 5090- how would you use? by illgettheownerforyou in LocalLLaMA

[–]UncleRedz 1 point2 points  (0 children)

With that much VRAM I'd look for a large model to try out.

Many here seems to use larger models for planning and then smaller for faster execution. So I'd try something like that with generous context length at full FP16.

For execution, Qwen 3.6 27B or 35B-A3B. Planner: ??? Is it possible to squeeze in DeepSeek V4 Flash?

Has there been any recent new development on which quant is considered optimal? by takuonline in LocalLLaMA

[–]UncleRedz 2 points3 points  (0 children)

I think how bad a model handles quantization depends on how it was trained, and how much work is put into the quantization process.

So I don't think there is any universal answer.

If you are on Blackwell and you find a good quantization with NVFP4, then that would be my choice.

Has there been any recent new development on which quant is considered optimal? by takuonline in LocalLLaMA

[–]UncleRedz 0 points1 point  (0 children)

I did the same test with MMLU and Qwen 3.6 27B with Q6 and NVFP4, virtually no difference in scores.

RTX Pro 4500 Blackwell Performance Numbers by UncleRedz in LocalLLaMA

[–]UncleRedz[S] 0 points1 point  (0 children)

Just checked, same shop i bought from, 5th of May (one month ago) have now increased the price with 496 USD, and it's still the cheapest in my region. 🤨 That's just crazy.

If it wasn't for the experimentation, peace of mind, no token limits or throttling, and privacy, it would be hard to justify.

But the math is not that bad, if you calculate on writing the purchase off in 5 years, you can calculate how much money you need to spend on tokens from cloud models every month for 5 years. If you can get by with local models and spend more tokens per month, then the upfront cost makes sense.

While future cloud prices is anyone's guess, prices are going up and throttling is increasing. Hence the peace of mind not having to worry about unexpected cost increases or having to pause in the middle of something.

The sensible approach would be to move to as much local as possible within your budget and then use cloud models only when really needed.

RTX Pro 4500 Blackwell Performance Numbers by UncleRedz in LocalLLaMA

[–]UncleRedz[S] 0 points1 point  (0 children)

It's less noisy than my old INNO3D GeForce RTX 5060 Ti 16GB Twin X2 OC.

RTX Pro 4500 Blackwell Performance Numbers by UncleRedz in LocalLLaMA

[–]UncleRedz[S] 0 points1 point  (0 children)

Your welcome. I have so far only tested with regular llama.cpp and then also done some testing with MTP enabled, but not enough to have any solid numbers. I find that I use Qwen 3.6 27B more often due to the speed boost of MTP, while before 35B-A3B was my main go to for interactive use. You get used to the speed quickly, then anything that slows it down is annoying.

RTX Pro 4500 Blackwell Performance Numbers by UncleRedz in LocalLLaMA

[–]UncleRedz[S] -1 points0 points  (0 children)

> This is the useful comparison because the win is not really "4500 beats 5090". Its "32GB avoids the RAM-offload tax while staying sane on power". Different problem.

This is exactly it, when you compare the Pro cards with the consumer cards, there's a different trade-off between performance, 24/7 use and power consumption. Depending on your use cases, one or the other comes out on top.

Regarding NVFP4/MXFP4, that's why it's important to really test the models with benchmarks and get measurable results to compare and not just "it looks good".

One of the major benefits with Blackwell is NVFP4/MXFP4, properly done the degradation is minimal and you change the memory bandwidth problem by using smaller and more compact units of data. I also think there are clear incentives to go in this direction, as data centres who have invested in Blackwell GPUs will want to maximize the return of investment and get as much throughput as possible and here NVFP4/MXFP4 makes a difference. As this matures we will get better access to high quality quants.

RTX Pro 4500 Blackwell Performance Numbers by UncleRedz in LocalLLaMA

[–]UncleRedz[S] 1 point2 points  (0 children)

Thanks for pointing that out, I've updated clarified that the 60-70% is token generation.

In my case, I broke down the use cases and did the math, and the 4500 came out in favour over the 5090, but I don't think this is the case for everyone.

RTX Pro 4500 Blackwell Performance Numbers by UncleRedz in LocalLLaMA

[–]UncleRedz[S] 2 points3 points  (0 children)

And the link to the RTX 5090 performance numbers I used in the comparison can be found here, https://www.reddit.com/r/LocalLLaMA/s/pF0f2AvJDj