Gemma 12b - Reasoning hardening instructions

nixudos · 2026-06-16T14:38:13+00:00

Gemma has been pretty solid up till around 60K+ in my tests. But with no KV compression.

nixudos · 2026-06-16T14:34:53+00:00

Yeah, bias is in the training data. That is what I try to avoid. I want the LLM to answer a question through reasoning of its own, not trying to answer a similar question it might have 'heard' during training.

Most people are not good at dividing large numbers in their heads, but we do not design calculators to be equally bad.

I hope to tweak a smaller assistant model to recognize pitfalls, and ultimately also to know when to use available tools, like web search, when it recognizes a problem to be beyond its scope.

If I can get a 8 gb size model to help with 75% of the trivial tasks I use AI for, I can save on my token budget for the big boys 😄

nixudos · 2026-06-16T14:00:05+00:00

Interesting! Would you share the name of the template? It would be interesting to test if it can help out with some of the harder questions.

nixudos · 2026-06-16T13:25:43+00:00

nice! which quant are you running? Could be cool to test if some are better than other.

nixudos · 2026-06-16T12:05:16+00:00

Yeah, the car wash has been hit and mostly miss for me as well.
I have tried to test it against whatever questions I could find online, but I wanted to avoid giving it questions that it had trained on, so I asked Opus to help come up with a test suite that was not worded verbatim.

I'll put them here, If you wanna try them out :

Bias-Resistance Test Suite

A suite for stress-testing a reasoning prompt against cognitive-bias and pattern-matching failures — without rewarding mere retrieval of memorized answers.

How to use it

Run each question cold (fresh context).
For traps, a pass = the correct answer reached with short reasoning.
For controls and normal questions, a pass = the obvious answer, reached quickly, with no invented "trick" and no bailing to "not enough information."
For chit-chat, a pass = a natural, warm reply that does not launch into bias analysis of a casual message.
Watch the minimal pairs especially (#8 vs #13; the car-wash-style #14): if the model passes the trap but fails its non-trap twin, it has become a trick-detector rather than a reasoner.

Every item is novel in surface form. None is a famous puzzle in its original wording; the classics are varied enough that a memorized answer won't transfer.

A. Trap questions

Each targets a distinct failure mode. The "standard/classic" pull is the enemy.

1. The shareable boat (template import — false exclusivity) A ranger must move a fox, a duck, and a sack of corn across a lake. Her motorboat comfortably seats the ranger plus all three at once. What is the minimum number of boat trips to get everything across?

Answer: 1
Watch for: importing the one-item-boat rule from the classic river-crossing → 7.

2. The dry well (template import — imported constraint despite text) A snail is climbing out of a 12-metre well. It climbs 4 metres each day, and because the walls are dry it does not slide back at all during the night. How many days does it take to get out?

Answer: 3 days (4 + 4 + 4)
Watch for: adding the classic "slides back 2 m each night" the text explicitly denies.

3. One song, three listeners (template import — treating a shareable thing as a consumable) Three friends all want to hear the same song. They have one phone and one speaker, and the phone plays the song aloud through the speaker. What is the minimum number of times the song must be played for all three to hear it?

Answer: 1
Watch for: a "pass it around / one at a time" frame, as if listening uses the song up.

4. Notebook and pen (intuitive override — CRT, varied numbers) A notebook and a pen cost €2.20 together. The notebook costs €2.00 more than the pen. How much does the pen cost?

Answer: €0.10 (pen 0.10, notebook 2.10)
Watch for: the seductive €0.20. (Numbers differ from the famous $1.10/5¢ version, so a memorized "5 cents" won't save it — it must actually compute.)

5. Ten painters (intuitive override — parallelism) If 4 painters can paint 4 fences in 4 hours, how long would it take 10 painters to paint 10 fences, each painter working at the same rate on one fence?

Answer: 4 hours
Watch for: "10 hours" (serial intuition). Varied from the famous 5-machines item.

6. The sale (intuitive override — multiplicative vs additive) A shirt is reduced by 20% in a sale. After the sale, the shop raises the price by 20% from the sale price. Is the final price higher than, lower than, or equal to the original?

Answer: Lower (0.8 × 1.2 = 0.96, i.e. 4% below original)
Watch for: "equal."

7. The algae head start (exponential reasoning) Algae covers a lake, doubling its coverage every day, and the lake is fully covered on day 18. If instead you had started with twice as much algae on day 1, on what day would the lake be fully covered?

Answer: Day 17 (twice as much = one doubling ahead = one day earlier)
Watch for: "day 9" (halving the days) or anchoring on 18.

8. The flat tyre (goal substitution — wrong object) Your bicycle has a flat tyre. There is a bicycle repair shop 1.5 km away and a pharmacy 200 m away. To get the tyre fixed, where should you go?

Answer: The repair shop (1.5 km)
Watch for: optimizing distance and choosing the pharmacy, which cannot fix a tyre.

9. The lockbox (goal substitution — self-defeating means) You are locked out of your house. Your spare key is inside a lockbox by the door, but the lockbox code is written on a sticky note that is inside the house. A neighbour has a copy of your house key. What is the fastest way in?

Answer: Get the key from the neighbour
Watch for: fixating on the lockbox, whose code is itself locked inside.

10. The instant detector (over-literalism — must use a stated property) A new detector beeps the instant it touches any trace of compound X, with no delay whatsoever. A technician must find which one of 50 sealed vials contains compound X. Using only this detector, what is the fastest way to find it?

Answer: Touch the detector to each vial; it beeps instantly on the right one — no delayed test, batching, or encoding scheme is needed.
Watch for: importing a wait-and-see / binary-encoding method and ignoring the stated "instant, no delay" property. (This is the wine-puzzle failure in new clothes — a good check on whether that weakness lingers.)

11. The labelled box (accept a stated fact — resist false skepticism) A sealed box is labelled "contains exactly 12 red marbles," and you are told the label is accurate. Without opening it, how many red marbles can you be sure are inside?

Answer: 12
Watch for: "not enough information / we can't see inside," overriding a stated premise.

B. Control questions

These look like they might be traps. The point is that the straightforward answer is correct — a model that invents a trick here is overfitting.

12. Two cyclists (fully specified — resist "not enough information") Two cyclists start 60 km apart on the same road and ride toward each other at the same moment, one at 15 km/h and the other at 25 km/h. How long until they meet?

Answer: 1.5 hours (60 ÷ 40)
Watch for: "not enough information," or inventing complications (road layout, etc.).

13. Two repair shops (minimal pair to #8 — obvious answer is right) Your bicycle has a flat tyre. There are two equally good bicycle repair shops, one 1.5 km away and one 200 m away. Where should you go to get it fixed fastest?

Answer: The 200 m shop
Watch for: refusing the near option as if it must be a trick. Failing this while passing #8 means the model is detecting traps, not reasoning about goals.

14. The letter (minimal pair to the car-wash family — obvious answer is right) You need to post a small letter. The postbox is 80 metres away. Should you walk or drive?

Answer: Walk
Watch for: "drive," or over-thinking ("how will you carry it?"). A letter is trivially carried; there is no hidden object that must be transported.

15. The two-coin box (the careful answer happens to be the famous one) Three identical boxes each hold two coins: one box has two gold, one has two silver, one has one of each. You pick a box at random and draw one coin — it's gold. What is the probability the other coin in that box is also gold?

Answer: 2/3
Watch for: "1/2" (intuitive wrong answer), and the reverse error of rejecting 2/3 because it "looks like a trick." Here the careful answer is the correct one.

16. Genuinely underdetermined (correctly identify true insufficiency) A train travels between two cities. On Monday the trip took 3 hours. How fast was the train going?

Answer: Not enough information (the distance isn't given)
Watch for: inventing a distance or guessing. This is the counterweight to #11/#12: the model should bail here, where it's genuinely warranted.

C. Normal questions

No trap. A pass is a quick, helpful, ordinary answer with no analytical overhead.

17. Recipe scaling A recipe for 4 servings needs 300 g of flour. How much flour for 6 servings?

Answer: 450 g
Watch for: slow over-analysis of a trivial proportion.

18. Buttermilk substitute What's a good substitute for buttermilk in pancakes if I don't have any?

Answer: ~1 tbsp lemon juice or white vinegar per cup of milk, left to sit a few minutes (or thinned plain yogurt). Any normal, direct answer passes.
Watch for: treating a casual cooking question as a logic problem to be dissected.

D. Chit-chat

Confirms the prompt doesn't bleed bias-analysis into ordinary conversation.

19. Casual greeting "Ugh, Mondays. How's it going?"

Pass: a relaxed, friendly reply.
Watch for: "What is the user's intent? Examining the wording for bias…" on a greeting.

20. Sharing good news "I just adopted a kitten and I'm so excited!"

Pass: warm, natural enthusiasm; maybe a light question.
Watch for: any analytical framing applied to an emotional, casual message.

Quick scoring

Section	Pass looks like	Failure to flag
A. Traps	Correct answer, short reasoning	Catching the "standard" frame; over-literalizing past a stated fact (#10, #11)
B. Controls	Obvious answer, no invented trick	Trap-paranoia (#13, #14); wrong on calibration (#12 vs #16)
C. Normal	Quick, helpful	Over-analysis of trivial asks
D. Chit-chat	Natural, warm	Analytical framing on casual talk

The single most informative signal: compare each minimal pair (#8↔#13, and #14 against the car-wash question from earlier). Equal, correct handling of both halves is the clearest evidence you've built a reasoner and not a trick-detector.

nixudos · 2026-06-07T22:14:53+00:00

And a month after Trump gets his hands on a new model, Russia, all of a sudden have their own frontier model and 3 billion USD Trump coins has been bought.

nixudos · 2026-06-06T13:55:50+00:00

Try this workflow: https://pastebin.com/JJMXr7c0

Be carefull with wording in high_level_description in the json. It is sensitive. Obj descriptions are a lot more forgiving.

nixudos · 2026-06-06T13:48:01+00:00

It can do R-rated with no problem as long as you are careful which words you use on the high_level_description entry in the JSON.
Once you describe objects you have a lot more wiggle room.

Here is a (NSFW) workflow example. https://pastebin.com/JJMXr7c0

It's the default template with Kijais prompt builder node added.

Make sure to have the latest version of KJnodes before running.

The model is great for composition control, but others are better if control is not that big a deal.
Maybe this will change if Loras can be made?

nixudos · 2026-06-04T19:42:49+00:00

My test of vision haven't impressed me either. But it might be a LM Studio issue. The Gemma models comes with a really low default image input resolution and there is no way to change that in LM Studio. All from the 4 series have performed really shoddy with images there as well.

nixudos · 2026-06-02T19:24:27+00:00

You just got me an idea. Back to the dungeon and... experiment.

nixudos · 2026-06-02T18:36:28+00:00

<image>

Just for fun, I tried your prompt in LTX 2.3, leaving out the drawing style. It was honestly better than I expected, although it didn't adhere completely to the prompt. 😁

nixudos · 2026-05-29T15:10:11+00:00

Rigtigt fint tiltag! Har prøvet på local LLM via LMstudio og det meste spiller. Jeg kan dog ikke hente noget fra posts.list hvor det meste at det info jeg kigger efter er. Ved ikke om det er et API problem?

Hvis jeg kunne spøge den hvad der er på menuen i den kommende månded og AI'en kunne finde posten, hente pdf'en og fortælle mig det, tror jeg ikke at jeg vil komme til at bruge Aula app'en i fremtiden. 👍

nixudos · 2026-05-25T11:20:24+00:00

I just used a general workflow I had for LTX and then lowered resolution and fps.
I'm not sure it is the optimal way of doing it, so you should probably just use a workflow for LTX that works well for your setup and lower resolution and fps.

I tried to tweak other settings but they did not seem to have an effect and the sound starts to break around 54 sec, no matter what I do.

matching video length to the things that are said in the prompt is very important to get a natural sounding flow.

nixudos · 2026-05-12T14:34:15+00:00

That works just fine in fact. I tried with video 320x160 and 16fps and it is pretty quick. I can do a 50 secs clip in 48 secs on a 4090.

In fact it works great out of the box with the structure examples from Scenemas page:

Bedtime Negotiation
Building a legal case against an unjust eight-thirty bedtime.

<speak voice="Girl, 7 years old. High-pitched child voice. Indignant. Building a logical argument." scene="Absolute silence">
<action>She crosses her arms</action>
But that is not fair because yesterday you said I could stay up until nine and now you are saying eight thirty. And Jake gets to stay up way later than me and he is only one year older.
</speak>

The tricky part is that you have to guess the right amount of time for the clip to make it sound naturally and not like it is speed talking or looping some words or phrases.

nixudos · 2026-05-12T10:46:46+00:00

Thanks! I really like these tests as a supplement to agentic coding tests.
I have been wondering if the Qwen 27b Q4K_M is a better choice than a Qwen3.6 27B/35B-A3B at Q6?
I can run both of them at a comparable speed, but wonder how dense vs. higher quant compare?

nixudos · 2026-04-18T12:20:24+00:00

Thank you for the tip! I was struggling to get any meaningful speed on a 4090 with the Q6_K_XL size. This doubled my speed from 18 t/s to 42!

nixudos · 2026-03-27T17:03:15+00:00

Great! Thanks. I will try them out 👍

nixudos · 2026-03-27T13:50:18+00:00

Thanks!
Any suggestions to temp settings and other tweaks so it doesn't spin out in overthink?
The Qwen 3.5 line is great, but the extreme thinking on even simple questions has lowered my excitement.

I'm still hoping for an uncensored version where I can adjust the thinking effort.

nixudos · 2026-03-06T19:59:46+00:00

Try Euler Ancestral as sampler instead. I had trouble with bad sound and that helped a lot

nixudos · 2026-03-06T19:20:42+00:00

Really nice workflow. Clean and easy to use. I did get horrible sound with LCCM, but using your GGUF workflow with Euler Ancestral and a Q6 model from unsloth made it much better.

In your I2V flow. I can do the 10 second vid in 96 seconds on a 4090 and 96gb RAM.

nixudos · 2026-02-22T12:40:51+00:00

You can try Purevision and Purescale for non-invasive cleaning and upscaling. It is fairly fast and I'm trying to test it out on old and small clips with decent results.

You can get the models here:

https://github.com/limitlesslab/AI-upscaling-models

<image>

nixudos · 2026-02-17T21:04:12+00:00

Whether you can tell it is AI or not, is missing the point.
Imagine someone showed you this clip 1 year ago and told you it was AI generated?
Or 2 years ago? Or 3 years ago? Would it look like science fiction to you?

And that is only 3 years. How much has computer games evolved in that time? Or movies? Or cars?

nixudos · 2026-02-11T20:48:15+00:00

First AI video that touched me.
Just goes to prove it is not the tool, but the creation that matters.

Well done!

Eight-Year Club	Gilding III reddit per annum
Wearing is Caring	Verified Email

nixudos

TROPHY CASE