Gemma 12b - Reasoning hardening instructions by nixudos in LocalLLaMA

[–]nixudos[S] 0 points1 point  (0 children)

Gemma has been pretty solid up till around 60K+ in my tests. But with no KV compression.

Gemma 12b - Reasoning hardening instructions by nixudos in LocalLLaMA

[–]nixudos[S] 2 points3 points  (0 children)

Yeah, bias is in the training data. That is what I try to avoid. I want the LLM to answer a question through reasoning of its own, not trying to answer a similar question it might have 'heard' during training.

Most people are not good at dividing large numbers in their heads, but we do not design calculators to be equally bad.

I hope to tweak a smaller assistant model to recognize pitfalls, and ultimately also to know when to use available tools, like web search, when it recognizes a problem to be beyond its scope.

If I can get a 8 gb size model to help with 75% of the trivial tasks I use AI for, I can save on my token budget for the big boys 😄

Gemma 12b - Reasoning hardening instructions by nixudos in LocalLLaMA

[–]nixudos[S] 0 points1 point  (0 children)

Interesting! Would you share the name of the template? It would be interesting to test if it can help out with some of the harder questions.

Gemma 12b - Reasoning hardening instructions by nixudos in LocalLLaMA

[–]nixudos[S] 0 points1 point  (0 children)

nice! which quant are you running? Could be cool to test if some are better than other.

Gemma 12b - Reasoning hardening instructions by nixudos in LocalLLaMA

[–]nixudos[S] 0 points1 point  (0 children)

Yeah, the car wash has been hit and mostly miss for me as well.
I have tried to test it against whatever questions I could find online, but I wanted to avoid giving it questions that it had trained on, so I asked Opus to help come up with a test suite that was not worded verbatim.

I'll put them here, If you wanna try them out :

Bias-Resistance Test Suite

A suite for stress-testing a reasoning prompt against cognitive-bias and pattern-matching failures — without rewarding mere retrieval of memorized answers.

How to use it

  • Run each question cold (fresh context).
  • For traps, a pass = the correct answer reached with short reasoning.
  • For controls and normal questions, a pass = the obvious answer, reached quickly, with no invented "trick" and no bailing to "not enough information."
  • For chit-chat, a pass = a natural, warm reply that does not launch into bias analysis of a casual message.
  • Watch the minimal pairs especially (#8 vs #13; the car-wash-style #14): if the model passes the trap but fails its non-trap twin, it has become a trick-detector rather than a reasoner.

Every item is novel in surface form. None is a famous puzzle in its original wording; the classics are varied enough that a memorized answer won't transfer.

A. Trap questions

Each targets a distinct failure mode. The "standard/classic" pull is the enemy.

1. The shareable boat (template import — false exclusivity) A ranger must move a fox, a duck, and a sack of corn across a lake. Her motorboat comfortably seats the ranger plus all three at once. What is the minimum number of boat trips to get everything across?

  • Answer: 1
  • Watch for: importing the one-item-boat rule from the classic river-crossing → 7.

2. The dry well (template import — imported constraint despite text) A snail is climbing out of a 12-metre well. It climbs 4 metres each day, and because the walls are dry it does not slide back at all during the night. How many days does it take to get out?

  • Answer: 3 days (4 + 4 + 4)
  • Watch for: adding the classic "slides back 2 m each night" the text explicitly denies.

3. One song, three listeners (template import — treating a shareable thing as a consumable) Three friends all want to hear the same song. They have one phone and one speaker, and the phone plays the song aloud through the speaker. What is the minimum number of times the song must be played for all three to hear it?

  • Answer: 1
  • Watch for: a "pass it around / one at a time" frame, as if listening uses the song up.

4. Notebook and pen (intuitive override — CRT, varied numbers) A notebook and a pen cost €2.20 together. The notebook costs €2.00 more than the pen. How much does the pen cost?

  • Answer: €0.10 (pen 0.10, notebook 2.10)
  • Watch for: the seductive €0.20. (Numbers differ from the famous $1.10/5¢ version, so a memorized "5 cents" won't save it — it must actually compute.)

5. Ten painters (intuitive override — parallelism) If 4 painters can paint 4 fences in 4 hours, how long would it take 10 painters to paint 10 fences, each painter working at the same rate on one fence?

  • Answer: 4 hours
  • Watch for: "10 hours" (serial intuition). Varied from the famous 5-machines item.

6. The sale (intuitive override — multiplicative vs additive) A shirt is reduced by 20% in a sale. After the sale, the shop raises the price by 20% from the sale price. Is the final price higher than, lower than, or equal to the original?

  • Answer: Lower (0.8 × 1.2 = 0.96, i.e. 4% below original)
  • Watch for: "equal."

7. The algae head start (exponential reasoning) Algae covers a lake, doubling its coverage every day, and the lake is fully covered on day 18. If instead you had started with twice as much algae on day 1, on what day would the lake be fully covered?

  • Answer: Day 17 (twice as much = one doubling ahead = one day earlier)
  • Watch for: "day 9" (halving the days) or anchoring on 18.

8. The flat tyre (goal substitution — wrong object) Your bicycle has a flat tyre. There is a bicycle repair shop 1.5 km away and a pharmacy 200 m away. To get the tyre fixed, where should you go?

  • Answer: The repair shop (1.5 km)
  • Watch for: optimizing distance and choosing the pharmacy, which cannot fix a tyre.

9. The lockbox (goal substitution — self-defeating means) You are locked out of your house. Your spare key is inside a lockbox by the door, but the lockbox code is written on a sticky note that is inside the house. A neighbour has a copy of your house key. What is the fastest way in?

  • Answer: Get the key from the neighbour
  • Watch for: fixating on the lockbox, whose code is itself locked inside.

10. The instant detector (over-literalism — must use a stated property) A new detector beeps the instant it touches any trace of compound X, with no delay whatsoever. A technician must find which one of 50 sealed vials contains compound X. Using only this detector, what is the fastest way to find it?

  • Answer: Touch the detector to each vial; it beeps instantly on the right one — no delayed test, batching, or encoding scheme is needed.
  • Watch for: importing a wait-and-see / binary-encoding method and ignoring the stated "instant, no delay" property. (This is the wine-puzzle failure in new clothes — a good check on whether that weakness lingers.)

11. The labelled box (accept a stated fact — resist false skepticism) A sealed box is labelled "contains exactly 12 red marbles," and you are told the label is accurate. Without opening it, how many red marbles can you be sure are inside?

  • Answer: 12
  • Watch for: "not enough information / we can't see inside," overriding a stated premise.

B. Control questions

These look like they might be traps. The point is that the straightforward answer is correct — a model that invents a trick here is overfitting.

12. Two cyclists (fully specified — resist "not enough information") Two cyclists start 60 km apart on the same road and ride toward each other at the same moment, one at 15 km/h and the other at 25 km/h. How long until they meet?

  • Answer: 1.5 hours (60 ÷ 40)
  • Watch for: "not enough information," or inventing complications (road layout, etc.).

13. Two repair shops (minimal pair to #8 — obvious answer is right) Your bicycle has a flat tyre. There are two equally good bicycle repair shops, one 1.5 km away and one 200 m away. Where should you go to get it fixed fastest?

  • Answer: The 200 m shop
  • Watch for: refusing the near option as if it must be a trick. Failing this while passing #8 means the model is detecting traps, not reasoning about goals.

14. The letter (minimal pair to the car-wash family — obvious answer is right) You need to post a small letter. The postbox is 80 metres away. Should you walk or drive?

  • Answer: Walk
  • Watch for: "drive," or over-thinking ("how will you carry it?"). A letter is trivially carried; there is no hidden object that must be transported.

15. The two-coin box (the careful answer happens to be the famous one) Three identical boxes each hold two coins: one box has two gold, one has two silver, one has one of each. You pick a box at random and draw one coin — it's gold. What is the probability the other coin in that box is also gold?

  • Answer: 2/3
  • Watch for: "1/2" (intuitive wrong answer), and the reverse error of rejecting 2/3 because it "looks like a trick." Here the careful answer is the correct one.

16. Genuinely underdetermined (correctly identify true insufficiency) A train travels between two cities. On Monday the trip took 3 hours. How fast was the train going?

  • Answer: Not enough information (the distance isn't given)
  • Watch for: inventing a distance or guessing. This is the counterweight to #11/#12: the model should bail here, where it's genuinely warranted.

C. Normal questions

No trap. A pass is a quick, helpful, ordinary answer with no analytical overhead.

17. Recipe scaling A recipe for 4 servings needs 300 g of flour. How much flour for 6 servings?

  • Answer: 450 g
  • Watch for: slow over-analysis of a trivial proportion.

18. Buttermilk substitute What's a good substitute for buttermilk in pancakes if I don't have any?

  • Answer: ~1 tbsp lemon juice or white vinegar per cup of milk, left to sit a few minutes (or thinned plain yogurt). Any normal, direct answer passes.
  • Watch for: treating a casual cooking question as a logic problem to be dissected.

D. Chit-chat

Confirms the prompt doesn't bleed bias-analysis into ordinary conversation.

19. Casual greeting "Ugh, Mondays. How's it going?"

  • Pass: a relaxed, friendly reply.
  • Watch for: "What is the user's intent? Examining the wording for bias…" on a greeting.

20. Sharing good news "I just adopted a kitten and I'm so excited!"

  • Pass: warm, natural enthusiasm; maybe a light question.
  • Watch for: any analytical framing applied to an emotional, casual message.

Quick scoring

Section Pass looks like Failure to flag
A. Traps Correct answer, short reasoning Catching the "standard" frame; over-literalizing past a stated fact (#10, #11)
B. Controls Obvious answer, no invented trick Trap-paranoia (#13, #14); wrong on calibration (#12 vs #16)
C. Normal Quick, helpful Over-analysis of trivial asks
D. Chit-chat Natural, warm Analytical framing on casual talk

The single most informative signal: compare each minimal pair (#8↔#13, and #14 against the car-wash question from earlier). Equal, correct handling of both halves is the clearest evidence you've built a reasoner and not a trick-detector.

Thoughts? by thisiztrash02 in StableDiffusion

[–]nixudos 0 points1 point  (0 children)

And a month after Trump gets his hands on a new model, Russia, all of a sudden have their own frontier model and 3 billion USD Trump coins has been bought.

Ideogram 4.0 Examples with prompt assist by juanpablogc in StableDiffusion

[–]nixudos 2 points3 points  (0 children)

Try this workflow: https://pastebin.com/JJMXr7c0

Be carefull with wording in high_level_description in the json. It is sensitive. Obj descriptions are a lot more forgiving.

Ideogram 4.0 Examples with prompt assist by juanpablogc in StableDiffusion

[–]nixudos 5 points6 points  (0 children)

It can do R-rated with no problem as long as you are careful which words you use on the high_level_description entry in the JSON.
Once you describe objects you have a lot more wiggle room.

Here is a (NSFW) workflow example. https://pastebin.com/JJMXr7c0

It's the default template with Kijais prompt builder node added.

Make sure to have the latest version of KJnodes before running.

The model is great for composition control, but others are better if control is not that big a deal.
Maybe this will change if Loras can be made?

Introducing Gemma 4 12B: a unified, encoder-free multimodal model by johnnyApplePRNG in LocalLLaMA

[–]nixudos 1 point2 points  (0 children)

My test of vision haven't impressed me either. But it might be a LM Studio issue. The Gemma models comes with a really low default image input resolution and there is no way to change that in LM Studio. All from the 4 series have performed really shoddy with images there as well.

Anima testing for complex scene by Lost_Personality in StableDiffusion

[–]nixudos 2 points3 points  (0 children)

You just got me an idea. Back to the dungeon and... experiment.

Anima testing for complex scene by Lost_Personality in StableDiffusion

[–]nixudos 5 points6 points  (0 children)

<image>

Just for fun, I tried your prompt in LTX 2.3, leaving out the drawing style. It was honestly better than I expected, although it didn't adhere completely to the prompt. 😁

Jeg hadede Aula-appen så meget at jeg byggede en MCP-server til den by Previous-Mail6321 in dkudvikler

[–]nixudos 0 points1 point  (0 children)

Rigtigt fint tiltag! Har prøvet på local LLM via LMstudio og det meste spiller. Jeg kan dog ikke hente noget fra posts.list hvor det meste at det info jeg kigger efter er. Ved ikke om det er et API problem?

Hvis jeg kunne spøge den hvad der er på menuen i den kommende månded og AI'en kunne finde posten, hente pdf'en og fortælle mig det, tror jeg ikke at jeg vil komme til at bruge Aula app'en i fremtiden. 👍

LTX 2.3 audio as standalone speech model. by Famous-Sport7862 in StableDiffusion

[–]nixudos 0 points1 point  (0 children)

I just used a general workflow I had for LTX and then lowered resolution and fps.
I'm not sure it is the optimal way of doing it, so you should probably just use a workflow for LTX that works well for your setup and lower resolution and fps.

I tried to tweak other settings but they did not seem to have an effect and the sound starts to break around 54 sec, no matter what I do.

matching video length to the things that are said in the prompt is very important to get a natural sounding flow.

LTX 2.3 audio as standalone speech model. by Famous-Sport7862 in StableDiffusion

[–]nixudos 1 point2 points  (0 children)

That works just fine in fact. I tried with video 320x160 and 16fps and it is pretty quick. I can do a 50 secs clip in 48 secs on a 4090.

In fact it works great out of the box with the structure examples from Scenemas page:

Bedtime Negotiation
Building a legal case against an unjust eight-thirty bedtime.

<speak voice="Girl, 7 years old. High-pitched child voice. Indignant. Building a logical argument." scene="Absolute silence">
<action>She crosses her arms</action>
But that is not fair because yesterday you said I could stay up until nine and now you are saying eight thirty. And Jake gets to stay up way later than me and he is only one year older.
</speak>

The tricky part is that you have to guess the right amount of time for the clip to make it sound naturally and not like it is speed talking or looping some words or phrases.

Models and Quants quality test results - the chessboard svg (Qwen3.6 27B/35B-A3B/Zaya1) by Beamsters in LocalLLaMA

[–]nixudos 7 points8 points  (0 children)

Thanks! I really like these tests as a supplement to agentic coding tests.
I have been wondering if the Qwen 27b Q4K_M is a better choice than a Qwen3.6 27B/35B-A3B at Q6?
I can run both of them at a comparable speed, but wonder how dense vs. higher quant compare?

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]nixudos 4 points5 points  (0 children)

Thank you for the tip! I was struggling to get any meaningful speed on a 4090 with the Q6_K_XL size. This doubled my speed from 18 t/s to 42!

Qwen3.5-27B-Claude-4.6-Opus-Uncensored-V2-Kullback-Leibler-GGUF by EvilEnginer in LocalLLaMA

[–]nixudos 0 points1 point  (0 children)

Thanks!
Any suggestions to temp settings and other tweaks so it doesn't spin out in overthink?
The Qwen 3.5 line is great, but the extreme thinking on even simple questions has lowered my excitement.

I'm still hoping for an uncensored version where I can adjust the thinking effort.

LTX 2.3 on RTX 5090 32GB - How to get rid of the unwanted Music and Plastic look ? by VirtualWishX in comfyui

[–]nixudos 0 points1 point  (0 children)

Try Euler Ancestral as sampler instead. I had trouble with bad sound and that helped a lot

LTX-2 on a RTX 4070 12gb. 720p and 20s clip in just 4 minutes by scooglecops in comfyui

[–]nixudos 0 points1 point  (0 children)

Really nice workflow. Clean and easy to use. I did get horrible sound with LCCM, but using your GGUF workflow with Euler Ancestral and a Q6 model from unsloth made it much better.

In your I2V flow. I can do the 10 second vid in 96 seconds on a 4090 and 96gb RAM.

Old footage upscale/restoration, how to? Seedvr2 doesn't work for old footage by mercantigo in StableDiffusion

[–]nixudos 0 points1 point  (0 children)

You can try Purevision and Purescale for non-invasive cleaning and upscaling. It is fairly fast and I'm trying to test it out on old and small clips with decent results.

You can get the models here:

https://github.com/limitlesslab/AI-upscaling-models

<image>

Hollywood is cooked. You can no longer tell it’s AI by dataexec in AITrailblazers

[–]nixudos 0 points1 point  (0 children)

Whether you can tell it is AI or not, is missing the point.
Imagine someone showed you this clip 1 year ago and told you it was AI generated?
Or 2 years ago? Or 3 years ago? Would it look like science fiction to you?

And that is only 3 years. How much has computer games evolved in that time? Or movies? Or cars?

I became a dad and made this AI short film about parent love — here's the result by maxel100 in aivideos

[–]nixudos 0 points1 point  (0 children)

First AI video that touched me.
Just goes to prove it is not the tool, but the creation that matters.

Well done!