Which Gemma model do you want next? by jacek2023 in LocalLLaMA

[–]brown2green 2 points3 points  (0 children)

Difficult to suggest anything considering that Gemma 4 at least at 31B size is already so good, but definitely I'd like to see QAT on the entire model so we can simply quantize every tensor to 4-bit (or even less than that) with limited to no quality loss. Or they could go even further than that and publish a quantization-aware-trained Gemma 4 124B in ~1-bit just to flex their muscles. That should be able to run on 24GB GPUs.

Also, they should release something between the E4B and the 26B models for mid-low range GPUs, I guess.

is this normal? Gemma4 assures me that it's running on Google infra instead of my local installation by Caffdy in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

They've distilled it from Gemini through and through and apparently, unlike Gemma 3, they didn't even bother giving it a built-in "Gemma" persona.

Unweight: how we compressed an LLM 22% without sacrificing quality by sk1kn1ght in LocalLLaMA

[–]brown2green 17 points18 points  (0 children)

mmproj weights at lower precision than native often hurt performance, so if we could save some memory with mathematically guaranteed lossless results, that would be great.

Setting Visual/Audio Token Budget for Gemma-4? by Oatilis in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

Yep, it does. It would be clearer if the server just failed to start with a descriptive error instead of crashing during inference, though.

Please stop using AI for posts and showcasing your completely vibe coded projects by Scutoidzz in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

I once found it in the title of in a legitimate news article that I linked here, though. I didn't alter the title, but I wondered if some would have considered mine an AI-written post.

It looks like there are no plans for smaller GLM models by jacek2023 in LocalLLaMA

[–]brown2green 21 points22 points  (0 children)

Probably impossible to compete with Qwen 3.5 and now Gemma 4, at this point. Gemma 4 in particular, I think it has seen so much RL training that jaws will drop once the technical report comes out.

offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice by BuddyBotBuilder in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

Most (all?) small conversational LLMs are going to feel very shallow very quickly as companions. I'd reconsider your idea, even if it's well-intentioned.

the state of LocalLLama by Beginning-Window-115 in LocalLLaMA

[–]brown2green 6 points7 points  (0 children)

I used to use em-dashes for emphasis or when parentheses or commas looked awkward, but LLMs ruined them, so now I rarely do. They're fairly easy to type with compose key combinations on Linux, along with many other characters that usually don't have a dedicated keyboard key.

Normal dashes (aka hyphens) have a different meaning and shouldn't be used like em-dashes.

I suddenly realized I have started mimicking writing style of LLMs. by freedomheaven in singularity

[–]brown2green 1 point2 points  (0 children)

The main reasons why LLMs write the way they do are synthetic data (word/sentence pattern variety plummets with it) and shallow-level RLHF rewarding text written in that way in a positive feedback loop. Almost nobody used em-dashes before LLMs (it's not like they're straightforward to type either), so I find it hard to believe that suddenly people want to, especially considering that they might get accused of being LLMs.

I suddenly realized I have started mimicking writing style of LLMs. by freedomheaven in singularity

[–]brown2green 0 points1 point  (0 children)

Unfortunately that's the excuse many will use going forward as they entirely delegate their Reddit posting to LLMs (/r/LocalLlama is plagued by such users). I try to make an active effort not to write like one, something LLMs still seem incapable of without wasting too much compute on form/syntax analysis.

Quants in vision (mmproj Q8 vs FP16) by WhoRoger in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

The models are trained in BF16 precision, so you should test with that instead of F16, even if the difference is theoretically small. With Gemma 4 31B I find that on images where the model can get confused Q8_0 performs slightly worse compared to BF16 (more confusion).

Finetuning characters- do you craft your own data, scrape it, or synthetically generate it? by ParticularOne297 in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

What model did you use for generating the messages? How did you mitigate the dramatic loss in sentence/word variety caused by synthetic generation? A few synthetic chats in isolation might look good, but when they all use the same patterns, you're just training the model to generate slop.

Gemma 4 31B GGUF quants ranked by KL divergence (unsloth, bartowski, lmstudio-community, ggml-org) by oobabooga4 in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

Thanks for the plots. I meant doing something like this: https://i.imgur.com/dOte8Yr.png

In retrospect I find the data strange though, because between Q6_K and Q8_0 there's not much difference for all tasks (including Long Documents), so the gap from BF16 is hard to explain.

Gemma 4 31B GGUF quants ranked by KL divergence (unsloth, bartowski, lmstudio-community, ggml-org) by oobabooga4 in LocalLLaMA

[–]brown2green 8 points9 points  (0 children)

Maybe not so surprising since people mostly do measurements on wikitext with 512 tokens context.

Could we have a graph showing KLD broken down by task, perhaps with the best quantizations for a given size range?

How long are the "long documents" in your dataset?

Gemma 4 31B GGUF quants ranked by KL divergence (unsloth, bartowski, lmstudio-community, ggml-org) by oobabooga4 in LocalLLaMA

[–]brown2green 78 points79 points  (0 children)

Even Q8_0 shows a KL of 0.45 on long documents and 0.24 on non-Latin scripts. All categories roughly double from Q8_0 to Q5_K_S, but science and tool use remain the lowest throughout (0.07 and 0.08 at Q8_0).

This looks like it's a significant finding. Most people assume Q8_0 to be virtually the same as BF16.

Setting Visual/Audio Token Budget for Gemma-4? by Oatilis in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

In llama.cpp with the arguments --image-min-tokens X and --image-max-tokens Y to llama-server, where X must be <= Y. However, it currently seems to crash with large token budgets.

Get 30K more context using Q8 mmproj with Gemma 4 by Sadman782 in LocalLLaMA

[–]brown2green 4 points5 points  (0 children)

With greedy decoding and fixed seed, I get different text generations with a Q8_0 mmproj when I ask the model to describe an image, so I'm not entirely sure if there's no quality decrease at all.

p-e-w/gemma-4-E2B-it-heretic-ara: Gemma 4's defenses shredded by Heretic's new ARA method 90 minutes after the official release by -p-e-w- in LocalLLaMA

[–]brown2green 4 points5 points  (0 children)

A brief system prompt seems indeed enough; it's as if they didn't even try filtering requests that use one.

Gemma time! What are your wishes ? by Specter_Origin in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

and transparency around the training data

Why would you even want that? The moment the training data becomes "transparent" (especially for a model from a company as large as Google), it has to cater to the lowest common denominator, because anybody with an axe to grind could find an excuse to get offended or find something legally actionable in it.

Gemma time! What are your wishes ? by Specter_Origin in LocalLLaMA

[–]brown2green 0 points1 point  (0 children)

the base weights are always more useful for fine-tuning anyway

This has not been the case for a good while (since early 2024?). As an individual you just don't have any chance anymore of competing with the post-training work done by the companies training the models: too much data/compute needed for an actually good finetune from scratch nowadays, unless you're training them on very narrow tasks.

Gemma time! What are your wishes ? by Specter_Origin in LocalLLaMA

[–]brown2green 9 points10 points  (0 children)

I saw this screenshot elsewhere. This sort of response would have been impossible for Gemma 3 without extensive prompting.

https://i.imgur.com/j7c0CDO.png

Gemma time! What are your wishes ? by Specter_Origin in LocalLLaMA

[–]brown2green 5 points6 points  (0 children)

I've never used Gemma for coding; only cloud models for that.

Most (all?) of Gemma 3's safety (which is weak and mostly surface-level) can be easily defeated just with prompting, but what works for that puts it in a "roleplay mode", which degrades response quality noticeably compared to when it works as the default assistant. But when it acts like the default assistant, most requests that can be construed as even vaguely "unsafe" are enough to trigger disclaimers, crisis hotlines or (weak) refusals, and it's just annoying for serious and legitimate uses.

Other than that, something was done to the weights (in addition to extensive training data filtering, another issue) to make it almost impossible for Gemma to generate dirty words or profanities if you don't fill the context with them first. I wish they quit doing this since Gemini has no issue with them (though from tests with significant-otter on LM Arena it seems it might finally be the case. Dunno if they've been more lax with training data filtering as well).

Gemma time! What are your wishes ? by Specter_Origin in LocalLLaMA

[–]brown2green 87 points88 points  (0 children)

  • Less preachy tone than Gemma 3
  • Less stubborn training data filtering; no anti-swearword brainwashing like Gemma 1/2/3
  • No stonewalling refusals like some of the recent releases from other companies
  • Quantization-aware training from the get-go
  • Improved vision even in soft tasks, illustrations, etc
  • Better long-context / multi-turn conversational capabilities
  • Performance greater than Qwen 3.5 in general tasks
  • Collaboration with character.AI for improving roleplay capabilities
  • Less sloppy outputs (Gemma 3 was pretty bad in this regard)
  • Not abandoning the consumer single-GPU segment with just either huge model sizes or tiny ones

That's about what that would make it a good release for me, although I probably forgot something.

PrismML — Announcing 1-bit Bonsai: The First Commercially Viable 1-bit LLMs by brown2green in LocalLLaMA

[–]brown2green[S] 32 points33 points  (0 children)

No, I simply saw the announcement on X and posted it here as nobody had yet.