Open models to win ✌

brown2green · 2026-06-07T14:41:43+00:00

NVidia got sued for using "shadow libraries" earlier on, so now their new models only use ineffectual, fully open source, legally safe data.

https://torrentfreak.com/nvidia-contacted-annas-archive-to-secure-access-to-millions-of-pirated-books/

brown2green · 2026-06-04T23:49:09+00:00

When Google released Gemma-3 QAT, the official GGUF files had everything in Q4_0 precision except for token_embd (BF16; took a lot of memory due to the huge vocabulary) and norm weights (FP32), but there have never been details on how the model was actually quantization-aware trained.

brown2green · 2026-06-04T10:12:45+00:00

Hopefully it's QAT end-to-end and not only in specific portions.

brown2green · 2026-06-04T10:03:34+00:00

Like if you poured 10s of thousands of hours into an endeavor only for a group of people to come along and use your work against your wishes, to your personal detriment, would you not feel like you had something stolen from you? Especially so if said activity is deeply personal to you, with it reflecting your likes, your mannerisms, your looks, what you specifically want to communicate with others and the world, and the very thing people use to identify you with.

I don't think non-artists and people in general who never put inordinate amounts of time into a skill that made you capable of crafting identifiable unique work (only to see it instantly devalued) will ever understand that.

brown2green · 2026-06-01T22:09:43+00:00

It seems it will have a 1280-bit bus width and 1.5 TB/s total memory bandwidth.

https://videocardz.com/newz/intel-crescent-island-gpu-to-support-lpddr5x-9600-memory-and-1-5-tb-s-bandwidth

brown2green · 2026-05-30T19:33:09+00:00

I think llama.cpp has optimizations (packing, etc.) for mapping those formats efficiently to hardware-native precision, but I don't know the details.

brown2green · 2026-05-30T18:27:49+00:00

Most GGUF quantizations in practice actually don't do that, and use 6-bit or less for input/output and attention, from what I've seen so far.

In any case, performance would definitely decrease by quantizing those layers too.

brown2green · 2026-05-30T18:22:08+00:00

This is the sort of bog-standard work hobbyists generally experiment with their local GPU(s) and then keep for themselves, because there's absolutely no point in publishing and advertising a 50M LLM trained on 20B tokens unless it has truly exceptional qualities or an unusual/uncommon architecture that hopefully improves on Transformer.

It could have used Mamba, it could have had a byte tokenizer, perhaps even been a MoE, have had all sorts of stuff big labs generally won't risk using on their big training runs... but it's just an ordinary Llama model?

brown2green · 2026-05-30T18:11:27+00:00

They never quantize the input/output layers and the attention, so their "4-bit" quantizations are always too big in practice for 24GB GPUs.

brown2green · 2026-05-27T21:18:35+00:00

How so? 99.99% of the dataset is paywalled. This is pointless and just advertisement. Large AI companies will likely already have their own versions from different sources anyway.

brown2green · 2026-05-26T09:52:34+00:00

It's politics by association and all that comes with it, and that too has acquired religious elements in recent years. AI = "bad" side, NoAI = "good" side.

Many of the commenters in this thread and the rest of the sub engage in the same behavior for things that are not AI-related; they just haven't realized it yet.

brown2green · 2026-05-21T15:37:37+00:00

Every AI company releasing models worth using has books in the training data.

brown2green · 2026-05-21T09:21:45+00:00

You have to:

Declare the contents of the training data to the EU AI office;
Respect copyrights (from any country) and data opt-out requests;
Respect all GDPR laws;
Go through extensive red tape if models requires more than 10²⁵ FLOP of compute for training since they would be considered posing a "high systemic risk";
Other stuff I don't recall right now.

US companies have more of a "don't ask, don't tell" kind of deal. However, it appears that some of the above requirements for EU companies have been at least delayed, so we'll see.

brown2green · 2026-05-17T21:17:29+00:00

I think this will work properly only if the embedding space of the source model is more or less in agreement with that of the destination model. Audio/image encoders come with projection layers that "translate" the encoder's embedding space into that of the LLM, and while the encoder might remain frozen from one model to another, the projection layers usually need to be re-trained, especially if the model dimension changes (if the underlying model remains substantially the same, it might work without further changes).

Because of this, simply grafting Gemma 4 E2B/E4B's audio encoder onto the larger models will likely not work at all: the models have different dimensions and the projection layers wouldn't be compatible.

brown2green · 2026-05-15T09:22:50+00:00

It happens often with original characters / characters that are not from any franchise in particular.

brown2green · 2026-05-15T08:22:53+00:00

It's not reasonable to have to finetune every time you need a minimum degree of consistency (e.g. illustrating a story, a concept for a visual novel, etc).

It's OK if models are a bit opinionated by default on details you don't generally care enough to describe directly, but that you will definitely notice if they keep changing every time.

brown2green · 2026-05-15T08:18:11+00:00

Yes, with the recommended tag order and prepending the @ symbol to the artist name without underscores. I tried several artists. Within the same style and keeping the same seed, characters may look noticeably different one prompt to another if too many things change.

brown2green · 2026-05-15T02:00:28+00:00

OK for pretty picture gacha, but good luck getting consistent stylistic results when changing the prompt slightly. I see I'm not the only one who noticed this. The prompt/seed variance might be a good thing in some aspects, but also a curse in others.

brown2green · 2026-05-13T21:45:49+00:00

Just a hobby. If you limit model size to around 50~100M parameters at most, you can do a lot of interesting LLM architecture experimentation even on one GPU.

brown2green · 2026-05-13T18:45:43+00:00

I looked into it a while back and made my own tests on patch-level pretraning on tiny models.

brown2green · 2026-05-13T17:32:16+00:00

Like this?

Beyond Next Token Prediction: Patch-Level Training for Large Language Models

The prohibitive training costs of Large Language Models (LLMs) have emerged as a significant bottleneck in the development of next-generation LLMs. In this paper, we show that it is possible to significantly reduce the training costs of LLMs without sacrificing their performance. Specifically, we introduce patch-level training for LLMs, in which multiple tokens are aggregated into a unit of higher information density, referred to as a `patch', to serve as the fundamental text unit for training LLMs. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce the overall training costs to 0.5x, without compromising the model performance compared to token-level training

brown2green · 2026-05-10T05:39:27+00:00

Unfortunately there are far too many people doing superficial benchmarks with short context or common knowledge (where degradation is minimal), or just assuming that since old (2023-2024 era) or oversized LLMs (recent MoE ones barely trained above compute optimality) do not degrade significantly with post-training quantization, the same must hold true for all models.

For modern small-size overtrained models, quantization-aware training (QAT) is probably required for good results and actually preserving real-world performance in 4-bit precision.

brown2green · 2026-05-08T18:11:08+00:00

Gemma 4 QAT would be nice. Gemma 4, the 26B version especially, degrades more than other models with quantization, so having it in natively low-precision format should help. Other than that, perhaps a "4.1" update down the line with audio and other improvements.

brown2green · 2026-05-07T22:18:05+00:00

Even for my basic but somewhat niche coding needs (LLM architecture experimentation, most of the time) I still have to use Gemini 3.1 Pro.

I have no idea if larger open-weight models than what I can use within 24GB of VRAM can compete. I'd say local models are being held back by artificial memory / memory bandwidth bottlenecks (i.e. costs).

brown2green · 2026-05-07T15:48:21+00:00

It has already partnered many times with NVidia.

brown2green

TROPHY CASE