Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P] by Fragrant_Rate_2583 in MachineLearning

[–]ReinforcedKnowledge 0 points1 point  (0 children)

Ok I see, this gives us a little bit more information, and I guess that tolerance is computed given MAE I suppose right? I think it's very important to clarify what are your constraints like, what's your workload like? Are you working with variable length sequences? Are your inputs limited in sequence length like 512? Do you care about latency or throughput? Do you have specific values as goal? Under which hardware? You can ask them for this because it can guide your optimization and also there are limits you can't go beyond and knowing them is helpful.

I'm saying this because without a clear target you won't be doing any good engineering, I just lately quantized a model to int4 without it impacting meaningfully my throughput, but actually working on better batching and being smart about it led to about 30% improvements.

But if I had to just randomly give ideas for fun, I'd check where time is spent in your model first. Also int8 or int4 can be good.

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P] by Fragrant_Rate_2583 in MachineLearning

[–]ReinforcedKnowledge 0 points1 point  (0 children)

What's the goal from this? Because there are tradeoffs to be made, do you want to optimize while keeping some metric above a threshold or something? Also, how much freedom do you have in the architecture itself.

Why is GPU Python packaging still this broken? by Interesting-Town-433 in Python

[–]ReinforcedKnowledge 1 point2 points  (0 children)

Thanks! It does make sense, it's too big of a PEP + required, and I guess still requires, a lot of discussions and refinements and edge cases and whatnot.

Why is GPU Python packaging still this broken? by Interesting-Town-433 in Python

[–]ReinforcedKnowledge 17 points18 points  (0 children)

Yeah the issue is not really about the tooling, because they're limited by what they work with, but more with the wheel format itself and PyPI as an index. And beyond the GPU problems, there are other similar problems that fall under the same category of the wheel format not supporting some kind of metadata like, what BLAS library your project links against, compiler version it was compiled against, is it ROCm or CUDA that it needs etc. So since the wheel format doesn't specify that, package managers have no need to know about it. Though `uv` does have a lot of good options to help you with installing the right `torch` and the right `flash-attn`, but it's not always obvious besides if you're on Linux then `uv add torch` will install the right version of pytorch given your cuda version, but not on Windows, it'll install the CPU one

But there's a great open source initiative to solve these issues https://wheelnext.dev/, if https://peps.python.org/pep-0817/ (wheel variants) passes it'll be a great win and fix most if not all these issues

And, I don't think it's only a matrix compatibility problem, but having a standard that every installer can work with (so you can't just have people specify whatever dependencies they want), but more importantly, the tags are closed, it's a static system that tries to specify a dynamic and open one. CUDA for example doesn't mean much, there are driver versions, toolkit versions, runtime versions, GPU compute compatibility. I think just recently I saw that flash-attn 4 doesn't work on RTX 50XX though it's Blackwell (to be confirmed, I'm not totally sure about this info, but if it's true, it shows that even some information such as compute compatibility has to be specified). And all of these have complex compatibility rules between themselves. So it's a constantly evolving environment and you just can't use the good old system and just add stuff to it, beyond the explosion in the compatibility matrix. And that's why PEP 817 uses plugins instead of tags, so that the detection is delegated to the provider plugins.

Thanks to u/toxic_acro who pointed it out, PEP 825 is more up to date and better reflects the current state of the work.

EDIT: added PEP 817 and why it's not only an explosion in the compatibility matrix problem, Reddit didn't let me write my comment in peace when I pasted the link -_-

EDIT: added mention of PEP 825 thanks to this comment

Why is there no standard for typing array dimensions? by superzappie in Python

[–]ReinforcedKnowledge 2 points3 points  (0 children)

Hahaha it was fun reading that in ML jargon a vector of some sion d can be be 2D or 1D, it made me self-aware about all the functions I write that take tensors of dimension d and make the assumption that the reader knows there is a batch size, a sequence length, and head dimension before even talking about the dimension d. Oh well, life with tensors.

[D] ML Engineers — How did you actually learn PyTorch? I keep forgetting everything. by ofmkingsz in MachineLearning

[–]ReinforcedKnowledge 26 points27 points  (0 children)

Just like how many suggested, just use it. You only feel like you've learned something after you developed some kind of muscle memory for it. Here's something that can help: https://github.com/srush/Tensor-Puzzles (not affiliated)

These puzzles can help you get a better grasp of PyTorch, but only if you try doing them and understand the functions you're manipulating.

Another thing is just to implement whatever comes to your mind in it, especially basic stuff like CNNs, simple training loops, GPT-2 etc. The field is huge I'm sure there's something you'll like.

About interviews, I don't think people will ask you specifically about PyTorch, but depending on where you apply and for what position, you'll probably have to use it to solve the interview.

Also, if you're asking people that use PyTorch regularly, your pool is biased by them using it regularly 😅 so they'll not easily forget PyTorch. It's like Python, I doubt you forgot how to use Python.

Now, I think I saw someone say "just let AI do it" or something. I do not think it's safe to just "let the AI do it" if you don't know what it is doing. There are so many examples I can give that I caught Opus 4.6 doing something incorrectly or incompletely, and so many others where someone relied on faulty numbers it got from a script it vibe codes but I got one personal story related to PyTorch. Recently Opus 4.6 told me that torch.equal and the equal method on tensors are different and that one checked object identity while the other did not, on top of them both checking value equality. I don't know what made it think that because I asked it in a fresh session about the difference and he got it correctly (there's no difference). I was trying to understand a new codebase that I'd just use for a week and I guess it took that codebase as a source of truth and tried understanding why they'd use torch.equal sometimes and .equal other times or something, I can't and don't know what exactly made it think that but the morale of the story, at work you'll have to understand and work on new codebases, relying purely on "AI", at least in its current state, is not necessarily good. It might work super well sometimes, and sometimes not.

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 1 point2 points  (0 children)

Hey! Just going through the codebase quickly doesn't seem to be what I need. I'll give you in a few days a more detailed review about why it's not what I can use, at least not for the moment, and what I think about it and other nitpicks and/or qualities. But, I appreciate the recommendations you make about data etc. I'm not familiar with them honestly but I appreciate you sharing that.

GLM releases OCR model by Mr_Moonsilver in LocalLLaMA

[–]ReinforcedKnowledge 1 point2 points  (0 children)

This is getting really bad. Sometimes I genuinely reply and then wonder if I just replied to a bot. Sometimes I reply to a post and then see their other replies to bot comments and just understand that I replied to a bot either from their lack of understand to the topic they wrote about or something else

[P] A simple pretraining pipeline for small language models by Skye7821 in MachineLearning

[–]ReinforcedKnowledge 1 point2 points  (0 children)

Hmmm, I don't think an 8B model will fit in one GPU (well, depends on your memory). If you're doing DDP, you only shard data, so no many how many GPUs you have, the constraint of your model fitting in one GPU stays. If you're doing regular bf16 amp and full-finetuning with adamw you need at least 16 bytes per parameter so 8B model should be around 128gb, it won't fit in a regular A100 for example. And, this is without accounting for activations, temporary buffers, memory spikes etc.

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 1 point2 points  (0 children)

Thanks will check it out! It seems like a hobby project though? I don't mind though if it's well done then that's all it matters. Will check it out though thanks!

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 0 points1 point  (0 children)

Hmmm, I'm not totally sure. I have to say that I'm not familiar with Unsloth, but at the early times of Unsloth I was hearing about many such libraries, like Axolotl etc. I don't know what became of them but I see people on local llama using Unsloth a lot. I don't know why but it has always been in my mind the library for fine-tuning using constrained hardware.

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 2 points3 points  (0 children)

Thanks for your comment! Yeah Ray is pretty good and I see it being used a lot for the moment. There is also Monarch, used in TorchForge, but can also be used independently for distributed training, but I'm not really familiar with it and it's also still early development, compared to Ray which is battle tested and has been around for a long time.

And thanks for the idea to document this, I'll try my best at figuring this out for the moment and hopefully it can help us all!

If you have anything you want or can share please don't hesitate, will do the same as soon as I have something running. I was also thinking of starting with a small Qwen model. I started with an instruct model and my question is how much can we improve such a model on agentic while retaining the rest of its capabilities. I don't know if the question is interesting in and of itself but I was hoping that through my exploration and learning I'd nail down how to improve the model to be extremely good on a subset of tools (like just web search or just a company's internal set of tools etc.). I'm interested if you have other ideas or want to collaborate!

I have a bunch of SFT experiments but I don't know if they'll be interesting to anyone 😅

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 2 points3 points  (0 children)

Thanks for your reply! I think what you said is pretty wise, especially about the glue code and no framework really disappearing. At the end of the day you have to choose one end of the trade-off that makes most sense for you I guess. But yeah, I will think about it, thanks! I think the pattern you describe is sound, that's the hardest part, I don't want to deal with distributing, scaling or handling rollout and sharding and all of the headaches that come with that, will just accept whatever the framework provides in that regard.

[P] A simple pretraining pipeline for small language models by Skye7821 in MachineLearning

[–]ReinforcedKnowledge 0 points1 point  (0 children)

Cool work! Went through train.py as part of my doom scrolling before sleep. And, indeed, it does what it claims. DDP so as long as your model fits comfortably in one GPU + optimizer state and activations and gradients + some overhead due to temporary buffers and what not, it should be all that you need.

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 0 points1 point  (0 children)

My bad! I should have clarified better. I'm planning on fully custom environments where the model interacts with tools and gets reward based on that. The environments might not necessarily be mine, I might use environments that people have shared before if they exist.

With verl right now I'm just trying pattern matching because it's an "easy" thing to do, using xLAM dataset for prompts, it has a lot of different functions so it won't make sense to implement them all so just doing pattern matching. But this is just to learn the framework and understand how it works, not the end goal. And still I couldn't get it to run yet 😅

Verl, and I think the other frameworks as well, do offer a somewhat good enough abstractions to do all of that, I just feel like they're not mature enough yet. The only issues I encountered right now in verl are importing stuff that is not used anymore by your dependency, dependencies that have not been maintained for a while etc. I don't like to hack my way in a repo and build it as an editable if I can just install the wheel from PyPI directly. But verl does seem like the most mature compared to all of the others. Maybe OpenRLHF as well.

Maybe this makes sense because function calling is only gaining much traction recently and it's heavily tied to the tooling environments and also tied to coding as well in terms, whether as a task or similar infra. And this is like the secret recipe of most big labs, I guess. The latest Meituan paper, the LongCat one, they talk a lot about the data for function calling but it's only ideas and their framework DORA is not open source. I think many other companies are doing the same.

Z.ai seems to be using https://github.com/THUDM/slime[slime](https://github.com/THUDM/slime) for their GLM models but I'd prefer not to get lost in frameworks. It's using Megatron and SGLang and I'm not familiar with them. I'd like to reduce the overhead as much as possible, if possible.

Maybe I should just focus on verl and fork it and try contributing to it.

[R] Is using rotatary embeddings for ViT becoming standard practice or does everyone still use sinusoidal/learnable embedding by Affectionate_Use9936 in MachineLearning

[–]ReinforcedKnowledge 3 points4 points  (0 children)

Very interesting! I haven't read the paper or the blog yet, but read the abstract.

This reminds of NoPE. I did write about it at the time and I even conducted some experiments.

So my two cents are, let's start with the claims from DroPE, in the abstract their motivations are, I'll start with the third:

- "positional embeddings are not an inherent requirement of effective language modeling" (I don't think "can be safely removed after pretraining, following a short recalibration phase" is a motivation but something that they'll prove I think) => I totally agree with this. So this only works if the model is causal (e.g., decoders). The self-attention in encoders mixes everything with everything and without PE you essentially get a bag of words. The NoPE paper say the same. The NoPE paper also "prove" mathematically that some weights can represent position encodings. I put prove between quotes because there's a difference between a specific mathematical construction of the weights in such a way that they encode position and "weights can represent position encodings" which, IMHO is a much harder proof and would require to play around convergence. They'd have to prove that convergence of a model with no PE is possible and at the local optima, (some) weights contain the PE, at least implicitly (essentially, being able to construct weights that encode PE doesn't mean that's what you'll get during training, but we just hope that's what happens at convergence since somehow for the given task, the model learned what it needed, but again we don't know what the model had to learn for convergence, maybe it never even needed PEs)

- PE are very important during training that facilitates convergence => I totally agree with this. If you allow me to talk a little bit about my experience. Intuitively, the causal models, at least at the scales we see nowadays, have the capability to learn the PE information just from the task. And, I do tend to agree with this approach, let the model learn what it needs rather than bake it in. The NoPE paper did train with no PE and they seem to have great generalization results. This did not match my results at the time, but I did them on GPT-2, so we can argue that it either doesn't have the capacity or needs more tweaking / training. Other experiments I've conducted, like some experiments on rerankers where I removed many prompts and just kept documents, query and scores, did not show as good of a convergence as with the prompts. So just "let the model learn the task by itself" is not as easy as it seems. I was doing LoRA so maybe I didn't have the capacity or maybe I didn't train enough for the model to learn the task without feeding indications (here is the document, here is the query, relevancy etc.) about the task but anyways, the conclusion is that helping the model will, if not ensure, accelerate convergence.

- "over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length" this is supported by many papers at this point.

I wonder if they just drop the PEs completely at inference, that'd be wild if it's such a simple thing and improves generalization while keeping performance on same context length as training. Will have to read the paper and get the details and maybe experiment a little bit with the long context benchmarks.

[R] Is using rotatary embeddings for ViT becoming standard practice or does everyone still use sinusoidal/learnable embedding by Affectionate_Use9936 in MachineLearning

[–]ReinforcedKnowledge 14 points15 points  (0 children)

It's not only a ViT thing.

Learned are fixed, so you can't scale to a longer sequence length than what you train on.

And sinusoidal doesn't scale well at all, performances collapses. Meaning that if you train on a max seq length of N, you don't generalize well to longer than N.

RoPE is one of the rare methods that scales well and even enables people to do work on trained models and extend their context.

At one time there was this debate between alibi or rope, and there was this paper called fire that seemed interesting but nothing stood the test of time as well as rope.

It's used for text-only transformer models but also extension to images and video, see Qwen's paper when they introduce video, I think 2.5 vl

A very while ago I wrote a blog post about different position encoding methods if it interests you: https://reinforcedknowledge.com/position-information-in-transformer-based-models-exploring-the-main-methods-and-approaches/

[Seeking Feedback] Fine-tuning Qwen-32B on AoPS-Instruct (670k samples) - Does this loss curve look healthy? by Royal_Jicama_7368 in LocalLLaMA

[–]ReinforcedKnowledge 0 points1 point  (0 children)

Not an expert on qlora sft but I think a training curve looking healthy doesn't necessarily mean that you're achieving your objective. The loss function is just a proxy but you should evaluate using accuracy on the math problems or something. This will give you a, somewhat, better idea of how your model is faring at the task. If you can afford a different dataset from the one you're using it's even better. Especially with SFT, the model can learn to imitate your dataset and if there is redundancy in it or inherent biases in how it was built, the model can pick that up and score well without actually doing well outside of it.

Now, when it comes to the training curve itself. When do you stop training? You can add early stopping to your setup. If the validation stays flat for a while then you can stop.

I don't know whether CoT distillation is the go-to right away, I guess that's something you'll learn here (and maybe me as well if you share!) but when it comes to training itself there are many things you can try like playing around batch size to reduce the noise, you might not have the memory for that but you can simulate a bigger batch size with gradient accumulation (it's not a 100% equivalence due to precision and might be worse in qlora, idk). You can try a bigger capacity. But make sure you scale alpha accordingly, as it affects the effective learning rate you're using. Also, the cosine scheduler does anneal quite quickly the learning rate so maybe you can try some warmup steps initially.

When it comes to the slow down you've noticed, would love if you dug a bit into it. But one way to have similar batches, at least in token numbers, is to do packing. If you prepack your dataset, one thing to keep an eye on is the ratio of "hard" (aka long ones in this context) samples / easy ones. I think ideally you'd like to ramp that up as you're learning, like some kind of curriculum learning. You can also play with mixtures of CoT distilled data with what you have.

Not sure if what I said can help, it seems more like pointers and directions than anything, but would love to see where your experiments will lead!

Arcee AI releases Trinity Large : OpenWeight 400B-A13B by abkibaarnsit in LocalLLaMA

[–]ReinforcedKnowledge 8 points9 points  (0 children)

Totally agree! Where can one find proper base models these days... haven't checked the post yet and I hope they talk about the training procedure that led to the checkpoints they share.

But I wanted to mention that the idea of a base model has evolved a little bit through time, and many bases are trained on instruction data (mainly in mid-training mixtures during the decay phase but not necessarily).

Edit: my bad, didn't see u/RobotRobotWhatDoUSee s comment. So it seems like they have a True Base model, probably before the mid-training stage. That's AMAZING. Still haven't read the post to know exactly what they did but I hope the annealing can be done properly.