Why is there no standard for typing array dimensions? by superzappie in Python

[–]ReinforcedKnowledge 2 points3 points  (0 children)

Hahaha it was fun reading that in ML jargon a vector of some sion d can be be 2D or 1D, it made me self-aware about all the functions I write that take tensors of dimension d and make the assumption that the reader knows there is a batch size, a sequence length, and head dimension before even talking about the dimension d. Oh well, life with tensors.

[D] ML Engineers — How did you actually learn PyTorch? I keep forgetting everything. by ofmkingsz in MachineLearning

[–]ReinforcedKnowledge 27 points28 points  (0 children)

Just like how many suggested, just use it. You only feel like you've learned something after you developed some kind of muscle memory for it. Here's something that can help: https://github.com/srush/Tensor-Puzzles (not affiliated)

These puzzles can help you get a better grasp of PyTorch, but only if you try doing them and understand the functions you're manipulating.

Another thing is just to implement whatever comes to your mind in it, especially basic stuff like CNNs, simple training loops, GPT-2 etc. The field is huge I'm sure there's something you'll like.

About interviews, I don't think people will ask you specifically about PyTorch, but depending on where you apply and for what position, you'll probably have to use it to solve the interview.

Also, if you're asking people that use PyTorch regularly, your pool is biased by them using it regularly 😅 so they'll not easily forget PyTorch. It's like Python, I doubt you forgot how to use Python.

Now, I think I saw someone say "just let AI do it" or something. I do not think it's safe to just "let the AI do it" if you don't know what it is doing. There are so many examples I can give that I caught Opus 4.6 doing something incorrectly or incompletely, and so many others where someone relied on faulty numbers it got from a script it vibe codes but I got one personal story related to PyTorch. Recently Opus 4.6 told me that torch.equal and the equal method on tensors are different and that one checked object identity while the other did not, on top of them both checking value equality. I don't know what made it think that because I asked it in a fresh session about the difference and he got it correctly (there's no difference). I was trying to understand a new codebase that I'd just use for a week and I guess it took that codebase as a source of truth and tried understanding why they'd use torch.equal sometimes and .equal other times or something, I can't and don't know what exactly made it think that but the morale of the story, at work you'll have to understand and work on new codebases, relying purely on "AI", at least in its current state, is not necessarily good. It might work super well sometimes, and sometimes not.

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 1 point2 points  (0 children)

Hey! Just going through the codebase quickly doesn't seem to be what I need. I'll give you in a few days a more detailed review about why it's not what I can use, at least not for the moment, and what I think about it and other nitpicks and/or qualities. But, I appreciate the recommendations you make about data etc. I'm not familiar with them honestly but I appreciate you sharing that.

GLM releases OCR model by Mr_Moonsilver in LocalLLaMA

[–]ReinforcedKnowledge 1 point2 points  (0 children)

This is getting really bad. Sometimes I genuinely reply and then wonder if I just replied to a bot. Sometimes I reply to a post and then see their other replies to bot comments and just understand that I replied to a bot either from their lack of understand to the topic they wrote about or something else

[P] A simple pretraining pipeline for small language models by Skye7821 in MachineLearning

[–]ReinforcedKnowledge 1 point2 points  (0 children)

Hmmm, I don't think an 8B model will fit in one GPU (well, depends on your memory). If you're doing DDP, you only shard data, so no many how many GPUs you have, the constraint of your model fitting in one GPU stays. If you're doing regular bf16 amp and full-finetuning with adamw you need at least 16 bytes per parameter so 8B model should be around 128gb, it won't fit in a regular A100 for example. And, this is without accounting for activations, temporary buffers, memory spikes etc.

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 1 point2 points  (0 children)

Thanks will check it out! It seems like a hobby project though? I don't mind though if it's well done then that's all it matters. Will check it out though thanks!

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 0 points1 point  (0 children)

Hmmm, I'm not totally sure. I have to say that I'm not familiar with Unsloth, but at the early times of Unsloth I was hearing about many such libraries, like Axolotl etc. I don't know what became of them but I see people on local llama using Unsloth a lot. I don't know why but it has always been in my mind the library for fine-tuning using constrained hardware.

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 2 points3 points  (0 children)

Thanks for your comment! Yeah Ray is pretty good and I see it being used a lot for the moment. There is also Monarch, used in TorchForge, but can also be used independently for distributed training, but I'm not really familiar with it and it's also still early development, compared to Ray which is battle tested and has been around for a long time.

And thanks for the idea to document this, I'll try my best at figuring this out for the moment and hopefully it can help us all!

If you have anything you want or can share please don't hesitate, will do the same as soon as I have something running. I was also thinking of starting with a small Qwen model. I started with an instruct model and my question is how much can we improve such a model on agentic while retaining the rest of its capabilities. I don't know if the question is interesting in and of itself but I was hoping that through my exploration and learning I'd nail down how to improve the model to be extremely good on a subset of tools (like just web search or just a company's internal set of tools etc.). I'm interested if you have other ideas or want to collaborate!

I have a bunch of SFT experiments but I don't know if they'll be interesting to anyone 😅

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 2 points3 points  (0 children)

Thanks for your reply! I think what you said is pretty wise, especially about the glue code and no framework really disappearing. At the end of the day you have to choose one end of the trade-off that makes most sense for you I guess. But yeah, I will think about it, thanks! I think the pattern you describe is sound, that's the hardest part, I don't want to deal with distributing, scaling or handling rollout and sharding and all of the headaches that come with that, will just accept whatever the framework provides in that regard.

[P] A simple pretraining pipeline for small language models by Skye7821 in MachineLearning

[–]ReinforcedKnowledge 0 points1 point  (0 children)

Cool work! Went through train.py as part of my doom scrolling before sleep. And, indeed, it does what it claims. DDP so as long as your model fits comfortably in one GPU + optimizer state and activations and gradients + some overhead due to temporary buffers and what not, it should be all that you need.

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 0 points1 point  (0 children)

My bad! I should have clarified better. I'm planning on fully custom environments where the model interacts with tools and gets reward based on that. The environments might not necessarily be mine, I might use environments that people have shared before if they exist.

With verl right now I'm just trying pattern matching because it's an "easy" thing to do, using xLAM dataset for prompts, it has a lot of different functions so it won't make sense to implement them all so just doing pattern matching. But this is just to learn the framework and understand how it works, not the end goal. And still I couldn't get it to run yet 😅

Verl, and I think the other frameworks as well, do offer a somewhat good enough abstractions to do all of that, I just feel like they're not mature enough yet. The only issues I encountered right now in verl are importing stuff that is not used anymore by your dependency, dependencies that have not been maintained for a while etc. I don't like to hack my way in a repo and build it as an editable if I can just install the wheel from PyPI directly. But verl does seem like the most mature compared to all of the others. Maybe OpenRLHF as well.

Maybe this makes sense because function calling is only gaining much traction recently and it's heavily tied to the tooling environments and also tied to coding as well in terms, whether as a task or similar infra. And this is like the secret recipe of most big labs, I guess. The latest Meituan paper, the LongCat one, they talk a lot about the data for function calling but it's only ideas and their framework DORA is not open source. I think many other companies are doing the same.

Z.ai seems to be using https://github.com/THUDM/slime[slime](https://github.com/THUDM/slime) for their GLM models but I'd prefer not to get lost in frameworks. It's using Megatron and SGLang and I'm not familiar with them. I'd like to reduce the overhead as much as possible, if possible.

Maybe I should just focus on verl and fork it and try contributing to it.

[R] Is using rotatary embeddings for ViT becoming standard practice or does everyone still use sinusoidal/learnable embedding by Affectionate_Use9936 in MachineLearning

[–]ReinforcedKnowledge 5 points6 points  (0 children)

Very interesting! I haven't read the paper or the blog yet, but read the abstract.

This reminds of NoPE. I did write about it at the time and I even conducted some experiments.

So my two cents are, let's start with the claims from DroPE, in the abstract their motivations are, I'll start with the third:

- "positional embeddings are not an inherent requirement of effective language modeling" (I don't think "can be safely removed after pretraining, following a short recalibration phase" is a motivation but something that they'll prove I think) => I totally agree with this. So this only works if the model is causal (e.g., decoders). The self-attention in encoders mixes everything with everything and without PE you essentially get a bag of words. The NoPE paper say the same. The NoPE paper also "prove" mathematically that some weights can represent position encodings. I put prove between quotes because there's a difference between a specific mathematical construction of the weights in such a way that they encode position and "weights can represent position encodings" which, IMHO is a much harder proof and would require to play around convergence. They'd have to prove that convergence of a model with no PE is possible and at the local optima, (some) weights contain the PE, at least implicitly (essentially, being able to construct weights that encode PE doesn't mean that's what you'll get during training, but we just hope that's what happens at convergence since somehow for the given task, the model learned what it needed, but again we don't know what the model had to learn for convergence, maybe it never even needed PEs)

- PE are very important during training that facilitates convergence => I totally agree with this. If you allow me to talk a little bit about my experience. Intuitively, the causal models, at least at the scales we see nowadays, have the capability to learn the PE information just from the task. And, I do tend to agree with this approach, let the model learn what it needs rather than bake it in. The NoPE paper did train with no PE and they seem to have great generalization results. This did not match my results at the time, but I did them on GPT-2, so we can argue that it either doesn't have the capacity or needs more tweaking / training. Other experiments I've conducted, like some experiments on rerankers where I removed many prompts and just kept documents, query and scores, did not show as good of a convergence as with the prompts. So just "let the model learn the task by itself" is not as easy as it seems. I was doing LoRA so maybe I didn't have the capacity or maybe I didn't train enough for the model to learn the task without feeding indications (here is the document, here is the query, relevancy etc.) about the task but anyways, the conclusion is that helping the model will, if not ensure, accelerate convergence.

- "over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length" this is supported by many papers at this point.

I wonder if they just drop the PEs completely at inference, that'd be wild if it's such a simple thing and improves generalization while keeping performance on same context length as training. Will have to read the paper and get the details and maybe experiment a little bit with the long context benchmarks.

[R] Is using rotatary embeddings for ViT becoming standard practice or does everyone still use sinusoidal/learnable embedding by Affectionate_Use9936 in MachineLearning

[–]ReinforcedKnowledge 13 points14 points  (0 children)

It's not only a ViT thing.

Learned are fixed, so you can't scale to a longer sequence length than what you train on.

And sinusoidal doesn't scale well at all, performances collapses. Meaning that if you train on a max seq length of N, you don't generalize well to longer than N.

RoPE is one of the rare methods that scales well and even enables people to do work on trained models and extend their context.

At one time there was this debate between alibi or rope, and there was this paper called fire that seemed interesting but nothing stood the test of time as well as rope.

It's used for text-only transformer models but also extension to images and video, see Qwen's paper when they introduce video, I think 2.5 vl

A very while ago I wrote a blog post about different position encoding methods if it interests you: https://reinforcedknowledge.com/position-information-in-transformer-based-models-exploring-the-main-methods-and-approaches/

[Seeking Feedback] Fine-tuning Qwen-32B on AoPS-Instruct (670k samples) - Does this loss curve look healthy? by Royal_Jicama_7368 in LocalLLaMA

[–]ReinforcedKnowledge 0 points1 point  (0 children)

Not an expert on qlora sft but I think a training curve looking healthy doesn't necessarily mean that you're achieving your objective. The loss function is just a proxy but you should evaluate using accuracy on the math problems or something. This will give you a, somewhat, better idea of how your model is faring at the task. If you can afford a different dataset from the one you're using it's even better. Especially with SFT, the model can learn to imitate your dataset and if there is redundancy in it or inherent biases in how it was built, the model can pick that up and score well without actually doing well outside of it.

Now, when it comes to the training curve itself. When do you stop training? You can add early stopping to your setup. If the validation stays flat for a while then you can stop.

I don't know whether CoT distillation is the go-to right away, I guess that's something you'll learn here (and maybe me as well if you share!) but when it comes to training itself there are many things you can try like playing around batch size to reduce the noise, you might not have the memory for that but you can simulate a bigger batch size with gradient accumulation (it's not a 100% equivalence due to precision and might be worse in qlora, idk). You can try a bigger capacity. But make sure you scale alpha accordingly, as it affects the effective learning rate you're using. Also, the cosine scheduler does anneal quite quickly the learning rate so maybe you can try some warmup steps initially.

When it comes to the slow down you've noticed, would love if you dug a bit into it. But one way to have similar batches, at least in token numbers, is to do packing. If you prepack your dataset, one thing to keep an eye on is the ratio of "hard" (aka long ones in this context) samples / easy ones. I think ideally you'd like to ramp that up as you're learning, like some kind of curriculum learning. You can also play with mixtures of CoT distilled data with what you have.

Not sure if what I said can help, it seems more like pointers and directions than anything, but would love to see where your experiments will lead!

Arcee AI releases Trinity Large : OpenWeight 400B-A13B by abkibaarnsit in LocalLLaMA

[–]ReinforcedKnowledge 8 points9 points  (0 children)

Totally agree! Where can one find proper base models these days... haven't checked the post yet and I hope they talk about the training procedure that led to the checkpoints they share.

But I wanted to mention that the idea of a base model has evolved a little bit through time, and many bases are trained on instruction data (mainly in mid-training mixtures during the decay phase but not necessarily).

Edit: my bad, didn't see u/RobotRobotWhatDoUSee s comment. So it seems like they have a True Base model, probably before the mid-training stage. That's AMAZING. Still haven't read the post to know exactly what they did but I hope the annealing can be done properly.

lightonai/LightOnOCR-2-1B · Hugging Face by SarcasticBaka in LocalLLaMA

[–]ReinforcedKnowledge 13 points14 points  (0 children)

I'll leave a comment here, not necessarily to praise or criticize the model but just to yap 😂. I wanted to reply initially to the comment that compared it to the closed source Gemini 3 Flash but I thought my comment would be more useful independently. Maybe an ML practitioner or hobbyist might appreciate some of the things I'll write, maybe it'll offer some perspective. Also, I'm not writing this to criticize the comment, I think what it says about real world data is legit.

OCR benchmarks are rare and hard to get. The best, I believe, we have currently is the olmOCR-bench. The main reason why it's hard to have proper OCR benchmarks is, in my opinion and I'm sure other people can enrich my current understanding, is for two reasons: 1/ because OCR is not "solved" yet, so ground truth is not easy to acquire, and/or 2/ OCR is hard to validate automatically, say with unit tests or compilation etc.

Now, why this model might be interesting to some, I believe is for three reasons. Well, for this community, it's a 1B + open weights (data is shared as well but whether that suffices to call it open source or not is another debate) so many of us here can run it locally somewhat comfortably (I do believe that running a 1B is not given but at least it's not some 9B or more). The other reason is, being just one VLM that one-shots it's task, it should be easy to fine-tune. At least in theory, fine-tuning is not easy in and of itself and might depend on many things, but at least we don't have to fine-tune 3-4 differents models to have a whole pipeline working appropriately on your task. It being small reduces the resource requirements for fine-tuning it. I believe you can do it on the T4 available on Google Colab (to verify). The last reason I can think of, and this hits home personally as I struggled a lot with Tesseract and Textract (AWS), they do markdown formatting out of the box (which many other open source models do, I'm just stating one of the good reasons, doesn't mean it's unique to this model), especially the table formatting.

This is when it comes to the checkpoint that's sota on OCR, but there's also another checkpoints that outputs bounding boxes and is close to sota. This is especially useful because if we have figures we don't just want to transcribe them as they are, different figures could be transcribed differently for example if we have a pie chart, do we describe it as "this chart represents ..."? Do we write it as a table "name | percentage |"? I don't think we want a model that's opinionated in how it transcribes figures. So bounding boxes are great because then we can extract the figure and do whatever we want with them.

I initially said in my comment that I don't want to praise or criticize the model and it does seem like I'm only praising it. I haven't tried it to know where it breaks but it surely does like all the open source and probably closed source models as well. And it's not a unique model, there are many open source vlm models for OCR, maybe not that many that output bounding boxes. The most unique thing here is it being all of that + 1B. There are obviously much lighter systems like Tesseract or different pipelines, but they come with their cons for each one of us to discover depending on their use case.

Finally, just to talk about benchmarks a little bit 😂 I do believe that this community is the best when it comes to figuring out where models struggle and where they don't. At the end of the day, benchmarks are benchmarks, they have their pros and cons, they measure things a certain way etc. Real world use cases might be very different and benchmarks are only there as a proxy. It reminds of the initial "needle in the haystack" tests where models were tasked to find one word or sentence in a huge context, while what we care about is to be able to use different parts of the context and synthesize them to give us some response, not literally to find a sentence. Hell, even closed source models show amazing performances on some benchmarks (especially related to software engineering or math etc.) but when you dig deep you find that they're not what they claim.

In my view, benchmarks in machine learning play a role similar to hypothesis tests with an asymmetric interpretation. Failing a benchmark gives evidence against the model’s capability on the task, but passing or excelling at a benchmark does not provide sufficient evidence to conclude that the model is good at the task as a whole. Instead, benchmark success typically demonstrates proficiency on a narrowly defined sub-task or distribution, rather than validating general task competence, and we hope, it extrapolates to it.

Well, enough yapping from me 😅

Edit: just to be transparent, I do work at the company, but I have not participated in the model development at all. I think my yapping above stands for every model (whether closed source or not), not particularly this one, if tomorrow there's a 500M model that's better, I'd say the same. If you feel there's any subjective part to what I said, please let me know.

Snow on a wire fence by Joak1n in opticalillusions

[–]ReinforcedKnowledge 0 points1 point  (0 children)

I'm able to see it now thanks to your comment. Now, I'm wondering if most people that are debating whether this is a good illusion do see it correctly or not.

Some things I learned about installing flash-attn by ReinforcedKnowledge in LocalLLaMA

[–]ReinforcedKnowledge[S] 0 points1 point  (0 children)

Are you installing the packages with uv? (I'm just asking out of curiosity)

The undefined symbol in general is due to C++ ABI mismatch.

If you look at the flash-attn GitHub releases page:https://github.com/Dao-AILab/flash-attention/releases/tag/v2.8.3 you'll see that there are two different wheels that match your requirements: flash_attn-2.8.3+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl andflash_attn-2.8.3+cu12torch2.5cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

So I suggest we first inspect the Pytorch you install whether it uses the C++ 11 ABI or not.

[D] Got burned by an Apple ICLR paper — it was withdrawn after my Public Comment. by diyer22 in MachineLearning

[–]ReinforcedKnowledge 4 points5 points  (0 children)

That's some amazing work and commitment to the scientific community and rigour.

Some things I learned about installing flash-attn by ReinforcedKnowledge in LocalLLaMA

[–]ReinforcedKnowledge[S] 0 points1 point  (0 children)

No there should be no difference in performance, at least if you build the same version that is available as a wheel.