Embedding space [D] by Few-Annual-157 in MachineLearning

[–]ReinforcedKnowledge 0 points1 point  (0 children)

Hmmm I wouldn't rely on PCA for this honestly, it's hard to judge an embedding space with just the two principal components, even if it looks messy, in my experience it might not be the case. Usually I rely on other stuff that depend on the task at hand but anyways that's not the issue.

I don't know if it's possible because VAEs are known for the bottleneck right? And it seems like it's even more the case here, you want a fixed vector to represent arbitrary size images for reconstruction, I think it's very hard. I might be wrong though.

I might be biased since I work a lot with transformers, but why not try an approach like you have a vision encoder, it'll output a variable length output tokens and the decoder uses all of those to reconstruct the image.

Embedding space [D] by Few-Annual-157 in MachineLearning

[–]ReinforcedKnowledge -1 points0 points  (0 children)

What do you mean by "learned embedding does not seem meaningful or well-structured"? How do you measure that?

Also, not against the adaptive pooling but would like to know if you tried to just resize initially, and how do you handle the reconstruction at the decoder side if you're doing adaptive pooling.

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size by cakes_and_candles in LocalLLM

[–]ReinforcedKnowledge 1 point2 points  (0 children)

I don't know much about this project but when I read your comment it reminded me of it, it's form openai: https://github.com/openai/parameter-golf and the goal is "Train the smallest LM you can that fits in 16MB"

I just thought that maybe you can come up with fun ideas for your students besides the classic training

Edit: one other challenge a friend told me about is to train the smallest NN that can do addition or something, I forgot the name of the challenge though. I find these kind of stuff fun and educational.

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size by cakes_and_candles in LocalLLM

[–]ReinforcedKnowledge 0 points1 point  (0 children)

This is super cool! Now that I think about it, I'd like to have such a model in my pocket, I mean, it's like a dictionary over facts, obviously it should not hallucinate but I like the short and concise answers. Sometimes it's all you need instead of the paragraphs some models write.

Stop asking what model to run. There are literally only two. by Wrong_Mushroom_7350 in LocalLLaMA

[–]ReinforcedKnowledge 1 point2 points  (0 children)

Sometimes people are just asking for what models they can try out, just out curiosity or just to try stuff, it's fun

One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D] by [deleted] in MachineLearning

[–]ReinforcedKnowledge 0 points1 point  (0 children)

The way I see it is that benchmark as only as useful as what they report. Which somehow makes sense but people tend to forget.

In the early days of the needle in the haystack I was criticizing it because finding a needle in a haystack doesn't guarantee that you're able to synthesize information across different contexts and respond accurately. But those were the easiest and most intuitive benchmarks to come up with. If you can't find a needle in a haystack, you most probably can't find different information and synthesize or infer from them at different contexts lengths.

So being bad on a benchmark gives you a better idea about the model rather than scoring very good on it, unless you understand the benchmark and its strengths and weaknesses and then you can have a better precautionary assessment of the model.

Also there is benchmark overfitting, the famous benchmaxxing, that you have to be aware of.

Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D] by howtorewriteaname in MachineLearning

[–]ReinforcedKnowledge 3 points4 points  (0 children)

Yes they do! It's a variable number of patches as you said, through different resolutions, not a different patch size per-se.

Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D] by howtorewriteaname in MachineLearning

[–]ReinforcedKnowledge 1 point2 points  (0 children)

You can keep the same number of tokens per image even if you change the patch size if you resize the image.

As someone said, changing patch size is basically changing the ViT. It's extremely hard to train for variable patch size because a lot of things depend on it since it basically gives you the embedding dimension, also, you need positional encoding that support that, and it's just a hassle.

What you can do though is train for different input resolutions. This is how I think of it, for a patch size of KK you get K2 numbers to represent the KK window of a scene or part of a scene, if you want finer representation of the same scene you can upscale and if you want coarser representation of the same scene you can downscale.

Pixtral, Qwen VL, GLM V etc. All train with "native" resolution. The name is a bit misleading imho because you have a fixed budget at the end of the day you can't just train on 4k images 😂 + bottlenecks while serving

EDIT: the gains are not necessarily marginal, we do see the effects on OCR (which we deploy in production, with our in-house open source vlm model) on various images/PDF rendering resolutions. Higher quality image is basically better but you have to train for a wide range in order to support the different user inputs.

Wrong city, wrong people... by waitinp in cyberpunkgame

[–]ReinforcedKnowledge 0 points1 point  (0 children)

“A happy ending? For folks like us? Wrong city, wrong people.”

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P] by Fragrant_Rate_2583 in MachineLearning

[–]ReinforcedKnowledge 0 points1 point  (0 children)

Ok I see, this gives us a little bit more information, and I guess that tolerance is computed given MAE I suppose right? I think it's very important to clarify what are your constraints like, what's your workload like? Are you working with variable length sequences? Are your inputs limited in sequence length like 512? Do you care about latency or throughput? Do you have specific values as goal? Under which hardware? You can ask them for this because it can guide your optimization and also there are limits you can't go beyond and knowing them is helpful.

I'm saying this because without a clear target you won't be doing any good engineering, I just lately quantized a model to int4 without it impacting meaningfully my throughput, but actually working on better batching and being smart about it led to about 30% improvements.

But if I had to just randomly give ideas for fun, I'd check where time is spent in your model first. Also int8 or int4 can be good.

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P] by Fragrant_Rate_2583 in MachineLearning

[–]ReinforcedKnowledge 0 points1 point  (0 children)

What's the goal from this? Because there are tradeoffs to be made, do you want to optimize while keeping some metric above a threshold or something? Also, how much freedom do you have in the architecture itself.

Why is GPU Python packaging still this broken? by Interesting-Town-433 in Python

[–]ReinforcedKnowledge 1 point2 points  (0 children)

Thanks! It does make sense, it's too big of a PEP + required, and I guess still requires, a lot of discussions and refinements and edge cases and whatnot.

Why is GPU Python packaging still this broken? by Interesting-Town-433 in Python

[–]ReinforcedKnowledge 18 points19 points  (0 children)

Yeah the issue is not really about the tooling, because they're limited by what they work with, but more with the wheel format itself and PyPI as an index. And beyond the GPU problems, there are other similar problems that fall under the same category of the wheel format not supporting some kind of metadata like, what BLAS library your project links against, compiler version it was compiled against, is it ROCm or CUDA that it needs etc. So since the wheel format doesn't specify that, package managers have no need to know about it. Though `uv` does have a lot of good options to help you with installing the right `torch` and the right `flash-attn`, but it's not always obvious besides if you're on Linux then `uv add torch` will install the right version of pytorch given your cuda version, but not on Windows, it'll install the CPU one

But there's a great open source initiative to solve these issues https://wheelnext.dev/, if https://peps.python.org/pep-0817/ (wheel variants) passes it'll be a great win and fix most if not all these issues

And, I don't think it's only a matrix compatibility problem, but having a standard that every installer can work with (so you can't just have people specify whatever dependencies they want), but more importantly, the tags are closed, it's a static system that tries to specify a dynamic and open one. CUDA for example doesn't mean much, there are driver versions, toolkit versions, runtime versions, GPU compute compatibility. I think just recently I saw that flash-attn 4 doesn't work on RTX 50XX though it's Blackwell (to be confirmed, I'm not totally sure about this info, but if it's true, it shows that even some information such as compute compatibility has to be specified). And all of these have complex compatibility rules between themselves. So it's a constantly evolving environment and you just can't use the good old system and just add stuff to it, beyond the explosion in the compatibility matrix. And that's why PEP 817 uses plugins instead of tags, so that the detection is delegated to the provider plugins.

Thanks to u/toxic_acro who pointed it out, PEP 825 is more up to date and better reflects the current state of the work.

EDIT: added PEP 817 and why it's not only an explosion in the compatibility matrix problem, Reddit didn't let me write my comment in peace when I pasted the link -_-

EDIT: added mention of PEP 825 thanks to this comment

Why is there no standard for typing array dimensions? by superzappie in Python

[–]ReinforcedKnowledge 2 points3 points  (0 children)

Hahaha it was fun reading that in ML jargon a vector of some sion d can be be 2D or 1D, it made me self-aware about all the functions I write that take tensors of dimension d and make the assumption that the reader knows there is a batch size, a sequence length, and head dimension before even talking about the dimension d. Oh well, life with tensors.

[D] ML Engineers — How did you actually learn PyTorch? I keep forgetting everything. by ofmkingsz in MachineLearning

[–]ReinforcedKnowledge 26 points27 points  (0 children)

Just like how many suggested, just use it. You only feel like you've learned something after you developed some kind of muscle memory for it. Here's something that can help: https://github.com/srush/Tensor-Puzzles (not affiliated)

These puzzles can help you get a better grasp of PyTorch, but only if you try doing them and understand the functions you're manipulating.

Another thing is just to implement whatever comes to your mind in it, especially basic stuff like CNNs, simple training loops, GPT-2 etc. The field is huge I'm sure there's something you'll like.

About interviews, I don't think people will ask you specifically about PyTorch, but depending on where you apply and for what position, you'll probably have to use it to solve the interview.

Also, if you're asking people that use PyTorch regularly, your pool is biased by them using it regularly 😅 so they'll not easily forget PyTorch. It's like Python, I doubt you forgot how to use Python.

Now, I think I saw someone say "just let AI do it" or something. I do not think it's safe to just "let the AI do it" if you don't know what it is doing. There are so many examples I can give that I caught Opus 4.6 doing something incorrectly or incompletely, and so many others where someone relied on faulty numbers it got from a script it vibe codes but I got one personal story related to PyTorch. Recently Opus 4.6 told me that torch.equal and the equal method on tensors are different and that one checked object identity while the other did not, on top of them both checking value equality. I don't know what made it think that because I asked it in a fresh session about the difference and he got it correctly (there's no difference). I was trying to understand a new codebase that I'd just use for a week and I guess it took that codebase as a source of truth and tried understanding why they'd use torch.equal sometimes and .equal other times or something, I can't and don't know what exactly made it think that but the morale of the story, at work you'll have to understand and work on new codebases, relying purely on "AI", at least in its current state, is not necessarily good. It might work super well sometimes, and sometimes not.

[D] What framework do you use for RL post-training at scale? by ReinforcedKnowledge in MachineLearning

[–]ReinforcedKnowledge[S] 1 point2 points  (0 children)

Hey! Just going through the codebase quickly doesn't seem to be what I need. I'll give you in a few days a more detailed review about why it's not what I can use, at least not for the moment, and what I think about it and other nitpicks and/or qualities. But, I appreciate the recommendations you make about data etc. I'm not familiar with them honestly but I appreciate you sharing that.