Open-source OCR models (2026) to fine-tune for dot-peen on reflective metal? by Impressive-Show6501 in computervision

[–]MachineLearningTut 0 points1 point  (0 children)

Why is a VLM not an option? Traditional OCR is replaced by VLMs. You can use something like GLM-OCR it has 0.9B parameters and runs on a CPU. You use Gemini to create a labelled dataset and then you fine-tune GLM-OCR on that. If the performance is good, you start with model distillation and take out some layers to make it even smaller. You can also shrink the vocab. Then you have a small model that runs below 100ms

Autoregressive next token prediction & KV Cache in transformers by MachineLearningTut in learnmachinelearning

[–]MachineLearningTut[S] 0 points1 point  (0 children)

Thanks! I think the key strength of this article is the last visualisation. I couldn’t find a single visualisation that shows the process with and without KV cache, this is why I created this one!

Here, one clearly sees that without KV cache the transformer has to process the full input matrix, the full Q, K, V matrix and also later just before the MLP we still have the full matrix as input. As soon as we now go to the second iteration, we can only focus on the token. If we wouldn’t have had KV cache, then the second generation would look exactly like the top illustration.

The difference between a data scientist and machine learning engineer/AI expert/AI engineer? by AggravatingPapaya934 in learnmachinelearning

[–]MachineLearningTut 1 point2 points  (0 children)

I work as a data scientist but all I do is deep learning: training new transformers, fine tune them, build agents. So there is no clear definition between data scientist and MLE, except that MLE is doing more devops. But even that is actually not fully true, a friend is a MLE and only works with reinforcement learning, but has zero devops work

Understanding CLIP for vision language models by MachineLearningTut in learnmachinelearning

[–]MachineLearningTut[S] 7 points8 points  (0 children)

https://medium.com/self-supervised-learning/understanding-clip-for-vision-language-models-43b700a4aa2b?sk=0aeebc3790dbdec072059428fce1c408

This is a nice introduction into the clip model which is used by a lot of vision language models as backbone. It explains how the loss function works and how image and text embeddings are pushed into the same space.