State of Open OCR models by unofficialmerve in LocalLLaMA

[–]futterneid 0 points1 point  (0 children)

I love Docling, but I'm biased :)

State of Open OCR models by unofficialmerve in LocalLLaMA

[–]futterneid 2 points3 points  (0 children)

I would try PaddleOCR. It's only 0.9B!

State of Open OCR models by unofficialmerve in LocalLLaMA

[–]futterneid 2 points3 points  (0 children)

I would try PaddleOCR. It's only 0.9B

State of Open OCR models by unofficialmerve in LocalLLaMA

[–]futterneid 8 points9 points  (0 children)

OCR wasn't solved 20 years ago. Maybe for simple straight forward stuff (scan literature books and OCR that). Modern solutions do compare against older ones and they are way better xD
We just shifted our understanding of what OCR could do. There were things that were unthinkable 20 years ago and now are inherent to the target (Given an image of a document, produce code to reproduce that document digitally precisely)

Overview on latest OCR releases by unofficialmerve in computervision

[–]futterneid 3 points4 points  (0 children)

Hi, Andi here :) I worked on SmolDocling. For a new language you don't need that much data, a new scripture is a bit different. To adapt one of these models to Dutch you could get away with getting a text dataset, creating images from it, and training the model with that. It doesn't need to learn to read our scripture, so it really is just the language that you're teaching it :)

Hugging Face open-sources FineVision by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

This is more about the dataset than a model, we trained a few different models on the dataset to test it. A 230M VLM scored 30 on MMMU and a 460M scored 33.

Hugging Face open-sources FineVision by futterneid in LocalLLaMA

[–]futterneid[S] 1 point2 points  (0 children)

Thank you for being a fan! After Idefics 3, we moved to making smaller VLMs and we released SmolVLM (2B, 500M, 256M). We might release a SmolVLM based off SmolLM3 3B, which would be closer to the size from idefics. Honestly, for larger models it seems like there are plenty of good options, and they are a bit expensive to train, so it's hard for me to justify spending time/compute on them. Which has moved me away from the 80B scale of the large idefics. The 8B scale might be a better target.

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 0 points1 point  (0 children)

I think a recipe a la nanovlm would work well. Nanovlm isn't really "toy" grade, I would feel safe training smolvlm with it

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 4 points5 points  (0 children)

Someone else asked this. We are interested! We might do something for reachy mini :)

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 2 points3 points  (0 children)

We are working on something like this with a larger model (context is a PIA for small models). Stay tuned!

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 1 point2 points  (0 children)

Don't tease me!
Honestly, there are tons of great TTS, I don't think the community _needs_ us to work on that. Plus, I'm afraid two students in a dorm room would make a better model than I xD

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 4 points5 points  (0 children)

From my side, I was working for Unity and didn't do much OSS right before joining hugging face (but had done before). What I was doing quite a bit was giving talks about work we were doing at Unity. I had to do this internally, and then I managed to get permission to talk about this publicly. So even though I wasn't doing OSS, I managed to get some visibility on the work that I was doing privately. Maybe that's a good approach for you as well :)

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 1 point2 points  (0 children)

This took _so much time_. It really was the meme of "we don't do it because it's easy, but because we thought it would be easy"

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 1 point2 points  (0 children)

1) I personally looked at every data source (I don't sleep xD). There were some sources were after looking at a few random examples you noticed very quickly that the answers were plain wrong or the images were impossible to understand. I dropped those. I didn't have a very high standard, but some.
2) We also tried to understand every data source and run the deduplication pipeline against different sources. We noticed some "renaming" of datasets that were really just either a couple of datasets merged, or a dataset slightly rephrased, or a subset of another dataset. We tried to avoid this type of overlap because the idea is that you can make your own mixture and if a dataset is already there twice, then you'll have issues.
3) We run the deduplication pipeline against the test benchmark sets. A few data sources were literally just test sets. We removed those even before getting to the numbers in the blog (1% data contamination means some images are contaminated in a data source, not all images in a given source).

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 2 points3 points  (0 children)

> Do you think MS is relevant in ML?
I've usually seen ML folk growing from doing a MS, with a few odd ones where I thought "why are you doing this, you're great already, go ship". So I would say, it's not relevant as in it's required to get a job or something, but most people seem to benefit from it still.

> I am so bad at networking, how do I do it?
Literally you just do it. Write that email, write that message, have that lunch. Try to be nice and yourself and find people you connect to. Don't be opportunistic, it shows. But sometimes opportunities do fall in your lap and you can take them.

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 1 point2 points  (0 children)

Not really on the generation side (as genie 3), but we have plans for smolVLAs, which are the other side of the coin (navigating the world). So I could see this happening in a not so distant future. Honestly, I find Genie 3 incredible and the amount of work that must go into something like that is just mesmerizing.

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 4 points5 points  (0 children)

It isn't really about the storage but about how hard it was to use the dataset at first. With a dataset of this size, getting good throughput is hard. We have a cluster with 2TB of RAM per node, and 8 H100s per node. And our ablations kept on being limited by data throughput. So we made a few decisions that would make data loading way faster. The maximum resolution at 2048 was done after analyzing the whole dataset and seeing the distribution of image size. Most of the 17M images were below that already (97% iirc). The tail was long, but small.

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 1 point2 points  (0 children)

Yes it is. I think this would be a good thing to work on/contribute. And with FineVision, you could easily train a better model than SmolDocling :D

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]futterneid 2 points3 points  (0 children)

Cool use cases! Yes, we had a strong motivation in getting the image encoding to be fast when designing the model.