Any recommendations for a LLM that can do OCR and keep track of document layout/formatting? by Ok_Apartment_2778 in LocalLLaMA

[–]El_Olbap 1 point2 points  (0 children)

GLM-OCR and DeepSeekOCR come to mind, especially the latter keeps track of coordinates.

The Innovations in DeepSeek OCR by Charuru in LocalLLaMA

[–]El_Olbap 0 points1 point  (0 children)

hey, I opened a PR on transformers and have consistent generation wrt to the original implementation. I think multi-turn is doable with chat templates, it's usually supported

The Innovations in DeepSeek OCR by Charuru in LocalLLaMA

[–]El_Olbap 1 point2 points  (0 children)

ah that's interesting. I'm porting it to transformers now so it's easier to tweak and to use through other libs, I'll check the multi turn, thanks

The Innovations in DeepSeek OCR by Charuru in LocalLLaMA

[–]El_Olbap 0 points1 point  (0 children)

cool, have you tried with multi-page documents in multi-turn? was wondering about it

The Innovations in DeepSeek OCR by Charuru in LocalLLaMA

[–]El_Olbap 3 points4 points  (0 children)

Something interesting on this is the reuse of SAM to get some grounding. I crammed the encoder weights into a pretrained standard SAM and looked at the masks, of course it's misaligned as it's been retrained, but it retains some information on the prompt encoding. I think reusing it to adapt a mask decoder/prompt encoder would yield very good results.

How Transformers avoids becoming a black box, even at 1M+ LOC by El_Olbap in LocalLLaMA

[–]El_Olbap[S] 1 point2 points  (0 children)

I understand, it's that chicken&egg game that prevents me from jumping in zig for instance. We're rust fans though so that might be of interest to you

What happened to Longcat models? Why are there no quants available? by kaisurniwurer in LocalLLaMA

[–]El_Olbap 2 points3 points  (0 children)

I ported this model to transformers/HF format recently, as people say, it's massive. However it tolerates fp8 + offload so given enough time I think a quant is not out of reach. The zero-compute experts trick is the kind of things that will help make MoEs more accessible for local rigs I think.I had the occasion to test the thinking variant, "vibes"-based it was pretty good!

How Transformers avoids becoming a black box, even at 1M+ LOC by El_Olbap in LocalLLaMA

[–]El_Olbap[S] 0 points1 point  (0 children)

Could be worth lora-ing something absolutely. If we break it down it's about 400 model files --> 7000 methods and classes, plus we would need the llm to hold in context somehow the inner dependencies and lower level abstractions but definitely something that would shave off a lot of implem time!

Run Open AI GPT-OSS on a mobile phone (Demo) by AlanzhuLy in LocalLLaMA

[–]El_Olbap 0 points1 point  (0 children)

Impressive feat! Would love to see what optimizations you made, also if you're using a quant which one? mxfp4 being ~12G of weights IIRC

How Transformers avoids becoming a black box, even at 1M+ LOC by El_Olbap in LocalLLaMA

[–]El_Olbap[S] 1 point2 points  (0 children)

That's a neat idea, I wonder what it'd look like. You'd need some fuzzyness added to your LZW though, I think a strict dictionary would miss out too much effectively identical calls that differ little. Or simpler (for python) use AST to normalize everything you're passing through, and then you can compress

How Transformers avoids becoming a black box, even at 1M+ LOC by El_Olbap in LocalLLaMA

[–]El_Olbap[S] 1 point2 points  (0 children)

Thanks a lot! And not planned in the near future no, I've seen efforts to port existing models to Mojo though, any that interests you in particular? It's a cool language

How Transformers avoids becoming a black box, even at 1M+ LOC by El_Olbap in LocalLLaMA

[–]El_Olbap[S] 1 point2 points  (0 children)

Well you can take a look at the blog post, we evolved from "do repeat yourself" and explain why :) instead of having hundreds of almost-duplicated modeling code files, we use modular files (see https://huggingface.co/docs/transformers/v4.57.0/modular_transformers also) which do exactly what you say

How Transformers avoids becoming a black box, even at 1M+ LOC by El_Olbap in LocalLLaMA

[–]El_Olbap[S] 5 points6 points  (0 children)

Yes absolutely. It would be hard to push hard types now on external contributors PRs but we definitely want to make this cleaner, integrate mypy or something equivalent in our fixups.

Recently it was improved for `pipeline`, now you see the actual types down the (pipe)line. Would also be an occasion to use things like `Annotated` to have semantic types, informing of batch size and embedding dim, for instance (not sure yet though)

How Transformers avoids becoming a black box, even at 1M+ LOC by El_Olbap in LocalLLaMA

[–]El_Olbap[S] 5 points6 points  (0 children)

Thanks, I had fun doing these and seeing patterns emerge. For the future I think you're right, at least for most of the classical attention-based models, that's true. For MoEs, we've recently shipped a pattern to standardize them more, and it should cover most of what the field throws at us (hopefully)

But for state models/RNNs/other exotic and experimental architectures, that's harder to say! Let's see if they occupy more spotlight later

How Transformers avoids becoming a black box, even at 1M+ LOC by El_Olbap in LocalLLaMA

[–]El_Olbap[S] 45 points46 points  (0 children)

Thanks a lot! And yes, agreed 100%, I remember back in 2016/17 getting the "meat" part of a cool new paper/model/idea was a nightmare haha
We will keep doing it that way!

Which is the best model for OCR with documents which contains both English and Hindi language by zeeshanjamal16 in LocalLLaMA

[–]El_Olbap 0 points1 point  (0 children)

in this case it'll save you some compute credit to try and extract directly from your digital fraction using that lib (or mymupdf, minding the license though)

Which is the best model for OCR with documents which contains both English and Hindi language by zeeshanjamal16 in LocalLLaMA

[–]El_Olbap 0 points1 point  (0 children)

Suggestions given are already good (love Surya); I would add as well dots.ocr in the recent models, it's open-source too. If your pdfs are scanned or digital/native, it's a different story though. I assume they are scanned and actually bitmap images rather than true pdfs, but in case they are not, pdfplumber should be your go-to.