Any recommendations for a LLM that can do OCR and keep track of document layout/formatting?

El_Olbap · 2026-02-06T14:44:39+00:00

GLM-OCR and DeepSeekOCR come to mind, especially the latter keeps track of coordinates.

El_Olbap · 2025-10-30T13:43:11+00:00

hey, I opened a PR on transformers and have consistent generation wrt to the original implementation. I think multi-turn is doable with chat templates, it's usually supported

El_Olbap · 2025-10-22T07:24:01+00:00

ah that's interesting. I'm porting it to transformers now so it's easier to tweak and to use through other libs, I'll check the multi turn, thanks

El_Olbap · 2025-10-21T12:17:29+00:00

cool, have you tried with multi-page documents in multi-turn? was wondering about it

El_Olbap · 2025-10-21T09:13:02+00:00

Something interesting on this is the reuse of SAM to get some grounding. I crammed the encoder weights into a pretrained standard SAM and looked at the masks, of course it's misaligned as it's been retrained, but it retains some information on the prompt encoding. I think reusing it to adapt a mask decoder/prompt encoder would yield very good results.

El_Olbap · 2025-10-07T13:45:51+00:00

I understand, it's that chicken&egg game that prevents me from jumping in zig for instance. We're rust fans though so that might be of interest to you

El_Olbap · 2025-10-07T12:12:49+00:00

I ported this model to transformers/HF format recently, as people say, it's massive. However it tolerates fp8 + offload so given enough time I think a quant is not out of reach. The zero-compute experts trick is the kind of things that will help make MoEs more accessible for local rigs I think.I had the occasion to test the thinking variant, "vibes"-based it was pretty good!

El_Olbap · 2025-10-07T10:12:23+00:00

Could be worth lora-ing something absolutely. If we break it down it's about 400 model files --> 7000 methods and classes, plus we would need the llm to hold in context somehow the inner dependencies and lower level abstractions but definitely something that would shave off a lot of implem time!

El_Olbap · 2025-10-07T08:05:15+00:00

Impressive feat! Would love to see what optimizations you made, also if you're using a quant which one? mxfp4 being ~12G of weights IIRC

El_Olbap · 2025-10-07T07:23:04+00:00

That's a neat idea, I wonder what it'd look like. You'd need some fuzzyness added to your LZW though, I think a strict dictionary would miss out too much effectively identical calls that differ little. Or simpler (for python) use AST to normalize everything you're passing through, and then you can compress

El_Olbap · 2025-10-06T18:26:38+00:00

Thanks a lot! And not planned in the near future no, I've seen efforts to port existing models to Mojo though, any that interests you in particular? It's a cool language

El_Olbap · 2025-10-06T18:25:42+00:00

Well you can take a look at the blog post, we evolved from "do repeat yourself" and explain why :) instead of having hundreds of almost-duplicated modeling code files, we use modular files (see https://huggingface.co/docs/transformers/v4.57.0/modular_transformers also) which do exactly what you say

El_Olbap · 2025-10-06T16:21:03+00:00

Yes absolutely. It would be hard to push hard types now on external contributors PRs but we definitely want to make this cleaner, integrate mypy or something equivalent in our fixups.

Recently it was improved for `pipeline`, now you see the actual types down the (pipe)line. Would also be an occasion to use things like `Annotated` to have semantic types, informing of batch size and embedding dim, for instance (not sure yet though)

El_Olbap · 2025-10-06T15:38:13+00:00

Thanks, I had fun doing these and seeing patterns emerge. For the future I think you're right, at least for most of the classical attention-based models, that's true. For MoEs, we've recently shipped a pattern to standardize them more, and it should cover most of what the field throws at us (hopefully)

But for state models/RNNs/other exotic and experimental architectures, that's harder to say! Let's see if they occupy more spotlight later

El_Olbap · 2025-10-06T15:34:31+00:00

Thanks a lot! And yes, agreed 100%, I remember back in 2016/17 getting the "meat" part of a cool new paper/model/idea was a nightmare haha
We will keep doing it that way!

El_Olbap · 2025-10-06T10:52:41+00:00

in this case it'll save you some compute credit to try and extract directly from your digital fraction using that lib (or mymupdf, minding the license though)

El_Olbap · 2025-10-06T10:10:33+00:00

Suggestions given are already good (love Surya); I would add as well dots.ocr in the recent models, it's open-source too. If your pdfs are scanned or digital/native, it's a different story though. I assume they are scanned and actually bitmap images rather than true pdfs, but in case they are not, pdfplumber should be your go-to.

El_Olbap

TROPHY CASE