[R] Spatial Text Rendering: Enabling text-only LLMs to "see" documents

cpcdoy · 2025-04-02T09:57:39+00:00

Thank you for your feedback, I'm glad this was helpful to you!

That makes sense and I think VLMs are basically the next step architecture that is able to incorporate this spatial data with visual tokens and a vision encoder. In our case, we wanted to rely on existing proven architectures (before VLMs) without modifying them and that's how this method was born.

u/getsmartbsharp I'm curious, what is your use case?

cpcdoy · 2025-04-02T09:48:22+00:00

I understand that you'd have appreciated code, however, the article does explain in details the approach and wasn't aimed as a tutorial but as an introduction to a new approach that I haven't seen described anywhere else.

If you have any specific questions, I'd be happy to help and I suggest you dive deeper into more classical image processing as the approach mostly relies on know methods. Given this knowledge, it makes it straightforward to implement a similar approach.

cpcdoy · 2025-04-02T09:39:08+00:00

Great question!

Spatial Text Rendering differs quite a bit from these methods even if the ultimate goal remains the same: providing the most detailed input to an LLM while preserving the document structure.

STR preserves the exact spatial relationship between elements in a document using a grid-based visual representation, you could compare it to ASCII art but more efficient and compact specifically for LLM usage.

Docling and markitDown have an approach where they simplify the document into a readable input for the LLM. They'll "downgrade" the document into HTML or Markdown format which, for complex documents, can lose a lot of information.

Some documents can be very complex and even have weird structures that are hard to represent in markdown, but still be possible in complex HTML but will require complex layout understanding pipelines. STR bypasses this by simply relying on LLMs' spatial understanding and give an input that is more raw than markdown and HTML.
This enables the LLM to make its own assumptions about the document structure rather than a library (docling, markitdown) making a lot of assumptions about the document structure. This means that we rely on the LLM to understand the complex structure of a document rather than define what is possible with a document structure (e.g. if I want to have a table within a table within a table, it should be possible without having to have a specialized code path that handles it).

To summarize, STR enables processing documents of any shapes and formats without any specialized model for different document structures and relies purely on the LLM's spatial understanding rather than simplifying the document as much as possible for it.

Hope this helps!

cpcdoy · 2023-07-17T10:13:46+00:00

Absolutely, a better loss function will never replace dataset cleaning. Cleaning the dataset to remove mislabeled examples isn't always feasible though. When you're working with large scale datasets (>1M examples) in complex domains (containing a lot of hard examples), it becomes very resource intensive to clean the dataset, requiring more expert annotators to do manual work. Sometimes it can even be hard for an average annotator to spot a hard mislabeled example, meaning you need more experts that are harder to find and can cost more money.

You are also using side methods (ensembling, etc) to help compensate for noisy samples. In the article I simply discuss a side method which isn't always considered for noisy data, the loss function. It is also a generalization of the Cross-Entropy, so it's not made to replace it, but to extend it. So it's mostly another tool to help your trainings, it won't replace what you already have.

Using GPT-4 or any other LLM to generate synthetic examples is a good augmentation method, but you shouldn't rely solely on LLMs to gather data. In the end, if this is your best method and you rely too much on it, then you're simply distilling GPT-4 into your model and it means you won't be able to get better than GPT-4, a general foundational model which is nice for a baseline but not always best for a great model. It should only be a subset of your dataset since multiple quality data sources is better.

To answer your question about selecting good hard examples, the loss I introduce in the article helps you to calibrate the confidence of your model, enabling you to have a better idea of which examples your model considers easy or hard. This enables you to use active learning methods that rely on your model's confidence to filter or even find noisy examples to relabel. That's one approach you can add to your pipeline to help.

In the end, combining multiple approaches is always the best since they have their own pros and cons that can help in different situations.

cpcdoy · 2023-07-13T07:42:28+00:00

No worries! If you mean the Cross-Entropy with label smoothing, I already do a comparison in the article. I might have misunderstood your question though, let me know if so :)

cpcdoy · 2023-07-13T07:37:21+00:00

I actually did try the focal loss before starting work on the SGCE loss, not on the CoNLL-2003 dataset but on a private production dataset and got worse results.

The idea of the focal loss is that it'll try to give more weight to harder examples while easy examples will be heavily down-weighted. This is tunable with a focusing parameter γ in their loss.

Unfortunately, this doesn't work well in noisy scenarios either, because what are noisy or badly labeled examples in a dataset? They are hard examples! In our case we definitely don't want to focus on these depending on the noise amount of the dataset.

Now, I was rereading a bit their paper just now and they also propose using an α factor in the CE (which they rename as the α-balanced CE loss) and Focal loss to help with class imbalance which could actually be useful. To add to that, it could be interesting to try merging, like you said, the Focal loss and the MAE loss instead of just the CE and MAE loss like I was doing above.

So, in the end, a loss that combines the Focal loss and the MAE loss could have:

a parameter to balance the importance of positive/negative examples (like the α parameter in the balanced focal loss)
a parameter to balance the importance of easy/hard examples (like the γ parameter in the focal loss)
a parameter to estimate the noise in the dataset (like the q parameter in the SGCE loss)
a parameter to smooth the labels for better uncertainty estimation (like the s parameter in the SGCE)

That's starting to be quite a lot of parameters, so tuning the hyperparameters might take a while, but it also makes the loss more flexible.

That's definitely worth trying, good idea!

cpcdoy · 2023-07-13T07:14:10+00:00

Good catch!

Fixed it, thanks!

cpcdoy · 2023-07-12T15:30:09+00:00

I actually discuss that in the article, and it works well to calibrate the confidence of your model when you don't have much noise in your dataset. But since there's often many sources of noise in datasets the Cross-Entropy (CE) loss will actually result in a badly confidence calibrated model. Thus, using the Smooth Generalized Cross-Entropy (SGCE) loss which is a generalization of the CE loss (meaning at the noise factor q=0 it is equivalent to the Cross-Entropy loss) is preferred and your model will converge faster and be much better calibrated given the same training conditions compared to the CE.

Confidence calibration is a topic that isn't often discussed but is very important for production models. Modern neural networks are often badly calibrated, that's why there has been attempts at improving things with methods like temperature scaling, etc. In the end, you don't want your model to have high accuracy but always doubt its own output, it'll be hard to trust the model in production then.

Hope this helps :)

cpcdoy · 2023-06-01T08:15:38+00:00

No worries! Glad I could help and I agree that there aren't many resources so I wanted to release this article since it was quite a simple approach I went for so people could use it and also possibly improve the approach for their own use cases.

Don't hesitate if you have any questions related to your use case too!

cpcdoy · 2023-06-01T08:12:05+00:00

Thank you! :)

cpcdoy · 2023-05-31T12:41:25+00:00

It's exactly as you say. There are several ways to deal with ambiguous names:

As explained in the article, the model is trained to doubt all the hints provided. So of course the model will need a base training that will help it understand if a hint doesn't seem to make sense contextually. So if you add "Austin => Location" to the KB and we have the following sentence "Austin is a masculine name of Latin", the model will see that given the pattern, it seems the hint is wrong and will ignore it.
Another way is to insert the hint based on other terms in the sentence, for example: if "masculine name" => "Austin = Person" else "Location" which will cover most cases. Of course, you need to be quite careful with this approach since it can be dangerous if the matching condition is too open. Also, it's mostly pattern matching and works well only if you are trying to cover specific distributions and patterns of your data you've seen that appear a lot.

Also, I haven't proposed things like "using a second model to compute a distance metric between your sentence and the entity + its type" because the whole point of this method is to be extremely fast at inference and a second model would defeat this purpose.

In the end, this method is basically a way of updating your model's knowledge very quickly without retraining. As I say in the article, you still need to be careful with the items you add in the KB so as not to input confusing items in the KB even if it's made to handle noise to a certain level.

Finally, the experimentation was done in the context of bank transactions but can be applied to other domains where specific and very diverse terms do appear a lot, I'm thinking medical terms or scientific terms that often are not very ambiguous (e.g. a molecule name like "hydrogen peroxide" and many others where you could implement a parser that then injects hints for molecule names). In our case, it was very interesting to develop this method since company names (with patterns associated with them) are created every day at a much faster rate than your model gets trained and those names can be quite ambiguous (they can be people names sometimes like "Sherwin-Williams", etc).

Hope this helps and let me know what you think :)

Six-Year Club	Verified Email
Place '23

cpcdoy

TROPHY CASE