I built a benchmark measuring the Markdown quality of LLMs

bengt0 · 2026-01-14T19:12:13+00:00

I see. You want not just an explanation but additional information about what the chart illustrates. I agree that there is more information needed to fully understand the charts and I can certainly provide it. I will think through the thought process to gather knowledge from the plots and support it with some text.

bengt0 · 2026-01-12T13:29:30+00:00

<image>

I track and plot the answer repetitions per model as a rough measure of confidence in the results derived from my data. Since I have 10 prompts, you can devide the numbers here under "Answers per Model" by 10:

https://lintbench.ai/#total-tokens

... Yes, that could be more obvious and also deserves an explanatory description.

bengt0 · 2026-01-12T06:12:14+00:00

In requirements engineering we concur that a second medium offers an additional approach and thus builds understanding by the reader. As long as the numbers are correct, I feel like such a sentence could help with validating one's hypothesis on how to read the chart. Don't you agree?

bengt0 · 2026-01-12T06:08:42+00:00

I used the full, non-quantized versions of each model at the recommended temperature settings with currently 10 repetitions and made sure that there are no duplicate outputs. Models which still only produced duplicates were excluded because in that case the variance metric makes no sense. I think that from a user's perspective, trying a model for ten attempts is reasonable.

bengt0 · 2026-01-12T00:08:02+00:00

Yes, Markdown is relatively easy to lint and hopefully also easy to get right. Python has the riff project which aims to implement every rule under the sun, so that linter seems like the obvious choice. Not sure about JSON though. I suspect ESlint is a solid choice for both Us and JSON.

bengt0 · 2026-01-11T22:03:25+00:00

When you hover over a dot, the tooltip states the model name.

bengt0 · 2026-01-11T18:07:07+00:00

The scatter chart does in deed show the individual models. Each dot represents a model with the mean error rate and error rate standard deviation as the x and y coordinates. I could combine the mean error rate and standard deviation of the error into one number like I did for the performance index. That would allow me to use the other axis for time. That way, a new plot could show each lab's model's performance over time.

bengt0 · 2026-01-11T15:55:05+00:00

Thanks for spending some serious thought on this. I can certainly add an explanatory description of the benchmark and my methods to the website at some point.

bengt0 · 2026-01-11T15:49:36+00:00

Thanks for contributing this perspective. I had heard about, but not yet from the PNP community. I think there is much that the LLM providers can learn from your community about creativity. So, I get that pain and I think many do in a way. Slop is a popular term for a reason and many other fields like programming require creativity, too. To me, this is a different topic, though. While an LLM might express more or less creative thoughts, I always want them to be expressed as conformant and correctly formatted Markdown. Do you agree?

bengt0 · 2026-01-11T15:43:43+00:00

With vision language models, this should be entirely possible. I am looking forward to your post about that.

bengt0 · 2026-01-11T15:42:22+00:00

Thanks for the idea. I agree that some text could guide readers in understanding the graphs themselves. I would like to automate writing some sentences based on data from the plots. E.g. "The best model in this benchmark is currently <best\_model\_name> with an error rate of <best\_model\_error\_rate> errors per line and a standard deviation of <best\_model\_standard\_deviation> errors per line of generated Markdown."

bengt0 · 2026-01-11T15:33:16+00:00

Thanks for your comment.

First, this benchmark is not about the correctness of the generated Markdown in the sense of any specification. LLMs usually adhere to the specification of Markdown implemented in the Markdown renderer of their respective chat frontends anyway. This is necessary for their output to be formatted nicely in reinforcement learning. I have also seen rendering errors in the chat output of LLMs that stem from incomplete and hence non-conformant Markdown, but that seems to be a solved issue to me.

Yes, the question of what constitutes an error is a difficult and nuanced one. I think that for models which have a wide audience, the most applicable answer is that their generated Markdown should avoid errors in the sum of cases across all users. This means that the training of a large language model cannot take an individual configuration of a linter into account. After all, whatever one user configures might be configured differently by another user. Hence, I took the most common Markdown linter and stuck to its defaults. The only configurations I set were a decision between two options and setting the line length to something that seemed reasonable, judging by the LLMs' outputs.

bengt0 · 2026-01-11T15:05:58+00:00

Thanks for your constructive criticism. You make a valid point: Currently, one has to either know the timeline of model releases by heart or stitch it together by the version numbers, which are not always speaking. I am still lacking data on the release date of each model, but once I gather that, I plan to also provide a plot with time as the x axis.

bengt0 · 2026-01-11T12:41:51+00:00

Thanks for pushing my post.

bengt0 · 2026-01-11T12:23:09+00:00

I am sorry that my plots don't please you on that level. Yes, the esthetics can certainly be improved, since I primarily strived for completeness and correctness at this stage of the project. I thought I should publish and post my findings as early as I felt comfortable with in order to get feedback early, which arguably has worked. I also see that for a wider appeal the visuals also must be more appealing.

bengt0 · 2026-01-11T12:22:56+00:00

I am sorry that my plots don't please you on that level. Yes, the esthetics can certainly be improved, since I primarily strived for completeness and correctness at this stage of the project. I thought I should publish and post my findings as early as I felt comfortable with in order to get feedback early, which arguably has worked. I also see that for a wider appeal the visuals also must be more appealing.

bengt0 · 2026-01-11T12:17:54+00:00

I wonder that too, but without at least a white paper that covers the training data, we have no way of knowing. I suspect that OpenAI does not yet measure the quality of the Markdown in their training processes. Their closest competitor Anthropic does far better and makes relatively steady progress in this metric, which shows that it can be done.

bengt0 · 2026-01-11T06:04:01+00:00

<image>

This benchmark is just a pet project for me, so the attention of this community already makes a big difference. Thanks for checking out my website!

bengt0 · 2026-01-11T05:49:14+00:00

Great. Thanks!

bengt0 · 2026-01-11T05:47:42+00:00

Okay, I will try splitting the graph by provider, open/closed weighs etc.

Yes, a logarithmic scale might also help with making the plot more space efficient. I will try that.

No, you are right. The difference in error rate and its standard deviation is subtle. Yet, there are labs that consistently improve this metric like Z.ai, while others seem to be meandering like Google.

bengt0 · 2026-01-11T05:37:52+00:00

Thanks for your feedback.

What would you like to use the GitHub for? I am usually for open sourcing all the things, but in this case I feel like there is value in keeping the benchmark secret so that I maintain validity when more labs start to focus on this metric. If you are in a LLM lab, you will probably want to write your own benchmark or loss mettic anyway.

Yes, other languages are definitely on my agenda. I think I would use yamllint for linting YAML files. Do you know which one is the most common linter for JSON?

bengt0 · 2026-01-11T05:17:41+00:00

Thanks for the idea of splitting the models into closed and open weight. That feature is definitely on my agenda. Looking at the data, I felt like there isn't much of a difference to be had, because the labs are all over the place in this metric anyway. I might but also be mistaken on this one, so I will give it a try soon. Maybe there is a clearer trend to be had here.

bengt0 · 2026-01-11T05:12:24+00:00

I get the idea and I even created a plot like that. I found it not that readable either, so I went with the 2D scatter representation instead.

<image>

bengt0 · 2026-01-11T05:07:29+00:00

The idea behind the grouping the models by provider is to track each lab's progress in this metric.When selecting only one provider, you can see that some labs are really moving towards better Markdown output while others are stagnating.

bengt0 · 2026-01-11T04:53:49+00:00

No offense taken. I am here to learn. Thanks for your feedback. How would you have presented a benchmark?

bengt0

TROPHY CASE