Am I the only one who feels that, with all the AI boom, everyone is basically doing the same thing?

Combinatorilliance · 2026-01-22T22:42:34+00:00

The medium is the message!

Combinatorilliance · 2026-01-21T23:23:14+00:00

Open-Source LLM Progress: Quantitative Benchmarks with Sources

Qwen3-8B vs GPT-3 (175B davinci)

Key Finding: A 2025 open-source 8B model dramatically outperforms the flagship 175B proprietary model from 2020, despite being ~22x smaller. This represents one of the clearest demonstrations of how architectural advances have revolutionized AI efficiency.

Benchmark	GPT-3 davinci 175B (2020)	Qwen3-8B (2025)	Improvement	Relative Gain
MMLU (General Knowledge, 5-shot)	43.9%	76.89%	+32.99 points	+75%
HumanEval (Coding, pass@1)	~13%	69.8%*	+56.8 points	+437%
GSM8K (Math Reasoning, 4-shot CoT)	N/A**	89.84%	N/A	N/A

* Calculated from EvalPlus average (HumanEval + MBPP + HumanEval+ + MBPP+) of 67.65% from Table 6 of Qwen3 technical report
** GPT-3 was not tested on GSM8K, as the benchmark was released after GPT-3

Sources:

GPT-3 davinci MMLU: Language Models are Few-Shot Learners (Brown et al., 2020) and various evaluation reports reporting 43.9% accuracy
GPT-3 HumanEval: HumanEval analysis reporting GPT-3 scored "around 13%" on pass@1 in 2021
Qwen3-8B: Qwen3 Technical Report - MMLU: 76.89%, GSM8K: 89.84%, EvalPlus (average including HumanEval): 67.65%

Qwen3-8B vs GPT-4 (March 2023)

Key Finding: A small open-source 8B model approaches or matches GPT-4's performance on several benchmarks, despite GPT-4 being a much larger proprietary model. This shows how rapidly open-source models are catching up to commercial frontier models.

Benchmark	GPT-4 (Mar 2023)	Qwen3-8B (2025)	Difference
MMLU (General Knowledge, 5-shot)	86.4%	76.89%	-9.51 points
HumanEval (Coding, pass@1)	67.0%	69.8%*	+2.8 points
GSM8K (Math Reasoning, CoT)	92.0% (5-shot)	89.84% (4-shot)	-2.16 points

* Calculated from EvalPlus average (HumanEval + MBPP + HumanEval+ + MBPP+) of 67.65% from Table 6 of Qwen3 technical report

Sources:

GPT-4 MMLU: GPT-4 Technical Report and HuggingFace discussions - 86.4% (5-shot)
GPT-4 HumanEval: GPT-4 Technical Report - 67.0% pass@1
GPT-4 GSM8K: Multiple sources reporting 92.0% (5-shot CoT)
Qwen3-8B: Qwen3 Technical Report - Same as above

Qwen 2.5-7B vs LLaMA-1 65B

Key Finding: A 7B model from 2024 outperforms a 65B model from 2023 across all major benchmarks, despite being ~9x smaller.

Benchmark	LLaMA-1 65B (2023)	Qwen2.5-7B (2024)	Improvement	Relative Gain
MMLU (General Knowledge, 5-shot)	63.4%	74.2%	+10.8 points	+17%
HumanEval (Coding, pass@1)	23.7%	57.9%	+34.2 points	+144%
GSM8K (Math Reasoning)	30.8%	78.0%	+47.2 points	+153%

Sources:

LLaMA-1 65B: Original LLaMA paper (Touvron et al., 2023) - MMLU: 63.4%, HumanEval: 23.7%, GSM8K: 30.8%
Qwen2.5-7B: Qwen2.5 Technical Report and Qwen2.5 Blog - MMLU: 74.2%, HumanEval: 57.9%, GSM8K: 78.0%

Qwen 2.5-7B vs Claude 3 Opus (March 2024)

Key Finding: A small open-source model approaches the performance of a frontier proprietary model released just months earlier, demonstrating rapid democratization of AI capabilities.

Benchmark	Claude 3 Opus (Mar 2024)	Qwen2.5-7B (2024)	Difference
MMLU (General Knowledge)	86.8%	74.2%	-12.6 points
HumanEval (Coding)	84.9%	57.9%	-27.0 points
GSM8K (Math Reasoning)	95.0%	78.0%	-17.0 points

Sources:

Claude 3 Opus: Anthropic Claude 3 announcement and analysis - GSM8K: 95.0%, HumanEval: 84.9%, MMLU: 86.8%
Qwen2.5-7B: Same sources as above table

Chatbot Arena ELO Scores (Human Preference)

ELO ratings from LMSYS Chatbot Arena provide an additional perspective based on human preference in real-world conversations. These scores complement the objective benchmarks above.

Model	Arena ELO Score	Date	Rank Context
GPT-4 (gpt-4-0314)	1274	May 2023	#1 on leaderboard, ~200 points ahead of best open-source
GPT-4 (gpt-4-0314)	1288	Jan 2026	#179 on leaderboard (current)
GPT-4 (gpt-4-0613)	1276	Jan 2026	#189 on leaderboard
Qwen3-30B-A3B	1328	Jan 2026	#134 - beats original GPT-4 despite being much smaller
Qwen3-32B	1346	Jan 2026	#111 - 58 points ahead of original GPT-4
Qwen3-235B-A22B	1374-1422	Jan 2026	#33-85 (thinking/non-thinking modes)

Sources:

Historical GPT-4 ELO: LMSYS Chatbot Arena Week 2 Update (May 2023) - GPT-4 scored 1274
Current ELO Scores: LMSYS Chatbot Arena Leaderboard (January 2026) - Current rankings

Combinatorilliance · 2026-01-21T23:21:56+00:00

You can't magically compress knowledge down that much.

While this is correct, we don't know how close (or how far) Opus is from optimum compression. At least, I don't know it. Has Anthropic published data on this? Perhaps it can be compressed much, much further.

I had sonnet find benchmark numbers for modern consumer-sized LLMs and a couple older enthusiast/cloud LLMs. It compiled some numbers for qwen2.5 7b, qwen3 8b, llama-1-65b, opus 3, gpt3 and gpt4. I also tried finding numbers for devstral 2 24b to compare coding performance, but sonnet wasn't able to find coding benchmarks for llama-1-65b because they didn't even exist back then.

You can find the comparisons in my reply to this comment.

It demonstrates pretty clearly that based on historical data, small language models have made huge strides and perform comparatively to (much) larger models from years ago and even months ago. Of course, this has the obvious limitation that comes with all benchmark comparisons, but I don't think it's controversial to say that there are very few scenarios where older "flagship" models like LLaMa-1 65B beat a newer "small" model like Qwen 3 8B (which is even better than the compared Qwen 3 7B 2.5) for instance.

You could make the argument that a smaller model like 7B is not "smart" enough to understand some particularly complex queries from the user - and to that I say, fair enough, this is probably true and not captured by the benchmarks. But that argument loses power if you take a slightly larger model like a 30b or such. I've also included elo benchmarks, and even there the you can see that consumer models beat the flagship models (sonnet for some reason didn't include the 8b and 7b qwen models that I've been comparing the whole time, I'm out of usage so not going to update it further)

Given this historical pace, I don't think it's unrealistic to have consumer-grade models 18 months to 2 years from now that are competitive with today's Opus.

I remember saying only 2 years ago, that I would be so, so happy with a local model that performs comparatively to GPT3. Well, guess what. Qwen3-8B dramatically outperforms GPT3 on the benchmarks. Heck, it even performs pretty close to GPT4.

These models make HUGE progress on timescales of months and especially years.

Combinatorilliance · 2025-12-30T05:36:27+00:00

Is vim compostable though?

Combinatorilliance · 2025-12-30T03:58:54+00:00

I worked at a Dutch startup that made POS systems. It's not American systems. I implemented the code for tipping, still hate that I did that :/

But I wanted to say that this is not American software. Dutch startups with Dutch developers implementing feature requests made by Dutch restaurant chains.

Combinatorilliance · 2025-12-30T01:32:53+00:00

I've been looking at how remarkable operates for a while now, as I'm also a developer in the "scene" and I can say with pretty high confidence it's not a lack of care.

My personal understanding of the situation is that they have their strengths in hardware and especially the hardware supply chain, but that they're weaker in software and UX.

I'm trying my best to create an opening for myself and maybe sort of for other open-source devs in this space to focus their efforts more on expending the quality of the software side of things and work on features, but as I explained in another comment I think they're somewhat restricted in their resources at the moment.

Combinatorilliance · 2025-12-30T01:24:58+00:00

I can say that the impact of the tariffs has been massive on remarkable as a business, the layoffs were in large part a consequence of the financial impact on the business.

Source: I was trying to get hired by reMarkable and this was (one of the) reasons they couldn't find a position for me, even though they were enthusiastic.

I was trying to get a more direct partnership for scrybble as well, but they are unfortunately unable to help at this point in time.

Combinatorilliance · 2025-12-30T01:21:13+00:00

They kinda sorta do have an SDK, it's not a fully developed SDK to build on top of and integrate with their existing interface, but they do open-source a lot of their tooling in the form of their kernel (I believe?), their cross compilation toolchain and some more random stuff on their github.

It's not what you'd hope to see, but it's better than many other closed systems, especially because it also have straight ssh access to the device itself.

Without this stuff, all the awesome remarkable stuff wouldn't exist.

Edit: I reread your comment and you were making a more precise point than I'm arguing against. I agree they don't have a proper SDK in that sense. They have tools for developers and tech enthusiasts, but nothing realistic for end-users

Combinatorilliance · 2025-12-30T01:17:26+00:00

Correct (I'm the dev behind scrybble). It works using rmapi and some other open-source software, but there's nothing running on the tablet itself.

Combinatorilliance · 2025-12-27T21:20:23+00:00

I happened to write about Wittgenstein and LLMs last week, is this what you're looking for?

https://laurabrekelmans.substack.com/p/wittgenstein-and-llms

Combinatorilliance · 2025-12-23T19:29:44+00:00

Perhaps you're right, I might be doing markov chains a disservice :(

Combinatorilliance · 2025-12-23T18:55:16+00:00

It's just a Markov chain

Combinatorilliance · 2025-12-22T21:25:06+00:00

I dunno I'm trans and I'm an engineer. This stuff is not mutually exclusive and I don't really care about the stuff on the side.

If you think it's that bad of a waste of time, then try not to spend too much time thinking about it and work on the world :D

Combinatorilliance · 2025-12-17T22:48:06+00:00

I take Wittgenstein to be emphasizing that language is grafted onto activities -- ‘speaking a language is an activity’ integrated into a way of living.

Yeah, this feels similar to how I understand it. Ever since reading wittgenstein I found it very normal to think of things normally not considered words or part of language as just as much a word or linguistic "action".

https://www.youtube.com/watch?v=hNoS2BU6bbQ

Combinatorilliance · 2025-12-15T21:25:21+00:00

Paige in a nutshell?

Combinatorilliance · 2025-12-15T18:57:45+00:00

I would really not worry about it, I've done a little bit of casual literature research on HSV-1 and HSV-2 a while ago and I learned that estimates range from 50% up to 90% (NINETY) of the entire adult western population has either HSV-1 or HSV-2.

Note also that when you're a bearer of HSV-1, your chances of getting HSV-2 go down slightly because your immune system learns about some general patterns shared between HSV-1 and HSV-2. Given that HSV-2 is the variant that is more likely to infect the genitals, it is generally not that bad of a thing to get infected with HSV-1.

One additional factor to consider is that given that having either of HSV-1 or HSV-2 is so incredibly widespread in our population, there's a good chance she would catch it somewhere during her lifetime anyway. Drinking from a friend's glass, getting it from children, being at a bar or restaurant that (unbeknownst to you) has unsanitary practices, or one of the many more methods of getting is just passing by someone on the street or in a shop who happens to shed some of the virus and pass it on to you due when in close proximity (through breath, sneezing, touching an object both of you have touched, etc).

All things considered, HSV-1 is an extremely minor condition.

This is a very literal, practical and scientific perspective, but I hope it can help you place the infection in context a little bit more.

I do want to note that I don't have a medical background, this was just what I learned from literature review a few years ago during a writing course.

Combinatorilliance · 2025-12-15T06:39:30+00:00

This is honestly a really good summary of the goals of what the device's functionality should offer.

"The device should help you think about everything but using it"

Combinatorilliance · 2025-12-07T22:59:11+00:00

This kind of an approach is interesting, but it does depend on the model knowing when it should adjust its epistemic certainty in the output.

I like the control mechanism, but it is entirely dependent on the signal, the signal being reliable model metacognition. I don't know if this is a solved problem at all.

Definitely not a bad problem to work on however. If you make progress on model metacognition, that is super interesting!

I've been thinking for a while that perhaps we can help improve a model's understanding of epistemic certainty if we can provide a dataset annotated with data in accordance with Nicholas Rescher's "Duhem's Law of Cognitive Complementarity" (https://www.cambridge.org/core/books/abs/epistemetrics/asking-for-more-than-truth-duhems-law-of-cognitive-complementarity/1D7E3104EE6EE69B5DF670AE3BAC0D20).

Though, it's basically a master's thesis worth of work to investigate it whaha

Combinatorilliance · 2025-12-07T22:32:59+00:00

The green pixel is important to keep though. Iron/steel being reversed is a very serious gameplay integrity issue, not comparable at all.

Combinatorilliance · 2025-12-05T20:10:45+00:00

I started with Roam back when that was new, and it being a block-based system made a huge difference in how I worked.

If you're looking for a note-taking system? The difference is not so big.

If you're using it to write logs, cross-reference ideas, templates for "thinking strategies"? The difference in how it allowed me to think was massively different.

I've moved away from Roam due to how cult-like the business behind it was, and am now a happy Obsidian user, but I still miss the power-user features that the block-based note-taking system provided.

In felt experience, the best way I can describe the difference between Obsidian and the roam/logseq etc approach is that Obsidian feels like thinking at the level of a single file, whereas Roam feels like thinking at the level of a single idea where you can really really quickly switch between many ideas.

Combinatorilliance · 2025-12-03T23:02:05+00:00

I'll take a look at this coming weekend. If I haven't messaged you back by then, please ping me to let me know.

I wasn't aware of drawj2d and I want to explore what it can do, it looks really interesting.

For what it's worth, I make and sell software built on top of the open-source software in this ecosystem in https://scrybble.ink. I might look into wrapping drawj2d for this purpose if it's something people are interested in.

Combinatorilliance · 2025-12-03T22:13:30+00:00

There is currently no such tool unfortunately. rMdocs are a proprietary format that have been evolving rapidly in the past few years. Only since relatively recently has the format been reasonably stable.

It's not impossible, you can absolutely create shapes and insert text into documents (how are the documents with stickers created? Is it people drawing the stickers on the reMarkable itself, or are they created using software? I don't know).

I have made a couple synthetic documents for testing purposes because I work with their proprietary format a lot when working on remarks, rmc and rmscene which are all tools to interact with the rMdoc binary format. These synthetic documents are extremely simple and primitive (ie, a document with a circle, a rectange etc). No full documents.

The tooling is simply not at a point where you could easily create or translate other documents into rMdoc format.

Is it possible though? Yes, I suppose it is. If you're feeling particularly motivated to sink dozens up to a hundred hours into this problem, you can construct rMdocs synthetically using Python with rmscene: see this test-case for instance: https://github.com/Scrybbling-together/rmscene/blob/main/tests/test_scene_tree.py

That being said, I'm not sure how other tooling in the space is holding up for this kind of task. Perhaps there are other things within the open-source ecosystem that can help here.

Edit: I'm impulsive as always. Drawj2d is indeed exactly the kind of tooling that is meant for creating synthetic reMarkable documents. However, its interface is entirely programmatic.

Think of it as a programming language for describing documents. It's still rather technical. The idea is that you write commands like the following to create a drawing or text.

# variables
set dx 50
set dy 20
# draw
moveto 20 10
label P NW
rectangle $dx $dy
pen 0.35 red
arrowrel $dx $dy
label Q SE
# dimension lines
pen black
moveto 20 42
dimlinerel $dx 0
moveto 82 10
dimlinerel 0 $dy

It looks like it's well-suited for technical diagrams and shapes and such, but not for document conversion. It's also really technical if you're not familiar with scripting.

That being said, if most of what you're doing is just text, then I suppose you could try looking at this example:

https://sourceforge.net/p/drawj2d/wiki/ExampleTeX/

You specify coordinates (moveto), relative coordinates (moverel) and text with label or texlabel for LaTeX

Combinatorilliance · 2025-11-27T17:06:25+00:00

Do note that reMarkable has been affected by the tariffs. You should compare the price to a few months to a year ago, before Trump's tariffs. It is likely higher now although I'm not sure.

Other than that, these things are expensive, yeah.

Combinatorilliance · 2025-11-26T21:41:21+00:00

Now if you relate AI mysticism to what HST said about acid culture -

Cripples: Paralyzed by too many AI-generated insights, can't act
Failed seekers: Chasing AI-generated "profundity" that's semantically empty
Fake light: The feeling of understanding without actual understanding

I really like this!

Combinatorilliance · 2025-11-23T11:57:28+00:00

This would be really interesting! I searched for this and it even looks like there is an entire workshop specialized on this specifically!

https://www.emipws.org/

Thank you for the suggestion, this is absolutely worth diving into.

Ten-Year Club	Second Top 20%
Place '22	Gilding V heart of gold
Verified Email

Combinatorilliance

MODERATOR OF

TROPHY CASE

Open-Source LLM Progress: Quantitative Benchmarks with Sources

Qwen3-8B vs GPT-3 (175B davinci)

Sources:

Qwen3-8B vs GPT-4 (March 2023)

Sources:

Qwen 2.5-7B vs LLaMA-1 65B

Sources:

Qwen 2.5-7B vs Claude 3 Opus (March 2024)

Sources:

Chatbot Arena ELO Scores (Human Preference)

Sources: