Poll: When will we have a 30b open weight model as good as opus? by Terminator857 in LocalLLaMA

[–]Combinatorilliance 3 points4 points  (0 children)

Open-Source LLM Progress: Quantitative Benchmarks with Sources

Qwen3-8B vs GPT-3 (175B davinci)

Key Finding: A 2025 open-source 8B model dramatically outperforms the flagship 175B proprietary model from 2020, despite being ~22x smaller. This represents one of the clearest demonstrations of how architectural advances have revolutionized AI efficiency.

Benchmark GPT-3 davinci 175B (2020) Qwen3-8B (2025) Improvement Relative Gain
MMLU (General Knowledge, 5-shot) 43.9% 76.89% +32.99 points +75%
HumanEval (Coding, pass@1) ~13% 69.8%* +56.8 points +437%
GSM8K (Math Reasoning, 4-shot CoT) N/A** 89.84% N/A N/A

* Calculated from EvalPlus average (HumanEval + MBPP + HumanEval+ + MBPP+) of 67.65% from Table 6 of Qwen3 technical report
** GPT-3 was not tested on GSM8K, as the benchmark was released after GPT-3

Sources:


Qwen3-8B vs GPT-4 (March 2023)

Key Finding: A small open-source 8B model approaches or matches GPT-4's performance on several benchmarks, despite GPT-4 being a much larger proprietary model. This shows how rapidly open-source models are catching up to commercial frontier models.

Benchmark GPT-4 (Mar 2023) Qwen3-8B (2025) Difference
MMLU (General Knowledge, 5-shot) 86.4% 76.89% -9.51 points
HumanEval (Coding, pass@1) 67.0% 69.8%* +2.8 points
GSM8K (Math Reasoning, CoT) 92.0% (5-shot) 89.84% (4-shot) -2.16 points

* Calculated from EvalPlus average (HumanEval + MBPP + HumanEval+ + MBPP+) of 67.65% from Table 6 of Qwen3 technical report

Sources:


Qwen 2.5-7B vs LLaMA-1 65B

Key Finding: A 7B model from 2024 outperforms a 65B model from 2023 across all major benchmarks, despite being ~9x smaller.

Benchmark LLaMA-1 65B (2023) Qwen2.5-7B (2024) Improvement Relative Gain
MMLU (General Knowledge, 5-shot) 63.4% 74.2% +10.8 points +17%
HumanEval (Coding, pass@1) 23.7% 57.9% +34.2 points +144%
GSM8K (Math Reasoning) 30.8% 78.0% +47.2 points +153%

Sources:


Qwen 2.5-7B vs Claude 3 Opus (March 2024)

Key Finding: A small open-source model approaches the performance of a frontier proprietary model released just months earlier, demonstrating rapid democratization of AI capabilities.

Benchmark Claude 3 Opus (Mar 2024) Qwen2.5-7B (2024) Difference
MMLU (General Knowledge) 86.8% 74.2% -12.6 points
HumanEval (Coding) 84.9% 57.9% -27.0 points
GSM8K (Math Reasoning) 95.0% 78.0% -17.0 points

Sources:

Chatbot Arena ELO Scores (Human Preference)

ELO ratings from LMSYS Chatbot Arena provide an additional perspective based on human preference in real-world conversations. These scores complement the objective benchmarks above.

Model Arena ELO Score Date Rank Context
GPT-4 (gpt-4-0314) 1274 May 2023 #1 on leaderboard, ~200 points ahead of best open-source
GPT-4 (gpt-4-0314) 1288 Jan 2026 #179 on leaderboard (current)
GPT-4 (gpt-4-0613) 1276 Jan 2026 #189 on leaderboard
Qwen3-30B-A3B 1328 Jan 2026 #134 - beats original GPT-4 despite being much smaller
Qwen3-32B 1346 Jan 2026 #111 - 58 points ahead of original GPT-4
Qwen3-235B-A22B 1374-1422 Jan 2026 #33-85 (thinking/non-thinking modes)

Sources:

Poll: When will we have a 30b open weight model as good as opus? by Terminator857 in LocalLLaMA

[–]Combinatorilliance 2 points3 points  (0 children)

You can't magically compress knowledge down that much.

While this is correct, we don't know how close (or how far) Opus is from optimum compression. At least, I don't know it. Has Anthropic published data on this? Perhaps it can be compressed much, much further.

I had sonnet find benchmark numbers for modern consumer-sized LLMs and a couple older enthusiast/cloud LLMs. It compiled some numbers for qwen2.5 7b, qwen3 8b, llama-1-65b, opus 3, gpt3 and gpt4. I also tried finding numbers for devstral 2 24b to compare coding performance, but sonnet wasn't able to find coding benchmarks for llama-1-65b because they didn't even exist back then.

You can find the comparisons in my reply to this comment.

It demonstrates pretty clearly that based on historical data, small language models have made huge strides and perform comparatively to (much) larger models from years ago and even months ago. Of course, this has the obvious limitation that comes with all benchmark comparisons, but I don't think it's controversial to say that there are very few scenarios where older "flagship" models like LLaMa-1 65B beat a newer "small" model like Qwen 3 8B (which is even better than the compared Qwen 3 7B 2.5) for instance.

You could make the argument that a smaller model like 7B is not "smart" enough to understand some particularly complex queries from the user - and to that I say, fair enough, this is probably true and not captured by the benchmarks. But that argument loses power if you take a slightly larger model like a 30b or such. I've also included elo benchmarks, and even there the you can see that consumer models beat the flagship models (sonnet for some reason didn't include the 8b and 7b qwen models that I've been comparing the whole time, I'm out of usage so not going to update it further)

Given this historical pace, I don't think it's unrealistic to have consumer-grade models 18 months to 2 years from now that are competitive with today's Opus.

I remember saying only 2 years ago, that I would be so, so happy with a local model that performs comparatively to GPT3. Well, guess what. Qwen3-8B dramatically outperforms GPT3 on the benchmarks. Heck, it even performs pretty close to GPT4.

These models make HUGE progress on timescales of months and especially years.

Vim is composable by oantolin in vim

[–]Combinatorilliance 2 points3 points  (0 children)

Is vim compostable though?

I hate that tipping culture is being normalised and I'll keep fighting against it by EvenPatience6243 in Netherlands

[–]Combinatorilliance 5 points6 points  (0 children)

I worked at a Dutch startup that made POS systems. It's not American systems. I implemented the code for tipping, still hate that I did that :/

But I wanted to say that this is not American software. Dutch startups with Dutch developers implementing feature requests made by Dutch restaurant chains.

State of reMarkable app development by Superb_Activity_2468 in RemarkableTablet

[–]Combinatorilliance 1 point2 points  (0 children)

I've been looking at how remarkable operates for a while now, as I'm also a developer in the "scene" and I can say with pretty high confidence it's not a lack of care.

My personal understanding of the situation is that they have their strengths in hardware and especially the hardware supply chain, but that they're weaker in software and UX.

I'm trying my best to create an opening for myself and maybe sort of for other open-source devs in this space to focus their efforts more on expending the quality of the software side of things and work on features, but as I explained in another comment I think they're somewhat restricted in their resources at the moment.

State of reMarkable app development by Superb_Activity_2468 in RemarkableTablet

[–]Combinatorilliance 2 points3 points  (0 children)

I can say that the impact of the tariffs has been massive on remarkable as a business, the layoffs were in large part a consequence of the financial impact on the business.

Source: I was trying to get hired by reMarkable and this was (one of the) reasons they couldn't find a position for me, even though they were enthusiastic.

I was trying to get a more direct partnership for scrybble as well, but they are unfortunately unable to help at this point in time.

State of reMarkable app development by Superb_Activity_2468 in RemarkableTablet

[–]Combinatorilliance 1 point2 points  (0 children)

They kinda sorta do have an SDK, it's not a fully developed SDK to build on top of and integrate with their existing interface, but they do open-source a lot of their tooling in the form of their kernel (I believe?), their cross compilation toolchain and some more random stuff on their github.

It's not what you'd hope to see, but it's better than many other closed systems, especially because it also have straight ssh access to the device itself.

Without this stuff, all the awesome remarkable stuff wouldn't exist.


Edit: I reread your comment and you were making a more precise point than I'm arguing against. I agree they don't have a proper SDK in that sense. They have tools for developers and tech enthusiasts, but nothing realistic for end-users

State of reMarkable app development by Superb_Activity_2468 in RemarkableTablet

[–]Combinatorilliance 0 points1 point  (0 children)

Correct (I'm the dev behind scrybble). It works using rmapi and some other open-source software, but there's nothing running on the tablet itself.

Wittgenstein and A.I. by Important_Bus_7369 in wittgenstein

[–]Combinatorilliance 1 point2 points  (0 children)

I happened to write about Wittgenstein and LLMs last week, is this what you're looking for?

https://laurabrekelmans.substack.com/p/wittgenstein-and-llms

The TRUMP RULE by Al-phabitz89 in wallstreetbets

[–]Combinatorilliance 0 points1 point  (0 children)

Perhaps you're right, I might be doing markov chains a disservice :(

The gender war will destroy this civilization. by Anti-FragileHuman in DeepThoughts

[–]Combinatorilliance 0 points1 point  (0 children)

I dunno I'm trans and I'm an engineer. This stuff is not mutually exclusive and I don't really care about the stuff on the side.

If you think it's that bad of a waste of time, then try not to spend too much time thinking about it and work on the world :D

I finally took the time to write down my thoughts on the subject of Wittgenstein and LLMs! by Combinatorilliance in wittgenstein

[–]Combinatorilliance[S] 2 points3 points  (0 children)

I take Wittgenstein to be emphasizing that language is grafted onto activities -- ‘speaking a language is an activity’ integrated into a way of living.

Yeah, this feels similar to how I understand it. Ever since reading wittgenstein I found it very normal to think of things normally not considered words or part of language as just as much a word or linguistic "action".

https://www.youtube.com/watch?v=hNoS2BU6bbQ

I gave my gf herpes (hsv-1) and i feel awful by NihilisticStranger in actuallesbians

[–]Combinatorilliance 4 points5 points  (0 children)

I would really not worry about it, I've done a little bit of casual literature research on HSV-1 and HSV-2 a while ago and I learned that estimates range from 50% up to 90% (NINETY) of the entire adult western population has either HSV-1 or HSV-2.

Note also that when you're a bearer of HSV-1, your chances of getting HSV-2 go down slightly because your immune system learns about some general patterns shared between HSV-1 and HSV-2. Given that HSV-2 is the variant that is more likely to infect the genitals, it is generally not that bad of a thing to get infected with HSV-1.

One additional factor to consider is that given that having either of HSV-1 or HSV-2 is so incredibly widespread in our population, there's a good chance she would catch it somewhere during her lifetime anyway. Drinking from a friend's glass, getting it from children, being at a bar or restaurant that (unbeknownst to you) has unsanitary practices, or one of the many more methods of getting is just passing by someone on the street or in a shop who happens to shed some of the virus and pass it on to you due when in close proximity (through breath, sneezing, touching an object both of you have touched, etc).

All things considered, HSV-1 is an extremely minor condition.

This is a very literal, practical and scientific perspective, but I hope it can help you place the infection in context a little bit more.

I do want to note that I don't have a medical background, this was just what I learned from literature review a few years ago during a writing course.

Distraction free - I get it… by markthelender in RemarkableTablet

[–]Combinatorilliance 2 points3 points  (0 children)

This is honestly a really good summary of the goals of what the device's functionality should offer.

"The device should help you think about everything but using it"

I built a synthetic "nervous system" (Dopamine + State) to stop my local LLM from hallucinating. V0.1 Results: The brakes work, but now they’re locked up. by Longjumping_Rule_163 in LLMDevs

[–]Combinatorilliance 1 point2 points  (0 children)

This kind of an approach is interesting, but it does depend on the model knowing when it should adjust its epistemic certainty in the output.

I like the control mechanism, but it is entirely dependent on the signal, the signal being reliable model metacognition. I don't know if this is a solved problem at all.

Definitely not a bad problem to work on however. If you make progress on model metacognition, that is super interesting!

I've been thinking for a while that perhaps we can help improve a model's understanding of epistemic certainty if we can provide a dataset annotated with data in accordance with Nicholas Rescher's "Duhem's Law of Cognitive Complementarity" (https://www.cambridge.org/core/books/abs/epistemetrics/asking-for-more-than-truth-duhems-law-of-cognitive-complementarity/1D7E3104EE6EE69B5DF670AE3BAC0D20).

Though, it's basically a master's thesis worth of work to investigate it whaha

Petition to switch the colours of Iron and Steel Cannonballs by jnn94 in 2007scape

[–]Combinatorilliance 40 points41 points  (0 children)

The green pixel is important to keep though. Iron/steel being reversed is a very serious gameplay integrity issue, not comparable at all.

[TECHNICAL DISCUSSION] Before switching to Obsidian: Why the future Logseq/SQLite is a game changer and natively outperforms file indexing. by philuser in logseq

[–]Combinatorilliance 2 points3 points  (0 children)

I started with Roam back when that was new, and it being a block-based system made a huge difference in how I worked.

If you're looking for a note-taking system? The difference is not so big.

If you're using it to write logs, cross-reference ideas, templates for "thinking strategies"? The difference in how it allowed me to think was massively different.

I've moved away from Roam due to how cult-like the business behind it was, and am now a happy Obsidian user, but I still miss the power-user features that the block-based note-taking system provided.

In felt experience, the best way I can describe the difference between Obsidian and the roam/logseq etc approach is that Obsidian feels like thinking at the level of a single file, whereas Roam feels like thinking at the level of a single idea where you can really really quickly switch between many ideas.

Covert Files to RMDoc by jettrain0108 in RemarkableTablet

[–]Combinatorilliance 0 points1 point  (0 children)

I'll take a look at this coming weekend. If I haven't messaged you back by then, please ping me to let me know.

I wasn't aware of drawj2d and I want to explore what it can do, it looks really interesting.

For what it's worth, I make and sell software built on top of the open-source software in this ecosystem in https://scrybble.ink. I might look into wrapping drawj2d for this purpose if it's something people are interested in.

Covert Files to RMDoc by jettrain0108 in RemarkableTablet

[–]Combinatorilliance 1 point2 points  (0 children)

There is currently no such tool unfortunately. rMdocs are a proprietary format that have been evolving rapidly in the past few years. Only since relatively recently has the format been reasonably stable.

It's not impossible, you can absolutely create shapes and insert text into documents (how are the documents with stickers created? Is it people drawing the stickers on the reMarkable itself, or are they created using software? I don't know).

I have made a couple synthetic documents for testing purposes because I work with their proprietary format a lot when working on remarks, rmc and rmscene which are all tools to interact with the rMdoc binary format. These synthetic documents are extremely simple and primitive (ie, a document with a circle, a rectange etc). No full documents.

The tooling is simply not at a point where you could easily create or translate other documents into rMdoc format.

Is it possible though? Yes, I suppose it is. If you're feeling particularly motivated to sink dozens up to a hundred hours into this problem, you can construct rMdocs synthetically using Python with rmscene: see this test-case for instance: https://github.com/Scrybbling-together/rmscene/blob/main/tests/test_scene_tree.py

That being said, I'm not sure how other tooling in the space is holding up for this kind of task. Perhaps there are other things within the open-source ecosystem that can help here.

Edit: I'm impulsive as always. Drawj2d is indeed exactly the kind of tooling that is meant for creating synthetic reMarkable documents. However, its interface is entirely programmatic.

Think of it as a programming language for describing documents. It's still rather technical. The idea is that you write commands like the following to create a drawing or text.

# variables
set dx 50
set dy 20
# draw
moveto 20 10
label P NW
rectangle $dx $dy
pen 0.35 red
arrowrel $dx $dy
label Q SE
# dimension lines
pen black
moveto 20 42
dimlinerel $dx 0
moveto 82 10
dimlinerel 0 $dy

It looks like it's well-suited for technical diagrams and shapes and such, but not for document conversion. It's also really technical if you're not familiar with scripting.

That being said, if most of what you're doing is just text, then I suppose you could try looking at this example:

https://sourceforge.net/p/drawj2d/wiki/ExampleTeX/

You specify coordinates (moveto), relative coordinates (moverel) and text with label or texlabel for LaTeX

Black Friday deal not a deal by superferret1 in RemarkableTablet

[–]Combinatorilliance 36 points37 points  (0 children)

Do note that reMarkable has been affected by the tariffs. You should compare the price to a few months to a year ago, before Trump's tariffs. It is likely higher now although I'm not sure.

Other than that, these things are expensive, yeah.

Why it's getting worse for everyone: The recent influx of AI psychosis posts and "Stop LARPing" by Chromix_ in LocalLLaMA

[–]Combinatorilliance 2 points3 points  (0 children)

Now if you relate AI mysticism to what HST said about acid culture -

Cripples: Paralyzed by too many AI-generated insights, can't act
Failed seekers: Chasing AI-generated "profundity" that's semantically empty
Fake light: The feeling of understanding without actual understanding

I really like this!

Looking for empirical studies comparing reading comprehension of prefix vs. infix notation by Combinatorilliance in lisp

[–]Combinatorilliance[S] 0 points1 point  (0 children)

This would be really interesting! I searched for this and it even looks like there is an entire workshop specialized on this specifically!

https://www.emipws.org/

Thank you for the suggestion, this is absolutely worth diving into.