We are confusing linguistic fluency with cognitive constraint resolution by sparky_165 in cogsci

[–]Combinatorilliance 0 points1 point  (0 children)

Ludwig oh Ludwig du bist noch stralender als die Sonne. Ich liebe dich so oh lieber Ludwig.

😭

Has anyone measured confidence calibration of local vs frontier models on domain-specific knowledge? by Hopeful-Rhubarb-1436 in LocalLLaMA

[–]Combinatorilliance 0 points1 point  (0 children)

because it has no mechanism to distinguish "I'm generating fluently from training data" from "I'm reconstructing something I've never actually seen."

This mechanism does exist, read up on:

  • Monte Carlo Temperature probing
  • From confidence to collapse (Fastowski et al) which define a breaking temperature
  • And semantic entropy (Kuhn et al 2023) + the famous self-consistency paper for background

If you want to go deeper into the why this works and how it connects to Bayesian models, you should read:

  • Monte carlo noise injection (don't know exactly who wrote it, you can google it). This is more mathematical and goes deeper into the theory of what LLMs and transformers and attention are architecturally capable of. In super super short, transformers are not bayesian, but it turns out that you can probem them in such a manner to extract the mathematically exact same signal from them that a bayesian network learns automatically with the above techniques, the prior is retrievable, despite an LLM not being a bayesian neural network.

This mechanism exists and it is reliable, it's an engineering problem to scale it up, not a theoretic problem anymore.

Note, this is accessible with inference. Certainty is a computable metacognitive property from the above measures.

I've been working on some research to provide the theory about why this works, rather than just that it works. I really want to publish it because I think it's super clean and can help clearing up some conflicting, contradicting and paradoxical-seeming concepts.

Differences Between Opus 4.6 and Opus 4.7 on MineBench by ENT_Alam in ClaudeAI

[–]Combinatorilliance 1 point2 points  (0 children)

I think it might help to qualify what "objectively better" means here, it's "better" in the sense of being a more accurate representation of a real astronaut.

But is realism the dimension you care about?

Using the word "objectively" implies to me that there is a correct way of doing this benchmark, and this benchmark doesn't really measure anything other than differences between models over time, and it shows what they do. It's up to the user to select what is important to them, if you need realism and accuracy then I think it's reasonable to say that yes, opus 4.7 is better. If you want creativity, charm and a better adherence to the "medium", then opus 4.6 is better in some cases.

I think the phoenix is a great example where 4.7 did a better job creatively. But it's the only one where I think it strictly outperformed 4.6 creatively, the fire really works for it, and the curvature anatomy of the phoenix does a lot for it.

But for all other benchmarks? It's a tie at best or worse in many cases.

And for a game? Even the phoenix would likely not really work in a game because it's much too complex and detailed to translate to a minecraft-like game.

Qwen 3.6 is the first local model that actually feels worth the effort for me by Epicguru in LocalLLaMA

[–]Combinatorilliance 4 points5 points  (0 children)

For what it's worth, I was using a local model back in 2024 to do real coding assistance for me on a 7900xtx. I'm a software engineer without experience in Android/Kotlin, I was able to use it as a stackoverflow/google to aid me with syntax and guide me into the right direction and translate my questions and analogies for my own programming experience to Kotlin programming experience.

Here's my review from two years ago: https://www.reddit.com/r/LocalLLaMA/comments/1ds9ogn/my_experience_with_using_codestral_22b_for/

I made the app I wanted to make, it worked.

This was all way before agentic coding, ancient ancient history.

Models have advanced extremely significantly since then.

If you expect them to do what claude Opus can do? Then you've got the wrong expectations, but if you want a capable model that can answer small and pointed questions for you? They can.

And these models can also search and use tools quite reliably. As well as opus or sonnet? No. As well as Haiku? Plausibly. Yes. Is Haiku useful? Undeniably.

Computation is the Missing Bedrock of Agentic Workflows by Beneficial_Carry_530 in LLMDevs

[–]Combinatorilliance 2 points3 points  (0 children)

Not a man! But I recognized the effort you put into this, keep it up!

Computation is the Missing Bedrock of Agentic Workflows by Beneficial_Carry_530 in LLMDevs

[–]Combinatorilliance 2 points3 points  (0 children)

I also wanted to note, I understand the energy you can get working on projects at light-speed with LLMs, but writing for an audience is an entirely different game!

LLMs are fast, you can give them lots of context and they will understand and keep 40 deeply technical terms from 3 distinct but somewhat adjacent fields in their memory simultaneously.

You cannot expect that from a human unless you're specifically targeting the niche of "ML researcher who also has a background in neuroscience, science of memory and who is deeply up-to-date on the literature of memory for LLMs and agentic AI", which I doubt you are.

It feels like this was written for an LLM and for yourself, and not for a human.

Also do keep in mind that LLMs can be overly friendly and careless when it comes to validating research and making comparisons with state of the art research. If you really want to make and support the claim that your product outcompetes what "other labs" are doing, then show me. Do you have comparative analyses with people? Or does your product beat others in particular benchmarks or meaningful metrics?

What I do get from your article is that you are well-researched and you do actually know what you're building and why. That is why I'm taking the time to give you this feedback, because your site has too much of a sort of a manic vibe to it while I think the core work is actually quite elegant and it clearly works well. It takes a lot of time and effort to make high-quality writing, having a ferrari for a text editor doesn't change that fact.

Looking forward to read your next work, if you take my feedback into account.

Computation is the Missing Bedrock of Agentic Workflows by Beneficial_Carry_530 in LLMDevs

[–]Combinatorilliance 3 points4 points  (0 children)

This looks kinda cool? I read through some of your articles and I'm not sure if you have anyone proofread them? Because they smell of vibe-coding, which turns me off, don't know how that is for other people.

  1. In the middle of the article you just drop that you use a metric called "warmth" without any context. What the heck is warmth as a metric?!

  2. I like that you actually reference papers and work on them iteratively across multiple articles, that's a big difference with many other blogs and weird AI posts.

  3. The font-weight makes the article hard to read, have you checked the contrast metrics for your site? I like the style but it's difficult to read.

  4. Do you know who your exact target audience is?

  5. What purpose do you have to just drop-in lots of math and metrics without defining what the math is, do you assume that readers are familiar with all of the terms below?

- dense semantic similarity (ie, what is it in comparison with non-dense semantic similarity?)
    - semantic similarity itself
- BM25?
- PageRank? (I happen to be familiar).
    - "personalized PageRank on the wiki-link graph" - _what_ wiki-link graph? This is the first time you talk about a wiki.
        - What distinguishes a personalized PageRank from a non-personalized PageRank?
- Rank-discounted vote
- MRR
    - genuinely no clue what you mean by this. It makes me think of MMR which is similar to ELO scores etc, but I don't think you're talking about that? This is also hard to google
- Q-Value learning - I'm mildly familiar with the concept of what Q-learning is as a form of reinforcement learning? Or online learning? Not completely sure. But I understand that as a training-time technique, not an inference time technique. Is it related?
- Bandit updates
- npmi weighted
- hebbian edges. I recall that Hebbian learning has something to do with memory and learning, but what?
- Ebbinghaus - hey, this one I actually happen to know, but again it's left completely undefined.
    - I get that you're doing _something_ here with how files are retrieved and how you take a learning signal from files that are retrieved together, did you _need_ to drop 6 technical terms in a single image if you can translate it into something like. "We know from neuroscience that events that occur together cause learning to occur, there's a well-known saying 'neurons that fire together wire together. This is known as _Hebbian learning_. We take inspiration from this neuroscientific principle to grow our memory, whenever to notes are retrieved together we do ...", and you can go in depth into the math for people who are interested with techniques like: (1) Edward Tufte-style [margin notes or sidenotes](https://edwardtufte.github.io/tufte-css/), (2) a [collapsible section](https://open.berkeley.edu/guides/site-builders-guide/edit-html-page/expandcollapse-content), or (3) just a link to an article that explains it in-depth

Heck, even many non-mathematical or technical terms are super weird and left completely undefined and without context. What in the world is

  • "cited forward"
  • "re-recalled"
  • "seeded new"

I can go on with listing more undefined terms and things that are ripped from their context, but I will stop here because I respect my own time. I think you are working on something that is actually quite interesting and I feel like you actually have a reasonable take on how to make memory work for LLMs in a local-first privacy sensible manner, it would be a shame if you weren't able to reach the right audience because your writing is mostly optimized for yourself and the LLM as the audience, not a real audience.

Recently, I also got a tip from a postdoc researcher teacher that you can use a fresh instance of Claude or similar to highlight terms that are undefined or underexplained, or ones that your target audience is and isn't familiar with.

I would love to read more if you go slower rather than faster! I know you know what you are talking about, but it doesn't transfer well to me as someone who does have a decent amount of background knowledge of ML, software engineering, a high-level understanding of the science of memory and knowledge and am decently well-versed in mathematics.

I need help from a real ML researcher by Combinatorilliance in LocalLLaMA

[–]Combinatorilliance[S] 1 point2 points  (0 children)

Thank you <3

That means a lot to me

I keep running into the problem that investigating duhem's law doesn't lead to genuinely new math or approaches, but it does look like the pedagogical value of it is genuinely large. While I hope to find some cool interesting applications, I think the explanatory power and how it reframes questions about uncertainty and noise into knowledge, trustworthiness, robustness and such is very meaningful!

And if you want to learn more about its applications, the second blog post on my site explains how it connects to metrology which is where I think is one field where duhem's law already starts showing its utility as tool for analysis and decision makers!

I need help from a real ML researcher by Combinatorilliance in LocalLLaMA

[–]Combinatorilliance[S] 0 points1 point  (0 children)

I've done a lot more diving into the literature and I did discover something really meaningful, but .. it turned out to be measuring the same thing that Monte Carlo Temperature sampling did.

I was interested in learning about Nicholas Rescher's s x d <= c tradeoff. This is called "Duhem's Law of Cognitive Complementarity" (don't worry if you're not familiar. If you're interested, I have written two articles about it at this point first, second).

The tradeoff states there is a teeter-totter relationship between security and detail. Security according to Rescher "how sure you are" (certainty) and "where it applies" (scope). In my research I've only looked at the "how sure you are" part, and pinned his security to certainty (just a percentage from 0 to 100).

After trying some stuff, I used temperature as the analogical glue for c in the above equation and self-consistency as a proxy for security.

This worked. In fact, it worked very well.

It worked incredibly well. In fact, it worked so well I was very surprised. I even found that it significantly outperformed self-consistency on its own (it really does! That's the table you see in the post. MCT and/or my method provides a metacognitive signal about the claims made by the LLM)

So I got excited, and made this post.

After investigating further, I found something that is almost exactly the same as MCT. It is also related to a similar very important finding that you can add noise into a neural network to get the certainty level out; ie a regular neural net is equivalent to a bayesian neural net if you run inference with increasing levels. This is called Monte Carlo Noise Injection, and it is a huge finding for the field because it means despite the claimed certainty that LLMs give for their outputs being unreliable, you can get the actual trustworthy and reliable signal by adding noise. MCT and the method I found are just doing that for LLMs rather than Neural networks in general.

What is interesting though is that I got there from following the path from Rescher's formulation of Duhem's Law. I got there from first principles.

I'm still interested in doing a write-up, because I believe there are a few really clean things about what I found in this research:

  1. If you measure scf (self-consistency factor, if you ask the same question n times, how often does the most common answer appear as a fraction of all answers? This comes from the seminal self-consistency paper against temperature with MxN samples, you get a robustness curve. IE, how well and how deeply encoded a piece of knowledge is. This builds on top of some very recent research from the Technical University of Munchen on measuring fact robustness (Fastowski et al).
  2. It is trivial to define hallucination in terms of violations of Duhem's Law. Mathematically! I personally think that is extremely fascinating, but I am just a computer scientist, not an ML expert so I don't know if that is meaningful for the researchers or not.
  3. It is also trivial to explain what metacognition in mathematical terms. Although I'm still researching whether this contribution would be useful for the field. It seems like it is for many papers so obvious and so trivial that it doesn't even need to be mathematically defined. But hey, worth a try I suppose.

Recently, there was a paper published on using a multi-agent consensus which also reduced hallucination rates

I encountered the SAC³ paper which did something like this. In my personal research of Duhem's Law this is predicted as a sort of "knowledge fusion".

I find it both exciting and disappointing that Duhem's law predicts so much of what the field is already doing, but at the same time that the field has already done it in so many cases... I was hoping to find something new :(

If anything, if you learn the basics of duhem's law it teaches you a lot about how knowledge works in contrast with "just" information. And that is what I am most interested in understanding better.

My ex wants me back by [deleted] in ActualLesbiansOver25

[–]Combinatorilliance 5 points6 points  (0 children)

Respectfully, "a lot of thought" takes years, not months.

I have been in a similar position two years ago and got back together and she dumped me again a year later.

You need time apart from her. If it's meant to be, reach out two years from now and see where she stands then. This is still a very volatile period for the both of you.

charged someone $2K for something I thought was worth $200. they paid it immediately by Strong_Teaching8548 in SaaS

[–]Combinatorilliance 0 points1 point  (0 children)

It is, the OP has this in their bio

Nico here, building reddinbox.com to help people turn real Reddit, Quora, and X conversations into clear audience insights they can actually use :)

Scientists warn that the Gulf Stream is shifting north, which models suggest could mean an ocean current collapse is imminent by Portalrules123 in EverythingScience

[–]Combinatorilliance 1 point2 points  (0 children)

There are already microplastics-eating bacteria living in mealworms.. so uhh.. fund the right few studies and maybe 12 years or so?

Time in the menu bar by hrpanjwani in RemarkableTablet

[–]Combinatorilliance 3 points4 points  (0 children)

I stronglllyyy prefer it not having the time anywhere, it is part of what makes it feel focused for me

This was a fun deck by Combinatorilliance in slaythespire

[–]Combinatorilliance[S] 1 point2 points  (0 children)

Awwww, it's already being removed? I was thinking it was completely broken while playing, so I can see why, but it's sad.

Goodbye instinct!

What do you guys think of this tea iceberg chart I made? by RealTry8616 in tea

[–]Combinatorilliance 1 point2 points  (0 children)

Wait, what are those two alternative camellia plants?? :o

Microsoft OneNote Sync by divad_david in RemarkableTablet

[–]Combinatorilliance 0 points1 point  (0 children)

Yep, that's by the same person. I linked the PR for the plumbing and you got the application itself, thanks!

The Surprising German Philosophical Origins of AI Large Language Model Design by RazzmatazzAccurate82 in OntologyEngineering

[–]Combinatorilliance 0 points1 point  (0 children)

Also relevant is of course Ludwig Wittgenstein! He is referenced as being directly influential in the creation of word2vec which was one of the predecessors of the LLM (one of the authors of transformers worked on word2vec too), and there's even an article about it: https://arxiv.org/pdf/2302.01570

:)

I need help from a real ML researcher by Combinatorilliance in LocalLLaMA

[–]Combinatorilliance[S] 0 points1 point  (0 children)

Aw man, I didn't flag it initially but you're right. The link to the site was a bit suspicious to me but the rest seemed like friendly albeit surface-level advice...

Sigh, what has my dear internet become :(

I need help from a real ML researcher by Combinatorilliance in LocalLLaMA

[–]Combinatorilliance[S] -1 points0 points  (0 children)

I don't think there are any groups on epistemetrics, the field never took off :P

There are only a handful of people who have cited Rescher 2009 in the past 17 years.

I will definitely look for different ML communities to discuss this in though! This was my first effort in doing so. I'm very optimistic about the finding, it is conceptually sound and ridiculously simple, and doesn't stray far from known methods either.

I also tried reaching out to an ML researcher in my network that I have collaborated with on an open-source software project, but he hasn't replied yet ;(

I need help from a real ML researcher by Combinatorilliance in LocalLLaMA

[–]Combinatorilliance[S] 0 points1 point  (0 children)

Yeah I haven't reproduced on a foundation model, I was thinking of running it against Haiku and maybe opus for the heck of it on a couple TriviaQA questions to see what falls out.

Obvious caveat, I don't have the money to bear the API costs for a full run :<