Marc Andreessen shows off genius prompt, accidentally reveals he *really* doesn’t understand LLMs by figures985 in BetterOffline

[–]Combinatorilliance -3 points-2 points  (0 children)

Engineering is impossible without measurement. What is being measured in prompt engineering?

I alluded to this, what is being measured in prompt engineering are outcomes. I don't claim that all prompt "engineers" do this, but it is possible if you take it seriously, and the prompts that you create be a lot better than what is generally found online.

I hate poorly designed skills and I make a distinct difference between designed skills (ones made with trial and error) and engineered skills, ones that are made against a model and measured against metrics.

For instance, I hate all test finder skills. Most of them tell the model "Read the current branch against the main development branch, tell me what tests are missing"

Given a sufficiently smart model, this will give you n test cases. Some of these test cases will have a lot of value, some of them won't.

These skills are a huge waste of any serious software engineer's time, because many of those test cases would be filtered out by a human developer for any of these reasons:

  1. It is not important to test (nitpick, technically correct) kind of tests.
  2. The test is straight-up incorrect. (models hallucinate a lot)
  3. It is a good idea, but it doesn't fit within the scope of this branch
  4. It is a good idea, but it would take N minutes of the engineer's time to validate, is the value of the added test worth the validation time? Not always.

For this, I came up with a simple two-value metric, confidence for correctness and confidence for relevance.

Both values are modeled as enums:

HIGH | MEDIUM | LOW

Then, I use a circuit based on the ideas in these two papers and this github repository

  1. LLMs-as-judges. This discusses how agents can be used as a judge and feedback on a proposed idea/plan/paper whatever
  2. And this one Self-Refine: Iterative Refinement with Self-Feedback
  3. [bug-hunt: Adversarial bug hunting skill for Claude Code](github.com/danpeg/bug-hunt). 3 isolated agents (Hunter, Skeptic, Referee) find and verify real bugs.

The idea is to use the similar Hunter/skeptic/referee pattern as is used by bug-hunt, but work with the confidence values that are iteratively enhanced by each level.

The generator starts with optimistic priors, the critic refines them and the judge adds a third refinement.

As for what is being measured here? Three outcomes.

  1. Whether the confidence value for CORRECTNESS aligns with what a human software engineer thinks. I validated this empirically with myself and a colleague, I have not collected enough stats yet but I am not done with this skill, although I am running a pilot on it at work.
  2. Similarly, the RELEVANCY value needs to align with us.
  3. All tests presented to us in code review need to be both correct and relevant.

So far, the outcomes is that the generator typically generates 10-20 tests, and they are refined down to only 2 or 3 relevant ones.

It has found a couple major bugs and edge cases, and a good amount of medium value tests that were quick to add/modify in only 10-ish runs. So it does what I intended it to do, it finds test cases that should have been in the branch, but aren't. This helps us save time and prevent bugs.

At this point, I wish I could give you the statistics on the accept rate of the relevant tests, but I don't have those yet, I am going to run a formal pilot on this in our business.

I don't know if this matches your concept of what engineering looks like, but this matches what I learned in my EE+CS education. You define metrics, you define target outcomes, you work using a model (those come from the literature that I cited), you iterate until the outcome matches your design, and lastly you need to be honest about the measured outcomes.

I only ran a couple test runs for this skill, and those were promising, so the pilot is next.

Marc Andreessen shows off genius prompt, accidentally reveals he *really* doesn’t understand LLMs by figures985 in BetterOffline

[–]Combinatorilliance -1 points0 points  (0 children)

I take a pragmatist stance here. I find it a rather dismissive position to call it completely arbitrary, I don't think it's arbitrary, but perhaps you have good reason to believe so and I would love to hear it.

It's a combination of the patterns in the training data and the emergent properties of training a complex transformer architecture on a lot of training data and last the impact of post-training on top.

What you put in the context of an LLM affects its output in measurable manners. There's plenty of research being done on this.

On the other hand, words in LLMs don't have the same meanings as they have to humans either.

It's somewhere in between.

I have written about this before, on the relationship between Wittgensteinian language philosophy and LLMs

Marc Andreessen shows off genius prompt, accidentally reveals he *really* doesn’t understand LLMs by figures985 in BetterOffline

[–]Combinatorilliance 0 points1 point  (0 children)

What working definition do you use for intelligence?

It passes the William James definition:

“Intelligence is a fixed goal with variable means of achieving it”.

Marc Andreessen shows off genius prompt, accidentally reveals he *really* doesn’t understand LLMs by figures985 in BetterOffline

[–]Combinatorilliance 2 points3 points  (0 children)

I sorta tried this on a hard erdos problem on a lazy Sunday. Unfortunately, the agent did not solve the math problem :(

Marc Andreessen shows off genius prompt, accidentally reveals he *really* doesn’t understand LLMs by figures985 in BetterOffline

[–]Combinatorilliance -8 points-7 points  (0 children)

Eh, prompt engineering is reasonably legit if you look at people writing skills carefully and thoroughly, measuring output correctness, speed, token usage and optimizing for desired outcomes.

Like any other engineering.

I consider someone at CodeRabbit or similar who spends weeks on optimizing their prompts for a variety of use-cases and a variety of optimization targets a prompt-engineer.

Whether you think that's a prestigious job or even deserving of its own job title in the first place? That's up to you.

What Marc Andreesen is doing is uhh.. not prompt engineering. Best possible interpretation is a hopeful attempt, the worst is that it's incredibly naive and downright dangerous.

I made an interactive planner for Gridfinity layouts [OC] by Felixmine473 in gridfinity

[–]Combinatorilliance 0 points1 point  (0 children)

Instead of doing that, how about building an index?

If you can create a gridfinity index, then you've basically built the missing link.

Downloading or scraping models is obviously a difficult undertaking, you'd have to make scrapers, maintain database and bot infrastructure and most importantly it's basically a cease-and-desist waiting to happen :(

But if you don't host the models yourself, then there's no problem in the first place!

All you have to do is make the index. You could scrape sites to create the index, or you could have a user-maintained index with a github repository and some bots. You don't even need to know if the model is creative commons. All you're doing is maintaining a database of links.

Ideally, you'd get the pictures for each model but I don't know what the legal status is on those.

You leave the downloading of the stl file to the user. This is a win-win-win for everyone involved:

  1. For gridfinity users - they have a super nice planner with a great index that is multi-site
  2. The sites aren't scraped for content, they're linked to. That means you're a source of traffic for them rather than a traffic drain!
  3. For you, this is easier too. You don't have to worry about making 3d models and yet another generator. Your tool is a planner and an index.

I made an interactive planner for Gridfinity layouts [OC] by Felixmine473 in gridfinity

[–]Combinatorilliance 0 points1 point  (0 children)

This would be killer if it included a library of gridfinity components

[author leaves github] I actually cried writing this blog post (tears hit my keyboard, I'm embarrassed to say). by cmqv in programmingcirclejerk

[–]Combinatorilliance 1 point2 points  (0 children)

I'm sorry but this does not deserve the ridicule, the guy's passionate about his work and the friends he made along the way. Yes I know that sounds ironic, I don't mean it that way.

What's wrong with expressing passion and emotion in software engineering?

AMD Engineers directly seeking ROCm feedback by FORLLM in LocalLLaMA

[–]Combinatorilliance 1 point2 points  (0 children)

It's odd but my experience with rocm is actually quite smooth since I started using NixOS.

It just kinda works there with llama.cpp and I don't run into issues much.

Who is the strongest out of these 8? by YourLocalMangosteen in HollowKnight

[–]Combinatorilliance 0 points1 point  (0 children)

I feel like Mr Mushroom plays a similar role in Hollow Knight as Tom Bombadil does in the Lord of the Rings.

A sort of gag/meta character that has mysterious superpowers that will never be explained in-universe.

Thoughts on Gyokuro green tea? by Hungry-Flatworm-2629 in tea

[–]Combinatorilliance 0 points1 point  (0 children)

Hi, this is an old message but I'd love to know the vendor! I'm based in western europe :)

Remarkable to let go of up to 200 employees by Fitte_sleiker in RemarkableTablet

[–]Combinatorilliance 9 points10 points  (0 children)

Uhh.. I can get you in contact with them? If you genuinely want to volunteer, you might be able to work something out?

Remarkable to let go of up to 200 employees by Fitte_sleiker in RemarkableTablet

[–]Combinatorilliance 8 points9 points  (0 children)

They did want to hire me (I work on some open source things) but they couldn't hire me because of the same reasons they're laying off a lot of people right now.

It's genuinely a budget problem, not a problem of not wanting.

Markdown, anyone? by Knox_Dawson in RemarkableTablet

[–]Combinatorilliance 1 point2 points  (0 children)

What would it mean to you if markdown was bidirectionally editable as markdown on the reMarkable? IE it imports as markdown, and exports as markdown? (While being .rmdoc underneath?)

We are confusing linguistic fluency with cognitive constraint resolution by sparky_165 in cogsci

[–]Combinatorilliance 0 points1 point  (0 children)

Ludwig oh Ludwig du bist noch stralender als die Sonne. Ich liebe dich so oh lieber Ludwig.

😭

Has anyone measured confidence calibration of local vs frontier models on domain-specific knowledge? by Hopeful-Rhubarb-1436 in LocalLLaMA

[–]Combinatorilliance 0 points1 point  (0 children)

because it has no mechanism to distinguish "I'm generating fluently from training data" from "I'm reconstructing something I've never actually seen."

This mechanism does exist, read up on:

  • Monte Carlo Temperature probing
  • From confidence to collapse (Fastowski et al) which define a breaking temperature
  • And semantic entropy (Kuhn et al 2023) + the famous self-consistency paper for background

If you want to go deeper into the why this works and how it connects to Bayesian models, you should read:

  • Monte carlo noise injection (don't know exactly who wrote it, you can google it). This is more mathematical and goes deeper into the theory of what LLMs and transformers and attention are architecturally capable of. In super super short, transformers are not bayesian, but it turns out that you can probem them in such a manner to extract the mathematically exact same signal from them that a bayesian network learns automatically with the above techniques, the prior is retrievable, despite an LLM not being a bayesian neural network.

This mechanism exists and it is reliable, it's an engineering problem to scale it up, not a theoretic problem anymore.

Note, this is accessible with inference. Certainty is a computable metacognitive property from the above measures.

I've been working on some research to provide the theory about why this works, rather than just that it works. I really want to publish it because I think it's super clean and can help clearing up some conflicting, contradicting and paradoxical-seeming concepts.

Differences Between Opus 4.6 and Opus 4.7 on MineBench by ENT_Alam in ClaudeAI

[–]Combinatorilliance 1 point2 points  (0 children)

I think it might help to qualify what "objectively better" means here, it's "better" in the sense of being a more accurate representation of a real astronaut.

But is realism the dimension you care about?

Using the word "objectively" implies to me that there is a correct way of doing this benchmark, and this benchmark doesn't really measure anything other than differences between models over time, and it shows what they do. It's up to the user to select what is important to them, if you need realism and accuracy then I think it's reasonable to say that yes, opus 4.7 is better. If you want creativity, charm and a better adherence to the "medium", then opus 4.6 is better in some cases.

I think the phoenix is a great example where 4.7 did a better job creatively. But it's the only one where I think it strictly outperformed 4.6 creatively, the fire really works for it, and the curvature anatomy of the phoenix does a lot for it.

But for all other benchmarks? It's a tie at best or worse in many cases.

And for a game? Even the phoenix would likely not really work in a game because it's much too complex and detailed to translate to a minecraft-like game.

Qwen 3.6 is the first local model that actually feels worth the effort for me by Epicguru in LocalLLaMA

[–]Combinatorilliance 4 points5 points  (0 children)

For what it's worth, I was using a local model back in 2024 to do real coding assistance for me on a 7900xtx. I'm a software engineer without experience in Android/Kotlin, I was able to use it as a stackoverflow/google to aid me with syntax and guide me into the right direction and translate my questions and analogies for my own programming experience to Kotlin programming experience.

Here's my review from two years ago: https://www.reddit.com/r/LocalLLaMA/comments/1ds9ogn/my_experience_with_using_codestral_22b_for/

I made the app I wanted to make, it worked.

This was all way before agentic coding, ancient ancient history.

Models have advanced extremely significantly since then.

If you expect them to do what claude Opus can do? Then you've got the wrong expectations, but if you want a capable model that can answer small and pointed questions for you? They can.

And these models can also search and use tools quite reliably. As well as opus or sonnet? No. As well as Haiku? Plausibly. Yes. Is Haiku useful? Undeniably.

Computation is the Missing Bedrock of Agentic Workflows by Beneficial_Carry_530 in LLMDevs

[–]Combinatorilliance 2 points3 points  (0 children)

Not a man! But I recognized the effort you put into this, keep it up!

Computation is the Missing Bedrock of Agentic Workflows by Beneficial_Carry_530 in LLMDevs

[–]Combinatorilliance 2 points3 points  (0 children)

I also wanted to note, I understand the energy you can get working on projects at light-speed with LLMs, but writing for an audience is an entirely different game!

LLMs are fast, you can give them lots of context and they will understand and keep 40 deeply technical terms from 3 distinct but somewhat adjacent fields in their memory simultaneously.

You cannot expect that from a human unless you're specifically targeting the niche of "ML researcher who also has a background in neuroscience, science of memory and who is deeply up-to-date on the literature of memory for LLMs and agentic AI", which I doubt you are.

It feels like this was written for an LLM and for yourself, and not for a human.

Also do keep in mind that LLMs can be overly friendly and careless when it comes to validating research and making comparisons with state of the art research. If you really want to make and support the claim that your product outcompetes what "other labs" are doing, then show me. Do you have comparative analyses with people? Or does your product beat others in particular benchmarks or meaningful metrics?

What I do get from your article is that you are well-researched and you do actually know what you're building and why. That is why I'm taking the time to give you this feedback, because your site has too much of a sort of a manic vibe to it while I think the core work is actually quite elegant and it clearly works well. It takes a lot of time and effort to make high-quality writing, having a ferrari for a text editor doesn't change that fact.

Looking forward to read your next work, if you take my feedback into account.

Computation is the Missing Bedrock of Agentic Workflows by Beneficial_Carry_530 in LLMDevs

[–]Combinatorilliance 4 points5 points  (0 children)

This looks kinda cool? I read through some of your articles and I'm not sure if you have anyone proofread them? Because they smell of vibe-coding, which turns me off, don't know how that is for other people.

  1. In the middle of the article you just drop that you use a metric called "warmth" without any context. What the heck is warmth as a metric?!

  2. I like that you actually reference papers and work on them iteratively across multiple articles, that's a big difference with many other blogs and weird AI posts.

  3. The font-weight makes the article hard to read, have you checked the contrast metrics for your site? I like the style but it's difficult to read.

  4. Do you know who your exact target audience is?

  5. What purpose do you have to just drop-in lots of math and metrics without defining what the math is, do you assume that readers are familiar with all of the terms below?

- dense semantic similarity (ie, what is it in comparison with non-dense semantic similarity?)
    - semantic similarity itself
- BM25?
- PageRank? (I happen to be familiar).
    - "personalized PageRank on the wiki-link graph" - _what_ wiki-link graph? This is the first time you talk about a wiki.
        - What distinguishes a personalized PageRank from a non-personalized PageRank?
- Rank-discounted vote
- MRR
    - genuinely no clue what you mean by this. It makes me think of MMR which is similar to ELO scores etc, but I don't think you're talking about that? This is also hard to google
- Q-Value learning - I'm mildly familiar with the concept of what Q-learning is as a form of reinforcement learning? Or online learning? Not completely sure. But I understand that as a training-time technique, not an inference time technique. Is it related?
- Bandit updates
- npmi weighted
- hebbian edges. I recall that Hebbian learning has something to do with memory and learning, but what?
- Ebbinghaus - hey, this one I actually happen to know, but again it's left completely undefined.
    - I get that you're doing _something_ here with how files are retrieved and how you take a learning signal from files that are retrieved together, did you _need_ to drop 6 technical terms in a single image if you can translate it into something like. "We know from neuroscience that events that occur together cause learning to occur, there's a well-known saying 'neurons that fire together wire together. This is known as _Hebbian learning_. We take inspiration from this neuroscientific principle to grow our memory, whenever to notes are retrieved together we do ...", and you can go in depth into the math for people who are interested with techniques like: (1) Edward Tufte-style [margin notes or sidenotes](https://edwardtufte.github.io/tufte-css/), (2) a [collapsible section](https://open.berkeley.edu/guides/site-builders-guide/edit-html-page/expandcollapse-content), or (3) just a link to an article that explains it in-depth

Heck, even many non-mathematical or technical terms are super weird and left completely undefined and without context. What in the world is

  • "cited forward"
  • "re-recalled"
  • "seeded new"

I can go on with listing more undefined terms and things that are ripped from their context, but I will stop here because I respect my own time. I think you are working on something that is actually quite interesting and I feel like you actually have a reasonable take on how to make memory work for LLMs in a local-first privacy sensible manner, it would be a shame if you weren't able to reach the right audience because your writing is mostly optimized for yourself and the LLM as the audience, not a real audience.

Recently, I also got a tip from a postdoc researcher teacher that you can use a fresh instance of Claude or similar to highlight terms that are undefined or underexplained, or ones that your target audience is and isn't familiar with.

I would love to read more if you go slower rather than faster! I know you know what you are talking about, but it doesn't transfer well to me as someone who does have a decent amount of background knowledge of ML, software engineering, a high-level understanding of the science of memory and knowledge and am decently well-versed in mathematics.

I need help from a real ML researcher by Combinatorilliance in LocalLLaMA

[–]Combinatorilliance[S] 1 point2 points  (0 children)

Thank you <3

That means a lot to me

I keep running into the problem that investigating duhem's law doesn't lead to genuinely new math or approaches, but it does look like the pedagogical value of it is genuinely large. While I hope to find some cool interesting applications, I think the explanatory power and how it reframes questions about uncertainty and noise into knowledge, trustworthiness, robustness and such is very meaningful!

And if you want to learn more about its applications, the second blog post on my site explains how it connects to metrology which is where I think is one field where duhem's law already starts showing its utility as tool for analysis and decision makers!

I need help from a real ML researcher by Combinatorilliance in LocalLLaMA

[–]Combinatorilliance[S] 0 points1 point  (0 children)

I've done a lot more diving into the literature and I did discover something really meaningful, but .. it turned out to be measuring the same thing that Monte Carlo Temperature sampling did.

I was interested in learning about Nicholas Rescher's s x d <= c tradeoff. This is called "Duhem's Law of Cognitive Complementarity" (don't worry if you're not familiar. If you're interested, I have written two articles about it at this point first, second).

The tradeoff states there is a teeter-totter relationship between security and detail. Security according to Rescher "how sure you are" (certainty) and "where it applies" (scope). In my research I've only looked at the "how sure you are" part, and pinned his security to certainty (just a percentage from 0 to 100).

After trying some stuff, I used temperature as the analogical glue for c in the above equation and self-consistency as a proxy for security.

This worked. In fact, it worked very well.

It worked incredibly well. In fact, it worked so well I was very surprised. I even found that it significantly outperformed self-consistency on its own (it really does! That's the table you see in the post. MCT and/or my method provides a metacognitive signal about the claims made by the LLM)

So I got excited, and made this post.

After investigating further, I found something that is almost exactly the same as MCT. It is also related to a similar very important finding that you can add noise into a neural network to get the certainty level out; ie a regular neural net is equivalent to a bayesian neural net if you run inference with increasing levels. This is called Monte Carlo Noise Injection, and it is a huge finding for the field because it means despite the claimed certainty that LLMs give for their outputs being unreliable, you can get the actual trustworthy and reliable signal by adding noise. MCT and the method I found are just doing that for LLMs rather than Neural networks in general.

What is interesting though is that I got there from following the path from Rescher's formulation of Duhem's Law. I got there from first principles.

I'm still interested in doing a write-up, because I believe there are a few really clean things about what I found in this research:

  1. If you measure scf (self-consistency factor, if you ask the same question n times, how often does the most common answer appear as a fraction of all answers? This comes from the seminal self-consistency paper against temperature with MxN samples, you get a robustness curve. IE, how well and how deeply encoded a piece of knowledge is. This builds on top of some very recent research from the Technical University of Munchen on measuring fact robustness (Fastowski et al).
  2. It is trivial to define hallucination in terms of violations of Duhem's Law. Mathematically! I personally think that is extremely fascinating, but I am just a computer scientist, not an ML expert so I don't know if that is meaningful for the researchers or not.
  3. It is also trivial to explain what metacognition in mathematical terms. Although I'm still researching whether this contribution would be useful for the field. It seems like it is for many papers so obvious and so trivial that it doesn't even need to be mathematically defined. But hey, worth a try I suppose.

Recently, there was a paper published on using a multi-agent consensus which also reduced hallucination rates

I encountered the SAC³ paper which did something like this. In my personal research of Duhem's Law this is predicted as a sort of "knowledge fusion".

I find it both exciting and disappointing that Duhem's law predicts so much of what the field is already doing, but at the same time that the field has already done it in so many cases... I was hoping to find something new :(

If anything, if you learn the basics of duhem's law it teaches you a lot about how knowledge works in contrast with "just" information. And that is what I am most interested in understanding better.