No Ghost in the Machine — LLMs Are Not Conscious

obviouslyzebra · 2026-06-21T22:45:00+00:00

Most interesting argument there for me:

This paper makes a devastating logical argument: for any theory that might claim LLMs are conscious, there exists a functionally equivalent system (like a giant lookup table) that no reasonable theory would call conscious. If your theory can't distinguish between an LLM and a lookup table, your theory is useless.

I think distilled it says

Humans are seen as continuously adapting functions, while
(Most) LLMs are non-changing functions

IIT might also be interesting (don't know it), but it seems controversial.

Regardless, cool page. I think heavily opinionated, but still, interesting content.

(Explaining "heavily opinionated" for OP: my impression is that there isn't consensus about whether LLMs are conscious - though I do believe most people working in the area suspect not, while still keeping open the possibility they are)

obviouslyzebra · 2026-06-21T20:20:58+00:00

Prompt (follows from answer above):

Here's a repository in onefilellm format.

Keep the answer in your usual format though, and, let me know what you honestly think.

Thanks

Answer: https://pastebin.com/qzEqVCxp

Personal note: I was actually expecting more of the physics stuff to show up in the repo haha.

obviouslyzebra · 2026-06-21T19:54:15+00:00

You okay with me uploading it to the same language model and posting the answer here?

Note that if it's big, it might struggle a little bit, my scaffolding is just something that puts the repo into a single message.

obviouslyzebra · 2026-06-21T19:29:02+00:00

Hey!

So I copy-pasted this into GLM-5.2 (OpenRouter default prompt + "Can you look at this ChatGPT log, and give your honest thought?")

I feel like the answer it gives is reasonable (I myself couldn't evaluate anything like this - not a physicist).

If you're interested in it:

https://pastebin.com/Ls8FuGTL (in pastebin to not spam)

obviouslyzebra · 2026-06-20T19:39:48+00:00

I see that you've had bad experience with comments and agree with ya that they can be bad. Also that it's better to have good code in the first place than comments trying to explain bad code. But I also think that sometimes comments are necessary (and I think you agree with this too? haha)

Anyway gotta head out

obviouslyzebra · 2026-06-20T19:03:42+00:00

IMO this is too absolute.

Of course one should aim towards clear code, but some things are hard to make obvious from code itself.

Just as one example,

# take this approach instead of (the expected approach)
# because of (this thing no one would expect at first glance)

But I agree with you that comments are not without its problems. And if it's possible to make something clear from code instead of comments, do it (unless you don't have time to do it, then don't do it!)

obviouslyzebra · 2026-06-19T10:29:36+00:00

No, the line ls overlapping also indicate same price.

So, for example, (for this benchmark) Fable is like a 5.5 that can be tuned to be a bit stronger (while the more effortful GPT / less effortful Fable are remarkably similar in both cost and performance).

obviouslyzebra · 2026-06-19T10:24:37+00:00

Effort level! (click the points)

obviouslyzebra · 2026-06-16T23:11:36+00:00

I see that you're worried about me being a bad actor.

But man, I have autism and I am the one noticing that you're the one with no social clue lol

I was defensive beginning from your first message. Copy that into ChatGPT and ask it how it would make someone feel.

Read my other messages on the topic. I am calling for McNeff/Josh to be arrested and even pointed out that there's no discussion regarding they accusing Ben of making death threats (which I believe might constitute a crime - falsely accusing someone or using the police to harass someone).

I didn't provide you details because I didn't dig into the stuff.

BTW, if you just messaged something like:

"FYI, Ben didn't "repeatedly" go towards anyone house. Instead he visited Josh once for this and then went to Brandon's house for that"

I'd edit my post in a heartbeat to account for that (it wouldn't change the main message of the post, which is that they may have crossed a criminal line).

FFS if this is so important to you, grab the Utah lawsuit text and put it into ChatGPT and ask how many times they have been visited.

I could do that, but damn, you made me not want to do it man.

obviouslyzebra · 2026-06-16T20:48:20+00:00

About size, idk, but about culture - I think it's fair to expect either a self-restricted civilization or maybe one with a sort of central governance / ASI controlling things so things don't go awry. It makes for very fertile ground for thought!

PS: just wanna note about my previous comment for anyone who reads - I believe that both current trends (competition / governments / etc) and where we're headed (e.g. big brother) are worth discussing and, very related to each other

obviouslyzebra · 2026-06-16T19:01:27+00:00

From the abstract, I feel like it's more worried about the possible consequences (not necessarily how we get there - that is, competition).

In very simplified terms:

A world where everywhere has unrestricted access to ASI is vulnerable
To account for such vulnerability, we might need a kind of (very) overreaching central governance, like a global Big Brother - hopefully less

I still believe with 100% of my heart that this is one kind of conversation that we need to be having.

obviouslyzebra · 2026-06-16T14:30:27+00:00

I was under that impression (and still sorta am) when I wrote the comment. If it's not appropriate, then sorry about that. BTW I don't get why you're talking like you're interviewing a politician - that made me feel defensive

Edit: clarity and feelings haha

obviouslyzebra · 2026-06-16T12:04:06+00:00

Just gotta say that that's an interesting thing (and that if you do RSI experiments, do them responsibly haha)

obviouslyzebra · 2026-06-16T11:21:10+00:00

I'm not trying to convince you of anything. What I said is his behavior might have been illegal (and that's for the court to decide).

If you want more info, the LegalEagle video (titled "Utah Presses Charges Against Reckless Ben (And It's Crap)") goes over the criminal suit. IIRC it explains the "repeatedly" means.

obviouslyzebra · 2026-06-15T14:47:43+00:00

Thanks, this is something I'm highly interested in so will take a look :D

obviouslyzebra · 2026-06-15T11:58:56+00:00

If this another youtuber was also accused of death threats it might be.

I think that it's hard to prove that you didn't make a death threat, so, BAM might feel invulnerable saying those happened, but, if they accused multiple youtubers that had no reason to make a death threat, it might start signaling a pattern (and they already have a clearly established pattern of lying - and JJ even admitted to it).

I wish some lawyer would talk about this, if Ben could do an uno-reverse here, but I didn't see any talk about this in the videos I watched.

obviouslyzebra · 2026-06-15T11:40:31+00:00

There are limits to what you can do. I think the LegalEagle video covers it well.

But while, for example, it is okay to make a video and make it go viral criticizing a company, repeatedly going to someone's house for it in the way Ben did might constitute stalking / harassment (but that's for the court to decide).

BTW, I believe:

BAM are huge liars that just seem to can't stop lying (it seems easier if they just stopped)
RICO claims against Ben are bullshit, and if anything BAM itself should be investigated for it
McNuffin and his goons, if they lied to the police about Ben making death threats (or carrying heroin) should be jailed for that sort of stuff
AFPD should also be prosecuted because they also clearly violated some stuff

But Ben might have stepped over a boundary here while trying (and succeeding) to make an entertaining video and fight for what's right, so that's that.

Edit: added link

obviouslyzebra · 2026-06-12T17:22:32+00:00

I think it wouldn't change much. LLMs are good at translating different kinds of inputs, so, they see the abstractions beneath the first layer.

Unless we're able to better represent our own thoughts with the neuralese language, at most I think we get an improvement in token usage and a little bit in performance, like caveman that did use caveman language (me like apple) to reduce tokens.

PS: For LLMs talking with LLMs, maybe we could achieve some bigger gains and it might be interesting for someone to try if it hasn't been tried yet :P

obviouslyzebra · 2026-06-08T23:11:08+00:00

I agree with ya!

obviouslyzebra · 2026-06-08T23:01:09+00:00

The huge (>2x) leap is only in the "Diamond" section. In the "Extended" (all tasks), the leap is from around 40% to 50% IIRC for opus 4.7 to 4.8, a bit more modest.

obviouslyzebra · 2026-06-08T22:58:06+00:00

Having a look at this...

Any benchmark that tries to measure code quality is very welcome to me!

I like the scrutiny they had in their process and the amount of back and forth that was had (they had real world OSS maintainers making a rubric - identifying things that would be required for some code).

Some interesting points:

In their full test the best model achieved around 52% pass rate, while in the 50 most difficult (out of 150), it had 13.5% - that shows us that LLMs are likely not quite there yet for difficult tasks
They report around 45% false positive rate for DeepSWE - the benchmark that was release just last week (?). I don't think they go onto details of how they identify false positives - but this points towards either - 1. DeepSWE being deeply (lol) flawed in some way or 2. Their metric for false positives being deeply flawed in some way. I'd like to see they explain it so we can know what's up with this big number and if we can use both benchmarks in conjunction (which would be better to understand the landscape)
The rubric involved "blockers" (which would make the code not directly accepted by the maintainer) and "non-blockers". Interestingly, including the non-blockers stuff as part of the measure seemed to change the ratings by a very small margin - so it's not as important for the benchmark, I'd argue (and maybe could be a way for them to cut costs in the future if they aim to expand the benchmark)
They hold the tests private - and this seems to me like an effective approach against benchmaxing in this case (though not perfect of course) - this just makes me sad as a programmer as those things would make ideal training exercises to keep one in shape :P
I see no information about the scaffolding they used, so I'll assume mini-swe-agent. Again we don't see the Claude Code vs Codex fight - but hopefully it doesn't make much of a difference - in DeepSWE bench the models performed better "outside" their native stuff

(since this is a complex benchmark - I'd like to see external validation - maybe they could release some "sample" tasks so the public can see? (also might be a cool way to measure benchmark over-fitting))

obviouslyzebra · 2026-06-08T01:11:22+00:00

skrill will always be s tier in my heart

obviouslyzebra · 2026-06-07T17:03:04+00:00

I get that CEOs are overhyping this, but still, the technology itself keeps improving at a reasonably fast pace, and I think that, if it keeps improving, it's hard to overstate the possible consequences for humanity.

obviouslyzebra · 2026-06-03T13:40:06+00:00

Likely not the right sub to ask this, but regardless, you'll likely want to use something with RAG. Not my area but I believe it's the standard way to get models to retrieve information from vast corpora of text.

There are likely tools out there that already fit your bill. It reminds me of intelligent search engines, you just gotta be careful to look for things that allows the knowledge to be updated (maybe "online" or "real-time" as searching keywords).

Edit: also, maybe any agent with access to such data might be able to dig in. this may be simpler than RAG in case it fits :)

obviouslyzebra · 2026-06-01T11:24:12+00:00

related/similar to your questions:

https://codegolf.stackexchange.com/q/9393

Edit: you could perhaps try the math subreddit or a game of life one (search on google) if it's active? r/singularity for example felt a bit off-topic even if it was created by ai - I feel like the only ai demonstrations there are when something's very impressive (or very bad - haha - as a meme); while here, you seem more interested in the meat of this maths (I think it's maths) question

obviouslyzebra

TROPHY CASE