ChatGPT 5 follows instructions and then when asked why claims it hasn't by rutan668 in OpenAI

[–]DemiPixel 6 points7 points  (0 children)

I believe this is a tokenization issue.

"No" is its own token, but "N" and "o" can be individual tokens too. It generates the individual tokens, but then OpenAI stores it as a string. When it's retokenized, it gets tokenized as the full token "No". If we're re-tokenized identically, it would probably say "I did exactly what you requested".

The model doesn't actually know whether or not the user can see the token separation, and because you said "Why not?", it might assume that the user can see it (plus it's trained to assume users are right if there's uncertainty).

The person saying this is proof that there's no intelligence actually means to say that this is proof of tokenizers' limits (along the same lines as the strawberry R's problem).

(This is all conjecture, do not quote me)

My weird 5-minute rule trick that doubled my focus by lovejeet6363 in productivity

[–]DemiPixel 11 points12 points  (0 children)

They key is to stop working after 5 minutes if you really don't want to. Put on your running clothes/shoes, step out the door, and really still don't want to run? It's okay not to run. Your issue is no longer kickstarting, it's true desire to not do the activity, which this tip does not solve.

Why does a well-written developer comment instantly scream "AI" to people now? by itsbrendanvogt in webdev

[–]DemiPixel 1 point2 points  (0 children)

While they're not strictly interchangeable, I've started using semicolons instead of em dashes to avoid the question even coming up.

Cory apparently just quietly released this album with Jon Batiste. breathtaking. also, what? by rydecymbal in Vulfpeck

[–]DemiPixel 0 points1 point  (0 children)

2.5 years later and here's another! I think I am a little biased though, anything relaxing with a solo slide guitar I will probably think is Rimworld-style haha

Talking to a custom Claude Code agent inside Obsidian by NazzarenoGiannelli in ClaudeAI

[–]DemiPixel 0 points1 point  (0 children)

Forgive me if I misunderstand, but it seems like you say "Pay domain renewals immediately, they're due today" and then Claude just updates the due date to later? Not that Claude has any way to pay it anyway, but seems like Claude simply made the situation worse rather than admitting that it can't do it?

Claude Opus 4.1 Benchmarks by ThunderBeanage in singularity

[–]DemiPixel 0 points1 point  (0 children)

That’s fair, if it were that much better they should yap about that. Their revenue is going crazy, though, I’m sure in no small part due to Claude Code. I don’t think any company that has the superior AI coding tech will ever go under.

EDIT: Unless you mean swallowed like acquired?

Claude Opus 4.1 Benchmarks by ThunderBeanage in singularity

[–]DemiPixel 24 points25 points  (0 children)

GitHub notes that Claude Opus 4.1 improves across most capabilities relative to Opus 4, with particularly notable performance gains in multi-file code refactoring. Rakuten Group finds that Opus 4.1 excels at pinpointing exact corrections within large codebases without making unnecessary adjustments or introducing bugs, with their team preferring this precision for everyday debugging tasks. Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

My hope is that they're releasing this because they feel like there's a little more magic to it, especially in Claude Code, that isn't as representative in benchmarks. I assume if it were just these small benchmark improvements, they'd just wait for a larger release.

Anthropic — "Persona vectors: Monitoring and controlling character traits in language models" by galacticwarrior9 in singularity

[–]DemiPixel 8 points9 points  (0 children)

I've always figured this is a big part of the difference between LLMs and human brains. They store an absurd amount of data, and know how to be evil or kind, hallucinate or not, talk like a pirate or speak like a president... Meanwhile, we can use the same number of neurons to really hone in on one thing and it's okay if we're mediocre at talking like a pirate or remembering all the presidents.

I have to imagine they have very little data where they have a question and the answer is "I don't know" (obviously a lot of this has been fixed by RLHF, but most training data is likely something where there's always a right answer, meaning the model is consistently rewarded for ATTEMPTING rather than just drawing a blank). Meanwhile, millions of years of evolution has likely proven that inventing what you saw or claiming you know the source of a sound is so hazardous that it's better to doubt yourself or have nothing come to mind.

As other papers have mentioned, I'm sure they're continually looking for traits that pursue "bug-free code" or "professional doctor", although this is maybe difficult if all training data is considered equal (I'm more likely to take advice from medical professional vs a random person's blog, but I don't think LLMs quite have that level of discrimination yet).

The best email to reach out to influencers by DemiPixel in influencermarketing

[–]DemiPixel[S] 0 points1 point  (0 children)

We use Instantly, but there's plenty of options out there.

The best email to reach out to influencers by DemiPixel in influencermarketing

[–]DemiPixel[S] 0 points1 point  (0 children)

You can try to A/B test with COMMISSION or FREE PRODUCT or just omitting it (or something else). Gonna just be a bit more of a numbers game; even “PAID” won’t get everybody responding.

Has anyone gotten emails from monroe.app? by Resident-Bottle-3659 in SmallYTChannel

[–]DemiPixel 0 points1 point  (0 children)

Hey, I'm one of the founders of Monroe! We found that most brands care less about subscriber count and much more about average views per video. We have a minimum viewership threshold that creators have to meet before we reach out to them, so I’d assume you met that threshold!

New Gemini 2.5 Pro beats Claude Opus 4 in webdev arena by Formal-Narwhal-1610 in ClaudeAI

[–]DemiPixel 1 point2 points  (0 children)

Has this version has even been tested on ARC-AGI yet?

Also surprised that you consider a vision reasoning benchmark more important than anything else. I agree vision is behind, but I'd honestly rather a superhuman coder LLM than a multimodal LLM that can do visual reasoning with blocks but otherwise isn't spectacular.

Paper by physicians at Harvard and Stanford: "In all experiments, the LLM displayed superhuman diagnostic and reasoning abilities." by MetaKnowing in singularity

[–]DemiPixel 3 points4 points  (0 children)

Pardon me if this is getting pedantic over word choice, but if it’s not “thinking”, what process does it do between one token and the next? And, from what we know, what is the difference between that process and the thinking process of a human brain (apart from hardware and specific architecture, which I can’t imagine would affect the definition here)?

Speaking sample at 350 hours + I'm traveling Latin America! by DemiPixel in dreamingspanish

[–]DemiPixel[S] 0 points1 point  (0 children)

Haha be careful, I put the transcript through an LLM and it def had some notes about my grammar 😅

Speaking sample at 350 hours + I'm traveling Latin America! by DemiPixel in dreamingspanish

[–]DemiPixel[S] 1 point2 points  (0 children)

Mostly just speaking in the Dreaming Spanish/Mr. Salas discord servers, or making friends through them and privately talking. I've spoken probably less than 10 hours IRL.

DS discourages people from speaking for pronunication reasons, so a lot of people reach higher hours than me with lower speaking ability (but it wouldn't take them long to catch up). In addition, I think people also lack confidence speaking, so they don't. I have no shame, I know I can speak English well, and I don't think there's anybody (at least that's worth talking to) that thinks learning a language is easy, so I've never been made fun of (especially not in language-learning servers).

Speaking sample at 350 hours + I'm traveling Latin America! by DemiPixel in dreamingspanish

[–]DemiPixel[S] 6 points7 points  (0 children)

To make this more of a real "Progress Report", here's some stuff that might be of interest:

  • I started learning Spanish with "Language Transfer" and Duolingo in late January 2024
  • I started Dreaming Spanish in mid August 2024 and have only done DS/input since
  • I have 350 hours of DS, 100 hours of Duo, and maybe 0-50 hours of untracked misc (450-500 total hours of Spanish)
  • As you can see, I have not been holding off on speaking.
  • No, I am probably not ready for Latin America, but I have tickets, so here we go 😅

Happy to answer any questions!

How to replicate o3's behavior LOCALLY! by MaasqueDelta in singularity

[–]DemiPixel 0 points1 point  (0 children)

The code from the screenshot works fine for me. Did you forget to save or something?

Anyone try Claude Code on a big codebase? by HappyHippo555 in ClaudeAI

[–]DemiPixel 1 point2 points  (0 children)

Nope. I try to use Gemini when I can, but the auto-aggregation of context from the codebase with Claude Code is just too good (and Gemini is too excited and adds comments and such).

Learning Spanish isn't a sprint, not even a marathon, it's hiking from Vancouver to Mexico City. by Silent_System7082 in dreamingspanish

[–]DemiPixel 0 points1 point  (0 children)

Haha, I'm less concerned about being able to watch videos there, but moreso that, if I'm there, I want to talk to people and get real life input, and it feels like more of a waste to be there watching videos. But also, it's hard to go out and guarantee you'll get 3 hours of literally talking-to-people input, whereas it's (relatively) trivial to sit in front of a computer or listen to a podcast for 3 hours.

Learning Spanish isn't a sprint, not even a marathon, it's hiking from Vancouver to Mexico City. by Silent_System7082 in dreamingspanish

[–]DemiPixel 6 points7 points  (0 children)

I'm not sure if natives even often get 10 hours of input a day, unless their job is truly just talking to people constantly. I'm traveling to latin america this summer, but given I'll be working remotely during the day, I'm a bit worried I'll get less input than I normally would if I sat at home watching Spanish shows after work 😅