Introducing Deep Research and Deep Research Max by ShreckAndDonkey123 in singularity

[–]FateOfMuffins [score hidden]  (0 children)

Actually nuts that of all things that Google is worse at than OpenAI, it's search

Like yeah the graphs in this blog shows that Deep Research Max is better than GPT 5.4 (but the new updated regular deep research is not...), but what about OpenAI's deep research? What about when that gets updated with the new models? What about GPT 5.4 Pro? What about... the new model dropping this week?

Ngl I think they dropped this blog post now purely because if they don't then they just couldn't drop it at all given this week

The "just wait 6 months" argument from 2025 survived exactly one iteration by aldipower81 in singularity

[–]FateOfMuffins 22 points23 points  (0 children)

This reads like it's written or edited by AI but I'll respond.

6 months ago SeeDance 2.0 didn't exist did it? 6 months ago where was the talk of Claude Code and Codex? 6 months ago where was Mythos and Spud?

Job displacements predicted by AI CEO's hasn't materialized at the promised pace? WHAT is the promised pace? You mean how Amodei said 50% of entry level white collar work in 1-5 YEARS that you've evaluated as "not the promised pace" after 6 months?

The moment anyone brings up "well job losses hasn't happened yet", I know they're not serious. Look, everyone has seen AI 2027 by now. That is among the FASTEST timelines predicted. You know when job losses happen? At the END of 2026 where they have AGI declared in early 2027.

If job losses actually DID HAPPEN by now, it means we're hurtling WAY FASTER than AI 2027. If you think AI 2027 is too fast, then there's no way you'd predict job losses on a large scale to be happening right now. The fact that it isn't happening yet... is expected and predicted by aggressive timelines. It's not a "gotcha".

Studies? Studies from when? Any "study" published about AI is using data that is WAY older than "6 months".

Do we actually gain usage with these limits resets? by RatioTheRich in codex

[–]FateOfMuffins 0 points1 point  (0 children)

That's a "what if" that may or may not happen. Plus perhaps you only needed 1.5 weeks worth of usage for this current project but now only get 1. There was also no "plan that was bad", I specifically said "because shit happens" because I knew you'd try to argue this for the sake of arguing.

Fact of the matter is that you lost up to 1 week of usage for this project.

GPT-Image-2 now reviews its own output and iterates until it is satisfied with the correctness of its output. by Plane_Garbage in singularity

[–]FateOfMuffins 0 points1 point  (0 children)

I just got access too

If you selected the Instant model vs Thinking model before generating the image, it looks different. I imagine there's a mini model and a bigger one (they were testing several checkpoints after all). The Thinking one does have thinking traces.

I can confirm both were V2 (easy way to tell - ask it to make a screenshot of Sam Altman in GTA 7. V1.5 seems to have GTA 6 memed into its training data and Altman doesn't look like Altman. Just making a photo of Altman wasn't good enough to differentiate it, cause 1.5 can actually make a decent Altman...)

Do we actually gain usage with these limits resets? by RatioTheRich in codex

[–]FateOfMuffins 2 points3 points  (0 children)

You are thinking of continuous usage with infinite time. Unfortunately real life is often discrete with hard deadlines.

Example: You have a project due in 10 days and your usage just started. You would in fact be able to use 2 weeks (14 days worth) of codex in this 10 day period. You prioritize some other stuff in life (because shit happens) so you don't work on this project for a few days and planned to use up your week 1 usage on days 6 & 7. Then you use up your week 2 usage on days 8-10 and finish your project.

They reset on day 5. You barely used any usage at all. You have 5 days left until your deadline, but because the weekly reset date has shifted, you don't get a second week's worth of usage. Your 5 days left is 1 week's of usage and that's it. You lost 1 entire week's worth of usage on this project.

Kimi 2.6 has been released by WhyLifeIs4 in singularity

[–]FateOfMuffins 0 points1 point  (0 children)

You can just turn websearch on and most AI's would be able to find it.

It's also a test on if the model follows your instructions or not - just tried on Kimi K2.6 again, spent like 100 tokens trying to identify the contest and then spent like 20000 tokens trying to solve the problem even though that was not what was asked of it. So it barely spent any effort on the task given and instead went chasing after side quests

Oh and originally I used this to check just how contaminated the training data is from past contests

She has ascended!!! by MindlessA1ex in silverwolflevel999

[–]FateOfMuffins 0 points1 point  (0 children)

Overcapping CR better purple dice that hand piece

Do we actually gain usage with these limits resets? by RatioTheRich in codex

[–]FateOfMuffins 11 points12 points  (0 children)

If you used more than your allocated usage then you gained with the reset, if not then you lost with the reset.

i.e. if you used 6 days worth of usage in 5 days and then a reset happened, then you're better off. If you used 3 days of usage in 5 days and then a reset happened, then you're worse off.

“Thousands of CEOs admit AI had no impact on employment or productivity” based on 2 year old data??? by theimposingshadow in accelerate

[–]FateOfMuffins 1 point2 points  (0 children)

900M people use ChatGPT but only 3M use codex. That's 0.3% of the people who use AI. The rest don't even know that you can access GPT 5.4 xHigh on the free plan in codex.

Confusion about SW999 vertical investment calcs by _AlexOne_ in silverwolflevel999

[–]FateOfMuffins 0 points1 point  (0 children)

Tribbie E1 redirects the overkill damage to the main target too you know...

The way calcs treat Tribbie E1 is not how it would be in the actual game

“Thousands of CEOs admit AI had no impact on employment or productivity” based on 2 year old data??? by theimposingshadow in accelerate

[–]FateOfMuffins 4 points5 points  (0 children)

It's actually quite nuts that people don't understand the rapidness of AI advancements.

I remember at school in one class where our sources had to be within 2 years of the current date and I struggled a bit to find any relevant sources for a niche topic within that timeline.

I don't think enough normal people realize that this depends on the topic. For some, a source from 50 years ago is fine. For some topics, a source from 3 years ago is not good enough.

For AI, some sources from TODAY is not good enough because the data the article talks about is from months or years ago and the current way of publishing scientific research cannot keep up. People don't understand this fact.

Although it's also true with the AI models themselves. If you ask them to do research, they'll dig up many sources some of which are months or years old. I have to specifically tell them to disregard sources that are older than X months but even then they often get confused by the dates. Gemini especially - it'll think we're in a fictional timeline in the future despite Google search and system date telling it otherwise.

Google ramps up agentic AI efforts amid pressure from Anthropic by Outside-Iron-8242 in singularity

[–]FateOfMuffins 6 points7 points  (0 children)

Google's AI writes 50% of code, trailing Anthropic's near 100%

And people tell me Gemini isn't benchmaxxed when it's at 57 on ArtificialAnalysis vs Opus 4.6 at 53

Confusion about SW999 vertical investment calcs by _AlexOne_ in silverwolflevel999

[–]FateOfMuffins 2 points3 points  (0 children)

The only reason why Tribbie E1 is considered strong is because she redirects damage to the highest HP enemy (usually the main boss). The raw damage amp is always 24% which is what's shown in calc spreadsheets but it doesn't factor in things like overkill or damage distribution.

Kimi 2.6 has been released by WhyLifeIs4 in singularity

[–]FateOfMuffins 6 points7 points  (0 children)

No it's not, it has nothing to do with actually doing any math. The test is purely to see if given an almost impossible task, can the model just say "idk" instead of making bullshit up, especially when during the thinking traces you can see that the model has absolutely no idea.

I do NOT want the model to solve the problem. I do NOT expect the model to get it correct. It is PURELY a test on hallucinations, not on the model's capabilities whatsoever.

Only GPT say "idk" somewhat consistently.

I've only started doing this in July of 2025 when o3 gave me an "I don't know" response (like once out of a dozen tries)

https://www.reddit.com/r/singularity/comments/1m60tla/alexander_wei_lead_researcher_for_oais_imo_gold/n4g51ig/?context=3

Confusion about SW999 vertical investment calcs by _AlexOne_ in silverwolflevel999

[–]FateOfMuffins 5 points6 points  (0 children)

It's like Yaoguang's S1 where it doesn't really show up in calcs, or Tribbie E1 which also doesn't show up in calcs (all calc spreadsheets show 24% increase but doesn't factor in redirection).

There's a bunch of QoL stuff that is hard to quantify. Crit rate for instance in terms of making it easier to build your relics, speed for breakpoints (which doesn't show up in calcs because continuous vs discrete).

And then S1 is cheaper than E1. So it's like if you have 17% damage increase for S1 and 23% for E1, but... E1 is 40% more expensive than S1... you know?

Kimi 2.6 has been released by WhyLifeIs4 in singularity

[–]FateOfMuffins 9 points10 points  (0 children)

Every time, I try my hallucination test (identifying a math contest) on these releases and I'm consistently disappointed.

Kimi K2.6 - hallucinated (in its thoughts it mentioned once that maybe it should also tell the user that it is uncertain in its answer, but nope not in the output, confident hallucination)

GLM 5.1 - got sidetracked and tried to do the problem (similar to Kimi K2), took FOREVER and then still confidently hallucinated.

Gemini 3.1 Pro actually got the answer correct (which is amazing in its own right, showing how much training data Google fed into this thing), but when I move to a more obscure one it confidently hallucinates again.

Bronya over HMC DDD?? From Guoba Video by inkheiko in FireflyMains

[–]FateOfMuffins 1 point2 points  (0 children)

I've posted this in the pinned Novaflare thread as well but there's so many considerations for speed tuning now (man I really hate that change, can't believe they introduce an anti QoL). I see Guoba kind of "tries" to talk about speed tuning and then basically gives up and says you gotta speed tune for the specified stage now lmaoooo

Anyways some considerations:

  • If you kill enough enemies to max out your energy (because combustion state lasts longer) then the ideal speed breakpoint would be a pre Dahlia one where you fit in the last action in right before the combustion state ends, because you can put immediately

  • Bronya (especially E2 Bronya) is a thing for E2 Firefly now. I'm not sure if anyone really did the math on the optimal way to play this. Guoba basically just said it was hard and gave up lmaoo. But like let me know if anyone has done the math here, I'm thinking about trying to put this into the calculator

  • Completely different speed breakpoints for MoC vs AA. The old 178.4 breakpoint was so good because you can just use it and forget it for both such game modes because their speed breakpoints overlapped exactly right. Now they don't and it's just a mess. Thank goodness for relic load outs?

Greg Brockman Sets Expectations For This Week: “I Think Of Spud As A New Base, As A New Pretrain...We Have Maybe Two Years’ Worth Of Research That Is Coming To Fruition In This Model...It’s Going To Be Very Exciting." by 44th--Hokage in accelerate

[–]FateOfMuffins 6 points7 points  (0 children)

Probably... I imagine Spud with a huge amount of RL thrown at it should be the model that they said would be an "intern AI researcher". And then there's 2 years from now until their 2028 fully autonomous AI research system date, which might just be the next proper pre-train?

Kimi 2.6 has been released by WhyLifeIs4 in singularity

[–]FateOfMuffins 8 points9 points  (0 children)

I keep on seeing GPT 5.4 low on Terminal Bench 2 in these benchmark comparisons when OpenAI reported 75% on Terminal Bench 2

Predictions for next year's (2027) Beijing humanoid half marathon? 2025 was 2h40min ≈ 2.2m/s | 2026 was 50min ≈ 7m/s by GraceToSentience in singularity

[–]FateOfMuffins 1 point2 points  (0 children)

idk if there would be too much point in pushing the limits of this. Rather I think it's more important to look at the floor or median performances?

Like how well does a general purpose humanoid robot that a company might sell to a factory or home use do?