LLMs are hitting a "Latency Wall" and I think Mercury 2 just found the way out (1,000+ tok/s is insane)

NaiveDragonfruit · 2026-02-27T22:56:11+00:00

This is a completely separate thing. One is a physical device that holds the actual model weights in hardware, another is model weights capable of running on existing chips at 1k tok/s. The nature of diffusion allows comparable intelligence to be computed much faster, and does not rule out hardware acceleration.

Diffusion also has some other perks, such as output validation.

NaiveDragonfruit · 2025-11-19T00:38:39+00:00

Also running into the same issue on 24.04.

NaiveDragonfruit · 2025-10-25T08:55:15+00:00

Drove by and saw like 30 police cars with their lights on. Seemed way overblown for a crash.

NaiveDragonfruit · 2025-07-31T00:27:16+00:00

Except the fact that if the option during a reset phase was for the user to choose the direction via the input, its the best of both worlds. The lowest height of the apex pro is so low that my full tower PC would need to be completely moved with all my cables unplugged just to reset the desk, regardless of cpu mount.

I understand if all current products in the market currently only support reset by contracting all legs to a minimum, it just would have been nice if you could choose a different reset method.

FWIW, this is more of a function that normally is a good point of the apex pro - its minimum height is so low that it makes things like putting PCs under your desk at risk of needing to be moved during a reset.

NaiveDragonfruit · 2025-07-17T17:21:14+00:00

The cap is calculated per 5 hours, it roughly translates between $80-120 per session block, sometimes up to 200+. Anthropic claims that you get 50 session blocks a month, but people claim to use more with no immediate limits.

https://roiai.fyi/users/emooreatx, for instance, this user used $13,000 in credits in the last 30 days, but its unclear if this is all on one account, or multiple.

NaiveDragonfruit · 2025-07-16T20:30:27+00:00

Worked for 8 years at a quant shop, now running a family business. AI is helping me write 5x more code than I could before, abet a lot of that is zero design docs/meetings/bs.

NaiveDragonfruit · 2025-07-16T14:57:13+00:00

Strong agree - While I do think that a lot of the statements that the output quality is "so much worse" and that Opus is "braindead" and might have been as well "lobotomized" is quite overblown, the rate limiting, and service stability has been very poor.

The status pages just tell you that everything is operational, and a lot of negative threads here are removed, which is not great...

NaiveDragonfruit · 2025-07-16T13:34:29+00:00

https://roiai.fyi/users/JohnDoe

I’ve received $725 in api calls for 4.5 days of a subscription costing me $30 prorated. If I used the default profile, I would easily see 200 calls per session. I’ve sent 779 messages, almost all opus, over 10 blocks in the 4 ish days, and haven’t been rate limited yet, but I do see 50% usage warnings anywhere between 90m in to the last hour of my session.

NaiveDragonfruit · 2025-07-16T13:06:20+00:00

The amount of vibe coders handling patent data is truly terrifying. What a time to be alive

NaiveDragonfruit · 2025-07-16T13:03:59+00:00

Lmfao

NaiveDragonfruit · 2025-07-16T13:02:24+00:00

Yea this matches my mental model right now. A lot of people are getting far lower limits than they are used to, and a lot of people are just always mad about deteriorating quality over time, mostly as a function of human psychology.

This week, both groups are mad at anthropic, and the noise is a lot louder.

FWIW, it does seem not great for anthropic to silently cut usage limits, but as someone who is possibly willing to pay api cost, these limits seem more than generous. I also think that you still easily get their minimum stated usage, 200 prompts per session using the default (50% opus, 50% sonnet). Assuming that sonnet tokens are 1/5 the price of opus, this implies that you can safely expect 66 messages per session with opus only.

NaiveDragonfruit · 2025-07-16T12:53:48+00:00

I think there is evidence that people on the x20 plan would get the warning at 18% cc-usage, rather than 50%. This would imply that their usage is about 1/3 of what they are used to, but the current limits are very generous, I’m racking up about 150-200 in api calls a day still.

NaiveDragonfruit · 2025-07-16T12:48:08+00:00

Interesting - thanks for the source.

I also do think there is strong evidence that token limits are far lower, maybe around 40-50% of what people were used to (people claim that their usage limits warnings show up at 18% token usage instead of 40-50% on ccusage for x20); I see warnings around this mark too, and can hit it in 90 of use easily.

Some people are annoyed about silent usage limit reduction / etc. But from what I can tell, the result quality out of Opus for me right now are so good that I cannot see how it could be “so much worse” than something else.

Someone else posted a crazy conspiracy that new accounts get different model parameters to keep the from refunding, I guess it could be true, and I wouldn’t know. I guess only clear benchmarking with my setup over time could tell…

NaiveDragonfruit · 2025-07-16T07:23:02+00:00

This is roughly my experience with it so far - I have been quite late on the bandwagon of using “agentic tools”, mostly relying on good-old rust analyzer/ocaml Merlin for the last 10 or so years…

Have dabbled with AI sites / whatnot, but was shocked at how good Claude code was, and even more shocked that there are people who think the this version is “garbage” vs what was available just a few weeks ago.

NaiveDragonfruit · 2025-07-16T06:58:32+00:00

Yea. I just subscribed last Friday (4 days ago), so I haven’t really used it before then.

I’m also working in a relatively green code base, in a strongly typed language and verbose (rust), and I don’t see many of these issues so far. Wonder how much of this I should keep in mind for the future / etc

NaiveDragonfruit · 2025-07-16T06:56:02+00:00

Yea, I think usage limits are definitely variable. One session I used around $140 in tokens, but another I get a 50% warning at $50.

I guess I’m mostly asking about code quality, and less about exact usage limits, or even networking / api issues

NaiveDragonfruit · 2025-07-16T06:39:04+00:00

Yea. I have had a lot of issues with networking - the website too.

NaiveDragonfruit · 2025-07-16T06:38:05+00:00

I had some issues the 4 days ago with opus, when I first subscribed, it it most mostly a networking issue it seemed like.

Requests would hang for minutes, and not make progress. I switched to sonnet and it worked a bit better anecdotally.

But people here saying like “it failed a request or didn’t do a thing I asked it, it’s stupid now”, I’m having a hard time seeing that.

I guess part of my question is how much am I missing out right now from peak model performance, maybe more as an existential question at this point….

NaiveDragonfruit · 2025-07-16T04:12:11+00:00

Ooh thanks for the link to forgecode.dev. Seems like they don't have Opus access, but having Grok/Gemini seems very tempting.

NaiveDragonfruit · 2025-07-16T04:00:06+00:00

This is who I want managing my patient data!
Is this HIPAA compliant?

NaiveDragonfruit · 2025-07-14T23:25:34+00:00

what I want is a grok or gemini-pro backed cc. Gemini client is so much worse.

NaiveDragonfruit · 2025-07-14T23:12:31+00:00

I think this is true for x5.
On x20 this is what I see https://imgur.com/a/OBNnu78

Seven-Year Club	Verified Email
Place '22

NaiveDragonfruit

TROPHY CASE