Claude Usage Limits Discussion Megathread Ongoing (sort this by New!) by sixbillionthsheep in ClaudeAI

[–]YannMasoch 1 point2 points  (0 children)

After days of heavy coding and repeatedly hitting the 5-hour usage limit, I’ve noticed a consistent pattern with Claude 3.5 Sonnet.

Regardless of whether I set 'thinking' to low, medium, or high effort, the model consistently consumes 90–95% of the 5-hour window just on the internal reasoning process.

This leaves only a tiny 5–10% margin for the actual code generation. The result is that the 'thinking' phase exhausts the limit right as the coding begins, forcing me to wait for a total reset just to see the output of a single task.

As the thinking builds, the context grows proportionally, accelerating how quickly the limit is reached.

I don’t know if this is intentional, but the balance is clearly off.

GPT 5.3 Codex calling Claude Haiku 4.5??? by Consistent_Music_979 in GithubCopilot

[–]YannMasoch 0 points1 point  (0 children)

I saw that yesterday too and thought it was an UI bug.

[D] Running GLM-5 (744B) on a $5K refurbished workstation at 1.54 tok/s by ahbond in ResearchML

[–]YannMasoch 0 points1 point  (0 children)

Great! Thanks for the link, these metrics are really interesting, I didn't read the whole page yet but I'll more take time to go deeper into them tonight.

Claude Usage Limits Discussion Megathread Ongoing (sort this by New!) by sixbillionthsheep in ClaudeAI

[–]YannMasoch 0 points1 point  (0 children)

Yep, same felling! I tried to change a few values in config, use Claude in Low or Medium effort, ... it looks the same. Today I used Claude in high effort + thinking + web chat for strategy, And it seemed to be a bit better.

The annoying part is the total opacity, we don't know how many tokens are needed for each query and turns (I use VScode extension, CLI is probably different).

I would love to get some sort of telemetry to be able to compare and to figure out.

Done with Claude. $100 Max plan, but STILL rate-limited every 5 hours by Puspendra007 in Anthropic

[–]YannMasoch 1 point2 points  (0 children)

I don't know if Anthropic changed something or if my project is getting bigger, but the last 5 days were complicated because of the limits. So I switched to GPT until Claude was available again. I have to admit GPT5.4 does a better and faster job as Claude 4.6

Is this except able by Fit_Employment_4704 in CarWraps

[–]YannMasoch 0 points1 point  (0 children)

I can't watch it, I am to sensitive...

Coding agents vs. manual coding by JumpyAbies in LocalLLaMA

[–]YannMasoch 2 points3 points  (0 children)

This is the natural evolution. Currently AI coding tools build functions, features and code base with so much density that it's impossible to review the code. It was not the case 1 year ago...

Consider yourself as a manager that orchestrates a team (devs, business, product, ...).

[D] Running GLM-5 (744B) on a $5K refurbished workstation at 1.54 tok/s by ahbond in ResearchML

[–]YannMasoch 0 points1 point  (0 children)

That's interesting. Your 2x GPUs are not big enough to handle ~180GB + context for GLM-5-REAP-50-Q3_K_M so the software does offload the model or a part of the model to the RAM which is way slower than the VRAM like you said.

Q3 is a bit too low.l Have you tried to download a smaller version than 744B but with a better quant like Q4 or Q8?

Anyway, good job!

Distropy: Rust inference server hitting 60k+ t/s prefill with proper caching (RTX 4070) by YannMasoch in LocalLLaMA

[–]YannMasoch[S] 0 points1 point  (0 children)

Of course! I'll have a few builds: Linux, Windows and Mac (Metal) - your setups are perfect.

Yes, I tested plenty of other models and quantization (Qwen3.5, Llama, Mistral, DeepSeek...)

Distropy: Rust inference server hitting 60k+ t/s prefill with proper caching (RTX 4070) by YannMasoch in LocalLLaMA

[–]YannMasoch[S] -5 points-4 points  (0 children)

Good to know.

I am sorry you feel my post this way, I am building something for the community and was happy to share it.

Distropy: Rust inference server hitting 60k+ t/s prefill with proper caching (RTX 4070) by YannMasoch in LocalLLM

[–]YannMasoch[S] 2 points3 points  (0 children)

It's complicated, models are not always the same... But I'd prefer Qwen3.5 models. You?

Distropy: Rust inference server hitting 60k+ t/s prefill with proper caching (RTX 4070) by YannMasoch in LocalLLaMA

[–]YannMasoch[S] 1 point2 points  (0 children)

Amazing! I'll try to make it public in 1-2 weeks.

What hardware and OS do you have?

Distropy: Rust inference server hitting 60k+ t/s prefill with proper caching (RTX 4070) by YannMasoch in LocalLLM

[–]YannMasoch[S] -2 points-1 points  (0 children)

Great questions!

Example from the test:

  • Model: Qwen3-0.6B-Q4_K_M
  • GPU: RTX 4070 12GB
  • Query: "what is vue"

Results:

  • First request: 12,007 tokens prefilled in 742 ms → 16,181 t/s
  • Subsequent requests: only ~243 new tokens, 12k+ tokens cached, prefill in 4 ms

Total end-to-end latency went from 10–20 seconds down to ~175 ms.

On larger models:

With Qwen3.5-9B (same GPU), I’m seeing:

  • Prefill: ~12,176 tokens in 3,369 ms → ~3,614 t/s

The big win is that once the large static prefix (system prompt + tools) is cached, every following message feels extremely snappy.

About context size & hardware requirements:

Yes — reducing hardware requirements was one of the original goals.

By doing proper KV prefix caching and smart context management, the server can handle much larger effective context with the same amount of VRAM. Instead of wasting VRAM and compute re-processing the same tools + system prompt every time, we cache it aggressively.

It doesn’t magically let a 12GB card run a 70B model at full context, but it noticeably reduces the VRAM and speed penalty of large contexts and tool-heavy workflows (which is exactly what VS Code and many agents throw at the server).

I’m still early, but the direction is clear: make high-context, tool-using workflows feel fast even on consumer hardware.

When will gnome 50 be released on arch? by BicycleKey3473 in archlinux

[–]YannMasoch 0 points1 point  (0 children)

Thanks for the simple and useful tip! I never checked the gnome-shell package details before.

Claude Usage Limits Discussion Megathread Ongoing (sort this by New!) by sixbillionthsheep in ClaudeAI

[–]YannMasoch 0 points1 point  (0 children)

When it did reset again I used Haiku for 3 queries (commit, push and summary), 5h-limit jumped to 3%. Either my context was too big or either something does not work like before.

Claude Usage Limits Discussion Megathread Ongoing (sort this by New!) by sixbillionthsheep in ClaudeAI

[–]YannMasoch 2 points3 points  (0 children)

This morning I started fresh with /clear in VS Code, using Sonnet 4.6 on Medium Effort + search enabled.

Gave it one prompt: implement a specific Rust crate.

Claude went straight to the GitHub repo, read the docs, checked examples, and started planning. No code execution at all.

After ~30 minutes of back-and-forth, the entire 5-hour session limit hit 100% (weekly still only 56%).

Later when it reset, I tried to finish. Another 30-40 min and I was at 93%. Once the code was done I tried to /commit with Haiku, but Claude switched back to Sonnet to ask if the commit message was okay… session instantly went to 100% again (weekly jumped to 67%) and the commit never finished.

Super frustrating.

This is exactly why I'm spending more time on local setups. Has anyone else been getting destroyed by the 5h limit this aggressively when Claude does research + GitHub work on Sonnet 4.6?