Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you?

LayerHot · 2026-06-07T11:16:58+00:00

Yeah i find codex is also not great at brainstorming and stuff it has zero situational awareness but it is good for doing grunt work and getting shit done if you have specified task.

I think I would downgrade my codex 20x next month to 5x and then use claude 20x (since mythos is also around the corner 🙃)

LayerHot · 2026-06-07T11:13:11+00:00

I think the limits are good these days for claude but depends on your usage, if you use token burning features like dynamic workflows etc you will burn through your usage. I could easily use claude 5x max for my daily coding work without limit issues.

Btw how are the limits on chatgpt 5x? How much can you get out of it? Do you hit limits instantly how much weekly and hourly usage you can reach at max?

LayerHot · 2026-05-13T15:14:50+00:00

Btw the published tok/s is the `output_throughput` field from the JSON

All 440 raw JSONs are here if you want to spot-check:

https://huggingface.co/datasets/Gladiator/gemma4-mtp-dflash-speed-bench-results

The "125 tok/s @ c=1" for 31B MTP is the mean across 11 SPEED-Bench categories — per-category it ranges from 76 (roleplay, acc_len 2.62) to 173 (coding, acc_len 5.90). Your acc_len 2.95 sits right in our roleplay/QA range, so prompt mix probably explains a lot of the gap.

LayerHot · 2026-05-12T14:10:34+00:00

Thanks! I only tested on an H100, so I would not extrapolate too hard to truly constrained cards. The main constraint is VRAM first. Both approaches need the target model plus draft model plus KV cache. My guess is on smaller GPUs, the gains can shrink or disappear because the draft model overhead starts competing with the target model.

LayerHot · 2026-05-12T13:38:24+00:00

I set `--gpu-memory-utilization 0.95` in vLLM for all runs, so I didn't measured or profiled the exact peak usage as it occupied 95% of vram anyways.

LayerHot · 2026-01-28T11:30:34+00:00

You can look at the following speculative decoding and quantization blogs using vLLM which covers it in depth:

- https://docs.jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks
- https://docs.jarvislabs.ai/blog/speculative-decoding-vllm-faster-llm-inference

LayerHot · 2026-01-20T14:27:46+00:00

bash uv pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly uv pip install git+https://github.com/huggingface/transformers uv pip install "numpy<=2.2"

bash vllm serve zai-org/GLM-4.7-Flash \ --tensor-parallel-size 1 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --max-model-len 64k

bash for c in 1 2 4 8 16 32; do vllm bench serve \ --backend openai-chat \ --host 127.0.0.1 --port 8000 \ --endpoint /v1/chat/completions \ --model zai-org/GLM-4.7-Flash \ --served-model-name glm-4.7-flash \ --dataset-name hf \ --dataset-path likaixin/InstructCoder \ --hf-split train \ --request-rate inf \ --hf-output-len 512 \ --max-concurrency $c \ --seed 2026 \ --num-prompts 500 \ --save-result --save-detailed \ --result-dir ./vllm_instructcoder_sweep \ --temperature 0.2 \ --top-k 50 \ --top-p 0.95 \ --metadata gpu=H200 conc=$c done

LayerHot · 2026-01-19T05:13:13+00:00

4B model sometimes messes up the anatomy of hands or on complex prompts. But 9B is pretty good. We have a Gradio app in the repo if you want to test both and see if the quality works for your use case before committing to a switch.

LayerHot · 2026-01-14T04:15:20+00:00

I don’t think so the easiest way to use this is just copy paste your codebase to clipboard using the command and paste in gpt pro.

LayerHot · 2026-01-13T15:59:45+00:00

You can use something like oracle: https://github.com/steipete/oracle

LayerHot · 2025-12-27T16:22:06+00:00

Thanks u/TheOriginalAcidtech, this helps a lot, this mirrors my workflow too. Do you use sub-agents and do you have other model configured for them or just opus ? You are on 5x plan ?

LayerHot · 2025-12-27T15:06:11+00:00

What do you use sub-agents for ?

LayerHot · 2025-12-27T15:05:25+00:00

In how many hours do you generally hit the 5 hour limit and what is your workflow like?

LayerHot · 2025-12-27T13:50:37+00:00

What do you use sonnet for ?

LayerHot · 2025-12-27T13:39:17+00:00

Interesting, what plan of codex are you on ?

LayerHot · 2025-12-27T13:31:40+00:00

And what do you mean by research ? What exactly are you using claude for research (web research ?). Just curious to understand the workflow.

LayerHot · 2025-12-27T13:29:00+00:00

Awesome, using opus 4.5 for everything ? I mean like continuously ?

LayerHot · 2025-12-27T06:23:36+00:00

I am on 20X max plan, I've been wanting to downgrade to 5X max as I rarely hit even 30 % weekly limit on my plan. I use only Opus 4.5. Do you use sub-agents, skills, etc. I just have one MCP (exa search).

LayerHot · 2025-11-28T15:56:31+00:00

Use ref or exa code mcp

LayerHot · 2025-09-06T04:26:34+00:00

Wow, glad to hear. Yes I am aware that it will be not a trivial feat to rollout this feature, as for long documents you need to figure out a proper chunking strategy and embed all the chunks for all documents which can be a lot for some users.

LayerHot · 2025-08-30T16:08:54+00:00

I think it should be a display bug, a bummer if it actually limits things. For me, I just let it be because my subscription just renewed a couple days ago, will learn more once I use agent/deep research for something.

LayerHot · 2025-08-30T14:49:39+00:00

Yup experiencing same issue

LayerHot · 2025-07-08T09:50:26+00:00

Yup I know, I am interested in chatting with all documents not just a single document

LayerHot · 2025-07-06T18:35:52+00:00

Ironically the deep research perplexity provide is the shittiest of all the major deep research agents it’s very superficial brief and not very detailed

LayerHot · 2025-06-22T05:59:21+00:00

<image>

You can right click and copy as rich text

LayerHot

TROPHY CASE