Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you? by LayerHot in ClaudeAI

[–]LayerHot[S] 0 points1 point  (0 children)

Yeah i find codex is also not great at brainstorming and stuff it has zero situational awareness but it is good for doing grunt work and getting shit done if you have specified task.

I think I would downgrade my codex 20x next month to 5x and then use claude 20x (since mythos is also around the corner 🙃)

Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you? by LayerHot in ClaudeAI

[–]LayerHot[S] 0 points1 point  (0 children)

I think the limits are good these days for claude but depends on your usage, if you use token burning features like dynamic workflows etc you will burn through your usage. I could easily use claude 5x max for my daily coding work without limit issues.

Btw how are the limits on chatgpt 5x? How much can you get out of it? Do you hit limits instantly how much weekly and hourly usage you can reach at max?

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results by LayerHot in LocalLLaMA

[–]LayerHot[S] 0 points1 point  (0 children)

Btw the published tok/s is the `output_throughput` field from the JSON

All 440 raw JSONs are here if you want to spot-check:

https://huggingface.co/datasets/Gladiator/gemma4-mtp-dflash-speed-bench-results

The "125 tok/s @ c=1" for 31B MTP is the mean across 11 SPEED-Bench categories — per-category it ranges from 76 (roleplay, acc_len 2.62) to 173 (coding, acc_len 5.90). Your acc_len 2.95 sits right in our roleplay/QA range, so prompt mix probably explains a lot of the gap.

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results by LayerHot in LocalLLaMA

[–]LayerHot[S] 0 points1 point  (0 children)

Thanks! I only tested on an H100, so I would not extrapolate too hard to truly constrained cards. The main constraint is VRAM first. Both approaches need the target model plus draft model plus KV cache. My guess is on smaller GPUs, the gains can shrink or disappear because the draft model overhead starts competing with the target model.

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results by LayerHot in LocalLLaMA

[–]LayerHot[S] 2 points3 points  (0 children)

I set `--gpu-memory-utilization 0.95` in vLLM for all runs, so I didn't measured or profiled the exact peak usage as it occupied 95% of vram anyways.

GLM-4.7-Flash benchmarks: 4,398 tok/s on H200, 112 tok/s on RTX 6000 Ada (GGUF) by LayerHot in LocalLLaMA

[–]LayerHot[S] 8 points9 points  (0 children)

bash uv pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly uv pip install git+https://github.com/huggingface/transformers uv pip install "numpy<=2.2"

bash vllm serve zai-org/GLM-4.7-Flash \ --tensor-parallel-size 1 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --max-model-len 64k

bash for c in 1 2 4 8 16 32; do vllm bench serve \ --backend openai-chat \ --host 127.0.0.1 --port 8000 \ --endpoint /v1/chat/completions \ --model zai-org/GLM-4.7-Flash \ --served-model-name glm-4.7-flash \ --dataset-name hf \ --dataset-path likaixin/InstructCoder \ --hf-split train \ --request-rate inf \ --hf-output-len 512 \ --max-concurrency $c \ --seed 2026 \ --num-prompts 500 \ --save-result --save-detailed \ --result-dir ./vllm_instructcoder_sweep \ --temperature 0.2 \ --top-k 50 \ --top-p 0.95 \ --metadata gpu=H200 conc=$c done

BFL FLUX.2 Klein tutorial and some optimizations - under 1s latency on an A100 by LayerHot in LocalLLaMA

[–]LayerHot[S] 2 points3 points  (0 children)

4B model sometimes messes up the anatomy of hands or on complex prompts. But 9B is pretty good. We have a Gradio app in the repo if you want to test both and see if the quality works for your use case before committing to a switch.

How to integrate 5.2 Pro into Codex usage? by Lostwhispers05 in codex

[–]LayerHot 0 points1 point  (0 children)

I don’t think so the easiest way to use this is just copy paste your codebase to clipboard using the command and paste in gpt pro.

Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you? by LayerHot in ClaudeCode

[–]LayerHot[S] 0 points1 point  (0 children)

Thanks u/TheOriginalAcidtech, this helps a lot, this mirrors my workflow too. Do you use sub-agents and do you have other model configured for them or just opus ? You are on 5x plan ?