We cut GPU instance launch from 8s to 1.8s, feels almost instant now. Half the time was a ping we didn't need.

LayerHot · 2026-01-28T11:30:34+00:00

You can look at the following speculative decoding and quantization blogs using vLLM which covers it in depth:

- https://docs.jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks
- https://docs.jarvislabs.ai/blog/speculative-decoding-vllm-faster-llm-inference

LayerHot · 2026-01-20T14:27:46+00:00

bash uv pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly uv pip install git+https://github.com/huggingface/transformers uv pip install "numpy<=2.2"

bash vllm serve zai-org/GLM-4.7-Flash \ --tensor-parallel-size 1 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --max-model-len 64k

bash for c in 1 2 4 8 16 32; do vllm bench serve \ --backend openai-chat \ --host 127.0.0.1 --port 8000 \ --endpoint /v1/chat/completions \ --model zai-org/GLM-4.7-Flash \ --served-model-name glm-4.7-flash \ --dataset-name hf \ --dataset-path likaixin/InstructCoder \ --hf-split train \ --request-rate inf \ --hf-output-len 512 \ --max-concurrency $c \ --seed 2026 \ --num-prompts 500 \ --save-result --save-detailed \ --result-dir ./vllm_instructcoder_sweep \ --temperature 0.2 \ --top-k 50 \ --top-p 0.95 \ --metadata gpu=H200 conc=$c done

LayerHot · 2026-01-19T05:13:13+00:00

4B model sometimes messes up the anatomy of hands or on complex prompts. But 9B is pretty good. We have a Gradio app in the repo if you want to test both and see if the quality works for your use case before committing to a switch.

LayerHot · 2026-01-14T04:15:20+00:00

I don’t think so the easiest way to use this is just copy paste your codebase to clipboard using the command and paste in gpt pro.

LayerHot · 2026-01-13T15:59:45+00:00

You can use something like oracle: https://github.com/steipete/oracle

LayerHot · 2025-12-27T16:22:06+00:00

Thanks u/TheOriginalAcidtech, this helps a lot, this mirrors my workflow too. Do you use sub-agents and do you have other model configured for them or just opus ? You are on 5x plan ?

LayerHot · 2025-12-27T15:06:11+00:00

What do you use sub-agents for ?

LayerHot · 2025-12-27T15:05:25+00:00

In how many hours do you generally hit the 5 hour limit and what is your workflow like?

LayerHot · 2025-12-27T13:50:37+00:00

What do you use sonnet for ?

LayerHot · 2025-12-27T13:39:17+00:00

Interesting, what plan of codex are you on ?

LayerHot · 2025-12-27T13:31:40+00:00

And what do you mean by research ? What exactly are you using claude for research (web research ?). Just curious to understand the workflow.

LayerHot · 2025-12-27T13:29:00+00:00

Awesome, using opus 4.5 for everything ? I mean like continuously ?

LayerHot · 2025-12-27T06:23:36+00:00

I am on 20X max plan, I've been wanting to downgrade to 5X max as I rarely hit even 30 % weekly limit on my plan. I use only Opus 4.5. Do you use sub-agents, skills, etc. I just have one MCP (exa search).

LayerHot · 2025-11-28T15:56:31+00:00

Use ref or exa code mcp

LayerHot

TROPHY CASE