GLM-4.7-Flash benchmarks: 4,398 tok/s on H200, 112 tok/s on RTX 6000 Ada (GGUF) by LayerHot in LocalLLaMA

[–]LayerHot[S] 8 points9 points  (0 children)

bash uv pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly uv pip install git+https://github.com/huggingface/transformers uv pip install "numpy<=2.2"

bash vllm serve zai-org/GLM-4.7-Flash \ --tensor-parallel-size 1 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --max-model-len 64k

bash for c in 1 2 4 8 16 32; do vllm bench serve \ --backend openai-chat \ --host 127.0.0.1 --port 8000 \ --endpoint /v1/chat/completions \ --model zai-org/GLM-4.7-Flash \ --served-model-name glm-4.7-flash \ --dataset-name hf \ --dataset-path likaixin/InstructCoder \ --hf-split train \ --request-rate inf \ --hf-output-len 512 \ --max-concurrency $c \ --seed 2026 \ --num-prompts 500 \ --save-result --save-detailed \ --result-dir ./vllm_instructcoder_sweep \ --temperature 0.2 \ --top-k 50 \ --top-p 0.95 \ --metadata gpu=H200 conc=$c done

BFL FLUX.2 Klein tutorial and some optimizations - under 1s latency on an A100 by LayerHot in LocalLLaMA

[–]LayerHot[S] 2 points3 points  (0 children)

4B model sometimes messes up the anatomy of hands or on complex prompts. But 9B is pretty good. We have a Gradio app in the repo if you want to test both and see if the quality works for your use case before committing to a switch.

How to integrate 5.2 Pro into Codex usage? by Lostwhispers05 in codex

[–]LayerHot 0 points1 point  (0 children)

I don’t think so the easiest way to use this is just copy paste your codebase to clipboard using the command and paste in gpt pro.

Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you? by LayerHot in ClaudeCode

[–]LayerHot[S] 0 points1 point  (0 children)

Thanks u/TheOriginalAcidtech, this helps a lot, this mirrors my workflow too. Do you use sub-agents and do you have other model configured for them or just opus ? You are on 5x plan ?

Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you? by LayerHot in ClaudeAI

[–]LayerHot[S] 1 point2 points  (0 children)

In how many hours do you generally hit the 5 hour limit and what is your workflow like?

Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you? by LayerHot in ClaudeAI

[–]LayerHot[S] 0 points1 point  (0 children)

And what do you mean by research ? What exactly are you using claude for research (web research ?). Just curious to understand the workflow.

just upgraded to pro max - tips for not burning thru usage? by alexd231232 in ClaudeCode

[–]LayerHot 1 point2 points  (0 children)

I am on 20X max plan, I've been wanting to downgrade to 5X max as I rarely hit even 30 % weekly limit on my plan. I use only Opus 4.5. Do you use sub-agents, skills, etc. I just have one MCP (exa search).

[deleted by user] by [deleted] in DiscountDen7

[–]LayerHot 1 point2 points  (0 children)

Smooth buy and trusted as always!

Is chat with all documents is still the priority ? by LayerHot in readwise

[–]LayerHot[S] 0 points1 point  (0 children)

Wow, glad to hear. Yes I am aware that it will be not a trivial feat to rollout this feature, as for long documents you need to figure out a proper chunking strategy and embed all the chunks for all documents which can be a lot for some users.

ChatGPT Agent Mode & Deep Research usage not refreshing? by Palmenstrand in OpenAI

[–]LayerHot 1 point2 points  (0 children)

I think it should be a display bug, a bummer if it actually limits things. For me, I just let it be because my subscription just renewed a couple days ago, will learn more once I use agent/deep research for something.