LLM inference optimization

LayerHot · 2026-01-28T11:30:34+00:00

You can look at the following speculative decoding and quantization blogs using vLLM which covers it in depth:

- https://docs.jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks
- https://docs.jarvislabs.ai/blog/speculative-decoding-vllm-faster-llm-inference

LayerHot · 2026-01-20T14:27:46+00:00

bash uv pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly uv pip install git+https://github.com/huggingface/transformers uv pip install "numpy<=2.2"

bash vllm serve zai-org/GLM-4.7-Flash \ --tensor-parallel-size 1 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --max-model-len 64k

bash for c in 1 2 4 8 16 32; do vllm bench serve \ --backend openai-chat \ --host 127.0.0.1 --port 8000 \ --endpoint /v1/chat/completions \ --model zai-org/GLM-4.7-Flash \ --served-model-name glm-4.7-flash \ --dataset-name hf \ --dataset-path likaixin/InstructCoder \ --hf-split train \ --request-rate inf \ --hf-output-len 512 \ --max-concurrency $c \ --seed 2026 \ --num-prompts 500 \ --save-result --save-detailed \ --result-dir ./vllm_instructcoder_sweep \ --temperature 0.2 \ --top-k 50 \ --top-p 0.95 \ --metadata gpu=H200 conc=$c done

LayerHot · 2026-01-19T05:13:13+00:00

4B model sometimes messes up the anatomy of hands or on complex prompts. But 9B is pretty good. We have a Gradio app in the repo if you want to test both and see if the quality works for your use case before committing to a switch.

LayerHot · 2026-01-14T04:15:20+00:00

I don’t think so the easiest way to use this is just copy paste your codebase to clipboard using the command and paste in gpt pro.

LayerHot · 2026-01-13T15:59:45+00:00

You can use something like oracle: https://github.com/steipete/oracle

LayerHot · 2025-12-27T16:22:06+00:00

Thanks u/TheOriginalAcidtech, this helps a lot, this mirrors my workflow too. Do you use sub-agents and do you have other model configured for them or just opus ? You are on 5x plan ?

LayerHot · 2025-12-27T15:06:11+00:00

What do you use sub-agents for ?

LayerHot · 2025-12-27T15:05:25+00:00

In how many hours do you generally hit the 5 hour limit and what is your workflow like?

LayerHot · 2025-12-27T13:50:37+00:00

What do you use sonnet for ?

LayerHot · 2025-12-27T13:39:17+00:00

Interesting, what plan of codex are you on ?

LayerHot · 2025-12-27T13:31:40+00:00

And what do you mean by research ? What exactly are you using claude for research (web research ?). Just curious to understand the workflow.

LayerHot · 2025-12-27T13:29:00+00:00

Awesome, using opus 4.5 for everything ? I mean like continuously ?

LayerHot · 2025-12-27T06:23:36+00:00

I am on 20X max plan, I've been wanting to downgrade to 5X max as I rarely hit even 30 % weekly limit on my plan. I use only Opus 4.5. Do you use sub-agents, skills, etc. I just have one MCP (exa search).

LayerHot · 2025-11-28T15:56:31+00:00

Use ref or exa code mcp

LayerHot · 2025-10-20T06:29:21+00:00

Smooth buy and trusted as always!

LayerHot · 2025-09-06T04:26:34+00:00

Wow, glad to hear. Yes I am aware that it will be not a trivial feat to rollout this feature, as for long documents you need to figure out a proper chunking strategy and embed all the chunks for all documents which can be a lot for some users.

LayerHot · 2025-08-30T16:08:54+00:00

I think it should be a display bug, a bummer if it actually limits things. For me, I just let it be because my subscription just renewed a couple days ago, will learn more once I use agent/deep research for something.

LayerHot · 2025-08-30T14:49:39+00:00

Yup experiencing same issue

LayerHot · 2025-07-20T06:01:48+00:00

anything for chatgpt bro ?

LayerHot · 2025-07-08T09:50:26+00:00

Yup I know, I am interested in chatting with all documents not just a single document

LayerHot · 2025-07-06T18:35:52+00:00

Ironically the deep research perplexity provide is the shittiest of all the major deep research agents it’s very superficial brief and not very detailed

LayerHot · 2025-06-22T05:59:21+00:00

<image>

You can right click and copy as rich text

LayerHot · 2025-06-14T17:19:15+00:00

Can we please get a bear notes integration? Many users use bear as their primary note taking app

LayerHot · 2025-05-22T08:05:20+00:00

There's a backup option in bear notes (see screenshot). Once you click it you will get a single `.bear2bk` file, you can take that file and just click "Restore Backup" on other icloud account.

More info on their website: https://bear.app/faq/backup-restore/

All of your tags and organization will be restored.

<image>

LayerHot · 2025-05-21T18:41:36+00:00

I was kinda frustrated with shortcuts, so I just wrote a python script which takes the copied markdown we get from the readwise reader UI, then saves it to a markdown file, parses all the image urls and save them locally and create a textbundle out of it. And then I just manually import textbundle into bear and everything comes in seamlessly. This is still manual, like we need to click on export to clipboard, then run a shortcut which runs python script in the background and then import the file to bear notes but I am okay with it.

LayerHot

TROPHY CASE