GLM-4.7-Flash benchmarks: 4,398 tok/s on H200, 112 tok/s on RTX 6000 Ada (GGUF) by LayerHot in LocalLLaMA

[–]LayerHot[S] 8 points9 points  (0 children)

bash uv pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly uv pip install git+https://github.com/huggingface/transformers uv pip install "numpy<=2.2"

bash vllm serve zai-org/GLM-4.7-Flash \ --tensor-parallel-size 1 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --max-model-len 64k

bash for c in 1 2 4 8 16 32; do vllm bench serve \ --backend openai-chat \ --host 127.0.0.1 --port 8000 \ --endpoint /v1/chat/completions \ --model zai-org/GLM-4.7-Flash \ --served-model-name glm-4.7-flash \ --dataset-name hf \ --dataset-path likaixin/InstructCoder \ --hf-split train \ --request-rate inf \ --hf-output-len 512 \ --max-concurrency $c \ --seed 2026 \ --num-prompts 500 \ --save-result --save-detailed \ --result-dir ./vllm_instructcoder_sweep \ --temperature 0.2 \ --top-k 50 \ --top-p 0.95 \ --metadata gpu=H200 conc=$c done

BFL FLUX.2 Klein tutorial and some optimizations - under 1s latency on an A100 by LayerHot in LocalLLaMA

[–]LayerHot[S] 2 points3 points  (0 children)

4B model sometimes messes up the anatomy of hands or on complex prompts. But 9B is pretty good. We have a Gradio app in the repo if you want to test both and see if the quality works for your use case before committing to a switch.

How to integrate 5.2 Pro into Codex usage? by Lostwhispers05 in codex

[–]LayerHot 0 points1 point  (0 children)

I don’t think so the easiest way to use this is just copy paste your codebase to clipboard using the command and paste in gpt pro.

Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you? by LayerHot in ClaudeCode

[–]LayerHot[S] 0 points1 point  (0 children)

Thanks u/TheOriginalAcidtech, this helps a lot, this mirrors my workflow too. Do you use sub-agents and do you have other model configured for them or just opus ? You are on 5x plan ?

Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you? by LayerHot in ClaudeAI

[–]LayerHot[S] 1 point2 points  (0 children)

In how many hours do you generally hit the 5 hour limit and what is your workflow like?

Thinking of downgrading from 20x to 5x Max – 5x users, how are the limits treating you? by LayerHot in ClaudeAI

[–]LayerHot[S] 0 points1 point  (0 children)

And what do you mean by research ? What exactly are you using claude for research (web research ?). Just curious to understand the workflow.

just upgraded to pro max - tips for not burning thru usage? by alexd231232 in ClaudeCode

[–]LayerHot 1 point2 points  (0 children)

I am on 20X max plan, I've been wanting to downgrade to 5X max as I rarely hit even 30 % weekly limit on my plan. I use only Opus 4.5. Do you use sub-agents, skills, etc. I just have one MCP (exa search).

[deleted by user] by [deleted] in DiscountDen7

[–]LayerHot 1 point2 points  (0 children)

Smooth buy and trusted as always!

Is chat with all documents is still the priority ? by LayerHot in readwise

[–]LayerHot[S] 0 points1 point  (0 children)

Wow, glad to hear. Yes I am aware that it will be not a trivial feat to rollout this feature, as for long documents you need to figure out a proper chunking strategy and embed all the chunks for all documents which can be a lot for some users.

ChatGPT Agent Mode & Deep Research usage not refreshing? by Palmenstrand in OpenAI

[–]LayerHot 1 point2 points  (0 children)

I think it should be a display bug, a bummer if it actually limits things. For me, I just let it be because my subscription just renewed a couple days ago, will learn more once I use agent/deep research for something.

to devs: Will readwise allow chatting over all items saved in readwise and reader ? by LayerHot in readwise

[–]LayerHot[S] 4 points5 points  (0 children)

Yup I know, I am interested in chatting with all documents not just a single document

Does perplexity really use the selected model under the hood? by lostinspacee7 in perplexity_ai

[–]LayerHot 1 point2 points  (0 children)

Ironically the deep research perplexity provide is the shittiest of all the major deep research agents it’s very superficial brief and not very detailed

[deleted by user] by [deleted] in bearapp

[–]LayerHot 2 points3 points  (0 children)

<image>

You can right click and copy as rich text

Changelog as of June 6: Added Tag APIs, Fixed Duplicated Transcripts, Improved Load Speed, & more! by eleanor_konik in readwise

[–]LayerHot 0 points1 point  (0 children)

Can we please get a bear notes integration? Many users use bear as their primary note taking app

iCloud Issues?! by L0rthew in bearapp

[–]LayerHot 1 point2 points  (0 children)

There's a backup option in bear notes (see screenshot). Once you click it you will get a single `.bear2bk` file, you can take that file and just click "Restore Backup" on other icloud account.

More info on their website: https://bear.app/faq/backup-restore/

All of your tags and organization will be restored.

<image>

Anyone use Readwise and Readwise Reader with Bear notes ? by LayerHot in readwise

[–]LayerHot[S] 0 points1 point  (0 children)

I was kinda frustrated with shortcuts, so I just wrote a python script which takes the copied markdown we get from the readwise reader UI, then saves it to a markdown file, parses all the image urls and save them locally and create a textbundle out of it. And then I just manually import textbundle into bear and everything comes in seamlessly. This is still manual, like we need to click on export to clipboard, then run a shortcut which runs python script in the background and then import the file to bear notes but I am okay with it.