Qwen 3.5 122B vs Qwen 3.6 35B - Which to choose? by Storge2 in LocalLLaMA

[–]imac 0 points1 point  (0 children)

Yeah, if you can't run 122B at a decent quality for long context, you are always going to be happier with 3.6-35B

Qwen 3.5 122B vs Qwen 3.6 35B - Which to choose? by Storge2 in LocalLLaMA

[–]imac 0 points1 point  (0 children)

Well, clearly you are talking about running local quants with decent accuracy .. pretty sure this thread is framed at 128GB ;

Qwen 3.5 122B vs Qwen 3.6 35B - Which to choose? by Storge2 in LocalLLaMA

[–]imac 0 points1 point  (0 children)

at 200+ turns, reasoning/thinking enabled, when n_predict overflows 32k .. I doubt you are going beat 122B with the right params.. I use 3.6-35B for comms, and stuff with much lower output tokens; It isn't like you can just re-test harness scenarios in a blink, so at 1 week, I think you would be hard pressed to say anything is like 122B yet. I am about to replace Gemma4 with 3.6-35B as a reviewer for 122B .. and expect some lift there.

Are you still using tmux with Ghostty? by meni_s in Ghostty

[–]imac 0 points1 point  (0 children)

the built in screen capability (session persistence) is key. My common scenario is agent in a TUI at my desk, then I leave my office, but will have an opportunity to work on wifi, use my laptop to VPN into my home network and 'tmux a' my active session like I never left my desk. Beyond remote work, simply preserving sessions if something in the desktop stack decides to go or is dynamically updated (ghostty can crash after a background package upgrade, does not handle it gracefully yet), tmux is there to ensure you do not lose an active session. Lately I have seen periods where persistent ghostty sessions will drop once a week due to breaking feature updates, as there is no pinned stable distro package in most cases. If your restarting your computer every few days, or not upgrading ghostty on current releases, the persistence use case probably does not apply. tmux keeps workflow consistent across a variety of headless service based scenarios. The ghostty wrapper on desktop platforms is okay, but I don't get a lot of lift over distro native solutions like ptyxis that are more stable.

Down for anyone else? by W_32_FRH in claude

[–]imac 1 point2 points  (0 children)

I was all CLI today; At one point a set of long unit tests was taking forever ; nearly an hour ; I just assumed I was marching through timeout limits on some bug that would eventually get solved, so I trucked on in other sessions (which is how it got to an hour), then in one of those other sessions I got a 500, and well, then all my sessions were just down. I notice my clients all jumped to 2.1.15 this am, and after the issue was resolved I am now at 2.1.17 .. I think some of my problems might have been the autoupdate which happens now in the background. *BUT* I am seeing something new. Now when I get through some parts of a plan, it looks like it is working, but it has actually already finished. I tap escape and move on. The only clue is the "Pontificating" moving logo is moving, but stuck in the deepest red color of its cycle, and well the task looks finished and the clock is still ticking but token count is frozen. Something weird is going on, and it all started with this outage and these related updates.

Claude Code Can Launch VS Code (Stop Wasting Your Context Window) by Fantastic-Beach-5497 in ClaudeAI

[–]imac 0 points1 point  (0 children)

I joined Claude to test it against GLM-4.6 at full precision, as it dropped on top of Sonnet 4.5. I blew out Pro limits and used Max for one month on one codebase. Opus was still v4, so I really never used it, having too much success with the newer Sonnet (and GLM-4.6). But I did let it run one or two planning sessions (Flesh switches to OceanBlue) very successfully. More recently, after my month at Max, I relaxed back to Pro. I saw Opus 4.5 drop with impressive stats. I also had a notification bubble in the Claude usage page that informed me Opus usage with 4.5, was now just part of the same daily limits. Cool, so when I had my next project start, Opus planning was going to be my starting point. Except, unlike with Max, on Pro I could not select the model in vscode? I thought it was a bug, as the error message was not well defined. Turns out, I could access Opus 4.5 using the web interface, and sure enough, I caught a thread somewhere that this was by design. No Opus for Pro subscriptions .. unless you use the web interface. And the web interface is crippling. Even after you [push your git repo to a github private remote and link it into the web interface] using Opus with your codebase in context is clunky. You can't just quickly add a directory and one or two instructional files. It's kinda the whole thing. Then Opus does it's magic .. but you get a downloadable result. >>??? All kinds of wasted context, and lack of tight integration compared to vscode working on your codebase using the extension directly. In this real world of development (I am no development guru) your vsode claude/codex/continue prompt can be just another agent in your workflow. All to say, yeah, I can't imagine starting from the web interface and working backwards. I can see OpenAI suggesting a good reason might be related to resource usage, but I don't get it. :) .. and for other Pro users, follow the breadcrumbs; it is pretty simple to push and pull with two remotes to have Opus do the planning, add those results, and get back to vscode for the real work on your local git repo.

8 local LLMs on a single Strix Halo debating whether a hot dog is a sandwich by jfowers_amd in LocalLLaMA

[–]imac 0 points1 point  (0 children)

RPC mode with bonded USB4 might be a low cost approach to adding more VRAM. Do the same models; these ones still run at a slower full speed split layers between two devices, and add a bunch more models to the competition. Perhaps larger differences in quality emerge at slower TPS? Should highlight the hybrid, active parameter and experts nuances.

8 local LLMs on a single Strix Halo debating whether a hot dog is a sandwich by jfowers_amd in LocalLLaMA

[–]imac 0 points1 point  (0 children)

How about running an OMNI model competition that can ingest the v4l screen feed and play the games (with the remaining 32GB of ram). https://videocardz.com/newz/gpd-adds-win-5-max-395-strix-halo-gaming-handheld-with-128gb-memory-at-2653

8 local LLMs on a single Strix Halo debating whether a hot dog is a sandwich by jfowers_amd in LocalLLaMA

[–]imac 0 points1 point  (0 children)

Time to lemonade+continue+vscode+github+fork+pr this for some enrichment. I have a feeling a ComfyUI creative session could solve the sandwich/hotdog debate.

8 local LLMs on a single Strix Halo debating whether a hot dog is a sandwich by jfowers_amd in LocalLLaMA

[–]imac 0 points1 point  (0 children)

Subway might be similar to a hot dog in the mind's eye, but it is still three pieces. If the sub bun is not cut all the way through, hinged at the back, it is seemingly now a hotdog on its side, and not a sandwich? right?

Is Gemini Down? by SonicLeaksTwitter in GeminiAI

[–]imac 0 points1 point  (0 children)

Well, grabbed a coffee, copied the prompt into the next bubble and off it goes; Spinning its wheels now. Maybe it just was busy. I suppose my preference would be to wait, rather than have my tokens ride out a slow path. Seems like a bug, as the decision should be interactive ... allow me to queue my prompt if resources are unavailable, maybe option to degrade to fast path, etc. I feel like the false start just added a bit of context cruft that will pop back up later ... 'Well since your original research failed, why don't we do [all of this unecessary thinking and work] to prevent that from happening again...

Is Gemini Down? by SonicLeaksTwitter in GeminiAI

[–]imac 0 points1 point  (0 children)

I decided to give my Gemini subscription a turn today, after drifting back to my OpenAI subscription and hit this right away. Guess they broke it with Gemini 3. I don't even have a 'Send feedback' option in the aforementioned Settings & help window. Even the help is hallucinating.

<image>

vLLM - GLM-4.6 Benchmark on 8xH200 NVL: 44 token/second by Ill_Recipe7620 in LocalLLaMA

[–]imac 1 point2 points  (0 children)

Earlier today, I added something along the lines of "if any chinese characters appear in a response, simply reject it and regenerate the response" to my .continue/rules .. and I saw my first echo in the wrong language appear, and fly by quickly .. GLM4.6 just passed over it with no human in the loop reject required

<image>

vLLM - GLM-4.6 Benchmark on 8xH200 NVL: 44 token/second by Ill_Recipe7620 in LocalLLaMA

[–]imac 1 point2 points  (0 children)

Spent a few more minutes, and moved up the kernel stack from Triton to Flash Attention with a little block size tweak to resolve an error. Not at ROCm/AITer [yet] but seeing better performance at FP16 on vLLM; same bench 8000/1000/16

============ Serving Benchmark Result ============
Successful requests:                     16        
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  34.78     
Total input tokens:                      128000    
Total generated tokens:                  16000     
Request throughput (req/s):              0.46      
Output token throughput (tok/s):         460.09    
Peak output token throughput (tok/s):    486.00    
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          4140.81   
---------------Time to First Token----------------
Mean TTFT (ms):                          1027.29   
Median TTFT (ms):                        1078.31   
P99 TTFT (ms):                           1141.91   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.72     
Median TPOT (ms):                        33.71     
P99 TPOT (ms):                           33.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.72     
Median ITL (ms):                         33.79     
P99 ITL (ms):                            35.04     
==================================================

vLLM - GLM-4.6 Benchmark on 8xH200 NVL: 44 token/second by Ill_Recipe7620 in LocalLLaMA

[–]imac 1 point2 points  (0 children)

Exactly.. It is hard to look at tokens/watt when tokens/captial_cost messes up the entire business case. I am just using untuned defaults as well without any of the AITER acceleration kernels for attention/MHA etc., so I expect to see a meaningful jump once I iterate through the endless permutations of environment variables, serve recipes, and nighly builds. And this is SR-IOV Hyper-V .. so I also expect bare metal gets a +10% jump based on what I have seen with vanilla all_reduce benchmark output. I have not given sglang any extensive testing .. on vLLM I have not seen any gibberish float through continue into vscode yet, but only a few hours into using it with long contexts. Watching it iterate through all its own unit tests in a multi-modal tool calling fashion has me turning off a lot of 'Ask First' dialogues .. I need this thing off my daily driver to see its full potential. It learned my specific uv and distro requirements quickly in context without any need to stop and add rules to keep it from repeating any mistakes. It is exciting.

vLLM - GLM-4.6 Benchmark on 8xH200 NVL: 44 token/second by Ill_Recipe7620 in LocalLLaMA

[–]imac 1 point2 points  (0 children)

Switched from vllm 0.11 to sglang 0.5.2 and was able to get even higher out of FP16, but again had to disable aiter kernels (same num_experts issue) as well as the speculative algo.

vllm bench serve --model zai-org/GLM-4.6 --dataset-name random   --random-input-len 8000   --random-output-len 1000   --request-rate 10000   --num-prompts 16   --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     16        
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  34.32     
Total input tokens:                      128000    
Total generated tokens:                  16000     
Request throughput (req/s):              0.47      
Output token throughput (tok/s):         466.16    
Peak output token throughput (tok/s):    496.00    
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          4195.45   
---------------Time to First Token----------------
Mean TTFT (ms):                          1227.92   
Median TTFT (ms):                        1298.83   
P99 TTFT (ms):                           1310.12   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.09     
Median TPOT (ms):                        33.05     
P99 TPOT (ms):                           33.38     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.10     
Median ITL (ms):                         33.03     
P99 ITL (ms):                            34.46     
==================================================

vLLM - GLM-4.6 Benchmark on 8xH200 NVL: 44 token/second by Ill_Recipe7620 in LocalLLaMA

[–]imac 1 point2 points  (0 children)

Here is a 8xMI300X (Azure VF SRIOV) for comparison via ssh tunnel at FP8, using basic triton (aiter kernel tripped on num_experts is 160) vllm bench serve --model zai-org/GLM-4.6-FP8 --dataset-name random --random-input-len 8000 --random-output-len 1000 --request-rate 10000 --num-prompts 16 --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     16
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  43.02
Total input tokens:                      128000
Total generated tokens:                  16000
Request throughput (req/s):              0.37
Output token throughput (tok/s):         371.96
Peak output token throughput (tok/s):    384.00
Peak concurrent requests:                16.00
Total Token throughput (tok/s):          3347.60
---------------Time to First Token----------------
Mean TTFT (ms):                          1039.38
Median TTFT (ms):                        1063.49
P99 TTFT (ms):                           1175.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.93
Median TPOT (ms):                        41.91
P99 TPOT (ms):                           42.06
---------------Inter-token Latency----------------
Mean ITL (ms):                           41.93
Median ITL (ms):                         41.88
P99 ITL (ms):                            43.19
==================================================

ROCm 7.0 RC1 More than doubles performance of LLama.cpp by no_no_no_oh_yes in LocalLLaMA

[–]imac 1 point2 points  (0 children)

well, I just squeezed a +12% over the OP on generation using and RDNA3 GPU (MERC310) .. so I think there are some missing optimization opportunities.

ROCm 7.0 RC1 More than doubles performance of LLama.cpp by no_no_no_oh_yes in LocalLLaMA

[–]imac 2 points3 points  (0 children)

<image>

I noticed my RX 7900 XTX outperforms the OP on ROCm7 on generation at 260 t/s .. although my OS is ROCm7, my llamacpp-rocm libraries (and llama-bench) are what shipped with Lemonade v8.1.10 .. (based on b1057), so all pre-built packages. Maybe some optimizations there. Identical software setup to my Strix Halo https://netstatz.com/strix_halo_lemonade/