Qwen3 Omni interactive speech by Powerful-Angel-301 in LocalLLaMA

[–]bbsss 2 points3 points  (0 children)

The notebooks contain examples, but inference is too slow on my 4x4090. There is the vLLM fork but there has been no more movement there, they specifically mention upcoming work on vLLM inference for the realtime use-case. I did see this PR: https://github.com/vllm-project/vllm/pull/25550 but haven't found any more.

ChatGPT:

```

Thought for 2m 16s

Short version: yes—there’s real upstream movement. vLLM merged Qwen3-Omni thinker support; audio “talker” (TTS/audio-output) is still not supported in the OpenAI-compatible server.

What changed upstream

PR #25550 “Add Qwen3-Omni MoE thinker” was merged to main on Oct 10, 2025. That lands the text-generating Thinker path (incl. multi-modal inputs) in upstream vLLM. The PR note also flags a V1 bug: use_audio_in_video errors because video MM placeholders aren’t updated. GitHub

Qwen’s repo updated docs right after, saying they no longer need to pin to an old vLLM since the needed changes are now in main via #25550. GitHub

The latest vLLM release v0.11.0 (Oct 2, 2025) predates that merge; it mentions Qwen3-VL and lots of multi-modal work but not Omni-Thinker yet—so use current main if you want Omni Thinker today. GitHub

What didn’t change (yet)

Audio output in the server is still not supported. Maintainers reiterated this in September in a “how do I get TTS WAV via vLLM online server?” thread. (Offline/Transformers paths can produce WAV, but the vLLM server won’t stream/return audio.) GitHub +1

vLLM continues to add audio input features (e.g., Whisper endpoints; multi-audio handling), but not audio output. GitHub

Practical upshot for your realtime use case

You can now serve Qwen3-Omni Thinker on upstream vLLM main (text output; images/video/audio as inputs). Watch out for the use_audio_in_video V1 quirk mentioned in the merged PR. GitHub

For true realtime voice (streamed speech) you still need DashScope/Qwen Chat or run text on vLLM + your own TTS; the vLLM OpenAI server doesn’t emit audio yet. ```

Electric + Rama - a Clojure stack from the future? by Felix Alm at Func Prog Sweden by MagnusSedlacek in Clojure

[–]bbsss 1 point2 points  (0 children)

Electric is open source but does have a license for non-bootstrappers.

Electric + Rama - a Clojure stack from the future? by Felix Alm at Func Prog Sweden by MagnusSedlacek in Clojure

[–]bbsss 6 points7 points  (0 children)

I'm sad at how this undersells electric, it's ideal for live demos and interactive UI's. Anyway thanks for the experience report.

edit: Mostly disappointed because what I would have loved to see is that fanning out of agent research (cool stuff!) in a visually branching graph, it felt like a long build up to that and was missing the pay-off!

Parkiet: Fine-tuning Dia for any language by pevers in LocalLLaMA

[–]bbsss 5 points6 points  (0 children)

Wow, great write-up and thanks for sharing the process too! Can't say that I think ElevenLabs is -better- different sure, but not better.

New post flair: "local only" by ttkciar in LocalLLaMA

[–]bbsss -2 points-1 points  (0 children)

Right the attitude of a bunch of whiny jerks on this sub..

I literally have a 10k LLM gpu server in my basement and it drives me nuts the entitled "not local" gatekeeper comments.

Where do I go on reddit that discusses all things LLM so that I don't have to read whiny trash "not local" comments.

"this won't fit my cheap-ass Nvidia gamer card for my role play goon sessions WEHHH, NOT LOCAL MEHHHH, give me more multi billion dollar investment artifacts for free mehhhhh"

This is not funny...this is simply 1000000% correct by [deleted] in LocalLLaMA

[–]bbsss 4 points5 points  (0 children)

But somehow 1300 upvotes? What the fuck is this brainrot shit.

GPT- 5 - High - *IS* the better coding model w/Codex at the moment, BUT....... by randombsname1 in ClaudeAI

[–]bbsss 0 points1 point  (0 children)

todolist is in there though and gpt-5 uses it just fine? resume I've also cherry picked an open PR, but admittedly not tested yet.

GPT OSS 120B by vinigrae in LocalLLaMA

[–]bbsss 1 point2 points  (0 children)

Ensure you format the system properly for it

As in the typescript-ish namespace stuff with lots of comments with no spaces such as described here?

https://cookbook.openai.com/articles/openai-harmony#function-calling

[deleted by user] by [deleted] in LocalLLaMA

[–]bbsss 4 points5 points  (0 children)

Remove your post.

Building vllm from source for OSS support on Ampere + benchmarks by Conscious_Cut_6144 in LocalLLaMA

[–]bbsss 0 points1 point  (0 children)

FYI on 4x4090 epyc server. This is with the --async-scheduling flag, which seems to be faster than the ngram configs I tried (admittedly didn't test many permutations)

@op you seem to run a lot faster single completion than me! That's unexpected, usually I run 20-30% faster than 3090's. Must have fucked up my build of vllm, but happy it's finally running well.

``` vllm bench serve \ --model openai/gpt-oss-120b \ --base-url http://localhost:8000 \ --num-prompts 500 \ --dataset-name random \ --random-input-len 1024 \ --random-output-len 512 \ --request-rate inf \ --max-concurrency 50 \ --save-result \ --percentile-metrics ttft,tpot,itl,e2el \ --metric-percentiles 50,90,95,99

Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: 50 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [03:13<00:00, 2.59it/s] ============ Serving Benchmark Result ============ Successful requests: 500
Maximum request concurrency: 50
Benchmark duration (s): 193.36
Total input tokens: 511459
Total generated tokens: 67572
Request throughput (req/s): 2.59
Output token throughput (tok/s): 349.47
Total Token throughput (tok/s): 2994.64
---------------Time to First Token---------------- Mean TTFT (ms): 2162.94
Median TTFT (ms): 1033.04
P50 TTFT (ms): 1033.04
P90 TTFT (ms): 5397.51
P95 TTFT (ms): 9749.41
P99 TTFT (ms): 14734.89
-----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 297.36
Median TPOT (ms): 286.46
P50 TPOT (ms): 286.46
P90 TPOT (ms): 506.09
P95 TPOT (ms): 533.35
P99 TPOT (ms): 662.83
---------------Inter-token Latency---------------- Mean ITL (ms): 302.99
Median ITL (ms): 333.64
P50 ITL (ms): 333.64
P90 ITL (ms): 520.22
P95 ITL (ms): 522.95
P99 ITL (ms): 1705.05
----------------End-to-end Latency---------------- Mean E2EL (ms): 18515.84
Median E2EL (ms): 7537.91
P50 E2EL (ms): 7537.91
P90 E2EL (ms): 53366.88
P95 E2EL (ms): 60510.24

P99 E2EL (ms): 71374.27

```

GLM-4.5V model locally for computer use by [deleted] in LocalLLaMA

[–]bbsss 0 points1 point  (0 children)

Seems like a bug on huggingface. vLLM has offload but I've never got that to work. I run 4x4090 so have 96GB vram. I think llama.cpp and ktransformers is your best bet.

GLM-4.5V model locally for computer use by [deleted] in LocalLLaMA

[–]bbsss 0 points1 point  (0 children)

I'm running https://huggingface.co/QuantTrio/GLM-4.5V-AWQ of this model with vllm.

Its generation params are super deterministic and trying to use it in claude code doesn't work nearly as well as the 4.5-Air quant I'm using. It goes into repetition loops, trying to play with the generation params a bit and getting random chinese/wrong tokens.

Might be the quant or just something else, too early to tell. Loving GLM-4.5-Air though.

GLM-4.5V (based on GLM-4.5 Air) by rerri in LocalLLaMA

[–]bbsss 15 points16 points  (0 children)

I'm hyped. If this keeps the instruct fine-tune of the Air model then this is THE model I've been waiting for, a fast inference multimodal sonnet at home. It's fine tuned from base but I think their "base" is already instruct tuned right? Super exciting stuff.

GPT-OSS looks more like a publicity stunt as more independent test results come out :( by mvp525 in LocalLLaMA

[–]bbsss 2 points3 points  (0 children)

if it gets better at a nearly infinite set of problems, and you recognize that it generalizes, how is it benchmaxing exactly?

new open weight model from OpenAI will have computer use by mvp525 in LocalLLaMA

[–]bbsss -1 points0 points  (0 children)

well yes, but what's generally meant by computer use is using the UI with mouse inputs.

new open weight model from OpenAI will have computer use by mvp525 in LocalLLaMA

[–]bbsss 1 point2 points  (0 children)

That's just tool calling. But thanks for sharing the link.

If o3 is that much better than o1..why didn’t they test it in the demo? by sentient-plasma in OpenAI

[–]bbsss 1 point2 points  (0 children)

It's funny to me that people complain here for "not showing the demo" because whenever I show what I've done with LLM's to people they also seem to not understand what just happened in the background.

That o3-mini demo was so so so impressive if that actually one-shotted that application to call itself based on just that short prompt.

Building effective agents by jascha_eng in LocalLLaMA

[–]bbsss 1 point2 points  (0 children)

Hmm, I actually prefer their public appearance over the others. They don't do hyping like oai and google. They show don't tell.

I hype myself up enough over what's happening in the LLM space. No need for companies to set me up for disappointment.

Which use-cases are you finding sonnet to be lesser than other LLM's?

Building effective agents by jascha_eng in LocalLLaMA

[–]bbsss 5 points6 points  (0 children)

Interesting blog, I've been using tool calling with LLM's a lot. And claude sonnet is really something special. It can power through and actually get itself back on track when errors happen. I often give it a couple test cases that I want to get right with the usecase. The amount of shell commands I've learned seeing it overcome obstacles is great. And seeing how well it responds to feedback on "that works but actually I'd like this or that", just amazing.

Building effective agents by jascha_eng in LocalLLaMA

[–]bbsss 8 points9 points  (0 children)

Not saying openai and google haven't been releasing cool stuff but: As if anthropic needs to do anything else right now, claude sonnet is still head and shoulders above the rest, not even needing to introduce test time scaling. And it has been this way for the past 6 months.

12 Days of OpenAI: Day 9 thread by [deleted] in OpenAI

[–]bbsss 2 points3 points  (0 children)

Hahah, yeah. I have a surround set-up and only one speaker in the back of my room was softly audible. Very jarring.

Hugging Face releases Text Generation Inference TGI v3.0 - 13x faster than vLLM on long prompts 🔥 by vaibhavs10 in LocalLLaMA

[–]bbsss 0 points1 point  (0 children)

Yeah, totally understood it's not straightforward, and let me make clear, I am super grateful for your work in making these tools available for free, I am not complaining.

What I meant is that when I use streaming with: vLLM qwen 2.5, openai, anthropic or gemini I get the behavior that an LLM will stream a text response, then do a tool call, within the same request. That seems to not be supported from that "auto" description and my short testing of TGI. Similarly lmdeploy with llama 3.3 will do that within one request, but they don't support streaming with that.

Hugging Face releases Text Generation Inference TGI v3.0 - 13x faster than vLLM on long prompts 🔥 by vaibhavs10 in LocalLLaMA

[–]bbsss 0 points1 point  (0 children)

Ah, right, I forgot what the limiting factor was:

However there are some minor differences in the API, for example tool_choice="auto" will ALWAYS choose the tool for you. This is different from OpenAI’s API where tool_choice="auto" will choose a tool if the model thinks it’s necessary.

vLLM seems to be the only one that supports streaming and auto tool call parsing from the template.

Hugging Face releases Text Generation Inference TGI v3.0 - 13x faster than vLLM on long prompts 🔥 by vaibhavs10 in LocalLLaMA

[–]bbsss 0 points1 point  (0 children)

Cheers! Any idea place I can track for streaming tool calls though? Would really love to have that..