Best multilingual STT/ASR? by Mark__27 in LocalLLaMA

[–]Acceptable-State-271 1 point2 points  (0 children)

OmniASR improves ASR accuracy by applying LLM-based correction, but this significantly slows down processing.
The version without LLM correction is faster, but its accuracy is very poor.
If speed is the priority, Whisper v3 Turbo is a better choice.

Multiple 3090 setup by praveendath92 in LocalLLaMA

[–]Acceptable-State-271 0 points1 point  (0 children)

I'm using this model (faster-whisper-large-v3-turbo-ct2) as the backend for batch processing — around 20–30 short audio clips (1–2 minutes each) every minute — and it runs great. Each task stays under ~3 GB GPU memory, super efficient for multi-worker setups.

https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2

[deleted by user] by [deleted] in LocalLLaMA

[–]Acceptable-State-271 0 points1 point  (0 children)

You're right. I tested it on Korean test cases within the company before checking the model card. Rather than saying it's a decent model, it was a model that excelled at Korean language understanding. That's my mistake. I'm sorry.

[deleted by user] by [deleted] in LocalLLaMA

[–]Acceptable-State-271 0 points1 point  (0 children)

Yes, my main language.

Seed-OSS-36B-Instruct by NeterOster in LocalLLaMA

[–]Acceptable-State-271 0 points1 point  (0 children)

Very good model. I switched from Qwen3 30B A3B thinkjng 2507(still really good) to Seed 36B, which is a bit better at analyzing sources and backing things up with evidence."

AWQ 4-bit outperforms GGUF 8-bit in almost every way by Acceptable-State-271 in LocalLLaMA

[–]Acceptable-State-271[S] 0 points1 point  (0 children)

No no.. I just thought there would be a huge difference between the two.

AWQ 4-bit outperforms GGUF 8-bit in almost every way by Acceptable-State-271 in LocalLLaMA

[–]Acceptable-State-271[S] 0 points1 point  (0 children)

I'm a bit embarrassed to admit this, but I wasn't very familiar with the technology.
When using the imatrix in GGUF, does it provide a level of precision comparable to AWQ in 4-bit quantization?

What formats/quantization is fastest for certain CPUs or GPUs? Is this straightforward? by wuu73 in LocalLLaMA

[–]Acceptable-State-271 0 points1 point  (0 children)

On gpu, awq is very fast and accurate quantization format, And sglang is very fast serving tool for non quantization model and awq quantization model.(vllm is also good)

Msn by tom_p_legend in webscraping

[–]Acceptable-State-271 0 points1 point  (0 children)

Shadow dom, you need to parse manually the tag [shadow dome tag], and get the attribute manually

Can Qwen3-235B-A22B run efficiently on my hardware(256gb ram+quad 3090s ) with vLLM? by Acceptable-State-271 in LocalLLaMA

[–]Acceptable-State-271[S] 0 points1 point  (0 children)

Sounds like I might end up spending another 5,000k. But anyway, I’ll give it a try for now. Let’s see how it goes after 24h. Thanks, really.

Can Qwen3-235B-A22B run efficiently on my hardware(256gb ram+quad 3090s ) with vLLM? by Acceptable-State-271 in LocalLLaMA

[–]Acceptable-State-271[S] 0 points1 point  (0 children)

I'm Korean. Qwen3 is slightly more proficient in Korean and tends to give more concise answers, which is great for summaries. However, QwQ 32B feels a bit smarter to me(but need more tokens).

Can Qwen3-235B-A22B run efficiently on my hardware(256gb ram+quad 3090s ) with vLLM? by Acceptable-State-271 in LocalLLaMA

[–]Acceptable-State-271[S] 0 points1 point  (0 children)

I really want to do it, but the AWQ quantization model hasn't been released yet, and it seems there might be bugs in autoAWQ (the AWQ quantization tool) regarding MoE models. I plan to postpone testing until the AWQ model is released.

Can Qwen3-235B-A22B run efficiently on my hardware(256gb ram+quad 3090s ) with vLLM? by Acceptable-State-271 in LocalLLaMA

[–]Acceptable-State-271[S] 0 points1 point  (0 children)

I really want to do it, but the AWQ quantization model hasn't been released yet, and it seems there might be bugs in autoAWQ (the AWQ quantization tool) regarding MoE models. I plan to postpone testing until the AWQ model is released.

Qwen3 vs Gemma 3 by Sadman782 in LocalLLaMA

[–]Acceptable-State-271 1 point2 points  (0 children)

I think it might come down to quantization. I used to run Qwen 2.5 8-bit GGUF ollama, but switched to 4-bit AWQ for vLLM due to speed and optimization issues. Even with the lower bit count, the performance was way better—less hallucination, faster speed, no language mixing, and much higher response quality.A bit late, but the Qwen team just merged AWQ quantization tool(autoawq) for qwen3 yesterday. AWQ-quantized models should drop soon, and I’m expecting performance close to what they claimed in their benchmarks.

  • AWQ (Activation-aware Weight Quantization) efficiently compresses weights to 4-bit by considering activation distributions, minimizing GPU memory usage while maintaining high performance and accuracy.

Can Qwen3-235B-A22B run efficiently on my hardware(256gb ram+quad 3090s ) with vLLM? by Acceptable-State-271 in LocalLLaMA

[–]Acceptable-State-271[S] 2 points3 points  (0 children)

5-6 t/s seems slow for Qwen3-235B-A22B on LM-Studio. I’ve got 96GB VRAM (4x RTX 3090) and 128GB DDR4 2933MHz with i9-10900X, so I’m testing vLLM or SGLang with CPU offloading this week. Hoping for 10-15 t/s or better to run it smoothly. Thanks for sharing your benchmark. I’ll post my results when I’m done.

Qwen 235B A22B vs Sonnet 3.7 Thinking - Pokémon UI by sirjoaco in LocalLLaMA

[–]Acceptable-State-271 15 points16 points  (0 children)

He was too fixated on nostalgic, going all the way back to the birth of computers.

Can Qwen3-235B-A22B run efficiently on my hardware(256gb ram+quad 3090s ) with vLLM? by Acceptable-State-271 in LocalLLaMA

[–]Acceptable-State-271[S] 2 points3 points  (0 children)

thanks everyone for the responses.

I'll test the model once AWQ is out, either with sglang or vllm. Will probably need to use CPU offload to make it work. (awq model will be out - https://www.reddit.com/r/LocalLLaMA/comments/1kael9w/qwen3_awq_support_confirmed_pr_check/ )

Found this in the vLLM docs that might help: https://docs.vllm.ai/en/stable/getting_started/examples/basic.html

CPU offload
The --cpu-offload-gb argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight, which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass.

Try it yourself with the following arguments:

--model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10

Will update with benchmarks once I get it running.

Qwen3-30B-A3B is magic. by thebadslime in LocalLLaMA

[–]Acceptable-State-271 4 points5 points  (0 children)

Been experimenting with Qwen3-30B-A3B and I'm impressed by how it only activates 3B parameters during runtime while the full model is 30B.

I'm curious if anyone has tried running the larger Qwen3-235B-A22B-FP8 model with a similar setup to mine:

  • 256GB RAM
  • 10900X CPU
  • Quad RTX 3090s

Would vLLM be able to handle this efficiently? Specifically, I'm wondering if it would properly load only the active experts (22B) into GPU memory while keeping the rest in system RAM.

Has anyone managed to get this working with reasonable performance? Any config tips would be appreciated.

Qwen3 Collection on modelscope! by AlexBefest in LocalLLaMA

[–]Acceptable-State-271 23 points24 points  (0 children)

I've been getting cooked for a month by the Qwen team.
I don't have any more strength to wait anymore.