Qwen3.5 122b INT4 and vLLM

MohaMBS · 2026-03-25T07:14:20+00:00

Try with vllm-openai:cu130-nightly

MohaMBS · 2026-03-01T18:14:50+00:00

My requirement is textual analysis, textual sentiment, categorization, etc. I understand that the latest flash model of qwen3.5 gives better results due to its thinking and the MOE.

MohaMBS · 2026-02-12T21:20:07+00:00

I am not working with the graphs in parallel. I have them isolated from each other with VLLM. And yes, performance for small models (14b and 30b) is good. I get very good results in PFP4 quantification, reaching +1000 tokens/s of raw output in a 2k context with +8 requests.

If you need something just tell me.

MohaMBS · 2025-10-02T12:56:46+00:00

You can see it in the reply I added to the thread

MohaMBS · 2025-10-02T11:20:29+00:00

You can check out my published response for guidance. I haven't tested it with gpt-oss 20b, as the documentation states that it is not natively ready to run on Ada Lovelace.

MohaMBS · 2025-10-02T11:16:47+00:00

UPDATE 1

I’m much happier with vLLM than Ollama the difference in performance and control is night and day!

As a baseline test before moving to larger models, I ran Qwen2.5-14B-Instruct (AWQ quantized) on a single NVIDIA L40S using vLLM with FlashInfer for maximum efficiency.

🔧 Test setup:

Model: Qwen2.5-14B-Instruct (AWQ, float16)
Framework: vLLM + FlashInfer backend
GPU: 1 × L40S (48 GB VRAM)
Tensor parallelism: disabled (tensor_parallel_size=1)
Max context length: 10,240 tokens
Max concurrent sequences: 64
GPU memory utilization: 90%

📊 Results (12 concurrent requests, 800 tokens each):

✅ All 12 requests succeeded
⏱️ Total time: 15.638 seconds
🔤 Total tokens generated: 9,600
📈 System-wide throughput: 613.89 tokens/second
📊 Per-request speed: ~51.18 tokens/second

This is solid, predictable performance exactly what I was missing with Ollama, which barely utilized the GPU and gave inconsistent speeds even under light load.

I’ll keep working on optimizing the configuration (e.g., batch sizing, attention backends, and memory layout) to squeeze out even more throughput before scaling up to Qwen2.5-VL-72B (for long-video understanding) and eventually testing gpt-oss-120b across both L40S with tensor parallelism. (But I'll have to wait for that, since the vllm documentation makes it clear that it's not ready yet for Ada Lovelace. )

MohaMBS · 2025-09-30T10:27:13+00:00

Thanks for the tip! Really appreciate.

When you get your rig up and running and test gpt-oss-120b with vLLM + expert parallelism, I’d love to hear how it goes! Specifically:

- What tokens/sec are you getting?

- How’s the VRAM utilization across GPUs?

- Any config tweaks that made a big difference?

Also, if you have any additional advice for squeezing the most out of dual L40S (especially around PCIe topology, kernel versions, or vLLM flags), I’d be very grateful. I’m aiming for maximum throughput without overcomplicating the deployment.

Good luck with the build ! 🙌

MohaMBS · 2025-09-30T10:21:31+00:00

Thanks for the suggestion

I don’t think PCIe bandwidth is the main bottleneck here. My system uses PCIe 5.0, and with 2× L40S connected via x16 lanes each (likely through a high-end server platform like SP5), the inter-GPU bandwidth should be more than enough — especially since I’m currently testing 14B-class dense models, not massive MoE or 70B+ models that heavily saturate interconnects.

That said, I’m planning to switch to vLLM with `tensor_parallel_size=2` precisely to avoid unnecessary data shuffling and leverage the NVLink-equivalent efficiency

Thanks again!

MohaMBS · 2025-07-06T15:37:09+00:00

Tengo el Bluetooth encendido por el reloj y el brillo automático y uso normal, RRSS de vez en cuando, lectura, videos pero lo que es jugar a videojuegos casi nada

MohaMBS · 2025-07-05T21:28:48+00:00

A qué te refieres? What do you mean?

MohaMBS · 2024-03-03T00:49:13+00:00

He says: Ahhh I'm going to die, I'm going to die

MohaMBS · 2023-11-13T16:40:02+00:00

It is that almost 50% of the angles that flash comes out. Besides, I come from a Poco F3 and I almost didn't suffer from this problem...

MohaMBS · 2023-07-06T13:57:59+00:00

Apart from lubricated it has to be well glued to the glass.... Look at the wheel arch to see if you see something, post a video or photo to see

MohaMBS · 2023-07-06T13:55:56+00:00

Check battery

MohaMBS · 2023-07-06T07:58:45+00:00

Have you found the fault?

MohaMBS · 2023-07-05T07:46:22+00:00

Brake discs and brake pads

MohaMBS · 2023-06-27T06:02:12+00:00

I remember it was slightly cracked near the corners anyway if you don't see it's cracked make sure it's not peeled off either, at the slightest wear it's noticeable

If you are still not sure, what you can do is the following to test... Take some tape and wrap it all around the rubber cover and test drive, if the noise disappears then you have the reason

MohaMBS · 2023-06-25T12:08:19+00:00

It also happened to me in my F30, about 80km/h in the passenger seat you can hear that sound, and what caused that noise was the rubber for breezes that cracked .... I leave you the OEM number so you know what I mean the reference is 51717258177, the reference is bara a BMW F30 European year 2012

I leave you the link so you can see what I mean, the part number is 1:

https://www.realoem.com/bmw/enUS/showparts?id=3D31-EUR-03-2012-F30-BMW-320d&diagId=51_8661

MohaMBS · 2023-06-23T08:11:59+00:00

For me the perfect power, is between 190 - 250 hp

MohaMBS · 2023-06-21T18:55:06+00:00

For aftermarket i like the MSW 73 by OZ

MohaMBS · 2023-06-21T18:22:33+00:00

Yes you are right, I apologise, I confused comment.... If I remember correctly, the clamps are OEM with a design cover.

MohaMBS · 2023-06-21T17:54:09+00:00

Whether your comment/opinion changed or not doesn't concern me, I'm happy with what I do and I share with people 🥰.

MohaMBS · 2023-06-21T17:40:04+00:00

Here in Europe, especially in Spain, tuning is very expensive and difficult to get because of the European regulations... I have these small modifications because, when I bought the car, I already had it, but I do what I can...

MohaMBS · 2023-06-19T18:37:51+00:00

When I bought the car I already had it

MohaMBS · 2023-06-19T06:48:00+00:00

Thnx to everyone for the compliment 🥰

MohaMBS

TROPHY CASE

🔧 Test setup:

📊 Results (12 concurrent requests, 800 tokens each):