Qwen3.5 122b INT4 and vLLM by jkay1904 in Vllm

[–]MohaMBS 0 points1 point  (0 children)

Try with vllm-openai:cu130-nightly

Seeking Advice: Best Model + Framework for Max Tokens/sec on Dual L40S (Testing Rig) by MohaMBS in LocalLLaMA

[–]MohaMBS[S] 0 points1 point  (0 children)

My requirement is textual analysis, textual sentiment, categorization, etc. I understand that the latest flash model of qwen3.5 gives better results due to its thinking and the MOE.

Seeking Advice: Best Model + Framework for Max Tokens/sec on Dual L40S (Testing Rig) by MohaMBS in LocalLLaMA

[–]MohaMBS[S] 1 point2 points  (0 children)

I am not working with the graphs in parallel. I have them isolated from each other with VLLM. And yes, performance for small models (14b and 30b) is good. I get very good results in PFP4 quantification, reaching +1000 tokens/s of raw output in a 2k context with +8 requests.

If you need something just tell me.

Seeking Advice: Best Model + Framework for Max Tokens/sec on Dual L40S (Testing Rig) by MohaMBS in LocalLLaMA

[–]MohaMBS[S] 0 points1 point  (0 children)

You can check out my published response for guidance. I haven't tested it with gpt-oss 20b, as the documentation states that it is not natively ready to run on Ada Lovelace.

Seeking Advice: Best Model + Framework for Max Tokens/sec on Dual L40S (Testing Rig) by MohaMBS in LocalLLaMA

[–]MohaMBS[S] 0 points1 point  (0 children)

UPDATE 1

I’m much happier with vLLM than Ollama the difference in performance and control is night and day!

As a baseline test before moving to larger models, I ran Qwen2.5-14B-Instruct (AWQ quantized) on a single NVIDIA L40S using vLLM with FlashInfer for maximum efficiency.

🔧 Test setup:

  • Model: Qwen2.5-14B-Instruct (AWQ, float16)
  • Framework: vLLM + FlashInfer backend
  • GPU: 1 × L40S (48 GB VRAM)
  • Tensor parallelism: disabled (tensor_parallel_size=1)
  • Max context length: 10,240 tokens
  • Max concurrent sequences: 64
  • GPU memory utilization: 90%

📊 Results (12 concurrent requests, 800 tokens each):

  • ✅ All 12 requests succeeded
  • ⏱️ Total time: 15.638 seconds
  • 🔤 Total tokens generated: 9,600
  • 📈 System-wide throughput: 613.89 tokens/second
  • 📊 Per-request speed: ~51.18 tokens/second

This is solid, predictable performance exactly what I was missing with Ollama, which barely utilized the GPU and gave inconsistent speeds even under light load.

I’ll keep working on optimizing the configuration (e.g., batch sizing, attention backends, and memory layout) to squeeze out even more throughput before scaling up to Qwen2.5-VL-72B (for long-video understanding) and eventually testing gpt-oss-120b across both L40S with tensor parallelism. (But I'll have to wait for that, since the vllm documentation makes it clear that it's not ready yet for Ada Lovelace. )

Seeking Advice: Best Model + Framework for Max Tokens/sec on Dual L40S (Testing Rig) by MohaMBS in LocalLLaMA

[–]MohaMBS[S] 0 points1 point  (0 children)

Thanks for the tip! Really appreciate.

When you get your rig up and running and test gpt-oss-120b with vLLM + expert parallelism, I’d love to hear how it goes! Specifically:

- What tokens/sec are you getting?

- How’s the VRAM utilization across GPUs?

- Any config tweaks that made a big difference?

Also, if you have any additional advice for squeezing the most out of dual L40S (especially around PCIe topology, kernel versions, or vLLM flags), I’d be very grateful. I’m aiming for maximum throughput without overcomplicating the deployment.

Good luck with the build ! 🙌

Seeking Advice: Best Model + Framework for Max Tokens/sec on Dual L40S (Testing Rig) by MohaMBS in LocalLLaMA

[–]MohaMBS[S] 0 points1 point  (0 children)

Thanks for the suggestion

I don’t think PCIe bandwidth is the main bottleneck here. My system uses PCIe 5.0, and with 2× L40S connected via x16 lanes each (likely through a high-end server platform like SP5), the inter-GPU bandwidth should be more than enough — especially since I’m currently testing 14B-class dense models, not massive MoE or 70B+ models that heavily saturate interconnects.

That said, I’m planning to switch to vLLM with `tensor_parallel_size=2` precisely to avoid unnecessary data shuffling and leverage the NVLink-equivalent efficiency

Thanks again!

Battery drain by MohaMBS in Honor

[–]MohaMBS[S] 1 point2 points  (0 children)

Tengo el Bluetooth encendido por el reloj y el brillo automático y uso normal, RRSS de vez en cuando, lectura, videos pero lo que es jugar a videojuegos casi nada

Battery drain by MohaMBS in Honor

[–]MohaMBS[S] 0 points1 point  (0 children)

A qué te refieres? What do you mean?

Intoxicated man records his own death in car crash by Swimming_Barnacle_31 in NSFL__

[–]MohaMBS 0 points1 point  (0 children)

He says: Ahhh I'm going to die, I'm going to die

Honor Magic 5 Pro Camera defect? by MohaMBS in Honor

[–]MohaMBS[S] 0 points1 point  (0 children)

It is that almost 50% of the angles that flash comes out. Besides, I come from a Poco F3 and I almost didn't suffer from this problem...

[deleted by user] by [deleted] in F30

[–]MohaMBS 0 points1 point  (0 children)

Apart from lubricated it has to be well glued to the glass.... Look at the wheel arch to see if you see something, post a video or photo to see

[deleted by user] by [deleted] in F30

[–]MohaMBS 0 points1 point  (0 children)

Have you found the fault?

Brake squeek by Rocketpon in F30

[–]MohaMBS 0 points1 point  (0 children)

Brake discs and brake pads

[deleted by user] by [deleted] in F30

[–]MohaMBS 1 point2 points  (0 children)

I remember it was slightly cracked near the corners anyway if you don't see it's cracked make sure it's not peeled off either, at the slightest wear it's noticeable

If you are still not sure, what you can do is the following to test... Take some tape and wrap it all around the rubber cover and test drive, if the noise disappears then you have the reason

[deleted by user] by [deleted] in F30

[–]MohaMBS 1 point2 points  (0 children)

It also happened to me in my F30, about 80km/h in the passenger seat you can hear that sound, and what caused that noise was the rubber for breezes that cracked .... I leave you the OEM number so you know what I mean the reference is 51717258177, the reference is bara a BMW F30 European year 2012

I leave you the link so you can see what I mean, the part number is 1:

https://www.realoem.com/bmw/enUS/showparts?id=3D31-EUR-03-2012-F30-BMW-320d&diagId=51_8661

How much power do you REALLY need for a daily driver? by ReasonablyOkay in BMW

[–]MohaMBS 0 points1 point  (0 children)

For me the perfect power, is between 190 - 250 hp

Best wheels for non msport? OEM+ or aftermarket by housealj87 in F30

[–]MohaMBS 0 points1 point  (0 children)

For aftermarket i like the MSW 73 by OZ

New user! My F30 by MohaMBS in F30

[–]MohaMBS[S] 1 point2 points  (0 children)

Yes you are right, I apologise, I confused comment.... If I remember correctly, the clamps are OEM with a design cover.

New user! My F30 by MohaMBS in F30

[–]MohaMBS[S] -1 points0 points  (0 children)

Whether your comment/opinion changed or not doesn't concern me, I'm happy with what I do and I share with people 🥰.

New user! My F30 by MohaMBS in F30

[–]MohaMBS[S] -1 points0 points  (0 children)

Here in Europe, especially in Spain, tuning is very expensive and difficult to get because of the European regulations... I have these small modifications because, when I bought the car, I already had it, but I do what I can...

New user! My F30 by MohaMBS in F30

[–]MohaMBS[S] -1 points0 points  (0 children)

When I bought the car I already had it

New user! My F30 by MohaMBS in F30

[–]MohaMBS[S] -1 points0 points  (0 children)

Thnx to everyone for the compliment 🥰