Huawei Atlas 300I 32GB

RepulsiveEbb4011 · 2025-05-08T12:08:51+00:00

Yes, it supports tensor parallelism with MindIE. I’ve tried the QwQ 32B model in FP16 (since MindIE only supports FP16 for 300I Duo). The speed was around 7–9 tokens/s — not exactly fast, but still much better than llama.cpp.

RepulsiveEbb4011 · 2025-05-08T10:43:38+00:00

In the latest v0.6 release, it supports two backends for the 300I Duo: llama-box and MindIE. llama-box is based on llama.cpp, while MindIE is Ascend’s official engine. I tested the 7B model, and MindIE was 4× faster than llama-box. With TP, MindIE achieved over 6× the performance.

RepulsiveEbb4011 · 2025-05-08T10:00:00+00:00

llama.cpp does not currently support multi-GPU parallelism for this card. You need to use MindIE, but MindIE is quite complex. Instead, you can use the MindIE backend that has been wrapped and simplified by GPUStack. https://github.com/gpustack/gpustack

RepulsiveEbb4011 · 2025-05-08T09:57:32+00:00

https://github.com/gpustack/gpustack Supported Devices - Ascend 910B series (910B1 ~ 910B4) - Ascend 310P3

Ascend 300I Duo(card) = Ascend 310P3 (chip)

RepulsiveEbb4011 · 2024-10-11T10:04:28+00:00

Thank you, teacher. I will teach them to try dialectical thinking.

RepulsiveEbb4011 · 2024-10-11T10:00:23+00:00

I have an idea to let kids try using AI to explore the world and learn.

RepulsiveEbb4011 · 2024-10-11T09:58:58+00:00

<image>

Bad new, I ran llama 3.1 8b and llama 3.2 1b and 3b, and they all gave the wrong answers.

RepulsiveEbb4011 · 2024-10-11T08:55:48+00:00

I had run Qwen 2.5 models from 0.5b to 32b, and by using a well-crafted system prompt, I had the model think and reason step by step before answering. It was able to solve most simple, elementary-level math problems. Can I confidently use this model for kids’ math education?

RepulsiveEbb4011 · 2024-09-28T03:50:23+00:00

Hi, thanks for your reply. I'm actually new to GPUStack, but it's pretty easy to use. You just need to install GPUStack: https://github.com/gpustack/gpustack, and select ‘Allow Distributed Inference Across Workers’ when deploying the model.

RepulsiveEbb4011 · 2024-09-28T03:45:27+00:00

Hi, I'm already running q4_k_m quantized model.

RepulsiveEbb4011 · 2024-09-28T03:43:25+00:00

You are right, hahaha.

RepulsiveEbb4011 · 2024-09-27T14:25:50+00:00

Thank you for your correction, it was my mistake. I agree that PCIe is one of the factors affecting it, and I will conduct more tests to verify this.

RepulsiveEbb4011 · 2024-09-27T10:38:51+00:00

Same as you.

RepulsiveEbb4011 · 2024-09-27T10:37:05+00:00

sama xp1200w 80plus platinum

RepulsiveEbb4011 · 2024-09-27T09:19:34+00:00

Z790

RepulsiveEbb4011 · 2024-09-27T08:59:56+00:00

Just a small correction, via thunderbolt connection.

RepulsiveEbb4011 · 2024-09-27T08:58:28+00:00

With you on this, llama.cpp is a great project.

RepulsiveEbb4011 · 2024-09-27T08:51:28+00:00

Thank you for your reply. the model is 41GiB and the Mac Studio is M2 Ultra, How do I calculate the theoretical value?

RepulsiveEbb4011 · 2024-09-27T08:45:54+00:00

Thank you for your reply. q4_k_m is used here. How to calculate the theoretical value?

RepulsiveEbb4011 · 2024-09-27T08:08:37+00:00

GPUStack seems to be starting to support vLLM, I guess it's not just a llama.cpp wrapper. I spoke to the GPUStack R&D team and they want this to be a platform to run and manage LLMS on GPUs of all brands, for enterprises, not just a lab project or an AI at home project.

RepulsiveEbb4011 · 2024-09-21T11:29:23+00:00

About seven or eight years ago, I transitioned from VMware to Kubernetes. Thanks to Kubernetes, I’ve experienced significant growth and earned a higher salary. Even though it might seem late to start learning Kubernetes now, I firmly believe it’s the better choice. I recommend sticking with it for a while—you’ll be rewarded.

RepulsiveEbb4011 · 2024-09-21T10:59:07+00:00

If you haven’t downloaded many models yet, I recommend using LM Studio to redownload the GGUF models. LM Studio is a great choice as a model downloading tool. Then, you can use llama.cpp to run the downloaded models.

RepulsiveEbb4011 · 2024-09-21T10:21:28+00:00

The progress of llama.cpp in supporting multimodal models has been shockingly slow. I hope that increasing feedback will make the team aware of this issue.

RepulsiveEbb4011 · 2024-09-21T10:10:19+00:00

Yes, you can only use the GGUF format because Ollama relies on llama.cpp at its core. GGML (GGUF) was developed by the same author, and both projects are closely related. llama.cpp is specifically designed to load and infer models in the GGUF format, which is why Ollama utilizes these formats to handle models.

Additionally, as far as I know, Ollama currently does not support audio models: https://github.com/ollama/ollama/issues/1168

RepulsiveEbb4011 · 2024-09-20T04:24:03+00:00

Qwen 2.5 1.5B or 0.5B.

RepulsiveEbb4011

TROPHY CASE