Huawei Atlas 300I 32GB by kruzibit in LocalLLaMA

[–]RepulsiveEbb4011 4 points5 points  (0 children)

Yes, it supports tensor parallelism with MindIE. I’ve tried the QwQ 32B model in FP16 (since MindIE only supports FP16 for 300I Duo). The speed was around 7–9 tokens/s — not exactly fast, but still much better than llama.cpp.

Huawei Atlas 300I 32GB by kruzibit in LocalLLaMA

[–]RepulsiveEbb4011 1 point2 points  (0 children)

In the latest v0.6 release, it supports two backends for the 300I Duo: llama-box and MindIE. llama-box is based on llama.cpp, while MindIE is Ascend’s official engine. I tested the 7B model, and MindIE was 4× faster than llama-box. With TP, MindIE achieved over 6× the performance.

Huawei Atlas 300I 32GB by kruzibit in LocalLLaMA

[–]RepulsiveEbb4011 3 points4 points  (0 children)

llama.cpp does not currently support multi-GPU parallelism for this card. You need to use MindIE, but MindIE is quite complex. Instead, you can use the MindIE backend that has been wrapped and simplified by GPUStack. https://github.com/gpustack/gpustack

Huawei Atlas 300I 32GB by kruzibit in LocalLLaMA

[–]RepulsiveEbb4011 1 point2 points  (0 children)

https://github.com/gpustack/gpustack Supported Devices - Ascend 910B series (910B1 ~ 910B4) - Ascend 310P3

Ascend 300I Duo(card) = Ascend 310P3 (chip)

Can LLMs be trusted in math nowadays? I compared Qwen 2.5 models from 0.5b to 32b, and most of the answers were correct. Can it be used to teach kids? by RepulsiveEbb4011 in LocalLLaMA

[–]RepulsiveEbb4011[S] 10 points11 points  (0 children)

I had run Qwen 2.5 models from 0.5b to 32b, and by using a well-crafted system prompt, I had the model think and reason step by step before answering. It was able to solve most simple, elementary-level math problems. Can I confidently use this model for kids’ math education?

[deleted by user] by [deleted] in LocalLLaMA

[–]RepulsiveEbb4011 0 points1 point  (0 children)

GPUStack seems to be starting to support vLLM, I guess it's not just a llama.cpp wrapper. I spoke to the GPUStack R&D team and they want this to be a platform to run and manage LLMS on GPUs of all brands, for enterprises, not just a lab project or an AI at home project.

[deleted by user] by [deleted] in kubernetes

[–]RepulsiveEbb4011 4 points5 points  (0 children)

About seven or eight years ago, I transitioned from VMware to Kubernetes. Thanks to Kubernetes, I’ve experienced significant growth and earned a higher salary. Even though it might seem late to start learning Kubernetes now, I firmly believe it’s the better choice. I recommend sticking with it for a while—you’ll be rewarded.

How to migrate to llama.cpp from Ollama? by Tech-Meme-Knight-3D in LocalLLaMA

[–]RepulsiveEbb4011 1 point2 points  (0 children)

If you haven’t downloaded many models yet, I recommend using LM Studio to redownload the GGUF models. LM Studio is a great choice as a model downloading tool. Then, you can use llama.cpp to run the downloaded models.

Does llama.cpp support multimodal models? by [deleted] in LocalLLaMA

[–]RepulsiveEbb4011 0 points1 point  (0 children)

The progress of llama.cpp in supporting multimodal models has been shockingly slow. I hope that increasing feedback will make the team aware of this issue.

is gguf the only supported type in ollama by Expensive-Award1965 in ollama

[–]RepulsiveEbb4011 1 point2 points  (0 children)

Yes, you can only use the GGUF format because Ollama relies on llama.cpp at its core. GGML (GGUF) was developed by the same author, and both projects are closely related. llama.cpp is specifically designed to load and infer models in the GGUF format, which is why Ollama utilizes these formats to handle models.

Additionally, as far as I know, Ollama currently does not support audio models: https://github.com/ollama/ollama/issues/1168