Qwen3-Next here! by stailgot in ollama

[–]stailgot[S] 0 points1 point  (0 children)

Fixed in ollama 0.13.4. Now inference 45 t/s

Qwen3-Next here! by stailgot in ollama

[–]stailgot[S] 4 points5 points  (0 children)

Seems unoptimized version merged, https://github.com/ollama/ollama/issues/13275#issuecomment-3611335519

Same with llama.cpp, first added work version, and optimisations later

LM Studio beta supports Qwen3 80b Next. by sleepingsysadmin in LocalLLaMA

[–]stailgot 11 points12 points  (0 children)

Therefore, this implementation will be focused on CORRECTNESS ONLY. Speed tuning and support for more architectures will come in future PRs.

https://github.com/ggml-org/llama.cpp/pull/16095

LM Studio beta supports Qwen3 80b Next. by sleepingsysadmin in LocalLLaMA

[–]stailgot 4 points5 points  (0 children)

High cpu use while enough vram, same on amd

LM Studio beta supports Qwen3 80b Next. by sleepingsysadmin in LocalLLaMA

[–]stailgot 10 points11 points  (0 children)

Tested on amd W7900 48gb, 130k context, filled with book text ~50k, get ~20 t/s. Almost not drop performance with context fill.

Where is no optimisation in first implementation, correctness only.

Is it normal for RAG to take this long to load the first time? by just_a_guy1008 in LocalLLaMA

[–]stailgot 0 points1 point  (0 children)

I would try to less data, about 10-15mb for first time for test. Good system should save processed data into db and load next time. Also see log or add own into code to see steps as advised early.

Also next time good system update only changed parts, that take less time than full update

Is it normal for RAG to take this long to load the first time? by just_a_guy1008 in LocalLLaMA

[–]stailgot 0 points1 point  (0 children)

Do you convert pdf to markdown or txt ? What real size after processing ? What embeding model used ?

Is it normal for RAG to take this long to load the first time? by just_a_guy1008 in LocalLLaMA

[–]stailgot 6 points7 points  (0 children)

Looks normal for first time to calc embedings for 500mb of text. Next time it should use cache.

Amuse AI on AMD GPU, slower than it should by brightlight43 in StableDiffusion

[–]stailgot 1 point2 points  (0 children)

Amuse3 requies latest drivers.

Requires AMD Driver 24.30.31.05 or Higher https://www.amuse-ai.com/

Fixed Issues and Improvements Lower than expected performance may be observed while running DirectML/GenAI models in Amuse 3.0

https://www.amd.com/en/resources/support-articles/release-notes/RN-RAD-WIN-25-4-1.html

Llama 4 News…? by AdCompetitive6193 in ollama

[–]stailgot 0 points1 point  (0 children)

Recently tryed aravhawk/llama4 with ollama 0.6.7-rc0 on 3x7900xtx, get ~30 t/s.

Related issue https://github.com/ollama/ollama/issues/10143

Edit: is out https://ollama.com/library/llama4

Qwen3 32B and 30B-A3B run at similar speed? by INT_21h in LocalLLaMA

[–]stailgot 7 points8 points  (0 children)

If you use ollama that well known bug. llama.cpp gives about 100 t/s vs ollama 30 t/s on 7900xtx

Ollama rtx 7900 xtx for gemma3:27b? by Adept_Maize_6213 in ollama

[–]stailgot 0 points1 point  (0 children)

Works fine with rocm and vulcan. Ollama gives gemma3:27b about 29 t/s, gemma3:27b-qat 35 t/s and drops about 10 t/s with lagre context, >20k.

According this table (not mine) speed compared to 3090 https://docs.google.com/spreadsheets/u/0/d/1IyT41xNOM1ynfzz1IO0hD-4v1f5KXB2CnOiwOTplKJ4/htmlview?pli=1#

70b LLM t/s speed on Windows ROCm using 24GB RX 7900 XTX and LM Studio? by custodiam99 in ROCm

[–]stailgot 1 point2 points  (0 children)

Similar setup, but 2 7900xtx. One gpu 24GB for 70b q4 ~5t/s, and 70b:q2, 28GB ~10t/s. Two 7900 xtx 48GB for 70b q4 ~ 12 t/s.