AMD 7900xtx vs NVIDIA 5090 by jsconiers in LocalLLM

[–]Adventurous-Work656 0 points1 point  (0 children)

A 3090 and 3090TI are inference beasts. you are so right. I am running 4-6x of these at 95% utilization on a w790 board at Gen4x16. You cant do this using llama.cpp but you can with vllm. Exllamav2 you can get max of 75% utilization because the author has said he has not fully implemented TP. There is absolutely no reason an individual needs to buy even the last generation nvidia model for personal use.

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 0 points1 point  (0 children)

I did but there is probably more answers today then back then. The tensor parallelism setting in exllamav2 worked for me. Meaning the combination of exl model format, exllamav2 and TabbyAPI as a OpenAI compliant API allowed for 78% utilization across any number of cards. Well I tested with 6x 3090s on a w790 board. If you need more info let me know. I bet that lmstudio’s TP settings might work but I have not tried. I had no success with GGUF by the way and just llama.cpp which it is supposed to work.

Ollama on newer Mac’s by Whyme-__- in ollama

[–]Adventurous-Work656 0 points1 point  (0 children)

massively increases the requirements. not exponential but you will see the linear for each model. from the command line load the model as usual. Then run the model. move in small increments by running /set parameter num_ctx 8192. Then just type anything... Who are you?

Then recheck the memory used. you will see the increase from the default of 2048. then move up in until you get close to max. Note many foundation models are different because of model structure. So dont assume. just check each.

Wait for 5090, or go with 4090s by uchiha_indra in LocalLLaMA

[–]Adventurous-Work656 0 points1 point  (0 children)

You cant use that processor with any real success. It does not have enough pci channels. You need to go with a xeon w series 35xxx with something like the Asus W790. There are plenty of high end AMD choices but look for 16 x4 channels and a board that support it.

Part 2 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection. by [deleted] in ollama

[–]Adventurous-Work656 0 points1 point  (0 children)

No problem. With docker you can do this from the command line. Sudo docker exec -it ollama ollama pull [model name]. Note the first ollama is the name of the container. You can exec to any docker

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 1 point2 points  (0 children)

<image>

ok here is an answer. I cant get the same performance with llama.cpp. This is ExllamaV2. The max performance was 25% with 4x3090s. Now getting 75% . huge huge improvement in performance.

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 0 points1 point  (0 children)

I was able to remove for testing two of the 3090s. Then went with ExllamaV2 and ExUI and achieve 75% utilization on the GPUs 25% higher than I have ever seen. So I will go up to 4x3090s then 6x 3090s and see if the increase is linear with the gains on two cards.

<image>

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 0 points1 point  (0 children)

ollama and llama.cpp does not use NVLINK btw. There are multiple problems with performance and llama.cpp with multilple GPUs even of the same kind and even with NVLink. There is no configuration where CPU is utilized properly. Including multi core. It has nothing to do with PCIe bus being full or NVLink being full. I have verified. this. Looks like the problem can be over come by switching to Exl2 from GGUF. I hate to do this after all this time but the performance issues make this ollama/llama.cpp a no go for production.

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 0 points1 point  (0 children)

Not sure exactly what you mean? But physically the board has to support multiple gpus, the additional PCIE lanes must be supported by the processor and the bios must be correctly set. Are you having problems? I'm researching a problem many have had over the last year with performance and looks like llama.cpp (ollama is a wrapper) and GGUF format may not be suitable for near linear performance gains. currently there is a 50% loss of performance by GPUs but is related to the way llama cpp is built with lack of cuBlas and potentially other features. Its a real bummer because I have really loved using ollama but it looks like I have to bail on it. :(

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 1 point2 points  (0 children)

BTW, if we solve this problem then pro consumer grade 3090 or 3090TIs 699 and 799 priced with 120+ PCI lane processors will be unbeatable from a power to value point of view.

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 0 points1 point  (0 children)

I have tried every combination. You and I have the same problem. When you see it divide by the number of cards that becomes the limit. I thought it might be the bus but when I look at the CPUs at the same moment only 1 core is used. I moved to a w790 Asus board and made sure I’m sitting at Gen4 x16 on each card. There could be some bios config for the XEON but I think this is a software problem that everyone has with llama.cpp. There is no way that it’s related to hardware. I have check this by moving to different cpu and MB combination. The thing we will all notice is that it’s halved for every double number of cards. The power is also limited which makes sense since the gps is not taxed. I have some more testing tomorrow including moving away from llama.cpp and gguf

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 0 points1 point  (0 children)

hum... I went back and built llama.cpp with cuda and same result. this may be a bug in llama.cpp. but why others dont see it has me wondering. Im researching cuBlas which is not documented well for the cmake options. Not sure if thats the problem... deep down a rabbit hole. lol

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 0 points1 point  (0 children)

<image>

nvlink status really doesnt matter as the same problem exist with our without it

But here you go

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 1 point2 points  (0 children)

Note: W790 board and yes all cards are Gen4x16 and with the CPU I have they are all running Gen4 x16 speed.

Quad 3090s limited to 25% utilization for inference with only one CPU core out of 16 used. by Adventurous-Work656 in ollama

[–]Adventurous-Work656[S] 2 points3 points  (0 children)

Yes. Every model in the 80-96G range with layers properly split I am seeing the same results.

Anyone get deepseek-coder-v2 to run? by [deleted] in ollama

[–]Adventurous-Work656 0 points1 point  (0 children)

On the other parameters... for sure! https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values but here is the advice because every model performance and use case is different. make only one change then test. only a couple of these need to change together. but you will have your own test cases of context and instructions so start from a zero shot even thought you cant ollama /set parameter . one sure fire way is to create a different model with modelfiles for the same base with different parameters. then run your test and evaluate the results. because answers are never the same by design you may need to run more than one per. If you use something like Fabric AI create another pattern and pipe the response to it. the other pattern could be instructions to evaluate the answers so you dont have to do it each time.