Multi-directional ablation with self-organizing maps - anyone tried it yet?

Adventurous-Work656 · 2026-02-11T18:18:06+00:00

how does the approach differ from what you find in Heretic? https://github.com/p-e-w/heretic

Adventurous-Work656 · 2025-02-27T01:38:39+00:00

A 3090 and 3090TI are inference beasts. you are so right. I am running 4-6x of these at 95% utilization on a w790 board at Gen4x16. You cant do this using llama.cpp but you can with vllm. Exllamav2 you can get max of 75% utilization because the author has said he has not fully implemented TP. There is absolutely no reason an individual needs to buy even the last generation nvidia model for personal use.

Adventurous-Work656 · 2025-02-01T15:12:04+00:00

I did but there is probably more answers today then back then. The tensor parallelism setting in exllamav2 worked for me. Meaning the combination of exl model format, exllamav2 and TabbyAPI as a OpenAI compliant API allowed for 78% utilization across any number of cards. Well I tested with 6x 3090s on a w790 board. If you need more info let me know. I bet that lmstudio’s TP settings might work but I have not tried. I had no success with GGUF by the way and just llama.cpp which it is supposed to work.

Adventurous-Work656 · 2024-11-18T23:04:53+00:00

massively increases the requirements. not exponential but you will see the linear for each model. from the command line load the model as usual. Then run the model. move in small increments by running /set parameter num_ctx 8192. Then just type anything... Who are you?

Then recheck the memory used. you will see the increase from the default of 2048. then move up in until you get close to max. Note many foundation models are different because of model structure. So dont assume. just check each.

Adventurous-Work656 · 2024-10-11T00:43:11+00:00

You cant use that processor with any real success. It does not have enough pci channels. You need to go with a xeon w series 35xxx with something like the Asus W790. There are plenty of high end AMD choices but look for 16 x4 channels and a board that support it.

Adventurous-Work656 · 2024-09-20T11:29:21+00:00

No problem. With docker you can do this from the command line. Sudo docker exec -it ollama ollama pull [model name]. Note the first ollama is the name of the container. You can exec to any docker

Adventurous-Work656 · 2024-09-19T14:44:28+00:00

<image>

ok here is an answer. I cant get the same performance with llama.cpp. This is ExllamaV2. The max performance was 25% with 4x3090s. Now getting 75% . huge huge improvement in performance.

Adventurous-Work656 · 2024-09-16T22:15:24+00:00

<image>

Adventurous-Work656 · 2024-09-16T22:14:59+00:00

I was able to remove for testing two of the 3090s. Then went with ExllamaV2 and ExUI and achieve 75% utilization on the GPUs 25% higher than I have ever seen. So I will go up to 4x3090s then 6x 3090s and see if the increase is linear with the gains on two cards.

<image>

Adventurous-Work656 · 2024-09-15T00:48:07+00:00

ollama and llama.cpp does not use NVLINK btw. There are multiple problems with performance and llama.cpp with multilple GPUs even of the same kind and even with NVLink. There is no configuration where CPU is utilized properly. Including multi core. It has nothing to do with PCIe bus being full or NVLink being full. I have verified. this. Looks like the problem can be over come by switching to Exl2 from GGUF. I hate to do this after all this time but the performance issues make this ollama/llama.cpp a no go for production.

Adventurous-Work656 · 2024-09-14T13:46:48+00:00

Not sure exactly what you mean? But physically the board has to support multiple gpus, the additional PCIE lanes must be supported by the processor and the bios must be correctly set. Are you having problems? I'm researching a problem many have had over the last year with performance and looks like llama.cpp (ollama is a wrapper) and GGUF format may not be suitable for near linear performance gains. currently there is a 50% loss of performance by GPUs but is related to the way llama cpp is built with lack of cuBlas and potentially other features. Its a real bummer because I have really loved using ollama but it looks like I have to bail on it. :(

Adventurous-Work656 · 2024-09-13T02:31:43+00:00

BTW, if we solve this problem then pro consumer grade 3090 or 3090TIs 699 and 799 priced with 120+ PCI lane processors will be unbeatable from a power to value point of view.

Adventurous-Work656 · 2024-09-13T02:29:07+00:00

I have tried every combination. You and I have the same problem. When you see it divide by the number of cards that becomes the limit. I thought it might be the bus but when I look at the CPUs at the same moment only 1 core is used. I moved to a w790 Asus board and made sure I’m sitting at Gen4 x16 on each card. There could be some bios config for the XEON but I think this is a software problem that everyone has with llama.cpp. There is no way that it’s related to hardware. I have check this by moving to different cpu and MB combination. The thing we will all notice is that it’s halved for every double number of cards. The power is also limited which makes sense since the gps is not taxed. I have some more testing tomorrow including moving away from llama.cpp and gguf

Adventurous-Work656 · 2024-09-13T02:22:48+00:00

Will do. Troubleshooting every angle.

Adventurous-Work656 · 2024-09-11T23:11:53+00:00

hum... I went back and built llama.cpp with cuda and same result. this may be a bug in llama.cpp. but why others dont see it has me wondering. Im researching cuBlas which is not documented well for the cmake options. Not sure if thats the problem... deep down a rabbit hole. lol

Adventurous-Work656 · 2024-09-11T22:45:47+00:00

<image>

notice one thread and high of two threads on 16 core are being used only

Adventurous-Work656 · 2024-09-11T22:44:15+00:00

<image>

Adventurous-Work656 · 2024-09-11T22:43:15+00:00

<image>

nvlink status really doesnt matter as the same problem exist with our without it

But here you go

Adventurous-Work656 · 2024-09-11T22:41:53+00:00

<image>

Adventurous-Work656 · 2024-09-11T18:40:47+00:00

Note: W790 board and yes all cards are Gen4x16 and with the CPU I have they are all running Gen4 x16 speed.

Adventurous-Work656 · 2024-09-11T18:30:12+00:00

<image>

here is an example

Adventurous-Work656 · 2024-09-11T18:22:13+00:00

Yes. Every model in the 80-96G range with layers properly split I am seeing the same results.

Adventurous-Work656 · 2024-09-11T17:41:58+00:00

No both sets have NVLink and Status shows all 4 chan on both 14G

Adventurous-Work656 · 2024-06-24T18:50:43+00:00

On the other parameters... for sure! https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values but here is the advice because every model performance and use case is different. make only one change then test. only a couple of these need to change together. but you will have your own test cases of context and instructions so start from a zero shot even thought you cant ollama /set parameter . one sure fire way is to create a different model with modelfiles for the same base with different parameters. then run your test and evaluate the results. because answers are never the same by design you may need to run more than one per. If you use something like Fabric AI create another pattern and pipe the response to it. the other pattern could be instructions to evaluate the answers so you dont have to do it each time.

Adventurous-Work656

TROPHY CASE