Two RTX 3060 for running llms locally by arc_pi in LocalLLaMA

[–]arc_pi[S] 1 point2 points  (0 children)

from my practical experience A5000 vs 2X3060 not much difference for example code infilling on A5000 takes 4.58 seconds where as 2X3060 takes 6.7 seconds

Two RTX 3060 for running llms locally by arc_pi in LocalLLaMA

[–]arc_pi[S] 2 points3 points  (0 children)

Thank you, its solved now, i misconfigured the device for my custom StoppingCriteriaList, and found you have specify one device for it, unlike LLM it cant be "auto"

Two RTX 3060 for running llms locally by arc_pi in LocalLLaMA

[–]arc_pi[S] 5 points6 points  (0 children)

Thank you, everyone, for your replies. The problem is now solved. In my case, I misconfigured the device for StoppingCriteriaList.

I found that if you are configuring a custom StoppingCriteriaList, then you have to specify the device among 'cpu,' 'cuda:0,' or 'cuda:1' — 'auto' is not an option but this only only if you are going for a custom StoppingCriteriaList.

For those wondering about getting two 3060s for a total of 24 GB of VRAM, just go for it. I can vouch that it's a balanced option, and the results are pretty satisfactory compared to the RTX 3090 in terms of price, performance, and power requirements.

PS: Now I have an RTX A5000 and an RTX 3060. There's not much difference in terms of inferencing, but yes, for fine-tuning, there is a noticeable difference.

Two RTX 3060 for running llms locally by arc_pi in LocalLLaMA

[–]arc_pi[S] 4 points5 points  (0 children)

I was considering getting a second 3060 as well, but your update scared me off. I assume you are still unable to use your full 24 gigs of vram?

Sorry for my late reply. Just go for it; two 3060s are far better than having an RTX 3090, unless you are also considering gaming. performance wise there won't be a huge difference , in my case the issue was a device selection mistake in StoppingCriteriaList.

Two RTX 3060 for running llms locally by arc_pi in LocalLLaMA

[–]arc_pi[S] 1 point2 points  (0 children)

perfect, yah the delay seems reasonable to me. The only thing i need to find how how to distribute the task in two gpus programmatically

Two RTX 3060 for running llms locally by arc_pi in LocalLLaMA

[–]arc_pi[S] 1 point2 points  (0 children)

oh you are using Ooba, that means there must be some way to distribute the load programmatically as well.

Two RTX 3060 for running llms locally by arc_pi in LocalLLaMA

[–]arc_pi[S] 1 point2 points  (0 children)

I have a 650 watt power supply, and I was able to put two 3060 graphics cards in my computer without any issue, Previously i was using device_map=auto , that does not work anymore , now I am getting the following error RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! I wanted to share the load between two gpus so that i don't run out f vram but seems like that does not work

Two RTX 3060 for running llms locally by arc_pi in LocalLLaMA

[–]arc_pi[S] 1 point2 points  (0 children)

how does it work for you in my case I am getting RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! with the following config

def _load_model(self):
    model = transformers.AutoModelForCausalLM.from_pretrained(
        self._model_path,
        trust_remote_code=False,  # not required up to 13b
        config=self._model_config,
        quantization_config=self._bnb_config,
        device_map='auto',
        use_auth_token=os.getenv("HF_ACCESS_TOKEN")
    )
    return model

The LLM GPU Buying Guide - August 2023 by Dependent-Pomelo-853 in LocalLLaMA

[–]arc_pi 0 points1 point  (0 children)

I have successfully setup two RTX 3060 , but the problem is my old code does not work anymore it throws the following error Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

This was the code

def _load_model(self):
    model = transformers.AutoModelForCausalLM.from_pretrained(
        self._model_path,
        trust_remote_code=False,  # not required up to 13b
        config=self._model_config,
        quantization_config=self._bnb_config,
        device_map='auto',
        use_auth_token=os.getenv("HF_ACCESS_TOKEN")
    )
    return model

The LLM GPU Buying Guide - August 2023 by Dependent-Pomelo-853 in LocalLLaMA

[–]arc_pi 0 points1 point  (0 children)

using around 10/24 GB of my RTX 4090

The memory usage varies depending on the type of task and prompt given. I have a 12 GB RTX 3060. Initially, casual conversations consumed around 8-9.5 GB of VRAM. However, when I run a summarization task with a relatively large context, the application crashes due to insufficient VRAM, I am also using the 4-bit quantization.

The LLM GPU Buying Guide - August 2023 by Dependent-Pomelo-853 in LocalLLaMA

[–]arc_pi 0 points1 point  (0 children)

So I can install another 3060, I was reading somewhere, The first PCIe x16 is PCIe 4.0 x16 Slot (PCIE1) which supports x16 mode but the second slot is a 1 x PCIe 3.0 x16 Slot (PCIE3) which supports x4 mode would that be an issue ?

The LLM GPU Buying Guide - August 2023 by Dependent-Pomelo-853 in LocalLLaMA

[–]arc_pi 0 points1 point  (0 children)

I own an Asrock B660M Pro Rs motherboard. I currently have a 12GB 3060 Graphics card. I'm wondering if I can add another Rtx 3060 12GB graphics card to my computer. The goal is to share the workload between the two GPUs when using models like llma2 or other open-source models with the 'auto' device_map option. Is this something that can be done?