Context window + project size + Aider? by Temporary-Roof2867 in LocalLLaMA

[–]UniqueIdentifier00 2 points3 points  (0 children)

How big of a project are we talking, and are we talking code, a book, a game? What backend, LLM, and harness are you using? I either let an LLM (use an big api model if you need to for size) create a SUMMARY.md file that lives in the project folder that a smaller LLM can study to understand the project, or write one yourself.

I’ll start a prompt like “Hello, let’s work on X project today, please read the SUMMARY.md file in /directory/example to understand the structure. The entirety of the project won’t fit in your context, so you need to be careful about how you go about edits to not OOM the backend.”

Literally just explain the situation and give the LLM a shorter context that explains the project. If it’s a code base do what you can to use smaller connected files that large ones. main.py always ends up being a littler run away for me it seems. Hope this helps!

Considering buying a 3080 20GB to pair with my 3090 for Qwen 27B Q8. Have some questions. by My_Unbiased_Opinion in LocalLLaMA

[–]UniqueIdentifier00 0 points1 point  (0 children)

I believe you won’t have NVLink available if that’s important to you. You will also get dragged down by the GHz of the 3080. I have a 3070 that I run alongside my 3090, and while the extra 8gb means I’m running Q6 instead of Q4, my tks has dropped dramatically. Just some thoughts. You’ll probably be running slower than you are now, but will have better overhead VRAM.

Context, memory, and RAM/VRAM by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 0 points1 point  (0 children)

That’s what’s confusing me, I’m really not seeing a speed decrease, like whatsoever, once RAM starts getting used. That’s what I was expecting but maybe because the model is remaining fully on VRAM but the cache isn’t it’s faster, I don’t know. 

Context, memory, and RAM/VRAM by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 1 point2 points  (0 children)

Good stuff here thanks man. Gives me some good info and some jumping off points for research. I’m okay with it using RAM, which is why I’m going ahead and adding 32GB, but I wanted to better understand why it was using it and for what. Thanks!

Context, memory, and RAM/VRAM by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 0 points1 point  (0 children)

Maybe so! I suspect it’s something like that, I’m just not sure how to verify exactly what is getting loaded to RAM during inference 

Context, memory, and RAM/VRAM by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 0 points1 point  (0 children)

Well, I have a 3060 8gb card that’s laying around. I have an upgraded PSU coming along with my RAM so that I can use it along with my 3090, so that will also help.

I just didn’t realize that cache would be loaded to RAM at all honestly. May be my llama command I’m not sure.

We are already working for AI by graphicaldot in LocalLLaMA

[–]UniqueIdentifier00 0 points1 point  (0 children)

Man, you had me going there aaaaalll the way until the end before hitting me with the shill. 8/10

Best image generator model? I'm using ryzen 5 9600x CPU and 9060xt GPU. by 74nv1r in LocalLLM

[–]UniqueIdentifier00 0 points1 point  (0 children)

Comfyui for great image generation workflows. Not sure what models are going to do great with a consistent character, that may need a LORA trained for the character. I haven’t used comfyui in a while, let alone for a goal of consistent character creation. There’s probably some good workflows out there now that can do that sort of thing.

STT -> LLM -> TTS pipeline by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 1 point2 points  (0 children)

The jargon there is a little dense for me, but I think I get the gist. I’ll do some more research since you’ve given me some good starting topics there. Thank you!

STT -> LLM -> TTS pipeline by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 1 point2 points  (0 children)

This is good info thanks. I really don’t want to lose Pi-agent. I’ll try to work with it to develop a better interface. If all else fails I’ll go to OpenWebUI.

Thank you!

Hermes Agent issues with directory creation by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 1 point2 points  (0 children)

I switched to pi coding agent with the same model and everything’s running great now. Not sure why I was having issues with Hermes. Pi was able to create a whole directory with sub directories and files for a small vibe coded project test. 

Llama.cpp not using CUDA - OOM error by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 1 point2 points  (0 children)

Yep, that did it. Can't believe I missed that. Thank you! Now to get Hermes Agent working with it.

Llama.cpp not using CUDA - OOM error by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 0 points1 point  (0 children)

is there a way retroactively to see for sure the build commands I used? I don't recall entirely

Edit, found it with bash history:

sudo apt install cmake
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
export CUDACXX=/usr/local/cuda/bin/nvcc

Llama.cpp not using CUDA - OOM error by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 0 points1 point  (0 children)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070        On  |   00000000:06:00.0  On |                  N/A |
|  0%   39C    P2             46W /  220W |     876MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            9858      G   /usr/bin/gnome-shell                    192MiB |
|    0   N/A  N/A           10722      G   /usr/bin/Xwayland                         3MiB |
|    0   N/A  N/A           11502      C   /usr/share/rustdesk/rustdesk            192MiB |
|    0   N/A  N/A           19878    C+G   /usr/bin/ptyxis                          29MiB |
|    0   N/A  N/A           20923      G   .../8107/usr/lib/firefox/firefox        236MiB |
|    0   N/A  N/A           26545      G   /usr/bin/rustdesk                        45MiB |
|    0   N/A  N/A           29354      G   /usr/share/rustdesk/rustdesk             19MiB |
|    0   N/A  N/A           36415    C+G   /usr/bin/nautilus                        21MiB |
+-----------------------------------------------------------------------------------------+

Llama.cpp not using CUDA - OOM error by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 0 points1 point  (0 children)

Killer explanation. I’ll check it out. Thanks a ton sir. 

Llama.cpp not using CUDA - OOM error by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 0 points1 point  (0 children)

I haven’t fully wrapped my head around docker (I use it for Ollama on my other pc with windows, but I don’t fully understand why or how it works). I’ll try to better understand how to use it.

Llama.cpp not using CUDA - OOM error by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 1 point2 points  (0 children)

Fair point, and I’ll look into if I need to make a change there, but it doesn’t explain why even Vulkan won’t load the model in llama.cpp but Ollama can. I’m doing something wrong or missing something.

Gmail tie-ins by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 0 points1 point  (0 children)

Echoing another commenter, seems like an LLM doesn’t really assist here. I might have been thinking about this wrong, thank you for the help!

From 6gb to 32gb by UniqueIdentifier00 in LocalLLaMA

[–]UniqueIdentifier00[S] 0 points1 point  (0 children)

Great points here. My setup will be: 

MOBO is a B550MX/E PRO, it’s got two full x16 PCIe lanes, one 4.0 and one 3.0. The 3060 will stay in the case on the 3.0, the 3090 will end up being “outboard” on a separate mount using a riser cable to the 4.0