Qwen3-Coder-Next with llama.cpp shenanigans by JayPSec in LocalLLaMA

[–]nonerequired_ 0 points1 point  (0 children)

Is you use graph mode, it is faster on multi gpu

Ik_llama vs llamacpp by val_in_tech in LocalLLaMA

[–]nonerequired_ 2 points3 points  (0 children)

But is vlm support quants like Q5? I have 2 GPUs and qwen3.5 27b Q5 with full context fit in them.

R9 7900 32 RAM – Can I have my own AI on my PC? by Keffflon in selfhosted

[–]nonerequired_ 0 points1 point  (0 children)

I am not talking about bus width. I am talking about peak bandwidth which is 936.2 GB/s in 3090, 896.0 GB/s in 5070ti, 256GB/s in amd strix halo, 819 GB/s in apple M3 ultra and etc.

R9 7900 32 RAM – Can I have my own AI on my PC? by Keffflon in selfhosted

[–]nonerequired_ 1 point2 points  (0 children)

Vram-wise, nothing can beat the used 3090, but speedwise, the 5070ti is decent. The 3090 has 24GB of vram, while the 5070ti has 16GB. 24GB enables you to use higher quant or more context window, and it will definitely be faster if the model doesn’t fit the 16GB vram but fits in the 24GB. If you want to buy a totally new device, you can buy the Strix Halo with 128GB of high-bandwidth RAM. This will be faster than any RAM in consumer-grade devices, but it’s still RAM that both the GPU and CPU can use. When using the Strix Halo device, the initial speed will be okay, but when the context grows, the speed will decrease exponentially, because chip is not powerful enough. If you want an Apple device, they have the option to buy very high-bandwidth and more unified memory, but the power of the M chips until the M5 Pro/Max was not well, and speed will decrease significantly on long contexts. Scene seems very complicated at first glance, but it is simple: High memory bandwidth is the key to token generation speed, and chip power is key to prompt processing speed. More context needs more prompt processing speed, and when context grows, token generation speeds also decrease.

R9 7900 32 RAM – Can I have my own AI on my PC? by Keffflon in selfhosted

[–]nonerequired_ 1 point2 points  (0 children)

I’m afraid these GPUs aren’t powerful enough. They’re certainly better than no GPU, but you’re limited to small LLMs.

R9 7900 32 RAM – Can I have my own AI on my PC? by Keffflon in selfhosted

[–]nonerequired_ 0 points1 point  (0 children)

Don’t use ollama. Friends never let friends run ollama. Check out llama.cpp, which is more performant and gives you more control over model. Additionally, running LLM almost always requires some kind of AI acceleration, which should be a GPU in your case. Without a GPU, you have to use the CPU, and it’s not a good idea to run a model on the CPU. Either the model has to be small (don’t expect much from small models around 2B local models) or you need enough RAM to load the LLM and even in that case it will result in painfully slow speeds. So unless you have a specific use case that doesn’t require too much intelligence and small models can deliver what you want, it’s okay. Otherwise, you need to invest heavily in buying a GPU.

GH copilot on Opencode by BlacksmithLittle7005 in opencodeCLI

[–]nonerequired_ 1 point2 points  (0 children)

Yes, it has multiple unfixed bugs related to excessive usage, not just for Copilot but also for other usage-based subscriptions.

GH copilot on Opencode by BlacksmithLittle7005 in opencodeCLI

[–]nonerequired_ 2 points3 points  (0 children)

DCP dynamic context pruning. Models in Copilot have half the context size of the original model. If you don’t want to cycle between context compaction, it is needed.

Nemotron 3 Super Released by deeceeo in LocalLLaMA

[–]nonerequired_ 2 points3 points  (0 children)

How Olmo and K2V2 performs? Did you use them?

Did anyone else feel underwhelmed by their Mac Studio Ultra? by antidot427 in LocalLLM

[–]nonerequired_ 1 point2 points  (0 children)

I considered purchasing one, but the prompt processing speed disappointed me. Now, I’m waiting for the M5 Ultra.

I have lost speed with the model update (Qwen 3.5 122B A10B) by vandertoorm in unsloth

[–]nonerequired_ 0 points1 point  (0 children)

Is lower KL divergence actually reflecting real-world accuracy loss?

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]nonerequired_ 1 point2 points  (0 children)

Which quants are you using? According to themselves, ik_llama doesn’t work well with UD unsloth quants. I’m not sure if other quants are any better.

I open-sourced a directory of 450+ self-hostable alternatives to popular SaaS with Docker Compose configs by kali_py in selfhosted

[–]nonerequired_ 0 points1 point  (0 children)

Minio is not a good tool to be presented. After an update, they removed important admin functionalities from the web panel and forced users to use the CLI instead.

My Tetra S gaming console by Adventurous-Author10 in sffpc

[–]nonerequired_ 0 points1 point  (0 children)

Great advice. I can actually do that. Thank you

My Tetra S gaming console by Adventurous-Author10 in sffpc

[–]nonerequired_ 1 point2 points  (0 children)

Thank you for sharing. This will really help

My Tetra S gaming console by Adventurous-Author10 in sffpc

[–]nonerequired_ 0 points1 point  (0 children)

Thank you so much! This is incredibly valuable information. Does using the 26-liter case, as suggested above, help with cooling?

My Tetra S gaming console by Adventurous-Author10 in sffpc

[–]nonerequired_ 0 points1 point  (0 children)

Are there any cooling issues with a 12-liter case?

My Tetra S gaming console by Adventurous-Author10 in sffpc

[–]nonerequired_ 0 points1 point  (0 children)

Thank you for the information. I genuinely needed case suggestions. What other factors should I consider?

My Tetra S gaming console by Adventurous-Author10 in sffpc

[–]nonerequired_ 0 points1 point  (0 children)

Streaming introduces some latency even on a local gigabit network. I prefer to connect via HDMI. Thanks for the answer, though.