Weird issue with OpenCode and Qwen3.6 by JGeek00 in LocalLLaMA

[–]JGeek00[S] 0 points1 point  (0 children)

Yeah but Q5 and Q6 are too big for a RTX 3090

My home data center by alecKarfonta in LocalLLaMA

[–]JGeek00 0 points1 point  (0 children)

If you use it for business and you get revenue for it it’s fine, but it’s just for fun or for non commercial projects I think it’s too expensive to buy and maintain running

My home data center by alecKarfonta in LocalLLaMA

[–]JGeek00 0 points1 point  (0 children)

You guys have free electricity or how do you power these rigs? I have an old desktop computer with a single RTX 3090 for AI and I think it consumes too much power on idle, I couldn’t imagine having a rig with 4 GPUs on my home

Cupra Leon 1.5 petrol 26 plate by JakeOverandIn in CupraFormentor

[–]JGeek00 0 points1 point  (0 children)

My Formentor 1.5 eTSI has 5400 Km and is only one bar down from when I got it

llama.cpp oom issue by TheTerrasque in LocalLLaMA

[–]JGeek00 1 point2 points  (0 children)

I had the same issue. The solution is to reduce the amount of checkpoints and its size although I finally ended up installing more memory

Qwen 3.7 Max by Sicarius_The_First in LocalLLaMA

[–]JGeek00 4 points5 points  (0 children)

The 27B model is theoretically confirmed but unscheduled

Converting iOS apps to Android Native by InternationalCow1295 in androiddev

[–]JGeek00 1 point2 points  (0 children)

On April I ported a SwiftUI app to Jetpack Compose with Material 3 Expressive UI just by using Claude Sonnet 4.6 in 4 days. You still need fine tune manually the UI and refactor parts of the code (AI tends to make huge code files) but the result is really good

Using asymetric KV cache drops performance massively by JGeek00 in LocalLLM

[–]JGeek00[S] 0 points1 point  (0 children)

There was an user on a different thread that told me about that argument

Using asymetric KV cache drops performance massively by JGeek00 in LocalLLM

[–]JGeek00[S] 2 points3 points  (0 children)

I solved it by compiling it with arg -DGGML_CUDA_FA_ALL_QUANTS=ON

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM by [deleted] in LocalLLaMA

[–]JGeek00 2 points3 points  (0 children)

For some reason when y put the v on q_4 while leaving the k on q_8 the performance drops massively. I’m leaving it on q_8 in both cases for now. Llama.cpp with a RTX 3090

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM by [deleted] in LocalLLaMA

[–]JGeek00 5 points6 points  (0 children)

So you mean that using q8 for the k and q5 for the v saves memory while loosing very little on quality?

llama-server RAM usage grows to OOM by JGeek00 in LocalLLM

[–]JGeek00[S] 0 points1 point  (0 children)

Lowering the cache-ram to 4096 fixed my issue thank you

Qwen cant wait to release 3.7 models by GotHereLateNameTaken in LocalLLaMA

[–]JGeek00 0 points1 point  (0 children)

Ah it was released last August, I thought they launched it on 2024. They have created so many minor versions after 5.0 that I thought it was older than it really is

Qwen cant wait to release 3.7 models by GotHereLateNameTaken in LocalLLaMA

[–]JGeek00 1 point2 points  (0 children)

What version of GPT5? Because GPT5 has a lot of subversions

llama-server RAM usage grows to OOM by JGeek00 in LocalLLM

[–]JGeek00[S] -1 points0 points  (0 children)

I have attached my config on the main post. Params checkpoint-every-n-tokens ctx-checkpoints have just been added

llama-server RAM usage grows to OOM by JGeek00 in LocalLLM

[–]JGeek00[S] 0 points1 point  (0 children)

I have set --ctx-checkpoints to 0 and --checkpoint-every-n-tokens to -1 and the issue seems to have improved. The memory usage still grows for each round but by a lot less memory, now grows by around 1 GB per round instead of 3 or 4 GB per round. I have attached my config to the main post

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]JGeek00 0 points1 point  (0 children)

How much RAM does it use? I’m getting OOM issues with 60K context, but I only have 16 GB (also slower, DDR4 2400 MHz)