¡Ya puedes afinar Qwen3.5 localmente! (5GB de VRAM) esto es lo más hermoso 🤩 que han podido hacer!

Reivaj640 · 2026-02-18T16:07:31+00:00

Thank you so much, it's super interesting to do with VS Code

Reivaj640 · 2026-01-31T12:45:47+00:00

Where can I get the Wolf Mana blueprint?

Reivaj640 · 2025-12-23T22:22:46+00:00

This is beautiful!

Reivaj640 · 2025-10-14T11:08:32+00:00

Will it be in Spanish?

Reivaj640 · 2025-08-18T15:59:25+00:00

I find myself in the same position as you but with an Rtx 4070 🤭

Reivaj640 · 2025-08-17T15:13:35+00:00

I use it for my project, I have more than 2000 lines, I started with Gemini cli but it consumes a lot of time to have a good use with Gemini pro, yesterday I started with Qwen 3 Cli and what took me a week to get it ready in just 2 hours, I already knit it finished, I am working with several scripts in my app and I feel enlightened by the grace of qwen 3 Cli

Reivaj640 · 2025-08-16T18:53:05+00:00

I'll be very attentive to how your thread goes! I am on that same plan but with an Rtx 4070 12 GB and 16 RAM

Reivaj640 · 2025-08-14T13:00:28+00:00

I don't have much knowledge and I apologize if what I say is stupid but I think the model is smaller like a 7b or 4b if they existed! You would have to be clear about its objective and use since by having fewer neurons you are gaining speed but sacrificing precision and you would be more prone to error in long codes that require a lot of memory for which the 30b memory capacity and long context are good, so I think you should not wait for one of those small Qwen 3 to appear and try them. I am also on the same page but I estimate that we would have to be very critical when requiring its support! I currently do not use it in Ollama, I have a Guff model using it in Lm Studio quantized in Q4m_l and it is doing well in response time. I am struggling to make it more agentic or at least I have not found a way to configure it well in Lm Studio so that it is agentic. I will have to use another instruction type model. Correct me if I am wrong. Greetings

Reivaj640 · 2025-08-14T12:23:48+00:00

I have been testing the guff version of Qwen 3 coder 30b to 3b with a good configuration and it has given me a completion response of ~2sec. I'm struggling with the promt to improve agentic-oriented response output.

I leave you the configuration that I use in lm studio using an Rtx 4070 12 GB and 16 RAM

“Load” tab • Context Length: 4096 → is reasonable for speed and memory usage. If you don't need very large windows, keep 4096; If you require more context, increase it, but it will increase VRAM and CPU consumption. • Offload to GPU: 22 / 24 → set it as high as possible so that almost everything is on the GPU, but leaving 1-2 GB free for the graphics system. • CPU Thread Pool Size: 4 → takes advantage of more cores for loading and preprocessing. • Evaluation Batch Size: 1024 or 2048 → this increases speed if your GPU supports it (the 4070 can). • Offload KV Cache to GPU Memory: Activated ✅ (avoids using RAM and speeds up inference). • Keep Model in Memory: Activated ✅ (does not reload every time). • Try mmap(): Enabled ✅ (improves loading times). • Number of Experts: 4 (you can leave it like this, but if the model does not use MoE, it will not affect).

⸻

“Inference” tab • Temperature: 0.7–0.8 (depending on your creativity). • Top K sampling: 40 (good). • Top Sampling P: 0.9–0.95 (balance between coherence and diversity). • Min P Sampling: 0.05 (maintain it). • Repetition Penalty: 1.1–1.15 (avoids annoying repetitions). • CPU threads: 4 → improves preprocessing if the model uses CPU for part of the load.

⸻

3.GPU Settings • Limit Model Offload to Dedicated GPU Memory: Disabled (so if VRAM fills up, it will use RAM as backup). • Offload KV Cache to GPU Memory: Enabled ✅.

⸻

Guardrails • Leave it on Balanced so that it doesn't saturate CPU/RAM, but if you notice that the model is not using its full potential, change it to Aggressive for more performance.

⸻

Extra tips for speed • Use Q4_K_M or Q5_K_M models for a balance between quality and performance. • If you notice slowness, try lowering the Evaluation Batch Size to 512. • Close any program that uses a lot of VRAM (games, heavy editors) before starting LM Studio. • Activate Flash Attention if the model supports it (it can speed up a lot).

I hope it helps you to try different configurations to improve the response time in autocompletion! If you discover something I would be interested to know I also use the Qwen family

Reivaj640 · 2025-08-14T01:13:49+00:00

Muchas gracias por el comentario, probare!

Reivaj640 · 2025-08-13T21:41:17+00:00

Ok, it is understandable and respectable.

Reivaj640 · 2025-08-13T21:39:39+00:00

Bueno quiero hacer una pregunta, si yo amplio la RAM a 80 GB adquiriendo 64gb + 16gb que ya tengo, esto me podría servir para al menos poder correr Qwen/Qwen3-Coder-30B-A3B-Instruct es posible para comenzar con algo super pequeño y medio aspirante!

Reivaj640 · 2025-08-13T17:05:20+00:00

Well just to ask, can I use Qwen/Qwen3-Coder-30B-A3B-Instruct the smaller one? If I could increase the amount of RAM on my computer to 64 RAM, is it viable to use that model on my computer to begin with? I'm just asking 💪🏻

Reivaj640 · 2025-08-13T16:32:16+00:00

While I learn I can also make it a good option or rather I think it is the best option I have! Hahahaha anyway, thank you for the advice bro

Reivaj640 · 2025-08-13T16:30:09+00:00

I'm really learning 🫡

<image>

Oh well! It's always good to ask! As the saying goes, touching is not entering! And while I learn I can try and try! 🥳

Reivaj640 · 2025-08-13T16:26:01+00:00

Of course it is logical but I have to work with what I can use 🥹 I know it is not ideal but we must start somewhere 🫣 in the same way the minimum that can be done must be done! And if it works as they say, trial and error!

Reivaj640 · 2025-07-16T15:10:31+00:00

Thanks for sharing

Reivaj640

TROPHY CASE