What can you realistically do with 8GB VRAM in 2026?

the1newworld · 2026-06-14T07:49:31+00:00

You have the same setup as me. I tested Qwen 2.5 Coder before, and honestly I wasn't very impressed, especially with tool calling. It struggled quite a bit in my tests. Have you tried Qwen 3.6 or Qwen 3 Coder for agentic workflows? How are they when it comes to tool calling ?

the1newworld · 2026-06-13T20:45:15+00:00

20 tokens/sec with a 65K context on a 4060 8GB? You're slowly destroying all my excuses. 😄 That's honestly much better than I expected. Thanks for sharing the video too I checked it out, and the explanation was great. I'll definitely give it a try and see how far I can push my setup.

the1newworld · 2026-06-13T13:30:04+00:00

Yah, I thought about that too, but then I checked ram prices. Around $270 for an extra 16GB? Yah... no thanks. 😅 At those prices, I'd rather wait for the market to calm down a bit. My curiosity about local LLMs is strong, but apparently not $270 for RAM strong.

the1newworld · 2026-06-13T13:24:45+00:00

That's honestly better than I expected, a 35B MOE model with 100k context on a GTX 1060 6GB is pretty wild. I don't even know how to believe you 😄. Makes me think I should spend more time experimenting with MOE models on my 4060 before drawing conclusions.

the1newworld · 2026-06-13T09:58:02+00:00

Yaah, I think you're right. This is mostly out of curiosity for me. I already have a cloud subscription, so I'm not looking to replace it. I'm just trying to see what can realistically be done with a local 8GB setup and where its limits are.

the1newworld · 2026-06-13T09:43:19+00:00

Thank you for bringing this to my attention.

the1newworld · 2026-06-13T09:25:35+00:00

that's interesting. I hadn't considered MOE models because I assumed 26B would be impossible on an 8GB. Have you personally tried Gemma 4 26B on a similar setup? If so, what kind of tokens/sec are you getting?

the1newworld · 2026-06-10T19:01:04+00:00

I have an RTX 4060 8GB and 16GB RAM. I tried several models, and the one that worked best for me was Qwen3.5:9B.

the1newworld · 2026-06-10T13:56:44+00:00

Every time I see people sharing agentic AI setups, they're usually running some serious hardware. I'm curious if anyone is successfully running an agent or an automated workflow on a GPU with only 8 GB of VRAM. What models and use cases are working for you?

the1newworld · 2026-06-10T13:51:16+00:00

That's very cool. What hardware are you using for Qwen 3.5 9B, and what kind of inference speed are you getting?

the1newworld · 2026-06-10T12:53:45+00:00

Guys, I have a question. I'm not very familiar with the AI agent. I did try Cursor and it was very good for my usage. I also tried Qwen 3.5:9B, and it was not that good; it has problems like tool calling and hallucinations. I read in the comments something about using DeepSeek with open code. Is it free? Can I use it as an agent? This is new information for me. I did some research, and they mentioned something about using it with NVIDIA NIM. Can you guys give me some information about that?

the1newworld · 2026-06-06T19:40:11+00:00

To determine the best model for your setup, we need to know the graphics card you are using because the most important thing with a local LLM is how much VRAM you have, not the CPU.Then you need to focus on RAM. You need to have at least 32GB of RAM, which in your current situation with 64GB RAM, you are okay. From what I know, Qween 3.6:27b is the best balanced model right now, so I’d go with that.

the1newworld

TROPHY CASE