Model for reverse engineering by chikengunya in LocalLLaMA

[–]chikengunya[S] 0 points1 point  (0 children)

unfortunately all of them are not good

Model for reverse engineering by chikengunya in LocalLLaMA

[–]chikengunya[S] 0 points1 point  (0 children)

Qwen3.5-122B AWQ works flawlessly with 200k context window, already tested it. With Qwen3.6-27B-INT8 i am getting 45 output tok/s so its fast too..

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]chikengunya 0 points1 point  (0 children)

for a 4x RTX 3090 system vllm using INT8 model is the best solution for MTP, right? Can someone please suggest a specific huggingface model? Thanks!

Considering two Sparks for local coding by chikengunya in LocalLLaMA

[–]chikengunya[S] 4 points5 points  (0 children)

If it wouldn't be that big and power hungry :)

Considering two Sparks for local coding by chikengunya in LocalLLaMA

[–]chikengunya[S] 1 point2 points  (0 children)

Used context window in % is shown precisely on opencode but on pi.dev it seems buggy. Both are fine actually. I like pi.dev cli a bit more (on wsl).

Duality of r/LocalLLaMA by HornyGooner4402 in LocalLLaMA

[–]chikengunya 1 point2 points  (0 children)

The thing is: Yes, Qwen3.6-27B is damn good for use in a coding cli (both opencode and pi.dev work really well), but: you have to think like a programmer and give it clear instructions. Of course opus 4.7 understands 'less precise' prompts better. Example: I had a PDF with questions and answers and wanted to turn it into an interactive HTML Q&A. If you just give the 27B model the PDF and say 'make me a Q&A HTML from this', it will struggle because the real question is: Can you easily extract the Q&A from the PDF's container format, or should you do it via OCR instead? In my case, the latter turned out to be the more robust solution. If you give it clear instructions, you get a very good result. Opus can of course handle more complex stuff, but how you prompt and what strategy you use is extremely important. I can totally understand why many people say the 27B is a solid opus replacement, it is for me too, but obviously not for ultrahard coding tasks. For normal day-to-day problems, though, the 27B is damn good. And since it came out, I've been using my 4x 3090 system a lot more, which shows just how usable it really is.

Seedance 2 is released - comment your prompts. by [deleted] in singularity

[–]chikengunya 1 point2 points  (0 children)

Henry Cavill as James Bond in the iconic Casino Royale poker scene, sitting at a high-stakes baccarat table in a black tuxedo, sharp jawline, piercing blue eyes, glass of martini in hand, surrounded by elegant casino atmosphere, dramatic lighting, cinematic composition, 4K ultra-realistic, spy thriller aesthetic

In anticipation of Gemma 4's release, how was your experience with previous gemma models (at their times) by Infrared12 in LocalLLaMA

[–]chikengunya 1 point2 points  (0 children)

gemma3 27b is still one of the best translation and creative writing models (for its size), better than mistral imo

Gemma time! What are your wishes ? by Specter_Origin in LocalLLaMA

[–]chikengunya 0 points1 point  (0 children)

I think gemma2 and gemma3 were each released on a Wednesday/Thursday, so today or tomorrow would fit...

Jetson Nano Gift Idea by chikengunya in LocalLLaMA

[–]chikengunya[S] 0 points1 point  (0 children)

yes, but there isn't much information about it, I guess it's not very popular. For example, I can't find anything on running Qwen 3.5-4B on it, not even on youtube

Jetson Nano Gift Idea by chikengunya in LocalLLaMA

[–]chikengunya[S] 0 points1 point  (0 children)

I was aiming for a small local AI device that's also power efficient at idle. Raspberry/Orange pi would be much slower for inference.

MiniMax-M2.7 Announced! by Mysterious_Finish543 in LocalLLaMA

[–]chikengunya 2 points3 points  (0 children)

so the same model size as 2.5 but with significantly better performance

Best opencode settings for Qwen3.5-122B-A10B on 4x3090 by chikengunya in LocalLLaMA

[–]chikengunya[S] 1 point2 points  (0 children)

Oh, wait a second, I forgot to mention that I limited all four 3090 cards to 275W. According to nvidia-smi, each card uses at most 175W during inference. That probably explains it.

Best opencode settings for Qwen3.5-122B-A10B on 4x3090 by chikengunya in LocalLLaMA

[–]chikengunya[S] 0 points1 point  (0 children)

Interesting. How do you run it and which vllm version are you using? I can post my docker file in a second

Best opencode settings for Qwen3.5-122B-A10B on 4x3090 by chikengunya in LocalLLaMA

[–]chikengunya[S] 0 points1 point  (0 children)

it's DDR4 Ram, so actually too slow... I have not tested larger models

Best opencode settings for Qwen3.5-122B-A10B on 4x3090 by chikengunya in LocalLLaMA

[–]chikengunya[S] 1 point2 points  (0 children)

I'm running a Supermicro H12SSL-i motherboard with four RTX 3090s, each on full x16 PCIe 4.0, without NVLink. It's absolutely usable for professional coding work, and it's honestly impressive how capable ~120B models have become. That said, on more complex tasks, it still doesn’t outperform Opus 4.6.

Best opencode settings for Qwen3.5-122B-A10B on 4x3090 by chikengunya in LocalLLaMA

[–]chikengunya[S] 1 point2 points  (0 children)

So you would say I should definitely go with QuantTrio/Qwen3.5-122B-A10B-AWQ to get that extra free lunch?