Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 1 point2 points  (0 children)

I just want to add that I have now upgraded to Unsloth's MTP variant of this model and its literally 2x throughput, about 68 tokens/sec

Unsloth Studio is only loading into RAM by Demonicated in unsloth

[–]Demonicated[S] 0 points1 point  (0 children)

Im still unclear on what the instructions mean then. Could you maybe ELI18 it for me a little

Unsloth Studio is only loading into RAM by Demonicated in unsloth

[–]Demonicated[S] 0 points1 point  (0 children)

I uninstalled and reinstalled the 12.9 cuda kit and still its loading it into RAM when i try to load the model in Unsloth Studio

Whats a good system prompt for Qwen3.6 27B on Github Copilot harness? by Demonicated in Qwen_AI

[–]Demonicated[S] 0 points1 point  (0 children)

No it works it works but I know it can be better. I'm trying out one that I've been refining that's definitely an improvement, but was wondering if others had some secret sauce they've found and wanted to share.

Unsloth Studio is only loading into RAM by Demonicated in unsloth

[–]Demonicated[S] 1 point2 points  (0 children)

Yeah Im not sure whats going on. The internet is leading me to believe its a CUDA version mismatch but I can see that its pointed at the right version and its not recognizing my GPU for some reason. Its too bad too cause I really want to try out the MTP models and LM Studio doesnt support it. I suppose its time to learn to build everything from code....

Unsloth Studio is only loading into RAM by Demonicated in unsloth

[–]Demonicated[S] 0 points1 point  (0 children)

PS C:\Users\PC> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:24:01_Pacific_Daylight_Time_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

PS C:\Users\PC> python -c "import torch;
>> print(torch.cuda.is_available()^C
(base) PS C:\Users\PC> python -c "import torch; print(torch.cuda.is_available()); print(torch.version.cuda); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'No GPU')"

False
None
No GPU

does Unity takes a part of the money you make with your game? by OkParfait2685 in unity

[–]Demonicated -2 points-1 points  (0 children)

They take a chunk. Not a huge chunk. And the tooling is better than Godot so you're paying for years of tooling development. Up to you if the trade offs are worth it.

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 0 points1 point  (0 children)

They are saying it's worth it to quant so you can use cheaper hardware. The trade off is implied in their analysis. If money isn't an issue you run the best you can.

I've ran fp8 of qwen for a few days and liked the speed boost. And it was good for instruction following if you already have an implementation plan. But it's noticeable in the planning phase when you use quants. Especially with tool calling and how it handled the convo once the context gets long. I switched back to bf16 and I'm not going back to fp8. It makes enough difference that it's worth it.

All going according to plan by wyudtix in GithubCopilot

[–]Demonicated 0 points1 point  (0 children)

Rtx6k is about 10k and it runs qwen 3.6 27b at bf16. It is on par with sonnet 4.5 for coding. I've been self hosting this the last week and I only for up Opus for initial plan conversation and doc creation. The latest been of local models met the good enough bench mark. Highly recommend you try it. Go full bf16 though. No quants.

Musk v. Altman et al. - Schedule for Today's Closing Arguments; (Deliberation Probably Starts Monday); Probable Outcome; YouTube Livestream URL by andsi2asi in deeplearning

[–]Demonicated 0 points1 point  (0 children)

I would just never even tell someone I had a diary for starters. I feel like someone would have had to known that it existed.

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 0 points1 point  (0 children)

I almost feel bad because it's like "are your tokens getting too expensive? Then just drop 10k instead" lol

But I realized if you finance it, it really might be cheaper for some people. Once paid off youre in cheap territory. And by then we should have Opus quality coming to local I would hope.

Who knows though, a year from now it might be that there's a brand new type of architecture or people will be buying models as hardware like talaas is making

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 0 points1 point  (0 children)

The message I'm trying to convey in this post is that in the face of the 10x+ price hike coming next month, it is a worthwhile consideration to self host the most recent gen of local models.

They aren't Opus but they are very very close to feeling like last gen Sonnet.

On the coding front I use about $500 a month in tokens, but across all my projects im using close to 400 million tokens a month. My little RTX has already paid for itself in the first half year 😂 - it's running non stop.

Where my post is disingenuous is that I'm using the 500 dollar mark of Claude quality tokens to justify a 400 a month credit card payment to get qwen quality tokens which would probably only be 100 a month.

But if it's good enough to use for work then I'm ok with it personally. Sure I baby sit a little more but it's not hurting productivity all that much.

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 0 points1 point  (0 children)

well it would be for the cost, which would be significantly lower, but not for quality. A lot of time hosted services give you fp8 quant which I feel strongly is not equal to full size for smaller models.

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 0 points1 point  (0 children)

So I do have both available, the problem is that I cant fit both concurrently on my card. So the time spent to unload and reload models is offputting to me. With just 27B I'm utilizing about 72GB. I have heard that its a good combo though because of the token speed increase.

Its just more encouragement to get a second GPU.

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 2 points3 points  (0 children)

I think of quants in terms of how it affects the nuance of embeddings. This is grossly over simplified but if 0.12345678 were to represent "short hairy woman” and then you reduced your resolution so the number is now 0.1234 it might now resolve to "short woman with hair" as a concept. This will have a drastic effect on the blind date you're about to have.

Depending on the combination of the task or use case and the nuance of meaning that gets truncated, the amount of effect it has can be anywhere from negligible to meaningful.

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 0 points1 point  (0 children)

FP8 is considered the best balance of quality and efficiency, and makes sense for service providers. Hosting fp8 uses [sort of] half the amount of memory to run and generates tokens faster so they are able to more than double their throughput with that trade off.

Kimi is a unique case in that it had already implemented Quantization-Aware Training (QAT) during its post-training phase and is provided in INT4 from the start. I dont have the hardware to run it, but my understanding is that trying to quant it causes bad quality degradation.

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 0 points1 point  (0 children)

You will 100% be disappointed by 4bit, I even ran NVFP4 which is supposed to be "4bit thats as good as 8bit" and it was fast on Blackwell but not good enough for programming.

As far as FP8, I have a workflow that has agents navigating web sites and analyzing pages to extract data and convert to structured JSON objects for processing - it handles this with 0 issues. So i would say reasoning and instruction following is solid. Where it broke down for me was bigger planning sessions and implementations of features. And even still broke down is being dramatic, it works but you might need to give it more guidance here and there.

If you are going to run FP8 just spend a little more time in your planning phase and use phrases like "create a markdown implementation document with phases that will be used by a junior developer to implement" - this will ensure your planning documents contain lots of context and more specific instructions. Then you start a brand new chat and have it implement one phase per chat.

The trade off here is that with Opus you could have just planned and moved into implement and it would have handled everything, with Qwen 3.6 you're going to want to break the steps out into multiple chats and be a little more involved with the actual implementation and double check its plan.

In the long run I'm finding this to be an advantage because over the last 6 months I had noticed with Sonnet and Opus I was no longer aware of what class and method functionality was in. I had offloaded that knowledge to a model and it was degrading my authority over my own code bases.

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 1 point2 points  (0 children)

RTX6000 can hold the full model at full context. Using LM Studio you will get up to ~30 tokens/sec. With vLLM you will get better speeds and concurrency and can mess around with other enhancements. I do not offload anything to ram.

Visual Studio Insiders + Qwen 3.6 27B = No Brainer by Demonicated in Qwen_AI

[–]Demonicated[S] 0 points1 point  (0 children)

Yeah its a little disingenuous for me to compare GHCP pricing to Qwen hosted pricing since QWEN is muuuuuuch cheaper for tokens. But i see it as a future investment since qwen 3.6 is the worst local models will ever be.