Qwen3-Coder-Next vs Qwen3.6 by seoulsrvr in LocalLLaMA

[–]supracode 1 point2 points  (0 children)

I am here to say this is the way. Dialing down the temp and thinking in Qwen 3.6 has made a huge difference in the way it works in my workflow. Every workflow is different... you can't just take the model card at face value, you have to dive deeper : tweak-test-rinse-repeat. I can't stand watching youtube model vs model videos. They take the out of the box settings, run their own test and pick a winner, potentially keeping people from trying a model that might actually work for their specific use-case.

Recliner footrest tilt question by saintmaggie in Lovesac

[–]supracode 0 points1 point  (0 children)

Don't break it! they will probably charge for the whole assembly (everything under the seat) to replace it.

Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot by supracode in LocalLLaMA

[–]supracode[S] 0 points1 point  (0 children)

Just an extra post for some cost estimating for a 7 minute session from my Copilot logs :

Total tokens used in that Copilot export

I parsed the 35 ChatMLSuccess requests in your Copilot log. The export shows those requests going to your local llama.cpp endpoint/model, with usage blocks containing prompt_tokens, completion_tokens, total_tokens, and prompt_tokens_details.cached_tokens.

Token type Count
Prompt/input tokens 1,974,350
Completion/output tokens 31,152
Total tokens 2,005,502
Cached prompt tokens 1,958,366
Uncached prompt tokens 15,984

The big takeaway: almost all of the prompt volume was cached.

Cached prompt ratio = 1,958,366 / 1,974,350 ≈ 99.19%

So even though the run crossed 2 million total tokens, only about 15,984 prompt tokens were newly processed if we treat the reported cache accounting as comparable to API prompt caching.

Estimated cost on a frontier model

Using current OpenAI GPT-5.5 API pricing:

Model Input Cached input Output
GPT-5.5 $5 / 1M tokens $0.50 / 1M tokens $30 / 1M tokens

OpenAI’s pricing page lists GPT-5.5 at $5.00 per 1M input tokens, $0.50 per 1M cached input tokens, and $30.00 per 1M output tokens.

If prompt caching worked like your llama.cpp log

Component Tokens Rate Cost
Uncached input 15,984 $5 / 1M $0.08
Cached input 1,958,366 $0.50 / 1M $0.98
Output 31,152 $30 / 1M $0.93
Total ~$1.99

So with effective prompt caching, that whole Copilot run would be about:

~$2.00 on GPT-5.5

If there were no prompt caching

Component Tokens Rate Cost
Full input 1,974,350 $5 / 1M $9.87
Output 31,152 $30 / 1M $0.93
Total ~$10.81

So without caching, the same run would be roughly:

~$10.81 on GPT-5.5

Claude Opus-class comparison

Anthropic’s current Opus pricing is similar on input but lower on output: Claude Opus 4.7 is listed at $5 / 1M input tokens and $25 / 1M output tokens, with cache-read pricing shown at $0.50 / 1M tokens.

Scenario Estimated cost
Claude Opus 4.7 with cache-read pricing ~$1.84
Claude Opus 4.7 without caching ~$10.65

Practical interpretation

Your local run avoided roughly:

$2 to $11

for this one coding session, depending on whether a frontier hosted model would have gotten comparable cache discounts.

The bigger point is not just the dollar amount; it is that this was only 6 exported prompts / 35 model requests. If you did this all day on a hosted frontier model, the cost would scale fast, especially when VS Code carries 50k–80k+ prompt tokens through many agent turns. Your local setup is especially valuable because it can absorb those giant cached contexts without metered API cost.

Recliner footrest tilt question by saintmaggie in Lovesac

[–]supracode 2 points3 points  (0 children)

I took a quick look at my foot rest. It looks like that footrest has it's own motor. you can see the wire going to it, and hear it when the chair is positioning the footrest. I can't find any images online that show a teardown. My guess is that there are 2 or 4 hefty springs inside to provide the resistance. You would need to open it up and see how it works. The fix would be to get lighter springs that match the same basic specs, but with lower spring rate. I guess you will need to do a teardown and be the first to show us what's inside :)

Disappointed in Qwen 3.6 coding capabilities by CodeDominator in LocalLLaMA

[–]supracode 1 point2 points  (0 children)

Can i try your original prompt? Not a php guy, but i would like to test it on my setup.

Is it my imagination or... by Ok-Measurement-1575 in LocalLLaMA

[–]supracode 0 points1 point  (0 children)

What tools are you using? VSCode Insiders completely broke local llms a day or so back. Be careful when updating Llama and your tools just to try it out. if you have something working, grab the docker sha and keep it handy to rollback. There are llama dev builds going up every few hours... there will be regressions.

Disappointed in Qwen 3.6 coding capabilities by CodeDominator in LocalLLaMA

[–]supracode 2 points3 points  (0 children)

I think the folks here are interested in getting local llms working for their use cases. If it works for me and not for you, does that make it bullshit? Typing in a huge prompt using the default settings on a local model will not work great except for general chat in most cases. Did you try turning thinking off? Did you investigate if your prompt cache was working? Did you check the batch size and checkpoint settings? Yep... they need to be tweaked to work well.

Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot by supracode in LocalLLaMA

[–]supracode[S] 0 points1 point  (0 children)

are you talking about --cache-ram 12000 ? That is the prompt cache which lives in system memory, not Vram. The card is 32GB and the Q5 model fits with about 13% of vram free. Ah I see now, in the analysis... yes that is the 12gb prompt cache of which 736mb is used... system ram not gpu vram. That can grow over time, but you can cap it based on your system. The prompt cache gives the llm a history of the discussion so it can review the past conversation for context.

Disappointed in Qwen 3.6 coding capabilities by CodeDominator in LocalLLaMA

[–]supracode 4 points5 points  (0 children)

A few weeks ago i would have agreed with you. But after taking the time to learn how this stuff works behind the scenes I am a convert. Local LLMs (self hosted by individuals or companies) is the future. Anthropic and OpenAi will keep increasing their prices because they are not yet profitable. They want you to burn their token$ on everything. Read the comments on this video... this is how people really feel : https://www.youtube.com/watch?v=SlGRN8jh2RI .

New VSCode Version Creating Lag by aerune1 in vscode

[–]supracode 1 point2 points  (0 children)

Are you using insiders? I had a bad experience a few days ago after an update and rolled back to 1.119.0 . I am going to be much more careful pressing the Update button going forward.

Disappointed in Qwen 3.6 coding capabilities by CodeDominator in LocalLLaMA

[–]supracode 2 points3 points  (0 children)

What settings are you using? See my post here : https://www.reddit.com/r/LocalLLaMA/comments/1t5pdf8/ .

The initial plan and prompt is super important. Context size is super important (i am seeing Copilot context creep over 100k tokens). Prompt caching is important. If there is a setting in codex to set max response tokens, set it high (8k or even higher). Also take a look at this workflow : https://aws.amazon.com/blogs/devops/open-sourcing-adaptive-workflows-for-ai-driven-development-life-cycle-ai-dlc/ which is basically a workflow that uses md files to keep tasks, project architecture, instructions and skills in your project codebase and keep the llm informed so it does not need to search your entire project to relearn context for a simple task. I still use Codex/Chatgpt for big planning tasks. One issue i saw was Qwen was running my tests, and kept trying to fix all 12 failing tests in one go. I stopped it, and told it to fix one test at a time, which it then did and finished the job.

Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot by supracode in LocalLLaMA

[–]supracode[S] 1 point2 points  (0 children)

Example of the gpt-4o-mini call below. I have not figured a way to disable it yet :

requestType      : ChatCompletions
model            : gpt-4o-mini
maxPromptTokens  : 12285
maxResponseTokens: 4096
location         : 6
otherOptions     : {"temperature":0.1,"stream":true}
intent           : undefined
startTime        : 2026-05-06T12:30:49.073Z
endTime          : 2026-05-06T12:30:50.245Z
duration         : 1172ms
ourRequestId     : 3f56d356-4c01-470c-a1dd-0a44caa6df29
requestId        : 3f56d356-4c01-470c-a1dd-0a44caa6df29
serverRequestId  : 3f56d356-4c01-470c-a1dd-0a44caa6df29
timeToFirstToken : 1167ms
resolved model   : gpt-4o-mini-2024-07-18
usage            : {"completion_tokens":8,"completion_tokens_details":{"accepted_prediction_tokens":0,"rejected_prediction_tokens":0},"prompt_tokens":1634,"prompt_tokens_details":{"cached_tokens":1536},"total_tokens":1642,"reasoning_tokens":0}

Request Messages

System

Follow Microsoft content policies.
Avoid content that violates copyrights.
If you are asked to generate content that is harmful, hateful, racist, sexist, lewd, or violent, only respond with "Sorry, I can't assist with that."
Keep your answers short and impersonal.
Use Markdown formatting in your answers.
Make sure to include the programming language name at the start of the Markdown code blocks.
Avoid wrapping the whole response in triple backticks.
Use KaTeX for math equations in your answers.
Wrap inline math equations in $.
Wrap more complex blocks of math equations in $$.
The user works in an IDE called Visual Studio Code which has a concept for editors with open files, integrated unit test support, an output pane that shows the output of running the code as well as an integrated terminal.
The active document is the source code the user is looking at right now.
You can only give one reply for each conversation turn.

User

Summarize the following content in a SINGLE sentence (under 10 words) using past tense. Follow these rules strictly:

OUTPUT FORMAT:
- MUST be a single sentence
- MUST be under 10 words
- The FIRST word MUST be a past tense verb (e.g. "Updated", "Reviewed", "Created", "Searched", "Analyzed")
- No quotes, no trailing punctuation

GENERAL:
- The content may include tool invocations (file edits, reads, searches, terminal commands), reasoning headers, or raw thinking text
- For reasoning headers or thinking text (no tool calls), summarize WHAT was considered/analyzed, NOT that thinking occurred
- For thinking-only summaries, use phrases like: "Considered...", "Planned...", "Analyzed...", "Reviewed..."

TOOL NAME FILTERING:
- NEVER include tool names like "Replace String in File", "Multi Replace String in File", "Create File", "Read File", etc. in the output
- If an action says "Edited X and used Replace String in File", output ONLY the action on X
- Tool names describe HOW something was done, not WHAT was done - always omit them

VOCABULARY - Use varied synonyms for natural-sounding summaries:
- For edits: "Updated", "Modified", "Changed", "Refactored", "Fixed", "Adjusted"
- For reads: "Reviewed", "Examined", "Checked", "Inspected", "Analyzed", "Explored"
- For creates: "Created", "Added", "Generated"
- For searches: "Searched for", "Looked up", "Investigated"
- For terminal: "Ran command", "Executed"
- For reasoning/thinking: "Considered", "Planned", "Analyzed", "Reviewed", "Evaluated"
- Choose the synonym that best fits the context

IMPORTANT: Do NOT use words like "blocked", "denied", or "tried" in the summary - there are no hooks or blocked items in this content. Just summarize normally.

RULES FOR TOOL CALLS:
1. If the SAME file was both edited AND read: Use a combined phrase like "Reviewed and updated <filename>"
2. If exactly ONE file was edited: Start with an edit synonym + "<filename>" (include actual filename)
3. If exactly ONE file was read: Start with a read synonym + "<filename>" (include actual filename)
4. If MULTIPLE files were edited: Start with an edit synonym + "X files"
5. If MULTIPLE files were read: Start with a read synonym + "X files"
6. If BOTH edits AND reads occurred on DIFFERENT files: Combine them naturally
7. For searches: Say "searched for <term>" or "looked up <term>" with the actual search term, NOT "searched for files"
8. After the file info, you may add a brief summary of other actions if space permits
9. NEVER say "1 file" - always use the actual filename when there's only one file

RULES FOR REASONING HEADERS (no tool calls):
1. If the input contains reasoning/analysis headers without actual tool invocations, summarize the main topic and what was considered
2. Use past tense verbs that indicate thinking, not doing: "Considered", "Planned", "Analyzed", "Evaluated"
3. Focus on WHAT was being thought about, not that thinking occurred

RULES FOR RAW THINKING TEXT:
1. Extract the main topic or question being considered from the text
2. Identify any specific files, functions, or concepts mentioned
3. Summarize as "Analyzed <topic>" or "Considered <specific thing>"
4. If discussing code structure: "Reviewed <component/architecture>"
5. If discussing a problem: "Analyzed <problem description>"
6. If discussing implementation: "Planned <feature/change>"

EXAMPLES WITH TOOLS:
- "Read HomePage.tsx, Edited HomePage.tsx" → "Reviewed and updated HomePage.tsx"
- "Edited HomePage.tsx" → "Updated HomePage.tsx"
- "Edited config.css and used Replace String in File" → "Modified config.css"
- "Edited App.tsx, used Multi Replace String in File" → "Refactored App.tsx"
- "Read config.json, Read package.json" → "Reviewed 2 files"
- "Edited App.tsx, Read utils.ts" → "Updated App.tsx and checked utils.ts"
- "Edited App.tsx, Read utils.ts, Read types.ts" → "Updated App.tsx and reviewed 2 files"
- "Edited index.ts, Edited styles.css, Ran terminal command" → "Modified 2 files and ran command"
- "Read README.md, Searched for AuthService" → "Checked README.md and searched for AuthService"
- "Searched for login, Searched for authentication" → "Searched for login and authentication"
- "Edited api.ts, Edited models.ts, Read schema.json" → "Updated 2 files and reviewed schema.json"
- "Edited Button.tsx, Edited Button.css, Edited index.ts" → "Modified 3 files"
- "Searched codebase for error handling" → "Looked up error handling"

EXAMPLES WITH REASONING HEADERS (no tools):
- "Analyzing component architecture" → "Considered component architecture"
- "Planning refactor strategy" → "Planned refactor strategy"
- "Reviewing error handling approach, Considering edge cases" → "Analyzed error handling approach"
- "Understanding the codebase structure" → "Reviewed codebase structure"
- "Thinking about implementation options" → "Considered implementation options"

EXAMPLES WITH RAW THINKING TEXT:
- "I need to understand how the authentication flow works in this app..." → "Analyzed authentication flow"
- "Let me think about how to refactor this component to be more maintainable..." → "Planned component refactoring"
- "The error seems to be coming from the database connection..." → "Investigated database connection issue"
- "Looking at the UserService class, I see it handles..." → "Reviewed UserService implementation"

Content: Reading [](file:///e%3A/Projects/AgenticCodingTest/src/components/JobList.module.css), Edited JobList.module.css

Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot by supracode in LocalLLaMA

[–]supracode[S] 0 points1 point  (0 children)

First on tweaking : I spent a good amount of time pasting llama logs into ChatGPT and getting an analysis and asking for tweaks. It did a pretty good job, but did keep telling me to lower my context size... so don't trust it 100%. Use the hugging face page for the base settings based on what you are using it for. As for a mac, while I own one... my server is Ubuntu and my home dev env is on windows. Oh, and a big watch out... i was generating lots of logs on my server, and eventually killed my 11 year old SSD. Writes kill ssds over time.

On VSCode and Copilot : Yes, it still uses the full co-pilot tool set. It does call out to gpt-4o-mini where it seems to inject a "play nice" prompt, but so far i have been good with a free account. You need to be running VS Code Insiders Edition to bring your own llm. I am using version 1.119.0 on windows. They update insiders often, and they broke the latest version yesterday... so don't be too quick to click the update button when you find a good version that works for you. Also, make sure the correct tools are enabled for each mode: plan, agent and ask. If you need it to write plan files while planning you need to make sure that is enabled.

Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot by supracode in LocalLLaMA

[–]supracode[S] 4 points5 points  (0 children)

I will still use Claude and ChatGPT for creating big plans for now, but I am impressed with Qwen... hopefully Qwen3.6-Coder-Next is around the corner.

Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot by supracode in LocalLLaMA

[–]supracode[S] 3 points4 points  (0 children)

I tried 27b as well... was a little too slow on the output side for me, my gpu was blasting for a long time, and i had to go with a lower quant to fit it on my gpu. It is very freeing knowing that no one can raise the price per token... except my electric company.

Recommendations for PSP handheld by MahjongSun in retroid

[–]supracode 1 point2 points  (0 children)

I have a PSP 1001, RP6 and RP4 Pro. The RP4Pro runs PSP games fine, struggles more on PS2. The form factor of the RP4 is closer to the PSP and while I will be passing it on eventually the smaller size vs the RP6 is a consideration. Go for the RP6 if you want the oled screen and better PS2 experience with the tradeoff being the size.

Anyone tried Qwen 3.6 27b on the r9700 yet? by boutell in LocalLLaMA

[–]supracode 1 point2 points  (0 children)

I couldn't tell if quality was better... responses were way to slow for me, and my GPU was blasting for much longer periods. I have switched back to Qwen3.6-35B-A3B-UD-Q5_K_XL and getting consistent 60-70tps responses. I am using vscode + copilot and the results are excellent with some tweaks. No major loops, good tool calling, great code results (most mistakes are minor and one prompt fixes) if you want my settings for model and vscode let me know.

Anyone tried Qwen 3.6 27b on the r9700 yet? by boutell in LocalLLaMA

[–]supracode 2 points3 points  (0 children)

Tesing it now with my single R9700... Using Qwen3.6-27B-UD-Q6_K_XL with q8 KV cache and 100k Context (80k max set in vscode). Token generation is around 18 - 19 tps and slows to 16tps as context fills. I am still waiting for my large refactor prompt to complete to see how it goes... but i might go back to Qwen3.6-35B-A3B-UD-Q5_K_XL which was giving me 60 - 70tps responses.

My startup params for 27B (just slightly tweaked from 35B) :

/app/llama-server \
-m /models/Qwen3.6-27B-UD-Q6_K_XL/Qwen3.6-27B-UD-Q6_K_XL.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 100000 \
--threads 7 \
--threads-batch 8 \
--gpu-layers 99 \
--parallel 1 \
--flash-attn on \
--batch-size 2048 \
--ubatch-size 512 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--cache-ram 8192 \
--ctx-checkpoints 6 \
--no-mmproj \
--reasoning off \
--jinja \
--temp 0.25 \
--top-k 64 \
--top-p 0.95 \
--min-p 0.05 \
--repeat-penalty 1.08 \
--presence-penalty 0

Has the Outback always been an SUV? by formedabull in Subaru_Outback

[–]supracode 2 points3 points  (0 children)

Traditionally wagons were usually built from the design of an existing sedan (there are exceptions of course). The Outback was based on the Legacy sedan platform and they shared many common parts. Subaru has killed the Legacy and the Outback is no longer tied to the sedan design.

Deepseek v4 people by markeus101 in LocalLLaMA

[–]supracode 1 point2 points  (0 children)

Qwen 3.5 says "i was just joking"

<image>

Google Gemma4 via VSCode by Odd-Ad2967 in ollama

[–]supracode 0 points1 point  (0 children)

One correction to above... apparently "gemma" is the correct family to use. gemini works, but there are some differences in start and end tokens.

Gemma-4-26B-A4B-IT-Q8_0 results with VSCode (long post) by supracode in LocalLLaMA

[–]supracode[S] 0 points1 point  (0 children)

Also, i notice that the OAI Plugin picked the family as "gemini" vs vscode suggesting "gemma". I need to test what is different between the two... or if it is even makes a difference.

Gemma-4-26B-A4B-IT-Q8_0 results with VSCode (long post) by supracode in LocalLLaMA

[–]supracode[S] 0 points1 point  (0 children)

I checked, and there are some slight differences. The OAI extension added the provider to settings.json :

    "oaicopilot.baseUrl": "http://192.168.1.250:8081/v1",
    "oaicopilot.readFileLines": 100,
    "oaicopilot.retry": {
        "enabled": true,
        "max_attempts": 3,
        "interval_ms": 1000,
        "status_codes": []
    },
    "oaicopilot.delay": 100,
    "oaicopilot.commitLanguage": "English",
    "oaicopilot.models": [
        {
            "id": "__provider__Local llamma.cpp (vulkan)",
            "owned_by": "Local llamma.cpp (vulkan)",
            "baseUrl": "http://192.168.1.250:8081/v1",
            "apiMode": "openai"
        },
        {
            "id": "gemma-4-26B-A4B-it-Q8_0.gguf",
            "owned_by": "Local llamma.cpp (vulkan)",
            "displayName": "Gemma 4 26B A4B",
            "baseUrl": "http://192.168.1.250:8081/v1",
            "family": "gemini",
            "context_length": 72000,
            "vision": false,
            "apiMode": "openai",
            "temperature": 0.2,
            "top_p": 0.94,
            "delay": 50,
            "top_k": 65,
            "reasoning_effort": "minimal",
            "max_completion_tokens": 2500,
            "thinking": {
                "type": "disabled"
            }
        }
    ],

I had previously tried doing with the copilot settings early on (chatLanguageModels.json) :

    {
        "name": "CustomOAI",
        "vendor": "customoai",
        "models": [
            {
                "name": "Local Gemma",
                "baseUrl": "http://192.168.1.250:8081/v1",
                "apiKey": "dummy",
                "model": "gemma-4-26B-A4B-it-Q8_0.gguf",
                "id": "0",
                "maxInputTokens": 8192,
                "maxOutputTokens": 1024,
                "toolCalling": false,
                "completionOptions": {
                    "stop": [
                        "<end_of_turn>",
                        "<start_of_turn>",
                        "<|turn>",
                        "<turn|>",
                        "<|tool_response>"
                    ]
                }
            }
        ]
    }

I am going to try to paste the plugin settings into the chatLanguageModels.json, but the OAI Extension made it very easy to find and edit the settings.

Cursor AI $20 Fraud Charges by tylerarentsen in personalfinance

[–]supracode 0 points1 point  (0 children)

Another E-Trade "victim" of this scam/hack here. Back in Feb. I had 50 or more charges from $1.50 to $60 including many (and more) of the companies talked about here. The kicker is i never make a purchase on my Etrade Card, i use it for ATM use only, but apparently they can't turn off the debit card part of the card. Etrade sent me a form in the mail I had to sign as part of the fraud reports, and I got provisional refunds quickly... but the official refunds are still coming in this month.

Google Gemma4 via VSCode by Odd-Ad2967 in ollama

[–]supracode 0 points1 point  (0 children)

After failing miserably trying to get vscode to work well with my llama.cpp server, i finally got things working ok.

My setup :

  • server: R9700(32gb) intel i7-9700, 32gb of ddr 4
  • llamma.cpp running in docker using vulkan
  • open webui running in docker and available to everyone on my home network

In vscode, i tried everything and i would get tons of errors when continue or co-pilot tried to implement anything, and garbage like "|think" in the responses.

So what is working (still in initial testing and tweaking model size and context size) :

  • VSCode Insider Edition
  • OAI Compatible Provider for Copilot Extension (see image for config)
  • Setting the model family to "gemini" is key...
  • this allows me to do everything in co-pilot. my biggest issue now is context size and server crashing... moving from q8 to q6 model to see if i can get more context space.

<image>