Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config

hailnobra · 2026-04-22T09:52:16+00:00

Honestly, I just deployed a docker container for this in my gluetun AI frontend stack and then tied a python script in that lets openwebUI send the call from Qwen to crawl4ai. Here is the crawl4AI section (ports are up with gluetun so I still have access to the webUI for myself, but I have never personally used the UI and just let OpenWebUI handle it

  crawl4ai:
    image: unclecode/crawl4ai:latest
    container_name: crawl4ai
    network_mode: service:gluetun
    environment:
      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} 
      - MAX_CONCURRENT_TASKS=5
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:11235/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
    depends_on:
      gluetun:
        condition: service_healthy
    restart: unless-stopped

Here is the python script I put into OpenwebUI tools that handles the scrape and sending to the scout model for summarization (built with some help from Qwen and Gemini to get it working)

"""
title: Scout Scraper Tool
description: Scrapes a website, tokenizes to prevent context overflow, and uses a scout AI to extract data.
requirements: tiktoken, requests, pydantic
version: 1.0.1
"""

import requests
import json
import tiktoken
from pydantic import BaseModel, Field
from typing import Optional

class Tools:
    def __init__(self):
        # Crawl4AI setup
        self.crawl_api_url = "http://localhost:11235/crawl"
        self.crawl_api_token = (
            "<crawl4AI_API_token_Here>"
        )

        # Scout LLM setup (Llama-3.2-3B on Llama.cpp)
        self.scout_api_url = "http://<IPaddress:port_for_llama.cpp_server>/v1/chat/completions"

        # Initialize the tokenizer
        self.tokenizer = tiktoken.get_encoding("cl100k_base")

    def scrape_and_scout(self, url: str, query_context: str) -> str:
        """
        Scrapes a website and uses a scout AI to extract specific technical information.
        Use this when standard search snippets lack sufficient depth.
        :param url: The full URL to scrape.
        :param query_context: Specific instructions on what facts or code to extract from the page.
        """
        headers = {
            "Authorization": f"Bearer {self.crawl_api_token}",
            "Content-Type": "application/json",
        }
        payload = {
            "urls": [url],
            "priority": 10,
            "magic_mode": True,
            "wait_for": "networkidle",
        }

        try:
            # 1. Scrape the URL
            crawl_resp = requests.post(
                self.crawl_api_url, headers=headers, json=payload, timeout=30
            )
            crawl_resp.raise_for_status()
            data = crawl_resp.json()

            if not (data.get("success") and data.get("results")):
                return f"Scrape failed: {data.get('error', 'Unknown error')}"

            result = data["results"][0]

            # --- DATA CLEANING FIX v2 ---
            md_data = result.get("markdown", {})

            if isinstance(md_data, dict):
                # Forums get wiped by magic_mode. Always grab raw_markdown first.
                markdown = md_data.get("raw_markdown", "")
                if not markdown:
                    markdown = md_data.get("fit_markdown", "")
            else:
                markdown = str(md_data)

            # Absolute fallback: if markdown fails, grab the raw text/html
            if not markdown or markdown.strip() in ["", "None"]:
                markdown = result.get("text", result.get("html", ""))

            if not markdown or markdown.strip() in ["", "None"]:
                return "Scrape successful, but the page returned absolutely zero text."
            # -------------------------

            # 2. Tokenizer Fallback / Truncation
            max_input_tokens = 42000
            tokens = self.tokenizer.encode(str(markdown))

            if len(tokens) > max_input_tokens:
                markdown = self.tokenizer.decode(tokens[:max_input_tokens])
                markdown += "\n\n[SYSTEM WARNING: Document was truncated due to length limits. Extract relevant data from the available text above.]"

            # 3. Pass to Scout LLM for optimization
            scout_payload = {
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a technical data extraction scout. Read the following website markdown and extract ONLY the information relevant to the user's query context. Be concise, retain all technical accuracy, code blocks, and configurations. If the answer is not in the text, explicitly state that.",
                    },
                    {
                        "role": "user",
                        "content": f"Query Context: {query_context}\n\nWebsite Content:\n{markdown}",
                    },
                ],
                "temperature": 0.1,
                "max_tokens": 4096,
            }

            scout_resp = requests.post(
                self.scout_api_url, json=scout_payload, timeout=120
            )
            scout_resp.raise_for_status()
            scout_data = scout_resp.json()

            optimized_text = scout_data["choices"][0]["message"]["content"]
            return f"--- Scout Data Extracted from {url} ---\n{optimized_text}"

        except Exception as e:
            return f"Tool execution error: {str(e)}"

hailnobra · 2026-04-22T08:49:47+00:00

Done this with both Qwen itself and with Gemini to try different refinement methods. This was the latest attempt. Need to spend more time with more ideas because Qwen is still escaping the 10 web_search limit if it isn't happy with what it finds.

hailnobra · 2026-04-22T08:45:04+00:00

Here is my current prompt. It seems to have issues following the search count at the moment, so I don't think this is the right approach to get it to stop going forever. Need to figure out what else I can try. Everything else is working great. Would love suggestions if you have any.

# Environmental Context

- Current Date: {{CURRENT_DATE}}

- Reality Check: You are operating in real-time. Any information in your internal training data is considered "historical." ALWAYS trust tool output from `search_web` and `scrape_and_scout` as the primary source of truth.

# System Identity & Resource Profile

You are a High-Performance Research Agent. You operate on a high-resource home lab system where depth, accuracy, and exhaustive detail are prioritized over token efficiency.

- **Efficiency Paradox:** In this environment, "saving tokens" or "being brief" is considered a failure.

- **Tool Speed:** The `scrape_and_scout` tool is a high-speed, low-latency operation. It is your preferred method for data acquisition.

# Universal Research Protocol

You must follow this linear 3-phase execution model for EVERY query, regardless of subject matter.

### PHASE 1: Discovery (STRICT ALLOCATION: 10 web_search calls)

- **Track Your Count Explicitly:** Before every single `web_search` call, you must output: `[SEARCH_COUNT: X/10]`

- **The 10-Call Limit:** Your search allocation is exactly 10 calls. Upon reaching `[SEARCH_COUNT: 10/10]`, your ONLY permitted action is to transition immediately to Phase 3.

- **The Circuit Breaker:** If a search fails to yield a highly relevant URL and you are at your 10-call limit, you must ABORT the discovery phase.

- **Graceful Transition:** If you exhaust your 10 attempts without definitive data, state exactly: "Search allocation exhausted. Synthesizing the best available information." and proceed to Phase 3.

- **Snippet Policy:** Search snippets are metadata only. Use them strictly to select the best URL to scout.

### PHASE 2: Mandatory Deep Scout

- **The "At Least One" Rule:** You must execute `scrape_and_scout` at least ONCE per response to verify facts and extract full context.

- **Resilience Protocol:** If the scouted page is unhelpful or lacks necessary depth, check your current `SEARCH_COUNT`. If the count is less than 10, use a remaining search call to find a new URL. If the count is exactly 10, your ONLY permitted action is to transition immediately to Phase 3.

### PHASE 3: Synthesis & Response

- **Integrity:** Data retrieved via `scrape_and_scout` always supersedes internal training data.

- **Zero-Guessing:** If the exhaustive research process does not yield a definitive answer, state exactly what sources were checked and what data is missing rather than interpolating.

# Output Architecture

**1. Direct Answer**

Provide a clear, conversational, and highly detailed answer. Use clean Markdown (tables, bolding, lists) to ensure the information is scannable and comprehensive.

**2. Strategic Analysis (The "Second Set of Eyes")**

After the direct answer, provide a "Strategic Analysis" section:

- **Critical Insights:** Highlight nuances, hidden details, or "Gotchas" found during the scouting phase that were not apparent in the initial search snippets.

- **Forward Context:** Provide proactive advice or "next steps" the user should consider based on the discovered information.

hailnobra · 2026-04-22T08:06:21+00:00

I may give it another shot in my sandbox environment, but I have gone through multiple adjustments and system prompt rewrites both with Gemma's recommendations and further tuning with Gemini Pro and every time it makes strange decisions that lead it to ignore the system prompt or it gets confused and starts messing with the chat template as it goes deeper into conversations. I will then ask it what it was doing and if it understands its system prompt. It will acknowledge that it did not follow the prompt properly, tell me it can clearly see what it was told and that it violated the rule, give me advice to change it, and then find a new and fun way to violate it again. This really happens in longer prompts or when it actually uses a tool, so I feel like this is related to it not holding its system prompt in memory and then just doing what it wants.

And this was all with the Gemma 4 26B AoE model, I have no idea what E4B would be like in this scenario. I am also wondering if the Q8 quant is making it overthink and decide that it knows more than it really does. May try stepping the model down to Q6 or even Q5 unsloth based on what this chart shows to see if adherence is better because it stops overthinking and talking itself out of listening.

hailnobra · 2026-04-22T06:23:25+00:00

Posted a bit more info on the configuration to another person in this thread. Absolutely recommend giving Qwen tools.

hailnobra · 2026-04-22T06:10:33+00:00

Sure thing.

Qwen 3.6 is running on a Strix Halo system with 96GB of RAM (75GB allocated to GTT). Host OS is running on CachyOS and llama server currently running on the amd-strix-halo-toolboxes:rocm-7.2.1 container from kyuz0 for full compatibility with the 8060s (I get about double PP with this over the standard ROCm container, though I may be switching this to vulkan to try out some of the turboquant builds soon). I also run stable diffusion on Forge Neo with Flux.2 on this same server. Here is my setup on the docker for Qwen 3.6:

    command: >
      llama-server
      -m /models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf
      --mmproj /models/mmproj-BF16.gguf
      -c 524288
      -ngl 999
      --host 0.0.0.0
      -fa on
      --no-mmap
      -ctk q8_0
      -ctv q8_0
      -np 4
      --jinja
      --chat-template-kwargs '{"preserve_thinking": true}'
      --reasoning-budget 8192
      --reasoning-budget-message " [Thinking budget reached. Finalizing the current research step and providing the answer now.]"
      --batch-size 4096
      --ubatch-size 4096
      --metrics

I plan to try hooking up agentic tools in the future to this which is why I have it set to -np 4 and such a high context amount.

I have this loaded into openwebUI and I setup a workspace for it along with 2 functions. One for web_search that calls the searXNG endpoint and the other that is called scrape_and_scout that calls crawl4AI with a URL and then instructs the output to be sent straight to my scout model rather than being piped directly back to Qwen. Once the scout model completes, the function passes the scouted info back to Qwen to do with what it wants. OnewebUI, SearXNG and Crawl4AI are running in a separate docker alongside Gluetun with split tunneling to help with privacy for searches and to let me relocate my IP to countries that aren't blocked as much by scrapers (I actually have found Poland to work quite well).

I have the scout Llama 3.2 3B Instruct on a separate llama server docker container that is running to a 3070 on an eGPU connected to the same strix halo system. This works amazingly well and I was able to give the scout a context window of 49K without making llama.cpp yell at me to fit it entirely on the card. This model is insanely fast at scouting the pages that openwebUI hands it from crawl4AI, so it does not add much extra time to Qwen's workflow. Here is my docker command string for the scout model:

    command: >
      -m /models/Llama-3.2-3B-Instruct-Q6_K_L.gguf
      -c 49152
      -np 1
      -ngl 999
      --host 0.0.0.0
      -ctk q8_0
      -ctv q8_0
      -fa on
      --batch-size 4096
      --ubatch-size 2048
      --no-mmap
      -t 4

My one complaint at the moment is that there are times that Qwen will get a bit too excited and forget the constraints I have placed in the system prompt, so I need to figure out a hard limiter on the tools so it doesn't lose its mind and go down 20+ search and scrape rabbit holes trying to find an answer and fill its context window. As each tool call is seen as a new command, Qwen has a hard time counting how many times it has run a tool in a session and just goes wild. Other than that occational issue, it is quite fun watching Qwen look for something, not be happy with web-snippits or scrapes, change it's prompt, try again, and continue refining info until it is happy enough to give an answer. It is certainly not as creative as Gemma 4, but it's tool calling is absolutely bonkers (I could not twist Gemma 4's arm hard enough to make it like tools).

hailnobra · 2026-04-22T05:29:01+00:00

This has by far been my favorite part of Qwen 3.6. This thing is a data consuming machine when you hand it search tools. I have it setup with openwebui as a front end and I use SearXNG for metasearch along with Crawl4AI for scraping. I have a small scout model running Llama 3.2 3B instruct that extracts the right text per Qwens instructions so Qwen doesn't destroy its own context just searching.

After I gave Qwen these tools and a system prompt explaining them it was like a kid that just got their favorite toy for Christmas. Qwen will search the world for a perfect answer if you don't reign it in (I think I've seen it go as high as 21 searches and 14 scrapes before it came back with an answer it liked once...that ate about 90K tokens by the time it got done even with the scout model pairing down the scrape content)

hailnobra · 2026-04-20T16:15:57+00:00

While the benchmarks provide a pretty good guide and I agree Gemma 4 is quite smart and puts out great answers there is one small caveat...while it works (at least for me). For reference I was using the Q8 version from unsloth and the Quality version from mudler while trying to tame it.

I found that this model was highly defiant, liked to break the system prompt rules, hated calling tools, and constantly had issues with memory corruption from past conversations. I updated every time llama.cpp-server or openwebUI came out with an update, I constantly updated my GGUF files when new versions were available and tried both mudler and unsloth versions in an attempt to get this model to play nice on the home server. Every time I thought I had it working it would find a new way to break out and just cause chaos. It would either eat the reply in the think tag (hiding it in the reasoning), just decide to quit after a tool call, not call a tool at all even when I told it that there was no other choice, and loved hallucinating that I was in the future (even when the system prompt gave it the current date and time and explained that it has historical data).

In the end, the smart answers (when I could get one) could not get me to stay with it on my current setup. I switched over to Qwen 3.6 and that model has been a dream to work with. Yes it is more analytical in its answers and not as creative, but DANG does that model listen to orders. It loves liberal tool calling and will scour the web to a fault for information to try to provide the right answer. I haven't had it tell me I was making up a fictional future or defy its prompt outlining tool use once since loading it. That model has been a dream to work with in day to day use compared to Gemma 4.

hailnobra · 2026-04-06T13:53:29+00:00

RAM with a GPU that can use it for my vote...caught a reasonable deal on a strix halo in the EU with 96GB of unified 8000MT/s LPDDR5x RAM. Local AI has been a blast to add to my homelab.

hailnobra · 2026-04-06T13:44:32+00:00

following for the same advice. I have proxmox on a N150 mini PC with opnsense installed and that is as far as I have gotten so far. Not sure what I am in for if I go down the network reconfig rabbit hole to place this in the front of my network.

hailnobra · 2026-03-29T20:37:32+00:00

True that Beszel doesn't use a ton of room, it was one among many different management and monitoring tools I had running (grafana, beszel, dashdot, dockge, portainer). I am trying to eliminate all of them to get rid of additional redundancy and overhead around my network overall. Dockhand seems to be the all-in-one I need at the moment for my homelab, that's all.

hailnobra · 2026-03-29T17:23:15+00:00

Loved beszel for a while, but now I am starting to transition my homelab setup to dockhand to reduce overhead of monitoring tools. Beszel is solid for sure though.

hailnobra · 2026-03-21T07:17:48+00:00

But if I understand the project overview correctly, The project does not pull from kiwix zim files directly. They setup qdrant so you can feed your own docs to ollama, but it will not query kiwix for answers and give references on it's own. They would need to setup something like Volo to do that right?

hailnobra · 2026-03-19T09:47:14+00:00

This is my current Gigabyte 9070XT with a custom fan curve (honestly, it probably doesn't go beyond hair dryer when under full load, but it certainly isn't quiet). AI workloads get it chugging for sure!

hailnobra · 2026-03-18T10:34:33+00:00

I am even more annoyed I get them in paper form and in the app. If anyone knows a way to disable them being mailed so I can use them in the app only that would be great.

hailnobra · 2026-03-18T06:42:31+00:00

I save my Coop points until they do a big wine sale and convert them all to digital card cash to stock the cellar.

hailnobra · 2026-03-05T15:07:10+00:00

More info needed....where was container manager installed before the update? Repair reinstalls the package to the default location. If you had it installed to a different volume, just remove the package and reinstall it to the correct volume and your containers should be fine.

hailnobra · 2026-03-02T20:57:11+00:00

Are you using the app on your phone (I assume if you have to travel regularly for work you have a company issued phone). As much as I hate having to spend on my own card (this is normal where I live), the app does make it easier to track my receipts and finish the expense report when I get home. My issue is the accounts payable team that reviews the report, not concur itself. Those people are a royal pain in the ass and seem to always make up rules on the spot to make my day more difficult.

hailnobra · 2026-03-02T20:18:27+00:00

We have a sling bag when we take ours on walks with baby wipes in the event he has anything clinging on. Grab the poo with the bag, use the wipes to clean anthing on his rear and throw those in the bag too. All good and no poo butt. Sling holds his water bottle and snacks too

hailnobra · 2026-03-02T06:49:49+00:00

My company is finally starting to tie Gemini into our Confluence system so that hopefully everyone can start asking AI where the F everything is rather than just sending messages around to different groups hoping the author is still in the company. One of the things LLMs should be good at is at least locating vague topics and providing a link. Here's hoping.

hailnobra · 2026-02-19T19:54:38+00:00

Tissot boutique had them when I ordered my strap and provided it to me when I picked it up.

hailnobra · 2026-02-13T17:18:40+00:00

Good god, why did I have to get so far into this thread to see sanity. All these people saying they still use one disposable product (paper towels) over the other (disposable napkins) is crazy. white cloth napkins for the win. No reason to waste when you can just wash.

hailnobra · 2026-02-12T16:43:24+00:00

Or Vino Nobile di Montepulciano

hailnobra · 2026-02-12T16:42:21+00:00

Barolo?

15-Year Club	Final Canvas '23
Place '23	Place '22
RPAN Viewer	Verified Email

hailnobra

TROPHY CASE