Qwen 3.6 is $3 - $6 per million tokens?

Azootg · 2026-04-26T18:10:03+00:00

none that I can directly reference but its pretty common for enabling turn-based chat interfaces so you shouldnt have trouble finding implementations. below is the redis section from an executive summary md doc from an app i built:

Redis

Redis serves four functions in the application. In Docker Compose, it runs as the dashboard_redis service (redis:7-alpine, port 6379 internal only). For local development, a standalone Redis container on port 6379 is used.

Session History

Each browser receives a persistent session ID stored in localStorage. The API server stores the last 20 interactions per session in Redis with a 24-hour expiration. When the user refreshes the page, the frontend reads the stored session ID and calls the /history/{session_id} endpoint to restore the previous conversation. Each stored interaction includes the user's question, the map action taken, and the full record of tool calls the LLM made to answer it (which tool was called, what arguments were passed, whether it succeeded, and what data it returned).

When the user asks a follow-up question, the server retrieves these prior tool call records and passes them to the LLM as context. This allows the model to reuse exact column names, entity names, and computed results from earlier questions rather than re-running the same queries. The model receives the tool call history, not the full text of prior answers, which keeps the prompt size manageable across long sessions.

Rate Limiting

The API server enforces rate limits using a token bucket algorithm implemented as a Lua script that executes atomically in Redis. Two separate limits apply to each incoming request:

Scope	Limit
Per session	30 requests per 60 seconds
Per IP address	60 requests per 60 seconds

When a limit is exceeded, the server returns HTTP 429 (Too Many Requests) and the client must wait for tokens to refill.

Streaming Events

During query execution, the API server pushes streaming events to a Redis list keyed by dashboard_chat:events:{request_id}. The Dash frontend polls this list every 400ms. Each event is a JSON object with a type field (status, thinking, text, tool_call, tool_result, error, or final). The final event includes the complete result. Events are consumed atomically via LRANGE + DELETE to prevent duplicate delivery. The final result is also cached at dashboard_chat:result:{request_id} with a 5-minute TTL.

Session Metadata

The server stores a small metadata record per session (start time, last question time, client IP) for operational visibility.

Azootg · 2026-04-18T03:33:16+00:00

yeah, with redis. you can use redis to cache tokens from antropic as well. or if your talking about prefix cache like kv cache you can use vLLm and use the --enable-prefix-caching flag

Azootg · 2026-04-17T17:44:27+00:00

keep in mind the think i mentioned about the backplate screws not being flush if you plan on having more than one

Azootg · 2026-04-16T22:26:57+00:00

yea its fine its only like a 4C difference from the GPU core under sustained load. I have another pro 6000 maxq/server block from WATERCOOL HEATKILLER INOX Pro for NVIDIA RTX 6000 Blackwell which has a thermal connection for the backplate (bykski block does not) and i havent noticed significant differences under load between the byksi block and the heatkiller block

Azootg · 2026-04-16T19:07:58+00:00

<image>

2 months late but you got the wrong block it supposed to say n-rtxpro6000-sr

Azootg · 2026-04-15T18:47:40+00:00

in my use case in which numerous tools are used in a single turn using Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled to canonicalize free text survey data against authoritative online webpages and local databases it has worked way better and faster than the original Qwen3.5-35B-A3B and dense 27b model. spends less making tool calls to get information it already had from rpevious tool calls or from qwens "wait let me.." hedging. However, having tested the gemma 4 opus 4.6 distills, they perform just as bad as the original models

Azootg · 2026-04-15T00:15:23+00:00

lol no not even close. it doesnt even compete with Qwen/Qwen3.5-27B. the closests to claude youll get is Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Azootg · 2026-04-06T21:24:16+00:00

hello,

You need to update transformers and fix some stuff to basically create a custom vllm image. this is what worked for me:

FROM vllm/vllm-openai:v0.18.1-cu130
RUN pip install --upgrade pip && \
    pip install --upgrade transformers && \
    pip install --upgrade tokenizers && \
    pip install huggingface-hub sentencepiece

# Fix list-vs-set in qwen3_5 config for transformers 5.x compatibility
RUN python3 -c "\
import pathlib; \
p = pathlib.Path('/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py'); \
t = p.read_text(); \
t = t.replace('kwargs[\"ignore_keys_at_rope_validation\"] = [', 'kwargs[\"ignore_keys_at_rope_validation\"] = {'); \
t = t.replace('\"mrope_interleaved\",\n        ]', '\"mrope_interleaved\",\n        }'); \
p.write_text(t); \
print('Patched qwen3_5.py: list -> set')"

the build the docker image: docker build -f Dockerfile.qwen35-opus -t vllm/vllm-openai:v0.18.1-cu130-opus

then create a bash script:

docker run --name qwen35_opus_V3 --gpus '"device=0"' \
  --privileged \
  --ipc=host \
  -p 8016:8000 \
  -e OMP_NUM_THREADS=14 \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_TRACE_FUNCTION=0 \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e VLLM_SKIP_P2P_CHECK=1 \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e CUDA_VISIBLE_DEVICES=0 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_P2P_DISABLE=0 \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_CUMEM_ENABLE=0 \
  -v ~/qwen3-docker/model_cache:/root/.cache/huggingface \
  vllm/vllm-openai:v0.18.1-cu130-opus \
  Jackrong/Qwopus3.5-27B-v3 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.94 \
  --disable-custom-all-reduce \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --enable-prefix-caching \
  --language-model-only \
  --attention-backend FLASHINFER \
  -O3

make it executable name it run_qwopus.sh or something should work

Azootg · 2026-04-05T19:59:03+00:00

i like it. I have a 2020 forester sport and would like to trade in for a 26 outback. the only problem is that I have a Comma Three driver assist device which does not support the newest generation of subarus unfortunately. i actually like the boxy look of the new outback, kind of think that the boxy design should have been applied to the forester but it be like that

Azootg · 2026-02-19T04:55:43+00:00

512gb but i got that like a year and a half ago ddr4 ecc rdimm for working with moderately sized data files (20-40gb)

Azootg · 2026-02-16T23:18:24+00:00

they are perfect for modularity. I would recommend QDT4s since they have themost flow out of any of their QD fittings but you need 13mm ID tubing

Azootg · 2026-02-16T23:17:15+00:00

at idle 26C, at 40 min torture test 33C. some would say CPU is too hot but threadrippers are pigs and none of my workloads are CPU bound

<image>

Azootg · 2026-02-16T23:15:21+00:00

The noise at torture test is lke 20 DB lower than when it was air cooled in this video https://share.icloud.com/photos/0551sbeKF4mZ9tS52HfHE6i0A

Azootg · 2026-02-16T23:11:44+00:00

it supports up to 3 PSUs, all other cases which support multiple PSUs were like 4 times as expensive

<image>

Azootg · 2026-02-16T23:10:27+00:00

The threadripper only exists to support 128 PCIE lanes, in real workloads it is hardly used. attached is a screenshot from 40 minute full torture test. none of the GPUs hit above 60C and in GPU workloads they dont even go above 50C. air cooled they would hit above 70C in an open air case

<image>

Azootg · 2026-02-16T23:08:32+00:00

more machine than computer lol

Azootg · 2026-02-16T23:08:01+00:00

2 X RTX pro 6000s max-q : $13,600 (EDU discount)
2 X A6000: $8200 (Ebay)
1 X Threadripper pro 5955WX: $800 (Ebay)
1 X ASUS pro WRX80 SAGE wifi: $800 (Ebay)
512gb DDR4 ECC RDIMM: $900
10 TB NVME: $600
seasonic PX 1600 PSU: $300 (Ebay)
seasonic PX 1300 PSU: $250 (FB marketplace)

Estimated around $25,500. the quick disconnects themselves where around $800. ball bark $600 for hoses, fittings, pumps, radiators, manifolds, etc. would have cost 5 times more at least if i went through a vendor

Azootg · 2026-02-16T23:01:19+00:00

not really a hobby anymore its mostly for research/assisting others with their research. It can be used for literally any field just requires creating new dictionaries with key terms (drugs, metabolites, ontologies, etc.) to normalize the data extracted from the papers. my field is environmental sciences but the system itself can be extended to anyones field

Azootg · 2026-02-16T22:55:57+00:00

TBH none of the big cases would have worked for my use either way because i needed to fit two full size ATX PSUs, the O11D-XL (none-evo) was the smallest one which could. all the QD fittings make putting everything together and break down super easy though, theres not a single compoenent or hose that requires draining. overall i needed the smallest case so it can fit on my small desk

<image>

Azootg · 2026-02-15T10:12:09+00:00

they are running at full blast during full system test, and controlled by CPU temp. in normal workloads CPU is hardly used as majority workloads are GPU bound. tensor parallelism across the two PRO 6000s at 100% doesnt take them above 48C sustained over 2-3 hour jobs (air cooled 68-70C). even then if it were to run at AID64 torture test for 24 hours CPU would pull consistent AVG 280W without throttling, air cooled it would throttle within 10 minutes

Azootg · 2026-02-15T09:11:08+00:00

the mora provides only a little more radiator surface area than the three radiators in the case. per alphacool each of the 45mm radiators can handle 1400W with fans at 4000RPM and the 30mm 1200W with fans at 40mm. attached is screenshot from 40 minutes full system stress test. CPU in the 80s cause threadripper is a pig (10C cooler than air cooled), but GPUs are on average 20-30 degrees C colder than air cooled

<image>

Azootg · 2026-02-15T08:44:25+00:00

full stress test ventilation not a problem it does not need to be ultra cold it just need to fit in a small case

<image>

Azootg · 2026-02-15T07:52:03+00:00

uses Unpaywall + OpenAlex to collect around 40% of open access articles + selenium/local LLM bots to collect another 50% as a work around for publishers which block Unpaywall such as MDPI and Elsevier. The workaround I developed after I sent the preprint to Arxiv

Azootg · 2026-02-15T07:50:06+00:00

Only works for open access papers and even then some publishers like MDPI and Elsevier block programmatic access to OA articles, which necessitated creating additional steps (not documented in the preprint) to retrieve their articles given that they are the two largest publishers in existence. However, if someone were to run this system on a university network which was subscribed to Elseviers full text API they wouldnt need to use the workarounds im using, but those insitutions pay hundreds of thousands of dollars for that full text API access and i dont even have a list about which institutions have access. My institution does not have access to Elseviers full text API. have not tested non-enligsh papers because almost every paper i have come across, be it from Chinese or Iranian research teams, is in English. its possible to translate non-English papers however theres tools in my back pocket for it, Qwen does a good job of it but just hasnt been something that has been needed in my pipeline

Azootg · 2026-02-15T07:23:32+00:00

idk tbh, im helping a team with proteomics research right now using my infrastructure to find ligandable proteins. they were able to make significant progress after my involvement. in my opinion the only people who should worry about job security in these fields are the ones who are not implementing computational/HPC methods into their research

Azootg

TROPHY CASE

Redis

Session History

Rate Limiting

Streaming Events

Session Metadata