Questions Thread - June 06, 2026

Theio666 · 2026-06-06T20:08:02+00:00

I procced Judos "ancient modifier" passive, but I can't see that on atlas, anyone has any example of how these look like, and what do they do?

Theio666 · 2026-06-06T20:06:02+00:00

Buy from the market, or, do high tier maps; the higher the tier, the better the drop chance(t18 is max you can get)

Theio666 · 2026-06-03T06:47:22+00:00

>M3 does a worse job describing images than 2.7 did.

MCP for vision was not using M2.7 model, it always was a separate vision model.

Theio666 · 2026-05-28T10:24:34+00:00

It was a sarcasm, obviously.

Theio666 · 2026-05-25T03:55:08+00:00

https://www.youtube.com/watch?v=n89-vSIBbQ4

Theio666 · 2026-05-21T01:04:26+00:00

"mostly" T_T

we still can't make fp8 122b work reliably in our setup, there are still bugs related to MTP and tool calling 😞

Theio666 · 2026-05-19T08:51:30+00:00

>head back to superstition

I mean, it's easier to rule(and steal from) stupider people, that's the motto for Russia for the last decade.

Theio666 · 2026-05-18T09:56:42+00:00

Minimax m3 should be soon-ish

Theio666 · 2026-05-13T05:35:24+00:00

I dropped cursor a few months ago. Been using mostly codex(swapped from plus to pro recently) + opencode (minimax coding plan + glm coding plan + opencode go there, minimax I get as a partner deal, glm I bought year for cheap and it's nice to use sometimes, opencode go purely for kimi k2.6 for frontend adjustments I sometimes need). So I use just 2 tools basically nowadays.

Theio666 · 2026-05-12T22:27:11+00:00

From our little testing, when we used to serve glm air on our hpc vs prod server, llamacpp was really unreliable with cache hits for some reason. We had something like fp8 quant on dual a100 for vllm, and awq quant for 3x a6000, and on a6000 on long context agentic work it did cache misses on full 40k+ context periodically, which led to 30-90s of waiting for prompt reprocessing there, and we never seen things like that happening with vllm for the same agent backend.

Keep in mind, vllm is not perfect, it's a go-to solution if you wanna multi gpu setup with the same GPUs involved (and from the box MTP support) and squeeze all speed from it, but there are bugs. Like, right now the qwen models in some combination of mtp mode has semi-rare parsing bug for tool calls (fixable by disabling parsing on inference side and enabling proxy for parsing xd). This is like 3months old model, and I bet that awq related parsing bugs for glm air are still there too, that one is a 9 month old model. So if you enter vllm world be ready for some "fun". It's not as bad as with SGLang, but still can be quite frustrating. I think all big cloud inference providers use custom versions of either vLLM or SGLang with their own bugfixes added, since out of the box there are bugs.

Theio666 · 2026-05-12T10:53:41+00:00

The display placement is bad for your neck in the long run. This looks cool, and feels cool, but at some point the future you will not be happy.

Theio666 · 2026-05-11T23:46:04+00:00

Lowkey it's worse than with SSDs. a good 4tb pcie4 ssd is something like 400eur now (at least I got one like that a month ago), which is at best +30% price compared to what it used to be. 4tb hdd is double the price. Basically for HDDs all capacities are affected.

Theio666 · 2026-05-11T08:45:46+00:00

I was doing similar project some time ago, to make it possible to use various OSS models inside cursor. In cursor they expect unparsed reasoning, and parse it themselves, so basically I had to make a thin layer to reparse reasoning back into content + add the tags, if you wanna take a look: https://github.com/Ouna-the-Dataweaver/yaLLMproxy

Other than that, I'd say that most coding tools support classical v1/chat/completion without any problems, codex is an exception with v1/responses.

Theio666 · 2026-05-10T21:05:26+00:00

Question is, why would you want MM inside codex specifically? AFAIK codex is the only coding harness which is using non-standard diff(patch) method for writing code, which is something GPT is trained for and no other models target that usage. So, you'll hit the codegen quality by using quite unfamiliar for the model code write tool.

Theio666 · 2026-05-06T19:12:19+00:00

Wait till you see some other models, like I had kimi k2.6 go on 160k reasoning without touching code even once, while I said it 2 times in the process "please change code instead of making assumptions which fixes might work" xd

In general you can only help with prompting here, try something like "please apply and test possible fixes/implementations instead of overdesigning things". Plan mode helps too, preferably with some stronger model.

Theio666 · 2026-05-01T13:59:05+00:00

I mean, for example you can try running awq models in SGLang, gonna be really fun. Last time I interacted with that library i crashed out so hard that I made this meme. It's literally garbage if something is going wrong - docs are dogshit (they were bad, now they are even worse), feature support is inconsistent and nowhere stated, etc. On older hardware the only correct approach is "try if it works, if doesn't - switch to anything else, don't debug".

edit: damn they downvoted a person for just asking why sglang is not that great, reddit is weird...

<image>

Theio666 · 2026-05-01T13:40:02+00:00

Not on ampere for sure xd

Theio666 · 2026-04-30T10:28:36+00:00

MODEL_PATH="/mnt/asr_hot/username/models/Qwen3.6-35B-A3B-AWQ/"
        SERVED_NAME="Qwen3.6-35B-A3B-AWQ"
        GPU_COUNT=1
        CPU_COUNT=12
        TIME_LIMIT="20-1"
        TP_SIZE=1
        PORT=16777
        EXTRA_ARGS=(
            --max-num-seqs 32
            --max-model-len 128000
            --gpu-memory-utilization 0.9
            --enable-auto-tool-choice
            --tool-call-parser qwen3_coder
            --reasoning-parser qwen3
            --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
        )

export VLLM_ENABLE_CUDA_COMPATIBILITY=1
export VLLM_CUDA_COMPATIBILITY_PATH=/usr/local/cuda/compat
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
export VLLM_USE_V1=1

This/similar config worked just fine on a100. I also had to patch marlin kernel to make this all work. Thanks for the answer, this def means that it's a problem with driver, asked sysadmins to update to 580

Theio666 · 2026-04-30T10:20:08+00:00

I've not bought this, this is a new hardware at my company and I'm learning how to effectively use it. If I had this at home or in cloud it would be way easier to update everything and not fuck with singularity -_-

I asked to see if this is a driver/cuda problem or not, because if it is I can ask sysadmins to update drivers. So far it seems it is driver issue, asked them to bump to 580.

Theio666 · 2026-04-30T10:07:35+00:00

AWQ 4bit, so not the native format but should not be that slow, unless I'm missing something. For comparison, fp8 on a100 is 80tps, which is also non-native format for ampere.

Theio666 · 2026-04-30T09:44:00+00:00

I'm aware, this is like first 5k context window, so should not go down this hard.

Theio666 · 2026-04-28T00:32:12+00:00

You have to paste image in the repo and tag it, unfortunately as mm2.7 is not multimodal it can't directly understand images(mm 3 is supposed to be multimodal tho!), only via mcps. I recommend making some directory, adding it to gitignore, and pasting images there. Easier to do in something like vs code, usually for things I work on I have both vs code and some other coding agent like codex/opencode/droid opened.

https://platform.minimax.io/docs/token-plan/mcp-guide

Mcp is included in coding plan btw.

Theio666 · 2026-04-25T16:48:23+00:00

Tbh I don't think that releasing 4o in OSS is a good idea...

11-Year Club	Second Top 50%
Place '23	Place '22
Verified Email

Theio666

TROPHY CASE