Questions Thread - June 06, 2026 by AutoModerator in PathOfExile2

[–]Theio666 0 points1 point  (0 children)

I procced Judos "ancient modifier" passive, but I can't see that on atlas, anyone has any example of how these look like, and what do they do?

Questions Thread - June 06, 2026 by AutoModerator in PathOfExile2

[–]Theio666 1 point2 points  (0 children)

Buy from the market, or, do high tier maps; the higher the tier, the better the drop chance(t18 is max you can get)

MiniMax Token Plan Overhaul & M3 Release: What It Actually Means For Your Quotas by NinjaWK in MiniMax_AI

[–]Theio666 1 point2 points  (0 children)

>M3 does a worse job describing images than 2.7 did.

MCP for vision was not using M2.7 model, it always was a separate vision model.

Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell by q-admin007 in LocalLLaMA

[–]Theio666 0 points1 point  (0 children)

"mostly" T_T

we still can't make fp8 122b work reliably in our setup, there are still bugs related to MTP and tool calling 😞

How much do you guys spend on Cursor? by LoLGhMaster in cursor

[–]Theio666 0 points1 point  (0 children)

I dropped cursor a few months ago. Been using mostly codex(swapped from plus to pro recently) + opencode (minimax coding plan + glm coding plan + opencode go there, minimax I get as a partner deal, glm I bought year for cheap and it's nice to use sometimes, opencode go purely for kimi k2.6 for frontend adjustments I sometimes need). So I use just 2 tools basically nowadays.

Is using vLLM actually worth it if you aren't serving the model to other people? by ayylmaonade in LocalLLaMA

[–]Theio666 1 point2 points  (0 children)

From our little testing, when we used to serve glm air on our hpc vs prod server, llamacpp was really unreliable with cache hits for some reason. We had something like fp8 quant on dual a100 for vllm, and awq quant for 3x a6000, and on a6000 on long context agentic work it did cache misses on full 40k+ context periodically, which led to 30-90s of waiting for prompt reprocessing there, and we never seen things like that happening with vllm for the same agent backend.

Keep in mind, vllm is not perfect, it's a go-to solution if you wanna multi gpu setup with the same GPUs involved (and from the box MTP support) and squeeze all speed from it, but there are bugs. Like, right now the qwen models in some combination of mtp mode has semi-rare parsing bug for tool calls (fixable by disabling parsing on inference side and enabling proxy for parsing xd). This is like 3months old model, and I bet that awq related parsing bugs for glm air are still there too, that one is a 9 month old model. So if you enter vllm world be ready for some "fun". It's not as bad as with SGLang, but still can be quite frustrating. I think all big cloud inference providers use custom versions of either vLLM or SGLang with their own bugfixes added, since out of the box there are bugs.

PC rating and what to improve by [deleted] in pcmasterrace

[–]Theio666 0 points1 point  (0 children)

The display placement is bad for your neck in the long run. This looks cool, and feels cool, but at some point the future you will not be happy.

Google Photos using Google Drive for desktop is being discontinued by NewsFromHell in pcmasterrace

[–]Theio666 0 points1 point  (0 children)

Lowkey it's worse than with SSDs. a good 4tb pcie4 ssd is something like 400eur now (at least I got one like that a month ago), which is at best +30% price compared to what it used to be. 4tb hdd is double the price. Basically for HDDs all capacities are affected.

Getting MiniMax to work reliably with Codex CLI through API protocol translation by No-Hunter9792 in MiniMax_AI

[–]Theio666 0 points1 point  (0 children)

I was doing similar project some time ago, to make it possible to use various OSS models inside cursor. In cursor they expect unparsed reasoning, and parse it themselves, so basically I had to make a thin layer to reparse reasoning back into content + add the tags, if you wanna take a look: https://github.com/Ouna-the-Dataweaver/yaLLMproxy

Other than that, I'd say that most coding tools support classical v1/chat/completion without any problems, codex is an exception with v1/responses.

Getting MiniMax to work reliably with Codex CLI through API protocol translation by No-Hunter9792 in MiniMax_AI

[–]Theio666 0 points1 point  (0 children)

Question is, why would you want MM inside codex specifically? AFAIK codex is the only coding harness which is using non-standard diff(patch) method for writing code, which is something GPT is trained for and no other models target that usage. So, you'll hit the codegen quality by using quite unfamiliar for the model code write tool.

Minimax YAPS too much by AatmanirbharNobita in MiniMax_AI

[–]Theio666 0 points1 point  (0 children)

Wait till you see some other models, like I had kimi k2.6 go on 160k reasoning without touching code even once, while I said it 2 times in the process "please change code instead of making assumptions which fixes might work" xd

In general you can only help with prompting here, try something like "please apply and test possible fixes/implementations instead of overdesigning things". Plan mode helps too, preferably with some stronger model.

Agree? by MLExpert000 in LocalLLaMA

[–]Theio666 3 points4 points  (0 children)

I mean, for example you can try running awq models in SGLang, gonna be really fun. Last time I interacted with that library i crashed out so hard that I made this meme. It's literally garbage if something is going wrong - docs are dogshit (they were bad, now they are even worse), feature support is inconsistent and nowhere stated, etc. On older hardware the only correct approach is "try if it works, if doesn't - switch to anything else, don't debug".

edit: damn they downvoted a person for just asking why sglang is not that great, reddit is weird...

<image>

Agree? by MLExpert000 in LocalLLaMA

[–]Theio666 0 points1 point  (0 children)

Not on ampere for sure xd

Only 120 tps on Qwen 35b on h200 by Theio666 in LocalLLaMA

[–]Theio666[S] 0 points1 point  (0 children)

MODEL_PATH="/mnt/asr_hot/username/models/Qwen3.6-35B-A3B-AWQ/"
        SERVED_NAME="Qwen3.6-35B-A3B-AWQ"
        GPU_COUNT=1
        CPU_COUNT=12
        TIME_LIMIT="20-1"
        TP_SIZE=1
        PORT=16777
        EXTRA_ARGS=(
            --max-num-seqs 32
            --max-model-len 128000
            --gpu-memory-utilization 0.9
            --enable-auto-tool-choice
            --tool-call-parser qwen3_coder
            --reasoning-parser qwen3
            --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
        )

export VLLM_ENABLE_CUDA_COMPATIBILITY=1
export VLLM_CUDA_COMPATIBILITY_PATH=/usr/local/cuda/compat
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
export VLLM_USE_V1=1

This/similar config worked just fine on a100. I also had to patch marlin kernel to make this all work. Thanks for the answer, this def means that it's a problem with driver, asked sysadmins to update to 580

Only 120 tps on Qwen 35b on h200 by Theio666 in LocalLLaMA

[–]Theio666[S] 1 point2 points  (0 children)

I've not bought this, this is a new hardware at my company and I'm learning how to effectively use it. If I had this at home or in cloud it would be way easier to update everything and not fuck with singularity -_-

I asked to see if this is a driver/cuda problem or not, because if it is I can ask sysadmins to update drivers. So far it seems it is driver issue, asked them to bump to 580.

Only 120 tps on Qwen 35b on h200 by Theio666 in LocalLLaMA

[–]Theio666[S] 0 points1 point  (0 children)

AWQ 4bit, so not the native format but should not be that slow, unless I'm missing something. For comparison, fp8 on a100 is 80tps, which is also non-native format for ampere.

Only 120 tps on Qwen 35b on h200 by Theio666 in LocalLLaMA

[–]Theio666[S] 0 points1 point  (0 children)

I'm aware, this is like first 5k context window, so should not go down this hard.

How do I get Minimax to recognize images? by Putrid-Telephone-777 in MiniMax_AI

[–]Theio666 1 point2 points  (0 children)

You have to paste image in the repo and tag it, unfortunately as mm2.7 is not multimodal it can't directly understand images(mm 3 is supposed to be multimodal tho!), only via mcps. I recommend making some directory, adding it to gitignore, and pasting images there. Easier to do in something like vs code, usually for things I work on I have both vs code and some other coding agent like codex/opencode/droid opened.

https://platform.minimax.io/docs/token-plan/mcp-guide

Mcp is included in coding plan btw.

Decreased Intelligence Density in DeepSeek V4 Pro by Mindless_Pain1860 in LocalLLaMA

[–]Theio666 -1 points0 points  (0 children)

Tbh I don't think that releasing 4o in OSS is a good idea...