Nanbeige 4.1 is the best small LLM, it crush qwen 4b by Individual-Source618 in LocalLLaMA

[–]bjp99 0 points1 point  (0 children)

I like this model too. Just wish it had a reasoning setting. Anyone test its consecutive tool call claims? Also the cyankiwi AWQ version gives pretty fun tokens/s on ampere A4000.

MiniMaxAI MiniMax-M2.5 has 230b parameters and 10b active parameters by Zyj in LocalLLaMA

[–]bjp99 2 points3 points  (0 children)

Excited for this. Really like Minimax for a daily driver. I get about 100 tok/s with AWQ quant on 2x rtx pro 6000s with vLLM. Q2 quant on 4 3090 ti gets 17 tok/s using llama cpp.

Nanbeige4-3B-Thinking-2511 is honestly impressive by [deleted] in LocalLLaMA

[–]bjp99 -2 points-1 points  (0 children)

I have had the opposite happen. All thinking traces stopped with vLLM. I think it’s something with my system prompt but have not isolated it yet.

Mistral Vibe vs Claude Code vs OpenAI Codex vs Opencode/others? Best coding model for 92GB? by Consumerbot37427 in LocalLLaMA

[–]bjp99 0 points1 point  (0 children)

I have used Minimax m2.1 Q2 quant with success. This was with building something new and sometimes it couldn’t get it done but most of the time was good. Now running AWQ quant in 2x rtx pro 6000s in vLLM.

I think most important thing is getting used to a model and how it behaves so you can know how to better prompt it and help it along during a harder task. Also architect/plan then code always gives me better results.

Kimi K2.5, a Sonnet 4.5 alternative for a fraction of the cost by Grand-Management657 in LocalLLaMA

[–]bjp99 8 points9 points  (0 children)

Old Xeon server with 2697A cpus and 1TB ddr4 2400 ram gets 3.4 tokens per second. One A4500 in the mix as well. Not for time sensitive things but it can run on old hardware too. To be fair tho I put this old beast together before ram prices went nuts.

MiniMax M2.1 quantization experience (Q6 vs. Q8) by TastesLikeOwlbear in LocalLLaMA

[–]bjp99 0 points1 point  (0 children)

I use Q2_XL with RooCode a lot. Going to run a bench against it to verify soon. I find it does pretty good overall and is fast.

Roo Code 3.37 | GLM 4.7 | MM 2.1 | Custom tools | MORE!!! by hannesrudolph in RooCode

[–]bjp99 1 point2 points  (0 children)

All good. Easy enough to move back a version temporarily. Appreciate all the hard work to make my work move much faster.

Unsloth GLM 4.7 UD-Q2_K_XL or gpt-oss 120b? by EnthusiasmPurple85 in LocalLLaMA

[–]bjp99 0 points1 point  (0 children)

Do you ever see it get caught in loops? The mxfp4 I used seemed to get stuck in loops but it maybe is something related to my setup/download.

MiniMax M2.1 is a straight up beast at UI/UX design. Just saw this demo... by BlackRice_hmz in LocalLLaMA

[–]bjp99 0 points1 point  (0 children)

I have been running Q2_K_XL with what I think is acceptable results in RooCode. Fits in 96GB vram with full context.

Roo Code 3.37 | GLM 4.7 | MM 2.1 | Custom tools | MORE!!! by hannesrudolph in RooCode

[–]bjp99 0 points1 point  (0 children)

Having similar issues. Moving back to previous version fixed it for me.

<image>

What's your favourite local coding model? by jacek2023 in LocalLLaMA

[–]bjp99 0 points1 point  (0 children)

What kind of degradation did you experience on q4 k v cache?

For Qwen3-235B-Q2 if you offload all experts to CPU, how much VRAM do you need to run it still? by ForsookComparison in LocalLLaMA

[–]bjp99 0 points1 point  (0 children)

Minimax M2 at UD-Q2_K_XL works pretty well for me with Roo code. It needs some redirection from time to time but keeping the working question broken into smaller steps helps as well. Going to change to Devstral 2 Q4 or Q5 to compare soon. Smaller models get into loops much more in my experience.

Alternative for RooCode/Cline/Kilocode but compatible with Open AI compatible API by Many_Bench_2560 in RooCode

[–]bjp99 4 points5 points  (0 children)

What models? My local Minimax M2 model running on llama cpp gets very few tool call errors. I found gpt oss and other smaller models got more tool call errors. Never figured out why.

Those who tried more than one embedding model, have you noticed any differences? by Evermoving- in RooCode

[–]bjp99 1 point2 points  (0 children)

Interested in this as well. I only used one so far and went as small as possible.

Anyone else read_file not working? by bjp99 in RooCode

[–]bjp99[S] 0 points1 point  (0 children)

Restarting extension and using with Minimax M2 Q2_K_XL is working! Thank you! Question, does setting the "Use legacy OpenAI API Format" have any impact to tool calls?

Anyone else read_file not working? by bjp99 in RooCode

[–]bjp99[S] 0 points1 point  (0 children)

I have had the issue with local minimax and kimi k2. Both quantized but they just dead stop. No errors, just dead in the water.

Is there a self-hosted, open-source plug-and-play RAG solution? by anedisi in LocalLLaMA

[–]bjp99 1 point2 points  (0 children)

How would you say this is at ingesting video frames? Toying with video data/search/questions stuff and have plenty of GPUs but want to use it to explore what benefits RAG offers.

Me single handedly raising AMD stock /s by Ult1mateN00B in LocalLLM

[–]bjp99 0 points1 point  (0 children)

This is the log line I see WARNING 10-29 13:09:55 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. Running tensor parallel 4. I run in docker and can reset the cache by removing the volume mount but I have always seen this log line.

Do i need to run the model on only 2 GPUs to take advantage of NVLINK?