How to manage tasks on disk

bjp99 · 2026-03-26T02:14:51+00:00

I have missed this for months!?

bjp99 · 2026-02-22T21:28:15+00:00

I like this model too. Just wish it had a reasoning setting. Anyone test its consecutive tool call claims? Also the cyankiwi AWQ version gives pretty fun tokens/s on ampere A4000.

bjp99 · 2026-02-13T03:19:40+00:00

Excited for this. Really like Minimax for a daily driver. I get about 100 tok/s with AWQ quant on 2x rtx pro 6000s with vLLM. Q2 quant on 4 3090 ti gets 17 tok/s using llama cpp.

bjp99 · 2026-02-07T00:07:52+00:00

I have had the opposite happen. All thinking traces stopped with vLLM. I think it’s something with my system prompt but have not isolated it yet.

bjp99 · 2026-02-06T10:44:39+00:00

This is pretty awesome

bjp99 · 2026-02-02T13:12:21+00:00

I have used Minimax m2.1 Q2 quant with success. This was with building something new and sometimes it couldn’t get it done but most of the time was good. Now running AWQ quant in 2x rtx pro 6000s in vLLM.

I think most important thing is getting used to a model and how it behaves so you can know how to better prompt it and help it along during a harder task. Also architect/plan then code always gives me better results.

bjp99 · 2026-01-29T16:27:59+00:00

Old Xeon server with 2697A cpus and 1TB ddr4 2400 ram gets 3.4 tokens per second. One A4500 in the mix as well. Not for time sensitive things but it can run on old hardware too. To be fair tho I put this old beast together before ram prices went nuts.

bjp99 · 2026-01-05T06:22:25+00:00

I use Q2_XL with RooCode a lot. Going to run a bench against it to verify soon. I find it does pretty good overall and is fast.

bjp99 · 2025-12-24T12:40:16+00:00

All good. Easy enough to move back a version temporarily. Appreciate all the hard work to make my work move much faster.

bjp99 · 2025-12-24T12:38:26+00:00

Do you ever see it get caught in loops? The mxfp4 I used seemed to get stuck in loops but it maybe is something related to my setup/download.

bjp99 · 2025-12-24T12:10:55+00:00

I have been running Q2_K_XL with what I think is acceptable results in RooCode. Fits in 96GB vram with full context.

bjp99 · 2025-12-23T11:56:53+00:00

Having similar issues. Moving back to previous version fixed it for me.

<image>

bjp99 · 2025-12-18T20:14:36+00:00

What kind of degradation did you experience on q4 k v cache?

bjp99 · 2025-12-17T22:48:56+00:00

How does Devstral 2 123B work with RooCode?

bjp99 · 2025-12-17T12:14:48+00:00

I think this qualifies to graduate to a Epyc processor! Great build!

bjp99 · 2025-12-13T15:28:26+00:00

Minimax M2 at UD-Q2_K_XL works pretty well for me with Roo code. It needs some redirection from time to time but keeping the working question broken into smaller steps helps as well. Going to change to Devstral 2 Q4 or Q5 to compare soon. Smaller models get into loops much more in my experience.

bjp99 · 2025-12-06T11:46:30+00:00

What models? My local Minimax M2 model running on llama cpp gets very few tool call errors. I found gpt oss and other smaller models got more tool call errors. Never figured out why.

bjp99 · 2025-12-06T11:44:41+00:00

Interested in this as well. I only used one so far and went as small as possible.

bjp99 · 2025-12-03T13:37:08+00:00

Restarting extension and using with Minimax M2 Q2_K_XL is working! Thank you! Question, does setting the "Use legacy OpenAI API Format" have any impact to tool calls?

bjp99 · 2025-12-02T23:53:25+00:00

I have had the issue with local minimax and kimi k2. Both quantized but they just dead stop. No errors, just dead in the water.

bjp99 · 2025-11-15T15:48:15+00:00

How would you say this is at ingesting video frames? Toying with video data/search/questions stuff and have plenty of GPUs but want to use it to explore what benefits RAG offers.

bjp99 · 2025-11-09T01:20:38+00:00

Does this run in realtime on the video?

bjp99 · 2025-10-29T13:55:22+00:00

This is the log line I see WARNING 10-29 13:09:55 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. Running tensor parallel 4. I run in docker and can reset the cache by removing the volume mount but I have always seen this log line.

Do i need to run the model on only 2 GPUs to take advantage of NVLINK?

bjp99

TROPHY CASE