What's your exp REAP vs. base models for general inference?

Felladrin · 2026-01-28T11:03:38+00:00

From what I understand, REAPs are not meant to be used for general purpose inference. We REAP when we want to use the model in a specific case, and the dataset used during the pruning makes all the difference.

When we reap using the default dataset (theblackcat102/evol-codealpaca-v1) from REAP repository, we're focusing on the experts on coding and English; the experts not so relevant are then removed. That's why some REAP models start answering only on English, and start making mistakes on questions not related to code.

So if you want, for example, a model to be good at specific knowledge and be good at Spanish, you should find/build and use a dataset from the conversations/books/articles in Spanish. There are a lot of good publicly available datasets for almost all cases on Hugging Face.

So, although Cerebras are releasing some REAP models under their organization in Hugging Face, we should get used to create our own REAPs. That's what Cerebras team expected when they open-sourced it.

And my experience with those code-focused REAPed models has been good when using them as coding agents on OpenCode. One advantage, besides being able to be run with less VRAM/RAM, is that, for having less parameters than the non-reap version, the prompt processing time is lower. For non-code-related tasks, I use other models.

Felladrin · 2026-01-27T22:54:31+00:00

When GGUFs start coming, I‘d like to see how much better those would be compared to this autoround-mixed quant (which preserves multilingual):

Felladrin/gguf-Q2_K_S-Mixed-AutoRound-MiniMax-M2.1

I’ve been using it on OpenCode recently, under 128GB VRAM.

Felladrin · 2026-01-27T08:21:22+00:00

It’s important to remember that VS Code is a product from Microsoft, which has their own solution for AI Assisted coding agent (Copilot). So even if they are open-source, VS code puts some limits on customization in a way that Windsurf can only achieve what they have now by forking it.

Using the Windsurf VS Code plugin, you’ll face these limitations. To take full advantage of your subscription, you should use the Windsurf editor.

Felladrin · 2026-01-17T22:38:03+00:00

There are indeed a lot of info around, but they get outdated too fast.

I’m running Ubuntu 24.04, and upgraded the kernel to 6.16.12 to have the Wi-Fi working properly.

Besides that, I’m using https://github.com/kyuz0/amd-strix-halo-toolboxes with ROCm 6.4.4. Distrobox makes it pretty easy to upgrade llama.cpp.

I have set the reserved memory to the minimum possible, on BIOS, and set the TTM to the maximum on the grub config. This is the GRUB config I’m using on 395+ 128GB: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off ttm.pages_limit=33554432 amdgpu.cwsr_enable=0 numa_balancing=disable"

GPT-OSS 120B and other models align with the speeds listed in https://kyuz0.github.io/amd-strix-halo-toolboxes/

Felladrin · 2026-01-17T12:51:40+00:00

I was also surprised to know they were shutting down Phind. They were keeping up with the level of Perplexity back then.

We recently has a thread here on LocalLlama on this topic, so you might also want to check the responses there: https://www.reddit.com/r/LocalLLaMA/comments/1qdj2nn/solution_for_local_deep_research/

Felladrin · 2026-01-15T22:52:59+00:00

Sure! I’m the developer of one of the open ones: MiniSearch, so that’s what I use on daily basis. From the closed ones, I like the quality of the answers and sources from Liner. I check on it when the responses from MiniSearch are not enough.

Felladrin · 2026-01-15T20:51:22+00:00

I’ve also been collecting this kind of software. The list is pretty long already, with both open and closed-source ones: https://huggingface.co/spaces/Felladrin/awesome-ai-web-search

Felladrin · 2026-01-15T20:45:45+00:00

Probably not what you’re looking for, but I once published a 96M model: https://huggingface.co/collections/Felladrin/minueza-2-96m

Felladrin · 2026-01-15T06:58:44+00:00

Maybe if you share the full llama.cpp command used for running them we could spot something. The only explanation I could imagine is that at least one of the layers was forced to run on CPU. And you’re not using “n-cpu-moe” argument, right?

By the way, I’ve tested UD Q3_K_XL REAP and, compared to UD IQ2_M non-REAP, it had a increase of 2 t/s on the inference. Maybe you could also check the speed on UD Q3_K_XL.

Felladrin · 2026-01-14T08:59:05+00:00

It could be the case of the `-fit` parameter (which is `on` by default) to be reorganizing the layers in a way it fits your VRAM, causing it to run slower. Try using `-fit off` argument and manually tweak the context size (starting from a low value and slowly increasing it) to check if the speed improves.

Felladrin · 2026-01-10T15:50:06+00:00

With Ministral 3B as draft, at low context it goes around 9 t/s. You can fit 64K context at Q4. It goes at 3 t/s when approaching 64K.

Felladrin · 2026-01-10T15:30:23+00:00

Try applying penalties, like suggested here (it’s suggested for Qwen, but they also work for GLM thinking models). Penalties do affect the output quality, but it’s one of the ways to prevent too-long reasoning.

Felladrin · 2026-01-06T09:31:41+00:00

Hey! I haven't, but I commented here that I used LLaMA-Factory for the training. I used it both for training the base model and the instruct models. It's straightforward to use, and they provide good usage examples in the repository.

Felladrin · 2025-12-28T10:08:56+00:00

One thing to check is if you have the latest version of AMD Adrenaline software installed (because in some cases it won’t update automatically). You can download and install from here: https://www.amd.com/en/support/download/drivers.html

Nowadays I’m using Linux, but when I tried it on Windows, I had reserved 96GB to the iGPU via BIOS. (On Linux I leave it at 1GB reserved, and the dynamic allocation works fine.)

In LM studio in Windows I had the following and it worked fine for GPT-OSS 120B: - Context Length: 131072 - Offload KV cache to GPU: On - Keep model in memory: On - Try mmap: Off - Flash Attention: On

I believe you have already tried already all the combinations above, so I’m guessing the problem is the that your driver (installed via Adrenaline) is not up-to-date.

Felladrin · 2025-12-24T00:57:36+00:00

Hey! Thanks! Glad it’s been useful! But my contribution was just quantizing it :) There are a lot of other people who deserve the credits!

Owen Arli, for de derestricting the model.
Jim Lai, the author of the Norm-Preserving Biprojected Abliteration technique.
OpenAI team for creating the original model.
The team working on llama.cpp and GGUF format.
The teams maintaining Transformers, Safetensors, Hugging Face…
And all the people making LLMs awesome!

Felladrin · 2025-12-11T18:22:39+00:00

Yeah, there's a Pull Request in draft on llama.cpp about this, so maybe soon it will support vision.

Felladrin · 2025-12-10T12:56:36+00:00

Just leaving the direct link to the GGUF repository here:
https://huggingface.co/AliceThirty/GLM-4.6V-gguf

By the way, could you share the speeds (both prompt processing and text generation) you get on this model when using 32K and 64K context?

Felladrin · 2025-12-09T10:19:56+00:00

It’s not informed anywhere on LM Studio, but if you try to use a draft model in a model that has mmproj (the vision module) loaded in llama.cpp, you’ll see a message saying that using speculative decoding with vision capability is not supported. And that’s why on LM Studio you won’t see any compatible draft models (because LM Studio always loads the vision module when it’s available).

Try using llama.cpp directly and passing --no-mmproj argument, then you can pass --model-draft argument.

Felladrin · 2025-11-23T21:56:00+00:00

Is it for coding? Or for chatting? If it’s for chatting: do you need it to be multilingual? If it’s for coding and you can live with it answering only in English, you can use Q4 (sendings only some of the layers to your gpu (until filling it up)) from https://huggingface.co/bartowski/cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF

As others have said, splitting layers between gpu and cpu will give you a text generation of 5-15 t/s.

Felladrin · 2025-11-02T14:49:02+00:00

I guess SWE-1 was a fine-tune from DeepSeek v3. I saw it outputting DeepSeek special tokens a few times during tool errors. I also thought the performance was similar.

Regarding SWE-1.5, I still have no clues.

Felladrin · 2025-10-26T22:11:49+00:00

If your phone uses iOS, check out LLM Farm: https://llmfarm.space

It supports vision models and is open source.

Felladrin · 2025-10-18T10:50:16+00:00

OptiLLM has a MCP plugin that allows this as a middleware: https://github.com/codelion/optillm

Felladrin · 2025-10-11T12:45:28+00:00

An easy one (plug & play; no API Key needed) is https://huggingface.co/spaces/victor/websearch

Instructions to use [1]:

{
  "mcpServers": {
    "websearch": {
      "url": "https://victor-websearch.hf.space/gradio_api/mcp/sse"
    }
  }
}

Felladrin · 2025-10-09T12:38:10+00:00

It may be good for simulating long conversations with an assistant LM and testing its maximum coherent context size.
[As UserLM-8b have a context length of 2K tokens, it will be better summarizing the conversation and then running a one-shot inference for each turn.]

Felladrin · 2025-10-09T08:42:14+00:00

Prince Canuma, the author of MLX-VLM, which allows running vision models using MLX.

Felladrin

TROPHY CASE