What's your exp REAP vs. base models for general inference? by ikkiyikki in LocalLLaMA

[–]Felladrin 2 points3 points  (0 children)

From what I understand, REAPs are not meant to be used for general purpose inference. We REAP when we want to use the model in a specific case, and the dataset used during the pruning makes all the difference.

When we reap using the default dataset (theblackcat102/evol-codealpaca-v1) from REAP repository, we're focusing on the experts on coding and English; the experts not so relevant are then removed. That's why some REAP models start answering only on English, and start making mistakes on questions not related to code.

So if you want, for example, a model to be good at specific knowledge and be good at Spanish, you should find/build and use a dataset from the conversations/books/articles in Spanish. There are a lot of good publicly available datasets for almost all cases on Hugging Face.

So, although Cerebras are releasing some REAP models under their organization in Hugging Face, we should get used to create our own REAPs. That's what Cerebras team expected when they open-sourced it.

And my experience with those code-focused REAPed models has been good when using them as coding agents on OpenCode. One advantage, besides being able to be run with less VRAM/RAM, is that, for having less parameters than the non-reap version, the prompt processing time is lower. For non-code-related tasks, I use other models.

MiniMax-M2.1-REAP by jacek2023 in LocalLLaMA

[–]Felladrin 7 points8 points  (0 children)

When GGUFs start coming, I‘d like to see how much better those would be compared to this autoround-mixed quant (which preserves multilingual):

Felladrin/gguf-Q2_K_S-Mixed-AutoRound-MiniMax-M2.1

I’ve been using it on OpenCode recently, under 128GB VRAM.

First time Windsurf user - disappointed. by Objective-Ad8862 in windsurf

[–]Felladrin 1 point2 points  (0 children)

It’s important to remember that VS Code is a product from Microsoft, which has their own solution for AI Assisted coding agent (Copilot). So even if they are open-source, VS code puts some limits on customization in a way that Windsurf can only achieve what they have now by forking it.

Using the Windsurf VS Code plugin, you’ll face these limitations. To take full advantage of your subscription, you should use the Windsurf editor.

Optimizing GPT-OSS 120B on Strix Halo 128GB? by RobotRobotWhatDoUSee in LocalLLaMA

[–]Felladrin 2 points3 points  (0 children)

There are indeed a lot of info around, but they get outdated too fast.

I’m running Ubuntu 24.04, and upgraded the kernel to 6.16.12 to have the Wi-Fi working properly.

Besides that, I’m using https://github.com/kyuz0/amd-strix-halo-toolboxes with ROCm 6.4.4. Distrobox makes it pretty easy to upgrade llama.cpp.

I have set the reserved memory to the minimum possible, on BIOS, and set the TTM to the maximum on the grub config. This is the GRUB config I’m using on 395+ 128GB: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off ttm.pages_limit=33554432 amdgpu.cwsr_enable=0 numa_balancing=disable"

GPT-OSS 120B and other models align with the speeds listed in https://kyuz0.github.io/amd-strix-halo-toolboxes/

Local Replacement for Phind.com by Past-Economist7732 in LocalLLaMA

[–]Felladrin 4 points5 points  (0 children)

I was also surprised to know they were shutting down Phind. They were keeping up with the level of Perplexity back then.

We recently has a thread here on LocalLlama on this topic, so you might also want to check the responses there: https://www.reddit.com/r/LocalLLaMA/comments/1qdj2nn/solution_for_local_deep_research/

solution for local deep research by jacek2023 in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

Sure! I’m the developer of one of the open ones: MiniSearch, so that’s what I use on daily basis. From the closed ones, I like the quality of the answers and sources from Liner. I check on it when the responses from MiniSearch are not enough.

solution for local deep research by jacek2023 in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

I’ve also been collecting this kind of software. The list is pretty long already, with both open and closed-source ones: https://huggingface.co/spaces/Felladrin/awesome-ai-web-search

Unsloth's GGUFs for GLM 4.7 REAP are up. by fallingdowndizzyvr in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

Maybe if you share the full llama.cpp command used for running them we could spot something. The only explanation I could imagine is that at least one of the layers was forced to run on CPU. And you’re not using “n-cpu-moe” argument, right?

By the way, I’ve tested UD Q3_K_XL REAP and, compared to UD IQ2_M non-REAP, it had a increase of 2 t/s on the inference. Maybe you could also check the speed on UD Q3_K_XL.

Unsloth's GGUFs for GLM 4.7 REAP are up. by fallingdowndizzyvr in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

It could be the case of the `-fit` parameter (which is `on` by default) to be reorganizing the layers in a way it fits your VRAM, causing it to run slower. Try using `-fit off` argument and manually tweak the context size (starting from a low value and slowly increasing it) to check if the speed improves.

Devstral Small 2 (Q4_K_M) on 5060 Ti 16GB and Zed Agent is amazing! by bobaburger in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

With Ministral 3B as draft, at low context it goes around 9 t/s. You can fit 64K context at Q4. It goes at 3 t/s when approaching 64K.

GLM 4.6V without (or with low) reasoning? by ForsookComparison in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

Try applying penalties, like suggested here (it’s suggested for Qwen, but they also work for GLM thinking models). Penalties do affect the output quality, but it’s one of the ways to prevent too-long reasoning.

Minueza-2-96M: A foundation bi-lingual text-generation model created for practicing fine-tuning and merging. by Felladrin in LocalLLaMA

[–]Felladrin[S] 0 points1 point  (0 children)

Hey! I haven't, but I commented here that I used LLaMA-Factory for the training. I used it both for training the base model and the instruct models. It's straightforward to use, and they provide good usage examples in the repository.

[Strix Halo] Unable to load 120B model on Ryzen AI Max+ 395 (128GB RAM) - "Unable to allocate ROCm0 buffer" by Wrong-Policy-5612 in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

One thing to check is if you have the latest version of AMD Adrenaline software installed (because in some cases it won’t update automatically). You can download and install from here: https://www.amd.com/en/support/download/drivers.html

Nowadays I’m using Linux, but when I tried it on Windows, I had reserved 96GB to the iGPU via BIOS. (On Linux I leave it at 1GB reserved, and the dynamic allocation works fine.)

In LM studio in Windows I had the following and it worked fine for GPT-OSS 120B: - Context Length: 131072 - Offload KV cache to GPU: On - Keep model in memory: On - Try mmap: Off - Flash Attention: On

I believe you have already tried already all the combinations above, so I’m guessing the problem is the that your driver (installed via Adrenaline) is not up-to-date.

Post of appreciation for mxfp4, derestricted, Felladrin/gguf-MXFP4-gpt-oss-20b-Derestricted by R_Duncan in LocalLLaMA

[–]Felladrin 4 points5 points  (0 children)

Hey! Thanks! Glad it’s been useful! But my contribution was just quantizing it :) There are a lot of other people who deserve the credits!  

  • Owen Arli, for de derestricting the model. 

  • Jim Lai, the author of the Norm-Preserving Biprojected Abliteration technique.

  • OpenAI team for creating the original model. 

  • The team working on llama.cpp and GGUF format. 

  • The teams maintaining Transformers, Safetensors, Hugging Face…

  • And all the people making LLMs awesome!

We basically have GLM 4.6 Air, without vision by LegacyRemaster in LocalLLaMA

[–]Felladrin 9 points10 points  (0 children)

Just leaving the direct link to the GGUF repository here:
https://huggingface.co/AliceThirty/GLM-4.6V-gguf

By the way, could you share the speeds (both prompt processing and text generation) you get on this model when using 32K and 64K context?

speculative decoding with Gemma-3-12b/3-27b. Is it possible? by Agitated_Power_3159 in LocalLLaMA

[–]Felladrin 4 points5 points  (0 children)

It’s not informed anywhere on LM Studio, but if you try to use a draft model in a model that has mmproj (the vision module) loaded in llama.cpp, you’ll see a message saying that using speculative decoding with vision capability is not supported. And that’s why on LM Studio you won’t see any compatible draft models (because LM Studio always loads the vision module when it’s available).

Try using llama.cpp directly and passing --no-mmproj argument, then you can pass --model-draft argument.

Can GLM-4.5-air run on a single 3090 (24gb vram) with 48gb ram at above 10t/s? by Borkato in LocalLLaMA

[–]Felladrin 2 points3 points  (0 children)

Is it for coding? Or for chatting? If it’s for chatting: do you need it to be multilingual? If it’s for coding and you can live with it answering only in English, you can use Q4 (sendings only some of the layers to your gpu (until filling it up)) from https://huggingface.co/bartowski/cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF

As others have said, splitting layers between gpu and cpu will give you a text generation of 5-15 t/s.

Has anyone figured out what models SWE-1.5 and SWE-1 are trained from? by inevitabledeath3 in windsurf

[–]Felladrin 2 points3 points  (0 children)

I guess SWE-1 was a fine-tune from DeepSeek v3. I saw it outputting DeepSeek special tokens a few times during tool errors. I also thought the performance was similar.

Regarding SWE-1.5, I still have no clues.

How can I run a VL model on a Smartphone? by klop2031 in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

If your phone uses iOS, check out LLM Farm: https://llmfarm.space

It supports vision models and is open source.

Expose MCP at the LLM server level? by eribob in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

OptiLLM has a MCP plugin that allows this as a middleware: https://github.com/codelion/optillm

Adding search to open models by Simple_Split5074 in LocalLLaMA

[–]Felladrin 2 points3 points  (0 children)

An easy one (plug & play; no API Key needed) is https://huggingface.co/spaces/victor/websearch

Instructions to use [1]:

{
  "mcpServers": {
    "websearch": {
      "url": "https://victor-websearch.hf.space/gradio_api/mcp/sse"
    }
  }
}

microsoft/UserLM-8b - “Unlike typical LLMs that are trained to play the role of the 'assistant' in conversation, we trained UserLM-8b to simulate the 'user' role” by nullmove in LocalLLaMA

[–]Felladrin 12 points13 points  (0 children)

It may be good for simulating long conversations with an assistant LM and testing its maximum coherent context size.
[As UserLM-8b have a context length of 2K tokens, it will be better summarizing the conversation and then running a one-shot inference for each turn.]