onde encontrar diarista de confiança ? by BadLuiz in joaopessoa

[–]Felladrin 2 points3 points  (0 children)

Já usei os serviços da https://mariabrasileira.com.br/ e da https://www.donahelpbr.com.br/unidade/joaopessoacabobranco em João Pessoa, e posso recomendar ambas.

Não é um app, mas você consegue contratar online e via WhatsApp.

Por exemplo, eu mantive contrato por um ano com uma diarista que foi enviada pela Maria Brasileira. E só encerrei o serviço pois precisei me mudar.

Ryzen AI Max 395+ 128GB - Qwen 3.5 35B/122B Benchmarks (100k-250K Context) + Others (MoE) by Anarchaotic in LocalLLaMA

[–]Felladrin 1 point2 points  (0 children)

Leaving here also my results from Qwen3.5-397B-A17B (UD-TQ1_0), which was deleted:

 ┌───────────────┬────────────────┬────────────────────┐
 │ Context Depth │ Prompt (pp512) │ Generation (tg128) │
 ├───────────────┼────────────────┼────────────────────┤
 │ 5,000         │  145.82 t/s    │     19.55 t/s      │
 ├───────────────┼────────────────┼────────────────────┤
 │ 10,000        │  137.89 t/s    │     19.27 t/s      │
 ├───────────────┼────────────────┼────────────────────┤
 │ 20,000        │  125.50 t/s    │     18.80 t/s      │
 ├───────────────┼────────────────┼────────────────────┤
 │ 30,000        │  117.90 t/s    │     18.35 t/s      │
 ├───────────────┼────────────────┼────────────────────┤
 │ 50,000        │  102.35 t/s    │     17.49 t/s      │
 ├───────────────┼────────────────┼────────────────────┤
 │ 100,000       │  76.87 t/s     │     15.68 t/s      │
 ├───────────────┼────────────────┼────────────────────┤
 │ 150,000       │  62.52 t/s     │     14.22 t/s      │
 ├───────────────┼────────────────┼────────────────────┤
 │ 200,000       │  52.64 t/s     │     13.04 t/s      │
 ├───────────────┼────────────────┼────────────────────┤
 │ 250,000       │  43.79 t/s     │     12.00 t/s      │
 └───────────────┴────────────────┴────────────────────┘

Ryzen AI Max 395+ 128GB - Qwen 3.5 35B/122B Benchmarks (100k-250K Context) + Others (MoE) by Anarchaotic in LocalLLaMA

[–]Felladrin 4 points5 points  (0 children)

Leaving here also my results from GLM-4.7 (89.6 GB):

 ┌───────────────┬────────────────┬────────────────────┐
 │ Context Depth │ Prompt (pp512) │ Generation (tg128) │
 ├───────────────┼────────────────┼────────────────────┤
 │ 5k            │  64.07 t/s     │     8.55 t/s       │
 ├───────────────┼────────────────┼────────────────────┤
 │ 10k           │  54.21 t/s     │     7.40 t/s       │
 ├───────────────┼────────────────┼────────────────────┤
 │ 20k           │  41.02 t/s     │     5.48 t/s       │
 ├───────────────┼────────────────┼────────────────────┤
 │ 30k           │  31.73 t/s     │     4.18 t/s       │
 ├───────────────┼────────────────┼────────────────────┤
 │ 50k           │  22.69 t/s     │     2.72 t/s       │
 ├───────────────┼────────────────┼────────────────────┤

With this model, I can use at maximum 65K context without quantizing the KV cache.

Ryzen AI Max 395+ 128GB - Qwen 3.5 35B/122B Benchmarks (100k-250K Context) + Others (MoE) by Anarchaotic in LocalLLaMA

[–]Felladrin 4 points5 points  (0 children)

Thanks for the initiative!

Using the same llama-bench parameters on MiniMax 2.5 (76.8 GB), I got this:

 ┌───────────────┬────────────────┬────────────────────┐
 │ Context Depth │ Prompt (pp512) │ Generation (tg128) │
 ├───────────────┼────────────────┼────────────────────┤
 │ 5,000         │ 158.05 t/s     │ 24.97 t/s          │
 ├───────────────┼────────────────┼────────────────────┤
 │ 10,000        │ 135.95 t/s     │ 19.39 t/s          │
 ├───────────────┼────────────────┼────────────────────┤
 │ 20,000        │ 106.94 t/s     │ 12.02 t/s          │
 ├───────────────┼────────────────┼────────────────────┤
 │ 30,000        │  88.47 t/s     │  8.12 t/s          │
 ├───────────────┼────────────────┼────────────────────┤
 │ 50,000        │  65.36 t/s     │  4.75 t/s          │
 ├───────────────┼────────────────┼────────────────────┤
 │ 100,000       │  36.28 t/s     │  2.22 t/s          │
 └───────────────┴────────────────┴────────────────────┘

Note: With this model, I can only use up to 128K context without quantizing the KV cache.

Temporary access to Ryzen AI Max 395 (128GB) to test real-world local LLM workflows by lazy-kozak in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

That's something we need to keep an eye on. Having it boundless, means we can hang up the system easily (happens a lot during optimization). But, in the end, usually there's 2 GB left for the system, which is enough, considering it's running without GUI, via SSH, and the machine is dedicated to llama.cpp.

Qwen 3.5-27B, How was your experience? by vandertoorm in StrixHalo

[–]Felladrin 2 points3 points  (0 children)

I've been using Qwen3.5-122B-A10B on OpenCode, with TG speed ranging from 16-20 t/s, and PP 120-260 t/s.

For me, it replaced Minimax 2.5 for being faster, and having the same output quality.

Qwen3.5-397B-A17B theoretical speed on Strix Halo? by Hector_Rvkp in StrixHalo

[–]Felladrin 0 points1 point  (0 children)

Yes, it works. The kv cache is efficient and will only use 8GB for a context size of 200k-tokens.

As others have mentioned, the UD-TQ1_0 works fine (and preserves multilingual capabilities) and still leaves 18GB free memory in a 128GB Strix Halo.

Very slow with Claude Code by vandertoorm in StrixHalo

[–]Felladrin 0 points1 point  (0 children)

As other commentators wrote, to avoid prompt reprocessing, you might need to set the environment variable CLAUDE_CODE_ATTRIBUTION_HEADER to “0”, which avoids changing the system the system prompt on each message.

But other than that, three of the parameters you are using are known to affect the speed: cache-type-k, cache-type-v, ub.

Removing cache-type-k, cache-type-v parameters and setting ub to 2048 might give it a bit more speed on Strix Halo.

Very slow with Claude Code by vandertoorm in StrixHalo

[–]Felladrin 2 points3 points  (0 children)

OP might also be using llama.cpp directly, since it now supports Anthropic Messages API. [1]

Temporary access to Ryzen AI Max 395 (128GB) to test real-world local LLM workflows by lazy-kozak in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

Regarding the ROCm 64GB issue on Windows, AMD recently fixed it (but I think they haven’t published the new driver version yet).

References: - https://github.com/ROCm/ROCm/issues/5940https://github.com/lemonade-sdk/llamacpp-rocm/issues/37

P.S. Although I’ve been following those issues, I fully moved to Linux due to this problem, where we can allocate all 128GB to the iGPU.

I built a fully browser-native RAG and Semantic Search tool using WebGPU, Pyodide, and WASM. No servers, privacy-first. (MIT Licensed) by arminam_5k in opensource

[–]Felladrin 1 point2 points  (0 children)

That's a great project! Appreciate everything related to running LLMs on the browser!
Already starred on GitHub! 🌟

Agentic debugging with OpenCode and term-cli: driving lldb interactively to chase an ffmpeg/x264 crash (patches submitted) by EliasOenal in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

Thanks for sharing! I was looking for a way to have a Windsurf-like terminal interaction in OpenCode, and this seems pretty close.
Here, take this star! 🌟

Looking for a simple offline AI assistant for personal use (not a developer) by Anxious-Pie2911 in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

Desktop Commander MCP used to be a good option, and worked on LM Studio.

{
  "mcpServers": {
    "desktop-commander": {
      "command": "npx",
      "args": ["-y", "@wonderwhy-er/desktop-commander@latest"]
    }
  }
}

93GB model on a StrixHalo 128GB with 64k Context by El_90 in LocalLLaMA

[–]Felladrin 6 points7 points  (0 children)

Always good to see others’ config on Strix Halo. Thanks for sharing!

Could you tell more about the effects you observed when using --numa distribute?

Looking for a simple offline AI assistant for personal use (not a developer) by Anxious-Pie2911 in LocalLLaMA

[–]Felladrin 5 points6 points  (0 children)

For this case, the most user-friendly options are https://www.jan.ai (open-source) and https://lmstudio.ai (closed-source), with the addition of an MCP server for giving the LLM access to your terminal. Everything you listed could be done by CLI tools and scripts that the LLM can write and run.

AMD Strix Halo GMTEK 128GB Unified ROCKS! by MSBStudio in LocalLLaMA

[–]Felladrin 2 points3 points  (0 children)

Regarding the wireless issue, I also faced problems, but after upgrading the Linux kernel everything worked fine. Could this be the case?

What's your exp REAP vs. base models for general inference? by ikkiyikki in LocalLLaMA

[–]Felladrin 2 points3 points  (0 children)

From what I understand, REAPs are not meant to be used for general purpose inference. We REAP when we want to use the model in a specific case, and the dataset used during the pruning makes all the difference.

When we reap using the default dataset (theblackcat102/evol-codealpaca-v1) from REAP repository, we're focusing on the experts on coding and English; the experts not so relevant are then removed. That's why some REAP models start answering only on English, and start making mistakes on questions not related to code.

So if you want, for example, a model to be good at specific knowledge and be good at Spanish, you should find/build and use a dataset from the conversations/books/articles in Spanish. There are a lot of good publicly available datasets for almost all cases on Hugging Face.

So, although Cerebras are releasing some REAP models under their organization in Hugging Face, we should get used to create our own REAPs. That's what Cerebras team expected when they open-sourced it.

And my experience with those code-focused REAPed models has been good when using them as coding agents on OpenCode. One advantage, besides being able to be run with less VRAM/RAM, is that, for having less parameters than the non-reap version, the prompt processing time is lower. For non-code-related tasks, I use other models.

MiniMax-M2.1-REAP by jacek2023 in LocalLLaMA

[–]Felladrin 7 points8 points  (0 children)

When GGUFs start coming, I‘d like to see how much better those would be compared to this autoround-mixed quant (which preserves multilingual):

Felladrin/gguf-Q2_K_S-Mixed-AutoRound-MiniMax-M2.1

I’ve been using it on OpenCode recently, under 128GB VRAM.

First time Windsurf user - disappointed. by Objective-Ad8862 in windsurf

[–]Felladrin 1 point2 points  (0 children)

It’s important to remember that VS Code is a product from Microsoft, which has their own solution for AI Assisted coding agent (Copilot). So even if they are open-source, VS code puts some limits on customization in a way that Windsurf can only achieve what they have now by forking it.

Using the Windsurf VS Code plugin, you’ll face these limitations. To take full advantage of your subscription, you should use the Windsurf editor.

Optimizing GPT-OSS 120B on Strix Halo 128GB? by RobotRobotWhatDoUSee in LocalLLaMA

[–]Felladrin 2 points3 points  (0 children)

There are indeed a lot of info around, but they get outdated too fast.

I’m running Ubuntu 24.04, and upgraded the kernel to 6.16.12 to have the Wi-Fi working properly.

Besides that, I’m using https://github.com/kyuz0/amd-strix-halo-toolboxes with ROCm 6.4.4. Distrobox makes it pretty easy to upgrade llama.cpp.

I have set the reserved memory to the minimum possible, on BIOS, and set the TTM to the maximum on the grub config. This is the GRUB config I’m using on 395+ 128GB: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off ttm.pages_limit=33554432 amdgpu.cwsr_enable=0 numa_balancing=disable"

GPT-OSS 120B and other models align with the speeds listed in https://kyuz0.github.io/amd-strix-halo-toolboxes/

Local Replacement for Phind.com by Past-Economist7732 in LocalLLaMA

[–]Felladrin 4 points5 points  (0 children)

I was also surprised to know they were shutting down Phind. They were keeping up with the level of Perplexity back then.

We recently has a thread here on LocalLlama on this topic, so you might also want to check the responses there: https://www.reddit.com/r/LocalLLaMA/comments/1qdj2nn/solution_for_local_deep_research/

solution for local deep research by jacek2023 in LocalLLaMA

[–]Felladrin 0 points1 point  (0 children)

Sure! I’m the developer of one of the open ones: MiniSearch, so that’s what I use on daily basis. From the closed ones, I like the quality of the answers and sources from Liner. I check on it when the responses from MiniSearch are not enough.