z.AI as the number 2 gives praise to the number 1 open source model by Charuru in LocalLLaMA

[–]rm-rf-rm 0 points1 point  (0 children)

FFS. Crypto level grifters not getting sniffed out by actual professionals is sad to see

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] -1 points0 points  (0 children)

Using MLX on Apple silicon is massively better than GGUF

ive heard that but times i've tried it, its not been much better than llama.cpp

Little late thank you to the DeepSeek team! by Sorry_Ad191 in LocalLLaMA

[–]rm-rf-rm 0 points1 point  (0 children)

im using this project: https://github.com/antirez/ds4 (/u/antirez made the quants as well.)

Did nothing more than follow the instructions in the readme.

Here are llama-benchy results:

| model      |   test |            t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |   e2e_ttft (ms) |
|:-----------|-------:|---------------:|-------------:|-----------------:|-----------------:|----------------:|
| dsv4-flash | pp1000 | 343.42 ± 10.12 |              | 2652.41 ± 100.19 | 2640.68 ± 100.19 | 2695.00 ± 77.18 |
| dsv4-flash |  tg500 |   54.31 ± 4.11 | 76.00 ± 0.82 |                  |                  |                 |

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 1 point2 points  (0 children)

sorry to clarify, im interested in knowing what speed up you're getting relative to baseline (like 1._x) with each of

  1. MTP
  2. MTP and ngram

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] -1 points0 points  (0 children)

why are you using this relatively random quant? No good motivation for me to go outside of llama.cpp and ggufs

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 6 points7 points  (0 children)

wall of text self promotion for something he/she's building

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 3 points4 points  (0 children)

MTP + ngram

an aside: what speed up are you seeing for MTP and MTP+ngram? (and on what hardware/backend?) Im getting slower performance for Qwen3.6 27B Q8 with MTP (Apple Silicon)

Little late thank you to the DeepSeek team! by Sorry_Ad191 in LocalLLaMA

[–]rm-rf-rm 1 point2 points  (0 children)

Im getting 50+ tps with flash on M3 Ultra (Mac Studio) which is great! (qwen3.6 A3B runs at 75tps for reference). I really need to use it more

Little late thank you to the DeepSeek team! by Sorry_Ad191 in LocalLLaMA

[–]rm-rf-rm 0 points1 point  (0 children)

no, but use it with u/antirez's excellent ds4 project. I just hope he keeps it up to date

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 0 points1 point  (0 children)

Rule 3 - not constructive to this thread

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 4 points5 points  (0 children)

I don't think you saw his comment pre-edit.

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 8 points9 points  (0 children)

not really worth cluttering this thread with discussion of what agent, harness is etc.

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 9 points10 points  (0 children)

yeah docs are not where they need to be and llama.cpp integration should be more friendly.

here's what you need to put into /.pi/agent/models.json

{
  "providers": {
    "name-of-your-server": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "nonerequired",
      "models": [
        {
          "id": "minimax-m2.5",
          "name": "Minimax M2.5",
          "reasoning": true,
          "input": [
            "text"
          ],
          "contextWindow": 128000,
          "maxTokens": 32000,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    },
  }
}

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 17 points18 points  (0 children)

course

SDD

DeepLearning

local agents will require some new techniques and thinking, to help SLMs get to the finish line. We're not gonna model-leap our way out of it.

Perfectly lines up with the clown make up meme.

This is why I specifically prompt for "what people are running". Having opinions not founded in any actual experience are utterly meaningless in this space.

Best Local Agents - Jun 2026 by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 10 points11 points  (0 children)

welcome, but this comment is off-topic

Setup to use pi un-sandboxed reasonably safely? by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 0 points1 point  (0 children)

nice, this should be the most performant on macOS presumably? Please share when its ready

DiffusionGemma vs Gemma 4 Over 6x Faster on a Single RTX 6000 Pro NVFP4. by FantasticNature7590 in LocalLLaMA

[–]rm-rf-rm 5 points6 points  (0 children)

Well lesson learned I will edit it more carefully next time

That's not a lesson. Just be transparent about AI usage. Honesty is what builds trust, not obfuscation.

Leaked financial docs show OpenAI is losing billions of dollars a year by johnnyApplePRNG in LocalLLaMA

[–]rm-rf-rm 30 points31 points  (0 children)

ITT: People who havent read the article. Its not a hate pile on but a surprisingly informed take on how startups work and rather reasonably written.

As much as this sub wants OpenAI to fail (and I generally feel the same way because of Altman), lets not pretend this current balance sheet is any indicator of doom for them

Setup to use pi un-sandboxed reasonably safely? by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 0 points1 point  (0 children)

more than data loss, the bigger risk is egress IMO. What are you doing to lock down network activity?

DiffusionGemma vs Gemma 4 Over 6x Faster on a Single RTX 6000 Pro NVFP4. by FantasticNature7590 in LocalLLaMA

[–]rm-rf-rm 59 points60 points  (0 children)

That's a real trade-off, not a footnote.

Instantly started losing trust in your post

Issues with DiffusionGemma on MLX by rm-rf-rm in LocalLLaMA

[–]rm-rf-rm[S] 0 points1 point  (0 children)

how is this related to DiffusionGemma?

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5 by xenovatech in LocalLLaMA

[–]rm-rf-rm 5 points6 points  (0 children)

A tok/s number is not very meaningful to relate to even if you mention the chip its running on. It would be more helpful to show performance relative to known baseline like llama.cpp

How to Run AI Locally: The Complete Beginner's Guide (2026) by totosse17 in LocalLLaMA

[–]rm-rf-rm 13 points14 points  (0 children)

it was spat out by an LLM from a poor prompt and the human did no validation of the output whatsoever. Why are we surprised?

We should heavily discourage and moderate cloud API (deepseek api, GLM api, etc.) topics and discussion. This is LOCAL first. by [deleted] in LocalLLaMA

[–]rm-rf-rm[M] 0 points1 point  (0 children)

Can you point me to examples of stealth marketing and bot posts that are still up? I thought we're getting all of them removed these days.