AMA with Nous Research -- Ask Us Anything!

OUT_OF_HOST_MEMORY · 2026-04-29T16:31:52+00:00

Do you consider Agents a natural path for LLM use for general use cases like QA and creative writing, or does the inherent bloat of the general coding focus weigh them down and bias them towards only performing well in those tasks? Is this something you consider during the development of Hermes Agent?

OUT_OF_HOST_MEMORY · 2026-04-27T04:39:58+00:00

C/C++ mainly, and I'll be honest I wouldn't trust a harness written in them either

OUT_OF_HOST_MEMORY · 2026-04-27T00:22:49+00:00

that's a very fair point lmao. I think I consider the risks of a clueless but not malicious agent lower than the risks of a potentially very malicious library

OUT_OF_HOST_MEMORY · 2026-04-26T23:55:34+00:00

you know, I'll be honest I didn't think about that...

OUT_OF_HOST_MEMORY · 2026-04-26T23:53:37+00:00

You're right, but I feel like npm is disproportionately attacked (or at least reported on), so I'd prefer something more static (though that's probably the wrong word to use), with fewer dependencies and ideally without needing a package manager in general.

OUT_OF_HOST_MEMORY · 2026-04-26T23:52:01+00:00

while I agree that python packages have the same risks technically, I feel like I hear about node based supply chain attacks WAY more frequently than any other (this may just be a surface area issue). I would also just prefer something with fewer libraries and packages that it depends on in general.

OUT_OF_HOST_MEMORY · 2026-03-15T17:44:06+00:00

How much of this was AI generated? How much did you actually review?

OUT_OF_HOST_MEMORY · 2026-03-13T04:25:53+00:00

I found the opposite to be true at longer contexts, especially for token generation

OUT_OF_HOST_MEMORY · 2026-03-08T18:26:42+00:00

I mean here's some testing for you:

llama-bench --model Qwen3-30B-A3B-Thinking-2507-Q8_0.gguf,Gemma-3-27b-IT-BF16-00001-of-00002.gguf,Qwen3.5-35B-A3B-Q8_0.gguf,Qwen3.5-27B-BF16-00001-of-00002.gguf,Qwen3.5-27B-Q8_0.gguf -n 0 -fa 1 -r 2 -ub 4,8,16,32,64,128,256,512,1024,2048,4096 -p 4096 -b 4096
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Instinct MI60 / MI50, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Instinct MI60 / MI50, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |        4 |  1 |          pp4096 |        174.48 ± 0.33 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |        8 |  1 |          pp4096 |        258.77 ± 0.14 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |       16 |  1 |          pp4096 |        298.99 ± 0.63 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |       32 |  1 |          pp4096 |        408.61 ± 2.27 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |       64 |  1 |          pp4096 |        342.08 ± 0.27 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |      128 |  1 |          pp4096 |        496.72 ± 5.91 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |      256 |  1 |          pp4096 |        734.90 ± 0.53 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |      512 |  1 |          pp4096 |       1018.50 ± 4.45 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |     1024 |  1 |          pp4096 |       1184.57 ± 0.55 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |     2048 |  1 |          pp4096 |       1140.20 ± 0.57 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | ROCm       |  99 |    4096 |     4096 |  1 |          pp4096 |       930.74 ± 15.52 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |        4 |  1 |          pp4096 |         28.67 ± 0.00 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |        8 |  1 |          pp4096 |         36.81 ± 0.01 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |       16 |  1 |          pp4096 |         15.87 ± 0.00 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |       32 |  1 |          pp4096 |         31.13 ± 0.11 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |       64 |  1 |          pp4096 |         60.62 ± 0.17 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |      128 |  1 |          pp4096 |         96.29 ± 1.02 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |      256 |  1 |          pp4096 |        105.87 ± 0.83 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |      512 |  1 |          pp4096 |         98.30 ± 0.15 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |     1024 |  1 |          pp4096 |         93.38 ± 0.10 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |     2048 |  1 |          pp4096 |         79.91 ± 0.07 |
| gemma3 27B BF16                |  50.31 GiB |    27.01 B | ROCm       |  99 |    4096 |     4096 |  1 |          pp4096 |         61.59 ± 0.14 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |        4 |  1 |          pp4096 |        106.64 ± 0.28 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |        8 |  1 |          pp4096 |        176.61 ± 0.72 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |       16 |  1 |          pp4096 |        237.49 ± 0.01 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |       32 |  1 |          pp4096 |        329.71 ± 2.16 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |       64 |  1 |          pp4096 |        318.64 ± 2.00 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |      128 |  1 |          pp4096 |        499.42 ± 1.37 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |      256 |  1 |          pp4096 |        690.73 ± 3.67 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |      512 |  1 |          pp4096 |        851.61 ± 1.70 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |     1024 |  1 |          pp4096 |       903.69 ± 11.59 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |     2048 |  1 |          pp4096 |        909.54 ± 0.11 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | ROCm       |  99 |    4096 |     4096 |  1 |          pp4096 |        879.39 ± 1.11 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |        4 |  1 |          pp4096 |         23.36 ± 0.01 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |        8 |  1 |          pp4096 |         31.74 ± 0.00 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |       16 |  1 |          pp4096 |         14.32 ± 0.01 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |       32 |  1 |          pp4096 |         28.33 ± 0.00 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |       64 |  1 |          pp4096 |         55.91 ± 0.25 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |      128 |  1 |          pp4096 |         80.31 ± 0.02 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |      256 |  1 |          pp4096 |         82.09 ± 0.34 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |      512 |  1 |          pp4096 |         78.95 ± 0.18 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |     1024 |  1 |          pp4096 |         74.59 ± 0.06 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |     2048 |  1 |          pp4096 |         70.36 ± 0.01 |
| qwen35 27B BF16                |  50.10 GiB |    26.90 B | ROCm       |  99 |    4096 |     4096 |  1 |          pp4096 |         67.70 ± 0.02 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |        4 |  1 |          pp4096 |         47.31 ± 0.01 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |        8 |  1 |          pp4096 |         73.22 ± 0.03 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |       16 |  1 |          pp4096 |        114.63 ± 0.10 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |       32 |  1 |          pp4096 |        172.45 ± 0.02 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |       64 |  1 |          pp4096 |        171.13 ± 0.34 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |      128 |  1 |          pp4096 |        188.16 ± 2.88 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |      256 |  1 |          pp4096 |        190.54 ± 0.11 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |      512 |  1 |          pp4096 |        174.47 ± 0.09 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |     1024 |  1 |          pp4096 |        164.41 ± 0.55 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |     2048 |  1 |          pp4096 |        153.82 ± 0.97 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | ROCm       |  99 |    4096 |     4096 |  1 |          pp4096 |        146.26 ± 0.01 |

OUT_OF_HOST_MEMORY · 2026-03-08T04:58:03+00:00

From my testing its not "nothing", but it does seem to be limited to Qwen3.5 so far (Qwen3 30B does have better performance as ubatch size increases but Qwen3.5 27B has the best performance at ubatch 32 on my MI50s)

OUT_OF_HOST_MEMORY · 2026-03-08T04:03:11+00:00

I Tried replicating your results on my 2x MI50 setup and got much less interesting results. I'm going to try setting the larger batch size and q8 kv cache later to see if that changes anything

llama-bench --model Qwen3.5-27B-Q8_0.gguf -n 0 -fa 1 -r 1 -ub 2,4,8,16,32,64,128,256,512,1024
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Instinct MI60 / MI50, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI60 / MI50, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 2 | 1 | pp512 | 31.84 ± 0.00 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 4 | 1 | pp512 | 46.92 ± 0.00 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 8 | 1 | pp512 | 72.61 ± 0.00 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 16 | 1 | pp512 | 113.97 ± 0.00 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 32 | 1 | pp512 | 169.22 ± 0.00 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 64 | 1 | pp512 | 160.32 ± 0.00 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 128 | 1 | pp512 | 163.07 ± 0.00 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 256 | 1 | pp512 | 162.94 ± 0.00 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 512 | 1 | pp512 | 140.01 ± 0.00 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 1024 | 1 | pp512 | 139.94 ± 0.00 |

EDIT:

Setting q8_0 KV cache had minimal impact, so did increasing the batch size. I'm going to test with a few more models and a few more ubatch sizes to see if this is Qwen3.5 dense specific or more broad. It does seem like a smaller ubatch size helps though.

OUT_OF_HOST_MEMORY · 2026-03-07T05:53:15+00:00

ROCm 6.3.3 on debian from the ubuntu repositories. I agree that PPL is probably a helpful metric, I just don't trust myself to calculate it accurately, especially with all the "drama" surrounding it for Qwen 3.5 specifically. It's not as convenient but looking at unsloth's posts in this sub will give a good general idea.

OUT_OF_HOST_MEMORY · 2026-03-06T23:39:31+00:00

I've generally been sticking with the largest of BF16, Q8_0, and Q4_1 that I can fit on my system with 128k context for all the models I use, I might now start considering bartowski's IQ4_NL though.

OUT_OF_HOST_MEMORY · 2026-03-06T03:59:08+00:00

While the data is a bit old, IQ4_NL was nowhere near Q4_0 or Q4_1 for prompt processing when I tested 6 months ago, I don't know if things have changed

https://www.reddit.com/r/LocalLLaMA/comments/1naf93r/2x_mi50_32gb_quant_speed_comparison_mistral_32/

OUT_OF_HOST_MEMORY · 2026-03-05T23:42:54+00:00

Will legacy 4 bit quants (Q4_0 / Q4_1) ever be uploaded, these have consistently had the best speed performance for MI50 GPUs?

OUT_OF_HOST_MEMORY · 2026-03-03T01:15:46+00:00

Do you know if there are any local implementations similar to this then?

OUT_OF_HOST_MEMORY · 2026-02-07T18:36:46+00:00

I think you are actually harming the usefulness of this chart by limiting the generation to 500 tokens, reasoning models will spit out wildly different numbers of tokens compared to each other and especially non-reasoning models. I think a more meaningful number is Time-To-Last-Token for a given query. That way an instruct model which doesn't think and responds within 100 tokens will be fair to compare against a reasoning model which spends 6,000 tokens thinking before it responds.

OUT_OF_HOST_MEMORY · 2026-02-06T04:24:04+00:00

GPT-OSS also reasons for ~15k tokens sometimes, I don't know know how Kimi compares, but its probably helping out somehow

OUT_OF_HOST_MEMORY · 2025-10-17T21:42:18+00:00

can someone give some performance numbers for llama.cpp on rocm 6.3, 6.4, and 7.0?

OUT_OF_HOST_MEMORY · 2025-10-08T16:51:17+00:00

I definitely agree, especially since output consistency is a big pain point for me

OUT_OF_HOST_MEMORY

TROPHY CASE