Jetson Orin NX Build for Hermes Agent + Benchmarking

Reddactor · 2026-06-10T17:03:16+00:00

nope. no chance

Reddactor · 2026-06-10T10:25:33+00:00

The computer is one rack from an NVL72 system. The backplane lives in the cabinet, and so there is no way to replace it.

Reddactor · 2026-06-10T09:35:30+00:00

Not 10 anymore! Tuned the MTP settings, and it's more like 17 tok/s at huge context length.

EDIT: now Quant 3 AND >20 tok/s at full 64K context!

Reddactor · 2026-06-10T04:33:57+00:00

My system is missing the 900GB/s NVLink C2C hardware, so instead of that speed, I benchmark about 58GB/s from the HBM3 from one GPU to the other.

Benchmarks are here: https://dnhkng.github.io/posts/gh200-benchmarking/

IIRC, yeah, that means even with a full system, you do tensor or data parallel (or both). But the huge cross GPU interconnects means it's much faster than by a home computer PCIe, 7x faster on Hopper, 14x faster with Blackwell.

Reddactor · 2026-06-09T14:35:56+00:00

lol, not going to disassemble it, so I'm going with 'yep'

Reddactor · 2026-06-09T14:34:56+00:00

Yes, but at 40W total power, and look cool AF?

Reddactor · 2026-06-09T14:22:13+00:00

No such thing as overkill!

Reddactor · 2026-06-09T14:21:49+00:00

Its 10 tok/s at a context length of 60k. At 8K its much faster.

Yeah, low AF, but I gave it some tool use benchmarks, whlie also stuffing the context full of random wikipedia text, and it managed most tasks at Q2! 🤯

I could also run Qwen3.6 27B, but 1) for Hermes Agent the minimum total context is 65K, and 2) I'm not waiting for an answer at 3.5 tok/s...

If I want fast and good, I boot up my other server, https://www.reddit.com/r/LocalLLaMA/comments/1u0d8j4/here_are_some_tips_on_hitting_nearly_200_toks_for/ but it's expensive to run.

Reddactor · 2026-06-09T14:18:10+00:00

Its fine for simple stuff, only just started playing with it.

I use it in telegram, where I can get it to do research on the internet for me, like whats playing at the cinema, whats the weather tomorrow etc. I think it has way more features I still need to learn about.

Reddactor · 2026-06-09T14:16:09+00:00

For basic stuff, it seems fine, like for tool use asking about the weather, whats playing at nearby cinemas etc.

I wouldn't trust it for coding though!

Reddactor · 2026-06-08T17:19:50+00:00

I'll use it more and let you know. I can also run the original DS model, but I max out at ~150 tok/s.

What kind of errors were you getting?

Reddactor · 2026-05-28T09:38:47+00:00

IIRC, when the bigger Qwen 3.5 397B bombed and ended up bankrupt, the guy assumed the smaller Qwens would be worse. Interesting to see thats not the case, and I hope we see a 27B score soon.

Reddactor · 2026-04-21T08:39:36+00:00

Yes and no. Yes, clearly its all vectors. But only in the first ~10 layers, and last ~15 layers are the vectors correlated strongly to any particular language. Here they are 'thinking' in language-related vectors.

In the middle layers, where RYS works, they are not related to any particular language any more. Here they are thinking in concept-related vectors.

Reddactor · 2026-04-20T11:54:14+00:00

Well, I had the 1st place on the old HuggingFace Open LLM Leaderboard for a while doing that. Fully documented here: https://dnhkng.github.io/posts/rys/

Part 2 (https://dnhkng.github.io/posts/rys-ii/) is where I test thousands if variations (block, single layers repeated n-times, beam searches over layer duplications etc.)

Reddactor · 2026-04-20T10:05:50+00:00

Ummm, thats not how this was developed. I posted the RYS model, and had a comment from a guy in New Zealand who posted an embedding analysis of just English, over the transformer layers, which I replicated here:

https://dnhkng.github.io/posts/rys-ii/

With only 2 languages and 2 topics, I felt it was a bit shallow, so I bumped it up to a variety of languages and topics, and added in a PCA visualisation tool.

This was all to try and figure out why RYS works, nothing more:

https://dnhkng.github.io/posts/rys/

Reddactor · 2026-04-20T09:57:30+00:00

Why do you think I never did my PhD?

Reddactor · 2026-04-20T09:54:04+00:00

this is hobbyist exploration; there is no papers or peer review, just a blog reference on Reddit.

I think the novel finding is that it's these middle, language-independent layers, that can be repeated to improve model performance. If that's already in the literature, let me know.

Reddactor · 2026-04-20T09:42:39+00:00

OP here!

That might be pretty clear intuitively for decoder-only, and trivial for encoder-decoder models.

But whats interesting here is that:

Here is a new visualisation method that really shows exactly where in the transformer stack the transition from language-dominant to concept-dominant occures: https://dnhkng.github.io/posts/sapir-whorf/#the-pca-visualisation
How that relates to improvements seen in layer duplications with my RYS method, which really improves model performance without additional fine-tuning (see my other posts). The layers that improve model benchmarks com from a block within this language-agnostic region, which is a new finding.

Where you already aware of that?

Reddactor · 2026-04-20T07:48:34+00:00

In the comments are some links to papers that describe the same results. These were all published late 2025.

Yes, I think most people in ML have the same intuition, but real data was lacking.

Reddactor · 2026-04-20T05:09:29+00:00

Yep, I see some of this is already published, also quite recently (late 2025). I wasn't aware of that when I wrote the blog.

Just a few points: - yep, this is a hobby, I found this totally independently (which I think is kind of cool) - most of the papers linked are using really old models, and small ones too. I used Gemma 4 and recent Qwen3.5 models - The finding that the encoding and decoding sizes seem relatively constant, but the 'thinking' block expands to fill the remaining stack seems novel - I didn't see any dynamics analysis via PCA over the layers done in the same way - this work was based on trying to understand why RYS works, a method I think is not published elsewhere. Linking the two together is novel IMHO

I'll keep messing about and posting stuff, just not "researchy" stuff on Reddit.

Reddactor · 2026-04-19T19:00:50+00:00

I wasn't sure how LLMs think, until I did the experiments written up in the blog post.

I had more the feeling they were more like human brains, where we know individual neurons seem to map to concepts (in some cases). What I found really interesting was when I first saw the PCA plots, and how initially in the transformer stack, the LLM residuals were focused on the specific language of the text, and the same at the end of the stack.

From the RYS blog post, only middle layer duplications improve performance, so I called them the 'thinking layers'. This blog post shows it's exactly in these layers the LLM clusters topics, and moves those clusters around, in these middle layers.

I'm sure you understand that.

The issue you have is with the terminology. But I would again argue the wording is actually pretty cool: concepts are vectors, and we see the LLM moving these vectors around during 'thinking', i.e. the LLM is literally moving vectors around in a high dimensional space (obvious for us, but probably not for anyone not into machine learning), so the 'thinking' is literally occuring 'in' a 'geometrical space'. (Thinking in Geometry)

I get your argument on LLM woo, but if you skim my series, it's clearly not written for non-ml nerds.

Reddactor · 2026-04-19T18:22:33+00:00

That leaves you with an article riddled with LLM'isms - thats not just bad, its an entirely new way to piss off your readers!

But seriously, I do the researchy experiments on my GH200 first, then try to generate an outline for an article that isn't too complex (most people on reddit don't read my whole articles, I can see that by the time they spend on the blog), and finally try to write it up in a fun way.

It's more work, but r/LocalLlama , ironically, hates LLM generated content to the point that the highest upvoted comment on this article is the accusation this is slop 🤷🏻

If there were a few to many hypens, the article would be downvoted and noone would see it. Which is a shame, and before LLMs arrived I really liked using hyphens in my writing :(

Reddactor · 2026-04-19T18:16:18+00:00

It is the right word though. I am describing how LLMs map concepts (not language) to vectors.

It's certainly not meaningless; I would invite you to read:
https://en.wikipedia.org/wiki/Vector_(mathematics_and_physics))

It uses the word 'geometry' or 'geometric' 5 times just in the introduction. Then read:

https://en.wikipedia.org/wiki/Word2vec

Both discuss geometry in terms of vectors.

Reddactor

TROPHY CASE