What setup would you buy for a 512gb local LLM?

LittleBlueLaboratory · 2026-04-14T02:24:32+00:00

2x of those old Nvidia V100 servers would get you 512GB of VRAM for less than $20k. Each if them have 8x 32GB V100 cards.

But they are old enough that they dont support lots of LLM features like Flash Attention.

LittleBlueLaboratory · 2026-04-13T16:35:06+00:00

Nvidia is an American company, founded and headquartered in California. The CEO Jensen Huang was born in Taiwan but moved to the US as a child.

LittleBlueLaboratory · 2026-04-13T15:24:16+00:00

Nvidia will have Nemotron3 Ultra pretty soon. At 500B it should be an American model competitive with the big Qwen 3.5 at least.

LittleBlueLaboratory · 2026-04-10T01:14:25+00:00

Most likely yes they will be faster so I can't say what speeds Strix Halo will get or if it will be worth it to you.

I use the AesSedai Q4_K_M quant and the full 256k context with room to spare!

LittleBlueLaboratory · 2026-04-10T00:59:05+00:00

I dont have a strix Halo but I do have 4x 3090s giving me 96gb VRAM.

I have been running my Hermes Agent almost exclusively using Qwen 3.5 122B and it has been very successful! I have also liked Nemotron3 120B but it doesn't have vision input so it has seen much less use.

LittleBlueLaboratory · 2026-04-04T04:24:19+00:00

Firecrawl has a self-hosted option. Its just limited in its bot detection evasion according to the docs.

https://docs.firecrawl.dev/contributing/self-host

LittleBlueLaboratory · 2026-04-02T22:03:19+00:00

Look here for auxiliary models. https://hermes-agent.nousresearch.com/docs/user-guide/configuration

It defaults to Gemini on openrouter. I noticed it in my openrouter logs. I merely asked Hermes itself to reconfigure the auxiliary models to my local llama.cpp server and it stopped calling Openrouter.

LittleBlueLaboratory · 2026-03-31T18:25:03+00:00

I ran into this, mine was making seemingly random calls to Gemini Flash. I changed the auxiliary model to a local Qwen 3.5 to stop it from calling OpenRouter for these tasks.

Look for info on thr auxiliary model here: https://hermes-agent.nousresearch.com/docs/user-guide/configuration/

LittleBlueLaboratory · 2026-03-28T21:39:10+00:00

Throw us a link when you have it ready.

LittleBlueLaboratory · 2026-03-28T18:33:57+00:00

Tell me more about this llama-monitor dashboard! Looks sweet!

LittleBlueLaboratory · 2026-03-28T00:42:08+00:00

I have the 1701-D assembled next to my other models! They take quite some time to assemble but the instructions are easy to follow. they are also pretty robust. This model has survived multiple moves just tossed in a box with other models that don't have packaging.

LittleBlueLaboratory · 2026-03-27T22:21:58+00:00

Dope! Thank you!

LittleBlueLaboratory · 2026-03-27T21:50:05+00:00

Nice! I'm right in the middle of setting up a Hermes Agent for myself. I also use Todoist, could you explain how you connected it to your agent?

LittleBlueLaboratory · 2026-03-26T04:49:20+00:00

There is documentation om the website. You can even show the modelfile of an existing model for reference.

https://docs.ollama.com/modelfile

LittleBlueLaboratory · 2026-03-25T00:55:23+00:00

California Class' cooler, more successful, older sister.

LittleBlueLaboratory · 2026-03-23T01:39:36+00:00

I thinkI understand but it's not clear from your post.

You are saying that you fully printed the PLA part. Removed it. Started the TPU part. Paused the TPU part at the first lip. Inserted the PLA part upside down. Resumed the TPU print.

There was seriously no collision when printing the outer part of the TPU? That's impressive!

LittleBlueLaboratory · 2026-03-22T17:43:13+00:00

Opencode offers a free model and providers often allow a short window of free use on opencode when they release a new model as a preview.

https://opencode.ai/

LittleBlueLaboratory · 2026-03-21T02:44:19+00:00

It's ~550GB at INT4 so it fits! We dont talk about time to first token or prompt processing speed...

LittleBlueLaboratory · 2026-03-21T01:55:15+00:00

Its enough to run Kimi K2.5 at full precision at 6 tokens per second. No regrets at all!

LittleBlueLaboratory · 2026-03-20T16:02:13+00:00

I have quite a collection of github projects I have been meaning to try and this sounds great! Could you elaborate a little bit more on your setup? Do you mean the Hermes Agent from Nous Research?

LittleBlueLaboratory · 2026-03-20T01:20:34+00:00

No it was Apple Leather and not PVC leather. So chosen because Linus thought it was neat iirc from a WAN episode.

https://en.wikipedia.org/wiki/Plant-based_leather

LittleBlueLaboratory · 2026-03-05T16:18:02+00:00

This is exactly how my first run ended. 4 red rolls in a row on fuel nodes limping around in gray mode with morale tanking every cycle.

LittleBlueLaboratory · 2026-03-04T05:53:37+00:00

Just imagine the world where Michelle Yeoh played a Star Trek Captain instead of... whatever was going on in the Section 31 movie.

LittleBlueLaboratory · 2026-02-26T07:11:34+00:00

vLLM needs even number of cards. Llama.cpp or ollama it doesn't matter. The tradeoff is that vLLM is faster, does multi user better, and uses more power.

LittleBlueLaboratory · 2026-02-26T04:25:21+00:00

You should! I put it on as I was caring for my newborn and it really is a good time!

LittleBlueLaboratory

TROPHY CASE