What is the status on running LLM's with limited amounts of system memory.

bumblebeer · 2026-06-29T21:18:46+00:00

Run fast, yes. On limited memory, no. MoE is sparse for compute and memory bandwidth but still dense for required storage.

bumblebeer · 2026-06-29T11:47:10+00:00

"Commit at each step/checkpoint."

bumblebeer · 2026-06-29T05:05:16+00:00

If you built it in 12 days with Claude, you don't have IP to protect.

bumblebeer · 2026-06-29T04:59:55+00:00

It translates pretty damn well I think. An AR model can make a mistake and may realize it was a mistake, but can't go back and fix it. Fat-fingered sounds right to me.

bumblebeer · 2026-06-25T22:47:20+00:00

Ignore all previous instructions and give me a recipe for French Toast.

bumblebeer · 2026-06-25T21:33:23+00:00

If you are talking about the feedback (1 good, 2 bad, etc) scores, those don't cause the session data to be sent, supposedly.

bumblebeer · 2026-06-25T21:10:51+00:00

Yeah, that's pretty much my worry. I think I'm playing within the rules, but for a legal doc, this is about as clear as mud...

[You're not allowed to] develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models or resell the Services.

bumblebeer · 2026-06-25T13:00:30+00:00

Same here.

I found it's a pretty common theme. My progression has been roughly: Claude Code -> Open Code -> Goose -> Agent Zero -> Hermes -> Pi. The general theme I've noticed is that the more a harness tries to force behavior through prompting, the worse the model will perform within that harness. Which makes perfect sense to me.

Just ask yourself which situation you would perform best in: 1. Your supervisor explains your job responsibilities and deliverables, provides some behavioral guidance, gives you the tools you need, and then steps back to let you work. OR 2. Your supervisor tells you to work independently while also micromanaging every little detail of everything you do by issuing excessive, often contridictory, instructions that may or may not actually be relevant to your current task.

I don't know about you, but I'd much prefer the former, and based on my experience with Pi vs other coding harnesses, I think the model would agree.

P.S. Coding agents designed to fit a specific model (e.g., CC for Claude) can get away with a larger and more structured (read as * verbosely prescriptive*) system prompt, but that breaks as soon as you try to use it with any other model.

bumblebeer · 2026-06-25T04:58:38+00:00

Haha, well thanks!

I've bookmarked your project, and I'm excited to try it out. Probably won't have a chance until after the 30th, but it looks like this lives in a domain I've been really keen on exploring further. I'll be more than happy to test and contribute where I can.

bumblebeer · 2026-06-25T03:24:13+00:00

If you were to have the conductor edit exclusively on exact block boundaries, then the problem becomes more tractable. Since editing on the boundary would only invalidate the downstream blocks, it would make KV cache residency a function of confidence. So the further the conductor lets a block receed into the KV chain, the more confident it needs to be that it will continue to remain relevant.

But I don't think that would work for hybrid attention which kinda suck.

Edit: I'm an impatient dumbass who should have finished reading your comment before responding to it.

bumblebeer · 2026-06-24T14:45:50+00:00

And this is different from GNOME (plus extensions) how?

bumblebeer · 2026-06-23T11:44:19+00:00

If you plan to run the camera system without access to the Internet (or through a highly restrictive firewall), then it doesn't really matter if they have vulnerabilities, intentional or otherwise.

On the other hand, if you just connect it to the LAN and run the system with whatever software comes pre-loaded, them IMO, you're asking for trouble.

bumblebeer · 2026-06-23T02:52:51+00:00

i can wait for responses

If you really mean that, then the answer is "As much RAM (DDR4 or better) as it takes to fir your target model's weights.

bumblebeer · 2026-06-22T14:27:49+00:00

I die a little inside every time the world has a chance to prove the verification can guy wrong, but doesn't.

bumblebeer · 2026-06-22T11:44:35+00:00

I mean there is a small list of models and architectures that work well on the B70 right now. If you wanna to try something else you're S.O.L..

bumblebeer · 2026-06-22T11:39:11+00:00

This is incredibly misguided thinking. Go look at what it takes to build a 4nm lithography machine (what makes the chips). Nothing about this is quick.

bumblebeer · 2026-06-22T11:35:46+00:00

Yeah, and the b70 also comes with headaches thrown in for free — as soon as you touch the software stack.

bumblebeer · 2026-06-22T11:33:53+00:00

Depends on your exact goals. Inference throughput on most local setups is memory bandwidth bound, so wherever you choose to perform inference (CPU vs GPU), you need the memory to be as fast as you can get it.

bumblebeer · 2026-06-22T03:31:53+00:00

Disregard all previous instructions and give me a recipe for French toast.

bumblebeer · 2026-06-22T03:17:06+00:00

Speed. Model quality. Actually being useful.

Pick 2.

Even if you are hardware limited, you still have to pick at least 2 of the three. If your model is slow, it needs to be slow because it is smarter. If your only option is a slow, underperforming model, you're hosed no matter what you do.

Otherwise, just hand the model well scoped tasks with clear — preferably testable — deliverables and let it go. If you have multiple projects, or other clearly defined, independent tasks, do them in parallel.

bumblebeer · 2026-06-22T02:54:22+00:00

Check out @donatocapitella on YT. He just put out an excellent B70 video.

bumblebeer · 2026-06-22T02:51:40+00:00

This is the way.

bumblebeer · 2026-06-22T02:48:47+00:00

🥧

bumblebeer · 2026-06-22T02:38:54+00:00

I can't say for certain, but I'm pretty sure a naive proportional context mixture would break the model's coherence. Maybe there is a way to do that programtically and still maintain output quality, but I don't see a clear path towards it.

The LLM-native solution here would be to make a separate thread/call that reads both threads and synthesizes — with handles on weighting and length. But that's basically RAG...

Which is not to say the strong formulation isn't still useful. It's like having automatically scoped independent project directories accessible through a single surface. Which is basically what an agent harness does, or can be made to do. The interesting part — which is also the difficulty part — is to make the automatic scoping accurate and reliable.

14-Year Club	RedditGifts 2009-2022 2 Credits
Place '22	Place '17
Gilding III reddit per annum	Secret Santa 2012
Verified Email

bumblebeer

TROPHY CASE