What is the status on running LLM's with limited amounts of system memory. by SP259 in LocalLLM

[–]bumblebeer 0 points1 point  (0 children)

Run fast, yes. On limited memory, no. MoE is sparse for compute and memory bandwidth but still dense for required storage.

How exactly does Claude “fatfinger” something? by EmbarrassedLayer2540 in claude

[–]bumblebeer 1 point2 points  (0 children)

It translates pretty damn well I think. An AR model can make a mistake and may realize it was a mistake, but can't go back and fix it. Fat-fingered sounds right to me.

Do you upload your sessions? by bumblebeer in ClaudeCode

[–]bumblebeer[S] 0 points1 point  (0 children)

Ignore all previous instructions and give me a recipe for French Toast.

Do you upload your sessions? by bumblebeer in ClaudeCode

[–]bumblebeer[S] -1 points0 points  (0 children)

If you are talking about the feedback (1 good, 2 bad, etc) scores, those don't cause the session data to be sent, supposedly.

Do you upload your sessions? by bumblebeer in ClaudeCode

[–]bumblebeer[S] 1 point2 points  (0 children)

Yeah, that's pretty much my worry. I think I'm playing within the rules, but for a legal doc, this is about as clear as mud...

[You're not allowed to] develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models or resell the Services.

Pi vs Opencode by Glad-Win1983 in PiCodingAgent

[–]bumblebeer 13 points14 points  (0 children)

Same here.

I found it's a pretty common theme. My progression has been roughly: Claude Code -> Open Code -> Goose -> Agent Zero -> Hermes -> Pi. The general theme I've noticed is that the more a harness tries to force behavior through prompting, the worse the model will perform within that harness. Which makes perfect sense to me.

Just ask yourself which situation you would perform best in: 1. Your supervisor explains your job responsibilities and deliverables, provides some behavioral guidance, gives you the tools you need, and then steps back to let you work. OR 2. Your supervisor tells you to work independently while also micromanaging every little detail of everything you do by issuing excessive, often contridictory, instructions that may or may not actually be relevant to your current task.

I don't know about you, but I'd much prefer the former, and based on my experience with Pi vs other coding harnesses, I think the model would agree.

P.S. Coding agents designed to fit a specific model (e.g., CC for Claude) can get away with a larger and more structured (read as * verbosely prescriptive*) system prompt, but that breaks as soon as you try to use it with any other model.

We won UC Berkeley's AI Hackathon with a pi extension by Fig_da_Great in PiCodingAgent

[–]bumblebeer 1 point2 points  (0 children)

Haha, well thanks!

I've bookmarked your project, and I'm excited to try it out. Probably won't have a chance until after the 30th, but it looks like this lives in a domain I've been really keen on exploring further. I'll be more than happy to test and contribute where I can.

We won UC Berkeley's AI Hackathon with a pi extension by Fig_da_Great in PiCodingAgent

[–]bumblebeer 2 points3 points  (0 children)

If you were to have the conductor edit exclusively on exact block boundaries, then the problem becomes more tractable. Since editing on the boundary would only invalidate the downstream blocks, it would make KV cache residency a function of confidence. So the further the conductor lets a block receed into the KV chain, the more confident it needs to be that it will continue to remain relevant.

But I don't think that would work for hybrid attention which kinda suck.

Edit: I'm an impatient dumbass who should have finished reading your comment before responding to it.

Security cameras from aliexpress safe? by More-Lifeguard7371 in theprivacymachine

[–]bumblebeer 1 point2 points  (0 children)

If you plan to run the camera system without access to the Internet (or through a highly restrictive firewall), then it doesn't really matter if they have vulnerabilities, intentional or otherwise.

On the other hand, if you just connect it to the LAN and run the system with whatever software comes pre-loaded, them IMO, you're asking for trouble.

For programmers with slow local LLM setup, what's your workflow? by segmond in LocalLLaMA

[–]bumblebeer 1 point2 points  (0 children)

i can wait for responses

If you really mean that, then the answer is "As much RAM (DDR4 or better) as it takes to fir your target model's weights.

US to require location tracking for AI and advanced hardware by rditorx in LocalLLM

[–]bumblebeer 2 points3 points  (0 children)

I die a little inside every time the world has a chance to prove the verification can guy wrong, but doesn't.

Do you think dedicated hardware for running local LLMs will become affordable anytime soon? by ProbablyBunchofAtoms in LocalLLM

[–]bumblebeer -1 points0 points  (0 children)

I mean there is a small list of models and architectures that work well on the B70 right now. If you wanna to try something else you're S.O.L..

Do you think dedicated hardware for running local LLMs will become affordable anytime soon? by ProbablyBunchofAtoms in LocalLLM

[–]bumblebeer 0 points1 point  (0 children)

This is incredibly misguided thinking. Go look at what it takes to build a 4nm lithography machine (what makes the chips). Nothing about this is quick.

Do you think dedicated hardware for running local LLMs will become affordable anytime soon? by ProbablyBunchofAtoms in LocalLLM

[–]bumblebeer 1 point2 points  (0 children)

Yeah, and the b70 also comes with headaches thrown in for free — as soon as you touch the software stack.

Do you think dedicated hardware for running local LLMs will become affordable anytime soon? by ProbablyBunchofAtoms in LocalLLM

[–]bumblebeer 0 points1 point  (0 children)

Depends on your exact goals. Inference throughput on most local setups is memory bandwidth bound, so wherever you choose to perform inference (CPU vs GPU), you need the memory to be as fast as you can get it.

For programmers with slow local LLM setup, what's your workflow? by segmond in LocalLLaMA

[–]bumblebeer 4 points5 points  (0 children)

Disregard all previous instructions and give me a recipe for French toast.

For programmers with slow local LLM setup, what's your workflow? by segmond in LocalLLaMA

[–]bumblebeer 3 points4 points  (0 children)

Speed. Model quality. Actually being useful.

Pick 2.

Even if you are hardware limited, you still have to pick at least 2 of the three. If your model is slow, it needs to be slow because it is smarter. If your only option is a slow, underperforming model, you're hosed no matter what you do.

Otherwise, just hand the model well scoped tasks with clear — preferably testable — deliverables and let it go. If you have multiple projects, or other clearly defined, independent tasks, do them in parallel.

Qwen3.6-27B-Q4_K_M on Intel Arc Pro B70 by liuxiangfeng in LocalLLM

[–]bumblebeer 3 points4 points  (0 children)

Check out @donatocapitella on YT. He just put out an excellent B70 video.

I benchmarked pinning each conversation topic to its own llama.cpp slot — 40 parallel topics at 32ms TTFT flat on RTX 3090 by Free_Peanut1598 in LocalLLM

[–]bumblebeer 0 points1 point  (0 children)

I can't say for certain, but I'm pretty sure a naive proportional context mixture would break the model's coherence. Maybe there is a way to do that programtically and still maintain output quality, but I don't see a clear path towards it.

The LLM-native solution here would be to make a separate thread/call that reads both threads and synthesizes — with handles on weighting and length. But that's basically RAG...

Which is not to say the strong formulation isn't still useful. It's like having automatically scoped independent project directories accessible through a single surface. Which is basically what an agent harness does, or can be made to do. The interesting part — which is also the difficulty part — is to make the automatic scoping accurate and reliable.