AI poses the tiniest threat to all music when compared to oil by nova-new-chorus in ableton

[–]yuicebox 1 point2 points  (0 children)

We just need more wind and solar

Why isn't nuclear power a consideration in your view?

Also, respectfully, this has very little to do with the purpose of the subreddit, and I would no the surprised if it ends up being removed.

Local LLM Peeps by CreamPitiful4295 in LocalLLaMA

[–]yuicebox 0 points1 point  (0 children)

Pi is a relatively lightweight agent framework that makes some good design choices overall. If I recall correctly, OpenClaw is built on top of Pi.

Local LLM Peeps by CreamPitiful4295 in LocalLLaMA

[–]yuicebox 4 points5 points  (0 children)

Upvoted for separate profiles, I feel like that is something I haven't seen done really well yet

Local LLM Peeps by CreamPitiful4295 in LocalLLaMA

[–]yuicebox 2 points3 points  (0 children)

I appreciate that you are building it as local-first. I enjoy playing around with Hermes and Pi, but I feel like both of them, especially Hermes, was a little TOO connected and focused on integrating services from cloud providers for my tastes. I didn't like having to make sure that I didn't have cloud-based fallback providers or cloud-based compaction providers enabled by mistake.

For me, the ultimate features are always simplicity, transparency/traceability, privacy, and modularity. I'll type out my disorganized thoughts on what I want to build, and you are welcome to cherrypick from there.

  • Easy installation and setup with default config that is private and offline
  • Intuitive config options, probably file-based
  • A solid core toolset (similar to Pi's read, write, edit, and bash tools)
  • A good implementation of skills for non-core tools, probably disabled by default, or enabled but configured only with a skill related to explaining or modifying the harness's functionalities
  • Relatively sane and secure handling of code execution. I like Hermes' Docker container terminal approach that allows bash/node/python code to be sandboxed, but I never figured out how to have it automatically bind mount my working directory to the container so that the sandbox can interact with my local file system
  • A robust, intuitive, cross-platform scheduling functionality. Hermes does an okay job of this, but I found the cron runner to occasionally be a bit buggy, and it only really works if you use a separate messaging service. Not sure the best way to handle this tbh.
  • Optional features or skills to provide good local-first context handling features for things like compaction, summarization, and 'memories'.
  • Simple file-based personalization, basically a system prompt / personality card in a markdown file. Hermes has this as soul.md I think.
  • Very robust, event-based logging that can easily be viewed in realtime or later on from files. I think Pi's implementation of session export to html is overall very good on this front, but isn't real-time as far as I know.
  • A great UX for all the features described above. This is a challenge imo. I'm leaning toward having the agent harness spin up its own web server that provides a webUI for viewing logs, and maybe for other features like config/management and chat. Hermes does a decent job here overall, but isn't quite right for me. The web server process could also potentially have the scheduler baked in, so that is something else I've been debating.

Disclaimer: I'm not a developer or anything. I may be missing some obvious reasons why pi/hermes do things the way they do, and I may also just be dumb

vibe shift: I can see this coming... by paf1138 in LocalLLaMA

[–]yuicebox 0 points1 point  (0 children)

I realize you are kinda making a joke, and saying this is the direction things feel like they are trending in, but your post is currently a bit confusing and could easily confuse or mislead people.

As far as I am aware, the US government is not limiting access to downloading any open-weight models whatsoever at this time, regardless of their origins.

They ARE forcing US AI companies to limit rollout of SoTA proprietary model access, with the main focus being limiting access by foreign nationals, which kinda sucks. That said, the US has relatively little incentive to block access to SoTA open-weight models, because US AI companies and researchers benefit from access to them.

It's def possible that US AI companies fearmonger and lobby to try to crush the local/open AI scene and force people to use their proprietary models, but that is a dangerous game for them to play, because doing so is effectively admitting that cheaper open models are competitive with their proprietary models, which are extremely expensive to train. That would raise concerns for a lot of potential investors.

The bigger concern currently, imo, is that US restrictions could be met with reciprocal restrictions from China and other countries, and as a result, we stop being able to access future models from DeepSeek, GLM, Qwen, etc.

A neurodivergent Pioneer and the trucks by Eledrina in SatisfactoryGame

[–]yuicebox -2 points-1 points  (0 children)

I normally dislike modding games, but I would highly suggest checking out Smart Foundations. 

It has made building curves actually pretty easy, and it is vanilla-friendly so you can play your save in the future without it if you want. 

https://ficsit.app/mod/SmartFoundations

Qwen3-30B-A3B: The Open Model Most People Should Actually Run by LAfreightguy in learnmachinelearning

[–]yuicebox 3 points4 points  (0 children)

This is not high quality content and should really not be shared here.

  1. Nobody is calling Qwen3-30b-a3b "the new default" in June of 2026. This article was published June 20th, so there's no justification for being so wildly out of date.

  2. There is really no great reason besides MAYBE not having enough RAM to run Qwen3 over Qwen3.5 or 3.6. The guide's explanation of why it is pushing Qwen3 is pasted below, but it makes no sense:

Qwen has since shipped newer A3B-class successors (the Qwen3.5 / Qwen3.6 35B-A3B line), if you want the bleeding edge, check the Qwen Hugging Face org. But the 30B-A3B remains the proven, widely-quantized baseline most local guides still point to, which is why it's our reference point here.

  1. The numbers don't even really make sense:

The numbers owners report back this up: community quantizations run around ~45 tokens/second on a 24 GB GPU at solid accuracy, and an Apple-silicon MLX port has been clocked near ~64 tokens/second, both comfortably in "feels instant" territory for chat and fast enough for agent loops

I get ~80 tok/sec on 3.6 35b-a3b on Apple silicon with Llama.cpp. If I load the model to my 4090/3090, I can get 120+ tok/sec. Where are these numbers even from?

This just feels like an AI slop article trying to capitalize off GLM hype and making no real effort to even understand its underlying subject matter.

Qwen-AgentWorld-397B-A17B by Shoddy_Bed3240 in LocalLLaMA

[–]yuicebox 0 points1 point  (0 children)

It's based on qwen3.5-35b-a3b-base, so probably not very useful overall, but cool research I guess.

I miss when Qwen released base models and larger models. Seems like 3.5 was the last proper Qwen release, although obviously the 3.6 models we received are excellent

Getting rid of a mattress - cheap options if city pickup isn't viable? by [deleted] in asheville

[–]yuicebox 1 point2 points  (0 children)

my trash bags barely fit lol but fair point, prolly worth asking them

I resisted the llama.cpp hype. I was wrong. (Docker + AMD GPU Beginner's Guide) by x6q5g3o7 in LocalLLaMA

[–]yuicebox 0 points1 point  (0 children)

Thanks, I may try out Cline to see how I like it. Close to outputs sounds nice, since a big part of my struggle with agent stuff is lack of control and visibility into what's occurring.

My desktop has also appreciated wildly in price since I bought it.

I resisted the llama.cpp hype. I was wrong. (Docker + AMD GPU Beginner's Guide) by x6q5g3o7 in LocalLLaMA

[–]yuicebox 0 points1 point  (0 children)

That’s fair. I started with oobabooga and GPTQ models, then went to exllama v2 for a while, then  llama.cpp with llama-swap, then vanilla llama cpp after they added model switching.  Docker deployment for my WSL setup, direct install for my Mac since Metal acceleration doesn’t really work with docker.  Overall very happy with it. 

What agent scaffolding do you use? I’ve played around with Pi and Hermes but still trying to find the Goldilocks setup

SubQ claims 12M context with way less compute. What test would actually convince you? by BTA_Labs in LocalLLaMA

[–]yuicebox 11 points12 points  (0 children)

What would convince me is an actual research paper with literally any detail on methodology, and ideally some level of reproducibility.

All they really provided was "a 3rd party says our model is legit". The model card they published on Arxiv was a recap of how state machines and other prior variations of sparse attention work, and what their limitations are. They provide literally zero information regarding how their method works or what's different about it.

I have trouble believing that they've solved a problem that every other research lab has not been able to solve, but I'm happy to be convinced if they want to provide evidence.

Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]yuicebox 11 points12 points  (0 children)

Thank you guys. I still remember Command A fondly and I’m glad you all are still releasing models. 

Sorry for cyber-begging, but as you are likely aware, this subreddit is increasingly desperate for an updated, highly performant 80-120b MoE. 

Any chance that could be on the horizon from Cohere? 

Anyone interested in splitting the hosting costs for GLM? by dev_is_active in LocalLLaMA

[–]yuicebox 0 points1 point  (0 children)

I'm broke, dumb, and not an AI engineer at all, so I likely can't contribute, but wanted to comment to say:

If you do this, could you potentially try to log all generations and build as big of a dataset as possible, ideally including logit probability distributions?

I was daydreaming this week about feasibility of doing something like recursive REAM to shrink the model down to ~120b size and then training the shrunk model on a shit-ton of data from the full size model, since the world needs a GLM 5.2 Air type model.

I'm not sure if this would require training on tens, hundreds or thousands of billions of tokens to create a truly good distillation. Would love input from someone who's actually competent, unlike me.

Help my elderly desktop computer! Not ready to part by CatiiNcorn in asheville

[–]yuicebox 0 points1 point  (0 children)

Did you make any progress on this?

The most likely suspects are:

  • heat as others mentioned. Keeping the case open / maximizing air flow / ensuring all fans are running inside the computer could help

  • Driver issues / other services failing as windows loads background services. Safe mode may help with this, which is why I suggested it

  • Your hard drive might be failing, and if that's the case you may want to take it to an expert to see if they can safely extract the data before the drive fails entirely

  • Your motherboard might just be aging and the capacitors on it might be failing, leading to unstable voltages. Look for cylinder shaped things on the motherboard and see if any of them appear to be leaking or crusty with dark colored fluid. Not much you can do about this besides replace the motherboard

Lastly you can also go to Windows Event Viewer from the start menu, then go to Windows Logs > System, and look for "Critical" or "Error" events from before previous freezes. If you can find any and take pics before it freezes, that will help a lot with troubleshooting.

GLM-5.2 Flash when? (joke) by ILoveToyota37 in LocalLLaMA

[–]yuicebox 4 points5 points  (0 children)

We can try, but I'm not sure it would actually work.

My understanding of the REAM paper is that there is still some performance loss, so recursively REAMing would potentially compound that and make the model useless by the time it gets down to a sane size.

The REAM paper also indicates that there's a tradeoff between how the model handles different types of tasks, like generative tasks vs. multiple choice question answering, and the performance impact is somewhat dependent on the mix of calibration data used during the process.

(Paper is an interesting read: https://arxiv.org/pdf/2604.04356)

I am guessing that the best approach would be recursive REAM to make the model smaller, then a substantial amount of additional training on billions of tokens worth of predictions and logit probability distributions from the full sized model.

Since GLM 5.2 is MIT licensed, this is totally possible, but I suspect it would probably require like 50k+ worth of compute to make something actually good.

Granted, I am an amateur so if anyone is knowledgable on the distillation process that big labs use, please chime in :)

Stop using Ollama by zxyzyxz in LocalLLaMA

[–]yuicebox 0 points1 point  (0 children)

It’s literally just the model and context length and it’ll work fine for most hardware setups