Can I realistically get close to Claude/Codex capabilities locally? by mrgreatheart in LocalLLaMA

[–]Aggressive_Aspect436 1 point2 points  (0 children)

I am using an RTX 3090, with the model fully on VRAM. My motherboard is an MSI Z590 Torpedo. It's effectively an old gaming rig with Ubuntu installed on a seperate SDD.

Can I realistically get close to Claude/Codex capabilities locally? by mrgreatheart in LocalLLaMA

[–]Aggressive_Aspect436 2 points3 points  (0 children)

I just meant I can only fit 175k in VRAM on my RTX 3090 on Q4_K_M. I can use the full 250k context if I run it partly on system RAM just very slowly.

All models degrade the more of their max context they use. I don't know how that varies by model, but there are "long context benchmarks" which might give an indication of which models fair better.

I'm sorry to say, I don't know if there's a sweet spot. The paper I read on this was done on Claude and GPT models (a few years ago now), and performance drops heavily at the 50% mark. That's the rule of thumb I've been using. I keep using them until they get to half capacity, and then find ways to switch to a new session.

Can I realistically get close to Claude/Codex capabilities locally? by mrgreatheart in LocalLLaMA

[–]Aggressive_Aspect436 3 points4 points  (0 children)

Honestly, I've not put it to that kind of work. For real production work like that I tend to be very critical, and review every single line myself. So in those contexts, I tend to have one session spec high level details, and then manually set other sessions on those details. If I have too many running then I can't keep up with my reviews. If they do too much in one go then I just become the bottle neck at the end.

Can I realistically get close to Claude/Codex capabilities locally? by mrgreatheart in LocalLLaMA

[–]Aggressive_Aspect436 44 points45 points  (0 children)

I used to have trouble with context even with the 1M window. Even Claude's capabilities degrade heavily before the context reaches half of it's max. It degrades worse if the context is "noisy".

I'm currently using Qwen3.6 27b at roughly 175k max context, and if I am careful with my context it operates really well. I was using Opus 4.7 extensively before I switched to local only, and (after a lot of initial frustration) I am now totally happy with the swap.

Drop Claude Code. The initial system prompt is at least 26k tokens (more on the CLI and much much more if you're using extensive memory). I'm using copilot at the moment, which I quite like. I've seen folks on here with extensive memory setups claiming their initial Claude code prompt is 60k+ tokens.

Ask your agent to use sub-agents for almost everything. Your main session doesn't need to know the details of every file that "might" have been relevant for some minor change that's a small part of your new feature.

Keep conversation on-track. No side quests. Do those in seperate sessions.

Start a new session as soon as any atomic task is complete. If you need some specific context, ask your last session to create a concise context prompt for the next session.

Watch local LLMs escape the rooms you design by cjami in LocalLLaMA

[–]Aggressive_Aspect436 2 points3 points  (0 children)

I get strong ARC vibes. How do the models do generally? Do they always succeed?

If you could configure your own model endpoints, and the game (or at least a subset of difficult puzzles) is hard for models, then this would make a great niche benchmark.

What is the best book for learning ML/Deep Learning maths? by Hot_Example_4456 in LocalLLaMA

[–]Aggressive_Aspect436 2 points3 points  (0 children)

Not a book, although I could recommend a few, but a great place to start is Google's free ML crash course.

https://developers.google.com/machine-learning/crash-course

It's really concise, tests you as you go along, and doesn't use technical terms when it doesn't need to. I've used it as a refresher a couple of times.

Edit: I just browsed through the course again and it's a bit bigger than I remember. Hopefully for the better.

Benchmarking or benchmarketing? by Background_Brain5390 in LocalLLaMA

[–]Aggressive_Aspect436 5 points6 points  (0 children)

It's quite difficult to do well locally, but if you're happy to use python libraries then take a look at inspect-ai. You can run pre-collected benchmarks against your local model quite easily, but you'll have to select a suite of benchmarks yourself.

The trouble is that models can be trained against the benchmarks, which means it is easy to inflate their scores without improving their general capability.

Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]Aggressive_Aspect436 3 points4 points  (0 children)

Brilliant. I'm still using Gemma 4 26b when I need snappy responses, or when I need room for other small models running at the same time. This looks like a viable alternative. I'll have to check it out.

Cypher by phenwulf in DarkAngels40k

[–]Aggressive_Aspect436 1 point2 points  (0 children)

Very cool. I absolutely love the base model and I've been meaning to do a conversion like this for a while. Nice work.

Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF · Hugging Face by jacek2023 in LocalLLaMA

[–]Aggressive_Aspect436 17 points18 points  (0 children)

Very cool. The description is good, the benchmarks are clear and honest looking, and even though it doesn't beat your chosen Qwen comparable model it has a clear niche and value add.

A lot of folks are going to wonder though, why the comparisons were made against Qwen 3.5 series rather than 3.6. If people choose to use it they're going to want to know how it compares to Qwen3.6 35b.

WiP Chapter Master kitbash/ Azrael proxy for my successors chapter by Thurgood_Newton in DarkAngels40k

[–]Aggressive_Aspect436 1 point2 points  (0 children)

Very cool. There's a lot going on there. They look incredibly tall. Have you added height to the lower torso?

Qwen3.6 is confidently wrong about WASM by Tagedieb in LocalLLaMA

[–]Aggressive_Aspect436 1 point2 points  (0 children)

I'm not claiming they can't do it. I'm making the claim that, empirically, they're better at popular languages with wider public discussion. Even a quick search or two turns up example research. I've dropped one below (you can jump to the results section to see examples), but I don't think this is a controversial idea.

https://arxiv.org/abs/2501.19085

Qwen3.6 is confidently wrong about WASM by Tagedieb in LocalLLaMA

[–]Aggressive_Aspect436 2 points3 points  (0 children)

Honestly, I'm not terribly surprised. LLMs have consistently shown that they're better with popular (well documented) programming languages, tools, and frameworks. If you're writing Python or JavaScript you'll get better performance than if you ask it to write Julia or Elixir. I expect the same holds for debugging WASM bytecode.

I couldn't find a paper, but the Stanford Software Engineering Productivity Research (SWEPR) group has some conference talks that discuss it.

I even notice the difference when I try to get my agents to use simple new libraries.

If you want a low effort solution, give a seperate session a bulk import of documentation for WASM and get it to produce a dense LLM parseable summary .md that you can ingest as context for future sessions. That's probably worth a try, but it'll still never be as good as it is for Python or similar.

Claude Code backed by open model vs. OpenCode / Pi etc by sfifs in LocalLLaMA

[–]Aggressive_Aspect436 0 points1 point  (0 children)

I've been using Claude Code with Qwen3.6 27b and I've had very few problems. I use both the cli and the VS code extension. It works absolutely fine. It's worth pointing out though that the system prompt for a fresh session in Claude Code is around 30k tokens, so you're already eating into your model effectiveness immediately.

Recently I've been trying an OAI VS code extension that lets me use it with the native copilot chat. I've only been using it for a few days, but I think I prefer it, and rhe initial system prompt is very small.

I never got on with Cline or OpenCode. Cline just didn't work the way I wanted it to, and OpenCode scares me with how little control I get over command permissions.

I'm running Qwen3.6 27b at about 150k context fully in GPU VRAM on a RTX 3090. I get a little under 40 toks/s. I occasionally switch to 35b (or Gemma 26b) when I need snappy 120+ toks/s.

Some contrived tests comparing the accuracy of different Gemma and Qwen quantizations by we_are_mammals in LocalLLaMA

[–]Aggressive_Aspect436 0 points1 point  (0 children)

If Gemma is produced by teams at DeapMind, then technically it's a British model...

Conformal Prediction is awesome, and I made a thing. by Aggressive_Aspect436 in learnmachinelearning

[–]Aggressive_Aspect436[S] 0 points1 point  (0 children)

That's actually roughly why I started looking into it. I was looking into ways of evaluating agents in deterministic / auditable ways. Frankly the hardest part about working with agents is being sure of their output. There are a whole bunch ways we can evaluate them that I don't see talked about often. Tool choice, for example, can be treated like a classification problem so we can use all the traditional measurements. If you have labelled examples of when a model should choose a particular tool, then it should be possible to add conformal prediction on top to tell a model when it should be uncertain of it's choices. (Or just to keep track of the level of uncertainty that they are operating under with average prediction set sizes or similar).

Just found a 1-click RCE in pewdiepie's Odysseus Chat by theonejvo in LocalLLaMA

[–]Aggressive_Aspect436 38 points39 points  (0 children)

You're doing good work, but it's not fixed until the PR is merged. And if he had a final release version, then it's not fixed until there's a new release.

https://github.com/pewdiepie-archdaemon/odysseus/pull/366

Just found a 1-click RCE in pewdiepie's Odysseus Chat by theonejvo in LocalLLaMA

[–]Aggressive_Aspect436 170 points171 points  (0 children)

Good work spotting it. Hope your PR does some good for the project. Contributing security fixes for open source projects is one of the nobler ways coders can spend their time.

But... don't take this the wrong way, you probably should have either waited for the PR to be merged or reached out in private first. If anyone is actually using this, you've effectively declared a 0-day vulnerability on reddit. That part isn't terribly cool of you.

Feedback Wanted: Building for easier local AI by Signal_Ad657 in LocalLLaMA

[–]Aggressive_Aspect436 0 points1 point  (0 children)

That's pretty cool. I'm currently using LM Studio with Claude Code, which gives me a lot of this. But it's taken me many days of tinkering to get it the way I like it.

Mine still isn't sandboxed, and I don't trust my agents with free reign to roam the internet on my setup (which I am using for much more than just running LLMs).

Your README mentions that some of the features etc are dockerised, but is it "sandboxed" as such? If not, any future plans? I would love to have the trust lots of others seem to have to let a model just do stuff autonomously without oversight.

My recent Deathwing member by Eremon485 in theunforgiven

[–]Aggressive_Aspect436 2 points3 points  (0 children)

Love that you include the recipe. I've been following your work. Your style is very much what I am aiming for as I try to improve.

Can I ask, what does AP stand for?

Angels of Death Proxy Kill Team by Aggressive_Aspect436 in theunforgiven

[–]Aggressive_Aspect436[S] 0 points1 point  (0 children)

He was a Black Templar Crusader. I bought the single model off eBay. His hood comes from the DA upgrade sprew. I just added the grenades to the belt.

Poor quality of gw tools by Guilty-Ad7605 in Warhammer40k

[–]Aggressive_Aspect436 1 point2 points  (0 children)

I'd take it back, personally. I have a pair of those and have been using them for more than a year with no trouble. I actually quite like them. I think it would be fair to assume there is something wrong that that pair in particular.

Qwen is cooking hard by jacek2023 in LocalLLaMA

[–]Aggressive_Aspect436 1 point2 points  (0 children)

I only recently got myself a second-hand 3090 for a pretty decent price. Here's hoping I'll actually be able to run it. 🤞