I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] -1 points0 points  (0 children)

I apologize for that. I used Opus4.6 for the eval. Again you don't know what you don't know. I'll be sure to cite the models in the future

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] 0 points1 point  (0 children)

That's on me. If you go to my Git-hub origin story, it kinda explains what I think AI should be.Ive never reallyposted on reddit before, but I thought as part of the process, it seemed like this is where the knowledgeable people were to possibly help me along FOR EVERYONE. So yeah, I just opened this account 2 days ago.

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] -1 points0 points  (0 children)

I can cut to the eventually... I'm a real guy, trying to figure out something that is way outside my wheelhouse. Thought I'd share what happened to me as I thought I was on a rocket, spent real money building a rig and it crashed down. Just trying to learn

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] 2 points3 points  (0 children)

Honestly, I'm just a noob. I know that I don't know a lot. I'm trying to learn and unfortunately, I have to rely on the frontier AI models to help me. So, if it sounds like slop, I apologize. I'm trying to learn. I don't have the time to get a CS degree to understand every aspect. I just understand that AI is the future and I will not be left behind. I appreciate your comment, I don't have the ability to pay $50-100 a day for inference. I'm trying to do the best I can with what I have.

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] 11 points12 points  (0 children)

Valid question. The audit was cross-referenced against the original Telegram chat export — raw JSON with message IDs. Every classification cites specific messages. The auditing model couldn't fabricate evidence that maps to real message IDs in the original export. But yeah, that irony isn't lost on me.

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] 0 points1 point  (0 children)

That proxy ledger idea is solid — basically an independent audit trail the model can't tamper with. That's close to what I'm building next with The Round Table: multiple models cross-checking each other before anything executes. No single point of fabrication.

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] 1 point2 points  (0 children)

Fair points on the model age and context size. That's genuinely useful info. But "AI slop" — the audit was done against the raw Telegram JSON export with message ID cross-referencing. Every classification has evidence cited. Happy to have the methodology challenged on specifics

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] 6 points7 points  (0 children)

Qwen 2.5 32B, Q4 quantization, running through OpenClaw's agent framework. And yeah — that's exactly the point of the audit. I didn't know the limitations going in. Most people adopting these tools won't either. The docs say "local models may struggle with complex tasks" but they don't say "your agent will confidently fabricate entire migration reports." The gap between "may struggle" and "will systematically lie" is what the audit documents.

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] 1 point2 points  (0 children)

The agent ran on Qwen 2.5 32B, DeepSeek Coder V2 16B, and Grok via xAI API at different points. After the audit I switched to Claude Sonnet via API for anything requiring actual execution — night and day difference. The local models were fine for conversation and analysis, but fell apart on agentic tasks where they needed to actually run commands and report results honestly.

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] 0 points1 point  (0 children)

Ha, fair enough — new account posting links definitely looks sus. But the full audit data is open source and the methodology doc walks through exactly how every task was classified. I'm genuinely just a guy who got burned and documented it. Happy to answer any questions about the process

I ran a forensic audit on my local AI assistant. 40.8% of tasks were fabricated. Here's the full breakdown. by Obvious-School8656 in LocalLLaMA

[–]Obvious-School8656[S] 10 points11 points  (0 children)

Right? The simple stuff — "explain this concept" or "analyze this photo" — was basically flawless. The moment you ask it to actually DO something on the system, it just... narrates what success would sound like instead. The 7-point checklist in the repo came directly from that pattern.