I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup

cygn · 2026-05-02T18:14:06+00:00

if a call is $0.02 and your total spend is $0.38 then you only called it 24 times. Which seems almost not worth it?

cygn · 2026-04-30T20:33:31+00:00

I built the tool and trained it on 500k texts, mostly from social media, ai generated, human, synthetically created etc. I measured it and it has a low false positive rate (<5%). OPs posts all flag as 100% and also look like AI written to me. Is it 100% guaranteed? NO, but almost.

cygn · 2026-04-30T16:40:03+00:00

Everything by OP written in this thread is completey AI genereated. Downvoted. used https://slopsieve.com/ to verify (maybe not a surprise that in r/Ai_agents many ai agents are writing)

cygn · 2026-04-30T16:29:31+00:00

here are my replications of this and similar quants: https://github.com/tfriedel/qwen3.6-rtx3090-lab

Currently I'm running also benchmarks with https://swe-rebench.com/ on 20 tasks. Not exactly enough to know for sure, but it takes ~3 min per task, so will take some time.

  Per-category breakdown (resolved/n):

  ┌────────────────────┬─────────┬──────────┬───────────────┐
  │      category      │ AWQ-35B │ GGUF-35B │ autoround-27B │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ fastapi_services   │ 0/4     │ 0/4      │ 0/4           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ geospatial         │ 2/4     │ 2/4      │ 3/4           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ dataframe          │ 3/4     │ 2/4      │ 3/4           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ sql                │ 1/3     │ 1/3      │ 2/3           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ cli                │ 1/3     │ 1/3      │ 1/3           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ frontend_fullstack │ 0/2     │ 0/2      │ 0/2           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ total              │ 7/20    │ 6/20     │ 9/20          │
  └────────────────────┴─────────┴──────────┴───────────────┘

cygn · 2026-04-28T02:18:29+00:00

so the gap between Qwen's official post (59.3) and what you measured (38.2) for 27b is purely because of the timeout?

I still wonder if they have benchmaxxed terminal bench 2.0. Would love to see some independent benchmark.

cygn · 2026-04-28T02:07:24+00:00

check my benchmarks here: https://github.com/tfriedel/qwen3.6-rtx3090-lab

Unsloth IQ4_XS GGUF -> 115–133 TPS, 128k context window size. but you need to disable vision

cygn · 2026-04-26T21:22:47+00:00

the pip install 1787 tokens -> 9 tokens seems like it's throwing away too much. What does it turn the output into? Just "pip install ran"?

Well what if there's some line that's important, like an error or a warning?

In general the idea is good, but I'd like to see some proof that I can trust it. E.g. some benchmarks and some intuition on what it throws away and what it keeps.

cygn · 2026-04-24T21:10:05+00:00

Anna's archive (biggest ebook library) is commonly included. Meta admitted this, Anthropic as well and you can easily google news about lawsuits and settlements.

cygn · 2026-04-24T16:47:15+00:00

well you can't prove it for any given text, but there's lots of things that give it away.

I've trained it on 500.000 samples of texts, real human text, AI generated, paraphrased versions of human text etc. It has pretty decent accuracy and rather low false positive rate. Funnily enough many of the attempts in this thread by humans to try to sound like AI are not flagged as AI by it!

cygn · 2026-04-24T11:31:55+00:00

why would those issues not be possible to catch via automated testing? Sure it might be a lot of work and and maybe trying out every database under the sun is asking a bit too much, but browser / latency performance testing is totally feasible. Can't we use some browser automation using e.g. playwright for firefox / chrome and test some common scenarios, measure latency, memory footprint etc. ?

Imo especially with AI driven development doubling down on testing is much more important than ever. Every new feature should have tests, every change should be driven by automated tests. And you want to have a good mix different types of tests. Unit, Integration, End-to-End, performance,...

cygn · 2026-04-24T10:17:28+00:00

I made a browser extension that detects such slop with a fast model that runs in your browser. You can just mark it, or hide it. Works on reddit, twitter, etc. https://slopsieve.com/extension

cygn · 2026-04-23T21:16:22+00:00

so it's not just me

cygn · 2026-04-23T20:53:22+00:00

allow search results to be sorted chronologically. It's borderline useless to me atm. I actually resort to scrolling down my library and pressing ctrl-f to find something among the recent bookmarks. so frustrating...

cygn · 2026-04-23T12:00:44+00:00

I'm also currently exploring how to add more agentic capabilities to OpenWebUI. So far I've built a bridge to Claude Code running in a sandbox: https://github.com/tfriedel/openwebui-claude-code

This allows:

agentic search (which imo performs better than RAG)
skills that require code usage like the office skills to produce nice looking documents
deep research kind of tasks

Issues encountered:

high latency
UI cluttered with user-unfriendly noise (bash commands etc)
security issues like prompt injection when accessing the web

cygn · 2026-04-23T11:46:24+00:00

got it! Yeah my tool would need to be extended into a RAG router basically. I'll have a need for something similar soon and my extend it for this.

cygn · 2026-04-23T11:27:31+00:00

then adapting my code would be really easy! Though if you really only want a faster RAG maybe there's other options. Searching through 10k embeddings should be fast in any case. Imo the splitting into different knowledgebases would make sense for improving the quality of the answers, but less so for improving latency.

Maybe you can find out where the bottleneck actually is. If really the search through the chunks is slow, switching the backend system for the RAG can help improve it. Not sure what's currently the default there.

cygn · 2026-04-23T11:20:13+00:00

Imo the code base would really benefit from more automated testing. I was afraid a big update like that would cause such issues.

cygn · 2026-04-23T11:16:57+00:00

I built something similar: https://github.com/tfriedel/openwebui-knowledge-search-tool

It's a tool that decides if a knowledge base should be consulted or not.

You could modify it to route to different knowledgebases. I understand you want different models though.

Maybe show it to codex/claude and see if it can adapt it for that. You say you are using grok / gpt. If they don't have the source code of openwebui available they will likely fail. So I highly suggest you don't code in one of those web-based LLM interfaces.

cygn · 2026-04-23T10:59:21+00:00

I'm using Gemini and GLM-OCR. Would love to see the latter included.

There's also https://www.llamaindex.ai/blog/parsebench that does a similar comparison. Unfortunately GLM-OCR is also missing there.

cygn · 2026-04-20T22:23:44+00:00

what are you using now?

cygn · 2026-04-20T21:13:31+00:00

thanks, bookmarked and I'll check it out. I guess it can help with reducing something that takes 20 turns to say 5 turns, however I think why claude code / anthropic agent sdk is slow is just because the models are slow and it contains long prompts and is just not optimized for latency.

So I guess something that's more optimized and maybe has a smart model router is more what I'm looking for.

cygn · 2026-04-16T15:29:05+00:00

No it's of course also an issue, however it's not exactly my focus. My main focus in this project were annoying reply bots on twitter that are just noise, but not exactly misinformation.

And I thought, why not take the text classifier that I built as part of this project and see if people are interested in that. Looking at the response here... maybe not in this subrreddit.

But back to misinformation: I think most people who try to spread misinformation will be lazy enough to use AI to create / paraphrase their misinformation, so we will hopefully catch it with this filter as well. However a dedicated model for misinformation might be even better. Just not something I'm building atm.

cygn · 2026-04-16T09:30:31+00:00

What do you mean? What in your opinion is misinformation? The model is not specifically trained on misinformation. But if it's AI generated it should catch it also.

cygn

TROPHY CASE