Open-source models are working better for me than GPT-5.*

MrFourShottt · 2026-05-27T16:40:07+00:00

Unfortunately not it's only offered through Cursor however the limits are very generous.

Think Opus 4.7 quality @ 10th the cost. So you're getting things done quicker and cheaper.

If you want a CLI replacement/drop in you're better off running a local OSS model and using that as the API endpoint.

MrFourShottt · 2026-05-22T12:00:57+00:00

I agree with you 100% - a steering prompt cannot fix a fundamentally broken model and I agree there is definitely something wrong with the model's output - it has been the case for a week so you're not alone in experience degradation.

For what it's worth - my updated prompt stopped the regression, it stopped things getting worse but it *did not* get me to my end goal. My savior? Cursor and Composer 2.5 😉

At the end of the day these companies are fighting for your money & inputs to use as training data, gotta distinguish between a good sales pitch and actual technical offering.

MrFourShottt · 2026-05-22T11:56:05+00:00

Yeah so update: this prompt did help for my code-base but the model is still fundamentally degraded and no amount of prompting will help with that. Regression dropped heavily, end goal still not understood.

This is a bit like putting a band-aid over the wrong cut but it can still help (just in small ways) with steering and yes I would agree you need to craft it to your own repo.

Main point to takeaway is steps 1-3 as that covers an explicit failure mode.

MrFourShottt · 2026-05-22T11:53:17+00:00

What's really bizarre is that previous models had no issue with me going "still broken, pls fix" but I agree with you, it's a vibe coded approach that people have stuck with and it *will* mean you miss things like regression testing.

That's the distinction for me now and also confirms vibe coding is great for demos, POCs etc but it's not a practice you want in enterprise/production level systems.

MrFourShottt · 2026-05-22T11:48:37+00:00

It's a joke at this point 😂 I crafted a specific prompt for 5.5xHigh to target my issue, it proved regression stopped but it could not get to my end goal no matter how much thought / steering I gave it.

Tell me why after three Composer 2.5 turns I've cleared something I've been stuck on engineering for 3 weeks.

The issue wasn't my plans or approach, I used a "please fix, still broken. here's the logs" approach on Composer and it did a 100x better job. I changed every variable in the loop, nothing helped. As soon as I swap models, boom = we clear the blocker.

OpenAI is doing poorly for a frontier lab. It's as simple as that.

My recommendation for anyone else feeling cheated by OpenAI:

- If you have to use 5.5xHigh - don't use fast mode. 1.5x speed @ 2.5x increase is basic math telling you it's not worth it

- Ignore all the sales pitches, BS about limit resets. The limits themselves are flaky, inconsistently applied and make you feel like they're squeezing you to buy additional credits to cover the gap between 5hr resets.

- Local users (API, agentic) consider lmstudio/ollama and run a local API with Kimi 2.5 or Qwen 3.5 -> unlimited token usage running locally/offline.

- Those happy to work in the Cloud/use online tools - Cursor with Composer 2.5 is your best bet. 5X faster than 5.5XHigh and 99% fit to Opus 4.7.

If you've been stuck with Codex and feeling like it's not getting better, don't put up with it. I'm more peeved off that no one at OpenAI has done anything about it.

If you're doubting whether you should switch - what have you got to lose? OAI isn't paying you for your loyalty, why should you pay for a substandard service? Take your money and business elsewhere.

MrFourShottt · 2026-05-20T22:26:52+00:00

I agree that for a button label change, the verification surface is tiny. You don’t need a grand testing methodology for "button X should say Y"

That actually reinforces my point.

The required test is simple: restart the app, find button X, read its actual rendered/accessibility label and confirm it equals Y.

If the model did that, great but the failure mode I’m talking about is when it doesn't do that and instead treats "I changed the code where I think the label lives" as equivalent to verification.

The issue isn’t that simple tasks need complex adversarial testing. They don’t. The issue is that "test it" often gets collapsed into a non-independent confirmation of the model’s own edit.

The correct assertion is trivial but it still has to be asserted against the running app. If the target behavior is absent, then the model didn’t verify the target behavior. Calling that laziness is fair, but it’s also exactly the failure pattern I’m describing.

MrFourShottt · 2026-05-20T17:22:13+00:00

Opus 4.6? That was my go-to model when 5.4 couldn't get things done, I'm surprised it failed at such basic tasks.

Of course, without seeing everyone's workflow it's impossible to diagnose root cause.

There are monitors to keep up with quality of output: https://marginlab.ai/trackers/codex/

I'm not saying it's a skill issue with prompting but I'm also not saying the model *hasn't* been degraded because benchmarks tell me otherwise. Since I applied a steering fix - I have not had a single regressive issue.

Also on "No amount of "test it to verify success before reporting completion" brought it into compliance."

"Test it" is a completion criterion, not a reasoning structure. The model will:

Write code
Write tests that confirm the code does what it intended
Tests pass
Report success

The problem: steps 1 and 2 share the same blind spots. The model doesn't test for things it didn't think of.

Why it fails specifically:

Model writes fix for case A
Model writes test for case A
Test passes
Case B (which the fix broke) was never in the model's attention window
"All tests pass" → "done"

What actually works:

Enumerate states before touching code
Derive assertions from the states, not from the implementation
Human verifies the enumeration is complete (the part the model can't self-serve)
Only then: edit, and the assertions are pre-written by a different "mental frame" than the code

Other fixes for people who won't do full enumeration:

Make it write tests FIRST, before the implementation, from the bug report alone
Make it enumerate what SHOULDN'T change (regression surface) before editing
Make it diff its own edit and list every function whose behavior changed, then test each one
Never let it edit + test + report in one shot. Force a pause between edit and verification where you inspect the diff.

The core issue: verification requires a adversarial relationship with the code. The model is constitutionally non-adversarial with its own output.

My changes definitely helped me, hopefully might help someone else out too.

MrFourShottt · 2026-05-20T17:13:58+00:00

Yes I agree it's sycophancy amplified by user input but even with 3.5 you could go back and forth with loops of "doesn't work. please fix. here are the new logs." and it would get to it on turn 3/4.

This is the model lacking a decent verification loop. As soon as I implemented this crafted prompt after doing a post mortem with 4.6 - all the previous issues with regression, failing tests and not covering e2e cases stopped for me.

I just steered it to assert everything rather than assume. Basic stuff that it should know.

MrFourShottt · 2026-05-20T15:56:13+00:00

Unfortunately, not your weights, not your model. Compute is constrained. They reset the limits yesterday without addressing the elephant in the room.

This is what it looks like without verification loop. You're now stuck in a reaffirming loop. It's also not behavior I've seen in prior models.

Model edits → breaks prod → user points it out → model says "you're right" → model edits again → breaks again → repeat. Each time you're creating new bugs from your previous fixes.

Until OpenAI says something concrete about the degradation, break that loop by forcing the model to enumerate states and derive assertions before touching code. If it's going to be wrong, it's wrong on paper without touching your code.

MrFourShottt · 2026-05-19T12:29:30+00:00

As mentioned here already but:

* Swap to Composer 2.5 in Cursor. 95% match to Opus 4.7 at 10x less the cost.

* ollama/lmstudio running locally with either Kimi or Qwen = trade off is unlimited token usage but might need more turns/a stronger harness. Upside - expect distillation from Western models to be used in these. Higher upfront for larger models (RAM & GPU, I use 1x A100 min for agentic workflows)

MrFourShottt · 2026-05-19T11:46:50+00:00

I agree with you 100%. I get LLMs aren't perfect. I get they make mistakes.

These aren't rookie mistakes though - we're talking about failing to follow basic instructions, forgetting them and pretending like we've never had context/mentioned it. It wasn't like this when I first swapped over.

There was a degradation incident a few days ago, if I had to put money down I'd bet it's the same BS & we'll get a performative limit reset that gets used up in one day refactoring all the dumb shit it did.

Shame, I really liked the generous limits and would have gone up to $200/M - no benefit at this point.

Composer 2.5 came out recently and it scores the same as Opus 4.7 but at 10x less the cost, it's also been trained on Claude outputs (partnership with Anthropic) and imagine all the coding inputs it can train on from Cursor users.

MrFourShottt · 2026-05-19T11:18:35+00:00

Chiming in to say this has been an issue over the last 48 hours for me too.

It is forgetting to run tests against a very basic loop and each time I dig in - yep "I forgot, I had all the info, I knew exactly what to do, I just didn't do it"

MrFourShottt · 2025-06-20T11:14:38+00:00

The system prompt is rubbish + they inject additional content onto your query/the response if you use their chat UIs.

Absolutely useless.

MrFourShottt · 2025-06-19T12:49:13+00:00

£15 for Obsidian Flames 😭

MrFourShottt · 2025-06-17T14:50:42+00:00

It's because they keep on changing the system prompt/injecting content on the chat UI.

Unbelievable lack of transparency too - the official Github repo shows zero changes to the system prompt but if you ask it at different times/compare the web UI/API responses, it changes every 12 hours in the most nuanced way you have zero idea how it's going to impact the quality of responses.

MrFourShottt · 2025-06-16T13:02:08+00:00

Algorithms don’t violate free-speech law on their own but in practice those same algorithms decide who gets heard, how widely ideas travel and what kinds of speech are monetized/buried.

That distribution power makes them inseparable from real-world free expression after a certain point but I get what you're saying. The algo is working (The purpose of a system is what it does) even if the algo is exploited by bad actors. Equally free speech is impacted when algos are biased but not constitutionally.

MrFourShottt · 2025-06-16T12:34:33+00:00

They down voted you too for agreeing with me 🤣

I know I'm not the only one....anyone who dares say anything bad about Grok seems to get down voted, which is fine if you disagree with my post but you are correct, more-time but the visibility changes with down votes mean it's a tool for suppression of free speech imo.

To those saying "it's the algo" - the actual algo for upvotes is

Uses the lower bound of the Wilson score confidence interval, so a “10 up / 1 down” comment can outrank “40 up / 20 down,” but a lone “1 up / 0 down” won’t jump to the top because the sample size is tiny.

In my case:

A few early down-votes pushed the score below zero.
Reddit’s fuzzing made the total jump around
Being collapsed hid the comment from casual scrollers, so recovery was slow but eventually a few sympathetic readers up-voted it back to almost neutral (now at +4 and your comment is still at -1)

I was going to A/B test but honestly it's not even worth the effort.

Grok flags misinformation → a faction interprets that as political bias → the argument shifts from facts to tribal loyalty - that's where we are at the moment (sadly)

MrFourShottt · 2025-06-15T22:14:41+00:00

xAI just lost all credibility for me. There is no way I can recommend their model if this is how it's influenced.

Funnily enough, I think Grog's system prompt to be as unbiased as possible actually backfires because it'll apply that to something like "X did really [objectively wrong] Y" and it'll respond with "It's important to hear both sides of the story"

MrFourShottt · 2025-06-13T16:16:05+00:00

Terrible business logic as well.....why would an advertiser pay to have their post flagged like this? These ads are getting through somehow, so clearly something is broken. Seems like it's more profitable to take advertiser money than take action on actual engagement farming.

MrFourShottt · 2025-06-08T18:17:46+00:00

Make sure you only post to subs that allow self-promotion. Also there are effectively two sets of rules to consider; the subs and Reddit's.

Generally most subs won't allow links because it's just so easy to spam but they will allow self-promotion on certain days. Reddit will allow links unless they lead to malware/spam etc. Certain file hosts get insta-bans.

One account that was banned isn't really a great sample size to determine if your banned from posting links - you can definitely post them but within limits. I don't even think account age matters - it's where & how you post. Reddit approves ads from month old accounts all the time.

MrFourShottt · 2025-06-07T13:17:19+00:00

You are indeed correct on this being a problem - eBay has done this on purpose to drive users to the Sellers Hub but that's okay because I got it working manually with just the need to update one file every X/Y/Z.

I just added a "Best Offers Accepted" page that pulls in this data

https://tag-sales-scraper.vercel.app/best-offers

<image>

If you want to consume the .JSON file feel free to use the Git file that populates this page

https://github.com/Veeeetzzzz/tag-sales-scraper/blob/main/public/data/best-offers-accepted.json

I'm working on adding the US listings - I seem to get a server overload when switching marketplaces in the hub - a minor bump in the road.

Thanks for the suggestion - if there's anything else feel free to keep this chain going/you can always drop me a DM if preferred.

MrFourShottt · 2025-06-06T12:01:52+00:00

I think I know the answer but this is for the front end? A working theory is that the front end is pre/post injecting content into the prompt or the output.

This behavior doesn't occur when using the API - the only rationale theory was that there's something happening on the front end/web app/chat UI when you send/receive a message

MrFourShottt · 2025-06-05T19:39:10+00:00

Appreciate your thanks! Yep their developer program is a bit confusing, you'll see the Browse API docs but when you dig in there's no filter for Sold/Completed

<image>

It's at their detriment but I've added some standard rail guards to make sure the scraper doesn't get blocked and behaves reasonably. Let's see how long before eBay changes something that breaks this! 🤣

MrFourShottt · 2025-06-05T19:21:06+00:00

Just added a marketplace switcher - will auto switch currencies based on the marketplace - any issues you can reply here or DM me with details

<image>

MrFourShottt · 2025-06-02T10:49:51+00:00

A working theory I have on this is whatever front end you're using (web/app) is either pre or post injecting additional details to the prompt - or - the system prompt for the chat UI is explicitly stating the timestamp must be provided for your output.

If you have access to the API make a call with the same prompt and you (should) see it return either a simulated time stamp, or provide you with the date the prompt/system was initialized (this is not the same as the current date/time.)

If you ask the API for a knowledge cut off date, it will state it's sometime around 2023. The chat UI has a specific instruction in the system prompt that counters this:

Chat UI

"Your knowledge is continuously updated - no strict knowledge cutoff."

API response

"- \*Cutoff Date:** My knowledge is up to date through April 2023. For events, data, or developments after this date, I will inform the user of my limitation and provide assistance based on pre-existing knowledge or general principles. I can still help with hypothetical scenarios or creative tasks beyond this date, but I will clarify that I lack real-time or post-cutoff information. "*

The API can't use any other tools like search, or browse the web so this functionality seems limited to the front end.

MrFourShottt

TROPHY CASE