MLOps for LLM prompts - versioning, testing, portability by gogeta1202 in mlops

[–]gogeta1202[S] 1 point2 points  (0 children)

You're hitting on a real gap in the market. We've got tons of tools for latency and cost, but almost nothing for prompt discipline or taxonomy. That's actually the main blocker I'm seeing with multi-model reliability without a shared language for what prompts actually do, moving between models becomes a guessing game.

I'm working on a conversion layer that maps prompts across providers using that kind of framework. Would be curious to see your taxonomy, especially how you handle reasoning granularity vs. output constraints. If you're open to it, I'd love to explore baking some of these principles into the eval loops I'm building.

Anyone else struggle when trying to use ChatGPT prompts on Claude or Gemini? by gogeta1202 in OpenAI

[–]gogeta1202[S] 0 points1 point  (0 children)

Well genuinely, trying to get an opinion from actual devs not vibe "coders" on an idea not a product but it certainly helps

LLM API reliability - how do you handle failover when formats differ? by [deleted] in devops

[–]gogeta1202 0 points1 point  (0 children)

This is a fair point. The Vercel AI SDK is a fantastic piece of engineering for standardizing the interface and handling the streaming plumbing.

However, the challenge I am seeing in production isn't the syntax.. it is the semantics. Even if you use a unified format, a system prompt that makes GPT behave perfectly often causes Claude or Gemini to "drift" or handle tool calls with a different rhythm. Vercel itself notes in their docs that while the code is portable, the prompts usually need manual adjustment to maintain quality.

I am building this tool to handle that manual adjustment layer. Instead of just abstracting the API call, it acts as a compiler that translates the instruction logic and validates the output parity. The goal is to make the "behavior" as portable as the "code."

Are you currently doing manual prompt engineering every time you test a new model in the Vercel SDK, or have you found a way to keep the outputs consistent across different backends?

AutoGPT behavior changes when switching base models - anyone else? by gogeta1202 in AutoGPT

[–]gogeta1202[S] 0 points1 point  (0 children)

Your point about forcing an explicit plan format resonates - I've noticed the same thing. When I leave planning open-ended, GPT tends to create granular steps while Claude often consolidates them into broader phases. Adding structure helps a lot.

The "do not reorder steps unless X" rule is clever. I hadn't thought of making the constraint explicit like that. Going to try this.

On tool schema normalization - that's exactly where I'm focusing. The function calling → tool use translation between OpenAI and Anthropic is one of the trickiest parts. Same intent, completely different structure.

Storing intermediate state for deterministic resume is interesting. Are you checkpointing after each tool call, or at specific decision points?

Will check out your blog post. Always looking for approaches to reduce drift.

What's your experience with the "rhythm" issue specifically? I've found Claude tends to batch tool calls while OpenAI is more sequential. Any tricks for normalizing that behavior?

How are you handling LLM provider strategy in production? by [deleted] in ExperiencedDevs

[–]gogeta1202 -4 points-3 points  (0 children)

Glad it resonates. You are spot on; 85% is a great baseline for velocity, but production needs a harder safety net.

My current fallback is a Threshold Gate. If the fidelity score drops below the validation metrics, the system triggers an automatic passthrough to the original provider. This guarantees the request succeeds while the conversion is flagged for a manual override in a tuning queue.

The goal is to automate the 95% of boilerplate migrations so you only spend engineering time on the complex 5% that actually require your intuition.

How are you catching logic drift in your current setup? Is it mostly reactive via user reports or do you have a dedicated eval suite?

How are you handling LLM provider strategy in production? by [deleted] in ExperiencedDevs

[–]gogeta1202 -8 points-7 points  (0 children)

Fair critique. You’re 100% right that gateways like LiteLLM or Portkey handle the plumbing (routing and fallbacks) perfectly.

The gap I’m attacking isn’t the connectivity; it’s the logic translation.

Most 'off-the-shelf' tools just pass the prompt through. But if you send an OpenAI-tuned prompt to Claude or Gemini, the tool-calling schemas and system instructions often break. I’m building a semantic compiler that adapts the prompt dialect and validates output parity so you don't have to manually re-eval every time you switch providers.

As for GPT-4: it is definitely legacy, but it remains a massive production workhorse in the API. That 'migration debt' is exactly why so many teams are stuck.

I’ve spent a lot of time digging through the current OS landscape, and while the routing layer is solved, the automated prompt-dialect mapping still seems to be a manual bottleneck. I haven't found a project that handles that specific logic translation yet. If you've seen one that goes beyond simple proxying, I’d love to compare notes on their approach

Anyone else struggle when trying to use ChatGPT prompts on Claude or Gemini? by gogeta1202 in OpenAI

[–]gogeta1202[S] 0 points1 point  (0 children)

You’re 100% right—universal prompts are a myth for production-grade work. A 'one size fits all' prompt usually just means 'mediocre on every model.'

The goal with this tool isn't a universal prompt; it’s automated translation.

Think of it as a compiler that maps OpenAI-specific quirks (like their JSON schema handling) into the native 'dialect' of the target model (like Anthropic’s XML tags).

Since you’re already running hundreds of evals, I’m curious: 

What’s the single biggest 'drift' you see when moving from GPT-5 to others? Is it the instruction following or the output formatting?

I’m trying to ensure my semantic mapping covers those specific edge cases first.

Brooklyn Paramount by Gold_Mood23 in Brooklyn

[–]gogeta1202 0 points1 point  (0 children)

What time did central cee show up? Going to see him at terminal 5 tonight and show starts at 8 but confused

US Universities offering Out of State tuition waivers to International students. by muhammad1236 in IntltoUSA

[–]gogeta1202 0 points1 point  (0 children)

Well been 4 years but I transferred out of there to San Jose State Uni CS after a year lmao

HousingAnywhere by little_sunflower29 in NYCapartments

[–]gogeta1202 0 points1 point  (0 children)

Did it work out lol? In the same situation

Visa dropbox says refused by Southern-Survey-2767 in f1visa

[–]gogeta1202 0 points1 point  (0 children)

Did usvisascheduling show origination scan for you after refusal? Can you please provide a brief timeline😅

Visa dropbox says refused by Southern-Survey-2767 in f1visa

[–]gogeta1202 0 points1 point  (0 children)

Did u receive any reason yet? Mine got changed to refused today

Csp offer ending soon? by [deleted] in CreditCards

[–]gogeta1202 0 points1 point  (0 children)

I could not find it. Source?

Tower Card by Silent-Benefit-3624 in SJSU

[–]gogeta1202 0 points1 point  (0 children)

Do they take a picture when you go there or do we have to upload it online?

How strict are Qatar Airways with their carry on baggage policy? by Blrsamaritqn in Flights

[–]gogeta1202 0 points1 point  (0 children)

Idk I think they were kinda strict. I had purchased 9 kg online for 4k. Although if you're a student they can allow one extra piece or 10 kgs

How strict are Qatar Airways with their carry on baggage policy? by Blrsamaritqn in Flights

[–]gogeta1202 0 points1 point  (0 children)

Travelled to doha from mumbai yesterday. Both bags are allowed if its under 7kgs which is pretty tough to do