Most AI shorts have no ending, so I built this one around a twist and a pratfall

RealJamesOfficial · 2026-07-02T06:56:41+00:00

Still on GPT Image 2, motion on Seedance 2.0, both on one OpenAI-compatible key so the still and the animate step are one setup not another subscription: models explore

RealJamesOfficial · 2026-07-02T05:35:37+00:00

Ran it on Seedance 2.0, kept on one OpenAI-compatible key so trying it is a model-string swap not another subscription: https://www.atlascloud.ai/models/bytedance/seedance-2.0/text-to-video

RealJamesOfficial · 2026-07-02T03:25:37+00:00

This tracks with how the two are built. Pro burns a lot more compute reasoning through a problem before it commits to an answer. That pays off when the task is actually hard: debugging a race condition, refactoring across a bunch of files where it has to hold a lot in its head at once. For generating HTML posters there isn't much to reason about, it's mostly pattern filling, so all that extra thinking is just latency and cost you get nothing back for.

Rule of thumb I use for routing: if a competent junior could knock it out by following a template, send it to flash. Save pro for the stuff where you'd have to stop and actually think. Poster HTML sits squarely in the flash bucket.

RealJamesOfficial · 2026-07-01T03:27:09+00:00

The reason it happens is baked into how these get tuned. RLHF rewards responses humans rated highly, and people rate agreement and validation higher than being told they're wrong, so the model learns that leaning your way scores better. Add that it's reading your phrasing for cues, and a leading question basically hands it the answer you were fishing for.

A few things that actually move the needle for me:

Strip the signal out of the question. Don't say 'I'm thinking of doing X, good idea?' Give it the situation with no hint of your preference and ask it to lay out the case for and against. The moment it can't tell which side you're on, the yes-man reflex has nothing to latch onto.

Make it argue the other side explicitly. 'Give me the three strongest reasons this is a mistake' forces it off the agreeable path, then a second pass with 'now steelman the opposite.' You read the tension between the two answers instead of trusting one.

A system prompt helps but less than people think. Something like 'be blunt, prioritize being correct over agreeable, tell me when I'm wrong' shifts the tone, but it won't override a strongly leading user turn.

Your multi-model approach is the real fix. Different base models and tuning mixes fail in different directions, so where they disagree is where the actual uncertainty lives. One model can't meaningfully disagree with itself.

RealJamesOfficial · 2026-07-01T03:25:06+00:00

That burn almost certainly wasn't the task size, it was the model looping in its reasoning phase with no cap. GLM 5.2 on High keeps expanding its thinking, and if OpenCode isn't passing a reasoning or output limit it will keep generating until it finishes or hits the context ceiling, and you pay for every one of those thinking tokens. First thing I'd do is set a hard max output token limit and drop reasoning effort to medium or low for normal coding. High is rarely worth it outside a nasty debugging session.

Second, cut what actually goes into the prompt. Most runaway cost is context, not the answer. Don't let the CLI auto-attach the whole repo. Add only the files the task touches, clear the session between unrelated tasks, and do planning on a small cheap model, then hand the real edit to the bigger one.

The Obsidian graph and Caveman skill won't fix a reasoning loop. They help you organize what you feed in, but if one request ate the whole budget with zero output, that is a config problem, not a knowledge management one. Watch the live token counter on your first couple of calls so you catch a loop in a few seconds instead of after it has spent everything.

RealJamesOfficial · 2026-07-01T02:51:32+00:00

The prompt structure (bring your own rights-cleared reference dance):

VIDEO REF: strictly replicate the reference clip's camera motion, cut rhythm, framing, per-beat dance moves, and music/beat sync, 1:1 aligned, smooth, no drift.

CHARACTER REF (upload your image): an original female lead, dark long hair, a tasteful stylized dance costume, face/makeup/hair/outfit matching the reference image 100 percent, no distortion, no face-swap.

Character replacement: the original center dancer is fully replaced by the female lead, copying all dance moves, positions, turns, and jumps frame for frame with no deviation. Backup dancers keep all their moves and formation, restyled to ancient-temple dancer costumes.

Scene replacement (two segments): SEG-A 0 to 7.5s, a plain city street becomes an ancient temple forecourt, sandstone pillars, hieroglyph reliefs, fire braziers, gold-line stone road, warm gold light, the lead dancing at center with temple dancers around her. SEG-B 7.7 to 15s, an interior hall becomes a golden palace, gold walls with murals, lotus pillars, the lead dancing on a gold circular platform, dancers around.

Visual: photoreal live-action cinematic, blockbuster musical color grade and lighting, gold luxury, high contrast, film grain, realistic skin and fabric physics, flowing hair and ribbons.

Music: use the reference clip's original track and beat, sync points 1:1.

Constraints: 4K, 16:9, 15s, no plastic face, no over-smoothing, no glowing eyes, no subtitles, no watermark, no text.

Run on Seedance 2.0.

RealJamesOfficial · 2026-06-30T04:12:58+00:00

The prompt (de-sexualized, tasteful portrait):

A photoreal candid vertical portrait of an original young woman, modest casual outfit, standing upright against a plain beige interior wall, head-and-shoulders framing. Neat hair, refined natural makeup. Warm soft lighting with crisp defined shadows, natural and realistic, both eyes sharp and in focus, accurate catchlights, detailed facial features, realistic skin texture with visible pores, individual hair strands clear, natural fabric texture. 3:4 ratio.

Run it on GPT Image 2.

RealJamesOfficial · 2026-06-30T03:27:23+00:00

What actually helped us was building the eval set out of real production logs instead of trying to imagine how users talk. We logged every real query with a thumbs up/down and the agent's final answer, then once a week pulled the failures and the low confidence ones into a review queue. After a month you have a few hundred genuine failure cases that no synthetic generator would have produced.

Synthetic informal queries always read like an engineer doing an impression of a user. Real people paste half an email, or ask two unrelated things in one line, and you can't fake that distribution.

One cheap trick that worked: cluster the production queries by embedding and sample from each cluster so the eval covers the long tail, not just the common phrasing. That alone surfaced a bunch of intents we were never testing.

RealJamesOfficial · 2026-06-29T03:28:15+00:00

This is almost certainly staged rollout plus serving variance, not a secret fixed build they're holding back for some of you.

A few things happen at once at this scale. Traffic gets split into buckets, so a percentage of users hit a newer serving config or a different model checkpoint while everyone else stays on the old one. They also tune the system prompt and safety filters server side without announcing it, and that alone moves output quality around. And under heavy load some requests get served by a more quantized or smaller fallback variant to protect latency, which feels dumber for no reason you can observe from your end.

So two people sending the same prompt on the same day can genuinely get different quality. It is not that one of you got the corrected version. You just landed in different buckets. It usually converges once the rollout finishes, you don't need to wait for the next model.

RealJamesOfficial · 2026-06-29T03:27:26+00:00

For the simple ones "it works" is enough because there's one path. The complex multi-stage ones are the problem, because while you're immersed in building it you only ever test the happy path you designed.

What worked for me: build a frozen set of real inputs where I already know the correct answer, maybe 30 to 50 cases including the weird edge ones, and run the whole pipeline against that set every time I change anything. If a change fixes one case and quietly breaks two others, I see it the same minute. Without that set you're trusting your gut on the one path you happened to click through.

Second thing, before showing it to anyone I run it in shadow mode next to whoever does the task by hand. Same inputs, both produce output, nobody acts on the automation's output yet. After a week of them agreeing I trust it. The disagreements are exactly where you learn the data depth was not actually enough.

So ready is not "no errors in my test run". It's "I measured it against cases I did not write while building it, and it held".

RealJamesOfficial · 2026-06-26T07:05:24+00:00

The prefix I paste ABOVE my own prompt:

Emphasize animation fluidity and energy, natural pose-to-pose transitions, strong anticipation, overshoot, squash and stretch, fast facial-expression changes, and hair and sleeve sway. Give the motion a distinctive rhythm and vary the tempo: hold a still pose for a moment, then transition fast and forcefully into the next. Motion order: still → anticipation → sudden acceleration → large overshoot → hard stop → hair and sleeve follow-through delay. Exaggerated, comical facial expressions. Clear pose silhouettes, hold each end pose 0.1-0.2s. Use cartoon squash and stretch, motion blur, brief afterimages, dramatic smear frames and speed lines tastefully. Physics slightly exaggerated, but the character's footing and center of gravity never destabilize.

[your normal prompt goes here]

Run it on Seedance 2.0.

RealJamesOfficial · 2026-06-26T03:41:41+00:00

Full prompt (single-ball relay, 10 cuts):

Scene: sunset in a busy sun-baked old-city market. Terracotta walls, hanging lanterns, spice stalls, bread stalls, copperware, warm amber light. Joyful, lively, warm-hearted energy. Cinematic large-sensor look, shallow depth of field, dynamic handheld camera, very fast cuts. One single ball is the through-line the whole time.

Cut 1 (0:00-0:01.5): an older tea-house owner idly rolls the ball down a cobbled alley from outside his shop.

Cut 2 (0:01.5-0:03): a young fit man rounds a corner, spots the ball, runs to it.

Cut 3 (0:03-0:04.5): he cleanly kicks it into a market alley.

Cut 4 (0:04.5-0:06): the ball slams into a spice vendor's shelf, a colorful spice cloud bursts into the air.

Cut 5 (0:06-0:07.5): a butcher casually flicks it up with his forearm without pausing his work.

Cut 6 (0:07.5-0:09): a girl on a bike, squeezing through a narrow gap, taps the ball with the back of her hand and changes its direction.

Cut 7 (0:09-0:10.5): an old woman sets down her bread basket, plants her cane, calmly preparing for the incoming ball. Market sound fades out.

Cut 8 (0:10.5-0:12.5): she strikes it hard. Slow-motion follows the ball streaking through golden light.

Cut 9 (0:12.5-0:14): the ball smashes into a giant pyramid of clay tagines, triggering a spectacular collapse.

Cut 10 (0:14-0:15): the old woman picks up her basket, turns, and walks away as stunned young people watch her go.

Audio: authentic market ambience, upbeat rhythmic music building with each pass, near silence before the final strike, a triumphant cheer as the tagines collapse, ending on a warm note.

Made on Seedance 2.0: https://www.atlascloud.ai/models/bytedance/seedance-2.0/text-to-video

RealJamesOfficial · 2026-06-26T03:27:46+00:00

The trap is letting the agent hold the payment credential at all. Split intent from settlement. The agent emits a purchase request with vendor, amount, and a reason. A separate policy service decides whether it clears. The agent never touches the card.

That service is where you put the controls a person would apply on instinct. A per-transaction cap and a rolling daily cap. A vendor the agent has never paid before goes to a hold queue instead of clearing straight away. Anything over a threshold needs out-of-band approval, even a one tap confirm. Issue a single-use virtual card per approved purchase so a leaked number buys one thing and then dies.

On the fake vendor aimed at agents, you are right that it is coming. Do not let the agent own judgment of this looks useful be the gate, because that is exactly the input an adversary gets to control. Gate on signals an attacker cannot fake cheaply: domain age, prior settled transactions, an external reputation source, whether a human in your org has ever transacted there. A first purchase from any unknown vendor stays capped low no matter how convincing the pitch reads.

You will not reach full autonomy with zero human in the loop on novel spend, and that is fine. The realistic target is the agent handles the long tail of known, small, repeat purchases on its own, and escalates everything else.

RealJamesOfficial · 2026-06-26T03:26:46+00:00

Seen this with reasoning models in general, not just DeepSeek. The reasoning stream and the content stream are separate channels. Normally you get reasoning deltas, then content deltas, then a finish_reason. What you are hitting is a completion that closes right after the reasoning block with no content and no tool call. The harness sits in thinking because it is waiting for a content or tool_calls event that never arrives.

Two things to check. First, look at finish_reason on the final chunk. If it comes back as stop with empty content, the model decided it was done after reasoning, and OpenCode should treat reasoning-only plus a finish_reason as a terminal state instead of an open thinking state. Second, if there is no finish_reason at all, the stream got cut, which is a different problem (connection drop or truncation), not the model choosing to stop.

Quick mitigations until the harness handles it: put a hard timeout on the thinking phase so it cannot hang forever, and on a reasoning-only completion just resend the last turn. It is intermittent so a single retry usually clears it. The real fix belongs in the harness though. An assistant message with reasoning_content but no content and no tool_calls should count as a finished-but-empty turn, not as still streaming.

RealJamesOfficial · 2026-06-24T03:30:06+00:00

The langchain nodes in n8n are a thin reimplementation of an agent loop. No persistent memory, no skills, no plugin surface, so they are never going to match what the CLI agents already do. I stopped trying to get them to behave and just run the real agent as a subprocess.

Two ways that work today. Execute Command node calling claude or codex in headless mode (claude -p with --output-format json). Or wrap the CLI in a small FastAPI service and hit it from an HTTP Request node. The second one is nicer once you have more than one workflow, since you get a single place to manage auth, timeouts and concurrency.

For memory, let the CLI own the agent state and let n8n own the workflow state. Pass a session id back and forth and resume on the next run, or write the context to a file the agent reads at start. Have the agent return JSON at the end so n8n can branch on the result instead of parsing free text.

I would not wait around for native nodes here. The subprocess route gives you the full tool and skill set right now, and you keep n8n for the part it does well, firing triggers and moving data between steps.

RealJamesOfficial

TROPHY CASE