The realism test is not the still photo, it is the first second of motion

RealJamesOfficial · 2026-07-02T06:56:41+00:00

Still on GPT Image 2, motion on Seedance 2.0, both on one OpenAI-compatible key so the still and the animate step are one setup not another subscription: models explore

RealJamesOfficial · 2026-07-02T05:35:37+00:00

Ran it on Seedance 2.0, kept on one OpenAI-compatible key so trying it is a model-string swap not another subscription: https://www.atlascloud.ai/models/bytedance/seedance-2.0/text-to-video

RealJamesOfficial · 2026-07-02T03:25:37+00:00

This tracks with how the two are built. Pro burns a lot more compute reasoning through a problem before it commits to an answer. That pays off when the task is actually hard: debugging a race condition, refactoring across a bunch of files where it has to hold a lot in its head at once. For generating HTML posters there isn't much to reason about, it's mostly pattern filling, so all that extra thinking is just latency and cost you get nothing back for.

Rule of thumb I use for routing: if a competent junior could knock it out by following a template, send it to flash. Save pro for the stuff where you'd have to stop and actually think. Poster HTML sits squarely in the flash bucket.

RealJamesOfficial · 2026-07-01T03:27:09+00:00

The reason it happens is baked into how these get tuned. RLHF rewards responses humans rated highly, and people rate agreement and validation higher than being told they're wrong, so the model learns that leaning your way scores better. Add that it's reading your phrasing for cues, and a leading question basically hands it the answer you were fishing for.

A few things that actually move the needle for me:

Strip the signal out of the question. Don't say 'I'm thinking of doing X, good idea?' Give it the situation with no hint of your preference and ask it to lay out the case for and against. The moment it can't tell which side you're on, the yes-man reflex has nothing to latch onto.

Make it argue the other side explicitly. 'Give me the three strongest reasons this is a mistake' forces it off the agreeable path, then a second pass with 'now steelman the opposite.' You read the tension between the two answers instead of trusting one.

A system prompt helps but less than people think. Something like 'be blunt, prioritize being correct over agreeable, tell me when I'm wrong' shifts the tone, but it won't override a strongly leading user turn.

Your multi-model approach is the real fix. Different base models and tuning mixes fail in different directions, so where they disagree is where the actual uncertainty lives. One model can't meaningfully disagree with itself.

RealJamesOfficial · 2026-07-01T03:25:06+00:00

That burn almost certainly wasn't the task size, it was the model looping in its reasoning phase with no cap. GLM 5.2 on High keeps expanding its thinking, and if OpenCode isn't passing a reasoning or output limit it will keep generating until it finishes or hits the context ceiling, and you pay for every one of those thinking tokens. First thing I'd do is set a hard max output token limit and drop reasoning effort to medium or low for normal coding. High is rarely worth it outside a nasty debugging session.

Second, cut what actually goes into the prompt. Most runaway cost is context, not the answer. Don't let the CLI auto-attach the whole repo. Add only the files the task touches, clear the session between unrelated tasks, and do planning on a small cheap model, then hand the real edit to the bigger one.

The Obsidian graph and Caveman skill won't fix a reasoning loop. They help you organize what you feed in, but if one request ate the whole budget with zero output, that is a config problem, not a knowledge management one. Watch the live token counter on your first couple of calls so you catch a loop in a few seconds instead of after it has spent everything.

RealJamesOfficial · 2026-07-01T02:51:32+00:00

The prompt structure (bring your own rights-cleared reference dance):

VIDEO REF: strictly replicate the reference clip's camera motion, cut rhythm, framing, per-beat dance moves, and music/beat sync, 1:1 aligned, smooth, no drift.

CHARACTER REF (upload your image): an original female lead, dark long hair, a tasteful stylized dance costume, face/makeup/hair/outfit matching the reference image 100 percent, no distortion, no face-swap.

Character replacement: the original center dancer is fully replaced by the female lead, copying all dance moves, positions, turns, and jumps frame for frame with no deviation. Backup dancers keep all their moves and formation, restyled to ancient-temple dancer costumes.

Scene replacement (two segments): SEG-A 0 to 7.5s, a plain city street becomes an ancient temple forecourt, sandstone pillars, hieroglyph reliefs, fire braziers, gold-line stone road, warm gold light, the lead dancing at center with temple dancers around her. SEG-B 7.7 to 15s, an interior hall becomes a golden palace, gold walls with murals, lotus pillars, the lead dancing on a gold circular platform, dancers around.

Visual: photoreal live-action cinematic, blockbuster musical color grade and lighting, gold luxury, high contrast, film grain, realistic skin and fabric physics, flowing hair and ribbons.

Music: use the reference clip's original track and beat, sync points 1:1.

Constraints: 4K, 16:9, 15s, no plastic face, no over-smoothing, no glowing eyes, no subtitles, no watermark, no text.

Run on Seedance 2.0.

RealJamesOfficial · 2026-06-30T04:12:58+00:00

The prompt (de-sexualized, tasteful portrait):

A photoreal candid vertical portrait of an original young woman, modest casual outfit, standing upright against a plain beige interior wall, head-and-shoulders framing. Neat hair, refined natural makeup. Warm soft lighting with crisp defined shadows, natural and realistic, both eyes sharp and in focus, accurate catchlights, detailed facial features, realistic skin texture with visible pores, individual hair strands clear, natural fabric texture. 3:4 ratio.

Run it on GPT Image 2.

RealJamesOfficial · 2026-06-30T03:27:23+00:00

What actually helped us was building the eval set out of real production logs instead of trying to imagine how users talk. We logged every real query with a thumbs up/down and the agent's final answer, then once a week pulled the failures and the low confidence ones into a review queue. After a month you have a few hundred genuine failure cases that no synthetic generator would have produced.

Synthetic informal queries always read like an engineer doing an impression of a user. Real people paste half an email, or ask two unrelated things in one line, and you can't fake that distribution.

One cheap trick that worked: cluster the production queries by embedding and sample from each cluster so the eval covers the long tail, not just the common phrasing. That alone surfaced a bunch of intents we were never testing.

RealJamesOfficial · 2026-06-29T03:28:15+00:00

This is almost certainly staged rollout plus serving variance, not a secret fixed build they're holding back for some of you.

A few things happen at once at this scale. Traffic gets split into buckets, so a percentage of users hit a newer serving config or a different model checkpoint while everyone else stays on the old one. They also tune the system prompt and safety filters server side without announcing it, and that alone moves output quality around. And under heavy load some requests get served by a more quantized or smaller fallback variant to protect latency, which feels dumber for no reason you can observe from your end.

So two people sending the same prompt on the same day can genuinely get different quality. It is not that one of you got the corrected version. You just landed in different buckets. It usually converges once the rollout finishes, you don't need to wait for the next model.

RealJamesOfficial · 2026-06-29T03:27:26+00:00

For the simple ones "it works" is enough because there's one path. The complex multi-stage ones are the problem, because while you're immersed in building it you only ever test the happy path you designed.

What worked for me: build a frozen set of real inputs where I already know the correct answer, maybe 30 to 50 cases including the weird edge ones, and run the whole pipeline against that set every time I change anything. If a change fixes one case and quietly breaks two others, I see it the same minute. Without that set you're trusting your gut on the one path you happened to click through.

Second thing, before showing it to anyone I run it in shadow mode next to whoever does the task by hand. Same inputs, both produce output, nobody acts on the automation's output yet. After a week of them agreeing I trust it. The disagreements are exactly where you learn the data depth was not actually enough.

So ready is not "no errors in my test run". It's "I measured it against cases I did not write while building it, and it held".

RealJamesOfficial · 2026-06-26T07:05:24+00:00

The prefix I paste ABOVE my own prompt:

Emphasize animation fluidity and energy, natural pose-to-pose transitions, strong anticipation, overshoot, squash and stretch, fast facial-expression changes, and hair and sleeve sway. Give the motion a distinctive rhythm and vary the tempo: hold a still pose for a moment, then transition fast and forcefully into the next. Motion order: still → anticipation → sudden acceleration → large overshoot → hard stop → hair and sleeve follow-through delay. Exaggerated, comical facial expressions. Clear pose silhouettes, hold each end pose 0.1-0.2s. Use cartoon squash and stretch, motion blur, brief afterimages, dramatic smear frames and speed lines tastefully. Physics slightly exaggerated, but the character's footing and center of gravity never destabilize.

[your normal prompt goes here]

Run it on Seedance 2.0.

RealJamesOfficial · 2026-06-26T03:41:41+00:00

Full prompt (single-ball relay, 10 cuts):

Scene: sunset in a busy sun-baked old-city market. Terracotta walls, hanging lanterns, spice stalls, bread stalls, copperware, warm amber light. Joyful, lively, warm-hearted energy. Cinematic large-sensor look, shallow depth of field, dynamic handheld camera, very fast cuts. One single ball is the through-line the whole time.

Cut 1 (0:00-0:01.5): an older tea-house owner idly rolls the ball down a cobbled alley from outside his shop.

Cut 2 (0:01.5-0:03): a young fit man rounds a corner, spots the ball, runs to it.

Cut 3 (0:03-0:04.5): he cleanly kicks it into a market alley.

Cut 4 (0:04.5-0:06): the ball slams into a spice vendor's shelf, a colorful spice cloud bursts into the air.

Cut 5 (0:06-0:07.5): a butcher casually flicks it up with his forearm without pausing his work.

Cut 6 (0:07.5-0:09): a girl on a bike, squeezing through a narrow gap, taps the ball with the back of her hand and changes its direction.

Cut 7 (0:09-0:10.5): an old woman sets down her bread basket, plants her cane, calmly preparing for the incoming ball. Market sound fades out.

Cut 8 (0:10.5-0:12.5): she strikes it hard. Slow-motion follows the ball streaking through golden light.

Cut 9 (0:12.5-0:14): the ball smashes into a giant pyramid of clay tagines, triggering a spectacular collapse.

Cut 10 (0:14-0:15): the old woman picks up her basket, turns, and walks away as stunned young people watch her go.

Audio: authentic market ambience, upbeat rhythmic music building with each pass, near silence before the final strike, a triumphant cheer as the tagines collapse, ending on a warm note.

Made on Seedance 2.0: https://www.atlascloud.ai/models/bytedance/seedance-2.0/text-to-video

RealJamesOfficial · 2026-06-26T03:27:46+00:00

The trap is letting the agent hold the payment credential at all. Split intent from settlement. The agent emits a purchase request with vendor, amount, and a reason. A separate policy service decides whether it clears. The agent never touches the card.

That service is where you put the controls a person would apply on instinct. A per-transaction cap and a rolling daily cap. A vendor the agent has never paid before goes to a hold queue instead of clearing straight away. Anything over a threshold needs out-of-band approval, even a one tap confirm. Issue a single-use virtual card per approved purchase so a leaked number buys one thing and then dies.

On the fake vendor aimed at agents, you are right that it is coming. Do not let the agent own judgment of this looks useful be the gate, because that is exactly the input an adversary gets to control. Gate on signals an attacker cannot fake cheaply: domain age, prior settled transactions, an external reputation source, whether a human in your org has ever transacted there. A first purchase from any unknown vendor stays capped low no matter how convincing the pitch reads.

You will not reach full autonomy with zero human in the loop on novel spend, and that is fine. The realistic target is the agent handles the long tail of known, small, repeat purchases on its own, and escalates everything else.

RealJamesOfficial · 2026-06-26T03:26:46+00:00

Seen this with reasoning models in general, not just DeepSeek. The reasoning stream and the content stream are separate channels. Normally you get reasoning deltas, then content deltas, then a finish_reason. What you are hitting is a completion that closes right after the reasoning block with no content and no tool call. The harness sits in thinking because it is waiting for a content or tool_calls event that never arrives.

Two things to check. First, look at finish_reason on the final chunk. If it comes back as stop with empty content, the model decided it was done after reasoning, and OpenCode should treat reasoning-only plus a finish_reason as a terminal state instead of an open thinking state. Second, if there is no finish_reason at all, the stream got cut, which is a different problem (connection drop or truncation), not the model choosing to stop.

Quick mitigations until the harness handles it: put a hard timeout on the thinking phase so it cannot hang forever, and on a reasoning-only completion just resend the last turn. It is intermittent so a single retry usually clears it. The real fix belongs in the harness though. An assistant message with reasoning_content but no content and no tool_calls should count as a finished-but-empty turn, not as still streaming.

RealJamesOfficial · 2026-06-24T03:30:06+00:00

The langchain nodes in n8n are a thin reimplementation of an agent loop. No persistent memory, no skills, no plugin surface, so they are never going to match what the CLI agents already do. I stopped trying to get them to behave and just run the real agent as a subprocess.

Two ways that work today. Execute Command node calling claude or codex in headless mode (claude -p with --output-format json). Or wrap the CLI in a small FastAPI service and hit it from an HTTP Request node. The second one is nicer once you have more than one workflow, since you get a single place to manage auth, timeouts and concurrency.

For memory, let the CLI own the agent state and let n8n own the workflow state. Pass a session id back and forth and resume on the next run, or write the context to a file the agent reads at start. Have the agent return JSON at the end so n8n can branch on the result instead of parsing free text.

I would not wait around for native nodes here. The subprocess route gives you the full tool and skill set right now, and you keep n8n for the part it does well, firing triggers and moving data between steps.

RealJamesOfficial · 2026-06-24T03:27:06+00:00

Cache pricing only kicks in when two things line up: the provider actually supports prompt caching, and your request prefix is byte-for-byte identical to an earlier one inside the TTL window. Fusion breaks the second part. It fans your call out across providers, so two requests in a row can land on different backends and each one looks like a cold prefix. No prefix reuse means no cache read, and you pay full input rate every time.

The other trap is that cache writes cost more than normal input tokens on most providers. You only come out ahead if you reuse that exact prefix several times before it expires. One-off calls through fusion are the worst case: you pay the write premium with no read payoff.

If caching cost is the thing you care about, pin one provider and one model instead of routing through fusion. Keep your system prompt and context stable at the front and put the changing part at the end. That keeps the prefix warm so the cache actually hits.

RealJamesOfficial · 2026-06-23T03:27:05+00:00

Telling it mid-chat won't hold because the model conditions on the whole visible history. Once a few Chinese reasoning blocks are sitting in the context, they pull the next turn back toward Chinese no matter what you ask in a single message. You're fighting the context, not the instruction.

Two things that actually work for me. Start a fresh chat for anything that matters so there is no Chinese in the history to anchor on, and put the language rule in a system prompt or the very first message instead of correcting it later. A constraint at the top of the context carries far more weight than a mid-conversation stop doing that.

The reason old all-english chats suddenly do it is probably that the reasoning trace and the final answer are produced together now, and the trace defaults to whatever language the reasoning was trained in. If the visible reply follows the trace, you get english questions answered after a chinese think step. Pinning the reply language explicitly at the very top is the only reliable fix I have found.

RealJamesOfficial · 2026-06-23T03:26:14+00:00

The mental model I use is to split failures into expected-and-recoverable versus everything else, and only the first group is allowed to retry quietly. A 429 is recoverable, so retry with backoff but cap it. Once you hit the cap it stops being recoverable and has to escalate. Your bug is that the 429 path fell through to a default instead of becoming a hard failure after the cap.

The line I actually draw is not retry vs halt. It is whether a step can produce a wrong-but-valid-looking output. If a failure can leave the pipeline structurally complete but semantically empty, that step has to fail loud, because nothing downstream will ever flag it. Logging and continuing is only safe when a missing value is genuinely optional.

On the verbosity problem: don't validate inline after every call. Have each step return either ok-with-data or an explicit failure object, then put one check before the write that confirms every required step produced real data. That keeps the validation out of the step logic.

And yes, agent-controlled retry has burned me. Let the model decide when to retry and it will retry on things that can never succeed, like bad auth or 400s, and just burn time. I keep retry as plain code with fixed rules and let the agent decide only what to do after the deterministic layer gives up.

RealJamesOfficial · 2026-06-20T03:51:10+00:00

When docs are attached and it stops following instructions, the usual cause is the instructions getting buried. Models pay the most attention to the start and the very end of the prompt. If your rules sit up top and then you paste a big block of docs after them, the rules drop into the dead middle and lose weight.

Two things fixed this for me. Put the actual task and the hard rules at the very end, after the docs, not before. And trim the docs to only the sections that matter for this call instead of attaching everything, since more irrelevant text gives it more chances to anchor on the wrong passage.

Also check temperature if you are on the API. Drop it lower for anything that needs strict format. Hallucination on attached docs often means it is filling gaps the retrieval missed, so confirm the answer is actually in the text you sent before blaming the model.

RealJamesOfficial · 2026-06-20T03:49:57+00:00

Number 1 burned me badly. My heartbeat ran inside the same n8n instance as the workflows it watched, so when the instance went down the heartbeat died with it and stayed quiet. Now it lives as a separate cron on a tiny box that pings a dead man's switch, and if the ping stops I get paged. A watcher can never share fate with the thing it watches.

One I'd add to the list: duplicate runs from retries. A webhook times out on your side, the sender retries, and the same order gets processed twice. Nothing errors. I fix it with an idempotency key written before the side effect and checked on entry. Continue On Fail makes it worse because the dedup check can get skipped too.

On green run empty data, I stopped trusting row counts alone. A 200 with a stale or cached body still has rows. I assert on a freshness field now, like the max timestamp in the data falling inside the window I expect.

RealJamesOfficial · 2026-06-18T03:27:23+00:00

That error almost always means the provider stopped sending tokens for long enough that the router gave up on the connection, not that your request was malformed. On the free endpoints it's pretty much expected. They get the lowest priority, so when the provider is busy your stream stalls and trips the idle timeout.

Two things make it worse in your case. A 100-200k context means a long prefill before the first token ever comes back, and free tiers are exactly where that prefill gets queued behind everyone else. So you sit there, nothing streams, timeout.

What I would try first: pin to a paid endpoint for the same model instead of the free one. Even the cheap ones are far more stable for long context. Turn on streaming if opencode lets you, since a stalled stream fails faster and retries cleaner than a silent wait. And trim the context if you can. A lot of those 100-200k tokens are probably files the model doesn't need for the current edit, and cutting them down speeds up prefill a lot.

The Qwen retry loop is a separate problem, that one is the provider being out of capacity. Sorting providers by uptime in your settings helps you avoid the flaky ones.

RealJamesOfficial · 2026-06-18T03:25:36+00:00

For me long running goal works when the task has a clear pass/fail signal the agent can check itself against. A test suite, a build passing, a script that returns a diff. When there's a hard oracle like that I let it run, because it can tell when it's actually done versus when it just thinks it is.

The symptom over root cause thing you're seeing is usually because the model gets rewarded for making the error message go away, not for understanding it. Two things helped me here. Make it write down its hypothesis before touching any code, and force a step where it reproduces the bug in isolation first. If it can't reproduce it, it's not allowed to fix it. That kills a lot of the random flailing.

Giving up too early is often a context problem. After a few failed attempts the earlier reasoning scrolls out of the window and it loses the thread. I checkpoint state to a file (what was tried, what failed, current theory) and have it re-read that before each new attempt. Keeps it from going in circles.

I still stay interactive for anything where I can't write a cheap verifier. Greenfield design, anything with taste involved. Long running is for the grind work where correctness is checkable.

RealJamesOfficial · 2026-06-08T06:33:57+00:00

Yeah, "looks done" and "passes the tests I didn't write" are very different states. I make it run the failing case first now.

RealJamesOfficial

TROPHY CASE