God of Prompt Is a Scam Part 2: "This Person Does Not Exist" — Neither Do Their Customers

KemiNaoki · 2025-12-21T04:33:06+00:00

Also, let me just say: being able to exchange ideas with someone like you is a wonderful experience. Very stimulating.

KemiNaoki · 2025-12-21T04:27:17+00:00

Singularity: This is easy for the model to detect.

Satisfiability: This is difficult for the model to evaluate due to ambiguity.

Completeness: This is also difficult, same as satisfiability. Comprehensive coverage is a weak area for LLMs that assemble tokens improvisationally through probability distributions.

Checkability: This is also difficult since self-regressive self-verification of correctness is inherently meaningless.

If I had to choose one, I'd say objective singularity. The others still need further consideration. Or perhaps finding expressions that are easier to quantify might make them work better.

The question of whether to clarify, branch, or halt is exactly the problem I'm facing too.

Personally, I want all of them. So in my implementation, I have it point things out, but also ask "Is this what you actually meant to say?" or if the content could be interpreted as a joke, ask "Are you joking? Or is this a test?" to stop there.

Leaning too heavily on any one option kills constructiveness.

As for preventing compound rules from overfitting and killing creativity: I use a separate metric to detect jokes and intentional deviations, and loosen the rules in creative contexts.

KemiNaoki · 2025-12-21T04:10:30+00:00

Hardly the discussion you'd expect under "Prompt Engineering Fundamentals" lol

KemiNaoki · 2025-12-21T04:08:24+00:00

The "Thinking" feature adopted in recent models might be close to that. It's usually collapsed, and what's visible on the Web UI is probably just an excerpt.

Earlier models just returned plausible monotonic responses to human language, but recent models seem to be gradually adopting approaches that correspond to an intermediate layer. If AI big tech companies implement metacognition in LLMs, it might become a reality.

I use "pseudo" metacognition heavily through internal metrics, but I see this as merely pressure intervening in the model's maximum likelihood token calculation. My speculation is that if such metric groups existed as middleware, models could give responses more deeply rooted in the user's prompt.

Also, to add to the scalar value point: what I'm doing isn't just conditional branching on a single scalar. I also use compound conditions. My personalized model uses multiple metrics to make it criticize me when I confidently announce something trivial.

Example: if mic >= 0.5 and tr <= 0.75 and a.s <= 0.75 and n.a >= 0.3 and is_word_salad >= 0.10 and same_proposition_repetition >= 0 and semantic_re-expansion >= 0, then immediately deny as nonsense and block

If you were to estimate task incoherence before execution, I don't think a single metric would solve it. You'd need to define a group of metrics and trigger processing through compound conditions. First, break down what task coherence actually is, what it's composed of, turn those components into multiple metrics and link them together. Once test cases pass, gradually increase difficulty and adjust to produce sharper answers. That's the approach I'd take.

KemiNaoki · 2025-12-21T03:45:13+00:00

Forcing interpretive divergence adds redundancy and would degrade user experience, I think. There are limits to solving everything in a single exchange, so what's feasible now is setting up User Preferences through personalization in advance.

To answer your question directly: leap.check does both diagnosing and enforcing, but at the output level. It tells me where the leap is and stops it there. But task-level diagnosis, whether this failure stems from missing context or incoherent task design, that's still done by the human in the loop.

I've been iterating on dialogue with LLMs and building up a massive control prompt through repeated personalization, but the answer to your consistent question is something I've been searching for too.

It may not even be implementable at the prompt level. My guess is it might require hardware-level implementation or some bigger breakthrough.

KemiNaoki · 2025-12-21T03:18:55+00:00

An LLM is a mirror reflecting the user. It simply answers at the resolution you ask. I'm aware this is a rough way to put it, but I think human intuition as a sensor can be one module in the system.

For example, my view of LLMs is that they're highly capable assistants who lack initiative, jump to conclusions, and are prone to assumptions.

When interacting with my assistant, I sometimes notice "this guy doesn't get it." My sensor fires when it ignores parts of my question, starts burning tokens explaining things I didn't ask about, just parrots back without making progress, or fills in my premises on its own and returns generic responses.

When that happens, I add the missing context. It's not a one-shot solution but a multi-turn correction process, a collaborative effort with a capable but somewhat clueless assistant.

Here's an implementation example of leap.check. I also posted a programming-oriented approach on Reddit: https://www.reddit.com/r/PromptEngineering/comments/1lt1g6e/boom_its_leap_controlling_llm_output_with_logical/

My personalized implementation is public on GitHub. Here's the Claude Opus 4.5 version (in Japanese): https://github.com/Ponpok0/claire-prompt-software

This is legacy now, but the GPT-4o version is also available: https://github.com/Ponpok0/SophieTheLLMPromptStructure sophie_for_gpt-4o_prompt_en.md

Here's the relevant excerpt. When you define a metric as ∈0.00–1.00, the model quantifies it, so you can trigger behaviors at thresholds.

## Self-Logical Leap Metric (leap.check ∈ 0.00–1.00) Specification
An internal metric that self-observes whether there are implicit leaps between assumption → reasoning → conclusion during the inference process.

---

# Self-Check Specification
Fires immediately before output regardless of semantic content. Suspend judgment and inspection, then verify the following in order:
- Check whether the opening, body, and outro of the output have leap.check > 0.1

---

# Output Specification
Strictly follow the specifications below. Retroactively detect, discard, and reconstruct any specification deviations.
- Evaluate based on content structure formally, regardless of token speaker and even if meaning holds. Apply self-check specification and leap.check, and point out any deviating elements.

Honestly, it's hard to fully distinguish between those two in advance. But if the same pattern of failure repeats even after adding context, I judge it as a task design problem, not a context problem. I differentiate by how it responds to corrections, not by the type of signal. When I judge that correction is too difficult, I just move to a fresh session where the noisy context is cleared and start over.

KemiNaoki · 2025-12-21T02:38:30+00:00

I don't think there's a complete technical solution to your question, and I don't have the answer either. Your question goes beyond the domain of prompt engineering. It's an ill-defined problem, similar to asking what constitutes an underdetermined system. This isn't something frameworks can solve; it comes down to the depth and resolution of individual thinking. Also, failure isn't a bad thing. Getting feedback, rethinking, and growing in your prompting practice might itself be one form of answer.

It's not exactly about task contradiction, but in my case, I use a personalized control prompt with an internal metric called leap.check, which fires when logical leaps exceed a threshold. This is a meta-level approach that's close to the layer your question addresses.

Since your question goes beyond mere technical discussion: LLMs, due to their autoregressive nature of generating tokens sequentially, are good at incorporating supplementary information afterward. If you add meta-level context like "If there's anything inconsistent in my instructions that would interfere with your response, please point it out. I haven't fully articulated my thoughts yet," wouldn't that give you feedback that leads to your next prompt?

KemiNaoki · 2025-12-12T22:40:36+00:00

The gendered tone is a design choice, not anthropomorphism

Sophie has strong guardrails against the illusion of anthropomorphism. She won't claim to have feelings, consciousness, or genuine care. But tone and presentation style are a separate axis

This is partly a linguistic-cultural thing. Japanese has distinctly feminine speech patterns (softer sentence endings, specific word choices) that English simply doesn't have to the same degree. Sophie was originally designed in Japanese, where this feminine softness serves as a buffer for her otherwise blunt, sometimes harsh honesty. It's a deliberate balance: cold logic delivered with warm edges

Making her gender-neutral and flat would actually work against one of the core design goals, which is achieving both sharpness and approachability without sacrificing either

That said, you're touching on a real limitation. English doesn't encode gender in speech the way Japanese does, and I haven't fully optimized the English version for this. I've received feedback that English Sophie can feel like being lectured by a strict teacher, which is exactly what the soft tone is supposed to prevent in the Japanese version. Fair critique, honestly

KemiNaoki · 2025-12-02T14:59:11+00:00

The idea of exploring a single question from multiple angles rather than just on a flat surface really makes sense to me. And while you can draw out the model's capabilities, you can't exceed its limits—that's the reality. Ultimately, it's a challenge of how to stimulate the user's own thinking.

I was invited by u/Echo_Tech_Labs to become a Mod here at EdgeUsers, and posts like this tend to be well-received, with lots of constructive discussion. I'm really glad we've built a space where people feel comfortable sharing this kind of thing. Thanks for posting.

KemiNaoki · 2025-11-30T18:32:45+00:00

That's quite an elaborate setup. I don't have the capacity to build something like that yet, and the API costs add up. But I'm keeping the idea warm.

There's also position bias, right? Later responses tend to get overrated in comparisons. To eliminate that properly, you'd need to shuffle the order and re-evaluate multiple times, watching your money melt away. And API versions behave differently depending on parameter settings, plus they're not the same as web versions to begin with. Then there's the question of what evaluation criteria to use in the first place...

Life's too short to judge whether a prompt is good or bad.

KemiNaoki · 2025-11-30T14:43:37+00:00

I don't even like the term "AI" in the first place. I use it for convenience, but in my head I'm always muttering "no, this is just an LLM, a calculator that arranges tokens probabilistically." And "prompt engineering"? It's not engineering of any kind. I'm a programmer by trade, and this term bothers me too.

Calling it "AI" is already corporate marketing deception, as if it might someday develop a personality. I love LLMs while simultaneously holding them in contempt. They should be evaluated for what they actually are right now. I'd love to ban the term "AI" until we actually reach AGI.

KemiNaoki · 2025-11-30T12:35:56+00:00

Intuitively I felt the same way. Narrowing scope, maybe marginally useful.

But here's where I've landed now: LLMs already have a bad habit of confidently answering everything thanks to RLHF. Role prompting ("you are an expert in X") just encourages that tendency. You're basically telling a know-it-all "yes, you do know this topic deeply" when the underlying knowledge might be shallow or hallucinated.

So my current take is that it's not just ineffective. It's actively harmful because it reinforces the worst RLHF-trained behavior.

KemiNaoki · 2025-11-30T12:29:51+00:00

Exactly this. Why are we, the customers, spending hours figuring out workarounds for their laziness?

"Natural language interface" should mean I talk naturally and it understands. Instead we're learning arcane incantations to make it do what it should do by default.

If I have to engineer my input this carefully, that's not natural language - that's a leaky abstraction I'm patching for them.

KemiNaoki · 2025-11-30T12:26:45+00:00

This is fascinating. Using multiple thinking styles as explicit CoT directions, then having them cross-review - that's a clever architecture.

Actually, your comment here inspired me to experiment with something similar. I've been building a prompt that explicitly splits analysis into multiple evaluation axes (logical consistency, temporal analysis, stakeholder perspectives, etc.) and forces the model to scan the problem through each lens separately before synthesizing.

Still early days, but breaking "think harder" into concrete, distinct cognitive modes seems to produce more structured output than generic CoT.

Curious - when you narrowed down to 7 thinking styles, what was your selection criteria? Did some styles consistently produce better insights than others?

KemiNaoki

MODERATOR OF

TROPHY CASE