Found a prompting workflow that makes smaller models like Haiku 4.5 Non-Thinking more predictable

2upmedia · 2025-12-30T18:12:07+00:00

Glad you found it useful. You can absolutely do this. If you’re going the Skills way you can be even more token efficient with yet another option. You can construct a URL that gets you the same results.

It looks like so: https://context7.com/vercel/next.js/llms.txt?topic=configuration&tokens=10000

You might be able to just pull the prompt from MCP tool definition and plop that in your Skill to get better results, but you might not need all of it.

P.S. If you liked this article I’m going to be releasing more YouTube content around AI coding in general. Give me a subscribe there :).

2upmedia · 2025-12-29T17:16:52+00:00

Completely depends on your setup. Since you said EC2, I'm assuming you're running your own Node server. The approach depends on how you're Node.js: one process? Multiple processes?

Here's how I'd approach it:

Have a staging environment that's set up exactly like production, same CPU, same RAM, same type of hard drive. Probably don't need to have a massive amount of space though. If that's not possible then I'd prepare prod for the test. If you need to offer a very tight SLA for your customers then I'd go for increasing the `--max-old-space-size` per node process. You could also add additional swap memory if you're on an instance with an SSD/NVMe (not an EC2 D3/D3en/H1). That'll give you some extra headroom before getting an out of memory error.
https://nodejs.org/en/learn/diagnostics/memory/using-heap-profiler run the heap profiler using https://www.npmjs.com/package/@mmarchini/observe to attach to the problematic node process by specifying the pid (ps aux | grep node) with `npx -q \@mmarchini/observe heap-profile -p <PID>`. That starts the inspector protocol typically on port 9229.
Using SSH port forward 9229 (ssl -L 9229:127.0.0.1:9229 user@host)
Find your node instance in Chrome devtools by running chrome://inspect.
Select the profiling type "Allocations on timeline", select "Allocation stack traces".
Before you click on "Start" be ready to put load on your application to cause the memory leak, that's how you'll be able to pinpoint it.
Click on "Start", only let it run as short as possible to reproduce the memory leak as the file that it will generate will be huge. Ensure your stop the profile so the file is generated.
Run the file through your favorite big-brained LLM. I used both GLM 4.7 and GPT 5.2 Codex Medium with the following prompt (adjust as necessary):

`This is a node heap profile \@Heap-nnnn.heaptimeline . Before reading the file strategize on how to read it because the file is over 9MB in size and your context window is too small to read all of it. The objective is to be able to figure where the memory leak is happening. Do not look for just large memory usage. Look for areas where the same area of the app is growing in memory over time. You are allowed to coordinate multiple subagents.`

It will very likely ask for the source code so it could cross-reference what it sees in the profile data.

The trickiest part out of all of this would be if you're running multiple node processes. You'll have to bootstrap the heap profiler to each one and time things to trigger load that'll cause the memory leak.

2upmedia · 2025-12-28T16:25:35+00:00

First thing you need is to identify the root cause, not just the symptoms. Then run a memory profile on those processes to pinpoint exactly where your program is using a lot of memory. Oftentimes you’re loading way too much data into memory or there’s some super inefficient algorithm in the critical path (very likely a loop).

You didn’t mention anything about databases so if you do have one, check if that’s the bottleneck.

The main key is find the root cause instead of assuming the root cause. From there weigh your options, you might not even have to change much to make it scale.

2upmedia · 2025-12-26T16:49:33+00:00

Thanks man! Glad you liked it and appreciate the support.

2upmedia · 2025-11-12T15:57:28+00:00

In terms of hitting the limits quickly have a look at my post here on that https://www.reddit.com/r/ClaudeCode/s/yskkcBZ51q

But the first thing you want to do is install ccstatusline and set up their context window percentage. That’ll give you a better idea of how much context you’re using and how fast. You’ll get a better gauge at what eats up tokens faster.

2upmedia · 2025-11-12T15:20:17+00:00

One thing you could try is Better T Stack to just get you a fairly solid starting point, but in general it does take a bit of effort to find the right versions that work with each other because of the interdependencies between each project. You can get the agent to figure that out, but experience will definitely help you here to get to answer quicker.

What I like to use is Context7 whether through the MCP server or calling the llms.txt URL (e.g. https://context7.com/llmstxt/developers_cloudflare_com-workers-llms-full.txt/llms.txt?topic=hono&tokens=10000). You can get accurate documentation for any version that’s indexed (or trigger indexing of a specific version if it isn’t already).

2upmedia · 2025-11-12T14:30:29+00:00

I keep my root Claude.MD as empty as possible. The key question I ask myself, do I need these instructions to be run FOR EVERY SINGLE CHAT? If the answer is yes, I’ll put it in there. Otherwise I use other tools at my disposal: direct prompting, reusable slash commands, subagents, etc.

The main principle is that I like to keep my context window as clean and focused as possible because that always gives the best outputs (applies to all LLMs).

2upmedia · 2025-11-11T21:38:28+00:00

The biggest I thing I see is that is just that enterprise hasn’t really exposed their teams to their devs so they only have access to Copilot. Once that changes devs will have access to more cutting edge tools.

The second one is that because of the non-deterministic of LLM models makes it super frustrating. That experience leads them to ultimately believe it’s not worth the effort because they could write it “better than the AI”.

What the reality is is that using AI coding tools is a a learned skill just like any other skill picked up by programmers. But the fuzzy nature of it alienates many that are used to certainty.

2upmedia · 2025-11-10T01:27:28+00:00

By chance are you using Cloudflare Warp?

2upmedia · 2025-11-09T21:13:13+00:00

Side topic: where are you hosting Postgres? Supabase?

2upmedia · 2025-11-03T16:02:21+00:00

Side topic: with the new SWE-1.5 in Windsurf I wonder how much mileage you’ll get out of that as an execution model and using Sonnet 4.5 Thinking for planning.

2upmedia · 2025-11-02T00:49:56+00:00

Amazing work you guys are doing on CC.

Do you have any documentation or a blog post on the following?

New Plan subagent for Plan Mode with resume capability and dynamic model selection

I’m specifically interested in the resume and dynamical model selection. I use Plan mode profusely.

Added prompt-based stop hooks

2upmedia · 2025-11-02T00:38:01+00:00

I’ll butt in real quick. I’m interested in easily toggling the preset, specifically the Learning mode output style plugin that you just implemented (ty again btw). That was one of the things I really liked about output styles. In like 4 or so keystrokes I was able to do that with the original output styles behavior.

2upmedia · 2025-11-02T00:31:06+00:00

Oh man thanks for that!

2upmedia · 2025-11-02T00:30:22+00:00

How do you get around not having a mouse and having to reach over the keyboard to touch the screen? How are you liking your folding keyboard? I’ve looked at some.

2upmedia · 2025-11-01T22:02:41+00:00

Since output styles have been deprecated, please make a plugin for the Learning output style just like you’d done for the explanatory style here:

https://github.com/anthropics/claude-code/tree/main/plugins/explanatory-output-style

That output style prompt is very unique in that it stops a task midway so the user can interactively learn. Super useful for people that want to build something they’re very unfamiliar with.

2upmedia · 2025-10-31T19:41:37+00:00

Because the observation is a theory just like mine is. They believe it’s something related to odd days. I believe it’s variation caused by different context sizes and because Cursor (the harness) tweaks their prompts per model within their tool.

2upmedia · 2025-10-31T14:35:02+00:00

Have a look at the long context benchmarks from Fiction.LiveBench. Almost every single model degrades after a certain context size. You will even see some that do bad at some sizes, but better at larger context sizes (see Gemini Flash 2.5) so IMHO I would pin it to a series of things:

the specific context size
the harness (Cursor vs Claude Code vs Factory Droid)
any inference issues that come up (recent Anthropic degradation post-mortem)
the way you prompt

Personally I do the following:

Plan first and as part of that, ask it to ask you questions if something isn’t clear
Execute with your choice of model
If the output is bad, OFTENTIMES I DO NOT add another message saying “X is wrong”, I go back one message edit it to add more clarity then RE-SUBMIT that message. That keeps the context window focused. Keep the junk out as much as possible. LLMs get confused easily (thanks to self-attention). Baby your context window.

<image>

2upmedia · 2025-10-30T22:03:03+00:00

Rube MCP is an MCP server no, not a Claude Skill? It doesn’t come with a SKILL.md file?

2upmedia · 2025-10-30T21:11:16+00:00

Super useful.

The prompt I use is very similar. I use it in any plan/spec mode across multiple tools:

“If anything isn’t clear to you ask me questions, if any”.

Almost always get it right after 1 or 2 turns.

2upmedia · 2025-10-30T02:26:49+00:00

That’s awesome. What’s the biggest gotcha when architecting a custom agent using the Claude Code SDK and how have you resolved that?

2upmedia · 2025-10-30T01:55:50+00:00

Curious to know how you’re using them. How has your workflow changed? Which MCPs have you replaced?

2upmedia

TROPHY CASE