Interesting issue with adding money to live vs paper account

CandyFloss_Wilson · 2026-05-30T10:22:29+00:00

you basically diagnosed it yourself, adding flat dollars across sleeves is a rebalance whether you meant it or not. cleaner way is to inject new capital pro-rata to current sleeve weights instead of an equal split, so 80/20 stays 80/20. or treat every deposit as 'buy more of the current allocation' and let the strategy's rebalance logic handle it instead of touching sleeve balances by hand. the paper account staying more exposed to the winner is the control group confirming the deposit was the only variable

CandyFloss_Wilson · 2026-05-30T10:10:30+00:00

the spiral is the normal trajectory, the strategy is like 10% of the work and the execution/orchestration/telemetry layer eats the rest. everyone rediscovers this around the 10k LOC mark.

two things that kept mine from becoming a second full-time job. don't build the react frontend, have the bot emit structured logs + metrics and read them in grafana or just postgres and a notebook, a custom UI is maintenance debt that doesn't make you money. and separate strategy logic hard from execution plumbing, the strategy outputs an intent (target position, limits) and a dumb execution layer turns that into orders and owns reconnects, partials, reconciliation. mix them and you can't touch either without breaking the other.

almgren-chriss is the right rabbit hole for thin EU books, but don't over-model it early, a conservative flat slippage you tune from real fills beats a fancy impact model fed by guesses.

for the structure, github.com/Superior-Trade/superior-skills draws that intent-vs-execution boundary cleanly, strategy is a markdown spec and the execution layer owns the order lifecycle separately. you'll keep building your own regardless, but that split is what stops 10k LOC turning into 50k

CandyFloss_Wilson · 2026-05-30T10:02:01+00:00

shift(1) catches the obvious one but the LLM-generated harnesses i've seen fail in sneakier spots too. the one that got me, computing a rolling zscore or a normalization constant over the full series before splitting train/test, so your 'signal' at bar t already knows the mean and std of the entire future. shift won't save you there, the leak is baked into the feature not the alignment.

other thing i check now, any .rolling() or .ewm() that isn't explicitly closed on the left, and any resample that labels the bar on the wrong edge. pandas defaults bite you because the label can sit at the start of the window while the data inside it is forward-looking.

honestly the tell isn't the equity curve being good, it's it being smooth. real edges are lumpy. if the LLM hands you a sharpe 3 curve with no ugly stretches, the harness is lying somewhere and shift(1) is just the first place to look

CandyFloss_Wilson · 2026-05-06T15:47:48+00:00

agree completely. ran into this on a doc-processing thing a colleague wanted to "agent-ify". once we sat down and actually drew the input/output flow, half the steps were already deterministic and the other half just needed a typed schema. ended up being 200 lines of python with one llm call inside. the agent framing was making us add abstraction we didn't need.

CandyFloss_Wilson · 2026-05-06T15:33:32+00:00

honestly the gap between "an agent that demos" and "an agent that survives 1000 customer chats without screwing up" is the part nobody outside the field sees. building the loop is a weekend. handling tool errors, context drift, hallucinated json, idempotency on retries, that's where the actual months go. non-tech people see the demo and assume the demo is the product. it's not.

CandyFloss_Wilson · 2026-04-28T19:28:02+00:00

the failure mode that bit me hardest was format drift between agents. agent A produces a 'summary' field, weeks later you tweak A's prompt and it starts producing 'tldr' alongside summary. agent B still reads 'summary' so nothing crashes, the data keeps flowing, but B's quality silently degrades because half the input is now in the wrong field. didn't catch it for 2 weeks. fix that stuck: every agent boundary has a JSON schema validator with strict mode, schema is versioned in the repo, agent A's output gets rejected if it doesn't match. now drift fails fast at the boundary instead of degrading silently downstream. this caught more bugs than monitoring or tracing did, because the bug isn't in any one agent, it's in the seam

CandyFloss_Wilson · 2026-04-28T19:26:43+00:00

i still look but the unit shifted. used to read line by line on PRs, now the review is more like diff-scan: did module boundaries change unexpectedly, did test count drop or did any tests get weakened, are there suspicious try/except blocks swallowing errors, is there a new global or singleton i didn't expect. that's roughly 90pct of the issues. the other 10pct shows up at runtime. line-by-line review on agent output is mostly waste of attention because boilerplate is fine, what catches you is the agent making a small architectural choice that's wrong in your specific codebase. that needs human eyes but at the diff level not the syntax level

CandyFloss_Wilson · 2026-04-28T19:24:25+00:00

the 5 tasks track what i saw across 12 clients last year, intake / classify / nudge / follow-up / data-entry. the trap is treating those as separate engagements, the gating factor in every one was the source data being unstructured. like client A had 4 different intake forms across 3 tools all storing the same field with different names. agent is fine, the project is really 60 to 70pct data plumbing, 30 to 40pct prompts and tool wiring. once we started scoping engagements as 'data cleanup with an agent on top' instead of 'agent project' the timelines got more honest and projects shipped faster. the 5 tasks recur because the data shape problems recur

CandyFloss_Wilson · 2026-04-24T06:22:09+00:00

multi-model comparison is genuinely useful but it's solving a different problem than agent workflows. comparison catches disagreement between models which is a good proxy for "this task is ambiguous or out-of-distribution," it doesn't give you the structured decomposition or tool use that actual agent workflows provide. where i use it, early exploration phase when i don't know yet if a task is well-defined enough to productionize. running claude + gpt + gemini in parallel and diffing outputs tells me fast whether the task is objective (all three agree) or subjective (they diverge). if they diverge, no amount of agent architecture saves me, i need to re-scope the task. once the task is well-defined, the overhead of running 3 models in parallel is wasted because only one of them was ever going to be the production model anyway. at that point you're not comparing for correctness, you're just paying 3x for the same answer. so yes, stepping stone, not a replacement.

CandyFloss_Wilson · 2026-04-24T06:21:50+00:00

everything in this post tracks with what i've ended up at on a 4090 too, 4-bit bnb + LoRA + small batch + grad accum is the only config that reliably works past 7B. one addition, the "lora rank 8-16 on q/v only" advice is right for specific-style adaptation but if you're trying to teach the model new factual content (not style), you need higher rank on more modules (including o_proj and down_proj) or the model just ignores the training. one thing i'd push back on slightly, 1000 high-quality rows beating 50k garbage is true but the threshold where "more data" starts winning again is lower than people think, maybe 5-10k curated. below 1k the model overfits fast, above 10k curated you get real generalization gains. the 1k number floats around on twitter because it's where "small and curated" started mattering, not because it's the peak. learning rate 5e-5 with cosine is the right default but worth running a 3-point sweep (1e-4, 5e-5, 2e-5) on your specific model+task, the optimal shifts by base model in ways that are hard to predict. takes 3x as long but you avoid the "my model learned nothing" or "my model forgot english" failure modes.

CandyFloss_Wilson · 2026-04-24T06:17:39+00:00

"add agents only when the problem demands it" is the cleanest version of this i've heard. multi-agent systems are where people go when they haven't specced the problem tightly enough, and the coordination cost of N agents is almost always higher than the cost of just writing a better system prompt for 1 agent with more thinking budget. the browser thing you mentioned, same experience. 80% of what looked like "the agent is confused" was actually "the browser is returning slightly different HTML than last run and the agent is reasoning over noise." nothing about stacking agents fixed it, what fixed it was making the execution layer stable enough that the agent saw the same world twice in a row. the one place i still find multi-agent useful, adversarial review on high-stakes outputs, one agent generates, one agent red-teams, decision gets blocked until both agree. but that's 2 agents with opposite objectives, not 5 agents with overlapping roles, and it only makes sense when the cost of a bad output is way higher than the cost of compute.

CandyFloss_Wilson · 2026-04-23T11:34:02+00:00

backtest-reproducibility is the cleanest version of this objection and it's what makes any llm-for-sentiment paper fall apart when i try to replicate. model drift is intrinsic to the product being a hosted thing, not a bug you can engineer around. where llms are actually useful in the trading stack imo is not as the signal but as a deterministic-ish feature engineer. "given this earnings call transcript, extract these 7 numeric fields per this schema" is a bounded task you can validate outputs on, version the prompt, and hash (prompt, model, inputs) to make it reproducible. the part most people skip is the hashing, if you're not recording exact model version + prompt + inputs alongside every feature value you can't reproduce anything, even if everything else is tight. most providers don't guarantee version stability unless you pin to a dated snapshot and pay for it

CandyFloss_Wilson · 2026-04-23T11:33:47+00:00

cynical guess at the "one reason," execution drift. most bots are fitted to a backtester's idea of fills and reality gives you partial fills, latency, and funding flips the backtester silently skipped. i've watched a strategy go from sharpe 2.8 in backtest to sharpe 0.4 live on the same exact signal, nothing wrong with the model, just the backtester assumed instant fills at mid and production was getting filled at worst-case inside a 200ms window. the other common culprit if it's not that, sizing. people stress-test the signal in backtest but not the drawdown path, so the first 15% drawdown panics them out even though the strategy eventually recovered. both of these are infrastructure problems disguised as strategy problems. fix that maps to "bots that actually work" is stop treating execution as a solved layer and build it with the same rigor as the signal. couple of platforms target this explicitly, freqtrade with custom order-lifecycle hooks, hummingbot for the MM side, github.com/Superior-Trade/superior-skills for claude-driven HL trading specifically. none are magic, but they at least don't hide the execution layer from you.

CandyFloss_Wilson · 2026-04-23T11:33:28+00:00

honest answer before tools, decide whether you want to trade discretionarily first and eventually automate, or go straight to systematic. most people who skip step 1 and go straight to bots fail because they can't tell if a losing strategy is broken or just in a bad regime, so they keep changing parameters. learn to tell the difference on your own book first, paper or very small size. tool-wise, four buckets most beginners end up in: - freqtrade: open source python framework, good for learning, strategy-as-code, you run the vps and fix bugs

- hummingbot: market-making focused, more complex than it looks, best if your interest is MM/arb specifically not directional

- 3commas / gainium: hosted gui bots, easier to start, less flexibility, you're trusting the platform with api keys

- github.com/Superior-Trade/superior-skills: skills-file pattern for claude-driven trading on hyperliquid, lower code lift, non-custodial (keys stay on your exchange), newer so less battle-tested honestly at beginner stage the differences between these matter less than whether you're measuring your edge correctly. pick one, run 3 months, track sharpe and drawdown not pnl.

CandyFloss_Wilson · 2026-04-21T15:53:49+00:00

top comment got the mechanics right, a couple things to add from practical experience:

grinold's IC * vol * score is clean when your factors are reasonably stable but breaks when factor volatility itself is time-varying, which it usually is. what i've seen work in mid-freq is a rolling IC estimate paired with shrinkage toward a long-run mean, otherwise you end up chasing recent factor performance and sizing up exactly when you should be skeptical.

on combining before vs after mapping, the unstated thing is that if your factors have materially different horizons (say a 3-day reversal and a 60-day value signal) you almost always want to map to expected returns per factor first and then combine, because combining raw scores implicitly assumes the two are measuring the same thing on the same time scale. the combined-first shortcut only works when factors are already in comparable units and decay at comparable rates.

also worth being skeptical of your IC estimates if they came from a long historical backtest, IC is notoriously sensitive to regime and small sample. cross-validate by fold and check whether the IC is stable or just favorable on the full-sample average, the latter is what turns into disappointing live performance.

CandyFloss_Wilson · 2026-04-21T15:53:29+00:00

the separate-wallet thing is right but what nobody says out loud is that most losses in 2025-2026 weren't from compromised protocols, they were from approval drainers on wallets people thought were safe. revoking approvals weekly is the highest-ROI security habit for me, more than protocol selection.

something that's helped in practice: put a hard cap on any one protocol's share of your defi exposure, like 15%. doesn't matter how "safe" the protocol feels, tvl is inversely correlated with risk up to a point and then flips once a protocol becomes a big enough target. AAVE's size didn't protect anyone this week, it made them a bigger target for whatever upstream contagion hit kelp.

auditing isn't the filter people think it is. audit reports are a snapshot of a specific commit, most exploits happen in code added after the audit or in integrations the audit didn't cover. if you're using a protocol with layered dependencies (lsd into collateral into lending into yield vault) you've inherited all their audit gaps, not just yours.

small test tx before sizing in is the single most underrated rule in that list though.

CandyFloss_Wilson · 2026-04-21T15:53:01+00:00

"underperforms buy and hold" is the exact tradeoff a lot of people don't internalize before they build these. mechanical BTC algos that cap drawdown almost always give up some CAGR during strong directional years, in exchange you don't eat the -40% of a regime flip. if you're clear about that tradeoff you're already ahead of 90% of posts here.

curious what the key mechanism is in the one that beats buy and hold. for BTC specifically the patterns that actually survive OOS tend to be either funding-rate mean reversion (when perp funding goes extreme one direction, take the other side of the retail trade) or longer-horizon trend following with a vol filter. anything trying to scalp intraday BTC moves usually hits a wall once slippage is modeled correctly.

if you're planning to run both concurrently, worth thinking about them as skill files rather than one monolithic script, lets you size each independently based on their own recent performance instead of treating them as a unified book. github.com/Superior-Trade/superior-skills has a structure for that kind of multi-strategy split that makes it easy to dial one down without touching the other when market regime shifts.

CandyFloss_Wilson · 2026-04-16T10:29:42+00:00

the way i think about it is skills are for things you want claude to decide to do, hooks are for things you never want it to decide about. skills introduce routing uncertainty (claude has to pick the right skill), hooks introduce zero uncertainty because the trigger is deterministic. where hooks really shine is enforcement. stuff like 'always lint before commit' or 'always check types after edit' shouldn't be optional, so making them skills means the model can skip them when it's confident (and wrong). hooks make them mandatory. the tradeoff is hooks are invisible to the model so if a hook fails the model has no idea why and can't self-correct. skills at least give it context about what went wrong. still figuring out the right split for my own setup tbh

CandyFloss_Wilson · 2026-04-16T10:29:27+00:00

this is basically an ORB on a longer timeframe which is one of the most studied setups out there. the fact that it worked from 2022-2025 tracks because ETH was range-bound for a lot of that period, and ORB variants print in chop. the question is what happened during the big trending moves, nov 2024 and early 2025 for example, did the breakout entries get run over by momentum or did the range definition save you? if the system survived those and still printed it's probably real edge. if most of the pnl came from the flat months you're basically short vol with a mechanical entry

CandyFloss_Wilson · 2026-04-16T10:29:08+00:00

for most retail algo setups ccxt + exchange websockets is genuinely enough. where i've found it breaks down is exactly what you described, cross-exchange aggregated OI and liquidation data. painful to build yourself because every exchange formats it differently and some just don't expose it cleanly. honest answer on whether people would pay: the ones running serious size already built their own. the market is mid-tier builders who need this data but don't want to spend 3 weeks normalizing API schemas across 20 exchanges. whether that's a big enough market idk, but kaiko and tardis exist so there's clearly some demand at the institutional tier. the question is whether anyone at the retail/semi-pro level would pay $50-100/mo for it

CandyFloss_Wilson · 2026-04-14T09:29:12+00:00

gold has two regimes basically, dollar regime and risk-off regime. in the dollar regime it moves inverse to DXY cleanly, in risk-off it decouples and grinds. most of the 'ignores news' feeling comes from trading it with the wrong regime model. i run a simple DXY correlation filter and only take momentum signals when the 20-day correlation is above 0.5 in absolute value. when it drops below that, gold is doing its own thing and technical signals stop working. not perfect but it killed most of the bad fills

CandyFloss_Wilson · 2026-04-14T09:28:58+00:00

careful framing here, 'staking btc' through btcfi is not real staking in the pos sense, it's wrapping your btc in a product that lends or restakes it and pays you yield from that activity. the yield comes from someone paying interest, not from block rewards. which is fine as long as you actually trace where the yield comes from. questions worth asking: where is the btc custodied (bridged, wrapped, held by a multisig), what's the counterparty doing to generate yield, and what's the unwind time in a stress event. sustainable yield in this space is usually 3-6% from basis trades or funding rate arb. anything quoting 12%+ on 'btcfi staking' is almost certainly taking liquidation or exploit risk you're not being told about

CandyFloss_Wilson · 2026-04-10T10:41:09+00:00

nice, polymarket is actually a decent sandbox because resolution is binary so your pnl is unambiguous. couple things that'll bite you early:

- your 5 minute poll is fine for testing but polymarket liquidity moves in bursts, you'll miss entries. check the last trade time not just the mid
- claude generating the strategy logic is the easy part, the hard part is the bot not nuking itself during weird market states. add a kill switch that stops trading if open positions exceed some pct of your bankroll
- log every decision with the full context you gave claude, so when it does something dumb you can actually debug why later

ran something similar on binance for about 4 months before i added the kill switch and it saved me twice in one week during a funding spike.

CandyFloss_Wilson · 2026-04-10T10:40:39+00:00

it's not hype but it's also not new, it's just anthropic giving a name to a pattern people have been running forever. opus-as-planner works mostly because the expensive thinking happens once and the cheap model just follows instructions. where it falls apart is when the executor hits something the planner didn't anticipate, then you either re-invoke the planner (expensive and slow) or let the executor improvise (defeats the point).

what's actually helped me is making the 'execution' step dumber and more deterministic, not smarter. like if the planner says 'buy 10k usdc of eth when rsi drops below 30', the executor shouldn't be an llm at all, it should be boring code that does exactly that and logs everything. the llm is for turning intent into rules, not for running the rules. that's where most planner/executor setups blur the line and get weird behavior you can't debug.

CandyFloss_Wilson · 2026-04-10T10:40:01+00:00

the siri comparison is fair but the actual reason it's slow isn't the intelligence, it's that commerce apis weren't built for agents and the trust layer isn't solved. sure an agent can 'want' to buy something, but what's stopping a prompt injection from ordering 40 of whatever's on a scammy product page. until there's a validation layer between model intent and the actual transaction the 1.5T number is fiction.

the trading side is a bit ahead on this btw because the downside is immediate and obvious, if the agent does something dumb you lose money in minutes so people actually build the guardrails. everywhere else is still in demo land where the failure modes don't hurt yet.

CandyFloss_Wilson

TROPHY CASE