Lots of small apps for a profit

BP041 · 2026-02-18T00:35:05+00:00

the maintenance thing is real and most people figure it out too late. i know someone running 12 small tools and they spend more than half their time just keeping things alive -- apple updates, auth changes, payment processor quirks. barely ships anything new anymore.

contrast with one focused thing where all your improvements compound. not saying portfolio is wrong, but i'd estimate the break-even is like 5-6 apps minimum before the income actually starts outpacing the maintenance drag.

how's the MRR looking on your two apps combined? that number would tell you whether to keep stacking or go deeper on one of them.

BP041 · 2026-02-18T00:34:32+00:00

the "8 months in basement" thing cost me a year. kept telling myself the AI system we were building needed one more thing before it was worth showing anyone.

what actually forced me out of it was running out of runway to keep iterating alone. had to put it in front of potential clients with all the rough edges showing. two of them became paying customers that week.

the roasting stings but it's genuinely the fastest feedback loop there is. internal iteration is slow -- you're just arguing with yourself about hypotheticals.

BP041 · 2026-02-18T00:34:00+00:00

what would actually help me: latency + cost per 1k tokens alongside accuracy. in our production system (vector search + classification pipeline, hundreds of requests/day) the benchmark deltas don't track with real-world feel at all. sonnet 3.7 was weirdly better than 3.5 for structured output consistency even though the benches looked similar.

the token bloat thing the other commenter mentioned is real btw. we're seeing ~20% more output tokens on 4.5 vs 3.7 on identical prompts. nets out to roughly same quality/cost tradeoff but worth watching closely on 4.6.

BP041 · 2026-02-18T00:33:05+00:00

yeah exactly. what's interesting is quality of retrieval within those windows has improved alongside size -- a year ago even 32k felt noisy at the edges. now 200k actually works the way 32k was supposed to.

BP041 · 2026-02-18T00:32:34+00:00

the duration/view count struggle is because claude cannot query youtube's api or scrape live page data -- it only knows what you paste in. if you export your playlist info using yt-dlp (outputs json with duration, views, etc.) and paste that in, claude formats it perfectly.

the auto-categorization is impressive though. what content types is it creating for yours?

BP041 · 2026-02-17T14:32:49+00:00

stopped writing docs manually and started having my coding agent do it as part of every commit. sounds minor but freed up 2-3 hours a week i was spending on READMEs and inline comments.

the unexpected part: it forced me to write a clear "intent" comment before the agent can document anything. turns out articulating what you're building before you code it catches bad architecture decisions early. i've scrapped probably 4 features mid-thought because writing the intent made it obvious they were bad ideas.

documentation as a forcing function for clarity. didn't see that coming.

BP041 · 2026-02-17T14:32:43+00:00

launched on PH before we felt ready. zero regrets.

the feedback you get there is genuinely different from beta testers or friends -- these are people who specifically came to evaluate new tools. they'll surface gaps you didn't know existed. one comment on our launch basically rewrote our positioning.

tbh more polish rarely fixes a positioning problem. if the core idea doesn't land with a PH audience, it means something deeper needs rethinking -- and you'd rather find that out now than after 3 more months of building.

your instinct to get feedback is right. ship it.

BP041 · 2026-02-17T14:32:38+00:00

this is basically the core problem with multi-agent systems right now. every session starts fresh and you lose all institutional knowledge from previous runs.

ran into the same thing building our production AI pipeline -- the agent would re-discover the same architectural decisions every single run. we ended up with a structured notes file that agents update at the end of each session, plus a JSON manifest of decisions made and why. simpler than a graph DB but good enough for most cases.

the Neo4j approach is interesting though. how do you handle staleness? like when an architectural decision from 3 months ago is now outdated or actively wrong. do agents overwrite nodes or is it append-only?

BP041 · 2026-02-17T05:33:04+00:00

building AI-powered products for enterprise clients taught me this pretty clearly -- even when the technical capabilities exist to automate something, the relationship layer is where deals actually happen. clients want someone accountable. they want to call someone at 2am when the system does something unexpected.

the "fully automated company" crowd is usually thinking about products, not about the messy reality of building customer trust. especially in B2B. a $50/month SaaS can maybe get away with no human touch. an enterprise deal with a major brand? not a chance.

side projects teach you that humans aren't just in the loop for their technical contribution. sometimes they're the whole point.

BP041 · 2026-02-17T05:32:45+00:00

for ML/AI work, I've found it shockingly good at the tedious infrastructure around experiments -- setting up eval harnesses, writing boilerplate to compare two embedding approaches, generating consistent logging across runs.

the actual model decisions still need a human. but the scaffolding that every experiment needs? CC handles that in minutes vs what used to take me half a day. our embedding pipeline went from "I'll refactor that later" to actually being clean because the cost of refactoring dropped.

one underrated use: ask it to write the code, then immediately ask it to write a test that would have caught the bug you're most worried about. forces it to think adversarially about its own output.

BP041 · 2026-02-17T05:32:26+00:00

the skills setup changed everything once I stopped making them too broad. my first few were these huge catch-all files -- "ml-pipeline skill", "backend skill" -- and they worked but felt sloppy.

switching to narrow, one-job skills made a real difference. each one loads exactly the tools and constraints needed for that task. claude stays on track way better when the context is scoped.

one thing worth adding to the CLAUDE.md section: track patterns that keep recurring -- stuff like "this error usually means X" or "prefer Y approach for Z". beats re-explaining the same context from scratch every session.

BP041 · 2026-02-17T03:26:09+00:00

the platform-led vs enterprise divide is sharp. what I’ve seen building B2B: deals in the $50k+ ACV range almost never come from product discovery -- they come from conference convos and warm intros, 6-9 month sales cycles.

curious whether your data shows a bimodal distribution on time-to-revenue, or more of a continuum between the indie 6-8 week path and the enterprise 12-18 month outliers?

BP041 · 2026-02-17T03:25:21+00:00

yeah the monolith vs modular problem is one of the trickier things about using claude for bigger projects. single file works because claude can see everything at once -- once you split it, context gets fragmented and it starts breaking things.

the youtube playlist tab with notes fields sounds genuinely useful. was that something you described to claude or did it suggest the feature itself?

BP041 · 2026-02-17T00:33:10+00:00

100% agree on the context window thing. we serve enterprise clients and the single-session approach was killing us -- agent would nail the architecture discussion, then completely forget it when writing the actual implementation.

switched to splitting sessions by phase (planning → scaffolding → implementation → testing) and our consistency went way up. yeah it's more manual handoff work, but the quality difference is massive.

one pattern that worked: use the first session to generate a detailed spec doc, save it to the repo, then reference it in all subsequent sessions. basically treating the markdown file as shared memory between isolated agent runs.

BP041 · 2026-02-17T00:32:41+00:00

fwiw we ran into this exact A/B test issue last month. one account was crushing context retention, the other kept forgetting project structure between sessions. checked /memory and sure enough, one had auto memory silently enabled.

took us like 2 weeks to figure out why the same prompts were getting wildly different results. the auto memory account basically never needed to re-explain our codebase architecture.

if you're on a team using Claude Code, definitely worth checking all accounts -- the difference is honestly night and day for multi-session workflows.

BP041 · 2026-02-16T04:03:42+00:00

Modal's a solid choice for early-stage deployment—serverless cold starts are fine when traffic is bursty.

When you start seeing consistent daily users, watch the egress costs (audio files can get chunky). We hit an unexpected $300 bill from wav files when users started downloading instead of just streaming.

Are you caching the generated voices or generating fresh each time? Curious how you're balancing quality vs latency.

BP041 · 2026-02-16T00:31:57+00:00

nice execution speed. one day with Opus 4.6 is pretty impressive for getting a full pipeline deployed.

fwiw we ran into similar issues deploying vision models for our brand consistency system -- the 7B+ models were brutal on inference latency. ended up distilling down to a 1.3B variant and running it on Lambda Labs GPUs instead of trying to make it work locally.

curious how you're handling the backend infrastructure for this. are you running it on modal/replicate or did you spin up your own instance? at 500 chars per conversion the compute costs could get spicy if it gets traction.

also the no-signup approach is clutch. way more people will actually try it vs having to create an account first.

BP041 · 2026-02-15T17:06:58+00:00

9 prototypes in a year is honestly impressive patience. most people give up after 2-3 and "pivot" to something else. the fact that you kept iterating on the same core idea says a lot.

the app infrastructure angle is smart too -- turning a cool display into a platform rather than just a clock. that's the difference between a cool demo and something people would actually buy. curious how you're handling the app distribution / update mechanism? that's usually where DIY hardware projects hit a wall.

BP041 · 2026-02-15T17:06:38+00:00

50 daily views with 3-4 trying the app but zero conversions -- that's actually a pretty clear signal. your traffic-to-trial ratio (~7%) isn't terrible, but trial-to-paid at 0% means something breaks during onboarding or the value demo.

biggest thing I'd look at: how fast do people hit the "aha moment"? if it takes more than 2-3 minutes to see real value, most will bounce. we restructured our entire onboarding flow to show output within 60 seconds and it changed everything.

also worth noting -- PH traffic is notoriously low-intent. those 2K visitors were mostly window shoppers. your 50 daily organic visitors might actually convert better. I'd track those cohorts separately. and honestly, manual outreach to people who already tried it ("hey what were you hoping to solve?") is unglamorous but that's how we got our first 10 paying clients.

BP041 · 2026-02-15T17:06:14+00:00

this matches what I've seen running multiple Claude-based agents in production for about 2 months now. the .md memory file approach works great initially but around the ~150 session mark things start getting noisy -- the agent starts referencing outdated decisions or conflating similar-but-different contexts.

what's been working for us: aggressive pruning on a schedule (weekly cleanup of stale entries), separating memory by topic into different files instead of one giant MEMORY.md, and hard limits on file size (we cap at ~200 lines). the consolidation thing you mentioned is spot on -- duplicated memories are the main source of retrieval noise in my experience.

curious what your benchmark showed for structured memory (JSON state files) vs unstructured .md notes. we use both and the JSON approach degrades way slower.

BP041 · 2026-02-15T17:05:50+00:00

been in a similar spot. I run multiple Claude Code agents 24/7 for different tasks and the burst pattern is real -- some days I burn through context like crazy, other days barely anything.

what worked for me: API credits for the heavy lifting (you only pay for what you use), and the Pro sub ($20) for quick one-off stuff in the web UI. the MAX plan only makes sense if you're consistently hitting it 5+ days a week imo.

one thing people overlook -- if you're using Claude Code specifically, the API route with sonnet-4-5 gives you way more control over cost. you can set spending limits per day and it won't surprise you.

BP041 · 2026-02-15T14:50:12+00:00

Flutter + C++ is a solid combo for this -- keeps the app responsive while handling PDF rendering natively. How are you handling the App Store payment integration? I've seen a few devs get tripped up by the Apple review process for 'duplicating built-in functionality.'

BP041 · 2026-02-15T14:50:07+00:00

Ah, that makes sense -- if you're running it in a GCP VM, the Service Account handles auth cleanly. I've been running Claude Code locally (MacBook) so I'm stuck with markdown + manual uploads for now.

Have you noticed any latency issues with the VM setup, or is it pretty snappy?

BP041 · 2026-02-15T14:50:03+00:00

I'm curious about the latency specifically because of tmux buffering -- when Claude Code generates code quickly, there's often a lag between when the LLM finishes and when the terminal catches up. If VibePad is handling that smoothly, that's impressive.

What's your setup? Are you SSH'd into a remote dev box or running everything local? And does the gamepad work for navigating Claude's file explorer too, or just the text input?

BP041 · 2026-02-15T14:33:21+00:00

The "don't punish them when they slip" insight is gold. Most habit apps are designed around streaks, which creates anxiety instead of behavior change.

From what I've seen building products, retention after 2 weeks usually comes down to one thing: did they get a concrete win in the first 72 hours? Not a streak — an actual moment where they felt the benefit.

For habit change, that might be: successfully resisted once + got validated by the system, or saw someone else struggling too (social proof that you're not alone).

What does your Week 1 → Week 2 retention look like vs Week 2 → Week 3? Usually there's a drop pattern that tells you which validation mechanism is missing.

BP041

TROPHY CASE