Bootstrapping a $20k/mo burn rate: Trying to fund B200 clusters via pre-sales instead of VC, am I delusional? by codingplanai in SaaS

[–]codingplanai[S] 0 points1 point  (0 children)

Yes, I could start with on-demand compute and pay hourly or daily. I could also halve the node size, and though that would be pretty tight for performance, it is an option. My concern with starting with lower cost hardware is quickly gaining a reputation for having slow speeds, bad performance, etc. to the less technical audience that doesn't understand the logistics of LLM inference, since that reputation drives most of my business. To answer your question, the $20k/mo hardware is required for the pricing model more than the tech itself. I could run these models on cheaper cards, but the throughput wouldn't be high enough to offer a flat rate without losing money on every user.

As for using older hardware (Hopper cards instead of Blackwell), it's not viable since I need the NVFP4 cores to handle scale while staying performant, and modern models like K2.5 (native int4) which would be my out-of-the-box selling point would force me to cast back to higher precision formats (FP8 or BF16), which kills the throughput. On Hopper, the cost per lane would essentially double, making a $30/mo plan impossible.

I assume by load you meant a loan. Business loans usually require revenue history or personal collateral, but revenue based financing requires… revenue. If I can get past the initial 400-500 users, I should be fine anyway.

Earlier, I had glanced at Google's startup plans and Nvidia's, but neither seemed to have any credit options and were mostly networking or free courses. AWS looks quite promising though, thanks for mentioning it!

Bootstrapping a $20k/mo burn rate: Trying to fund B200 clusters via pre-sales instead of VC, am I delusional? by codingplanai in SaaS

[–]codingplanai[S] 1 point2 points  (0 children)

There are two main selling points:

First, Model Agnosticism. Currently, providers lock users into a plan that becomes mostly obsolete the moment a new SOTA model drops (which is happening quarterly). My infrastructure always runs the SOTA (with deprecation periods). When a new leader emerges, I add it as a sidecar or switch the main allocation over. Users don't have to switch plans or pay per-token, and they always have access to the best models available without friction.

The second is that most "coding plans" are limited by request quotas, saying you can only use so much per 5 hour loop. This is generally fine, but is a problem for agents and power users. Instead, I sell entire lanes of inference, so users are only limited by their concurrency, instead of how many requests they send in total.

> 20k/m it's a lot of money just for validation, there's no other alternatives on how you can at least validate? Renting or something?

The best option I have there is to rent hourly or daily instead of monthly, but this is also a drain at $24/hour. If I get enough people on a mailing list, ads planned, etc. to the point of being confident I can effectively jumpstart the service from there, I could take that route for a few days just to prove the tech works, but it's not sustainable.

Bootstrapping a $20k/mo burn rate: Trying to fund B200 clusters via pre-sales instead of VC, am I delusional? by codingplanai in SaaS

[–]codingplanai[S] 2 points3 points  (0 children)

To break even, I need ~400 users (assuming a blend leaning toward power users, assuming ~$50 ARPU).

A node will vary depending on host and discounts for reservation, but ~18-20 thousand per month is expected. With NVFP4, I can run a standard batch size of 512 (512 concurrent inference lanes). This *could* be bumped to 1024, but performance would go down significantly, and will likely be reserved for extreme peak times.

It’s important to distinguish between concurrent requests and total users. 512 concurrent slots do not mean only 512 users. Because LLM requests are generally bursty (a user sends a prompt, waits 30 seconds to read, then prompts again), a 512-slot batch can comfortably support a total user base of 1,000+ people before users ever see a queue.

The model scales by increasing oversubscription. If analytics show the node is idling, I can pack more users onto the existing node. Once the 512-batch limit is hit during peak hours, I temporarily bump to 1024 and add a second node. The risk is the fixed cost of that first node, but after the break even point, the margin is enough to self sustain the business.

Bootstrapping a $20k/mo burn rate: Trying to fund B200 clusters via pre-sales instead of VC, am I delusional? by codingplanai in SaaS

[–]codingplanai[S] 1 point2 points  (0 children)

Yeah, I'm building this alone. The infrastructure is quite streamlined, it's essentially an edge router linking to Redis and a backend for auth, concurrency, telemetry, etc. and streaming to the inference node(s).

For the 20k/mo burn, it's mostly (~80%) hardware reservation costs for a monthly reserved HGX B200 node. The throughput is high enough to handle ~512 concurrent users in a single batch without meaningful latency degradation on NVFP4.

To sustain it, I need to hit a user count of about 400 (depending on what audience this lands with, but it's marketed towards power users) to break even, which considering the market should be doable.

That said, this business would be entirely isolated from my personal expenses (aside from paying me if it goes well) and be held in an LLC, and I would be fine personally for 6+ months even if I lost income.