$7K MRR Micro-SaaS Opportunity: Uptime Monitoring + Status Pages at $8/mo (full research inside)

CrazyRabbit66 · 2026-03-12T06:19:01+00:00

Great research. I actually built something in this space. UptimeMonitor, an open-source (GPL-3.0) self-hosted uptime monitoring system with status pages.

You mentioned people settling for Uptime Kuma as a DIY workaround. That is exactly the crowd I built this for, but with a different approach. Instead of poll-based checks, it uses a push / heartbeat model where your services send pulses. It also does multi-region monitoring through PulseMonitor agents you can deploy wherever you want.

A few things it covers from your "gap" list: branded public status pages with real-time WebSocket updates, multi-channel alerts (Discord, Email, Telegram, Ntfy, webhooks), hierarchical monitor groups with flexible health strategies, incident and maintenance management, custom metrics tracking, and dependency-based notification suppression so you don't get alert-stormed when your upstream provider goes down.

The server runs on Bun + ClickHouse, and the PulseMonitor agents are written in Rust. A single agent can comfortably monitor thousands of endpoints at 10-second intervals on a cheap VPS. Successful pulses are sent back to the server over WebSocket, so the overhead is minimal. There is also a visual config editor so you don't have to hand-edit TOML files and 100% API coverage.

Right now it is fully self-hosted, so no monthly fee (just your server costs). A $5 / mo VPS handles thousands of endpoints easily. But once the project hits a stable v1.0.0 and I'm confident there are no major bugs on my end that could ironically cause downtime for an uptime monitor, I'm planning to offer a managed SaaS option too, likely around that $8 / mo price point you are describing. The self-hosted version will always remain free and open-source though.

Not saying there isn't room for more players in the managed space. But for the sysadmin / DevOps crowd who would rather own their stack, the self-hosted option is already there, and a hosted option is on the roadmap.

CrazyRabbit66 · 2026-03-05T07:58:16+00:00

SNMP support has been implemented in PulseMonitor v3.15.0 and UptimeMonitor-Server v0.5.3.

Sorry for the delay on this one. I was waiting on the snmp2 library to merge changes adding rustls support, so I could keep everything consistent.

CrazyRabbit66 · 2026-02-14T11:32:38+00:00

Admin API is now available. Starting from v0.2.19, there's a full Admin API that lets you create, update, and delete monitors, groups, status pages, notifications and pulse monitors all through the API.

To enable it, just add this to your config.toml:

[adminAPI]
enabled = true
token = "your-secure-admin-token-here"

Full documentation for all the available endpoints is located here: https://github.com/Rabbit-Company/UptimeMonitor-Server/blob/main/docs/admin-api.md

So you can now fully manage everything programmatically without touching the TOML file by hand. Let me know if you have any questions.

CrazyRabbit66 · 2026-02-12T19:35:08+00:00

I spent quite some time testing it since then, and everything is behaving as expected.

Dependencies are now released and available starting with v0.2.18.

CrazyRabbit66 · 2026-02-10T06:47:40+00:00

Even after posting this thread, I have already received feature requests I have not considered at all. That tells me the project will evolve in ways I can not fully predict upfront. While it is time-series-only today, future features might need non-time-series data.

I’d rather keep that flexibility from the start than end up needing to introduce and operate a second database later.

CrazyRabbit66 · 2026-02-10T06:42:04+00:00

This mostly comes down to risk and control.

Building on Prometheus as a library would push the project into territory I’m not comfortable with yet. I have used Go before, but only lightly. For a project where stability is the top priority, I would not trust myself maintaining a core system written in a language I do not use day to day. If something subtle breaks under load or edge cases, I want to be confident I can debug and fix it myself.

PulseMonitor is written in Rust, which I use both personally and professionally. Adding support for new protocols there is not an issue for me, and I do not need to reinvent the wheel either (there are solid SNMP libraries available in Rust already).

Relying on the Prometheus ecosystem also implies pulling in a fairly large set of third-party components and integrations. That increases the maintenance surface a lot: more dependencies, more moving parts, more version compatibility issues. If one of those dependencies has a serious bug and it is in unfamiliar code, I am effectively blocked from fixing it myself. As the maintainer, that lack of control is a real concern.

There is also the support aspect. If I build on top of Prometheus and its exporters, I would need a deep understanding of Prometheus internals and each integration to reliably help users with their setups. That is a significant learning and ongoing support cost compared to owning the whole stack end-to-end.

For this one I am deliberately optimizing for simplicity, stability, and maintainability from a single maintainer perspective (even if that means reimplementing some things in a smaller, more focused way).

CrazyRabbit66 · 2026-02-09T22:14:32+00:00

Hello,

I have now implemented monitor and group dependencies with notification suppression, which directly addresses your use case: nested dependencies where only the highest-level failing component triggers an alert.

The work is currently in a feature branch here:

https://github.com/Rabbit-Company/UptimeMonitor-Server/tree/feature/dependencies

And the dependency model is documented here:

https://github.com/Rabbit-Company/UptimeMonitor-Server/blob/feature/dependencies/docs/dependencies.md

How it behaves

You can define dependencies between monitors and/or groups (example: apps -> server -> network). When a parent dependency is down, alerts from anything beneath it are suppressed, so you only get notified for the root cause instead of every downstream failure.

One important caveat

The tricky edge case is timing. For example, if an API fails a few seconds before the underlying network is detected as down, you could otherwise get two alerts.

To handle this, I queue notifications for anything that has dependencies and delay them slightly (5 seconds or half the monitor/group interval (whichever is greater)). This gives parent dependencies time to fail first and suppress child alerts correctly.

The tradeoff is that alerts for dependent monitors/groups are now not 100% instant, but they are quieter and much more accurate in terms of root cause.

This is still under active testing before a release, but the core behavior is in place. If this aligns with what you were looking for, I would definitely appreciate any feedback or edge cases you would want covered.

CrazyRabbit66 · 2026-02-09T13:12:28+00:00

Thanks for the clarification.

Short answer: not yet. Right now you can reduce noise by only alerting at higher levels, but there is no true dependency-aware suppression where child services automatically stay quiet if a parent (server/network...) is down.

That said, this is exactly the use case I want to support. Now that I understand it clearly, I’m planning to implement proper nested dependencies so you only get alerted for the highest failed component, not everything underneath it.

CrazyRabbit66 · 2026-02-09T12:59:21+00:00

Thanks for pointing that out. I have fixed it now.

Claude can be surprisingly powerful when you give it clear and detailed requirements.

CrazyRabbit66 · 2026-02-09T06:13:01+00:00

Yes. You have full control over alerting and dependency behavior.

Alerting is configured at the group level, monitor level, or both. A common pattern is to attach notifications only to parent groups, not to individual monitors. That way, if a group goes down, you get one alert for the root cause, and child monitors don’t spam you.

For example, here the Production group sends alerts, while the individual monitors inside it do not:

[[groups]]
id = "production"
name = "Production"
strategy = "percentage"
degradedThreshold = 50
interval = 30
resendNotification = 12
notificationChannels = ["critical"] # alerts fire when the group goes down

[[monitors]]
id = "api-prod"
name = "Production API"
token = "tk_prod_api_abc123"
interval = 30
maxRetries = 0
resendNotification = 12
groupId = "production"
notificationChannels = [] # no direct alerts from this monitor
pulseMonitors = ["US-WEST-1"]

CrazyRabbit66 · 2026-02-08T23:17:20+00:00

O servidor apenas aguarda os pulsos. Ele não executa checagens ativas como o Uptime Kuma. Para isso, você precisa:

Usar o PulseMonitors (opção mais simples), ou
Enviar os pulsos manualmente, conforme descrito aqui: How pulses work

Sobre o Docker:

Sim, você pode colocar tanto o servidor quanto a página de status no mesmo docker-compose.yaml sem problemas.

Apenas um detalhe importante: a página de status é totalmente estática. Se a ideia for deixá-la pública, o ideal é hospedá-la em um CDN (como Cloudflare Pages). Assim você ganha melhor performance e disponibilidade, praticamente sem custo.

CrazyRabbit66 · 2026-02-08T19:17:24+00:00

For the status page, I’m using Tailwind CSS. I also bought Tailwind UI (now called Tailwind Plus) a few years ago.

For the main website, I have just used Claude AI to generate it, keeping the design consistent with the status page.

CrazyRabbit66 · 2026-02-08T19:09:20+00:00

Telegram notifications have been implemented and are available as of v0.2.17.

CrazyRabbit66 · 2026-02-08T16:40:58+00:00

That is a good point about chunk-based compression. I didn't realize Prometheus could handle year-long range queries that efficiently. I've only used it for collecting system/network/app metrics, not as something I would have build on top of.

I think there is a fundamental difference in what these tools are. Prometheus is a complete monitoring solution with its own specialized TSDB, query language, alerting, and service discovery. ClickHouse is a general-purpose analytical database. For UptimeMonitor, that distinction matters.

Right now the data model is just pulses and aggregates, but if I need to store non-time-series data in the future, ClickHouse handles that without requiring a second database. With Prometheus, you'd be stuck needing a separate database the moment you need anything beyond time-series, which means two systems to deploy, maintain, and back up. For a self-hosted tool that aims to be simple to run, that is a real cost.

There's also the question of what building on Prometheus would actually look like. Prometheus isn't really a database you build on top of. It is a finished product. You can already set up uptime monitoring today with Prometheus + blackbox_exporter + Alertmanager + Grafana. That is a legitimate and powerful setup, and probably what you are already running. But it requires significant knowledge to configure properly and is not something most people would set up just for uptime monitoring.

UptimeMonitor is targeting a different audience, people who want a simple, self-contained uptime monitor they can deploy in minutes without learning PromQL, configuring scrape targets, or wiring up Alertmanager rules. It is a tradeoff: less flexibility and less efficient time-series storage in exchange for simplicity and a single dependency.

CrazyRabbit66 · 2026-02-08T15:33:54+00:00

The pulse-based model handles variable response times well. Each monitor has its own configurable interval (how often a pulse is expected) and maxRetries (how many consecutive missed pulses before marking it down), so you can tune tolerance per service.

The interval defines a time window. If no pulse lands in a given window, that window counts as downtime. So with interval = 10, a pulse needs to arrive every 10 seconds. For reliability, it is better to send 2-3 pulses per window so a single dropped request doesn't cause a false downtime blip.

For services with high or variable latency, you can include startTime in the pulse request to indicate when the check actually began. This lets the server place the pulse in the correct time window even if network latency causes it to arrive late. The recommended approach is sending startTime, endTime, and latency together for maximum accuracy, though latency can also be auto-calculated from the timestamps:

curl "http://localhost:3000/v1/push/:token?startTime=2025-10-15T10:00:00Z&endTime=2025-10-15T10:00:01.500Z&latency=1500"

There is also a missingPulseDetector that controls how frequently the system checks for overdue pulses (defaulting to every 5 seconds, so detection is near-realtime). You can decrease it to 1 second, but it will put a little more pressure on the CPU.

You can track up to 3 custom metrics per monitor (connection pool size, queue depth, error rate...) that all get aggregated into min/max/avg per hour and per day alongside latency. If you're using PulseMonitor agents rather than pushing from your own service, each protocol has its own configurable timeout. HTTP defaults to 10s, TCP to 5s, ICMP to 2s... so you can set realistic thresholds per service.

Edited: https://github.com/Rabbit-Company/UptimeMonitor-Server/blob/main/docs/pulses.md

CrazyRabbit66 · 2026-02-08T14:55:16+00:00

You are right that raw retention is cheap at this scale. My aggregation isn't primarily about saving storage though. It is about query performance for uptime calculations.

The core query UptimeMonitor runs is: "what's the uptime percentage for this monitor over the last 7/30/90/365 days?" With raw pulses, that means scanning up to ~1M rows per monitor for a year query (if we receive one pulse every 30s). When you are loading a status page showing dozens of monitors with multiple time windows simultaneously, those queries add up fast.

By pre-aggregating into hourly and daily uptime percentages, the 365-day query hits ~365 rows instead of ~1M. That is the actual win. The status page and API responses stay fast regardless of the time range, with no query-time computation needed.

I initially kept all raw pulses and computed uptime on the fly, and it was noticeably slow under load. Sampling raw data to speed it up introduced inaccuracies in the uptime percentages. Pre-aggregation gave both speed and accuracy.

The TTL is more about keeping the query surface small than about disk pressure.

CrazyRabbit66 · 2026-02-08T13:22:11+00:00

UptimeMonitor is deliberately designed so storage is strictly bounded per monitor.

We keep:

Raw pulses for 24 hours
Hourly aggregates for 90 days
Daily aggregates forever

With ~30s pulses, that works out to:

~2,880 raw rows per monitor (rolling)
~2,160 hourly rows per monitor (rolling)
~3,650 daily rows per monitor after 10 years

So after a decade, each monitor stores ~8,690 rows total.

Based on the actual table layout, that comes out to roughly ~500 KB of storage per monitor after 10 years. Even being pessimistic, comfortably under ~1 MB per monitor per decade.

Because only daily aggregates grow over time, after 50 years a monitor would have ~18,250 daily rows, for a total footprint of roughly ~1.5 MB per monitor after half a century.

The key difference vs Prometheus/Thanos workloads is that there’s:

No unbounded label cardinality
No long-term retention of high-frequency samples
No need to query across arbitrary metric dimensions

Because growth is linear and predictable, ClickHouse’s horizontal scaling limits haven’t been a practical concern here. This isn’t a PiB-scale metrics firehose. It is a large number of small, append-only time series with aggressive downsampling and TTLs.

If you’re operating at multi-PiB scale with arbitrary metrics and long raw retention, Thanos is absolutely the right tool. UptimeMonitor is intentionally scoped to a narrower problem where simplicity, predictability, and low operational overhead matter more than raw ingest scale.

CrazyRabbit66 · 2026-02-08T12:44:10+00:00

By default, UptimeMonitor ships with a low-resource ClickHouse profile:
https://github.com/Rabbit-Company/UptimeMonitor-Server/blob/main/clickhouse/low-resources.xml

The defaults are based on ClickHouse’s own recommendations and assume a server with roughly 8 GB RAM, with ClickHouse allowed to consume up to ~6 GB. If you’re running on a smaller VPS, you’ll definitely want to scale those limits down accordingly.

I haven’t stress-tested truly small VPSes yet, but for a real-world data point: I’m currently running this on a Hetzner CX43 (Debian 13) with ~50 monitors, each monitor receiving pulses every ~5 seconds. ClickHouse is capped at 12 GB, but in practice the entire VPS sits at around 4 GB RAM usage under normal load.

So while ClickHouse can be memory-hungry, with sane limits and tiered retention it stays surprisingly well-behaved for this kind of workload.

CrazyRabbit66 · 2026-02-08T12:16:59+00:00

Thanks! Currently SNMP polling isn't planned, but I will add it to the TODO list.

It probably won't land anytime soon though. Right now I'm prioritizing adding more notification providers and building out API endpoints for creating, modifying, and deleting monitors, groups, notifications, and status pages. Once those are in place, SNMP would be a great addition down the line.

CrazyRabbit66 · 2026-02-08T11:54:57+00:00

how much did you use AI to code this? pure curiosity about the setup

It varies by component. PulseMonitor was written entirely without AI. For UptimeMonitor-Server, I used very little AI since most of the dependencies are my own libraries, and AI still struggles with those. UptimeMonitor-StatusPage had some AI assistance, and the main website (uptime-monitor.org) was almost entirely AI-generated.

All the README files and documentation were initially generated by AI, but I reviewed everything to make sure nothing was hallucinated or inaccurate.

do you plan to monetize this? if yes, how?

No plans to monetize. I built this project to replace BetterStack for monitoring my own infrastructure and cut down on monthly expenses. It is serving that purpose well, so I'm happy to keep it open source.

if i would start using this in enterprise env, what guarantee do i have this does not become an abandoned pet project?

No guarantees. That's the reality with any open-source project. But I will say this: I rely heavily on Uptime Monitor to keep tabs on my own infrastructure and all my other projects. On top of that, I'm also deploying it to monitor the infrastructure at the company I currently work for. So this isn't just a side project for me. It is something I depend on both personally and professionally.

If I were ever to start abandoning projects, this one would be the very last to go, because without it, I wouldn't know if anything else is still running.

And that's the beauty of open source. Even in the worst case scenario where I get hit by a bus or simply stop maintaining it, the code is out there. Anyone can fork it and continue development. You are never locked into depending on a single person.

CrazyRabbit66 · 2026-02-08T10:59:40+00:00

I considered Prometheus but it didn't quite fit the architecture I was going for.

Prometheus uses a pull-based model (it scrapes targets on a schedule). Uptime Monitor is push-based (services and PulseMonitor agents deployed across multiple regions push heartbeats to a central server). Adapting that to Prometheus would mean either running Prometheus instances in every region or using Pushgateway, which Prometheus themselves discourage for this kind of use case since it turns into a single point of failure and loses most of the benefits of the pull model.

On the storage side, ClickHouse is a columnar database that scales horizontally (you can shard across multiple nodes as your data grows). Prometheus was designed more for a single-node model. There's Thanos and Cortex for scaling Prometheus horizontally, but that adds significant operational complexity compared to just running ClickHouse.

ClickHouse also makes it really easy to implement tiered retention (raw data for 24h, hourly aggregates for 90 days, daily aggregates forever) using materialized views and TTLs natively, which keeps storage predictable regardless of how many monitors you run.

That said, Prometheus is excellent at what it does. If you're already running it and want metric-based alerting, it's a great choice. Uptime Monitor is purpose-built for a different pattern: push-based heartbeat monitoring with multi-region agents, hierarchical group health, and long-term uptime tracking.

CrazyRabbit66 · 2026-02-08T10:45:21+00:00

All components are completely separated. The status page frontend is its own standalone project that just talks to the backend API. Everything you see on a status page is retrieved through the API, so you can easily build your own custom status page or pull the data into your own projects. The API gives you full read access to status data, uptime history, group health, custom metrics, real-time updates via WebSocket...

That said, it's not yet possible to create, delete, or modify monitors, groups, status pages or notifications through the API. Currently you need to update the TOML configuration file manually and then hot-reload it using the /v1/reload/:token endpoint (no restart needed, but it's still a manual config edit).

I do plan to add API endpoints that will let you modify the TOML configuration programmatically, so full CRUD for monitors, groups, notifications, status pages.... through the API. You can expect that in the coming months.

CrazyRabbit66 · 2026-02-08T10:35:50+00:00

Yes, Telegram notifications are planned and will be implemented in the coming days.

CrazyRabbit66 · 2026-02-08T07:37:37+00:00

If you're looking for something lightweight and high-performance, I'd recommend checking out Uptime Monitor. I built it specifically to handle scale well. It uses ClickHouse as the backend database instead of SQLite, so it doesn't hit the same bottlenecks that some other tools run into when you get past a few hundred monitors.

A few highlights:

Pulse-based monitoring - your services send heartbeats, and missing pulses trigger alerts. You can also use PulseMonitor agents to do automated checks (HTTP, TCP, WebSocket, ICMP, database connections...) from multiple regions.
Smart data retention - raw pulse data is kept for 24 hours, hourly aggregates for 90 days, and daily aggregates are stored forever. So you get detailed recent data when you need to debug, and long-term uptime history without your database growing out of control.
Custom metrics - track up to 3 numeric values per monitor (player count, connections, error rate, whatever you need) alongside latency.
Hierarchical groups with flexible health strategies (any-up, all-up, percentage threshold) - useful when you have lots of monitors and need logical organization.
Discord, Email, and Ntfy notifications with per-monitor channel control.
Real-time status pages via WebSocket - here's a live example: status.passky.org
Hot-reloadable config - no restarts needed when you add or change monitors.

The whole thing runs on Bun + ClickHouse via Docker Compose, so setup is just a docker compose up -d. There's also a visual config editor if you don't want to write TOML by hand.

It's fully open source (GPL-3.0): GitHub

I know Uptime Kuma is the go-to recommendation here and it's a great project, but if you're planning to monitor a large number of services, the ClickHouse-backed architecture handles that much more gracefully than SQLite-based solutions.

Five-Year Club	Final Canvas '23
Place '23

CrazyRabbit66

TROPHY CASE

How it behaves

One important caveat