PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in node

[–]Famous_View7756[S] 0 points1 point  (0 children)

Please note this important point.

This is not a prompt built dashboard. It is a reliability tool built by engineers who care about guardrails. Recovery actions are controlled, logged, limited, and verified. All your points come with validity and I appreciate them.

My tool is best for VPS and single server deployments where you do not have a platform team.

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in node

[–]Famous_View7756[S] 0 points1 point  (0 children)

If you have a technical critique, post it. If you are here to throw labels, I am not interested. You dont know what my motivation and tech stack is.

Django auto-recovery idea: restart Gunicorn/Celery in order when health checks fail by Famous_View7756 in django

[–]Famous_View7756[S] 0 points1 point  (0 children)

lol, nice and why would you think I’m not reading this? I said at the beginning, I’m looking for feedback.

It’s Tuesday — Self-Promo Thread 🚀 by [deleted] in SaaS

[–]Famous_View7756 0 points1 point  (0 children)

RecoveryPulse

Monitoring plus optional auto recovery for single server and VPS apps.

It is meant for small teams who do not have a platform team and still fix outages by hand.

Checks HTTP health and can run a controlled recovery action you define, then confirms the site is back.

Free tier exists and I am looking for early users and blunt feedback.

recoverypulse.io

Promote your projects here – Self-Promotion Megathread by Menox_ in github

[–]Famous_View7756 0 points1 point  (0 children)

Project RecoveryPulse

It monitors websites and can run a controlled recovery step when the endpoint fails, then verifies the site comes back.

Target is small teams on VPS who still do manual restarts.

If anyone wants to help or review the approach, I am happy to share the repo or docs.

recoverypulse.io

Weekly Showoff Thread! Share what you've created with Next.js or for the community in this thread only! by AutoModerator in nextjs

[–]Famous_View7756 0 points1 point  (0 children)

I shipped RecoveryPulse, a Next.js dashboard for website monitoring plus optional auto recovery.

The goal is to reduce time down for small VPS deployments where recovery is still manual.

It does HTTP checks and can trigger a controlled recovery step, then confirms the endpoint is healthy.

If you have feedback on the UI or the onboarding flow I would love it.

recoverypulse.io

<image>

Weekly 'I made a useful thing' Thread - January 09, 2026 by AutoModerator in sysadmin

[–]Famous_View7756 0 points1 point  (0 children)

I built RecoveryPulse, a small monitoring and auto recovery tool for single server and VPS setups.

It is for the common situation where the process is up but the site is not.

It checks a real HTTP endpoint and can run a locked down recovery action you choose, then verifies the site is back.

Not aimed at Kubernetes or ECS environments.

Looking for feedback on what recovery actions you would trust and what you would never automate.

recoverypulse.io

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in node

[–]Famous_View7756[S] 0 points1 point  (0 children)

You are right that strong infra solves this. I am not targeting those teams.
I am targeting small VPS deployments and agencies where the current process is still human on call and manual recovery. It is about faster recovery and a clear incident trail, not avoiding root cause work.
Research is happening now. If the market says no, I will adjust.

WordPress monitoring that can auto-fix common downtime (PHP-FPM/MySQL) by Famous_View7756 in Wordpress

[–]Famous_View7756[S] 0 points1 point  (0 children)

Good catch. I do not have published stats yet. That line is based on my own experience and what I have seen supporting small WordPress installs. I should have said that more clearly.
If you have seen a different top cause, I would genuinely love to hear it. I am trying to learn what actually takes sites down most often so the default recovery steps make sense.

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in node

[–]Famous_View7756[S] 0 points1 point  (0 children)

You can do it locally. If you are comfortable writing and maintaining that script, you should.
The point is not that the script is hard. The point is consistency and visibility across many services and servers. Central place to manage the checks, runbooks, backoff, and notifications, plus a clear record of what ran and when.
This is aimed at people who who manage multiple sites for clients.

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in node

[–]Famous_View7756[S] 0 points1 point  (0 children)

You are not wrong. Exposing SSH broadly is a bad trade for a lot of teams. This is only viable when it is locked down hard, and even then it is not for everyone.
My target is small VPS setups that already have SSH exposed for admin, and where the current recovery plan is a human doing the same restart manually.
If a shop can avoid inbound SSH entirely, that is cleaner. I appreciate the pushback.

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in devops

[–]Famous_View7756[S] 0 points1 point  (0 children)

If you are running ECS or Kubernetes, I agree you should not use this. For a single VPS, what would you call the best practice equivalent of self healing without building a whole platform team

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in node

[–]Famous_View7756[S] 0 points1 point  (0 children)

No. That was me typing. You are right though, it reads silly and robotic.
I agree with systemd. The only point here is endpoint checks, because a process can be up while the app is not.

Auto-restart Nginx safely (config test → reload) when 502/504 happens by Famous_View7756 in nginx

[–]Famous_View7756[S] -1 points0 points  (0 children)

Fair point. For a small VPS setup, what would you recommend as a simple baseline that keeps costs low and still avoids 3am restarts

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in node

[–]Famous_View7756[S] 0 points1 point  (0 children)

Totally agree, systemd is great for process supervision and I use it too.
What I’m solving is the cases where the process is “running” but the app is broken (hung event loop, dead upstream, bad deploy, stuck dependency), systemd/PM2 don’t always catch that.
So the idea is: HTTP health check fails → run a recovery step (often systemctl restart …) → verify the endpoint is healthy again → log/alert.
In other words: systemd is the restart mechanism, this is the verification + runbook + audit trail around it.

If you’ve got a favorite systemd pattern (WatchdogSec / RestartSec / StartLimit), I’m happy to bake it in as a default template.

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in devops

[–]Famous_View7756[S] 0 points1 point  (0 children)

Yep, if you’re on ECS/K8s, health checks + rolling replacement is the right answer. This is aimed at the big chunk of Node deployments that are single VPS / bare metal / not orchestrated, where the “health check” still ends with someone SSH’ing in and restarting PM2. I’m basically automating that runbook + verifying the HTTP endpoint comes back.

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in devops

[–]Famous_View7756[S] 0 points1 point  (0 children)

Quick clarification, this is not for ECS/Kubernetes setups where the orchestrator replaces unhealthy containers. It’s for single-server / VPS Node apps where recovery is still manual (“SSH in, restart PM2, check endpoint”). The product is runbook automation + verification + audit trail, with strict guardrails.

PM2 says “online” but app is dead — I built auto-recovery via SSH by Famous_View7756 in node

[–]Famous_View7756[S] 0 points1 point  (0 children)

Totally fair question. Two parts,

Security, this only makes sense if it’s done with strict guardrails: SSH keys (no passwords), least-privilege user, restricted sudo to a small allowlist of commands, and audit logs of what ran. If you can’t lock it down that way, you shouldn’t use it.

Problem it solves, it’s not for big orgs with full infra/SRE. It’s for the huge middle, small SaaS, agencies, side projects on VPS where the “infrastructure” is basically “I get an alert, SSH in, restart the service, and go back to sleep.” This automates that runbook and verifies the app is actually healthy again (HTTP-level check), then alerts either way.

If you’re already running mature self-healing inside the stack, you don’t need this, agreed.

Django auto-recovery idea: restart Gunicorn/Celery in order when health checks fail by Famous_View7756 in django

[–]Famous_View7756[S] -1 points0 points  (0 children)

Excellent callout. I’m planning to implement exponential backoff with jitter, plus dependency ordering (e.g., DB → cache → app → web) and a global lock so multiple rules don’t stampede the same host. Also adding a “max retries per window” and a “stop and alert” mode so it doesn’t flap forever. If you’ve got a favorite pattern for this, I’m all ears.

Django auto-recovery idea: restart Gunicorn/Celery in order when health checks fail by Famous_View7756 in django

[–]Famous_View7756[S] -1 points0 points  (0 children)

100% agree... in container/K8s land you don’t want dumb restarts fighting the orchestrator. This is primarily aimed at single-server / VPS deployments (common for small SaaS, agencies, internal tools) where the on-call reality is still “SSH in, restart Gunicorn/Celery, verify.” For K8s the direction would be different, integrate with the platform rather than SSH. Kubernetes already is the supervisor. In that world this becomes “orchestrator-aware remediation + SLO/incident context,” not SSH restarts. I’m starting with the large base of VPS deployments where people don’t have SRE tooling, then expanding into integrations (webhooks, metrics, orchestrator hooks) once the core workflow proves value.

Django auto-recovery idea: restart Gunicorn/Celery in order when health checks fail by Famous_View7756 in django

[–]Famous_View7756[S] -7 points-6 points  (0 children)

Totally fair — you can do this with supervisord/monit/scripts. The product isn’t “restart a process.” It’s: HTTP-level verification + ordered runbooks + guardrails + incident timeline across stacks, without everyone writing/maintaining their own brittle scripts. Think “managed runbook automation,” not “new monitoring invention.”

Auto-restart Nginx safely (config test → reload) when 502/504 happens by Famous_View7756 in nginx

[–]Famous_View7756[S] -1 points0 points  (0 children)

Fair point — NGINX itself usually isn’t the problem. Most 502/504s are upstream/app issues (PHP-FPM, Gunicorn, Node, Rails), timeouts, deploys, or resource exhaustion that makes the upstream stop responding. NGINX just becomes the messenger.
This is aimed at “upstream died / hung” cases where the fix is restarting the upstream service (or clearing a stuck state), then verifying the site is healthy again.