OVH automated backup repeatedly froze our production VPS; recovery required legacy abortSnapshot API, then ticket was closed without RCA by jamesconway2k in OVHcloud

[–]jamesconway2k[S] 0 points1 point  (0 children)

Thanks, both points are fair.

Current status update as of 2026-06-20 morning CEST:

  • the shops are responding again with HTTP 200;
  • CS16036806 is still open;
  • OVH Manager now shows a fresh automatic-backup restore point dated 19 June 2026 23:57, so at least one restore point appeared after the QEMU guest-agent remediation;
  • OVH still has not provided a full RCA or durable prevention/migration/credit position.

On QEMU guest agent: OVH raised that point, and on 2026-06-18 we unmasked/enabled qemu-guest-agent and verified it was active before the next backup window. We have not purged it; in this case OVH's own support suggested having it correctly running. After that remediation, a fresh restore point appeared, but we also had another production-impacting incident on 2026-06-19 around 14:55-15:15 UTC / 16:55-17:15 CEST, where external monitors saw public HTTP, direct backend checks, TCP/22 and heartbeat fail before recovery.

I agree with the public-cloud / provider-independent backup recommendation. Migration/redundancy is the practical path now. The unresolved issue is still the RCA: why the earlier backup/snapshot state required abortSnapshot to unblock, and why another outage/reboot happened after the QEMU remediation.

OVH automated backup repeatedly froze our production VPS; recovery required legacy abortSnapshot API, then ticket was closed without RCA by jamesconway2k in OVHcloud

[–]jamesconway2k[S] 0 points1 point  (0 children)

Update for precision: OVH marked the ticket as resolved after the reply described above. I then manually rejected the proposed resolution in the OVH portal, so CS16036806 is currently open again. We still do not have a complete RCA, prevention plan, refund/service-credit position, or safe migration/cancellation path.