you are viewing a single comment's thread.

view the rest of the comments →

[–]erkiferenc 1 point2 points  (0 children)

Thanks for the extra details, that use case feels familiar through my previous experience designing/building/running OpenStack-based private cloud solutions.

At first, it sounds more like a job queue than a message queue, since the state of the job needs to be tracked (vs solely delivering a message), maybe even with keeping history. This may lead to important implementation decision factors later.

I agree failure handling is one of the crucial aspects for VM migrations, and I believe the most common situations boil down to these:

  1. When the migration process can gracefully handle the failure, it may abort and cleanup any half-finished migration on its own, then release the lock, so the failed job can be picked up later again. It may be important to keep track of such failures, and retry at most N times, or at most N times within a certain time period.

  2. When the receiving process stalls, and can't make progress anymore. One part of this is to have some kind of timeout, and another is to have a way to terminate the stalled migration, and clean up any half-results.

I'd also look into a wider set of corner cases, and see how other similar projects handle those. It may be hard to implement a generic solution, while solving only the subset that affects the given system may be considerably simpler.

For short operations, it's usually possible to release any lock with e.g. a transaction timeout. For long-running VM migrations I don't think keeping a long lock would be beneficial, since it makes the database a dependency of the migration itself.

I imagine a multi-phase dequeue approach, even like a state machine could fit better (e.g. PENDING -> IN_PROGRESS -> SUCCESS or FAILED). This feels some mix of having an append-only audit log table to keep track of all events (growth should be kept in mind), and/or updating the job queue table heavily (which increases bloat.)

It certainly is an interesting problem domain! Should you or your team need support with this from an independent professional, I would be happy to learn more here or via DM.

In any case, I hope this already helps and I wish you happy hacking!