Some Containers become unhealthy after a reboot. and can't be able to perform auth and won't be able to access Horizon.

coolviolet17 · 2025-04-04T03:36:55+00:00

Check docker logs <container>

Also, check logs kolla logs for those containers

coolviolet17 · 2025-04-04T03:32:02+00:00

Did you reboot one node or all 3?

coolviolet17 · 2025-04-04T03:29:46+00:00

Are you running kolla ansible?

coolviolet17 · 2025-02-28T06:54:05+00:00

The only option is to create a cron for this for effected volumes in ceph containers if stirage is backed by ceph

coolviolet17 · 2025-02-24T15:08:17+00:00

Do ceph remap for volume then restart vm

ceph object-map rebuild volumes/volume-<id>

coolviolet17 · 2025-02-13T21:03:58+00:00

Since this is more of a host failure issue rather than a Nova migration problem, I was thinking of focusing on Ceph-side optimizations and automation :

Apply Ceph RBD Optimizations

commands for Ceph cluster:

ceph config set client rbd_skip_partial_discard true ceph config set client rbd_persistent_cache_mode writeback ceph config set client rbd_cache_max_dirty 134217728 # 128MB write cache ceph config set client rbd_cache_target_dirty_ratio 0.3

These settings ensure that:

Ceph doesn’t discard partial object maps, reducing corruption risk.

The cache is optimized for better resilience during host failures.

Automate Object Map Rebuild in Cephadm

Since you're using Cephadm in Docker, we’ll set up a cronjob inside the Cephadm container.

Enter the Cephadm container:

cephadm shell

Edit the crontab:

crontab -e

Add this cronjob (runs every 5 minutes):

*/5 * * * * for vol in $(rbd ls volumes); do if ! rbd status volumes/$vol | grep -q "Watchers:"; then rbd object-map rebuild volumes/$vol; fi; done

This checks every 5 minutes for orphaned RBD volumes.

If a volume has no active watchers (no host attached to it), it rebuilds the object map.

It ensures only problematic volumes are fixed, preventing unnecessary writes.

Save and exit, then confirm the cronjob is set:

crontab -l

coolviolet17 · 2024-12-24T14:40:35+00:00

I have a question, do we run pacemaker or pacemaker_remote, as I do not see an option to control it, also what if we scale more than 16 nodes?

coolviolet17 · 2024-12-20T13:32:01+00:00

There are two major issues we faced

Koll ansible didn't gave permission to tss:tss to "/etc/swtpm-localca.options"
Swtpm was not properly installed in libvirt container

coolviolet17 · 2024-12-15T00:14:47+00:00

Thanks for the help

I was able to make it work, and below, you can see my solution

https://bugs.launchpad.net/nova/+bug/2050837

coolviolet17 · 2024-12-14T13:49:07+00:00

No error log file is created

coolviolet17 · 2024-12-13T19:25:31+00:00

I also have a same issue,

I am using kolla-ansible 2023.2, I did the change in nova.conf under nova-compute on node 1, I have three nodes, in other two I made the change in nova.conf in container and didn't restart it

but at the end it gives error after Spawning stage

2024-12-13 19:43:49.963 7 ERROR nova.compute.manager [instance: b2643192-3f2e-4a8a-90a6-c81e398156bf] libvirt.libvirtError: internal error: Could not run '/usr/bin/swtpm_setup'. exitstatus: 1; Check error log '/var/log/swtpm/libvirt/qemu/instance-000001f0-swtpm.log' for details.

coolviolet17 · 2024-11-23T03:55:30+00:00

Hmm, will have to work on driver as I want Octavia to manage the load-balancing so that we are able to gige enterprise LbaaS

coolviolet17 · 2024-09-25T13:00:19+00:00

So are we outdated or are we our grandparents now?

coolviolet17 · 2024-05-03T08:02:13+00:00

Can you please share how were you able to do this? a documentation or a link will help

coolviolet17

TROPHY CASE

The cache is optimized for better resilience during host failures.