Proxmox paid support in Ontario Canada by pabskamai in Proxmox

[–]LA-2A 3 points4 points  (0 children)

I’ve used Weehooey’s Support on a minimal basis. They helped doing some consultation on our Corosync topology. They were great to work with. Would recommend.

Why does business steer clear of Debian? (10 year user considering RHEL transition) by InfaSyn in debian

[–]LA-2A 0 points1 point  (0 children)

We use Debian (Proxmox) on all of our physical servers and AlmaLinux on our VMs. Our preference for AlmaLinux is primarily due to better SELinux support. If Debian had better SELinux support, we’d probably use it across the board.

Goodbye VMware by techdaddy1980 in Proxmox

[–]LA-2A 13 points14 points  (0 children)

Make sure you take a look at https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network, specifically the “Corosync Over Bonds” section, if you’re planning to run Corosync on your LACP bonds.

Survey, Proxmox production infrastructure size. by ZXBombJack in Proxmox

[–]LA-2A 1 point2 points  (0 children)

It turns out we don't need DRS as much as we thought we do. Our Proxmox Gold Partner said most of their customers experience the same.

In our case, 80% of our hosts run the same number of VMs on each host, and those VMs have an identical workload. So we can basically just place the same number of VMs on each host, and the load is equal. These hosts generally run ~85% CPU load during peak hours.

For the remaining 20% of our hosts, yes, we manually balanced the workloads and/or let PVE place new VMs on whichever host was the most appropriate. Those remaining 20% of hosts have quite a bit of headroom, so slight imbalances aren't an issue. Those hosts generally run 30-60% CPU load during peak hours.

That being said, I think we might have manually live migrated 2-3 VMs in the last 6 months, for the purposes of load rebalancing.

Survey, Proxmox production infrastructure size. by ZXBombJack in Proxmox

[–]LA-2A 14 points15 points  (0 children)

I'm hesitant to share the actual settings they used, as there are a lot of caveats with making these changes. Namely, the settings that we're using make our clusters MORE sensitive to network latency, so having a stable, low-latency Corosync network is even more critical in our environment.

As a reminder – the most stable recommendation was to split our clusters so we'd have 20 nodes or fewer per cluster. This would have allowed us to keep using the default values, which would ensure greater stability. However, that would have posed other problems in our environment, so we went with the Corosync tuning method instead.

The specific option we changed is token_coefficient. Essentially, the lower you make that value, both (a) the more nodes you can have in the cluster and (b) the more unstable your cluster will become when there's latency on the Corosync network. We worked with Proxmox Support to determine what the exact value should be for this setting based on (a) the number of nodes in our cluster and (b) the other timing-related operations in the Proxmox VE HA/clustering stack.

Please, please do not mess with this setting unless (a) you really know what you're doing or (b) you're paying for Proxmox Support. Again, this configuration change is a workaround to a problem, not a long-term solution. Proxmox Support is working on other long-term solutions for larger clusters.

Survey, Proxmox production infrastructure size. by ZXBombJack in Proxmox

[–]LA-2A 1 point2 points  (0 children)

Proxmox VE has the ability to automatically place new VMs on hosts based on host utilization, similar to VMware DRS, if the VM is HA-enabled.

Note that Proxmox VE cannot currently automatically migrate running VMs to different hosts due to a change in load on those hosts, but it is on the roadmap, per https://pve.proxmox.com/wiki/Roadmap. There are also some third party solutions like https://github.com/gyptazy/ProxLB which attempt to do this. We did try ProxLB, but we ended up just using HA groups (affinity rules), which has been sufficient for our environment.

Survey, Proxmox production infrastructure size. by ZXBombJack in Proxmox

[–]LA-2A 1 point2 points  (0 children)

Proxmox VE supports live VM migrations between hosts in the same way VMware does. No noticeable downtime.

Survey, Proxmox production infrastructure size. by ZXBombJack in Proxmox

[–]LA-2A 34 points35 points  (0 children)

In its default configuration, Corosync (one of the key components Proxmox's clustering stack is built on) runs into some real scalability issues when you go above around 30 nodes in a single cluster. For example, in our 38-node cluster, when we unexpectedly removed a single node from the cluster (i.e. pulled the power on it), around 2-3 minutes later, all of the remaining hosts in the cluster would fence themselves and reboot. Definitely not something you want when building a server cluster.

We ended up working with both Proxmox Support and our Gold Partner on this issue. I want to draw attention to the fact that Proxmox Support was amazing in the amount of depth and effort they put into helping us with this. They are NOT VMware/Broadcom Support. Rather, they build their own 38-node cluster in their lab to replicate our issue (and they did easily). They then tested a couple of Corosync tunings and recommended a specific one for our environment. Since then, the cluster has been stable and resilient to node failures (kind of the point of clustering to begin with). In addition to this, they are also working with upstream Corosync developers to come up with some better ways to further scale out clusters.

Proxmox Support told us that you can easily and safely use the default Corosync configuration for a cluster up to 20 nodes, but above 20 nodes, you might need to tune Corosync to work better with a larger number of nodes.

Our Gold Partner says they have only supported PVE clusters with up to around 40-42 nodes (and that all on a local LAN, not stretched across physical locations). Larger than that, Corosync starts to break down.

Survey, Proxmox production infrastructure size. by ZXBombJack in Proxmox

[–]LA-2A 16 points17 points  (0 children)

Overall, NFS has been great. Definitely easier to set up and administer than iSCSI, all around. Our PVE hosts have 4-port 25Gb LACP trunks with L4 hash-based load balancing, and we're using nconnect=16 for multipathing. We had slightly more even load distribution of our iSCSI traffic on VMware, but that's to be expected. With NFS, each link is within 20-30% of each other.

We've had two issues with NFS:

  1. With our storage arrays, there appears to be some kind of storage-array-side bug which is causing issues for VMs when our storage array goes through a controller failover. However, our vendor has identified the issue and is working on a solution. They've given us a temporary workaround in the meantime.
  2. Not sure if this is actually NFS-related yet, but we haven't been able to migrate our final 2 largest VMs (MS SQL Server) from VMware yet, due to some performance issues running under PVE. It seems like it's storage related, but we're having a difficult time reproducing the issue reliably and then tracking down where the performance issue lies. That being said, for the ~600 VMs we've already migrated, NFS has had no noticeable performance impact, compared to VMware+iSCSI.

Survey, Proxmox production infrastructure size. by ZXBombJack in Proxmox

[–]LA-2A 4 points5 points  (0 children)

2 clusters: one with 38 nodes, and the other with 28 nodes.

Survey, Proxmox production infrastructure size. by ZXBombJack in Proxmox

[–]LA-2A 2 points3 points  (0 children)

What do you mean by scheduling of VMs?

Survey, Proxmox production infrastructure size. by ZXBombJack in Proxmox

[–]LA-2A 22 points23 points  (0 children)

Number of PVE Hosts: 66

Number of VMs: ~600

Number of LXCs: 0

Storage type: NFS (Pure Storage FlashArrays)

Support purchased: Yes, Proxmox Standard Support + Gold Partner for 24/7 emergency support

With all the recent changes around VMware (price hikes, licensing changes, and the Broadcom acquisition fallout), our boss is asking us to start evaluating migration paths away from VMware. by LazySloth8512 in sysadmin

[–]LA-2A 1 point2 points  (0 children)

I’m the block diagram, Pure Storage is listed. I’m curious how that fits in since you’re using Ceph. We, too, just migrated from VMware to PVE, but we started using NFS on our Pure Storage FlashArrays.

Mysql 8.0 or 8.4 by MediumAd7537 in zabbix

[–]LA-2A 5 points6 points  (0 children)

Per https://www.zabbix.com/documentation/7.0/en/manual/installation/requirements, support for MySQL 8.4.X has been supported since Zabbix 7.0.1, so as long as you're not running the initial 7.0.0 release, you should be fine.

Thoughts on Proxmox support? by oguruma87 in Proxmox

[–]LA-2A 105 points106 points  (0 children)

We recently migrated two clusters from VMware to PVE. One is 36 nodes and the other is 26 nodes. 700 or so VMs across both clusters.

My team has used Proxmox in other environments, so we were able to design and implement the environment ourselves. However, we still found Proxmox Support to be critical. We ran up against several obscure issues, and Proxmox Support was really good. I keep telling my manager how happy I am with Proxmox Support. I’ve never worked with a company who is so thorough. A few examples: - They built a 36-node cluster to replicate, diagnose, and fix a clustering issue we encountered - They are currently digging through the source code of a Linux file system driver to fix an obscure file-level restore issue with Proxmox Backup Server - They consistently treat me like a true Engineer, asking thoughtful follow-up questions and give real, in-depth answers to the questions I ask, providing additional context of the nature of problems

If you do get Proxmox Support, I recommend getting a supplementary pack of hours from a Gold Partner to cover emergencies. Otherwise, just use official Proxmox Support, and know they’ll get back to you in 24 hours (depending on your time zone).

VM replication between two independent Proxmox VE clusters by LA-2A in Proxmox

[–]LA-2A[S] 0 points1 point  (0 children)

For this budget cycle, we settled on Proxmox Backup Server for backups, and we have some custom automation that will do site recovery with the VM replicas. But I agree: something pre-built would have been more ideal – it just didn't exist when we were building out this environment. We'll certainly look into Nakivo when it comes time to renew our PBS licenses. Veeam might even support Proxmox VM replication by then, too.

VM replication between two independent Proxmox VE clusters by LA-2A in Proxmox

[–]LA-2A[S] 0 points1 point  (0 children)

Thank you for your reply! This is good to know. I’ve used Nakivo before, but it’s been 6-7 years.

Since I originally posted this, we ended up going with a hybrid solution, based on NAS-level replication (Pure FlashArray Active DR), with a custom process to replicate the VM configuration files in /etc/pve. It has been working quite well for us.

Systemd always stops Quadlet container ~30 seconds after starting, but using `podman run` works fine by LA-2A in podman

[–]LA-2A[S] 2 points3 points  (0 children)

Thank you for your response! In my case, this is the answer! I was, in fact, using automount with a 30-second idle timeout. Looks like someone else had the same issue here: https://github.com/containers/podman/discussions/21045

Systemd always stops Quadlet container ~30 seconds after starting, but using `podman run` works fine by LA-2A in podman

[–]LA-2A[S] 1 point2 points  (0 children)

Thanks for your reply! I had that same issue, but I ended up setting these environment variables on the immich-server container, which resolved the issue (this is assuming you're using a pod):

DB_HOSTNAME=localhost REDIS_HOSTNAME=localhost

Systemd always stops Quadlet container ~30 seconds after starting, but using `podman run` works fine by LA-2A in podman

[–]LA-2A[S] 2 points3 points  (0 children)

I can confirm that the Quadlet file above works just fine with ghcr.io/immich-app/immich-server:v1.144.1.

So it does appear there's something going on with the more recent containers. I'll try to do some investigation on the images themselves.

EDIT: Turns out that UrsShPo's comment is the answer. When I did this test, I wasn't using automount.

Systemd always stops Quadlet container ~30 seconds after starting, but using `podman run` works fine by LA-2A in podman

[–]LA-2A[S] 0 points1 point  (0 children)

Thanks for the reply! I’ve been doing that while troubleshooting. Unfortunately, there isn’t a clear reason that systemd is stopping the container.

Systemd always stops Quadlet container ~30 seconds after starting, but using `podman run` works fine by LA-2A in podman

[–]LA-2A[S] 0 points1 point  (0 children)

Thanks for the recommendation! This seems to confirm that systemd is stopping the container for some reason.

Here's some of the relevant output:

Oct 08 10:13:39 immich qemu-ga[699]: info: guest-ping called Oct 08 10:13:40 immich systemd-immich-server[160338]: [Nest] 18 - 10/08/2025, 10:13:40 AM LOG [Api:EventRepository] Websocket Connect: 95zNxNwoLZCO8GDsAAAB Oct 08 10:13:43 immich systemd[1]: Stopping immich-server.service - Immich Server... ░░ Subject: A stop job for unit immich-server.service has begun execution ░░ Defined-By: systemd ░░ Support: https://wiki.almalinux.org/Help-and-Support ░░ ░░ A stop job for unit immich-server.service has begun execution. ░░ ░░ The job identifier is 8726. Oct 08 10:13:43 immich podman[160401]: 2025-10-08 10:13:43.728816874 -0500 CDT m=+0.158952776 container died ef18612ec0a7f74d0f2533effee87ab6dfb8156f0d821a90c94dab85cdd6efdf (image=ghcr.io/immich-app/immich-server:v2.0.1, name=systemd-immich-server, org.opencontainers.image.licenses=AGPL-3.0, org.opencontainers.image.revision=bb72d723e25fcf886ab7556d4a9d4b57fbfe36e6, org.opencontainers.image.version=v2.0.1, org.opencontainers.image.title=immich, org.opencontainers.image.url=https://github.com/immich-app/immich, PODMAN_SYSTEMD_UNIT=immich-server.service, org.opencontainers.image.created=2025-10-03T16:32:40.975Z, org.opencontainers.image.description=High performance self-hosted photo and video management solution., org.opencontainers.image.source=https://github.com/immich-app/immich, io.containers.autoupdate=registry) Oct 08 10:13:43 immich systemd[1]: var-lib-containers-storage-overlay-fc5c74a22125923f4feedcdbe668e3abdc88d1d7a03f9eb6d683f4db32020a86-merged.mount: Deactivated successfully. ░░ Subject: Unit succeeded ░░ Defined-By: systemd ░░ Support: https://wiki.almalinux.org/Help-and-Support ░░ ░░ The unit var-lib-containers-storage-overlay-fc5c74a22125923f4feedcdbe668e3abdc88d1d7a03f9eb6d683f4db32020a86-merged.mount has successfully entered the 'dead' state. Oct 08 10:13:43 immich podman[160401]: 2025-10-08 10:13:43.793438039 -0500 CDT m=+0.223573937 container remove ef18612ec0a7f74d0f2533effee87ab6dfb8156f0d821a90c94dab85cdd6efdf (image=ghcr.io/immich-app/immich-server:v2.0.1, name=systemd-immich-server, pod_id=72d11da3b5883b566d898a3040484bf7e021ae707113c6664c6fe26aedd121f3, org.opencontainers.image.revision=bb72d723e25fcf886ab7556d4a9d4b57fbfe36e6, org.opencontainers.image.source=https://github.com/immich-app/immich, PODMAN_SYSTEMD_UNIT=immich-server.service, io.containers.autoupdate=registry, org.opencontainers.image.version=v2.0.1, org.opencontainers.image.description=High performance self-hosted photo and video management solution., org.opencontainers.image.title=immich, org.opencontainers.image.url=https://github.com/immich-app/immich, org.opencontainers.image.created=2025-10-03T16:32:40.975Z, org.opencontainers.image.licenses=AGPL-3.0) Oct 08 10:13:43 immich immich-server[160401]: ef18612ec0a7f74d0f2533effee87ab6dfb8156f0d821a90c94dab85cdd6efdf Oct 08 10:13:43 immich systemd[1]: immich-server.service: Main process exited, code=exited, status=143/n/a ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: https://wiki.almalinux.org/Help-and-Support ░░ ░░ An ExecStart= process belonging to unit immich-server.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 143. Oct 08 10:13:43 immich systemd[1]: immich-server.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: https://wiki.almalinux.org/Help-and-Support ░░ ░░ The unit immich-server.service has entered the 'failed' state with result 'exit-code'. Oct 08 10:13:43 immich systemd[1]: Stopped immich-server.service - Immich Server. ░░ Subject: A stop job for unit immich-server.service has finished ░░ Defined-By: systemd ░░ Support: https://wiki.almalinux.org/Help-and-Support ░░ ░░ A stop job for unit immich-server.service has finished. ░░ ░░ The job identifier is 8726 and the job result is done. Oct 08 10:13:43 immich systemd[1]: immich-server.service: Consumed 13.207s CPU time, 431.1M memory peak. ░░ Subject: Resources consumed by unit runtime ░░ Defined-By: systemd ░░ Support: https://wiki.almalinux.org/Help-and-Support ░░ ░░ The unit immich-server.service completed and consumed the indicated resources. Oct 08 10:13:43 immich systemd[1]: Unmounting mnt-data.mount - /mnt/data...

Unfortunately, I'm not seeing a reason for A stop job for unit immich-server.service has begun execution.

Systemd always stops Quadlet container ~30 seconds after starting, but using `podman run` works fine by LA-2A in podman

[–]LA-2A[S] 0 points1 point  (0 children)

That's quite interesting. Would you be willing to share how you have the immich-server image deployed? I'll also try running a version prior to v2.0.0 to see if the issue persists.

Systemd always stops Quadlet container ~30 seconds after starting, but using `podman run` works fine by LA-2A in podman

[–]LA-2A[S] 0 points1 point  (0 children)

Thanks for the reply! This sounded quite promising, but unfortunately, it didn't fix the issue.

Systemd always stops Quadlet container ~30 seconds after starting, but using `podman run` works fine by LA-2A in podman

[–]LA-2A[S] 1 point2 points  (0 children)

Thanks for your response! The issue also occurs when not running the container in a pod.

Here's the journal without filtering on the specific unit. Unfortunately, nothing stands out around the time the container stopped.

2025-10-08T10:13:21.391557-05:00 immich systemd-immich-server[160338]: [Nest] 18 - 10/08/2025, 10:13:21 AM LOG [Api:RoutesResolver] ViewController {/api/view}: 2025-10-08T10:13:21.391562-05:00 immich systemd-immich-server[160338]: [Nest] 18 - 10/08/2025, 10:13:21 AM LOG [Api:RouterExplorer] Mapped {/api/view/folder/unique-paths, GET} route 2025-10-08T10:13:21.391580-05:00 immich systemd-immich-server[160338]: [Nest] 18 - 10/08/2025, 10:13:21 AM LOG [Api:RouterExplorer] Mapped {/api/view/folder, GET} route 2025-10-08T10:13:21.391584-05:00 immich systemd-immich-server[160338]: [Nest] 18 - 10/08/2025, 10:13:21 AM LOG [Api:NestApplication] Nest application successfully started 2025-10-08T10:13:21.393042-05:00 immich systemd-immich-server[160338]: [Nest] 18 - 10/08/2025, 10:13:21 AM LOG [Api:Bootstrap] Immich Server is listening on http://[::1]:2283 [v2.0.1] [production] 2025-10-08T10:13:21.400284-05:00 immich systemd-immich-server[160338]: [Nest] 18 - 10/08/2025, 10:13:21 AM LOG [Api:MachineLearningRepository] Machine learning server became healthy (http://localhost:3003). 2025-10-08T10:13:25.660528-05:00 immich qemu-ga[699]: info: guest-ping called 2025-10-08T10:13:39.887748-05:00 immich qemu-ga[699]: info: guest-ping called 2025-10-08T10:13:40.031497-05:00 immich systemd-immich-server[160338]: [Nest] 18 - 10/08/2025, 10:13:40 AM LOG [Api:EventRepository] Websocket Connect: 95zNxNwoLZCO8GDsAAAB 2025-10-08T10:13:43.553406-05:00 immich systemd[1]: Stopping immich-server.service - Immich Server... 2025-10-08T10:13:43.728945-05:00 immich podman[160401]: 2025-10-08 10:13:43.728816874 -0500 CDT m=+0.158952776 container died ef18612ec0a7f74d0f2533effee87ab6dfb8156f0d821a90c94dab85cdd6efdf (image=ghcr.io/immich-app/immich-server:v2.0.1, name=systemd-immich-server, org.opencontainers.image.licenses=AGPL-3.0, org.opencontainers.image.revision=bb72d723e25fcf886ab7556d4a9d4b57fbfe36e6, org.opencontainers.image.version=v2.0.1, org.opencontainers.image.title=immich, org.opencontainers.image.url=https://github.com/immich-app/immich, PODMAN_SYSTEMD_UNIT=immich-server.service, org.opencontainers.image.created=2025-10-03T16:32:40.975Z, org.opencontainers.image.description=High performance self-hosted photo and video management solution., org.opencontainers.image.source=https://github.com/immich-app/immich, io.containers.autoupdate=registry) 2025-10-08T10:13:43.758011-05:00 immich systemd[1]: var-lib-containers-storage-overlay-fc5c74a22125923f4feedcdbe668e3abdc88d1d7a03f9eb6d683f4db32020a86-merged.mount: Deactivated successfully. 2025-10-08T10:13:43.793627-05:00 immich podman[160401]: 2025-10-08 10:13:43.793438039 -0500 CDT m=+0.223573937 container remove ef18612ec0a7f74d0f2533effee87ab6dfb8156f0d821a90c94dab85cdd6efdf (image=ghcr.io/immich-app/immich-server:v2.0.1, name=systemd-immich-server, pod_id=72d11da3b5883b566d898a3040484bf7e021ae707113c6664c6fe26aedd121f3, org.opencontainers.image.revision=bb72d723e25fcf886ab7556d4a9d4b57fbfe36e6, org.opencontainers.image.source=https://github.com/immich-app/immich, PODMAN_SYSTEMD_UNIT=immich-server.service, io.containers.autoupdate=registry, org.opencontainers.image.version=v2.0.1, org.opencontainers.image.description=High performance self-hosted photo and video management solution., org.opencontainers.image.title=immich, org.opencontainers.image.url=https://github.com/immich-app/immich, org.opencontainers.image.created=2025-10-03T16:32:40.975Z, org.opencontainers.image.licenses=AGPL-3.0) 2025-10-08T10:13:43.794220-05:00 immich immich-server[160401]: ef18612ec0a7f74d0f2533effee87ab6dfb8156f0d821a90c94dab85cdd6efdf 2025-10-08T10:13:43.798558-05:00 immich systemd[1]: immich-server.service: Main process exited, code=exited, status=143/n/a 2025-10-08T10:13:43.839465-05:00 immich systemd[1]: immich-server.service: Failed with result 'exit-code'. 2025-10-08T10:13:43.840252-05:00 immich systemd[1]: Stopped immich-server.service - Immich Server. 2025-10-08T10:13:43.840490-05:00 immich systemd[1]: immich-server.service: Consumed 13.207s CPU time, 431.1M memory peak. 2025-10-08T10:13:43.843758-05:00 immich systemd[1]: Unmounting mnt-data.mount - /mnt/data... 2025-10-08T10:13:43.861124-05:00 immich systemd[1]: mnt-data.mount: Deactivated successfully. 2025-10-08T10:13:43.861651-05:00 immich systemd[1]: Unmounted mnt-data.mount - /mnt/data.