Rolling Upgrade of a Proxmox VE Cluster
How to perform a rolling upgrade of a Proxmox VE cluster with zero downtime, including node ordering, VM migration, HA group adjustments, and verification steps between nodes.
Why Rolling Upgrades Matter
A Proxmox VE cluster running production workloads cannot afford a complete shutdown for upgrades. The rolling upgrade strategy lets you upgrade one node at a time while the remaining nodes continue serving VMs and containers. This approach works for both minor point releases and major version upgrades (such as PVE 7 to 8 or PVE 8 to 9), though major upgrades require more careful planning.
Planning the Node Upgrade Order
The order in which you upgrade nodes matters significantly. Follow these principles:
- Start with a non-quorum-critical node. In a 3-node cluster, any single node can go down without losing quorum. In larger clusters, identify nodes whose absence will not break the majority.
- Save the most critical node for last. The node running your most important workloads or acting as the primary Ceph monitor should be the final node upgraded.
- Never upgrade more than one node at a time. Complete each node fully before starting the next.
# Check current cluster status and quorum
pvecm status
# View all nodes and their status
pvecm nodes
# Check which node is the current quorum leader
corosync-quorumtool -s
Step 1: Adjust HA Groups and Resources
Before removing a node from service, you need to handle High Availability resources. If HA is configured, VMs managed by the HA system will be automatically restarted on another node if the current node goes down. However, a planned upgrade benefits from a controlled migration rather than an HA failover event.
# List HA resources
ha-manager status
# Check HA groups
cat /etc/pve/ha/groups.cfg
# Temporarily restrict HA resources from running on the node being upgraded
# Option 1: Set the node to maintenance mode (PVE 8+)
ha-manager crm-command node-maintenance enable node1
# Option 2: Manually migrate HA-managed VMs
ha-manager migrate vm:100 node2
ha-manager migrate vm:101 node2
Step 2: Migrate All Workloads Off the Target Node
Move all VMs and containers to other cluster nodes. Use live migration for running VMs to avoid any downtime for the guests:
# List all VMs on the target node
qm list | grep running
# Live migrate a running VM to another node
qm migrate 100 node2 --online
# Migrate a container
pct migrate 200 node2 --online
# For VMs with local disk that cannot live migrate,
# shut down first then migrate
qm shutdown 103
qm migrate 103 node2
qm start 103
# Verify no VMs or containers remain on the target node
qm list
pct list
For bulk migration, you can script the process:
#!/bin/bash
TARGET_NODE="node2"
# Migrate all running VMs
for vmid in $(qm list | awk '/running/ {print $1}'); do
echo "Migrating VM $vmid to $TARGET_NODE"
qm migrate $vmid $TARGET_NODE --online
done
# Migrate all running containers
for ctid in $(pct list | awk '/running/ {print $1}'); do
echo "Migrating CT $ctid to $TARGET_NODE"
pct migrate $ctid $TARGET_NODE --online
done
Step 3: Upgrade the Node
With the node evacuated, perform the upgrade. For a minor update:
# Standard update
apt update && apt dist-upgrade -y
# Reboot to apply kernel changes
reboot
For a major version upgrade (e.g., PVE 7 to 8), follow the full major upgrade procedure: update APT sources, run the checker script, perform dist-upgrade, and handle all dpkg prompts. See the dedicated PVE major upgrade guides for complete instructions.
Step 4: Verify the Node After Upgrade
After the node reboots, verify it has rejoined the cluster and is functioning correctly before proceeding:
# Check node has rejoined the cluster
pvecm status
pvecm nodes
# Verify Proxmox version
pveversion -v
# Check running kernel
uname -r
# Verify all services are running
systemctl status pvedaemon pveproxy pvestatd corosync
# Check storage availability
pvesm status
# If using Ceph, verify Ceph health
ceph -s
ceph osd tree
# Test web UI access on this node
curl -sk https://localhost:8006 | head -5
If the node fails to rejoin the cluster, check corosync logs (journalctl -u corosync) and ensure the corosync configuration is consistent across all nodes. During major version upgrades, temporary corosync protocol mismatches between old and new versions are possible but typically resolve once both sides negotiate.
Step 5: Migrate Some Workloads Back (Optional)
You can optionally migrate some VMs back to the upgraded node to rebalance the cluster before moving to the next node. This is especially important if the remaining nodes are under heavy memory or CPU pressure:
# Move some VMs back to the upgraded node
qm migrate 100 node1 --online
pct migrate 200 node1 --online
Step 6: Repeat for Each Remaining Node
Proceed to the next node in your planned order. For each node, repeat the full cycle: adjust HA settings, migrate workloads off, upgrade, reboot, verify, optionally rebalance. A 3-node cluster typically takes 1-3 hours for minor updates or 3-6 hours for major version upgrades.
Post-Upgrade: Restore HA and Final Verification
After all nodes are upgraded, restore normal HA operation and verify the complete cluster:
# Disable maintenance mode on all nodes
ha-manager crm-command node-maintenance disable node1
ha-manager crm-command node-maintenance disable node2
ha-manager crm-command node-maintenance disable node3
# Verify HA status
ha-manager status
# Verify all nodes show the same version
pvecm nodes
for node in node1 node2 node3; do
ssh $node pveversion
done
# If using Ceph, verify all OSDs are up
ceph osd tree
ceph -s
Rolling upgrades require discipline and patience, but they are the safest way to maintain a Proxmox cluster. Keeping track of which nodes are upgraded, which still need attention, and the overall health of your cluster during the process is critical. ProxmoxR provides a centralized view of your cluster's status that is especially valuable during rolling upgrades, letting you confirm each node is healthy before moving to the next.
Take Proxmox management mobile
All the features discussed in this guide — accessible from your phone with ProxmoxR. Real-time monitoring, power control, firewall management, and more.