Proxmox Corosync Troubleshooting: Fix Common Cluster Communication Issues

Troubleshoot Proxmox corosync problems including totem timeouts, multicast vs unicast issues, firewall port configuration, and rejoining failed nodes.

Understanding Corosync in Proxmox

Corosync is the communication backbone of every Proxmox VE cluster. It handles node membership, messaging between nodes, and quorum tracking. When corosync has problems, your entire cluster can grind to a halt: VMs may freeze, the web interface becomes unresponsive, and configuration changes fail. Knowing how to diagnose and fix corosync issues is a critical skill for any Proxmox administrator.

Reading Corosync Logs

Your first step in any corosync troubleshooting session should be checking the logs. Corosync logs to the systemd journal.

# View recent corosync log entries
journalctl -u corosync --no-pager -n 100

# Follow corosync logs in real time
journalctl -u corosync -f

# Filter for errors and warnings only
journalctl -u corosync -p err -n 50

# Check corosync status
systemctl status corosync

# Common log messages to look for:
# "Retransmit List" - network issues causing packet loss
# "new membership" - node joined or left
# "TOTEM TIMED OUT" - a node stopped responding
# "failed to receive" - communication breakdown

Totem Timeout Issues

The totem protocol is how corosync nodes communicate. Timeout issues typically indicate network problems between nodes: high latency, packet loss, or misconfigured network interfaces.

# View totem configuration in corosync.conf
grep -A 10 "totem {" /etc/pve/corosync.conf

# Default totem settings:
# totem {
#     version: 2
#     cluster_name: proxmox
#     transport: knet
#     token: 1000       # Time (ms) before declaring a node dead
#     consensus: 6000   # Consensus timeout
# }

# For high-latency networks (e.g., across datacenters), increase timeouts
# Edit /etc/pve/corosync.conf (increment config_version!)
# totem {
#     version: 2
#     cluster_name: proxmox
#     transport: knet
#     token: 5000
#     consensus: 10000
# }

Always increment the config_version number when editing corosync.conf manually. Failure to do so will cause your changes to be silently rejected.

Multicast vs. Unicast (Kronosnet)

Modern Proxmox VE versions use Kronosnet (knet) as the default transport, which uses unicast. Older clusters may still use multicast (UDP/u). If you are upgrading from an older version or mixing transport types, communication will fail.

# Check current transport
grep "transport:" /etc/pve/corosync.conf

# knet (default in PVE 6+) - uses unicast, more reliable
# udp  - legacy multicast
# udpu - legacy unicast

# If using knet, verify kronosnet links
corosync-cfgtool -s

# Example output for healthy knet links:
# Link 0:
#   addr = 192.168.1.10
#   status:
#     nodeid: 2 link: enabled connected

# Test network connectivity between nodes on the corosync link
ping -I 192.168.1.10 192.168.1.11

If your nodes are behind NAT or on networks that block multicast, make sure you are using knet or udpu transport.

Firewall Port Configuration

Corosync requires specific ports to be open between all cluster nodes. Blocked ports are one of the most common causes of cluster communication failures.

# Required ports for Proxmox cluster communication:
# TCP/UDP 5405-5412  - Corosync/Kronosnet cluster communication
# TCP 22             - SSH (for node management and migration)
# TCP 8006           - Proxmox web interface
# TCP 3128           - SPICE proxy
# TCP 60000-60050    - Live migration

# Check if ports are open with ss
ss -tulnp | grep -E '5405|5406|5407'

# Test connectivity from one node to another
nc -zv pve2 5405

# If using Proxmox firewall, allow cluster communication
# In /etc/pve/firewall/cluster.fw:
# [RULES]
# IN ACCEPT -p udp -dport 5405:5412
# IN ACCEPT -p tcp -dport 5405:5412

# If using iptables directly
iptables -A INPUT -p udp --dport 5405:5412 -j ACCEPT
iptables -A INPUT -p tcp --dport 5405:5412 -j ACCEPT

Rejoining a Failed Node

When a node loses cluster communication and cannot rejoin automatically, you may need to manually intervene. This often happens after a network outage or when a node was offline for an extended period.

# On the failed node, check corosync status
systemctl status corosync

# Try restarting corosync
systemctl restart corosync

# If that fails, check if the node can reach other cluster nodes
ping pve1
nc -zv pve1 5405

# If cluster filesystem is stuck, restart pve-cluster
systemctl restart pve-cluster
systemctl restart corosync

# If the node's corosync config is corrupted, copy from a working node
# On working node:
scp /etc/pve/corosync.conf pve3:/tmp/

# On the failed node:
systemctl stop pve-cluster corosync
pmxcfs -l
cp /tmp/corosync.conf /etc/pve/corosync.conf
killall pmxcfs
systemctl start pve-cluster corosync

Diagnosing Network Issues

Many corosync problems stem from the underlying network. Use these commands to check for issues that monitoring tools like ProxmoxR might flag in your cluster overview.

# Check for packet loss between nodes
ping -c 100 -i 0.1 pve2 | tail -3

# Measure latency and jitter
mtr -rwc 50 pve2

# Check the network interface used by corosync
ip addr show

# Verify the correct interface is in corosync.conf
grep -A 5 "interface {" /etc/pve/corosync.conf

# Look for network errors on the interface
ethtool -S ens18 | grep -i error

Quick Reference: Corosync Troubleshooting Checklist

When corosync is misbehaving, work through this checklist systematically:

Check logs with journalctl -u corosync
Verify network connectivity between all nodes on the cluster network
Confirm firewall ports 5405-5412 are open (both TCP and UDP)
Ensure config_version in corosync.conf is consistent across nodes
Verify time synchronization (NTP) across all nodes
Check that the correct network interface and IP are configured in corosync.conf
Restart corosync and pve-cluster services if needed
Increase totem timeouts for high-latency links

Keeping corosync healthy is fundamental to cluster stability. Most issues come down to network problems, firewall rules, or configuration mismatches between nodes.

Proxmox Corosync Troubleshooting: Fix Common Cluster Communication Issues

Understanding Corosync in Proxmox

Reading Corosync Logs

Totem Timeout Issues

Multicast vs. Unicast (Kronosnet)

Firewall Port Configuration

Rejoining a Failed Node

Diagnosing Network Issues

Quick Reference: Corosync Troubleshooting Checklist

Take Proxmox management mobile

Related Articles

How to Set Up a Proxmox Cluster: Complete Guide

Proxmox High Availability (HA) Configuration Guide

How to Migrate VMs Between Proxmox Nodes

Manage Proxmox from your phone