Administration

Proxmox VE Best Practices: A Production Checklist for Reliable Infrastructure

A comprehensive production checklist for Proxmox VE covering storage pool separation, VLAN segmentation, regular backups with PBS, monitoring setup, two-factor authentication, API tokens, update scheduling, and high availability configuration.

ProxmoxR app icon

Managing Proxmox? Try ProxmoxR

Monitor and control your VMs & containers from your phone.

Try Free

Building a Reliable Proxmox Environment

Running Proxmox VE in production is straightforward, but doing it well requires planning. This checklist covers the essential practices that separate a resilient infrastructure from one that breaks at the worst possible time. Whether you are managing a small business cluster or a large deployment, these recommendations will help you avoid common pitfalls.

1. Separate Your Storage Pools

One of the most common mistakes is putting everything on a single storage pool. Separate your storage by function to improve performance, simplify management, and reduce blast radius when issues occur.

# Recommended storage layout:
# Pool 1: local-lvm or local-zfs  — VM/container disks (fast SSD/NVMe)
# Pool 2: backups                  — Backup storage (large HDD or NFS)
# Pool 3: iso-templates            — ISO images and container templates
# Pool 4: ceph-pool (if applicable) — Distributed storage for HA workloads

# Example: Create a separate ZFS pool for backups
zpool create backup-pool mirror /dev/sdc /dev/sdd
pvesm add dir backup-storage --path /backup-pool --content backup

Never store backups on the same physical storage as your VMs. If that storage fails, you lose both your production data and your ability to recover.

2. Implement VLAN Segmentation

Flat networks are simple but risky. Segment your traffic using VLANs to isolate different types of communication:

# Recommended VLAN layout:
# VLAN 10  — Management (Proxmox web UI, SSH, API)
# VLAN 20  — VM/container production traffic
# VLAN 30  — Storage network (Ceph, NFS, iSCSI)
# VLAN 40  — Cluster/Corosync communication
# VLAN 50  — Backup traffic

# Configure a VLAN-aware bridge in /etc/network/interfaces
auto vmbr0
iface vmbr0 inet manual
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 10 20 30 40 50

At minimum, keep management traffic separate from VM production traffic and storage traffic. This prevents a misbehaving VM from affecting your ability to manage the hypervisor.

3. Regular Backups with Proxmox Backup Server

Every production environment needs a backup strategy that is automated, verified, and tested. Use Proxmox Backup Server (PBS) for deduplication and incremental backups, and schedule backup jobs to run during off-peak hours.

# Recommended backup schedule:
# - Daily incremental backups of all VMs/containers
# - Retention policy: keep-daily=7, keep-weekly=4, keep-monthly=6
# - Verify jobs: run weekly to check backup integrity
# - Test restore: monthly, pick one VM and restore it to verify

# Example vzdump job configuration in /etc/pve/jobs.cfg
vzdump: daily-backup
    schedule 0 2 * * *
    storage pbs-backup
    mode snapshot
    all 1
    mailnotification failure
    enabled 1

Critically, test your restores regularly. A backup you have never tested is a backup you cannot trust.

4. Set Up Monitoring

Proxmox provides basic monitoring in the web UI, but production environments need proactive alerting. Configure notifications for disk space, failed backups, HA events, and hardware issues.

# Configure email notifications in /etc/pve/notifications.cfg
# Or use the Datacenter > Notifications section in the web UI

# Monitor critical metrics:
# - Disk space (alert at 80% usage)
# - ZFS pool health (scrub errors)
# - Backup job results
# - Cluster quorum status
# - CPU and RAM usage trends

# Quick CLI health check
pvesh get /cluster/resources --type node --output-format json-pretty
zpool status
pvecm status

For on-the-go monitoring, ProxmoxR lets you check node health, VM status, and resource usage from your phone — useful for staying aware of issues even when you are away from your workstation.

5. Enable Two-Factor Authentication

Every admin account should have 2FA enabled. Proxmox supports TOTP (time-based one-time passwords) and WebAuthn/U2F hardware keys. This is configured per-user through the web UI under Datacenter > Permissions > Two Factor.

# Force 2FA requirement for all users in a realm
# Datacenter > Permissions > Realms > Edit > TFA: Select "TOTP" or "WebAuthn"

# For PAM users, TOTP is the most practical option
# Each user sets up their own TOTP through:
# Username dropdown (top right) > TFA

6. Use API Tokens Instead of Passwords

If you automate tasks using the Proxmox API, never hardcode user passwords. Create dedicated API tokens with minimum required privileges:

# Create a user for automation
pveum user add automation@pve
pveum acl modify /vms -user automation@pve -role PVEVMAdmin

# Create an API token (save the secret — it is shown only once)
pveum user token add automation@pve monitoring-token --privsep 0

# Use the token in API calls
curl -k -H "Authorization: PVEAPIToken=automation@pve!monitoring-token=uuid-secret" \
  https://proxmox:8006/api2/json/cluster/status

Tokens can be revoked individually without affecting the user account or other tokens.

7. Maintain an Update Schedule

Updates should be regular but controlled. Never run apt upgrade on production nodes without testing first.

# Recommended update process:
# 1. Read the release notes for any breaking changes
# 2. Snapshot or backup the node (if possible)
# 3. Update a non-critical node first and test for 24-48 hours
# 4. Rolling update: migrate VMs off a node, update, verify, repeat
# 5. Never update all nodes simultaneously

apt update
apt list --upgradable
apt full-upgrade -y   # After testing on non-critical node first

8. Configure High Availability Properly

HA requires at least three nodes for proper quorum. Do not enable HA on a two-node cluster without understanding the fencing implications. For critical workloads:

# Set up HA groups with proper priorities
ha-manager groupadd production -nodes node1,node2,node3 -nofailback 0

# Add critical VMs to HA
ha-manager add vm:100 --group production --max_restart 3 --max_relocate 2
ha-manager add vm:101 --group production --max_restart 3 --max_relocate 2

# Verify HA status
ha-manager status

Ensure your fencing mechanism works correctly — test it by deliberately shutting down a node and verifying that VMs restart on surviving nodes within an acceptable timeframe.

Summary Checklist

  • Separate storage pools for VMs, backups, and ISOs
  • VLAN segmentation for management, production, and storage traffic
  • Automated daily backups with PBS and regular restore testing
  • Monitoring and alerting for critical infrastructure events
  • Two-factor authentication on all admin accounts
  • API tokens for automation instead of passwords
  • Controlled, rolling update schedule
  • Properly configured HA with tested fencing

No single practice makes an environment reliable. It is the combination of all these elements, consistently applied and regularly reviewed, that creates infrastructure you can depend on.

Take Proxmox management mobile

All the features discussed in this guide — accessible from your phone with ProxmoxR. Real-time monitoring, power control, firewall management, and more.

ProxmoxR

Manage Proxmox from your phone

Monitor, control, and manage your clusters on the go.

Free 7-day trial · No credit card required