How to Set Up High Availability Services on Fedora

Fedora supports high-availability clustering through the Pacemaker and Corosync stack, allowing services to fail over automatically between nodes when hardware or software problems occur.

The web server is down and you need it back up now

You have a Fedora server hosting a critical web application. The disk controller drops, the kernel panics, or the power supply fails. The machine reboots, but your users see a 502 Bad Gateway error. You log in, restart the service, and traffic resumes. Ten minutes of downtime. Your boss asks why the service wasn't available. You need the workload to move to a standby node automatically. You need high availability.

How the cluster stack works

Fedora's high availability stack relies on two engines working together. Corosync handles the gossip. It keeps the nodes talking, tracks membership, and enforces quorum. Pacemaker is the boss. It watches the resources and moves them around when things break. The pcs command is the remote control you use to tell them what to do.

Think of Corosync as the nervous system passing signals between nodes. Pacemaker is the brain deciding which node runs the workload. If one node stops sending heartbeats, Pacemaker moves the workload to a healthy node. You never edit the XML configuration files directly. pcs generates the configuration and validates it before applying changes.

Run pcs status immediately after setup. If you don't see all nodes online, stop and check the logs before adding resources.

Install and configure the cluster

Install the stack on every node in the cluster. The packages must be present on all members before you create the cluster.

sudo dnf install -y pacemaker corosync pcs
# WHY: Installs the resource manager, the messaging layer, and the configuration tool.
sudo systemctl enable --now pcsd
# WHY: Starts the pcsd daemon which handles authentication and remote commands.

Set the password for the hacluster system account. This password must be identical on every node.

sudo passwd hacluster
# WHY: Sets the password for the cluster management account. Authentication fails if passwords differ.

Authenticate the nodes and create the cluster from one node. This command pushes the configuration to all members.

sudo pcs host auth node1 node2
# WHY: Authenticates the hacluster user across nodes and establishes SSH trust for configuration sync.
sudo pcs cluster setup my-cluster node1 node2
# WHY: Writes the Corosync configuration to /etc/corosync/ on all nodes and sets the cluster name.
sudo pcs cluster enable --all
# WHY: Enables the cluster services to start automatically on boot.
sudo pcs cluster start --all
# WHY: Starts the cluster immediately without requiring a reboot.

Config files live in /etc/corosync/. Packages ship defaults to /usr/lib/corosync/. Edit /etc/. Never edit /usr/lib/.

Check that all nodes appear online.

sudo pcs status
# WHY: Displays node status, quorum state, and resource locations.

Run pcs status immediately. If nodes are offline, check the logs before proceeding.

Configure fencing to prevent data corruption

STONITH stands for Shoot The Other Node In The Head. It is the mechanism that guarantees a failed node is truly dead before its resources move. Without STONITH, you risk split-brain where two nodes think they own the same IP or database, causing data corruption. Fedora requires STONITH by default. You cannot disable it in a production setup.

Define a fence device that can power-cycle a node. IPMI is common for bare metal. Virtual machines often use fence_xvm.

sudo pcs stonith create fence_ipmi fence_ipmilan \
  ipaddr=192.168.1.50 login=admin passwd=secret lanplus=1 cipher=aes
# WHY: Defines a fence device that can power-cycle a node via IPMI.
sudo pcs stonith create fence_redundant \
  action=reboot pcmk_host_list="node1 node2"
# WHY: Creates a redundant fence device so the cluster can fence even if one agent fails.

For a two-node test cluster with no fencing hardware, you can disable STONITH temporarily. This is unsafe for production. Use this only in a lab where data loss is acceptable.

sudo pcs property set stonith-enabled=false
# WHY: Disables the STONITH requirement. Never use this in production.
sudo pcs property set no-quorum-policy=ignore
# WHY: Allows a two-node cluster to operate without a tie-breaker quorum device.

Test the fence device before trusting the cluster. A fence that doesn't work is worse than no fence.

Define resources and groups

Create resources for the services you want to manage. Group resources so they always run on the same node and start in the correct order.

sudo pcs resource create VIP ocf:heartbeat:IPaddr2 \
  ip=192.168.1.100 cidr_netmask=24 op monitor interval=20s
# WHY: Creates a virtual IP resource with a health check every 20 seconds.
sudo pcs resource create WebServer ocf:heartbeat:apache \
  configfile=/etc/httpd/conf/httpd.conf op monitor interval=20s
# WHY: Wraps Apache so Pacemaker can manage its lifecycle and monitor status.
sudo pcs resource group add web-group VIP WebServer
# WHY: Groups resources to ensure they run on the same node and start in order.

SELinux denials can prevent resources from starting. Check journalctl -t setroubleshoot for one-line summaries of denials. Read those before disabling SELinux.

Group resources tightly. A VIP without a service is useless. A service without a VIP is unreachable.

Open firewall ports

Corosync and Pacemaker need specific ports to communicate. Open the high-availability service in the firewall.

sudo firewall-cmd --permanent --add-service=high-availability
# WHY: Opens the ports Corosync and Pacemaker need to communicate between nodes.
sudo firewall-cmd --reload
# WHY: Applies the permanent change to the runtime firewall configuration.

Run firewall-cmd --reload after every rule change. The runtime and persistent configs will drift otherwise.

Verify the cluster and handle errors

Monitor the cluster health and test failover. Use pcs status for a quick overview and journalctl for details.

sudo pcs status
# WHY: Shows node status, quorum, and resource locations.
sudo journalctl -xeu pacemaker
# WHY: Displays detailed logs for the Pacemaker service with explanatory context.

If you see an error like this, the cluster service is not running on the node.

Error: cluster is not currently running on this node

Check systemctl status pacemaker and systemctl status corosync. Ensure the services are enabled and started. Time synchronization is also critical. Ensure chronyd or ntpd is running on all nodes. Corosync rejects messages from nodes with skewed clocks.

Read the actual error in the journal before guessing. journalctl -xe tells you exactly why a resource failed.

Common pitfalls

Split-brain occurs when nodes lose communication but both think they are alive. STONITH prevents this by killing the suspect node. Without STONITH, you must rely on quorum. An odd number of nodes helps, but a two-node cluster needs a tie-breaker or no-quorum-policy=ignore.

Firewall rules often block cluster traffic. Ensure high-availability is allowed. SELinux can block resource agents. Check journalctl -t setroubleshoot. Configuration drift happens when you edit files manually. Always use pcs to modify the cluster.

Test the fence device before trusting the cluster. A fence that fails is worse than no fence.

Choose the right tool for your workload

Use Pacemaker and Corosync when you need active-passive failover for stateful services like databases or virtual IPs. Use Keepalived when you only need a floating IP and don't require complex resource management. Use Kubernetes when you are running containerized workloads and need scaling across many nodes. Use a single node with backups when the service can tolerate downtime and recovery time is acceptable.

Where to go next

High availability means your service stays online even if one server crashes. Podman runs containers on a single machine, so you need multiple Fedora servers working together to achieve this. Think of it like having backup generators; if the main power fails, the backup takes over instantly.