Ceph Under Pressure: How I Recovered a Ceph Cluster in Proxmox Without a Configuration Backup

There are days when everything seems against you. That day, I got proof that Ceph is truly robust. I’ve been building Proxmox VE clusters with Ceph for over 6 years and today I experienced a situation that really made me appreciate the strength of this technology. I use Ceph to store both data and virtual machines. The VMs can be lightweight and simple like web servers, or more complex like database servers, as well as virtual machines supporting Kubernetes and its pods.

To provide some context about this problem, I have a Proxmox VE 9.1.x cluster with three nodes (pve1 to pve3) and a Ceph squid (19.2.3) cluster installed on these machines. The nodes have been interconnected from the start, and use Ceph storage. This cluster was originally installed with Proxmox VE 8 and Ceph quincy 17.

The two nodes pve4 and pve5 are new and not yet in the Proxmox or Ceph cluster. However, these two nodes already had the Ceph squid tools installed.

The Problems#

First mistake: installing Ceph squid on machines pve4 and pve5 before adding them to the cluster. Because of how Ceph works, this created a configuration conflict between the different nodes. But what did it actually cause?

When you install Ceph on a node, it automatically creates a local Ceph cluster with that node as the only member. This means that the two nodes pve4 and pve5 each had their own local Ceph cluster, with their own configuration and data. The ceph_fsid, ceph.conf, etc. were different on each node.

Meanwhile, the existing Ceph cluster on pve1 to pve3 continued to function normally.

Having seen this setup was wrong, Ceph was then uninstalled on machines pve4 and pve5 (the command run on both machines is as follows: apt purge ceph-* && apt autoremove && rm -rf /var/lib/ceph && rm -rf /etc/ceph).

Once both machines pve4 and pve5 were without Ceph, I started integrating pve4 into the Proxmox cluster; the integration went smoothly. Encouraged by this small victory, I launched the Ceph installation on pve4. During the configuration, a question was asked about Ceph and its public and internal subnets, forcing some parameters. That’s where the biggest problem started. I still don’t understand why or how node pve4 managed to force its configuration into the existing Ceph cluster.

As you can guess, the consequences were severe. Having lost the Ceph cluster configuration, the Proxmox nodes could no longer access the storage where the virtual machine disks were stored. Each virtual machine then became inaccessible, with a message of type kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0). No storage means no accessible services. The outage spread to all virtual machines and all services supported by Ceph storage.

Actions Taken#

First things first, I had to warn users about the ongoing incident, ensure access to virtual machine backups, and not turn off the PVE nodes. A quick assessment was conducted to identify the affected virtual machines (in my case, all of them), and the backups were verified. Bad news: no Ceph configuration, crushmap, or authentication keys had been backed up.

At this point, here’s the situation:

all virtual machines are inaccessible and in kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) mode
virtual machine backups are accessible, validated, and ready to be restored
no Ceph configuration is available
access to all Proxmox nodes is possible via SSH or web interface

I had to act, time was passing and users were affected. Several approaches were explored to solve the problem:

install a new Ceph cluster on machines pve4 and pve5, and restore virtual machines from backups on these nodes while repairing or rebuilding the existing Ceph cluster
recover the Ceph configuration from another Proxmox node (pve1, pve2, or pve3)
throw in the towel

The first approach is easy for a quick recovery, but requires restoring all virtual machines from backups, which is time-consuming. Plus, it doesn’t restore the existing Ceph cluster. The second approach is complex but allows restoring the existing Ceph cluster with minimal risk of data loss. The third approach is the simplest, with non-negligible consequences.

Let’s go with the second approach.

Problem Resolution#

The Ceph configuration file is stored in /etc/pve/ceph.conf. This file contains Ceph cluster configuration information, including Ceph node addresses, cluster identifier, and authentication keys, among others. The file was overwritten, so it had to be recreated entirely, based on the official Proxmox documentation and information available from other Proxmox nodes also running Ceph.

In the Ceph configuration done in the Proxmox cluster, each node has a monitor (ceph-mon), a manager (ceph-mgr), and several OSDs (ceph-osd). There are also ceph-mds for CephFS filesystem support, but they won’t be useful to us. Since node pve4 caused the problem, it needs to be removed from the Ceph cluster. The node was put in maintenance mode (ha-manager crm-command node-maintenance enable pve4), then the Ceph removal command was run again, then the node was rebooted. The reboot greatly helped with the rest of the procedure.

Once pve4 came back online, it’s time to regain control of the Ceph cluster. About fifteen minutes had elapsed between the situation assessment and this reboot of pve4.

Now, let’s deal with node pve1. Ceph has exclusive access to its resources (configurations, services, sockets, etc.), and given all the problems encountered, the decision was made to shut down all virtual machines on Proxmox node pve1. Once they’re shut down, all Ceph services on node pve1 need to be stopped (systemctl stop ceph.target && systemctl status ceph.target). Now we can recover the crushmap from the ceph-mon on node pve1. Here are the commands to run:

bash
# Direct extraction of the crushmap (compressed binary) from the local Ceph monitor store
ceph-monstore-tool /var/lib/ceph/mon/ceph-pve1 get crushmap > /tmp/crushmap_dump.bin
# Decompilation of the crushmap into readable text format
crushtool -d /tmp/crushmap_dump.bin -o /tmp/crushmap.txt
# Recompilation of the crushmap into binary format
crushtool -c /tmp/crushmap.txt -o /tmp/crushmap.bin
# Injection of the crushmap into the Ceph cluster
ceph osd setcrushmap -i /tmp/crushmap.bin

The above procedure allows recovering the current Ceph cluster crushmap and reinjecting it into the cluster. One of Ceph’s strengths is storing important metadata in the cluster, like the crushmap, which allows finding the current cluster state even after a restart.

This crushmap is crucial for the cluster’s proper functioning, as it defines the data topology in the cluster (where data is located, how it’s replicated, etc.). Each OSD, each monitor, each manager stores this information. It must be kept and backed up regularly and securely.

All commands executed without errors. It’s time to check the Ceph cluster status. Since everything is already in place or in an unstable state, the simplest approach is to restart the Ceph services on node pve1.

bash
# Restart Ceph services on node pve1
systemctl restart ceph-mon.target
systemctl restart ceph.target
ceph -s

A big part of the work is done, especially thanks to the result of the ceph -s command which shows the cluster is operational. Operational means the cluster is able to function, data is accessible from node pve1, but it’s not yet fully stable (“HEALTH_WARN” status).

All that’s left is to restart the Ceph services on the other nodes and verify the cluster status. Indeed, after restarting Ceph services on the other nodes, the cluster is stable and the status is “HEALTH_OK”. The configuration file /etc/pve/ceph.conf has been resynchronized with the cluster, since it’s part of the /etc/pve/ directory which is synchronized with all nodes in the cluster.

Finally, it’s time to verify that the virtual machines are working correctly and that the cluster is stable. Each virtual machine was restarted and each service was verified. Users were informed that the service was operational again.

In total, this operation took about 1 hour, from the start of the outage to the final verification.

Lessons for Next Time#

Backups and their verification are important:

Back up the configuration file /etc/pve/ceph.conf
Back up the Ceph crushmap (visible in the Proxmox web interface)
Do not proceed with the Ceph installation form when the page with the choice of Ceph communication subnets appears if you already have an existing configuration in the cluster

Getting started with Ceph and Proxmox requires good documentation and testing before going into production. Its optimization also requires in-depth knowledge of the architecture and specific project needs.

The official documentations of Proxmox and Ceph (Squid version) are a good starting point to understand the concepts and configurations.

The Problems#

Actions Taken#

Problem Resolution#

Lessons for Next Time#

Stay Updated

Related articles

Using Packer and Proxmox to build templates

Using Terraform or OpenTofu to create LXC containers on Proxmox

Install Proxmox on Debian