Ceph

2025-12-14

Pain and Cephering

Ceph is pretty neat. It joins together nodes to pool their storage and replicate objects between them, providing a sort of Kubernetes-for-storage service that I really like (I love Kubernetes so I’m going to use that as a point of reference a lot). And like Kubernetes, it comes with a ton of complexity the average person... probably won’t need in their life.

It starts at the bootstrapping stage. There are two ways to deploy Ceph, effectively - by hand, or using the Cephadm tool to do the hard work for you. Initially I opted for the "by hand" approach, since my goal was to learn Ceph anyways and what better way than to dive in headfirst and ignore things that make life easier.

This took some tinkering. My setup started as three Debian 13 based virtual machines, across my three Proxmox servers in my homelab. Each one has a small 64GiB disk that I'll let Ceph manage, 2GiB of memory and 2vCPU cores. I was mildly tempted to go smaller, but was aware that Ceph has a reputation for taking plenty of compute to run. The initial stumbling block was purely my use of Debian 13 (Trixie). While there are packages for it in the repositories, they... don't quite work, and the official Ceph repositories only go up to Bookworm! It's likely this will change in the future as Trixie gets more adoption, but for the time being I was forced to go to Debian 12. This isn't really an issue - Debian is renowned for its stability and lasting support - but the rest of my virtual machines are running Debian 13 so these ones are odd ducks. Regardless, once I was on Debian 12 the repositories were available (technically you could point Debian 13 to bookworm repos, but I aired on the side of doing things semi-properly) and I got to work installing the various Ceph components to bootstrap.

I'm mildly annoyed I didn't take better notes during this process, but needless to say there was a lot of going in circles trying to get components to talk to each other and bootstrapped properly. One thing that particularly stands out is Ceph's concept of compute workloads, or I suppose modules, and how you have to be sure that dependencies for those modules are installed - Ceph, or its packages, won't do this automatically. The actual bootstrapping of the storage component went relatively smoothly, but I could not for the life of me get the CephFS and NFS components working properly to be mounted on my desktop. It was turning into a frustrating experience, and I was about ready to drop Ceph entirely until I realised I was ignoring a very, very key tool: Cephadm.

Cephadm made things an absolute breeze, even if it came with some defaults I didn't need (like Grafana and Prometheus). Cephadm will take care of creating an initial small cluster using Podman Quadlets (which are cool in their own right), generating an SSH key, then assimilating the other nodes for the cluster. The process was annoyingly painless, in the sense I'd spent a good day or two on the manual approach. Which isn't to say I didn't learn a bunch I would have otherwise ignored with Cephadm, but I was kicking myself a bit.

The process was so easy that is sort of boils down to two main cephadm commands after installing it, as the Ansible playbook I put together demonstrates. I am aware that there is a first party Ansible playbook guide, but I opted to write my own so I understood the steps that were being taken - and I wasn't looking for anything "enterprise ready" anyways, so basic cephadm commands served me just fine.

Now that I had a functional Ceph cluster, I had a couple goals. First and foremost was creating an NFS mount so I could test it from my desktop. This mostly involved some basic clickops in the now fully functional web interface. Since the relevant Ceph processes were running in containers, whatever dependencies I missed during the manual setup were present and functional, so this was a pretty straightforward process. For this I opted for CephFS, since as far as I can tell it's closest to a POSIX filesystem and easiest for me to reason with. Ceph's Object Gateway exposes S3 endpoints and terminology, and Block Pools are available to expose something akin to raw disks (or images) over the network. I'm not too interested in those at the moment, but I'm sure as I dig into Ceph more I'll understand them more and implement them elsewhere.

Creating the share (now that everything was running nicely) was easy - create a pool, telling Ceph how many replicas of the data should be present along with any quotas, then create a filesystem using that pool. You also need a second pool for metadata. After that, you're good! Just setup an NFS share if you want to mount using that, or mount it directly mount.ceph.

Satisfied with the result, it was time to move on to the ultimate goal - giving it to my Kubernetes cluster.

My destination was making the Kubernetes cluster less reliant on a single service for storage. At that moment I had two options - S3, via my k8s-csi-s3 fork, or an NFS mount from my NAS. Neither option was ideal for one reason or another. NFS meant single point of failure, relying on my NAS, but did have higher throughput (2.5Gbps) and lower latency than reaching out to a remote S3 API. By shifting to Ceph, I could remove the SPOF while retaining fairly decent speed and latency. I am aware it's possible to run Ceph in Kubernetes itself, using Rook, but where's the fun in that?

My feelings towards stateful workloads in Kubernetes remain fairly unchanged, but the need to persist some things is somewhat inevitable. I only have 8 PVCs in my homelab cluster, with more stateful workloads living in VMs or LXCs in Proxmox. These 8 include things like Redis state for a few Discord bots, caches of built WASM artifacts for GoToSocial, and notably, kuik-image-keeper (KUIK)'s persistance.

KUIK was the thing I was most interested in keeping in Ceph - while container image registries are fairly reliable, I maintain a handful of my own images pushed to my Forgejo instance, which can go offline from time to time (usually deliberate). The idea is to be able to save images for the Kubernetes cluster within Kubernetes itself, with nodes effectively having a shared cache they can pull from before reaching out to the upstream. If the upstream (e.g Forgejo) is offline, it won't interrupt pods shuffling around.As a concrete example, if I power off my NAS, Kubernetes has to move pods from the NAS's worker node, and can cause problems if those pods use images from Forgejo, which is on the NAS as well. With this setup, that's far less of an issue since the Ceph cluster is also running across all three machines and the nodes will try fetching from it first.

Deploying a container storage interface (CSI) for Ceph is pretty easy. ceph-csi exists for this purpose, and while the manifests I ended up with are a bit messy it does work as expected. Using it is exactly the same as any other CSI, which is to say you assign the storage class in the PVC and the CSI figures out the rest. This was probably the more boring aspect, although figuring out how to provision a user was at least a bit exciting.

Anyways - at this point, I now have a highly available, local storage option for my homelab cluster. And anything else I want to use it for, for that matter. Much like my clustered PostgreSQL setup, which now backs my Forgejo instance along with various Kubernetes based services, it being outside of my Kubernetes cluster/not using Rook gives me a lot of flexibility with where I want to apply it. That's not to say Rook is a poor choice, but for this particular deployment I wanted to keep my options open (my bad experiences with Longhorn did have some influence here too).

Resource usage wise, the nodes are doing fine. At least, I haven't seen anything to cause concern, but my deployment is also relatively light on the usage front. With 192GiB of raw capacity, and 2 replicas of data in the the Kubernetes CephFS filesystem, it'll keep me going for a little while at least.

Whether or not I'd recommend Ceph does, as always, come down to circumstance. But, it is an invaluable thing to learn if you're really into ~~clusterfucks~~ clustering services.

If anything changes, I'll be sure to post about it on Floofy.tech, so feel free to shoot me a follow request!