Maulana's personal blog
Main feedMath/PhysicsSoftware Development

Fix Longhorn Volumes that refuses to attach


I’m quite impressed with Longhorn and I’ve been using it since before v1 version. The stack is not without bug, but it becomes more and more stable with new useful features added over time.

These are some issues that I encountered in the past and how to solve them.

Volumes got detached unexpectedly and can’t reattach

Perhaps, because I’m using self-managed Kubernetes (I construct the cluster myself in Hetzner), very often my nodes experienced “disruptions”. My nodes are in Hetzner Cloud, which is a shared VPS. I assume, since it is a shared VPS, there are some limit imposed by Hetzner that cause all of my pods in that node be killed. In this situation, Kubernetes will not create new pods, but instead restart the container.

The problem is, when the container restarts, the CSI doesn’t attach anymore so you get container with invalid state, it can’t access the volumes. It will be stuck in a restart loop.

How to fix:

You restart the Deployment/Statefulset manually. This will cause the pods to be recreated again, so Longhorn will attach the volume.

Since Longhorn 1.2.0, they implemented a default behavior that if a volume were detached unexpectedly, the pods will be recreated. You can change this settings by looking at this reference

Volumes can’t attach even if the pod is recreated

This happens randomly sometimes when pods are migrated to different node. Usually from eviction rules, or if the node is Cordoned/Drained. Even when you recreate the pods multiple times, the event log says that it can’t start the pod because it can’t attach a set of volumes (it will tell you which PVC can’t be attached).

How to fix:

Take note of PVCs that can’t be attached. Track down which PersistentVolume it bounds to. The reason this happens is because Longhorn still thinks that a different node owns the Volume attachment.

How do I know it? It’s a little bit complicated, but in summary you need to get the attachment process logs and know how Longhorn works. When Longhorn attach a Volume, it will only attach a Replica. From the Longhorn Volume CRD, you will get the replicas. From Longhorn Replica CRD, you will get the name of the instance manager. Then you tail the logs of the instance manager pod that does the attachment.

Long story stort, before proceeding, make sure that no one is actually accessing your volume. Scale down any pods that uses the Volume. Use Longhorn UI to make sure that the Volumes are in a Detached state.

Open Longhorn Volume CRD (not Kubernetes core PersistentVolume object). I usually uses Rancher to see the YAML manifests, but you can use other tools. For example, the equivalent kubectl command:

# if you don't know the full CRD name, list it using api-resources
kubectl get api-resources | grep longhorn
# You will see that the Longhorn Volume CRD is in API
# The Volume itself is stored in your Longhorn namespace, usually in longhorn-system
kubectl -n longhorn-system edit <volume-name>

Here’s an example of the resource. It is truncated to show only the relevant part:

kind: Volume
  name: pvc-d84ab4dc-11d9-4436-9f2f-8d07cafb5a46
  namespace: longhorn-system
  nodeID: worker-large4
  currentNodeID: worker-large4
  ownerID: worker-large4
  pendingNodeID: ""

If your volume is stuck to any particular node, which is not the node the pod is trying to run, then it will not attach. What you do is to make sure that all spec.nodeID, status.currentNodeID, status.ownerID, and status.pendingNodeID all assigned to be empty string "". Depending on your situation, it could be that some of these keys are empty to begin with after the Volume are detached. But, you need to make sure that all of them is set to empty to trigger Longhorn manager to take care of the detachment.

Once you apply the edit. Attach the volume again by running the pods/scale up. It doesn’t matter if after you apply, the Volume resources still got assigned a nonempty node ID fields in any keys I mentioned above. What is important is that when you edit the resources and save, the manager will be notified and takes care the detachment. So, this time the attachment will work.

If your pod still doesn’t run. Check the pod event again. Perhaps more than one volumes are stuck, so you need to take care all of them.

Volumes can’t attach because other pods already attach the volume

If you are using Longhorn since before v1, you will know that Longhorn only supports RWO (ReadWriteOnce) Volume. That means, it can only be attached to a single node, but it can be attached to more than one pods in the same node.

If the Pod event says the Volume can’t be attached because it is already been attached elsewhere, then you don’t have any other options. You must schedule all the pods that access the same Volume to be in the same node. There is no other way.

Of course this defeats the purpose of Horizontal Scalability, but that is another topic. With RWO Volume, you can only use multiple pods in one node to access the same Volume.

How to fix:

There are several ways to do this. Refer to Kubernetes documentation on how to schedule pods. Here’s some of them:

  • In Pod template spec in your workload, assign spec.nodeName directly to the name of the node where the Volume is currently attached
  • Use Pod Affinity so that one Workload can decide where it is going to be run, then other Workload’s Pod uses Pod Affinity to be placed in the same node with the previous Workload.

Alternatively, Longhorn now supports (somewhat experimental) RWX (ReadWriteMany) Volumes, which is a regular Longhorn Volume, but exposed via NFS by Longhorn Share Manager. You can move your data to this new volume type and then mount the volume from any node in the cluster (no need to use affinity).

Volumes can’t attach because the Replicas is in the Faulted state

It is a very rare case, since Longhorn can handle Faulted replicas just fine. But it can happen if you only have one replica, not enough disk space, and then your node got disrupted. If you have enough space, Longhorn can make a snapshot and try to recover the replica even with just one replica. But if you don’t have enough space it will not know what to do. You need to make decision. This is called a “Salvage” operation by Longhorn.

How to fix:

You can try the Salvage button first from the Longhorn UI of the Volume. If it doesn’t do anything, then definitely you only have one replica with not enough disk space in the node.

If you are sure (by hunch) that this is just a false negative case (error when it should not be), and the pod will mount the volume just fine, or the pod can handle it gracefully, you can force mount the volume. A case where this makes sense is when you have a cache volume to store javascript compiled assets. If the volume is corrupted, it won’t be a problem because you can just regenerate the content.

To force mount the volume, we need to do a little trick to let Longhorn think that the volume is okay.

Find your Faulted replica name from the Longhorn UI (easiest method I can think of). Of course you can also track it via CLI by opening the CRD, but via Longhorn UI is the fastest:

  1. Open the Volume tab in Longhorn UI, filter by Status, choose Faulted.
  2. Click your Faulted Volume.
  3. In the Volume menu, click Salvage (if you have more than one replicas). If it doesn’t work, just delete the Faulted replicas so Longhorn will create a new one
  4. If you only have one replica and you want to try to force mount, take note of the replica name.

Now with the replica name, you can edit it’s resource YAML manifests. I used to do it via Rancher, but if you want to do it via kubectl:

# if you don't know the full CRD name, list it using api-resources
kubectl get api-resources | grep longhorn
# You will see that the Longhorn Replica CRD is in API
# The Replica itself is stored in your Longhorn namespace, usually in longhorn-system
kubectl -n longhorn-system edit <replica-name>

Here’s the sample YAML, trimmed to only show relevant section:

kind: Replica
  name: pvc-04b355d9-d2ac-4e20-9aad-c545d3cda70b-r-3d6b5e4d
  namespace: longhorn-system
  failedAt: "2021-12-16T13:46:18Z"

The spec.failedAt contains information of when the replica faulted. This is used to compare between multiple replicas to see which one is more latest. Simply replace the value with "" to trick Longhorn into thinking that replica was not faulty anymore. It will be able to attach the volume.

However, I should mention that this is a very hacky thing to do. If you experience this very often and your data is very important (like transaction logs or something), I suggest you to use multiple replicas instead next time.

Wrapping up…

Those issues I mentioned above are the most common. If you use Longhorn, you will find that it’s snapshotting ability is what really important. You can have fault any day but if you have snapshots, you will be fine. As you may have know, Kubernetes already defines alpha/beta volume snapshot spec, but not every CSI supports it. Longhorn allow us to fill this gap with useful UI and scheduled snapshot abilities. This is the reason why I’m still using Longhorn instead of Hetzner CSI driver, because Hetzner volume can’t be snapshotted to my knowledge.

Rizky Maulana Nugraha

Written by Rizky Maulana Nugraha
Software Developer. Currently remotely working from Indonesia.
Twitter FollowGitHub followers