I use rook to build a ceph cluster.But my pvc get stuck in pending. When I used kubectl describe pvc, I found events from persistentvolume-controller:
waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator
All my pods are in running state:
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-ntqk6 3/3 Running 0 14d
csi-cephfsplugin-pqxdw 3/3 Running 6 14d
csi-cephfsplugin-provisioner-c68f789b8-dt4jf 6/6 Running 49 14d
csi-cephfsplugin-provisioner-c68f789b8-rn42r 6/6 Running 73 14d
csi-rbdplugin-6pgf4 3/3 Running 0 14d
csi-rbdplugin-l8fkm 3/3 Running 6 14d
csi-rbdplugin-provisioner-6c75466c49-tzqcr 6/6 Running 106 14d
csi-rbdplugin-provisioner-6c75466c49-x8675 6/6 Running 17 14d
rook-ceph-crashcollector-compute08.dc-56b86f7c4c-9mh2j 1/1 Running 2 12d
rook-ceph-crashcollector-compute09.dc-6998676d86-wpsrs 1/1 Running 0 12d
rook-ceph-crashcollector-compute10.dc-684599bcd8-7hzlc 1/1 Running 0 12d
rook-ceph-mgr-a-69fd54cccf-tjkxh 1/1 Running 200 12d
rook-ceph-mon-at-8568b88589-2bm5h 1/1 Running 0 4d3h
rook-ceph-mon-av-7b4444c8f4-2mlpc 1/1 Running 0 4d1h
rook-ceph-mon-aw-7df9f76fcd-zzmkw 1/1 Running 0 4d1h
rook-ceph-operator-7647888f87-zjgsj 1/1 Running 1 15d
rook-ceph-osd-0-6db4d57455-p4cz9 1/1 Running 2 12d
rook-ceph-osd-1-649d74dc6c-5r9dj 1/1 Running 0 12d
rook-ceph-osd-2-7c57d4498c-dh6nk 1/1 Running 0 12d
rook-ceph-osd-prepare-compute08.dc-gxt8p 0/1 Completed 0 3h9m
rook-ceph-osd-prepare-compute09.dc-wj2fp 0/1 Completed 0 3h9m
rook-ceph-osd-prepare-compute10.dc-22kth 0/1 Completed 0 3h9m
rook-ceph-tools-6b4889fdfd-d6xdg 1/1 Running 0 12d
Here is the kubectl logs -n rook-ceph csi-cephfsplugin-provisioner-c68f789b8-dt4jf csi-provisioner
I0120 11:57:13.283362 1 csi-provisioner.go:121] Version: v2.0.0
I0120 11:57:13.283493 1 csi-provisioner.go:135] Building kube configs for running in cluster...
I0120 11:57:13.294506 1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
I0120 11:57:13.294984 1 common.go:111] Probing CSI driver for readiness
W0120 11:57:13.296379 1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I0120 11:57:13.299629 1 leaderelection.go:243] attempting to acquire leader lease rook-ceph/rook-ceph-cephfs-csi-ceph-com...
Here is the ceph status in toolbox container:
cluster:
id: 0b71fd4c-9731-4fea-81a7-1b5194e14204
health: HEALTH_ERR
Module 'dashboard' has failed: [('x509 certificate routines', 'X509_check_private_key', 'key values mismatch')]
Degraded data redundancy: 2/6 objects degraded (33.333%), 1 pg degraded, 1 pg undersized
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
services:
mon: 3 daemons, quorum at,av,aw (age 4d)
mgr: a(active, since 4d)
osd: 3 osds: 3 up (since 12d), 3 in (since 12d)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 0 B
usage: 3.3 GiB used, 3.2 TiB / 3.2 TiB avail
pgs: 2/6 objects degraded (33.333%)
1 active+undersized+degraded
I think it’s because the cluster’s health is health_err, but I don’t know how to solve it...I use raw partitions to build the ceph cluster currently: one partition on a node and two partitions on another node.
I found that there are few pods restarted several times, so I checked their logs.As for the csi-rbdplugin-provisioner pod, there is the same error in csi-resizer,csi attacher and csi-snapshotter container:
E0122 08:08:37.891106 1 leaderelection.go:321] error retrieving resource lock rook-ceph/external-resizer-rook-ceph-rbd-csi-ceph-com: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-resizer-rook-ceph-rbd-csi-ceph-com": dial tcp 10.96.0.1:443: i/o timeout
,and a repeating error in csi-snapshotter:
E0122 08:08:48.420082 1 reflector.go:127] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: Failed to watch *v1beta1.VolumeSnapshotClass: failed to list *v1beta1.VolumeSnapshotClass: the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io)
As for the mgr pod,there is a repeating record:
debug 2021-01-29T00:47:22.155+0000 7f10fdb48700 0 log_channel(cluster) log [DBG] : pgmap v28775: 1 pgs: 1 active+undersized+degraded; 0 B data, 337 MiB used, 3.2 TiB / 3.2 TiB avail; 2/6 objects degraded (33.333%)
It's also weird that the mon pods' names are at,av and aw rather than a,b and c.Seems like the mon pods deleted and created several times,but I don't know why.
Thanks for any advice.