Errors encountered in using k8s

This is a record in deploying k8s in work stage.

Issue 1: failed to schedule pod for not running “VolumeBinding” filter plugin

describe pod:

1
2
3
4
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 55s (x19 over 26m) default-scheduler error while running "VolumeBinding" filter plugin for pod "eri-cec-9dcd4d6c8-whvrm": pod has unbound immediate PersistentVolumeClaims

check pod specs by kubectl edit pod <pod-name>:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
volumes:
- name: cec-database
persistentVolumeClaim:
claimName: eri-cec-database-pvc
- name: cec-misc
persistentVolumeClaim:
claimName: eri-cec-misc-pvc
- name: default-token-pxxg8
secret:
defaultMode: 420
secretName: default-token-pxxg8
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-04-08T01:58:02Z"
message: 'error while running "VolumeBinding" filter plugin for pod "eri-cec-9dcd4d6c8-whvrm":
pod has unbound immediate PersistentVolumeClaims'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: BestEffort

The two persistentVolumeClaims were set during helm install:

1
2
3
4
5
6
7
helm install -n eri ./cec-release-1.0.0.tgz \
> --set persistence.enabled=true \
> --set persistence.storageClass=nfs \
> --set persistence.database.size=5Gi \
> --set persistence.miscellaneous.size=5Gi \
> --set ingress.cecManager.hostName=dual-test \
> --set ingress.cecApi.hostName=dual-api
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
NAME:   eri
LAST DEPLOYED: Wed Apr 8 09:58:00 2020
NAMESPACE: default
STATUS: DEPLOYED
======================================
RESOURCES:
==> v1/Deployment
NAME AGE
eri-cec 1s
======================================
==> v1/PersistentVolumeClaim
NAME AGE
eri-cec-database-pvc 2s
eri-cec-misc-pvc 2s
======================================
==> v1/Pod(related)
NAME AGE
eri-cec-9dcd4d6c8-whvrm 1s
======================================
==> v1/Secret
NAME AGE
eri-cec-database-secret 2s
======================================
==> v1/Service
NAME AGE
eri-cec 1s
======================================
==> v1beta1/Ingress
NAME AGE
eri-cec-ingress 1s

=========================solution=========================
Refers:

  • https://stackoverflow.com/questions/60774220/kubernetes-pod-has-unbound-immediate-persistentvolumeclaims
  • https://blog.csdn.net/oguro/article/details/96964440
  • https://blog.csdn.net/liumiaocn/article/details/103388607
  • https://kubernetes.io/docs/tasks/configure-pod-container/configure-persistent-volume-storage/
    From above info, there is something wrong with PVC (PersistentVolumeClaims), which leaves state “unbound”. The PVC should be bound to certain PV (PersistentVolume) which has enough capacity to hold the binding PVC.
    check PVCs and PVs:
    1
    2
    3
    4
    5
    6
    [root@host63 cec-installer]# kubectl get pvc -A
    NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
    cec eri-sh-cec-database-pvc Pending nfs 125m
    cec eri-sh-cec-misc-pvc Pending nfs 125m
    default eri-cec-database-pvc Pending nfs 3h22m
    default eri-cec-misc-pvc Pending nfs 3h22m
    1
    2
    [root@host63 cec-installer]# kubectl get pv -A
    No resources found
    There is no PV on current node-63, thus I create two PVs for db-pvc and misc-pvc.
    Make a directory for PVs first:
    1
    [root@host63 mnt]# sudo mkdir /mnt/data
    check permission and capacity the PVC need:
    1
    2
    3
    4
    5
    6
    7
    [root@host63 cec-installer]# kubectl edit pvc eri-sh-cec-database-pvc -n cec
    ...
    accessModes:
    - ReadWriteOnce
    resources:
    requests:
    storage: 5Gi
    vi pv-init.yaml:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    apiVersion: v1
    kind: PersistentVolume
    metadata:
    name: eri-sh-cec-database-pv
    labels:
    name: eri-sh-cec-database-pv
    spec:
    nfs:
    path: /mnt/data
    server: nfs
    accessModes: ["ReadWriteMany","ReadWriteOnce"]
    capacity:
    storage: 5Gi
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
    name: eri-sh-cec-misc-pv
    labels:
    name: eri-sh-cec-misc-pv
    spec:
    nfs:
    path: /mnt/data
    server: nfs
    accessModes: ["ReadWriteMany","ReadWriteOnce"]
    capacity:
    storage: 5Gi
    1
    2
    3
    4
    5
    6
    7
    [root@host63 cec-installer]# kubectl apply -f pv-init.yaml
    persistentvolume/eri-sh-cec-database-pv created
    persistentvolume/eri-sh-cec-misc-pv created
    [root@host63 cec-installer]# kubectl get pv -A
    NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
    eri-sh-cec-database-pv 5Gi RWO,RWX Retain Available 15s
    eri-sh-cec-misc-pv 5Gi RWO,RWX Retain Available 15s
    PVC still not work, check its status:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    [root@host63 cec-installer]# kubectl describe pvc eri-sh-cec-database-pvc -n cec
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Warning ProvisioningFailed 54m (x3 over 104m) cluster.local/nfs-provisioner-nfs-server-provisioner_nfs-provisioner-nfs-server-provisioner-0_79370fad-78b7-11ea-8b82-66556e93189d failed to provision volume with StorageClass "nfs": error getting NFS server IP for volume: service SERVICE_NAME=nfs-provisioner-nfs-server-provisioner is not valid; check that it has for ports map[{111 UDP}:true {111 TCP}:true {2049 TCP}:true {20048 TCP}:true] exactly one endpoint, this pod's IP POD_IP=192.168.220.144
    Warning ProvisioningFailed 38m (x8 over 3h6m) cluster.local/nfs-provisioner-nfs-server-provisioner_nfs-provisioner-nfs-server-provisioner-0_79370fad-78b7-11ea-8b82-66556e93189d failed to provision volume with StorageClass "nfs": error getting NFS server IP for volume: service SERVICE_NAME=nfs-provisioner-nfs-server-provisioner is not valid; check that it has for ports map[{2049 TCP}:true {20048 TCP}:true {111 UDP}:true {111 TCP}:true] exactly one endpoint, this pod's IP POD_IP=192.168.220.144
    Normal Provisioning 21m (x16 over 3h6m) cluster.local/nfs-provisioner-nfs-server-provisioner_nfs-provisioner-nfs-server-provisioner-0_79370fad-78b7-11ea-8b82-66556e93189d External provisioner is provisioning volume for claim "cec/eri-sh-cec-database-pvc"
    Warning ProvisioningFailed 21m (x2 over 3h4m) cluster.local/nfs-provisioner-nfs-server-provisioner_nfs-provisioner-nfs-server-provisioner-0_79370fad-78b7-11ea-8b82-66556e93189d failed to provision volume with StorageClass "nfs": error getting NFS server IP for volume: service SERVICE_NAME=nfs-provisioner-nfs-server-provisioner is not valid; check that it has for ports map[{111 TCP}:true {2049 TCP}:true {20048 TCP}:true {111 UDP}:true] exactly one endpoint, this pod's IP POD_IP=192.168.220.144
    Normal ExternalProvisioning 87s (x742 over 3h6m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "cluster.local/nfs-provisioner-nfs-server-provisioner" or manually created by system administrator
    Did not figure out


Issue 2. The deployed pod (without nfs) would use ipv4 as default, and service nginx-ingress will use ipv4 address as default route, cannot change it.

log nginx-ingress:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
[root@host63 cec-installer]# kubectl logs nginx-ingress-controller-64d58897bd-b99gw
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: 0.29.0
Build: git-eedcdcdbf
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.17.8
-------------------------------------------------------------------------------
I0407 10:17:14.810155 8 flags.go:215] Watching for Ingress class: nginx
W0407 10:17:14.811042 8 flags.go:260] SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
W0407 10:17:14.811123 8 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0407 10:17:14.811367 8 main.go:193] Creating API client for https://192.167.0.1:443
I0407 10:17:14.820212 8 main.go:237] Running in Kubernetes cluster version v1.17 (v1.17.4) - git (clean) commit 8d8aa39598534325ad77120c120a22b3a990b5ea - platform linux/amd64
I0407 10:17:14.823302 8 main.go:91] Validated default/nginx-ingress-default-backend as the default backend.
I0407 10:17:15.126113 8 main.go:102] SSL fake certificate created /etc/ingress-controller/ssl/default-fake-certificate.pem
W0407 10:17:15.147723 8 store.go:657] Unexpected error reading configuration configmap: configmaps "nginx-ingress-controller" not found
I0407 10:17:15.156374 8 nginx.go:263] Starting NGINX Ingress controller
I0407 10:17:16.357204 8 nginx.go:307] Starting NGINX process
I0407 10:17:16.357338 8 leaderelection.go:242] attempting to acquire leader lease default/ingress-controller-leader-nginx...
W0407 10:17:16.358186 8 controller.go:394] Service "default/nginx-ingress-default-backend" does not have any active Endpoint
I0407 10:17:16.358304 8 controller.go:137] Configuration changes detected, backend reload required.
I0407 10:17:16.360127 8 status.go:86] new leader elected: nginx-ingress-controller-64d58897bd-cthrs
I0407 10:17:16.450895 8 controller.go:153] Backend successfully reloaded.
I0407 10:17:16.450966 8 controller.go:162] Initial sync, sleeping for 1 second.
W0407 10:17:20.280746 8 controller.go:394] Service "default/nginx-ingress-default-backend" does not have any active Endpoint
W0407 10:17:23.614240 8 controller.go:394] Service "default/nginx-ingress-default-backend" does not have any active Endpoint
W0407 10:17:33.458971 8 controller.go:394] Service "default/nginx-ingress-default-backend" does not have any active Endpoint
I0407 10:17:53.811527 8 leaderelection.go:252] successfully acquired lease default/ingress-controller-leader-nginx
I0407 10:17:53.811566 8 status.go:86] new leader elected: nginx-ingress-controller-64d58897bd-b99gw
W0407 10:18:00.868971 8 controller.go:394] Service "default/nginx-ingress-default-backend" does not have any active Endpoint
I0408 03:14:30.743173 8 event.go:281] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cec", Name:"eri-sh-cec-ingress", UID:"dd298fd3-3c16-42e8-a544-c7f942ec4e3e", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"211359", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress cec/eri-sh-cec-ingress
W0408 03:14:34.068588 8 controller.go:921] Service "default/eri-cec" does not have any active Endpoint.
W0408 03:14:34.068631 8 controller.go:921] Service "default/eri-cec" does not have any active Endpoint.
W0408 03:14:34.068648 8 controller.go:921] Service "cec/eri-sh-cec" does not have any active Endpoint.
W0408 03:14:34.068661 8 controller.go:921] Service "cec/eri-sh-cec" does not have any active Endpoint.
I0408 03:14:53.817883 8 status.go:274] updating Ingress cec/eri-sh-cec-ingress status from [] to [{10.136.40.63 }]
I0408 03:14:53.820045 8 event.go:281] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cec", Name:"eri-sh-cec-ingress", UID:"dd298fd3-3c16-42e8-a544-c7f942ec4e3e", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"211445", FieldPath:""}): type: 'Normal' reason: 'UPDATE' Ingress cec/eri-sh-cec-ingress
W0408 03:14:53.820285 8 controller.go:921] Service "default/eri-cec" does not have any active Endpoint.
W0408 03:14:53.820310 8 controller.go:921] Service "default/eri-cec" does not have any active Endpoint.
W0408 03:14:53.820326 8 controller.go:921] Service "cec/eri-sh-cec" does not have any active Endpoint.
W0408 03:14:53.820341 8 controller.go:921] Service "cec/eri-sh-cec" does not have any active Endpoint.

Solved the problem of product service using default ipv4 as cluster-ip by adding new paras in helm deployment charts:

  • ipFamily: below service: in values.yaml
  • ipFamily: {{.Values.service.ipFamily}}
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
      below `spec:` in `service.yaml`
    - `--set service.ipFamily=IPv6` when helm install product

    **Solution:**
    Modify helm chart before installing ingress, in `value.yaml`, config:
    ```yaml
    hostNetwork: true
    reportNodeInternalIp: true
    daemonset:
    useHostPort: true
    kind: DaemonSet
    you can also set service type and external_IP here. After helm install, check if you can visit service using hostname via host-ip.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    [root@host63 cec-installer]# curl http://dual-ipv6:80
    <!DOCTYPE html>
    <html>
    ...
    </html>
    [root@host63 cec-installer]# curl http://dual-ipv4:80
    <!DOCTYPE html>
    <html>
    ...
    </html>
    also check with client UI.


Issue 3. Failed to init tiller, pod creating error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

pod hang on state ContainerCreating after doing helm init. describe pod:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@host59 ~]# kubectl describe -n kube-system pod tiller-deploy-969865475-sn2k2
Name: tiller-deploy-969865475-sn2k2
Namespace: kube-system
Node: host59/2001:1b74:88:9400::59:59
Controlled By: ReplicaSet/tiller-deploy-969865475
Containers:
tiller:
Container ID:
Image: gcr.io/kubernetes-helm/tiller:v2.16.1
Image ID:
Ports: 44134/TCP, 44135/TCP
Events:
Type Reason Age From Message
Normal Scheduled 54m default-scheduler Successfully assigned kube-system/tiller-deploy-969865475-sn2k2 to host59
Warning FailedCreatePodSandBox 2m47s (x13 over 50m) kubelet, host59 Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Normal SandboxChanged 2m47s (x13 over 50m) kubelet, host59 Pod sandbox changed, it will be killed and re-created.

check docker containers, which is already running, so there must be something wrong with docker

1
2
[root@host59 ~]# systemctl status kubelet -l
Apr 17 17:25:27 host59 kubelet[19205]: E0417 17:25:27.660517 19205 dns.go:135] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.221.16.11 10.221.16.10 150.236.34.180
1
2
3
4
[root@host59 ~]# journalctl -u  kubelet -f
Apr 17 17:27:18 host59 kubelet[19205]: E0417 17:27:18.262418 19205 cni.go:385] Error deleting kube-system_tiller-deploy-969865475-sn2k2/f35df2a630d07b0ec7149fb06d7216c60a3c77a7118924c7b7eb9556b02f5cab from network multus/multus-cni-network: netplugin failed with no error message
Apr 17 17:27:18 host59 kubelet[19205]: W0417 17:27:18.263092 19205 cni.go:331] CNI failed to retrieve network namespace path: Error: No such container: beb6e83c61bc47ba808dcc51e6c76e89817efb1f518fe28bc1083c99ad4721e1
Apr 17 17:27:19 host59 kubelet[19205]: E0417 17:27:19.660435 19205 dns.go:135] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.221.16.11 10.221.16.10 150.236.34.180

So the multus pod/container is not running correctly. Check the pod:

1
2
3
4
5
6
7
8
[root@host59 ~]# kubectl describe pod  -n kube-system pod kube-multus-ds-amd64-wz5xj
Name: kube-multus-ds-amd64-wz5xj
Namespace: kube-system
Node: host59/2001:1b74:88:9400::59:59
Events:
Type Reason Age From Message
Warning DNSConfigForming 117s (x291 over 6h7m) kubelet, host59 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.221.16.11 10.221.16.10 150.236.34.180
Error from server (NotFound): pods "pod" not found

Delete pod kube-mutlus:

1
kubectl delete -f multus-daemonset.yml

still not work, edit deployment find:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@host59 opt]# kubectl edit deploy tiller-deploy -n kube-system
status:
conditions:
- lastTransitionTime: "2020-04-20T01:53:55Z"
lastUpdateTime: "2020-04-20T01:53:55Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: "2020-04-20T02:03:56Z"
lastUpdateTime: "2020-04-20T02:03:56Z"
message: ReplicaSet "tiller-deploy-b747845f" has timed out progressing.
reason: ProgressDeadlineExceeded
status: "False"
type: Progressing
observedGeneration: 1
replicas: 1
unavailableReplicas: 1
updatedReplicas: 1

Finally find the reason: mutlus is not compatible with calico, thus Error deleting kube-system_tiller-deploy... from network multus/multus-cni-network: netplugin failed with no error message happened as above. Even if I delete mutlus before, it has already been configured in etcd config file under /etc/kubernetes. So modify related config and the tiller pod will turn to normal.