Support ipv6/dualStack in K8s

2020-03-20

work

记录惊天巨坑enable ipv6 and dual stack for our product，从k8s安装开始

0. 原来产品是本地安装测试的，仅支持ipv4安装很简便，但这次要求支持ipv6/dualStack，根据官网文档，我们需要1.16版本以上的kubernetes，kubectl version查看本地版本:

1
2

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:11:03Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:02:12Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

1.14。。赶紧查docker desktop自带的kubernetes怎么更新，发现要写deploy测试或者重装k8s? 感觉不适用。因为这是docker自带的不知道重装会不会有别的影响。决定直接装到我们组的remote server上，这样测试也可以一步到位。

登上remote server发现第二个坑，机子连不了外网。。。连yum install都必须自己配置本地yum repo…一开始是想自己再装一个能连外网的虚拟机，把docker和k8s下载下来打好包再transfer到remote server上。

setup redhat7 to local vm:
error when setup vm: VT-x is not available (VERR_VMX_NO_VMX)
solution: https://blog.csdn.net/imilano/article/details/83038682 (note: this action will affect the auto-start of docker)
enable the subscription before downloading docker:
https://blog.csdn.net/yl_1314/article/details/52044022

不太行，改用这个: https://github.com/wxdlong/ok8s
把包下到本地再放到60上

before downloading, need change user permission level on docker. run command with:

1	export MSYS_NO_PATHCONV=1

then add local user to group “docker-users”, “Hyper-V Administrators”,” Remote Desktop Users”,”Remote Management Users”

1
2
3

PS H:\> net localgroup docker-users ERICSSON\<eid> /add
System error 1378 has occurred.
The specified account name is already a member of the group.

1 2	PS H:\> net localgroup "Hyper-V Administrators" ERICSSON\<eid> /add The command completed successfully.

1
2
3

PS H:\> net localgroup "Remote Desktop Users" ERICSSON\<eid> /add
System error 1378 has occurred.
The specified account name is already a member of the group.

1 2	PS H:\> net localgroup "Remote Management Users" ERICSSON\<eid> /add The command completed successfully.

check shared docker in DockerDesktop

If still can’t see the volume, relogon PC, sec, and reset credentials.

download packages to local folder download

1	$ docker run --rm -v '//c//Users//<eid>//download:/ok8s' registry.cn-hangzhou.aliyuncs.com/wxdlong/ok8s:v1.16.3

接下来就是把这个folder copy到host server上按教程往下走。

配置k8s和docker过程中遇到的错误

kubectl cluster-info return
[error] The connection to the server localhost:8080 was refused - did you specify the right host or port?
- https://blog.csdn.net/wzygis/article/details/91354870
  1
  2
  echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> ~/.bash_profile
  source ~/.bash_profile

包里缺少flannel插件,自行安装

Copy /etc/kubernetes/admin.conf to $HOME/.kube/(on windows, on host server we just use /root)：

[root@192-168-1-61 ~]# mkdir -p /root/.kube
[root@192-168-1-61 .kube]# cp -i /etc/kubernetes/admin.conf /root/.kube/config
[root@192-168-1-61 .kube]# ls -ltr /root/.kube/
total 8
  -rw-------. 1 root root 5448 Mar 17 21:33 config

这一步可能是多余的，因为上面第一个错误已经export过路径了

check pod status

[root@192-168-1-61 ok8s]# kubectl get pods -A
NAMESPACE     NAME                                        READY   STATUS    RESTARTS   AGE
kube-system   coredns-5644d7b6d9-5sp86                    0/1     Pending   0          15h
kube-system   coredns-5644d7b6d9-qjfz9                    0/1     Pending   0          15h
kube-system   etcd-192-168-1-61.maas                      1/1     Running   0          15h
kube-system   kube-apiserver-192-168-1-61.maas            1/1     Running   0          15h
kube-system   kube-controller-manager-192-168-1-61.maas   1/1     Running   0          15h
kube-system   kube-flannel-ds-amd64-vx4bw                 1/1     Running   0          15h
kube-system   kube-proxy-h4gdc                            1/1     Running   0          15h
kube-system   kube-scheduler-192-168-1-61.maas            1/1     Running   0          15h

According to above result, the core DNS service is not started successfully.

flannel still not work, check tail -f /var/log/messages found Unable to update cni config: no valid networks found in /etc/cni/net.d + [fork/exec /opt/ok8s/cni/flannel: permission denied fork/exec /opt/ok8s/cni/portmap: permission denied]
go to folder cd /opt/ok8s/cni/ and run chmod +x *.
tip: 无权限文件一般是白字，像上面这样授予全部权限会变为绿色，chmod 777 会把文件变为灰底
再次kubectl get pods -A可以看到两个core-dns node已经跑起来了

安装dashboard

1
2
3

[root@192-168-1-61 ok8s]# kubectl get pods -A
kubernetes-dashboard   dashboard-metrics-scraper-76585494d8-qb5mw   1/1     Running            0          101m
kubernetes-dashboard   kubernetes-dashboard-5996555fd8-88qz6        0/1     ImagePullBackOff   0          101m

这里pull image失败了，看了下应该是因为host server不能连外网，dashboard没大用处，这步暂时跳过

更换network插件为calico
装完flannel发现它还没支持ipv6，我人傻了，只好重新安装calico，先存一下calico支持ipv6 | dual-stack的官方文档, 以及非常详细的中文安装教程

kubeadm reset后更改pod-network-cidr，重新init cluster:
1
[root@192-168-1-61 ok8s]# kubeadm init --v=7 --pod-network-cidr=192.168.0.0/16 --kubernetes-version=v1.16.3
Calico作为CNI插件安装。必须通过传递–network-plugin=cni参数将kubelet配置为使用CNI网络, 这里–pod-network-cidr=192.168.0.0就是是用来给 controller-manager 用作自动分配pod子网 (用作给每个node上的pod分配IP address)
follow this official doc

由于不能连外网导致image pull不下来，手动下载过去

1	kube-system calico-node-vcbmc 0/1 Init:ImagePullBackOff 0 9m12s

[root@192-168-1-61 nodeagent~uds]# kubectl describe pods calico-node-vcbmc -n kube-system
Events:
Type     Reason     Age                     From                        Message
Normal   Scheduled  9m44s                   default-scheduler           Successfully assigned kube-system/calico-node-vcbmc to 192-168-1-61.maas
Warning  Failed     9m3s                    kubelet, 192-168-1-61.maas  Failed to pull image "calico/cni:v3.13.1": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on 10.136.40.87:53: server misbehaving
Normal   Pulling    7m37s (x4 over 9m44s)   kubelet, 192-168-1-61.maas  Pulling image "calico/cni:v3.13.1"
Warning  Failed     7m22s (x3 over 9m29s)   kubelet, 192-168-1-61.maas  Failed to pull image "calico/cni:v3.13.1": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning  Failed     7m22s (x4 over 9m29s)   kubelet, 192-168-1-61.maas  Error: ErrImagePull
Warning  Failed     7m9s (x6 over 9m28s)    kubelet, 192-168-1-61.maas  Error: ImagePullBackOff
Normal   BackOff    4m36s (x16 over 9m28s)  kubelet, 192-168-1-61.maas  Back-off pulling image "calico/cni:v3.13.1"

download calico:v3.13.1 via https://docs.projectcalico.org/release-notes/, transfer the tgz package to host server.
[root@192-168-1-61 calico]# docker load --input /home/eshibij/calico-v3.13.1/release-v3.13.1/images/calico-node.tar –> load calico image, check:

1
2
3

[root@192-168-1-61 calico]# docker images
REPOSITORY                           TAG                 IMAGE ID            CREATED             SIZE
calico/node                          v3.13.1             2e5029b93d4a        5 days ago          260MB

把pod2daemon-flexvol, calico-cni, calico-kube-controllers也一起load了

[error] pod describe find calico 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
–> 这个一般是有另一个它依赖的node还没起来，查看/var/log/messages
–> [failed to find plugin "calico" in path [/opt/ok8s/cni]
–> 原因是ok8s把加载的image里的默认文件挂载路径改成了/opt/ok8s/cni。可以直接在/opt/cni/bin (插件加载默认路径) 下找到calico和calico-ipam二进制文件copy到/opt/ok8s/cni下，也可以修改calico.yaml里的文件路径后重新apply -f

[root@192-168-1-61 cni]# kubectl get pods -A
NAMESPACE     NAME                                        READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-788d6b9876-dzlrz    1/1     Running   0          15h
kube-system   calico-node-fttdx                           1/1     Running   0          15h
kube-system   coredns-5644d7b6d9-dl8ft                    1/1     Running   0          15h
kube-system   coredns-5644d7b6d9-dzrdv                    1/1     Running   0          15h
kube-system   etcd-192-168-1-61.maas                      1/1     Running   0          15h
kube-system   kube-apiserver-192-168-1-61.maas            1/1     Running   0          15h
kube-system   kube-controller-manager-192-168-1-61.maas   1/1     Running   0          15h
kube-system   kube-proxy-l54q7                            1/1     Running   0          15h
kube-system   kube-scheduler-192-168-1-61.maas            1/1     Running   0          15h

modify kubelet.service
add --feature-gates="IPv6DualStack=true" after ExecStart=/opt/ok8s/bin/kubelet in file /usr/lib/systemd/system/kubelet.service
host server network config

refer https://www.jianshu.com/p/e92dec9f9cf4

add configs to /etc/sysctl.d/98-ok8s.conf
1
2
3
4
net.ipv6.conf.all.disable_ipv6 = 0
net.ipv6.conf.default.disable_ipv6 = 0
net.ipv6.conf.lo.disable_ipv6 = 0
net.ipv6.conf.all.forwarding=1
run sysctl -p to make these configs work
enable ipv6 on host server, add setting to /etc/sysconfig/network
1
NETWORKING_IPV6=yes
check ifcfg-xxx under /etc/sysconfig/networkscripts/:
1
2
IPV6INIT=yes
IPV6_AUTOCONF=yes
generate kubeconfig file to parse configs

refer https://github.com/Jason-ZW/kubernetes-dual-stack-poc/blob/86f51df6766a3091fea2838416eb55dc6b83d44e/kubeadm/kubeconfig.config

[error] k8s v1.16.3 cannot parse subnet with comma
https://github.com/kubernetes/kubeadm/issues/1828
update K8s to v1.17.4
packages/images updated:

binary file: kubeadm, kubelet, kubectl, kube-proxy, kube-scheduler
docker image: controller-manager, proxy, scheduler, apiserver, etcd, coredns
Note: the coredns official package has to be loaded as docker import coredns_1.6.5_linux_amd64.tgz (reason see https://visionary-s.github.io/%2F2020%2F01%2F20%2Fdocker%2F)
check package version after replace kubeadm file: kubeadm config images list

reinstall following official docs

[error] failed to execute operation file exists when systemctl enable kubelet
solution : systemctl disable kubelet first, then redo enable action. ref
[error] cannot use “fe80::2a80:23ff:feb5:a150” as the bind address for the API Server
查到kubeconfig.conf中:
1
2
3
4
5
6
7
apiVersion: kubeadm.k8s.io/v1beta1
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
##cgroup-driver: "systemd"
localAPIEndpoint:
advertiseAddress: fe80::2a80:23ff:feb5:a150
原因：不能用scope为link的ipv6地址，要用scope为global的2001:250:4000:2000::53
etcd 3.4.3 官方镜像有点问题，导致etcd启动连接老是失败
solution: 把老版3.1的换了个tag。。。

kube-controller-manager-192-168-1-61.maas CrashLoopBackOff
describe node, 发现在无限重启

1
2

      Warning  Unhealthy  91s                   kubelet, 192-168-1-61.maas  Liveness probe failed: Get https://127.0.0.1:10257/healthz: dial tcp 127.0.0.1:10257: connect: connection refused
Warning  BackOff    57s (x12 over 4m30s)  kubelet, 192-168-1-61.maas  Back-off restarting failed container

check docker logs；

E0325 03:00:25.517141       1 core.go:91] Failed to start service controller: WARNING: no cloud provider provided, services of type LoadBalancer will fail
E0325 03:00:25.638913       1 core.go:232] failed to start cloud node lifecycle controller: no cloud provider provided
E0325 03:00:37.441597       1 controllermanager.go:521] Error starting "nodeipam"
F0325 03:00:37.441623       1 controllermanager.go:235] error starting controllers: New CIDR set failed; the node CIDR size is too big

solution: ?

install calico

error when starting calico node

1	kube-system calico-node-mzcnp 0/1 Init:CrashLoopBackOff 2 31s

descirbe event:

        Normal   Scheduled  56s                default-scheduler           Successfully assigned kube-system/calico-node-mzcnp to 192-168-1-61.maas
Normal   Pulled     12s (x4 over 56s)  kubelet, 192-168-1-61.maas  Container image "calico/cni:v3.13.1" already present on machine
Normal   Created    12s (x4 over 56s)  kubelet, 192-168-1-61.maas  Created container upgrade-ipam
Warning  Failed     12s (x4 over 56s)  kubelet, 192-168-1-61.maas  Error: failed to start container "upgrade-ipam": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"/opt/ok8s/cni/calico-ipam\": stat /opt/ok8s/cni/calico-ipam: no such file or directory": unknown
Warning  BackOff    8s (x4 over 41s)   kubelet, 192-168-1-61.maas  Back-off restarting failed container

but I found the calico-ipam is already under /opt/ok8s/cni/
原因：calico.yaml配置里路径写错了，按官网的来

calico node起不起来

      [root@192-168-1-61 calico]# kubectl get pods -A
NAMESPACE     NAME                                        READY   STATUS              RESTARTS   AGE
kube-system   calico-kube-controllers-788d6b9876-wkd5h    0/1     ContainerCreating   0          39m
kube-system   calico-node-64q4h                           0/1     Running             0          59s
kube-system   coredns-6955765f44-k9cb5                    0/1     ContainerCreating   0          45m
kube-system   coredns-6955765f44-nznr9                    0/1     ContainerCreating   0          45m
kube-system   etcd-192-168-1-61.maas                      1/1     Running             0          45m
kube-system   kube-apiserver-192-168-1-61.maas            1/1     Running             0          45m
kube-system   kube-controller-manager-192-168-1-61.maas   1/1     Running             0          45m
kube-system   kube-proxy-ggjvb                            1/1     Running             0          45m
kube-system   kube-scheduler-192-168-1-61.maas            1/1     Running             0          45m

describe calico-kube-controller:

    Warning  FailedScheduling        59s (x30 over 39m)  default-scheduler           0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Normal   Scheduled               57s                 default-scheduler           Successfully assigned kube-system/calico-kube-controllers-788d6b9876-wkd5h to 192-168-1-61.maas
Warning  FailedCreatePodSandBox  54s                 kubelet, 192-168-1-61.maas  Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "27761f7a236bf9c092826ecb546dddad7c44a40fa1891faf6c45b14f330ad25d" network for pod "calico-kube-controllers-788d6b9876-wkd5h": networkPlugin cni failed to set up pod "calico-kube-controllers-788d6b9876-wkd5h_kube-system" network: error getting ClusterInformation: Get https://[10.24.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.24.0.1:443: connect: connection refused, failed to clean up sandbox container "27761f7a236bf9c092826ecb546dddad7c44a40fa1891faf6c45b14f330ad25d" network for pod "calico-kube-controllers-788d6b9876-wkd5h": networkPlugin cni failed to teardown pod "calico-kube-controllers-788d6b9876-wkd5h_kube-system" network: error getting ClusterInformation: Get https://[10.24.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.24.0.1:443: connect: connection refused]
Normal   SandboxChanged          9s (x5 over 53s)    kubelet, 192-168-1-61.maas  Pod sandbox changed, it will be killed and re-created.

原因： https://[10.24.0.1]:443这个应该对应kubeconfig中kube-proxy下的clusterCIDR，应该跟kubelet下的podSubnet保持一致?（我用了serviceSubnet的网段）
还是不行，后面又改成了纯ipv6环境，配置文件kubeconfig.yml如下：

apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 2001:250:4000:2000::53
  bindPort: 6443
nodeRegistration:
  criSocket: /var/run/dockershim.sock
  taints:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
  kubeletExtraArgs:
    node-ip: "2001:250:4000:2000::53"
#---
apiServer:
  timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta2
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controlPlaneEndpoint: "[2001:250:4000:2000::53]:6443"
dns:
  type: CoreDNS
imageRepository: k8s.gcr.io
controllerManager:
  extraArgs:
    #feature-gates: IPv6DualStack=true
    bind-address: "::"
    service-cluster-ip-range: fd03::/120
    cluster-cidr: "fd04::/120"
kind: ClusterConfiguration
kubernetesVersion: v1.17.4
networking:
  dnsDomain: cluster.local
  serviceSubnet: "fd03::/120"
  podSubnet: "fd04::/120"
#---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
#---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
clusterCIDR: "fd03::/120"
mode: "ipvs"

修改了calico.yaml执行脚本的参数如下：
！这是dual的，ipv6-only的把ipv4的部分去掉

kind: ConfigMap
...
data:
  ...
  cni_network_config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          ...
          "ipam": {
            "type": "calico-ipam",
            "assign_ipv4": "true",
            "assign_ipv6": "true",
            "ipv4_pools": ["192.170.30.0/16", "default-ipv4-ippool"],
            "ipv6_pools": ["fd03::/120", "default-ipv6-ippool"]
          },
          ...
        },
        ...
      ]
    }
  ...
  ...
#---
kind: DaemonSet
...
spec:
  ...
  template:
    ...
    spec:
      ...
  containers:
    - name: calico-node
      image: calico/node:v3.13.1
      env:
        ...
        - name: CALICO_IPV4POOL_CIDR
          value: "192.170.30.0/16"
        ...
        # Disable IPv6 on Kubernetes.
        - name: FELIX_IPV6SUPPORT
          value: "true"
        ...
        - name: CALICO_IPV6POOL_CIDR
          value: fd04::/120
        - name: IP6
          value: "autodetect"
        # !注意，这是后来做outgress发现的，ipv6环境默认不会配网关，see
        # https://docs.projectcalico.org/reference/node/configuration
        # 所以这里要加个参数
        - name: CALICO_IPV6POOL_NAT_OUTGOING
          value: true
  ...

create检查service地址能不能对上，具体debug步骤忘了，大概用了这些命令：

1 2	ss -atnlp \| grep 443 netstat -an\|grep 6443

又装了个Nginx pod以后，试图写一个测试service看能不能通过Nginx ping通ipv6地址，失败

service.yaml

apiVersion: v1
kind: Service
metadata:
    name: cec
spec:
    ipFamily: IPv6
    ports:
      - name: http
        port: 8080
        protocol: TCP
        targetPort: 80
    selector:
      app: nginx

nginx.yaml

piVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
      spec:
        containers:
          - name: nginx
            image: nginx:1.14.2
            ports:
              - containerPort: 80

debug

root@192-168-1-61 ok8s]# kubectl apply -f service.yaml
service/cec created
[root@192-168-1-61 ok8s]# kubectl get service -A
NAMESPACE     NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       cec          ClusterIP   fd03::7d     <none>        8080/TCP                 35s
default       kubernetes   ClusterIP   fd03::1      <none>        443/TCP                  4h36m
kube-system   kube-dns     ClusterIP   fd03::a      <none>        53/UDP,53/TCP,9153/TCP   4h36m
[root@192-168-1-61 ok8s]# kubectl describe -n default pod nginx-deployment-574b87c764-qptmz
...
Node:         192-168-1-61.maas/2001:250:4000:2000::53
Labels:       app=nginx
Annotations:  cni.projectcalico.org/podIP: fd04::d5/128
              cni.projectcalico.org/podIPs: fd04::d5/128
IP:           fd04::d5
Containers:
  nginx:
    Container ID:   docker://199c74a7f44d431d95091d991025e43b24dca7fa9cc9d77e3781af1b89f160ce
      Image:          nginx:1.14.2
      Port:           80/TCP
      Host Port:      0/TCP
...
[root@192-168-1-61 ok8s]# kubectl get pods -o yaml | grep -i podip
    cni.projectcalico.org/podIP: fd04::d5/128
    cni.projectcalico.org/podIPs: fd04::d5/128
  podIP: fd04::d5
  podIPs:
[root@192-168-1-61 ok8s]# kubectl describe svc cec
...
Selector:          app=nginx
Type:              ClusterIP
IP:                fd03::7d
Port:              http  8080/TCP
TargetPort:        80/TCP
Endpoints:         [fd04::d5]:80
[root@192-168-1-61 ok8s]# curl -vvv -k http://[fd03::7d]:8080 -g
* About to connect() to fd03::7d port 8080 (#0)
*   Trying fd03::7d...
* Connection refused
* Failed connect to fd03::7d:8080; Connection refused
* Closing connection 0
curl: (7) Failed connect to fd03::7d:8080; Connection refused
[root@192-168-1-61 ok8s]# >/dev/tcp/fd03::7d/8080
-bash: connect: Connection refused
-bash: /dev/tcp/fd03::7d/8080: Connection refused
[root@192-168-1-61 ok8s]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  [fd03::1]:443 rr
  -> [2001:250:4000:2000::53]:6443 Masq    1      4          0         
TCP  [fd03::a]:53 rr
  -> [fd04::cf]:53                Masq    1      0          0         
  -> [fd04::d7]:53                Masq    1      0          0         
TCP  [fd03::a]:9153 rr
  -> [fd04::cf]:9153              Masq    1      0          0         
  -> [fd04::d7]:9153              Masq    1      0          0         
TCP  [fd03::7d]:8080 rr
  -> [fd04::d5]:80                Masq    1      0          0         
UDP  [fd03::a]:53 rr
  -> [fd04::cf]:53                Masq    1      0          0         
  -> [fd04::d7]:53                Masq    1      0          0

[root@192-168-1-61 cec-installer]# kubectl get svc,deploy
NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
service/kubernetes   ClusterIP   fd03::1      <none>        443/TCP   137m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nginx-deployment   2/2     2            2           105m

[root@192-168-1-61 ok8s]# kubectl exec -it nginx-deployment-574b87c764-qptmz bash
root@nginx-deployment-574b87c764-qptmz:/# >/dev/tcp/127.0.0.1/80
root@nginx-deployment-574b87c764-qptmz:/# >/dev/tcp/fd04::d5/80
bash: connect: Connection refused
bash: /dev/tcp/fd04::d5/80: Connection refused
root@nginx-deployment-574b87c764-qptmz:/# cd /etc/nginx/
root@nginx-deployment-574b87c764-qptmz:/etc/nginx# cat nginx.conf
user  nginx;
worker_processes  1;

error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;

events {
  worker_connections  1024;
}

http {
  include       /etc/nginx/mime.types;
  default_type  application/octet-stream;

  log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
              '$status $body_bytes_sent "$http_referer" '
              '"$http_user_agent" "$http_x_forwarded_for"';

  access_log  /var/log/nginx/access.log  main;

  sendfile        on;
  #tcp_nopush     on;

  keepalive_timeout  65;

  #gzip  on;

  include /etc/nginx/conf.d/*.conf;
}
// not modified on my desktop, thus skip the modification here
root@nginx-deployment-574b87c764-qptmz:/etc/nginx# nginx -s reload
2020/03/26 09:01:37 [notice] 28#28: signal process started
root@nginx-deployment-574b87c764-qptmz:/etc/nginx# >/dev/tcp/fd04::d5/80

install helm/tiller/nfs/ingress
Background: no connection to outer internet on host server, install helm and tiller locally, export tiller image and transfer it to host server.

Step1: install helm

wget https://get.helm.sh/helm-v2.16.1-linux-amd64.tar.gz
// transfer package to host server:/var/tmp/xxx-installer/
[root@192-168-1-61 cec-installer]# tar zxvf helm-v2.16.1-linux-amd64.tar.gz
linux-amd64/
linux-amd64/helm
linux-amd64/LICENSE
linux-amd64/tiller
linux-amd64/README.md
[root@192-168-1-61 cec-installer]# sudo install **/helm /usr/bin
[root@192-168-1-61 cec-installer]# helm version
Client: &version.Version{SemVer:"v2.16.1", GitCommit:"bbdfe5e7803a12bbdf97e94cd847859890cf4050", GitTreeState:"clean"}
Error: could not find tiller
[root@192-168-1-61 cec-installer]# helm init
Creating /root/.helm
Creating /root/.helm/repository
Creating /root/.helm/repository/cache
Creating /root/.helm/repository/local
Creating /root/.helm/plugins
Creating /root/.helm/starters
Creating /root/.helm/cache/archive
Creating /root/.helm/repository/repositories.yaml
Adding stable repo with URL: https://kubernetes-charts.storage.googleapis.com
Error: error initializing: Looks like "https://kubernetes-charts.storage.googleapis.com" is not a valid chart repository or cannot be reached:etes-charts.storage.googleapis.com/index.yaml: dial tcp: lookup kubernetes-charts.storage.googleapis.com on 10.136.40.87:53: server misbehavi

here refered https://www.jianshu.com/p/2bb1dfdadee8

// 这一步不知道要不要做，反正我先做了
[root@192-168-1-61 cec-installer]# cd linux-amd64/
[root@192-168-1-61 linux-amd64]# ls
LICENSE  README.md  helm  tiller
[root@192-168-1-61 linux-amd64]# vim rbac-config.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: tiller
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: tiller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: tiller
    namespace: kube-system

1
2
3

[root@192-168-1-61 linux-amd64]# kubectl create -f rbac-config.yaml
serviceaccount/tiller created
clusterrolebinding.rbac.authorization.k8s.io/tiller created

Step2: init tiller
save local tiller image and transfer it to host server, then follow the answer@Amit-Thawait

这里还遇到一个问题，本地win10的helm和tiller以前装的是2.14，这次用了2.16，不知如何升级，问了一下，原来是用2.16的helm.exe和tiller.exe文件替换掉/.helm/下的俩对应exe再init一遍就行了，估计linux上的更新就是替换一下二进制文件吧。。

[root@host63 cec-installer]# docker image tag 1f92aa902d73 gcr.io/kubernetes-helm/tiller:v2.16.1
[root@192-168-1-61 cec-installer]# helm init --client-only --skip-refresh
Creating /root/.helm/repository/repositories.yaml
Adding stable repo with URL: https://kubernetes-charts.storage.googleapis.com
Adding local repo with URL: http://127.0.0.1:8879/charts
$HELM_HOME has been configured at /root/.helm.
Not installing Tiller due to 'client-only' flag having been set
// load tiller image
[root@192-168-1-61 cec-installer]# helm init
$HELM_HOME has been configured at /root/.helm.
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
To prevent this, run `helm init` with the --tiller-tls-verify flag.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation

helm error happens:

[root@192-168-1-61 cec-installer]# helm version
Client: &version.Version{SemVer:"v2.16.1", GitCommit:"bbdfe5e7803a12bbdf97e94cd847859890cf4050", GitTreeState:"clean"}
E0326 23:21:17.649683   11541 portforward.go:400] an error occurred forwarding 39127 -> 44134: error forwarding port 44134 to pod 8d197e975824256d2de574f4577161702560b55a4aa70978bac91f2e43abe712, uid : unable to do port forwarding: socat not found
E0326 23:21:18.652732   11541 portforward.go:400] an error occurred forwarding 39127 -> 44134: error forwarding port 44134 to pod 8d197e975824256d2de574f4577161702560b55a4aa70978bac91f2e43abe712, uid : unable to do port forwarding: socat not found

According to https://www.kubernetes.org.cn/3879.html, this error happens due to rr is not set to permitted in k8s net config, install socat to fix it. (internal yum repository has build up, thus download directly)

[root@192-168-1-61 cec-installer]# yum install socat.x86_64
[root@192-168-1-61 cec-installer]# helm version
Client: &version.Version{SemVer:"v2.16.1", GitCommit:"bbdfe5e7803a12bbdf97e94cd847859890cf4050", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.4", GitCommit:"5e135cc465d4231d9bfe2c5a43fd2978ef527e83", GitTreeState:"clean"}

版本好像没对上，懒得重装了凑合用吧。

Step3: install nfs
[error] cannot install nfs:

1
2

helm install stable/nfs-server-provisioner --name nfs-provisioner --set persistence.storageClass=nfs --set persistence.size=20Gi --set rbac.create=true
Error: validation failed: [storageclasses.storage.k8s.io "nfs" not found, serviceaccounts "nfs-provisioner-nfs-server-provisioner" not found, clusterroles.rbac.authorization.k8s.io "nfs-provisioner-nfs-server-provisioner" not found, clusterrolebindings.rbac.authorization.k8s.io "nfs-provisioner-nfs-server-provisioner" not found, services "nfs-provisioner-nfs-server-provisioner" not found, statefulsets.apps "nfs-provisioner-nfs-server-provisioner" not found]

try to downgrade tiller to 2.16.1 on windows:

1 2	$ helm reset --force $ helm init -i registry.cn-hangzhou.aliyuncs.com/google_containers/tiller:v2.16.1

nfs installed on windows successfully:

$ kubectl get storageclass
NAME                 PROVISIONER                                            AGE
hostpath (default)   docker.io/hostpath                                     7d3h
nfs                  cluster.local/nfs-provisioner-nfs-server-provisioner   102s

for offline linux server, firstly download helm chart package, then package the chart files to tgz, helm install it on host server:

[root@192-168-1-61 cec-installer]# helm install -n nfs-provisioner ./nfs-servier-provisoner.tgz \
> --set persistence.storageClass=nfs \
> --set persistence.size=20Gi

Error: release nfs-provisioner failed: namespaces "default" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "namespaces" in API group "" in the namespace "default"

应该是helm创建sa时未添加helm-tiller，按这个步骤加上就ok了:

[root@192-168-1-61 cec-installer]# kubectl create serviceaccount --namespace kube-system helm-tiller
serviceaccount/helm-tiller created

[root@192-168-1-61 cec-installer]# kubectl create clusterrolebinding helm-tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:helm-tiller
clusterrolebinding.rbac.authorization.k8s.io/helm-tiller-cluster-rule created

[root@192-168-1-61 cec-installer]# helm init --service-account=helm-tiller --upgrade
$HELM_HOME has been configured at /root/.helm.
Tiller (the Helm server-side component) has been updated to gcr.io/kubernetes-helm/tiller:v2.16.1 .

[root@192-168-1-61 cec-installer]# helm install -n nfs-provisioner ./nfs-servier-provisoner.tgz --set persistence.storageClass=nfs --set persistence.size=20Gi
NAME:   nfs-provisioner
LAST DEPLOYED: Mon Mar 30 02:13:19 2020
NAMESPACE: default
STATUS: DEPLOYED
...
[root@192-168-1-61 cec-installer]# kubectl get storageclass
NAME   PROVISIONER                                            RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nfs    cluster.local/nfs-provisioner-nfs-server-provisioner   Delete          Immediate           true                   2m42s

Step4: install nginx-ingress, similar to installing nfs

1
2
3

docker load -i ingress-nginx.tar.gz
docker image tag 20c7790fd73d quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.29.0
helm install -n nginx-ingress ./nginx-ingress-deploy.tgz --set rbac.create=true

finally, start to install our product

产品jenkins-ci因为没人维护挂了。。只好本地打包

step1: ./startindesignenv.sh -c -d -d步骤中最后为npm start所以最后不会停要手动cancel进程。build + deploy完了以后可以在coreservice/backend/webroot（类似target）下找到拷贝过去的前端编译文件。

step2: 在任一service目录下（例：coreservice）$ docker build -t <package-name> .执行打包，coreservice比较特殊，因为前后端一体所以要执行第一步先把前端整合到后端，其他service直接打包即可。

error1:

1 2	Step 1/15 : FROM node:10.15.3-alpine error pulling image configuration: Get https://registry-1.docker.io/v2/library/node/blobs.. net/http: TLS handshake timeout

solution : network not stable, ignore and build again

error2:

 Step 2/15 : RUN apk add --no-cache bash openssl
 fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
 WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz: network error (check Internet connection and firewall)
 ERROR: unsatisfiable constraints:
    bash (missing):
      required by: world[bash]
    openssl (missing):
      required by: world[openssl]
The command '/bin/sh -c apk add --no-cache bash openssl' returned a non-zero code: 2

solution: didnt figure out the detailed reason, but restart docker cannot solve this issue. Thanks to Ye who taught me to add

1 2	ENV HTTP_PROXY="http://www-proxy.lmera.ericsson.se:8080" ENV HTTPS_PROXY="http://www-proxy.lmera.ericsson.se:8080"

to Dockerfile, the problem due to no connection to external link while running the alpinelinux.

the proxy ENV config will also take effect on container running on host server after loading, we would better use --build-arg when execute docker build, see https://www.cntofu.com/book/139/image/dockerfile/arg.md
e.g. $ docker build -t core-cec . –build-arg HTTP_PROXY=”http://www-proxy.lmera.ericsson.se:8080"

Step3: install on linux server

docker load -i <package>
helm install -n eri2 ./cec-release-1.0.0.tgz --set service.type=NodePort --set persistence.enabled=false --set persistence.storageClass=nfs --set persistence.database.size=5Gi --set persistence.miscellaneous.size=5Gi

deployed but got error, when building one service container, run image as container and enter to check:

  [root@192-168-1-61 cec-installer]# docker run -it registry/cec-sai:1.0.0 sh
  /opt/ericsson/cec/saiservice # /usr/local/bin/npm start
  /opt/ericsson/cec/saiservice/node_modules/bindings/bindings.js:91
    throw e
    ^
    Error: Error loading shared library /opt/ericsson/cec/saiservice/node_modules/libxmljs/build/Release/xmljs.node: Exec format error
at Object.Module._extensions..node (internal/modules/cjs/loader.js:730:18)
at Module.load (internal/modules/cjs/loader.js:600:32)
at tryModuleLoad (internal/modules/cjs/loader.js:539:12)
at Function.Module._load (internal/modules/cjs/loader.js:531:3)
at Module.require (internal/modules/cjs/loader.js:637:17)
at require (internal/modules/cjs/helpers.js:22:18)
at bindings (/opt/ericsson/cec/saiservice/node_modules/bindings/bindings.js:84:48)
at Object.<anonymous> (/opt/ericsson/cec/saiservice/node_modules/libxmljs/lib/bindings.js:1:99)
at Module._compile (internal/modules/cjs/loader.js:701:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:712:10)
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! CEC_SAI_Handler@1.0.0 start: `node server.js`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the CEC_SAI_Handler@1.0.0 start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-03-31T02_30_46_424Z-debug.log

it may due to the some nodes in npm_modules not work both on windows and linux env. ref: https://dzone.com/articles/packaging-a-node-app-for-docker-from-windows
solution: remove node_modules before docker image build, npm install all packages in docker container, thus it will be generated in linux format.

...
COPY . ./
RUN rm -rf node_modules
...
npm install

build and check image:

$ docker build -t cec-sai . --build-arg HTTP_PROXY="http://www-proxy.lmera.ericsson.se:8080"
$ winpty docker run -it cec-sai sh
/opt/ericsson/cec/saiservice # find / -name npm
/usr/local/lib/node_modules/npm
/usr/local/lib/node_modules/npm/bin/npm
/usr/local/bin/npm
/opt/ericsson/cec/saiservice # /usr/local/bin/npm start
> CEC_SAI_Handler@1.0.0 start /opt/ericsson/cec/saiservice
> node server.js
sequelize deprecated String based operators are now deprecated. Please use Symbol based operators for better security, read more at http://docs.sequelizejs.com/manual/tutorial/querying.html#operators node_modules/sequelize/lib/sequelize.js:245:13

save image:

1	$ docker save -o cec-sai-200327.tar.gz cec-sai

then load it to host server on linux

Run enm server via docker container is a simpler way, use docker run -itd --name <simulator-name> -p 8091:8091 -v <volume-path> <image-id or name>, volune path can be skipped to set, remind to expose port 8091 or the container cannot be accessed. (port depends on your design)

Step4: install simulator
save image on windows, load it to linux, then start simulator service on linux server:

1 2	$ kubectl create deployment xxx-simulator --image=xxx-simulator:1.0.0 deployment.apps/xxx-simulator created

get simulator ip address:

1
2
3

$ kubectl get pod -l app=enm-simulator -o wide
NAME                             READY   STATUS    RESTARTS   AGE    IP         NODE                NOMINATED NODE   READINESS GATES
xxx-simulator-79d68f588f-8l2fq   1/1     Running   0          4m3s   fd04::fb   192-168-1-61.maas   <none>           <none>

Test functionalities

[error1] connot apply ipv6 address in our settings.
solution: change validation regex from ^http(s)?://[\\w\\.\\-?=%&:/]+$ to ^http(s)?://([\[\\w\\.\\-?=%&:/](\])?)+$

[error2] cannot connect to simulator when do cell import:
solution: modify default HOST config in server.js from ‘0.0.0.0’ to ‘’. see official node.js docs section = server.listen(options[, callback])
here I record my debug history:

[root@192-168-1-61 ~]# kubectl exec -it xxx-simulator-79d68f588f-8l2fq bash
bash-4.4# netstat -tunlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address           State       PID/Program name    
tcp        0      0 0.0.0.0:8091            0.0.0.0:*               LISTEN      23/node
tcp        0      0 0.0.0.0:8092            0.0.0.0:*               LISTEN      23/node

the ip is deafult in ipv4 format, so enter our core service (frontend and backend) pod to check more.

[root@192-168-1-61 ~]# kubectl exec -it -c xxx-core eri1-xxx-6c4b88cd75-7plb8 bash
bash-4.4# ping6 fd04::fd
PING fd04::fd (fd04::fd): 56 data bytes
64 bytes from fd04::fd: seq=0 ttl=63 time=0.144 ms
64 bytes from fd04::fd: seq=1 ttl=63 time=0.179 ms
64 bytes from fd04::fd: seq=2 ttl=63 time=0.069 ms

fd04::fd is the ipv6 address of simulator, so simulator can be accessed by core service.

bash-4.4# netstat -tunlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:5432            0.0.0.0:*               LISTEN      -
tcp        0      0 :::8082                 :::*                    LISTEN      -
tcp        0      0 :::8083                 :::*                    LISTEN      -
tcp        0      0 :::8084                 :::*                    LISTEN      -
tcp        0      0 :::8888                 :::*                    LISTEN      16/node
tcp        0      0 :::5432                 :::*                    LISTEN      -
tcp        0      0 :::8443                 :::*                    LISTEN      16/node
tcp        0      0 :::8585                 :::*                    LISTEN      16/node

ports are all correct, so there might be errors in simulator codes

[root@192-168-1-61 ~]# kubectl exec -it enm-simulator-77c8886b88-7ksxk bash
bash-4.4# grep -i host /opt/eri../xxx/simulatorservice/server.js
const HOST = '0.0.0.0';
httpServer.listen(HTTP_PORT, HOST, () => {
  LOGGER.info("HTTP Server is running on : http://%s:%s", HOST, HTTP_PORT);
httpsServer.listen(HTTPS_PORT, HOST, () => {
  ...

果然。。。config default HOST = ‘0.0.0.0’ –> HOST = ‘’

system error : connect ECONNREFUSED 127.0.0.1:8091,ENM=http://[fd04::fe]:8091
this error indicate the default route the simulator service connect to is not correct, check in codes I found:

module.exports = {
  ENMSERVICEPROTOCO: process.env.enmserviceprotoco || 'http:',
  ENMSERVICEPORT: process.env.enmserviceport || 8083,
  ENMSERVICEHOST: process.env.enmservicehost || '',

  ENMNODEPROTOCO: process.env.enmnodeprotocol || 'http:',
  ENMNODEPORT: process.env.enmnodeport || 8091,
  ENMNODEHOST: process.env.enmnodehost || 'localhost',

  CORENODEPROTOCO: process.env.corenodeprotocol || 'http:',
  CORENODEPORT: process.env.corenodeport || 8585,
  CORENODEHOST: process.env.corenodehost || 'localhost',
}

I guess there should be a ENV parameter configured in xxx simulator deployment.yaml. To temporarily fix this problem, I edit the deployment config of xxx-simulator, refer answer and k8s official doc:

...
spec:
  ...
  template:
    metadata:
      creationTimestamp: null
    labels:
      app: xxx-simulator
    spec:
      containers:
        - env:
          - name: enmnodehost
          value: '[fd04::fe]'
          image: xxx-simulator:1.0.0
...

这方法不对，原因：connect ECONNREFUSED 127.0.0.1:8091是enmservice和simulator在通信，这个相当于连了localhost:8091 (查看enmservice的log也确实如此)，理论上应连[fd04::fe]:8091。问题出在enmservice request export of simulator的地址，所以改simulator的container启动配置没用。
检查simulator的export代码，发现当时写的时候没考虑到ipv6，因为我们的create完job以后从simulator export的jobUrl是这么传的。。

1 2	const fqdn = req.headers && req.headers["host"] ? req.headers["host"].split(':')[0] : null; JOBSDATA = new JOBS(moJobId, fqdn);

所以error message里会出现http://[fd04:8091/bulk-con...这种url

"message": "Error: connect ECONNREFUSED ::1:8091",
"options": {
  "ca": [
    ""
  ],
  "jar": {},
  "url": "http://[fd04:8091/bulk-configuration/v1/import-jobs/jobs/18974277/files",
  "method": "POST"
  ...
}

改了一下

var fqdn = "";
    if (req.headers && req.headers["host"]) {
        const headerHost = req.headers["host"];
        if (/\[/.test(headerHost)) {
            let start = headerHost.indexOf('[');
            let end = headerHost.indexOf(']');
            fqdn = headerHost.slice(start, end + 1);
        } else {
            fqdn = headerHost.split(':')[0];
        }
    } else {
        fqdn = null;
    }