Pivotal Knowledge Base

Follow

Service failure after vSphere HA Failover

Environment

Pivotal Container Service (PKS) 1.0.0 on VMware vSphere with High Availability enabled

Symptom

Following a vSphere HA failover, a PKS cluster node may have services failing to start.

Error Message:

On a Clusters Master node, monit shows failed services

monit summary
The Monit daemon 5.2.5 uptime: 10m
Process 'flanneld' Execution failed
Process 'kube-apiserver' Does not exist
Process 'kube-controller-manager' running
Process 'kube-scheduler' running
Process 'etcd' Execution failed
Process 'etcd_consistency_checker' running
Process 'blackbox' running
Process 'bosh-dns' running
Process 'bosh-dns-healthcheck' running
System 'system_localhost' running

Flannel shows a DNS lookup failure in /var/vcap/sys/log/flanneld/flanneld_ctl.stderr.log

Error: client: etcd cluster is unavailable or misconfigured
error #0: dial tcp: lookup master-0.etcd.cfcr.internal on 10.193.62.2:53: no such host

etcd will have no leader...

in /var/vcap/sys/log/etcd/etcd_consistency_checker.stdout.log

2018/02/26 09:34:08 [INFO] the leader is []

and in/var/vcap/sys/log/etcd/etcd.stdout.log

{"timestamp":"1519038072.378881931","source":"etcdfab","message":"etcdfab.cluster.get-initial-cluster-state.member-list.failed","log_level":2,"data":{"error":"client: etcd cluster is unavailable or misconfigured"}}

The Kubernates controller manager and scheduler show they are running, but they are steaming the following errors...

/var/vcap/sys/log/kube-controller-manager/kube_controller_manager.stderr.log

E0226 09:38:30.063569 2652 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: Get https://master.cfcr.internal:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager: dial tcp: lookup master.cfcr.internal on 10.193.62.2:53: no such host 

/var/vcap/sys/log/kube-scheduler/kube_scheduler.stderr.log

E0226 09:39:47.524453 2674 reflector.go:205] k8s.io/kubernetes/plugin/cmd/kube-scheduler/app/server.go:590: Failed to list *v1.Pod: Get https://master.cfcr.internal:8443/api/v1/pods?fieldSelector=spec.schedulerName%3Ddefault-scheduler%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: dial tcp: lookup master.cfcr.internal on 10.193.62.2:53: no such host

On a Cluster worker node, monit shows failed services

monit summary
The Monit daemon 5.2.5 uptime: 4m Process 'flanneld' Execution failed
Process 'docker' initializing
Process 'kubelet' not monitored
Process 'kube-proxy' running
Process 'blackbox' running
Process 'bosh-dns' running
Process 'bosh-dns-healthcheck' running
System 'system_localhost' running

For Flannel in /var/vcap/sys/log/flanneld/flanneld_ctl.stderr.log...

Error: client: etcd cluster is unavailable or misconfigured
error #0: dial tcp: lookup master-0.etcd.cfcr.internal on 10.193.62.2:53: no such host

Even though the kube-proxy appears to be running, it is actually waiting on DNS as seen in /var/vcap/sys/log/kube-proxy/kube_proxy.stderr.log

E0226 09:08:05.470159 2746 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Endpoints: Get https://master.cfcr.internal:8443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup master.cfcr.internal on 10.193.62.2:53: no such host
E0226 09:08:05.470247 2746 event.go:209] Unable to write event: 'Post https://master.cfcr.internal:8443/api/v1/namespaces/default/events: dial tcp: lookup master.cfcr.internal on 10.193.62.2:53: no such host' (may retry after sleeping) 

Resolution

Check the /etc/resolv.conf it see if it has only the external dns service added, and that the local bosh-dns entry was missing.

cat /etc/resolv.conf
nameserver 10.1.2.3

The workaround is to run the bosh-dns-enable pre-start script manually on the impacted nodes.

/var/vcap/jobs/bosh-dns-enable/bin/pre-start

cat /etc/resolv.conf

nameserver 169.254.0.2
nameserver 10.1.2.3  

After that, monit will restart the services on that node.

This is a known issue and the solution will be released soon. 

 

Comments

  • Avatar
    Sangdon Shin

    I have a few additions to the article -

    - This same symptom can also happen when you just restart(or stop/start) VMs inside the cluster
    - An easy way to implement the solution to a VM would be :
    $ bosh -e <bosh_env>  -d <deployment>  ssh  <vm_instance>  -c  'sudo /var/vcap/jobs/bosh-dns-enable/bin/pre-start'

    Edited by Sangdon Shin
  • Avatar
    Sangdon Shin

    With PKS 1.0.2 this symptom has been mostly resolved, however for Harbor VM it'll still have same issue so this article is still valid in that case.

Powered by Zendesk