Hi community and K8s experts,
I installed a clean K8s cluster based on virtual machines (Debian 10). After the installation and the integration into my landscape, I repaired in the first step the coreDNS resolution. I did further test’s and found the following. The test setup consisted of a google.com nslookup and a local pod lookup on a k8s DNS address.
Basic setup:
- K8s version: 1.19.0
- K8s setup: 1 master + 2 worker nodes
- Based on: Debian 10 VM’s
- CNI: Flannel
Status of CoreDNS Pods
kube-system coredns-xxxx 1/1 Running 1 26h kube-system coredns-yyyy 1/1 Running 1 26h
CoreDNS Log:
.:53 [INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7 CoreDNS-1.6.7
CoreDNS config:
apiVersion: v1 data: Corefile: | .:53 { errors health { lameduck 5s } ready kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure fallthrough in-addr.arpa ip6.arpa ttl 30 } prometheus :9153 forward . /etc/resolv.conf cache 30 loop reload loadbalance } kind: ConfigMap metadata: creationTimestamp: "" name: coredns namespace: kube-system resourceVersion: "219" selfLink: /api/v1/namespaces/kube-system/configmaps/coredns uid: xxx
CoreDNS Service
kubectl -n kube-system get svc -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 15d k8s-app=kube-dns
Kubelet config yaml
apiVersion: kubelet.config.k8s.io/v1beta1 authentication: anonymous: enabled: false webhook: cacheTTL: 0s enabled: true x509: clientCAFile: /etc/kubernetes/pki/ca.crt authorization: mode: Webhook webhook: cacheAuthorizedTTL: 0s cacheUnauthorizedTTL: 0s clusterDNS: - 10.96.0.10 clusterDomain: cluster.local cpuManagerReconcilePeriod: 0s evictionPressureTransitionPeriod: 0s fileCheckFrequency: 0s healthzBindAddress: 127.0.0.1 healthzPort: 10248 httpCheckFrequency: 0s imageMinimumGCAge: 0s kind: KubeletConfiguration nodeStatusReportFrequency: 0s nodeStatusUpdateFrequency: 0s rotateCertificates: true runtimeRequestTimeout: 0s staticPodPath: /etc/kubernetes/manifests streamingConnectionIdleTimeout: 0s syncFrequency: 0s volumeStatsAggPeriod: 0s
Output of pods resolv.conf
/ # cat /etc/resolv.conf nameserver 10.96.0.10 search development.svc.cluster.local svc.cluster.local cluster.local invalid options ndots:5
Output of host resolv.conf
cat /etc/resolv.conf # Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8) # DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN nameserver 213.136.95.11 nameserver 213.136.95.10 search invalid
Output of host /run/flannel/subnet.env
cat /run/flannel/subnet.env FLANNEL_NETWORK=10.244.0.0/16 FLANNEL_SUBNET=10.244.0.1/24 FLANNEL_MTU=1450 FLANNEL_IPMASQ=true
Test setup
kubectl exec -i -t busybox -n development -- nslookup google.com kubectl exec -i -t busybox -n development -- nslookup development.default
Busybox v1.28 image
- google.com nslookup works answer takes very long
- local pod dns address fails answer takes very long
Test setup
kubectl exec -i -t dnsutils -- nslookup google.com kubectl exec -i -t busybox -n development -- nslookup development.default
K8s dnsutils test image
- google.com nslookup works sporadically It feels like sometimes the address is pulled from a cache and sometimes it does not work.
- local pod dns address works sporadically It feels like sometimes the address is pulled from a cache and sometimes it does not work.
Test setup
kubectl exec -i -t dnsutilsalpine -n development -- nslookup google.com kubectl exec -i -t dnsutilsalpine -n development -- nslookup development.default
Alpine image v3.12
- google.com nslookup works sporadically It feels like sometimes the address is pulled from a cache and sometimes it does not work.
- local pod dns address fails
The logs are empty. Do you have an idea where the problem is?
IP Routes master node
default via X.X.X.X dev eth0 onlink 10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1 10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink 10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink X.X.X.X via X.X.X.X dev eth0 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
UPDATE
I reinstalled the cluster and now I use Calico as CNI and have the same problem.
UPDATE 2
After a detailed error analysis under Calico, I found out that the corresponding pods did not work properly. I analyzed the error in detail and could find out that the corresponding port 179 was not opened by me in the firewall. After fixing this error, I was able to determine the proper function of the pods and confirmed that now the resolution of the names is also working.
Answer
Unable to post that much via comments. Posting as an answer.
I checked the the guide you’ve been referring to and set up my own test cluster (GCP, 3xDebian10 VMs).
The difference is that in my ~/kube-cluster/master.yml
I’ve set different link to kube-flannel.yml
(and the content of that file differs from the file in the guide :))
$ grep http master.yml
shell: kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml >> pod_network_setup.txt
On my cluster:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
instance-1 Ready master 2m48s v1.19.0
instance-2 Ready <none> 38s v1.19.0
instance-3 Ready <none> 38s v1.19.0
kubectl get pods -o wide -n kube-system
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-f9fd979d6-8sxg7 1/1 Running 0 4m48s 10.244.0.2 instance-1 <none> <none>
coredns-f9fd979d6-z5gdl 1/1 Running 0 4m48s 10.244.0.3 instance-1 <none> <none>
kube-flannel-ds-4khll 1/1 Running 0 2m58s 10.156.0.21 instance-3 <none> <none>
kube-flannel-ds-h8d9l 1/1 Running 0 2m58s 10.156.0.20 instance-2 <none> <none>
kube-flannel-ds-zhzbf 1/1 Running 0 4m49s 10.156.0.19 instance-1 <none> <none>
$ kubectl -n kube-system get svc -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 6m15s k8s-app=kube-dns
sammy@instance-1:~$ ip route
default via 10.156.0.1 dev ens4
10.156.0.1 dev ens4 scope link
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
I see no DNS lag issues.
kubectl create deployment busybox --image=nkolchenko/enea:server_go_latest
deployment.apps/busybox created
sammy@instance-1:~$ time kubectl exec -it busybox-6f744547bf-hkxnk -- nslookup default.default
Server: 10.96.0.10
Address: 10.96.0.10:53
** server can't find default.default: NXDOMAIN
** server can't find default.default: NXDOMAIN
command terminated with exit code 1
real 0m0.227s
user 0m0.106s
sys 0m0.012s
sammy@instance-1:~$ time kubectl exec -it busybox-6f744547bf-hkxnk -- nslookup google.com
Server: 10.96.0.10
Address: 10.96.0.10:53
Non-authoritative answer:
Name: google.com
Address: 172.217.22.78
Non-authoritative answer:
Name: google.com
Address: 2a00:1450:4001:820::200e
real 0m0.223s
user 0m0.102s
sys 0m0.012s
Let me know if you need me to run any other tests, I’ll keep this cluster throughout the weekend and then tear it down.
UPDATE:
$ cat ololo
apiVersion: v1
kind: Pod
metadata:
name: dnsutils
namespace: default
spec:
containers:
- name: dnsutils
image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
restartPolicy: Always
$ kubectl create -f ololo
pod/dnsutils created
$ kubectl get -A all -o wide | grep dns
default pod/dnsutils 1/1 Running 0 63s 10.244.2.8 instance-2 <none> <none>
kube-system pod/coredns-cc8845745-jtvlh 1/1 Running 0 10m 10.244.1.3 instance-3 <none> <none>
kube-system pod/coredns-cc8845745-xxh28 1/1 Running 0 10m 10.244.0.4 instance-1 <none> <none>
kube-system pod/coredns-cc8845745-zlv84 1/1 Running 0 10m 10.244.2.6 instance-2 <none> <none>
instance-1:~$ kubectl exec -i -t dnsutils -- time nslookup google.com
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: google.com
Address: 172.217.21.206
Name: google.com
Address: 2a00:1450:4001:818::200e
real 0m 0.01s
user 0m 0.00s
sys 0m 0.00s
Attribution
Source : Link , Question Author : ZPascal , Answer Author : Nick