Debugging kubernetes cluster part 2
Debugging a Kubernetes cluster requires a deep understanding of its components and interdependencies. Here’s a comprehensive Part 2 guide focusing on advanced debugging techniques for common cluster issues:
1. Node Issues
A. Node Not Ready
- Check Node Status:bash
- kubectl get nodes
- kubectl describe node <node-name>
- Inspect Kubelet Logs:
SSH into the node and review logs for errors:bash - journalctl -u kubelet -l
- Possible Causes:
- Resource exhaustion (e.g., CPU, memory, disk).
- Misconfigured networking (e.g., unable to reach the API server).
- Issues with container runtime (Docker, containerd).
B. Node Disk Pressure or Memory Pressure
- Check Allocations:bash
- kubectl describe node <node-name> | grep Allocated
- Clean Up Disk Space:
Remove unused images and logs:bash - docker system prune
- Reconfigure Resource Limits:
Adjust resource requests and limits for pods.
2. Pod Issues
A. Pod Stuck in Pending
- Inspect Events:bash
- kubectl describe pod <pod-name>
- Possible Causes:
- Insufficient resources: Check node capacity and pod requests.
- Scheduling constraints: Inspect nodeSelector, taints, and tolerations.
- Networking issues: Ensure the CNI plugin is functioning correctly.
B. CrashLoopBackOff
- View Logs:bash
- kubectl logs <pod-name> --previous
- Check Events:bash
- kubectl describe pod <pod-name>
- Debugging Steps:
- Ensure the container's entrypoint is correct.
- Verify environment variables and mounted volumes.
- Test locally using the same image.
C. Container Image Pull Issues
- Inspect Events:bash
- kubectl describe pod <pod-name>
- Common Errors:
- Unauthorized: Verify image pull secrets.
- Image not found: Confirm the image exists in the registry.
3. Networking Issues
A. Pods Can't Communicate
- Ping Other Pods:bash
- kubectl exec -it <pod-name> -- ping <pod-ip>
- Check Network Policies:bash
kubectl get networkpolicy -n <namespace> - Debugging CNI Plugins:
- Inspect CNI logs:bash
- cat /var/log/containers/<cni-plugin-name>*.log
B. Service Not Accessible
- Check Service Description:bash
- kubectl describe svc <service-name>
- Inspect Endpoints:bash
- kubectl get endpoints <service-name>
- Test Connectivity:
- From within a pod:bash
- curl http://<service-name>.<namespace>:<port>
4. API Server Issues
- Inspect Logs:bash
- journalctl -u kube-apiserver
- Test API Server Availability:bash
- kubectl get --raw /healthz
- Common Causes:
- SSL/TLS issues: Check certificates and CA bundle.
- Resource bottlenecks: Monitor CPU/memory usage.
5. Persistent Volume Issues
A. PVC Pending
- Inspect Events:bash
- kubectl describe pvc <pvc-name>
- Common Causes:
- No matching StorageClass.
- Insufficient storage on nodes.
B. PV Bound But Pod Can't Mount
- Inspect Logs:bash
- kubectl logs <pod-name>
- Debugging Steps:
- Verify volume permissions.
- Test mounting the volume manually on a node.
6. Cluster DNS Issues
- Test DNS Resolution:bash
- kubectl exec -it <pod-name> -- nslookup <service-name>
- Inspect CoreDNS Logs:bash
- kubectl logs -n kube-system <coredns-pod-name>
- Common Fixes:
- Restart CoreDNS pods if unresponsive.
- Validate ConfigMap for CoreDNS (kubectl get cm -n kube-system coredns).
7. Troubleshooting Tools
A. kubectl Debugging Tools
- Debug running pods:bash
- kubectl exec -it <pod-name> -- /bin/sh
- Debug containers with ephemeral containers (Kubernetes v1.18+):bash
- kubectl debug -it <pod-name> --image=busybox
B. Third-Party Tools
- Lens: GUI for Kubernetes cluster monitoring.
- K9s: Terminal-based cluster management.
- kubectl-trace: System-level tracing for Kubernetes.
C. Logs Aggregation
- Use tools like Fluentd, ELK Stack, or Loki for centralized logging.
8. Proactive Cluster Monitoring
- Implement monitoring systems like Prometheus, Grafana, or Datadog.
- Set up alerting for critical metrics (e.g., node health, pod restarts).
Example: Debugging Workflow for a Non-Responsive Service
- Check Pod Status:bash
- kubectl get pods -n <namespace>
- Describe the Service:bash
- kubectl describe svc <service-name> -n <namespace>
- Inspect Logs:bash
- kubectl logs <pod-name> -n <namespace>
- Test Connectivity:
- From within a cluster:bash
- curl http://<service-name>.<namespace>:<port>
- From outside:bash
- curl http://<external-ip>:<port>
This deeper dive equips you to troubleshoot and resolve complex Kubernetes issues effectively. Let me know if you'd like specific scenarios or additional examples!
Comments
Post a Comment