Debugging kubernetes cluster part 2

Debugging a Kubernetes cluster requires a deep understanding of its components and interdependencies. Here’s a comprehensive Part 2 guide focusing on advanced debugging techniques for common cluster issues:


1. Node Issues

A. Node Not Ready

  • Check Node Status:bash



  • kubectl get nodes
  • kubectl describe node <node-name>

  • Inspect Kubelet Logs:
    SSH into the node and review logs for errors:bash

  • journalctl -u kubelet -l

  • Possible Causes:
    • Resource exhaustion (e.g., CPU, memory, disk).
    • Misconfigured networking (e.g., unable to reach the API server).
    • Issues with container runtime (Docker, containerd).

B. Node Disk Pressure or Memory Pressure

  • Check Allocations:bash


  • kubectl describe node <node-name> | grep Allocated

  • Clean Up Disk Space:
    Remove unused images and logs:bash


  • docker system prune

  • Reconfigure Resource Limits:
    Adjust resource requests and limits for pods.


2. Pod Issues

A. Pod Stuck in Pending

  • Inspect Events:bash




  • kubectl describe pod <pod-name>

  • Possible Causes:
    • Insufficient resources: Check node capacity and pod requests.
    • Scheduling constraints: Inspect nodeSelector, taints, and tolerations.
    • Networking issues: Ensure the CNI plugin is functioning correctly.

B. CrashLoopBackOff

  • View Logs:bash


  • kubectl logs <pod-name> --previous

  • Check Events:bash


  • kubectl describe pod <pod-name>

  • Debugging Steps:
    • Ensure the container's entrypoint is correct.
    • Verify environment variables and mounted volumes.
    • Test locally using the same image.

C. Container Image Pull Issues

  • Inspect Events:bash


  • kubectl describe pod <pod-name>

  • Common Errors:
    • Unauthorized: Verify image pull secrets.
    • Image not found: Confirm the image exists in the registry.


3. Networking Issues

A. Pods Can't Communicate

  • Ping Other Pods:bash


  • kubectl exec -it <pod-name> -- ping <pod-ip>

  • Check Network Policies:bash


    kubectl get networkpolicy -n <namespace>


  • Debugging CNI Plugins:
    • Inspect CNI logs:bash

    • cat /var/log/containers/<cni-plugin-name>*.log

B. Service Not Accessible

  • Check Service Description:bash

  • kubectl describe svc <service-name>

  • Inspect Endpoints:bash



  • kubectl get endpoints <service-name>

  • Test Connectivity:
    • From within a pod:bash
    • curl http://<service-name>.<namespace>:<port>

4. API Server Issues

  • Inspect Logs:bash

  • journalctl -u kube-apiserver

  • Test API Server Availability:bash


  • kubectl get --raw /healthz

  • Common Causes:
    • SSL/TLS issues: Check certificates and CA bundle.
    • Resource bottlenecks: Monitor CPU/memory usage.


5. Persistent Volume Issues

A. PVC Pending

  • Inspect Events:bash



  • kubectl describe pvc <pvc-name>

  • Common Causes:
    • No matching StorageClass.
    • Insufficient storage on nodes.

B. PV Bound But Pod Can't Mount

  • Inspect Logs:bash


  • kubectl logs <pod-name>

  • Debugging Steps:
    • Verify volume permissions.
    • Test mounting the volume manually on a node.


6. Cluster DNS Issues

  • Test DNS Resolution:bash


  • kubectl exec -it <pod-name> -- nslookup <service-name>

  • Inspect CoreDNS Logs:bash



  • kubectl logs -n kube-system <coredns-pod-name>
  • Common Fixes:
    • Restart CoreDNS pods if unresponsive.
    • Validate ConfigMap for CoreDNS (kubectl get cm -n kube-system coredns).


7. Troubleshooting Tools

A. kubectl Debugging Tools

  • Debug running pods:bash

  • kubectl exec -it <pod-name> -- /bin/sh

  • Debug containers with ephemeral containers (Kubernetes v1.18+):bash

  • kubectl debug -it <pod-name> --image=busybox

B. Third-Party Tools

  • Lens: GUI for Kubernetes cluster monitoring.
  • K9s: Terminal-based cluster management.
  • kubectl-trace: System-level tracing for Kubernetes.

C. Logs Aggregation

  • Use tools like Fluentd, ELK Stack, or Loki for centralized logging.


8. Proactive Cluster Monitoring

  • Implement monitoring systems like Prometheus, Grafana, or Datadog.
  • Set up alerting for critical metrics (e.g., node health, pod restarts).


Example: Debugging Workflow for a Non-Responsive Service

  1. Check Pod Status:bash

  2. kubectl get pods -n <namespace>

  3. Describe the Service:bash

  4. kubectl describe svc <service-name> -n <namespace>

  5. Inspect Logs:bash

  6. kubectl logs <pod-name> -n <namespace>

  7. Test Connectivity:
    • From within a cluster:bash

    • curl http://<service-name>.<namespace>:<port>

    • From outside:bash

    • curl http://<external-ip>:<port>


This deeper dive equips you to troubleshoot and resolve complex Kubernetes issues effectively. Let me know if you'd like specific scenarios or additional examples!

Comments

Popular posts from this blog

Managine Hadoop Cluster

VENOM Vulnerability

Logrotation in Linux/unix