Azure Kubernetes Service: Lessons Learned from Production Deployments

Lessons Learned from Production Deployments

Abiola Akinbade

4/1/20252 min read

I still remember my first AKS cluster crash. It was 2 AM, opsgenie was buzzing with alerts, and I spent the next four hours trying to bring our production services back online. That night taught me more about Kubernetes than weeks of reading documentation.

After running AKS in production for a few years now, I've made plenty of mistakes. These are the lessons I wish someone had shared with me when I started.

Resource Planning

I used to think "bigger is better" for cluster sizing. I was wrong.

  • I wasted thousands of dollars on oversized clusters before finding the right balance.

  • Start with 3-5 nodes for most apps. You can always scale up later.

  • We split our system pods from application pods last year. Our stability improved right away.

  • My team cut costs by 80% on our dev environment using spot instances.

What's your current node count? Are you spending more than needed?

Performance Optimization

My first production issue came from a memory leak that crashed our entire cluster.

  • Set resource requests and limits on all your containers. I learned this lesson the hard way.

  • Check your actual usage weekly. My team found we overprovisioned by 40%.

  • We set up horizontal pod autoscaler based on request rate, not just CPU.

  • Our first autoscale test failed in production. Test scaling before real traffic hits.

Networking

Networking issues are hard to debug. Plan ahead.

  • We switched to Azure CNI after our pod IP issues with kubenet.

  • Network policies saved us during an attempted breach between namespaces.

  • Set up proper ingress controllers with TLS. We use NGINX with Let's Encrypt.

  • We hit network throughput limits before CPU limits. Monitor your network traffic.

Storage Considerations

  • Use managed disks for anything you care about.

  • We back up persistent volumes daily now.

  • Our database slowed down under load because of poor storage performance.

  • We use premium storage for databases and standard for logs.

Security Best Practices

Security feels like extra work until you get hacked.

  • We enabled Microsoft Defender after finding unauthorized cryptomining in our cluster.

  • Azure AD integration lets us control who can do what in each namespace.

  • We scan all container images in our pipeline. We've caught vulnerabilities before deployment.

  • We use pod security policies to prevent privilege escalation.

  • Our team rotates service principals every 90 days now.

When was the last time you reviewed your cluster security?

CI/CD Pipeline Insights

  • We built our pipeline with these key steps:

    • Scan container images for vulnerabilities

    • Validate Kubernetes manifests

    • Deploy to dev → staging → prod

    • Auto-rollback on high error rates

  • Helm templates cut our manifest maintenance time in half.

  • We moved secrets to Azure Key Vault after a GitHub token exposure.

Monitoring Setup

You can't fix what you can't see.

  • I set up monitoring before deploying our first pod.

  • We track both cluster health and application metrics.

  • Key metrics we watch daily:

    • Node CPU/memory usage

    • Pod restart count

    • HTTP error rates

    • Request latency

  • Our alerts caught memory leaks twice before users noticed any issues.

  • We use Prometheus and Grafana with Azure Monitor.

Which metrics matter most for your apps?

Disaster Recovery

Disasters happen. Preparation matters.

  • I test recovery procedures monthly with my team.

  • We back up etcd data daily.

  • Our recovery runbook has step-by-step instructions anyone can follow.

  • We ran a full disaster simulation last quarter. We recovered in 15 minutes.

  • Engaging Chaos Engineering Techniques is Key

Could your team recover your cluster if you weren't available?

Cost Management

  • We tag all resources by team, project, and environment.

  • Our weekly resource review cut costs by 30%.

  • We delete test namespaces after 7 days of inactivity.

  • Azure reservations saved us 40% on our production clusters.

My AKS journey had many bumps. Start small, learn constantly, and improve step by step. The benefits are worth the effort.