Jesus Paz · 2 min read
The Hidden Ways Kubernetes Makes Your AWS Bill Explode (and How to Fix It)
Identify the ten most expensive Kubernetes anti-patterns and learn the exact playbooks to eliminate them.
kubernetes optimization
Kubernetes is incredibly efficient—until it is not. The following misconfigurations silently inflate AWS spend across hundreds of clusters I have reviewed. Use this list as a diagnostic guide and a remediation plan.
1. Oversized CPU and memory requests
- Symptom: Requests exceed 2x actual peak usage.
- Fix: Feed ClusterCost metrics into right-sizing policies. Start with dev/stage, then roll into prod with PodDisruptionBudgets.
2. Idle node pools
- Symptom: Node groups with <20% utilization for weeks.
- Fix: Enable Cluster Autoscaler scale-down settings (
--scale-down-utilization-threshold=0.5) and schedule nightly audits via ClusterCost alerts.
3. Forgotten CronJobs
- Symptom: CronJobs keep spinning up expensive pods after the owning service is sunset.
- Fix: Set
successfulJobsHistoryLimit/failedJobsHistoryLimitand add lifecycle policies that remove CronJobs when repos are archived.
4. Orphaned load balancers and NAT gateways
- Symptom: ALBs remain provisioned after ingress deletions.
- Fix: Run automated sweeps using AWS Config + ClusterCost metadata. Terminate unused infrastructure and bill the last owner.
5. Storage left behind
- Symptom: PVCs and snapshots persist long after workloads migrate.
- Fix: Use
reclaimPolicy: Deletewhere possible and create monthly ClusterCost storage reports flagged by owner and TTL.
6. Misconfigured Horizontal Pod Autoscalers
- Symptom: HPAs scale out but never scale in because min replicas are set too high.
- Fix: Right-size min/max values using actual demand, and tie HPAs to business metrics (QPS) instead of CPU only.
7. Over-provisioned system namespaces
- Symptom: Logging, monitoring, and service mesh components run with production-grade requests in every environment.
- Fix: Separate system workloads per environment and tune requests using capacity tiering (prod vs. non-prod).
8. Expensive demo and preview environments
- Symptom: Preview clusters run 24/7 even when unused.
- Fix: Automate hibernation via ClusterCost schedules or GitHub workflows that tear down namespaces after inactivity.
9. GP2 and io1 storage defaults
- Symptom: Stateful workloads default to legacy gp2/io1 volumes with high baseline cost.
- Fix: Move to gp3 with tuned throughput and leverage EBS volume tagging to rebate teams that modernize.
10. Zombie DaemonSets
- Symptom: Security/observability DaemonSets remain enabled in every cluster, even where they provide no value.
- Fix: Audit DaemonSets quarterly, track ownership in ClusterCost, and remove ones that no longer have downstream consumers.
Remediation framework
- Discover issues via ClusterCost dashboards (top idle nodes, unused storage, oversized workloads).
- Prioritize by potential savings Ă— implementation effort.
- Assign owners—platform team for infra, product teams for workload sizing.
- Prove savings with before/after reports exported automatically.
Kubernetes will always spend whatever you allow it to. Shine a light on these hidden drivers, and the AWS bill becomes a lever for improvement rather than a monthly scare.***
Previous
How to Estimate the Cost of Each Namespace in Kubernetes (With Real Examples)
Next
How to See the Cost of Each Kubernetes Pod (Even if You Use Spot or Mixed Nodes)
Related reading
Join 1,000+ FinOps and platform leaders
Get Kubernetes and ECS cost tactics delivered weekly.