Jesus Paz · Oct 2, 2024 2 min read

The Hidden Ways Kubernetes Makes Your AWS Bill Explode (and How to Fix It)

Identify the ten most expensive Kubernetes anti-patterns and learn the exact playbooks to eliminate them.

kubernetes optimization

Kubernetes is incredibly efficient—until it is not. The following misconfigurations silently inflate AWS spend across hundreds of clusters I have reviewed. Use this list as a diagnostic guide and a remediation plan.

1. Oversized CPU and memory requests

Symptom: Requests exceed 2x actual peak usage.
Fix: Feed ClusterCost metrics into right-sizing policies. Start with dev/stage, then roll into prod with PodDisruptionBudgets.

2. Idle node pools

Symptom: Node groups with <20% utilization for weeks.
Fix: Enable Cluster Autoscaler scale-down settings (--scale-down-utilization-threshold=0.5) and schedule nightly audits via ClusterCost alerts.

3. Forgotten CronJobs

Symptom: CronJobs keep spinning up expensive pods after the owning service is sunset.
Fix: Set successfulJobsHistoryLimit/failedJobsHistoryLimit and add lifecycle policies that remove CronJobs when repos are archived.

4. Orphaned load balancers and NAT gateways

Symptom: ALBs remain provisioned after ingress deletions.
Fix: Run automated sweeps using AWS Config + ClusterCost metadata. Terminate unused infrastructure and bill the last owner.

5. Storage left behind

Symptom: PVCs and snapshots persist long after workloads migrate.
Fix: Use reclaimPolicy: Delete where possible and create monthly ClusterCost storage reports flagged by owner and TTL.

6. Misconfigured Horizontal Pod Autoscalers

Symptom: HPAs scale out but never scale in because min replicas are set too high.
Fix: Right-size min/max values using actual demand, and tie HPAs to business metrics (QPS) instead of CPU only.

7. Over-provisioned system namespaces

Symptom: Logging, monitoring, and service mesh components run with production-grade requests in every environment.
Fix: Separate system workloads per environment and tune requests using capacity tiering (prod vs. non-prod).

8. Expensive demo and preview environments

Symptom: Preview clusters run 24/7 even when unused.
Fix: Automate hibernation via ClusterCost schedules or GitHub workflows that tear down namespaces after inactivity.

9. GP2 and io1 storage defaults

Symptom: Stateful workloads default to legacy gp2/io1 volumes with high baseline cost.
Fix: Move to gp3 with tuned throughput and leverage EBS volume tagging to rebate teams that modernize.

10. Zombie DaemonSets

Symptom: Security/observability DaemonSets remain enabled in every cluster, even where they provide no value.
Fix: Audit DaemonSets quarterly, track ownership in ClusterCost, and remove ones that no longer have downstream consumers.

Remediation framework

Discover issues via ClusterCost dashboards (top idle nodes, unused storage, oversized workloads).
Prioritize by potential savings × implementation effort.
Assign owners—platform team for infra, product teams for workload sizing.
Prove savings with before/after reports exported automatically.

Kubernetes will always spend whatever you allow it to. Shine a light on these hidden drivers, and the AWS bill becomes a lever for improvement rather than a monthly scare.***

How to Estimate the Cost of Each Namespace in Kubernetes (With Real Examples)

How to See the Cost of Each Kubernetes Pod (Even if You Use Spot or Mixed Nodes)