The Hidden Cost of Spot Instances: Why Cheap Compute Can Be Expensive

Spot instances promise 90% savings, but interruptions, data transfer, and engineering fatigue can wipe that out. Here is when NOT to use them.

J
Jesus Paz
3 min read

Everyone loves Spot Instances. “Save up to 90%!” screams the AWS marketing page. And for stateless, fault-tolerant batch jobs, they are indeed a miracle.

But for long-running microservices in Kubernetes? The math isn’t so simple.

I’ve seen teams migrate their entire production fleet to Spot, high-fiving over the projected savings, only to watch their actual bill—and their burnout rate—creep back up.

Here are the hidden taxes of Spot Instances that the calculator doesn’t show you.

1. The Interruption Tax

When AWS reclaims a Spot node, your pods have 2 minutes to evacuate. Best case? A graceful shutdown. Worst case? Dropped connections, failed transactions, and a retry storm.

If your application isn’t perfectly architected for chaos (and be honest, is it?), every interruption is a customer-facing blip.

The Real Cost:

  • Retry Traffic: Failed requests get retried, increasing load on your database and other services.
  • Cold Starts: New nodes take time to spin up. Your app takes time to boot. During that window, you’re running at reduced capacity or over-provisioning to compensate.
  • Over-provisioning: To handle the churn, teams often run more replicas than they need, negating the unit cost savings.

2. The Cross-AZ Data Transfer Trap

Spot capacity is fluid. To maintain availability, you often have to enable “Capacity Rebalancing,” which aggressively moves workloads to pools with lower interruption risk.

Often, that means moving across Availability Zones (AZs).

In AWS, data transfer within an AZ is free. Data transfer between AZs costs $0.01/GB.

The Scenario: You have a chatty microservice architecture. Service A calls Service B 1,000 times a second.

  • On-Demand: You pin them to us-east-1a. Cost: $0.
  • Spot: The scheduler scatters them across 1a, 1b, and 1c to find capacity. Cost: Hundreds of dollars a month in hidden network fees.

If your “compute savings” get eaten by “networking costs,” you’ve just added complexity for free.

3. The Engineering Fatigue

This is the most expensive line item.

When a Spot interruption causes a weird race condition or a brief outage, who gets paged? Your engineers.

If your team spends 5 hours a week debugging “ghost issues” that turn out to be Spot interruptions, you aren’t saving money. You’re burning expensive engineering hours to save cheap EC2 hours.

Rule of Thumb: If you spend more on engineering time fixing Spot issues than you save on the bill, go back to On-Demand.

When to Use Spot (and When to Run)

✅ Use Spot for:

  • CI/CD Runners: Jenkins, GitHub Actions. If a build fails, you just retry it.
  • Batch Processing: Spark, Hadoop, image processing.
  • Stateless Frontends: If you have aggressive caching and can tolerate 1% error rates during rebalancing.

❌ Stick to On-Demand / Savings Plans for:

  • Databases: Obviously. Never put state on Spot.
  • Core API Services: If it powers your checkout flow, pay the premium for stability.
  • Single-Replica Apps: If you only run one copy, a Spot interruption means downtime.

The Verdict

Don’t default to Spot. Default to Savings Plans for your baseline load. They offer 40-60% savings with zero engineering overhead.

Use Spot only for the burstable, interruptible peaks. Reliability is a feature, and it has a price tag.

👨‍💻

Jesus Paz

Founder & CEO

Read Next

Join 1,000+ FinOps and platform leaders

Get Kubernetes and ECS cost tactics delivered weekly.