Graceful Shutdowns: Surviving Spot Interruptions

In my post on Spot Instances, I warned about the “Interruption Tax.” The only way to pay that tax without going bankrupt is Graceful Shutdowns.

When AWS reclaims a Spot node, it gives you a 2-minute warning. Kubernetes translates this into a SIGTERM signal sent to your pod.

If your app ignores SIGTERM, it gets SIGKILLed 30 seconds later. In-flight requests fail. Database connections leak. Customers get 500 errors.

Here is how to fix it.

The Lifecycle

Spot Reclaim: AWS notifies the node.
Node Drain: The node cordon/drains itself.
Pod Termination: Kubernetes sends SIGTERM to your app.
Service Removal: Simultaneously, Kubernetes removes the pod IP from the Service/Ingress endpoints.

The Race Condition

Step 3 and Step 4 happen at the same time. This is the problem.

Your app might receive the SIGTERM and shut down before the load balancer stops sending it traffic. Result: Dropped requests.

The Fix: preStop Sleep

You need to tell your app to wait for the load balancer to update. The simplest way is a preStop hook.

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 10"]

This forces the pod to stay alive for 10 seconds after the termination starts, giving the load balancer time to propagate the change.

The Code: Handling SIGTERM

In your application code (Go example), you must catch the signal and drain connections.

sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

<-sigChan // Wait for signal
log.Println("Shutting down...")

// Stop accepting NEW requests
server.Shutdown(ctx)

// Finish OLD requests
waitForJobsToFinish()

Summary

Spot instances are only “production ready” if your app is “interruption ready.”

Add a preStop sleep (10-15s).
Catch SIGTERM.
Drain connections gracefully.

Do this, and you can save 90% on compute without waking up at 3 AM.

👨‍💻

Jesus Paz

Founder & CEO

Previous ← AI Rightsizing for Kubernetes: Start with the Boring Baseline Next The Art of Kubernetes Request Sizing →