The Million-Dollar Question: How We Cut EKS Costs by 60% Without Sacrificing Performance
A detailed case study of how we reduced Amazon EKS infrastructure costs by 60% while improving performance and reliability for a major financial services client.
STAQI Technologies Team
February 26, 2024
The Million-Dollar Question: How We Cut EKS Costs by 60% Without Sacrificing Performance
When the CFO of a major financial services company asked us to "make the cloud bill reasonable" while maintaining their strict performance and compliance requirements, we knew we had a challenge ahead. Their monthly AWS bill had grown from $50,000 to $800,000 in just two years as they scaled their EKS-based payment processing platform. The mandate was clear: cut costs significantly without impacting the business.
Six months later, we had reduced their monthly AWS costs by 60% while actually improving system performance and reliability. This is the story of how we did it.
The Cost Crisis: When Success Becomes Expensive
Our client, a payment processor handling 20 million transactions daily, had experienced explosive growth. Their EKS infrastructure had grown organically, with each team adding resources as needed. While this approach supported rapid business growth, it led to significant inefficiencies:
The Starting Point:
- Monthly AWS costs: $800,000
- EKS clusters: 15 (across dev, staging, prod)
- EC2 instances: 400+ (mix of on-demand and reserved)
- Average CPU utilization: 12%
- Average memory utilization: 18%
- Persistent storage: 2.5 PB (mostly unoptimized)
- Data transfer costs: $80,000/month
Business Impact:
- Infrastructure costs growing faster than revenue
- Development teams afraid to deploy due to cost concerns
- CFO questioning the cloud strategy
- Board pressure to control spending
The STAQI Approach: FinOps Meets Engineering Excellence
We developed a comprehensive cost optimization strategy based on four pillars:
1. Visibility and Accountability
2. Right-Sizing and Efficiency
3. Intelligent Scaling
4. Architectural Optimization
Let's dive into each pillar and the specific techniques we used.
Pillar 1: Visibility and Accountability
Implementing Cost Observability
The first step was understanding where money was being spent. We implemented comprehensive cost tracking:
# Cost Allocation Tags for All Resources apiVersion: v1 kind: ConfigMap metadata: name: cost-allocation-config namespace: kube-system data: required-tags.yaml: | mandatory_tags: - cost-center - environment - application - team - business-unit auto_tagging_rules: - resource_type: "aws:ec2:instance" tags: managed-by: "kubernetes" cluster-name: "${CLUSTER_NAME}" - resource_type: "aws:ebs:volume" tags: managed-by: "kubernetes" cluster-name: "${CLUSTER_NAME}"
Custom Cost Dashboards
We built real-time cost dashboards that made spending visible to every team:
{ "dashboard": { "title": "EKS Cost Optimization Dashboard", "panels": [ { "title": "Daily Costs by Team", "type": "graph", "targets": [ { "expr": "aws_billing_estimated_charges{currency=\"USD\"} by (team)", "legendFormat": "{{team}}" } ] }, { "title": "Cost per Transaction", "type": "stat", "targets": [ { "expr": "aws_billing_estimated_charges / payment_transactions_total", "legendFormat": "Cost per Transaction" } ], "thresholds": [ {"color": "green", "value": 0}, {"color": "yellow", "value": 0.02}, {"color": "red", "value": 0.05} ] }, { "title": "Resource Utilization vs Cost", "type": "scatter", "targets": [ { "expr": "node_cpu_utilization", "refId": "A" }, { "expr": "node_cost_per_hour", "refId": "B" } ] } ] } }
Team-Level Cost Budgets and Alerts
# Cost Budget Alerts apiVersion: v1 kind: ConfigMap metadata: name: team-budgets namespace: cost-management data: budgets.yaml: | teams: payments-team: monthly_budget: 50000 alert_thresholds: [60, 80, 90] contact: "payments-team@company.com" fraud-detection: monthly_budget: 30000 alert_thresholds: [70, 85, 95] contact: "fraud-team@company.com" api-gateway: monthly_budget: 25000 alert_thresholds: [65, 80, 90] contact: "api-team@company.com"
Results from Visibility:
- Teams became cost-conscious overnight
- 40% reduction in dev/staging resource usage
- Identification of $120,000 in unused resources
Pillar 2: Right-Sizing and Efficiency
Intelligent Resource Right-Sizing
Most workloads were dramatically over-provisioned. We implemented automated right-sizing:
# Automated Right-Sizing Recommendations import boto3 import pandas as pd from datetime import datetime, timedelta class EKSRightSizer: def __init__(self): self.cloudwatch = boto3.client('cloudwatch') self.eks = boto3.client('eks') def analyze_workload_patterns(self, namespace, days=30): """Analyze workload patterns to recommend optimal resource allocation""" end_time = datetime.utcnow() start_time = end_time - timedelta(days=days) # Get CPU and memory metrics cpu_metrics = self.get_metric_statistics( namespace=f"AWS/EKS/{namespace}", metric_name="pod_cpu_utilization_over_pod_limit", start_time=start_time, end_time=end_time ) memory_metrics = self.get_metric_statistics( namespace=f"AWS/EKS/{namespace}", metric_name="pod_memory_utilization_over_pod_limit", start_time=start_time, end_time=end_time ) # Analyze patterns cpu_analysis = self.analyze_resource_usage(cpu_metrics) memory_analysis = self.analyze_resource_usage(memory_metrics) return self.generate_recommendations(cpu_analysis, memory_analysis) def analyze_resource_usage(self, metrics): """Analyze resource usage patterns""" if not metrics: return None values = [point['Average'] for point in metrics] return { 'p50': np.percentile(values, 50), 'p95': np.percentile(values, 95), 'p99': np.percentile(values, 99), 'max': max(values), 'avg': np.mean(values), 'std': np.std(values) } def generate_recommendations(self, cpu_analysis, memory_analysis): """Generate right-sizing recommendations""" recommendations = [] if cpu_analysis: # Recommend CPU based on P95 + buffer recommended_cpu = cpu_analysis['p95'] * 1.2 if cpu_analysis['avg'] < 0.3: # Under 30% average utilization recommendations.append({ 'type': 'cpu_reduction', 'current_avg': cpu_analysis['avg'], 'recommended_limit': recommended_cpu, 'potential_savings': self.calculate_cpu_savings(recommended_cpu) }) if memory_analysis: # Recommend memory based on P99 + buffer recommended_memory = memory_analysis['p99'] * 1.3 if memory_analysis['avg'] < 0.4: # Under 40% average utilization recommendations.append({ 'type': 'memory_reduction', 'current_avg': memory_analysis['avg'], 'recommended_limit': recommended_memory, 'potential_savings': self.calculate_memory_savings(recommended_memory) }) return recommendations
Vertical Pod Autoscaler (VPA) Implementation
# VPA for Automatic Right-Sizing apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: payment-processor-vpa namespace: payments spec: targetRef: apiVersion: apps/v1 kind: Deployment name: payment-processor updatePolicy: updateMode: "Auto" resourcePolicy: containerPolicies: - containerName: payment-processor maxAllowed: cpu: "2" memory: "4Gi" minAllowed: cpu: "100m" memory: "128Mi" controlledResources: ["cpu", "memory"] controlledValues: RequestsAndLimits
Node Pool Optimization
We restructured the entire node architecture:
# Optimized Node Groups for Different Workload Types module "compute_optimized_nodes" { source = "./modules/eks-node-group" cluster_name = var.cluster_name node_group_name = "compute-optimized" instance_types = ["c5.large", "c5.xlarge", "c5.2xlarge"] capacity_type = "SPOT" # 70% cost reduction for fault-tolerant workloads scaling_config = { desired_size = 10 max_size = 50 min_size = 5 } labels = { workload-type = "compute-intensive" cost-profile = "optimized" } taints = [{ key = "workload-type" value = "compute-intensive" effect = "NO_SCHEDULE" }] } module "memory_optimized_nodes" { source = "./modules/eks-node-group" cluster_name = var.cluster_name node_group_name = "memory-optimized" instance_types = ["r5.large", "r5.xlarge"] capacity_type = "ON_DEMAND" # For memory-sensitive workloads scaling_config = { desired_size = 5 max_size = 20 min_size = 2 } labels = { workload-type = "memory-intensive" cost-profile = "performance" } } module "general_purpose_spot" { source = "./modules/eks-node-group" cluster_name = var.cluster_name node_group_name = "general-spot" instance_types = ["m5.large", "m5.xlarge", "m5a.large", "m5a.xlarge"] capacity_type = "SPOT" scaling_config = { desired_size = 15 max_size = 100 min_size = 5 } labels = { workload-type = "general" cost-profile = "spot" } }
Right-Sizing Results:
- Average CPU utilization: 12% → 65%
- Average memory utilization: 18% → 70%
- Node count reduction: 400 → 180 instances
- Monthly savings: $180,000
Pillar 3: Intelligent Scaling
Predictive Auto-Scaling
We implemented predictive scaling based on business patterns:
# Predictive Scaling Based on Transaction Patterns import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor from datetime import datetime, timedelta class PredictiveScaler: def __init__(self): self.model = RandomForestRegressor(n_estimators=100, random_state=42) self.trained = False def train_model(self, historical_data): """Train the model on historical transaction and resource data""" # Features: hour, day_of_week, day_of_month, is_weekend, is_holiday features = [] targets = [] for record in historical_data: dt = datetime.fromisoformat(record['timestamp']) feature_vector = [ dt.hour, dt.weekday(), dt.day, 1 if dt.weekday() >= 5 else 0, # weekend 1 if self.is_holiday(dt) else 0, # holiday record['transaction_volume'], record['avg_response_time'] ] features.append(feature_vector) targets.append(record['required_replicas']) self.model.fit(features, targets) self.trained = True def predict_scaling_needs(self, forecast_hours=24): """Predict scaling needs for the next N hours""" if not self.trained: raise ValueError("Model must be trained first") predictions = [] current_time = datetime.utcnow() for hour_offset in range(forecast_hours): future_time = current_time + timedelta(hours=hour_offset) # Get predicted transaction volume for this time predicted_volume = self.predict_transaction_volume(future_time) feature_vector = [ future_time.hour, future_time.weekday(), future_time.day, 1 if future_time.weekday() >= 5 else 0, 1 if self.is_holiday(future_time) else 0, predicted_volume, 100 # target response time ] predicted_replicas = self.model.predict([feature_vector])[0] predictions.append({ 'timestamp': future_time.isoformat(), 'predicted_volume': predicted_volume, 'recommended_replicas': max(1, int(predicted_replicas)) }) return predictions def generate_scaling_schedule(self, predictions): """Generate HPA scaling schedule""" schedule = [] for pred in predictions: schedule.append({ 'time': pred['timestamp'], 'action': 'scale', 'replicas': pred['recommended_replicas'], 'reason': f"Predicted volume: {pred['predicted_volume']} TPS" }) return schedule
Custom Metrics Auto-Scaling
# HPA with Custom Business Metrics apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: payment-processor-hpa namespace: payments spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payment-processor minReplicas: 5 maxReplicas: 200 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 - type: Pods pods: metric: name: transactions_per_second_per_pod target: type: AverageValue averageValue: "50" - type: Object object: metric: name: queue_depth describedObject: apiVersion: v1 kind: Service name: payment-queue target: type: Value value: "30" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 180
Cluster Auto-Scaling Optimization
# Optimized Cluster Autoscaler Configuration apiVersion: apps/v1 kind: Deployment metadata: name: cluster-autoscaler namespace: kube-system spec: template: spec: containers: - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0 name: cluster-autoscaler command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/eks-cluster-name - --balance-similar-node-groups - --scale-down-enabled=true - --scale-down-delay-after-add=2m - --scale-down-unneeded-time=3m - --scale-down-utilization-threshold=0.5 - --skip-nodes-with-system-pods=false - --max-node-provision-time=15m
Scaling Optimization Results:
- 45% reduction in over-provisioning
- 30% faster scale-up response
- 50% reduction in idle nodes
- Monthly savings: $150,000
Pillar 4: Architectural Optimization
Storage Optimization
Storage costs were a major component. We implemented a comprehensive storage strategy:
# Optimized Storage Classes apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: gp3-optimized provisioner: ebs.csi.aws.com parameters: type: gp3 iops: "3000" # Baseline IOPS for gp3 throughput: "125" # Baseline throughput encrypted: "true" allowVolumeExpansion: true volumeBindingMode: WaitForFirstConsumer reclaimPolicy: Delete --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: gp3-high-performance provisioner: ebs.csi.aws.com parameters: type: gp3 iops: "16000" # High IOPS for databases throughput: "1000" # High throughput encrypted: "true" allowVolumeExpansion: true volumeBindingMode: WaitForFirstConsumer reclaimPolicy: Delete --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: efs-shared provisioner: efs.csi.aws.com parameters: provisioningMode: efs-ap fileSystemId: fs-0123456789abcdef0 directoryPerms: "0755" reclaimPolicy: Delete
Data Lifecycle Management
# Automated Data Lifecycle Management import boto3 from datetime import datetime, timedelta class DataLifecycleManager: def __init__(self): self.s3 = boto3.client('s3') self.ec2 = boto3.client('ec2') def optimize_s3_storage(self, bucket_name): """Implement intelligent tiering for S3 data""" lifecycle_config = { 'Rules': [ { 'ID': 'transaction-logs-lifecycle', 'Status': 'Enabled', 'Filter': {'Prefix': 'transaction-logs/'}, 'Transitions': [ { 'Days': 30, 'StorageClass': 'STANDARD_IA' }, { 'Days': 90, 'StorageClass': 'GLACIER' }, { 'Days': 365, 'StorageClass': 'DEEP_ARCHIVE' } ] }, { 'ID': 'application-logs-lifecycle', 'Status': 'Enabled', 'Filter': {'Prefix': 'application-logs/'}, 'Transitions': [ { 'Days': 7, 'StorageClass': 'STANDARD_IA' }, { 'Days': 30, 'StorageClass': 'GLACIER' } ], 'Expiration': {'Days': 2555} # 7 years for compliance } ] } self.s3.put_bucket_lifecycle_configuration( Bucket=bucket_name, LifecycleConfiguration=lifecycle_config ) def cleanup_unused_ebs_volumes(self): """Identify and clean up unused EBS volumes""" unused_volumes = [] volumes = self.ec2.describe_volumes()['Volumes'] for volume in volumes: if volume['State'] == 'available': # Not attached create_time = volume['CreateTime'] days_old = (datetime.now(create_time.tzinfo) - create_time).days if days_old > 7: # Unused for more than 7 days unused_volumes.append({ 'VolumeId': volume['VolumeId'], 'Size': volume['Size'], 'Cost': volume['Size'] * 0.1 * days_old, # Rough cost calculation 'DaysOld': days_old }) return unused_volumes
Network Optimization
# Optimized VPC and Networking Configuration apiVersion: v1 kind: ConfigMap metadata: name: vpc-optimization-config namespace: kube-system data: network-optimization.yaml: | # Use VPC endpoints to reduce data transfer costs vpc_endpoints: - service: s3 type: gateway route_table_ids: ["rtb-12345", "rtb-67890"] - service: dynamodb type: gateway route_table_ids: ["rtb-12345", "rtb-67890"] - service: ecr.api type: interface subnet_ids: ["subnet-12345", "subnet-67890"] security_group_ids: ["sg-abcdef"] - service: ecr.dkr type: interface subnet_ids: ["subnet-12345", "subnet-67890"] security_group_ids: ["sg-abcdef"] # NAT Gateway optimization nat_gateways: strategy: "single_az" # Use one NAT gateway instead of one per AZ instance_type: "nat.large" # Right-size based on usage
Architectural Optimization Results:
- Storage costs reduced by 70%
- Data transfer costs reduced by 80%
- Network optimization savings: $90,000/month
Advanced Cost Optimization Techniques
Spot Instance Strategy
# Mixed Instance Type Configuration resource "aws_autoscaling_group" "eks_spot_nodes" { name = "eks-spot-nodes" vpc_zone_identifier = var.private_subnet_ids target_group_arns = [] health_check_type = "EC2" min_size = 5 max_size = 100 desired_capacity = 20 mixed_instances_policy { launch_template { launch_template_specification { launch_template_id = aws_launch_template.eks_spot.id version = "$Latest" } override { instance_type = "m5.large" weighted_capacity = "1" } override { instance_type = "m5.xlarge" weighted_capacity = "2" } override { instance_type = "m5a.large" weighted_capacity = "1" } override { instance_type = "m5a.xlarge" weighted_capacity = "2" } } instances_distribution { on_demand_allocation_strategy = "prioritized" on_demand_base_capacity = 5 on_demand_percentage_above_base_capacity = 20 spot_allocation_strategy = "diversified" spot_instance_pools = 4 spot_max_price = "0.10" } } tag { key = "kubernetes.io/cluster/eks-cluster" value = "owned" propagate_at_launch = true } }
Reserved Instance Optimization
# RI Purchase Recommendation Engine import boto3 import pandas as pd from datetime import datetime, timedelta class RIOptimizer: def __init__(self): self.ce = boto3.client('ce') # Cost Explorer self.ec2 = boto3.client('ec2') def analyze_ri_opportunities(self, lookback_days=90): """Analyze usage patterns for RI opportunities""" end_date = datetime.now().strftime('%Y-%m-%d') start_date = (datetime.now() - timedelta(days=lookback_days)).strftime('%Y-%m-%d') # Get usage data response = self.ce.get_dimension_values( TimePeriod={ 'Start': start_date, 'End': end_date }, Dimension='SERVICE', Context='COST_AND_USAGE', SearchString='Amazon Elastic Compute Cloud' ) # Analyze instance usage patterns usage_data = self.get_detailed_usage(start_date, end_date) recommendations = self.generate_ri_recommendations(usage_data) return recommendations def generate_ri_recommendations(self, usage_data): """Generate RI purchase recommendations""" recommendations = [] # Group by instance type and analyze usage for instance_type, data in usage_data.groupby('instance_type'): avg_usage = data['usage_hours'].mean() # Recommend RI if usage is consistent (>70% of time) if avg_usage > (24 * 0.7): # More than 70% utilization hourly_rate = self.get_on_demand_rate(instance_type) ri_rate = self.get_ri_rate(instance_type, term='1yr', payment='partial') annual_savings = (hourly_rate - ri_rate) * 24 * 365 recommendations.append({ 'instance_type': instance_type, 'recommended_quantity': int(avg_usage / 24), 'annual_savings': annual_savings, 'payback_period_months': self.calculate_payback_period(instance_type), 'confidence_score': min(avg_usage / (24 * 0.9), 1.0) }) return sorted(recommendations, key=lambda x: x['annual_savings'], reverse=True)
Container Image Optimization
# Optimized Multi-Stage Build FROM node:18-alpine AS dependencies WORKDIR /app COPY package*.json ./ RUN npm ci --only=production && npm cache clean --force FROM node:18-alpine AS build WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build FROM node:18-alpine AS runtime RUN addgroup -g 1001 -S nodejs RUN adduser -S nextjs -u 1001 WORKDIR /app COPY /app/node_modules ./node_modules COPY /app/dist ./dist COPY /app/package.json ./package.json USER nextjs EXPOSE 3000 ENV NODE_ENV=production CMD ["npm", "start"] # Final image size: 150MB (down from 1.2GB)
The Complete Results: 60% Cost Reduction
Monthly Cost Breakdown
Before Optimization:
- EC2 instances: $450,000
- EBS storage: $120,000
- Data transfer: $80,000
- Load balancers: $40,000
- Other services: $110,000
- Total: $800,000
After Optimization:
- EC2 instances: $180,000 (60% spot, 40% RI)
- EBS storage: $36,000 (GP3, lifecycle management)
- Data transfer: $16,000 (VPC endpoints, optimization)
- Load balancers: $25,000 (rightsized ALBs)
- Other services: $63,000
- Total: $320,000
Net Savings: $480,000/month (60% reduction)
Performance Improvements
Despite the massive cost reduction, we actually improved performance:
- Response time: 250ms → 180ms (28% improvement)
- Throughput: 15,000 TPS → 22,000 TPS (47% improvement)
- Availability: 99.9% → 99.95% uptime
- Deployment speed: 15 minutes → 6 minutes
Business Impact
The cost optimization project delivered significant business value:
- Annual savings: $5.7 million
- ROI on optimization: 2,400%
- Freed up budget: Enabled expansion into 3 new markets
- Team confidence: Developers no longer afraid to innovate
- Competitive advantage: Lower transaction processing costs
Key Lessons and Best Practices
1. Start with Visibility
You can't optimize what you can't see. Implement comprehensive cost monitoring before making changes.
2. Culture Change is Critical
Technical optimization alone isn't enough. Teams need to understand and care about costs.
3. Automate Everything
Manual cost optimization doesn't scale. Build automation into your processes.
4. Performance and Cost Can Improve Together
With the right approach, cost optimization often leads to better performance.
5. Continuous Optimization
Cost optimization is not a one-time project but an ongoing discipline.
Tools and Technologies Used
Cost Monitoring and Analysis
- AWS Cost Explorer and Budgets
- Kubernetes Resource Recommender
- Kubecost for granular Kubernetes cost analysis
- Custom Prometheus metrics for real-time monitoring
Automation and Optimization
- Terraform for infrastructure as code
- Kubernetes VPA and HPA
- Custom Python scripts for analysis and automation
- AWS Spot Fleet and Auto Scaling Groups
Performance Monitoring
- Prometheus and Grafana
- AWS CloudWatch
- Application Performance Monitoring (APM)
- Custom business metrics dashboards
Future Cost Optimization Initiatives
AI-Powered Cost Optimization
We're developing AI models that can:
- Predict optimal instance types for workloads
- Automatically negotiate spot instance bids
- Optimize resource allocation based on business patterns
Serverless Migration
Moving appropriate workloads to serverless architectures:
- AWS Lambda for event-driven processing
- Amazon ECS Fargate for containerized batch jobs
- API Gateway for lightweight API endpoints
Multi-Cloud Cost Arbitrage
Evaluating opportunities to:
- Use different clouds for different workload types
- Take advantage of regional pricing differences
- Implement cross-cloud disaster recovery
Conclusion: The Strategic Value of Cost Optimization
This project proved that cost optimization is not just about saving money—it's about enabling business growth and innovation. By reducing our client's monthly AWS costs by 60% while improving performance, we:
- Freed up $5.7 million annually for business investment
- Improved system reliability and performance
- Built a culture of cost consciousness across engineering teams
- Established processes for ongoing optimization
The techniques we used are applicable to any organization running significant EKS workloads. The key is taking a systematic, data-driven approach that combines technical optimization with cultural change.
Most importantly, we demonstrated that in the cloud, performance and cost efficiency go hand in hand. When you optimize for efficiency, you often end up with better-performing, more reliable systems.
Ready to optimize your EKS costs without sacrificing performance? Contact STAQI Technologies to learn how our proven methodology can deliver similar results for your organization.