Cost Optimization11 min read

The Million-Dollar Question: How We Cut EKS Costs by 60% Without Sacrificing Performance

A detailed case study of how we reduced Amazon EKS infrastructure costs by 60% while improving performance and reliability for a major financial services client.

S

STAQI Technologies Team

February 26, 2024

EKSCost OptimizationFinOpsAWSKubernetes

The Million-Dollar Question: How We Cut EKS Costs by 60% Without Sacrificing Performance

When the CFO of a major financial services company asked us to "make the cloud bill reasonable" while maintaining their strict performance and compliance requirements, we knew we had a challenge ahead. Their monthly AWS bill had grown from $50,000 to $800,000 in just two years as they scaled their EKS-based payment processing platform. The mandate was clear: cut costs significantly without impacting the business.

Six months later, we had reduced their monthly AWS costs by 60% while actually improving system performance and reliability. This is the story of how we did it.

The Cost Crisis: When Success Becomes Expensive

Our client, a payment processor handling 20 million transactions daily, had experienced explosive growth. Their EKS infrastructure had grown organically, with each team adding resources as needed. While this approach supported rapid business growth, it led to significant inefficiencies:

The Starting Point:

  • Monthly AWS costs: $800,000
  • EKS clusters: 15 (across dev, staging, prod)
  • EC2 instances: 400+ (mix of on-demand and reserved)
  • Average CPU utilization: 12%
  • Average memory utilization: 18%
  • Persistent storage: 2.5 PB (mostly unoptimized)
  • Data transfer costs: $80,000/month

Business Impact:

  • Infrastructure costs growing faster than revenue
  • Development teams afraid to deploy due to cost concerns
  • CFO questioning the cloud strategy
  • Board pressure to control spending

The STAQI Approach: FinOps Meets Engineering Excellence

We developed a comprehensive cost optimization strategy based on four pillars:

1. Visibility and Accountability

2. Right-Sizing and Efficiency

3. Intelligent Scaling

4. Architectural Optimization

Let's dive into each pillar and the specific techniques we used.

Pillar 1: Visibility and Accountability

Implementing Cost Observability

The first step was understanding where money was being spent. We implemented comprehensive cost tracking:

# Cost Allocation Tags for All Resources apiVersion: v1 kind: ConfigMap metadata: name: cost-allocation-config namespace: kube-system data: required-tags.yaml: | mandatory_tags: - cost-center - environment - application - team - business-unit auto_tagging_rules: - resource_type: "aws:ec2:instance" tags: managed-by: "kubernetes" cluster-name: "${CLUSTER_NAME}" - resource_type: "aws:ebs:volume" tags: managed-by: "kubernetes" cluster-name: "${CLUSTER_NAME}"

Custom Cost Dashboards

We built real-time cost dashboards that made spending visible to every team:

{ "dashboard": { "title": "EKS Cost Optimization Dashboard", "panels": [ { "title": "Daily Costs by Team", "type": "graph", "targets": [ { "expr": "aws_billing_estimated_charges{currency=\"USD\"} by (team)", "legendFormat": "{{team}}" } ] }, { "title": "Cost per Transaction", "type": "stat", "targets": [ { "expr": "aws_billing_estimated_charges / payment_transactions_total", "legendFormat": "Cost per Transaction" } ], "thresholds": [ {"color": "green", "value": 0}, {"color": "yellow", "value": 0.02}, {"color": "red", "value": 0.05} ] }, { "title": "Resource Utilization vs Cost", "type": "scatter", "targets": [ { "expr": "node_cpu_utilization", "refId": "A" }, { "expr": "node_cost_per_hour", "refId": "B" } ] } ] } }

Team-Level Cost Budgets and Alerts

# Cost Budget Alerts apiVersion: v1 kind: ConfigMap metadata: name: team-budgets namespace: cost-management data: budgets.yaml: | teams: payments-team: monthly_budget: 50000 alert_thresholds: [60, 80, 90] contact: "payments-team@company.com" fraud-detection: monthly_budget: 30000 alert_thresholds: [70, 85, 95] contact: "fraud-team@company.com" api-gateway: monthly_budget: 25000 alert_thresholds: [65, 80, 90] contact: "api-team@company.com"

Results from Visibility:

  • Teams became cost-conscious overnight
  • 40% reduction in dev/staging resource usage
  • Identification of $120,000 in unused resources

Pillar 2: Right-Sizing and Efficiency

Intelligent Resource Right-Sizing

Most workloads were dramatically over-provisioned. We implemented automated right-sizing:

# Automated Right-Sizing Recommendations import boto3 import pandas as pd from datetime import datetime, timedelta class EKSRightSizer: def __init__(self): self.cloudwatch = boto3.client('cloudwatch') self.eks = boto3.client('eks') def analyze_workload_patterns(self, namespace, days=30): """Analyze workload patterns to recommend optimal resource allocation""" end_time = datetime.utcnow() start_time = end_time - timedelta(days=days) # Get CPU and memory metrics cpu_metrics = self.get_metric_statistics( namespace=f"AWS/EKS/{namespace}", metric_name="pod_cpu_utilization_over_pod_limit", start_time=start_time, end_time=end_time ) memory_metrics = self.get_metric_statistics( namespace=f"AWS/EKS/{namespace}", metric_name="pod_memory_utilization_over_pod_limit", start_time=start_time, end_time=end_time ) # Analyze patterns cpu_analysis = self.analyze_resource_usage(cpu_metrics) memory_analysis = self.analyze_resource_usage(memory_metrics) return self.generate_recommendations(cpu_analysis, memory_analysis) def analyze_resource_usage(self, metrics): """Analyze resource usage patterns""" if not metrics: return None values = [point['Average'] for point in metrics] return { 'p50': np.percentile(values, 50), 'p95': np.percentile(values, 95), 'p99': np.percentile(values, 99), 'max': max(values), 'avg': np.mean(values), 'std': np.std(values) } def generate_recommendations(self, cpu_analysis, memory_analysis): """Generate right-sizing recommendations""" recommendations = [] if cpu_analysis: # Recommend CPU based on P95 + buffer recommended_cpu = cpu_analysis['p95'] * 1.2 if cpu_analysis['avg'] < 0.3: # Under 30% average utilization recommendations.append({ 'type': 'cpu_reduction', 'current_avg': cpu_analysis['avg'], 'recommended_limit': recommended_cpu, 'potential_savings': self.calculate_cpu_savings(recommended_cpu) }) if memory_analysis: # Recommend memory based on P99 + buffer recommended_memory = memory_analysis['p99'] * 1.3 if memory_analysis['avg'] < 0.4: # Under 40% average utilization recommendations.append({ 'type': 'memory_reduction', 'current_avg': memory_analysis['avg'], 'recommended_limit': recommended_memory, 'potential_savings': self.calculate_memory_savings(recommended_memory) }) return recommendations

Vertical Pod Autoscaler (VPA) Implementation

# VPA for Automatic Right-Sizing apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: payment-processor-vpa namespace: payments spec: targetRef: apiVersion: apps/v1 kind: Deployment name: payment-processor updatePolicy: updateMode: "Auto" resourcePolicy: containerPolicies: - containerName: payment-processor maxAllowed: cpu: "2" memory: "4Gi" minAllowed: cpu: "100m" memory: "128Mi" controlledResources: ["cpu", "memory"] controlledValues: RequestsAndLimits

Node Pool Optimization

We restructured the entire node architecture:

# Optimized Node Groups for Different Workload Types module "compute_optimized_nodes" { source = "./modules/eks-node-group" cluster_name = var.cluster_name node_group_name = "compute-optimized" instance_types = ["c5.large", "c5.xlarge", "c5.2xlarge"] capacity_type = "SPOT" # 70% cost reduction for fault-tolerant workloads scaling_config = { desired_size = 10 max_size = 50 min_size = 5 } labels = { workload-type = "compute-intensive" cost-profile = "optimized" } taints = [{ key = "workload-type" value = "compute-intensive" effect = "NO_SCHEDULE" }] } module "memory_optimized_nodes" { source = "./modules/eks-node-group" cluster_name = var.cluster_name node_group_name = "memory-optimized" instance_types = ["r5.large", "r5.xlarge"] capacity_type = "ON_DEMAND" # For memory-sensitive workloads scaling_config = { desired_size = 5 max_size = 20 min_size = 2 } labels = { workload-type = "memory-intensive" cost-profile = "performance" } } module "general_purpose_spot" { source = "./modules/eks-node-group" cluster_name = var.cluster_name node_group_name = "general-spot" instance_types = ["m5.large", "m5.xlarge", "m5a.large", "m5a.xlarge"] capacity_type = "SPOT" scaling_config = { desired_size = 15 max_size = 100 min_size = 5 } labels = { workload-type = "general" cost-profile = "spot" } }

Right-Sizing Results:

  • Average CPU utilization: 12% → 65%
  • Average memory utilization: 18% → 70%
  • Node count reduction: 400 → 180 instances
  • Monthly savings: $180,000

Pillar 3: Intelligent Scaling

Predictive Auto-Scaling

We implemented predictive scaling based on business patterns:

# Predictive Scaling Based on Transaction Patterns import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor from datetime import datetime, timedelta class PredictiveScaler: def __init__(self): self.model = RandomForestRegressor(n_estimators=100, random_state=42) self.trained = False def train_model(self, historical_data): """Train the model on historical transaction and resource data""" # Features: hour, day_of_week, day_of_month, is_weekend, is_holiday features = [] targets = [] for record in historical_data: dt = datetime.fromisoformat(record['timestamp']) feature_vector = [ dt.hour, dt.weekday(), dt.day, 1 if dt.weekday() >= 5 else 0, # weekend 1 if self.is_holiday(dt) else 0, # holiday record['transaction_volume'], record['avg_response_time'] ] features.append(feature_vector) targets.append(record['required_replicas']) self.model.fit(features, targets) self.trained = True def predict_scaling_needs(self, forecast_hours=24): """Predict scaling needs for the next N hours""" if not self.trained: raise ValueError("Model must be trained first") predictions = [] current_time = datetime.utcnow() for hour_offset in range(forecast_hours): future_time = current_time + timedelta(hours=hour_offset) # Get predicted transaction volume for this time predicted_volume = self.predict_transaction_volume(future_time) feature_vector = [ future_time.hour, future_time.weekday(), future_time.day, 1 if future_time.weekday() >= 5 else 0, 1 if self.is_holiday(future_time) else 0, predicted_volume, 100 # target response time ] predicted_replicas = self.model.predict([feature_vector])[0] predictions.append({ 'timestamp': future_time.isoformat(), 'predicted_volume': predicted_volume, 'recommended_replicas': max(1, int(predicted_replicas)) }) return predictions def generate_scaling_schedule(self, predictions): """Generate HPA scaling schedule""" schedule = [] for pred in predictions: schedule.append({ 'time': pred['timestamp'], 'action': 'scale', 'replicas': pred['recommended_replicas'], 'reason': f"Predicted volume: {pred['predicted_volume']} TPS" }) return schedule

Custom Metrics Auto-Scaling

# HPA with Custom Business Metrics apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: payment-processor-hpa namespace: payments spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payment-processor minReplicas: 5 maxReplicas: 200 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 - type: Pods pods: metric: name: transactions_per_second_per_pod target: type: AverageValue averageValue: "50" - type: Object object: metric: name: queue_depth describedObject: apiVersion: v1 kind: Service name: payment-queue target: type: Value value: "30" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 180

Cluster Auto-Scaling Optimization

# Optimized Cluster Autoscaler Configuration apiVersion: apps/v1 kind: Deployment metadata: name: cluster-autoscaler namespace: kube-system spec: template: spec: containers: - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0 name: cluster-autoscaler command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/eks-cluster-name - --balance-similar-node-groups - --scale-down-enabled=true - --scale-down-delay-after-add=2m - --scale-down-unneeded-time=3m - --scale-down-utilization-threshold=0.5 - --skip-nodes-with-system-pods=false - --max-node-provision-time=15m

Scaling Optimization Results:

  • 45% reduction in over-provisioning
  • 30% faster scale-up response
  • 50% reduction in idle nodes
  • Monthly savings: $150,000

Pillar 4: Architectural Optimization

Storage Optimization

Storage costs were a major component. We implemented a comprehensive storage strategy:

# Optimized Storage Classes apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: gp3-optimized provisioner: ebs.csi.aws.com parameters: type: gp3 iops: "3000" # Baseline IOPS for gp3 throughput: "125" # Baseline throughput encrypted: "true" allowVolumeExpansion: true volumeBindingMode: WaitForFirstConsumer reclaimPolicy: Delete --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: gp3-high-performance provisioner: ebs.csi.aws.com parameters: type: gp3 iops: "16000" # High IOPS for databases throughput: "1000" # High throughput encrypted: "true" allowVolumeExpansion: true volumeBindingMode: WaitForFirstConsumer reclaimPolicy: Delete --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: efs-shared provisioner: efs.csi.aws.com parameters: provisioningMode: efs-ap fileSystemId: fs-0123456789abcdef0 directoryPerms: "0755" reclaimPolicy: Delete

Data Lifecycle Management

# Automated Data Lifecycle Management import boto3 from datetime import datetime, timedelta class DataLifecycleManager: def __init__(self): self.s3 = boto3.client('s3') self.ec2 = boto3.client('ec2') def optimize_s3_storage(self, bucket_name): """Implement intelligent tiering for S3 data""" lifecycle_config = { 'Rules': [ { 'ID': 'transaction-logs-lifecycle', 'Status': 'Enabled', 'Filter': {'Prefix': 'transaction-logs/'}, 'Transitions': [ { 'Days': 30, 'StorageClass': 'STANDARD_IA' }, { 'Days': 90, 'StorageClass': 'GLACIER' }, { 'Days': 365, 'StorageClass': 'DEEP_ARCHIVE' } ] }, { 'ID': 'application-logs-lifecycle', 'Status': 'Enabled', 'Filter': {'Prefix': 'application-logs/'}, 'Transitions': [ { 'Days': 7, 'StorageClass': 'STANDARD_IA' }, { 'Days': 30, 'StorageClass': 'GLACIER' } ], 'Expiration': {'Days': 2555} # 7 years for compliance } ] } self.s3.put_bucket_lifecycle_configuration( Bucket=bucket_name, LifecycleConfiguration=lifecycle_config ) def cleanup_unused_ebs_volumes(self): """Identify and clean up unused EBS volumes""" unused_volumes = [] volumes = self.ec2.describe_volumes()['Volumes'] for volume in volumes: if volume['State'] == 'available': # Not attached create_time = volume['CreateTime'] days_old = (datetime.now(create_time.tzinfo) - create_time).days if days_old > 7: # Unused for more than 7 days unused_volumes.append({ 'VolumeId': volume['VolumeId'], 'Size': volume['Size'], 'Cost': volume['Size'] * 0.1 * days_old, # Rough cost calculation 'DaysOld': days_old }) return unused_volumes

Network Optimization

# Optimized VPC and Networking Configuration apiVersion: v1 kind: ConfigMap metadata: name: vpc-optimization-config namespace: kube-system data: network-optimization.yaml: | # Use VPC endpoints to reduce data transfer costs vpc_endpoints: - service: s3 type: gateway route_table_ids: ["rtb-12345", "rtb-67890"] - service: dynamodb type: gateway route_table_ids: ["rtb-12345", "rtb-67890"] - service: ecr.api type: interface subnet_ids: ["subnet-12345", "subnet-67890"] security_group_ids: ["sg-abcdef"] - service: ecr.dkr type: interface subnet_ids: ["subnet-12345", "subnet-67890"] security_group_ids: ["sg-abcdef"] # NAT Gateway optimization nat_gateways: strategy: "single_az" # Use one NAT gateway instead of one per AZ instance_type: "nat.large" # Right-size based on usage

Architectural Optimization Results:

  • Storage costs reduced by 70%
  • Data transfer costs reduced by 80%
  • Network optimization savings: $90,000/month

Advanced Cost Optimization Techniques

Spot Instance Strategy

# Mixed Instance Type Configuration resource "aws_autoscaling_group" "eks_spot_nodes" { name = "eks-spot-nodes" vpc_zone_identifier = var.private_subnet_ids target_group_arns = [] health_check_type = "EC2" min_size = 5 max_size = 100 desired_capacity = 20 mixed_instances_policy { launch_template { launch_template_specification { launch_template_id = aws_launch_template.eks_spot.id version = "$Latest" } override { instance_type = "m5.large" weighted_capacity = "1" } override { instance_type = "m5.xlarge" weighted_capacity = "2" } override { instance_type = "m5a.large" weighted_capacity = "1" } override { instance_type = "m5a.xlarge" weighted_capacity = "2" } } instances_distribution { on_demand_allocation_strategy = "prioritized" on_demand_base_capacity = 5 on_demand_percentage_above_base_capacity = 20 spot_allocation_strategy = "diversified" spot_instance_pools = 4 spot_max_price = "0.10" } } tag { key = "kubernetes.io/cluster/eks-cluster" value = "owned" propagate_at_launch = true } }

Reserved Instance Optimization

# RI Purchase Recommendation Engine import boto3 import pandas as pd from datetime import datetime, timedelta class RIOptimizer: def __init__(self): self.ce = boto3.client('ce') # Cost Explorer self.ec2 = boto3.client('ec2') def analyze_ri_opportunities(self, lookback_days=90): """Analyze usage patterns for RI opportunities""" end_date = datetime.now().strftime('%Y-%m-%d') start_date = (datetime.now() - timedelta(days=lookback_days)).strftime('%Y-%m-%d') # Get usage data response = self.ce.get_dimension_values( TimePeriod={ 'Start': start_date, 'End': end_date }, Dimension='SERVICE', Context='COST_AND_USAGE', SearchString='Amazon Elastic Compute Cloud' ) # Analyze instance usage patterns usage_data = self.get_detailed_usage(start_date, end_date) recommendations = self.generate_ri_recommendations(usage_data) return recommendations def generate_ri_recommendations(self, usage_data): """Generate RI purchase recommendations""" recommendations = [] # Group by instance type and analyze usage for instance_type, data in usage_data.groupby('instance_type'): avg_usage = data['usage_hours'].mean() # Recommend RI if usage is consistent (>70% of time) if avg_usage > (24 * 0.7): # More than 70% utilization hourly_rate = self.get_on_demand_rate(instance_type) ri_rate = self.get_ri_rate(instance_type, term='1yr', payment='partial') annual_savings = (hourly_rate - ri_rate) * 24 * 365 recommendations.append({ 'instance_type': instance_type, 'recommended_quantity': int(avg_usage / 24), 'annual_savings': annual_savings, 'payback_period_months': self.calculate_payback_period(instance_type), 'confidence_score': min(avg_usage / (24 * 0.9), 1.0) }) return sorted(recommendations, key=lambda x: x['annual_savings'], reverse=True)

Container Image Optimization

# Optimized Multi-Stage Build FROM node:18-alpine AS dependencies WORKDIR /app COPY package*.json ./ RUN npm ci --only=production && npm cache clean --force FROM node:18-alpine AS build WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build FROM node:18-alpine AS runtime RUN addgroup -g 1001 -S nodejs RUN adduser -S nextjs -u 1001 WORKDIR /app COPY --from=dependencies /app/node_modules ./node_modules COPY --from=build /app/dist ./dist COPY --from=build /app/package.json ./package.json USER nextjs EXPOSE 3000 ENV NODE_ENV=production CMD ["npm", "start"] # Final image size: 150MB (down from 1.2GB)

The Complete Results: 60% Cost Reduction

Monthly Cost Breakdown

Before Optimization:

  • EC2 instances: $450,000
  • EBS storage: $120,000
  • Data transfer: $80,000
  • Load balancers: $40,000
  • Other services: $110,000
  • Total: $800,000

After Optimization:

  • EC2 instances: $180,000 (60% spot, 40% RI)
  • EBS storage: $36,000 (GP3, lifecycle management)
  • Data transfer: $16,000 (VPC endpoints, optimization)
  • Load balancers: $25,000 (rightsized ALBs)
  • Other services: $63,000
  • Total: $320,000

Net Savings: $480,000/month (60% reduction)

Performance Improvements

Despite the massive cost reduction, we actually improved performance:

  • Response time: 250ms → 180ms (28% improvement)
  • Throughput: 15,000 TPS → 22,000 TPS (47% improvement)
  • Availability: 99.9% → 99.95% uptime
  • Deployment speed: 15 minutes → 6 minutes

Business Impact

The cost optimization project delivered significant business value:

  • Annual savings: $5.7 million
  • ROI on optimization: 2,400%
  • Freed up budget: Enabled expansion into 3 new markets
  • Team confidence: Developers no longer afraid to innovate
  • Competitive advantage: Lower transaction processing costs

Key Lessons and Best Practices

1. Start with Visibility

You can't optimize what you can't see. Implement comprehensive cost monitoring before making changes.

2. Culture Change is Critical

Technical optimization alone isn't enough. Teams need to understand and care about costs.

3. Automate Everything

Manual cost optimization doesn't scale. Build automation into your processes.

4. Performance and Cost Can Improve Together

With the right approach, cost optimization often leads to better performance.

5. Continuous Optimization

Cost optimization is not a one-time project but an ongoing discipline.

Tools and Technologies Used

Cost Monitoring and Analysis

  • AWS Cost Explorer and Budgets
  • Kubernetes Resource Recommender
  • Kubecost for granular Kubernetes cost analysis
  • Custom Prometheus metrics for real-time monitoring

Automation and Optimization

  • Terraform for infrastructure as code
  • Kubernetes VPA and HPA
  • Custom Python scripts for analysis and automation
  • AWS Spot Fleet and Auto Scaling Groups

Performance Monitoring

  • Prometheus and Grafana
  • AWS CloudWatch
  • Application Performance Monitoring (APM)
  • Custom business metrics dashboards

Future Cost Optimization Initiatives

AI-Powered Cost Optimization

We're developing AI models that can:

  • Predict optimal instance types for workloads
  • Automatically negotiate spot instance bids
  • Optimize resource allocation based on business patterns

Serverless Migration

Moving appropriate workloads to serverless architectures:

  • AWS Lambda for event-driven processing
  • Amazon ECS Fargate for containerized batch jobs
  • API Gateway for lightweight API endpoints

Multi-Cloud Cost Arbitrage

Evaluating opportunities to:

  • Use different clouds for different workload types
  • Take advantage of regional pricing differences
  • Implement cross-cloud disaster recovery

Conclusion: The Strategic Value of Cost Optimization

This project proved that cost optimization is not just about saving money—it's about enabling business growth and innovation. By reducing our client's monthly AWS costs by 60% while improving performance, we:

  • Freed up $5.7 million annually for business investment
  • Improved system reliability and performance
  • Built a culture of cost consciousness across engineering teams
  • Established processes for ongoing optimization

The techniques we used are applicable to any organization running significant EKS workloads. The key is taking a systematic, data-driven approach that combines technical optimization with cultural change.

Most importantly, we demonstrated that in the cloud, performance and cost efficiency go hand in hand. When you optimize for efficiency, you often end up with better-performing, more reliable systems.

Ready to optimize your EKS costs without sacrificing performance? Contact STAQI Technologies to learn how our proven methodology can deliver similar results for your organization.

Ready to implement similar solutions?

Contact STAQI Technologies to learn how our expertise in high-volume systems, security operations, and compliance can benefit your organization.

Get Started