EKS Cost Optimization: Advanced Techniques for Reducing Cloud Spend
Comprehensive strategies for optimizing EKS costs including spot instances, resource right-sizing, scheduling optimizations, and automated cost management practices that can reduce expenses by 40-60%.
STAQI Technologies
March 5, 2024
EKS Cost Optimization: Advanced Techniques for Reducing Cloud Spend
Cloud costs can quickly spiral out of control without proper optimization strategies. This comprehensive guide explores advanced EKS cost optimization techniques that have helped organizations reduce their Kubernetes spending by 40-60% while maintaining performance and reliability.
Introduction
EKS cost optimization requires a multi-faceted approach addressing compute costs, storage optimization, network efficiency, and operational overhead. This guide provides actionable strategies for implementing comprehensive cost controls across your EKS infrastructure.
Cost Analysis Foundation
Understanding EKS Cost Components
Break down EKS costs into manageable components:
# cost-analysis.yaml apiVersion: v1 kind: ConfigMap metadata: name: cost-breakdown data: compute_costs: "60-70% - EC2 instances, Fargate pods" storage_costs: "15-20% - EBS volumes, EFS storage" network_costs: "10-15% - Data transfer, NAT gateways" control_plane: "5-10% - EKS cluster management fee" additional_services: "5-10% - Load balancers, monitoring"
Cost Monitoring Infrastructure
Implement comprehensive cost tracking:
# cost-monitoring.tf resource "aws_ce_cost_category" "eks_costs" { name = "EKS-Environment-Costs" rule_version = "CostCategoryExpression.v1" rule { value = "Production" rule { dimension { key = "SERVICE" values = ["Amazon Elastic Container Service for Kubernetes"] match_options = ["EQUALS"] } tag { key = "Environment" values = ["production"] match_options = ["EQUALS"] } } } rule { value = "Development" rule { dimension { key = "SERVICE" values = ["Amazon Elastic Container Service for Kubernetes"] match_options = ["EQUALS"] } tag { key = "Environment" values = ["development", "staging"] match_options = ["EQUALS"] } } } } resource "aws_budgets_budget" "eks_budget" { name = "EKS-Monthly-Budget" budget_type = "COST" limit_amount = "5000" limit_unit = "USD" time_unit = "MONTHLY" cost_filters { service = ["Amazon Elastic Container Service for Kubernetes"] } notification { comparison_operator = "GREATER_THAN" threshold = 80 threshold_type = "PERCENTAGE" notification_type = "ACTUAL" subscriber_email_addresses = ["devops@company.com"] } notification { comparison_operator = "GREATER_THAN" threshold = 100 threshold_type = "PERCENTAGE" notification_type = "FORECASTED" subscriber_email_addresses = ["devops@company.com", "finance@company.com"] } }
Compute Cost Optimization
Spot Instance Strategy
Implement aggressive spot instance usage for significant cost savings:
# spot-instances.tf resource "aws_eks_node_group" "spot_optimized" { cluster_name = aws_eks_cluster.main.name node_group_name = "spot-optimized-workers" node_role_arn = aws_iam_role.node_group.arn subnet_ids = var.private_subnet_ids capacity_type = "SPOT" # Diversify instance types for better spot availability instance_types = [ "m5.large", "m5.xlarge", "m5.2xlarge", "m5a.large", "m5a.xlarge", "m5a.2xlarge", "c5.large", "c5.xlarge", "c5.2xlarge", "c5a.large", "c5a.xlarge", "c5a.2xlarge" ] scaling_config { desired_size = 15 max_size = 100 min_size = 10 } # Spot instance handling launch_template { id = aws_launch_template.spot_template.id version = aws_launch_template.spot_template.latest_version } lifecycle { ignore_changes = [scaling_config[0].desired_size] } tags = { "Name" = "EKS-Spot-Nodes" "k8s.io/cluster-autoscaler/enabled" = "true" "k8s.io/cluster-autoscaler/${aws_eks_cluster.main.name}" = "owned" "k8s.io/cluster-autoscaler/node-template/label/node-lifecycle" = "spot" } } resource "aws_launch_template" "spot_template" { name_prefix = "eks-spot-" image_id = data.aws_ssm_parameter.eks_ami.value instance_type = "m5.large" vpc_security_group_ids = [aws_security_group.node_group.id] user_data = base64encode(templatefile("${path.module}/user_data.sh", { cluster_name = aws_eks_cluster.main.name bootstrap_arguments = "--container-runtime containerd" })) tag_specifications { resource_type = "instance" tags = { Name = "EKS-Spot-Node" "kubernetes.io/cluster/${aws_eks_cluster.main.name}" = "owned" } } metadata_options { http_endpoint = "enabled" http_tokens = "required" http_put_response_hop_limit = 2 } }
Intelligent Workload Scheduling
Deploy workloads strategically across spot and on-demand instances:
# spot-workload-scheduling.yaml apiVersion: apps/v1 kind: Deployment metadata: name: stateless-web-app namespace: production spec: replicas: 20 selector: matchLabels: app: web-app template: metadata: labels: app: web-app spec: # Prefer spot instances but allow on-demand as fallback affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: node-lifecycle operator: In values: ["spot"] - weight: 50 preference: matchExpressions: - key: node-lifecycle operator: In values: ["on-demand"] # Tolerate spot instance interruptions tolerations: - key: "spot-instance" operator: "Equal" value: "true" effect: "NoSchedule" # Distribute across zones for availability topologySpreadConstraints: - maxSkew: 2 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: web-app containers: - name: web-app image: nginx:alpine resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi # Graceful shutdown for spot interruptions lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 30"] terminationGracePeriodSeconds: 45 --- # Critical workloads on on-demand instances apiVersion: apps/v1 kind: Deployment metadata: name: critical-api namespace: production spec: replicas: 5 selector: matchLabels: app: critical-api template: metadata: labels: app: critical-api spec: # Require on-demand instances for critical workloads nodeSelector: node-lifecycle: on-demand containers: - name: api image: myapp/api:v1.0.0 resources: requests: cpu: 200m memory: 256Mi limits: cpu: 1 memory: 1Gi
AWS Savings Plans Integration
Optimize with Compute Savings Plans for predictable workloads:
# savings-plan-optimizer.py import boto3 import json from datetime import datetime, timedelta class SavingsPlansOptimizer: def __init__(self, region='us-west-2'): self.ce_client = boto3.client('ce', region_name=region) self.savings_plans_client = boto3.client('savingsplans', region_name=region) def analyze_usage_patterns(self, days=90): """Analyze EC2 usage patterns to recommend Savings Plans""" end_date = datetime.now().strftime('%Y-%m-%d') start_date = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d') response = self.ce_client.get_usage_and_costs( TimePeriod={ 'Start': start_date, 'End': end_date }, Granularity='DAILY', Metrics=['UnblendedCost', 'UsageQuantity'], GroupBy=[ { 'Type': 'DIMENSION', 'Key': 'INSTANCE_TYPE' }, { 'Type': 'DIMENSION', 'Key': 'REGION' } ], Filter={ 'Dimensions': { 'Key': 'SERVICE', 'Values': ['Amazon Elastic Compute Cloud - Compute'] } } ) return self._calculate_savings_opportunity(response['ResultsByTime']) def _calculate_savings_opportunity(self, usage_data): """Calculate potential savings from Savings Plans""" total_cost = 0 consistent_usage = {} for daily_usage in usage_data: for group in daily_usage['Groups']: instance_type = group['Keys'][0] cost = float(group['Metrics']['UnblendedCost']['Amount']) total_cost += cost if instance_type not in consistent_usage: consistent_usage[instance_type] = [] consistent_usage[instance_type].append(cost) # Calculate minimum consistent usage (baseline for Savings Plans) recommendations = {} for instance_type, costs in consistent_usage.items(): if len(costs) >= 30: # Sufficient data points min_daily_cost = min(costs) avg_daily_cost = sum(costs) / len(costs) # Recommend 70% of minimum usage for 1-year plan recommended_commitment = min_daily_cost * 365 * 0.7 potential_savings = recommended_commitment * 0.17 # ~17% savings recommendations[instance_type] = { 'recommended_commitment': recommended_commitment, 'potential_annual_savings': potential_savings, 'confidence': 'high' if min_daily_cost / avg_daily_cost > 0.6 else 'medium' } return recommendations def get_current_savings_plans(self): """Get current Savings Plans status""" response = self.savings_plans_client.describe_savings_plans() return response['savingsPlans']
Resource Right-Sizing
Automated Resource Optimization
Implement VPA-based resource optimization:
# resource-optimizer.yaml apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: cost-optimizer-vpa namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: resource-intensive-app updatePolicy: updateMode: "Auto" resourcePolicy: containerPolicies: - containerName: main-container minAllowed: cpu: 50m memory: 64Mi maxAllowed: cpu: 2 memory: 4Gi controlledResources: ["cpu", "memory"] controlledValues: RequestsAndLimits # Aggressive scaling for cost optimization mode: Auto --- apiVersion: v1 kind: ConfigMap metadata: name: resource-optimization-script namespace: kube-system data: optimize.py: | #!/usr/bin/env python3 import subprocess import json import yaml def get_vpa_recommendations(): """Get VPA recommendations for all deployments""" cmd = ["kubectl", "get", "vpa", "-o", "json", "--all-namespaces"] result = subprocess.run(cmd, capture_output=True, text=True) if result.returncode == 0: vpa_data = json.loads(result.stdout) recommendations = {} for vpa in vpa_data.get('items', []): if 'status' in vpa and 'recommendation' in vpa['status']: name = vpa['metadata']['name'] namespace = vpa['metadata']['namespace'] rec = vpa['status']['recommendation'] recommendations[f"{namespace}/{name}"] = { 'current': self._get_current_resources(namespace, name), 'recommended': rec.get('containerRecommendations', []) } return recommendations return {} def calculate_cost_impact(recommendations): """Calculate cost impact of applying VPA recommendations""" total_savings = 0 for deployment, data in recommendations.items(): current = data['current'] recommended = data['recommended'] if current and recommended: current_cost = self._estimate_cost(current) recommended_cost = self._estimate_cost(recommended[0]) savings = current_cost - recommended_cost total_savings += savings print(f"{deployment}: ${savings:.2f}/month savings") print(f"Total estimated monthly savings: ${total_savings:.2f}") return total_savings
Bin Packing Optimization
Optimize node utilization through intelligent scheduling:
# bin-packing-scheduler.yaml apiVersion: v1 kind: ConfigMap metadata: name: scheduler-config namespace: kube-system data: config.yaml: | apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: KubeSchedulerConfiguration profiles: - schedulerName: bin-packing-scheduler plugins: score: enabled: - name: NodeResourcesFit - name: NodeAffinity disabled: - name: NodeResourcesLeastAllocated pluginConfig: - name: NodeResourcesFit args: scoringStrategy: type: MostAllocated # Bin packing strategy resources: - name: cpu weight: 1 - name: memory weight: 1 --- apiVersion: apps/v1 kind: Deployment metadata: name: bin-packed-app namespace: production spec: replicas: 10 selector: matchLabels: app: bin-packed-app template: metadata: labels: app: bin-packed-app spec: schedulerName: bin-packing-scheduler containers: - name: app image: myapp:latest resources: requests: cpu: 200m memory: 256Mi limits: cpu: 400m memory: 512Mi
Storage Cost Optimization
EBS Volume Optimization
Implement intelligent EBS volume management:
# storage-optimization.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: cost-optimized-gp3 provisioner: ebs.csi.aws.com parameters: type: gp3 iops: "3000" throughput: "125" fsType: ext4 encrypted: "true" volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: infrequent-access-storage provisioner: ebs.csi.aws.com parameters: type: sc1 # Cold HDD for infrequent access fsType: ext4 encrypted: "true" volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true --- # Automated volume cleanup apiVersion: batch/v1 kind: CronJob metadata: name: ebs-volume-cleanup namespace: kube-system spec: schedule: "0 2 * * *" # Daily at 2 AM jobTemplate: spec: template: spec: serviceAccountName: ebs-cleanup-sa containers: - name: cleanup image: amazon/aws-cli:latest command: - /bin/bash - -c - | # Find and delete unattached EBS volumes older than 7 days aws ec2 describe-volumes \ --filters Name=status,Values=available \ --query 'Volumes[?CreateTime<=`2024-01-01`].[VolumeId,CreateTime]' \ --output text | \ while read volume_id create_time; do echo "Deleting volume: $volume_id (created: $create_time)" aws ec2 delete-volume --volume-id $volume_id done env: - name: AWS_DEFAULT_REGION value: "us-west-2" restartPolicy: OnFailure
Persistent Volume Resizing
Implement dynamic volume resizing based on usage:
# volume-resizer.py import boto3 import subprocess import json from datetime import datetime, timedelta class VolumeOptimizer: def __init__(self, region='us-west-2'): self.ec2_client = boto3.client('ec2', region_name=region) self.cloudwatch = boto3.client('cloudwatch', region_name=region) def analyze_volume_usage(self, volume_id, days=30): """Analyze CloudWatch metrics for volume usage patterns""" end_time = datetime.utcnow() start_time = end_time - timedelta(days=days) # Get volume utilization metrics response = self.cloudwatch.get_metric_statistics( Namespace='AWS/EBS', MetricName='VolumeReadBytes', Dimensions=[ { 'Name': 'VolumeId', 'Value': volume_id } ], StartTime=start_time, EndTime=end_time, Period=86400, # Daily Statistics=['Sum'] ) usage_pattern = self._analyze_usage_pattern(response['Datapoints']) return self._get_optimization_recommendation(volume_id, usage_pattern) def _analyze_usage_pattern(self, datapoints): """Analyze usage patterns to determine optimization strategy""" if not datapoints: return {'pattern': 'no_data', 'confidence': 'low'} daily_usage = [point['Sum'] for point in datapoints] avg_usage = sum(daily_usage) / len(daily_usage) max_usage = max(daily_usage) min_usage = min(daily_usage) if max_usage < avg_usage * 0.1: return {'pattern': 'very_low', 'confidence': 'high'} elif max_usage < avg_usage * 0.5: return {'pattern': 'low', 'confidence': 'medium'} elif min_usage > avg_usage * 0.8: return {'pattern': 'consistent_high', 'confidence': 'high'} else: return {'pattern': 'variable', 'confidence': 'medium'} def _get_optimization_recommendation(self, volume_id, usage_pattern): """Generate optimization recommendations""" volume_info = self.ec2_client.describe_volumes(VolumeIds=[volume_id])['Volumes'][0] current_type = volume_info['VolumeType'] current_size = volume_info['Size'] recommendations = { 'volume_id': volume_id, 'current_type': current_type, 'current_size': current_size, 'pattern': usage_pattern['pattern'] } if usage_pattern['pattern'] == 'very_low': recommendations.update({ 'recommended_type': 'sc1', # Cold HDD 'recommended_size': max(500, current_size // 2), 'potential_savings': self._calculate_savings(current_type, 'sc1', current_size) }) elif usage_pattern['pattern'] == 'low': recommendations.update({ 'recommended_type': 'st1', # Throughput Optimized HDD 'recommended_size': current_size, 'potential_savings': self._calculate_savings(current_type, 'st1', current_size) }) return recommendations def _calculate_savings(self, current_type, recommended_type, size): """Calculate potential cost savings""" pricing = { 'gp3': 0.08, # per GB/month 'gp2': 0.10, 'st1': 0.045, 'sc1': 0.015 } current_cost = pricing.get(current_type, 0.10) * size recommended_cost = pricing.get(recommended_type, 0.08) * size return (current_cost - recommended_cost) * 12 # Annual savings
Network Cost Optimization
NAT Gateway Optimization
Reduce NAT Gateway costs through strategic placement:
# nat-gateway-optimization.tf # Use single NAT Gateway for development environments resource "aws_nat_gateway" "cost_optimized" { count = var.environment == "production" ? length(var.availability_zones) : 1 allocation_id = aws_eip.nat[count.index].id subnet_id = var.public_subnet_ids[count.index] tags = { Name = "${var.environment}-nat-${count.index + 1}" Environment = var.environment CostOptimization = "true" } } # Route table configuration for cost optimization resource "aws_route_table" "private" { count = length(var.private_subnet_ids) vpc_id = var.vpc_id route { cidr_block = "0.0.0.0/0" nat_gateway_id = var.environment == "production" ? aws_nat_gateway.cost_optimized[count.index].id : aws_nat_gateway.cost_optimized[0].id } tags = { Name = "${var.environment}-private-rt-${count.index + 1}" CostOptimization = "true" } } # VPC Endpoints for reducing NAT Gateway usage resource "aws_vpc_endpoint" "s3" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.region}.s3" vpc_endpoint_type = "Gateway" route_table_ids = aws_route_table.private[*].id tags = { Name = "${var.environment}-s3-endpoint" CostOptimization = "NAT-bypass" } } resource "aws_vpc_endpoint" "ecr_api" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.region}.ecr.api" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.vpc_endpoints.id] private_dns_enabled = true tags = { Name = "${var.environment}-ecr-api-endpoint" CostOptimization = "NAT-bypass" } }
Data Transfer Optimization
Minimize cross-AZ and internet data transfer costs:
# data-transfer-optimization.yaml apiVersion: apps/v1 kind: Deployment metadata: name: data-locality-app namespace: production spec: replicas: 6 selector: matchLabels: app: data-locality-app template: metadata: labels: app: data-locality-app spec: # Prefer same-zone scheduling to reduce cross-AZ data transfer affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: ["database"] topologyKey: topology.kubernetes.io/zone nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 preference: matchExpressions: - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a"] # Primary zone containers: - name: app image: myapp:latest env: - name: PREFER_LOCAL_STORAGE value: "true" - name: CACHE_LOCALITY value: "zone-aware" resources: requests: cpu: 200m memory: 256Mi limits: cpu: 500m memory: 512Mi --- # Network policy to control egress traffic apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: cost-optimized-egress namespace: production spec: podSelector: matchLabels: app: data-locality-app policyTypes: - Egress egress: # Allow internal cluster communication - to: - podSelector: {} # Allow DNS resolution - to: - namespaceSelector: matchLabels: name: kube-system ports: - protocol: UDP port: 53 # Restrict external access to necessary services only - to: [] ports: - protocol: TCP port: 443 # HTTPS only
Automated Cost Management
Cost Monitoring and Alerting
Implement comprehensive cost monitoring:
# cost-monitoring.yaml apiVersion: v1 kind: ConfigMap metadata: name: cost-monitor-script namespace: monitoring data: monitor.py: | #!/usr/bin/env python3 import boto3 import json import os from datetime import datetime, timedelta class EKSCostMonitor: def __init__(self): self.ce_client = boto3.client('ce') self.cloudwatch = boto3.client('cloudwatch') self.cluster_name = os.getenv('CLUSTER_NAME', 'production-cluster') def get_daily_costs(self, days=7): """Get daily EKS costs for the past week""" end_date = datetime.now().strftime('%Y-%m-%d') start_date = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d') response = self.ce_client.get_cost_and_usage( TimePeriod={ 'Start': start_date, 'End': end_date }, Granularity='DAILY', Metrics=['UnblendedCost'], GroupBy=[ { 'Type': 'DIMENSION', 'Key': 'SERVICE' } ], Filter={ 'Dimensions': { 'Key': 'SERVICE', 'Values': [ 'Amazon Elastic Container Service for Kubernetes', 'Amazon Elastic Compute Cloud - Compute' ] } } ) return self._process_cost_data(response['ResultsByTime']) def _process_cost_data(self, cost_data): """Process and analyze cost trends""" daily_costs = [] total_cost = 0 for day_data in cost_data: date = day_data['TimePeriod']['Start'] day_total = 0 for group in day_data['Groups']: cost = float(group['Metrics']['UnblendedCost']['Amount']) day_total += cost daily_costs.append({ 'date': date, 'cost': day_total }) total_cost += day_total # Calculate trend if len(daily_costs) >= 2: recent_avg = sum(d['cost'] for d in daily_costs[-3:]) / 3 earlier_avg = sum(d['cost'] for d in daily_costs[:3]) / 3 trend = ((recent_avg - earlier_avg) / earlier_avg) * 100 else: trend = 0 return { 'daily_costs': daily_costs, 'total_cost': total_cost, 'average_daily_cost': total_cost / len(daily_costs), 'cost_trend_percent': trend } def send_cost_alert(self, cost_data): """Send cost alerts if thresholds are exceeded""" daily_budget = float(os.getenv('DAILY_BUDGET', '200')) trend_threshold = float(os.getenv('TREND_THRESHOLD', '20')) latest_cost = cost_data['daily_costs'][-1]['cost'] trend = cost_data['cost_trend_percent'] alerts = [] if latest_cost > daily_budget: alerts.append({ 'type': 'budget_exceeded', 'message': f"Daily cost ${latest_cost:.2f} exceeded budget ${daily_budget:.2f}", 'severity': 'high' }) if trend > trend_threshold: alerts.append({ 'type': 'cost_trend', 'message': f"Cost trend increased by {trend:.1f}% over the past week", 'severity': 'medium' }) if alerts: self._publish_alerts(alerts) def _publish_alerts(self, alerts): """Publish alerts to CloudWatch and Slack""" for alert in alerts: # CloudWatch metric self.cloudwatch.put_metric_data( Namespace='EKS/CostOptimization', MetricData=[ { 'MetricName': 'CostAlert', 'Value': 1, 'Unit': 'Count', 'Dimensions': [ { 'Name': 'AlertType', 'Value': alert['type'] }, { 'Name': 'ClusterName', 'Value': self.cluster_name } ] } ] ) --- apiVersion: batch/v1 kind: CronJob metadata: name: cost-monitor namespace: monitoring spec: schedule: "0 8 * * *" # Daily at 8 AM jobTemplate: spec: template: spec: serviceAccountName: cost-monitor-sa containers: - name: monitor image: python:3.9-slim command: - /bin/bash - -c - | pip install boto3 && python /scripts/monitor.py env: - name: CLUSTER_NAME value: "production-cluster" - name: DAILY_BUDGET value: "200" - name: TREND_THRESHOLD value: "20" volumeMounts: - name: scripts mountPath: /scripts volumes: - name: scripts configMap: name: cost-monitor-script defaultMode: 0755 restartPolicy: OnFailure
Automated Resource Cleanup
Implement automated cleanup of unused resources:
# resource-cleanup.yaml apiVersion: batch/v1 kind: CronJob metadata: name: resource-cleanup namespace: kube-system spec: schedule: "0 1 * * *" # Daily at 1 AM jobTemplate: spec: template: spec: serviceAccountName: cleanup-sa containers: - name: cleanup image: bitnami/kubectl:latest command: - /bin/bash - -c - | # Remove completed jobs older than 24 hours kubectl delete jobs --field-selector status.successful=1 --all-namespaces \ --dry-run=client -o name | \ while read job; do creation_time=$(kubectl get $job -o jsonpath='{.metadata.creationTimestamp}') if [[ $(date -d "$creation_time" +%s) -lt $(date -d "24 hours ago" +%s) ]]; then echo "Deleting old job: $job" kubectl delete $job fi done # Remove failed pods older than 1 hour kubectl get pods --all-namespaces --field-selector status.phase=Failed \ -o json | jq -r '.items[] | select(.metadata.creationTimestamp | fromdateiso8601 < (now - 3600)) | "\(.metadata.namespace) \(.metadata.name)"' | \ while read namespace pod; do echo "Deleting failed pod: $namespace/$pod" kubectl delete pod $pod -n $namespace done # Clean up unused ConfigMaps and Secrets # (Add logic to identify unused resources) # Report cleanup statistics echo "Cleanup completed at $(date)" restartPolicy: OnFailure
Cost Optimization Best Practices
Implementation Checklist
- ✅ Spot Instance Strategy: 70-80% spot instances for non-critical workloads
- ✅ Resource Right-sizing: VPA and monitoring-based optimization
- ✅ Storage Optimization: GP3 volumes with appropriate IOPS/throughput
- ✅ Network Optimization: VPC endpoints and single NAT for dev environments
- ✅ Automated Cleanup: Regular cleanup of unused resources
- ✅ Cost Monitoring: Daily cost tracking with trend analysis
- ✅ Savings Plans: Long-term commitments for predictable workloads
- ✅ Bin Packing: Efficient node utilization through scheduling
Continuous Optimization
- Weekly Cost Reviews: Analyze spending patterns and trends
- Monthly Resource Audits: Review and optimize resource allocations
- Quarterly Strategy Updates: Adjust optimization strategies based on usage patterns
- Annual Savings Plan Reviews: Optimize long-term commitments
Conclusion
Effective EKS cost optimization requires a comprehensive approach combining infrastructure efficiency, resource optimization, and operational discipline. The strategies outlined in this guide can typically reduce EKS costs by 40-60% while maintaining or improving performance and reliability.
Key success factors include:
- Continuous Monitoring: Real-time cost tracking and alerting
- Automation: Automated cleanup and optimization processes
- Strategic Planning: Long-term commitments for predictable workloads
- Team Alignment: Clear cost ownership and accountability
Remember that cost optimization is an ongoing process. Regular review and adjustment of these strategies ensures continued cost efficiency as your applications and infrastructure evolve.
For more cloud cost optimization strategies and FinOps best practices, follow STAQI Technologies' technical blog.