Cost Management18 min read

EKS Cost Optimization: Advanced Techniques for Reducing Cloud Spend

Comprehensive strategies for optimizing EKS costs including spot instances, resource right-sizing, scheduling optimizations, and automated cost management practices that can reduce expenses by 40-60%.

S

STAQI Technologies

March 5, 2024

EKSCost OptimizationAWSFinOpsCloud Economics

EKS Cost Optimization: Advanced Techniques for Reducing Cloud Spend

Cloud costs can quickly spiral out of control without proper optimization strategies. This comprehensive guide explores advanced EKS cost optimization techniques that have helped organizations reduce their Kubernetes spending by 40-60% while maintaining performance and reliability.

Introduction

EKS cost optimization requires a multi-faceted approach addressing compute costs, storage optimization, network efficiency, and operational overhead. This guide provides actionable strategies for implementing comprehensive cost controls across your EKS infrastructure.

Cost Analysis Foundation

Understanding EKS Cost Components

Break down EKS costs into manageable components:

# cost-analysis.yaml apiVersion: v1 kind: ConfigMap metadata: name: cost-breakdown data: compute_costs: "60-70% - EC2 instances, Fargate pods" storage_costs: "15-20% - EBS volumes, EFS storage" network_costs: "10-15% - Data transfer, NAT gateways" control_plane: "5-10% - EKS cluster management fee" additional_services: "5-10% - Load balancers, monitoring"

Cost Monitoring Infrastructure

Implement comprehensive cost tracking:

# cost-monitoring.tf resource "aws_ce_cost_category" "eks_costs" { name = "EKS-Environment-Costs" rule_version = "CostCategoryExpression.v1" rule { value = "Production" rule { dimension { key = "SERVICE" values = ["Amazon Elastic Container Service for Kubernetes"] match_options = ["EQUALS"] } tag { key = "Environment" values = ["production"] match_options = ["EQUALS"] } } } rule { value = "Development" rule { dimension { key = "SERVICE" values = ["Amazon Elastic Container Service for Kubernetes"] match_options = ["EQUALS"] } tag { key = "Environment" values = ["development", "staging"] match_options = ["EQUALS"] } } } } resource "aws_budgets_budget" "eks_budget" { name = "EKS-Monthly-Budget" budget_type = "COST" limit_amount = "5000" limit_unit = "USD" time_unit = "MONTHLY" cost_filters { service = ["Amazon Elastic Container Service for Kubernetes"] } notification { comparison_operator = "GREATER_THAN" threshold = 80 threshold_type = "PERCENTAGE" notification_type = "ACTUAL" subscriber_email_addresses = ["devops@company.com"] } notification { comparison_operator = "GREATER_THAN" threshold = 100 threshold_type = "PERCENTAGE" notification_type = "FORECASTED" subscriber_email_addresses = ["devops@company.com", "finance@company.com"] } }

Compute Cost Optimization

Spot Instance Strategy

Implement aggressive spot instance usage for significant cost savings:

# spot-instances.tf resource "aws_eks_node_group" "spot_optimized" { cluster_name = aws_eks_cluster.main.name node_group_name = "spot-optimized-workers" node_role_arn = aws_iam_role.node_group.arn subnet_ids = var.private_subnet_ids capacity_type = "SPOT" # Diversify instance types for better spot availability instance_types = [ "m5.large", "m5.xlarge", "m5.2xlarge", "m5a.large", "m5a.xlarge", "m5a.2xlarge", "c5.large", "c5.xlarge", "c5.2xlarge", "c5a.large", "c5a.xlarge", "c5a.2xlarge" ] scaling_config { desired_size = 15 max_size = 100 min_size = 10 } # Spot instance handling launch_template { id = aws_launch_template.spot_template.id version = aws_launch_template.spot_template.latest_version } lifecycle { ignore_changes = [scaling_config[0].desired_size] } tags = { "Name" = "EKS-Spot-Nodes" "k8s.io/cluster-autoscaler/enabled" = "true" "k8s.io/cluster-autoscaler/${aws_eks_cluster.main.name}" = "owned" "k8s.io/cluster-autoscaler/node-template/label/node-lifecycle" = "spot" } } resource "aws_launch_template" "spot_template" { name_prefix = "eks-spot-" image_id = data.aws_ssm_parameter.eks_ami.value instance_type = "m5.large" vpc_security_group_ids = [aws_security_group.node_group.id] user_data = base64encode(templatefile("${path.module}/user_data.sh", { cluster_name = aws_eks_cluster.main.name bootstrap_arguments = "--container-runtime containerd" })) tag_specifications { resource_type = "instance" tags = { Name = "EKS-Spot-Node" "kubernetes.io/cluster/${aws_eks_cluster.main.name}" = "owned" } } metadata_options { http_endpoint = "enabled" http_tokens = "required" http_put_response_hop_limit = 2 } }

Intelligent Workload Scheduling

Deploy workloads strategically across spot and on-demand instances:

# spot-workload-scheduling.yaml apiVersion: apps/v1 kind: Deployment metadata: name: stateless-web-app namespace: production spec: replicas: 20 selector: matchLabels: app: web-app template: metadata: labels: app: web-app spec: # Prefer spot instances but allow on-demand as fallback affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: node-lifecycle operator: In values: ["spot"] - weight: 50 preference: matchExpressions: - key: node-lifecycle operator: In values: ["on-demand"] # Tolerate spot instance interruptions tolerations: - key: "spot-instance" operator: "Equal" value: "true" effect: "NoSchedule" # Distribute across zones for availability topologySpreadConstraints: - maxSkew: 2 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: web-app containers: - name: web-app image: nginx:alpine resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi # Graceful shutdown for spot interruptions lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 30"] terminationGracePeriodSeconds: 45 --- # Critical workloads on on-demand instances apiVersion: apps/v1 kind: Deployment metadata: name: critical-api namespace: production spec: replicas: 5 selector: matchLabels: app: critical-api template: metadata: labels: app: critical-api spec: # Require on-demand instances for critical workloads nodeSelector: node-lifecycle: on-demand containers: - name: api image: myapp/api:v1.0.0 resources: requests: cpu: 200m memory: 256Mi limits: cpu: 1 memory: 1Gi

AWS Savings Plans Integration

Optimize with Compute Savings Plans for predictable workloads:

# savings-plan-optimizer.py import boto3 import json from datetime import datetime, timedelta class SavingsPlansOptimizer: def __init__(self, region='us-west-2'): self.ce_client = boto3.client('ce', region_name=region) self.savings_plans_client = boto3.client('savingsplans', region_name=region) def analyze_usage_patterns(self, days=90): """Analyze EC2 usage patterns to recommend Savings Plans""" end_date = datetime.now().strftime('%Y-%m-%d') start_date = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d') response = self.ce_client.get_usage_and_costs( TimePeriod={ 'Start': start_date, 'End': end_date }, Granularity='DAILY', Metrics=['UnblendedCost', 'UsageQuantity'], GroupBy=[ { 'Type': 'DIMENSION', 'Key': 'INSTANCE_TYPE' }, { 'Type': 'DIMENSION', 'Key': 'REGION' } ], Filter={ 'Dimensions': { 'Key': 'SERVICE', 'Values': ['Amazon Elastic Compute Cloud - Compute'] } } ) return self._calculate_savings_opportunity(response['ResultsByTime']) def _calculate_savings_opportunity(self, usage_data): """Calculate potential savings from Savings Plans""" total_cost = 0 consistent_usage = {} for daily_usage in usage_data: for group in daily_usage['Groups']: instance_type = group['Keys'][0] cost = float(group['Metrics']['UnblendedCost']['Amount']) total_cost += cost if instance_type not in consistent_usage: consistent_usage[instance_type] = [] consistent_usage[instance_type].append(cost) # Calculate minimum consistent usage (baseline for Savings Plans) recommendations = {} for instance_type, costs in consistent_usage.items(): if len(costs) >= 30: # Sufficient data points min_daily_cost = min(costs) avg_daily_cost = sum(costs) / len(costs) # Recommend 70% of minimum usage for 1-year plan recommended_commitment = min_daily_cost * 365 * 0.7 potential_savings = recommended_commitment * 0.17 # ~17% savings recommendations[instance_type] = { 'recommended_commitment': recommended_commitment, 'potential_annual_savings': potential_savings, 'confidence': 'high' if min_daily_cost / avg_daily_cost > 0.6 else 'medium' } return recommendations def get_current_savings_plans(self): """Get current Savings Plans status""" response = self.savings_plans_client.describe_savings_plans() return response['savingsPlans']

Resource Right-Sizing

Automated Resource Optimization

Implement VPA-based resource optimization:

# resource-optimizer.yaml apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: cost-optimizer-vpa namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: resource-intensive-app updatePolicy: updateMode: "Auto" resourcePolicy: containerPolicies: - containerName: main-container minAllowed: cpu: 50m memory: 64Mi maxAllowed: cpu: 2 memory: 4Gi controlledResources: ["cpu", "memory"] controlledValues: RequestsAndLimits # Aggressive scaling for cost optimization mode: Auto --- apiVersion: v1 kind: ConfigMap metadata: name: resource-optimization-script namespace: kube-system data: optimize.py: | #!/usr/bin/env python3 import subprocess import json import yaml def get_vpa_recommendations(): """Get VPA recommendations for all deployments""" cmd = ["kubectl", "get", "vpa", "-o", "json", "--all-namespaces"] result = subprocess.run(cmd, capture_output=True, text=True) if result.returncode == 0: vpa_data = json.loads(result.stdout) recommendations = {} for vpa in vpa_data.get('items', []): if 'status' in vpa and 'recommendation' in vpa['status']: name = vpa['metadata']['name'] namespace = vpa['metadata']['namespace'] rec = vpa['status']['recommendation'] recommendations[f"{namespace}/{name}"] = { 'current': self._get_current_resources(namespace, name), 'recommended': rec.get('containerRecommendations', []) } return recommendations return {} def calculate_cost_impact(recommendations): """Calculate cost impact of applying VPA recommendations""" total_savings = 0 for deployment, data in recommendations.items(): current = data['current'] recommended = data['recommended'] if current and recommended: current_cost = self._estimate_cost(current) recommended_cost = self._estimate_cost(recommended[0]) savings = current_cost - recommended_cost total_savings += savings print(f"{deployment}: ${savings:.2f}/month savings") print(f"Total estimated monthly savings: ${total_savings:.2f}") return total_savings

Bin Packing Optimization

Optimize node utilization through intelligent scheduling:

# bin-packing-scheduler.yaml apiVersion: v1 kind: ConfigMap metadata: name: scheduler-config namespace: kube-system data: config.yaml: | apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: KubeSchedulerConfiguration profiles: - schedulerName: bin-packing-scheduler plugins: score: enabled: - name: NodeResourcesFit - name: NodeAffinity disabled: - name: NodeResourcesLeastAllocated pluginConfig: - name: NodeResourcesFit args: scoringStrategy: type: MostAllocated # Bin packing strategy resources: - name: cpu weight: 1 - name: memory weight: 1 --- apiVersion: apps/v1 kind: Deployment metadata: name: bin-packed-app namespace: production spec: replicas: 10 selector: matchLabels: app: bin-packed-app template: metadata: labels: app: bin-packed-app spec: schedulerName: bin-packing-scheduler containers: - name: app image: myapp:latest resources: requests: cpu: 200m memory: 256Mi limits: cpu: 400m memory: 512Mi

Storage Cost Optimization

EBS Volume Optimization

Implement intelligent EBS volume management:

# storage-optimization.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: cost-optimized-gp3 provisioner: ebs.csi.aws.com parameters: type: gp3 iops: "3000" throughput: "125" fsType: ext4 encrypted: "true" volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: infrequent-access-storage provisioner: ebs.csi.aws.com parameters: type: sc1 # Cold HDD for infrequent access fsType: ext4 encrypted: "true" volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true --- # Automated volume cleanup apiVersion: batch/v1 kind: CronJob metadata: name: ebs-volume-cleanup namespace: kube-system spec: schedule: "0 2 * * *" # Daily at 2 AM jobTemplate: spec: template: spec: serviceAccountName: ebs-cleanup-sa containers: - name: cleanup image: amazon/aws-cli:latest command: - /bin/bash - -c - | # Find and delete unattached EBS volumes older than 7 days aws ec2 describe-volumes \ --filters Name=status,Values=available \ --query 'Volumes[?CreateTime<=`2024-01-01`].[VolumeId,CreateTime]' \ --output text | \ while read volume_id create_time; do echo "Deleting volume: $volume_id (created: $create_time)" aws ec2 delete-volume --volume-id $volume_id done env: - name: AWS_DEFAULT_REGION value: "us-west-2" restartPolicy: OnFailure

Persistent Volume Resizing

Implement dynamic volume resizing based on usage:

# volume-resizer.py import boto3 import subprocess import json from datetime import datetime, timedelta class VolumeOptimizer: def __init__(self, region='us-west-2'): self.ec2_client = boto3.client('ec2', region_name=region) self.cloudwatch = boto3.client('cloudwatch', region_name=region) def analyze_volume_usage(self, volume_id, days=30): """Analyze CloudWatch metrics for volume usage patterns""" end_time = datetime.utcnow() start_time = end_time - timedelta(days=days) # Get volume utilization metrics response = self.cloudwatch.get_metric_statistics( Namespace='AWS/EBS', MetricName='VolumeReadBytes', Dimensions=[ { 'Name': 'VolumeId', 'Value': volume_id } ], StartTime=start_time, EndTime=end_time, Period=86400, # Daily Statistics=['Sum'] ) usage_pattern = self._analyze_usage_pattern(response['Datapoints']) return self._get_optimization_recommendation(volume_id, usage_pattern) def _analyze_usage_pattern(self, datapoints): """Analyze usage patterns to determine optimization strategy""" if not datapoints: return {'pattern': 'no_data', 'confidence': 'low'} daily_usage = [point['Sum'] for point in datapoints] avg_usage = sum(daily_usage) / len(daily_usage) max_usage = max(daily_usage) min_usage = min(daily_usage) if max_usage < avg_usage * 0.1: return {'pattern': 'very_low', 'confidence': 'high'} elif max_usage < avg_usage * 0.5: return {'pattern': 'low', 'confidence': 'medium'} elif min_usage > avg_usage * 0.8: return {'pattern': 'consistent_high', 'confidence': 'high'} else: return {'pattern': 'variable', 'confidence': 'medium'} def _get_optimization_recommendation(self, volume_id, usage_pattern): """Generate optimization recommendations""" volume_info = self.ec2_client.describe_volumes(VolumeIds=[volume_id])['Volumes'][0] current_type = volume_info['VolumeType'] current_size = volume_info['Size'] recommendations = { 'volume_id': volume_id, 'current_type': current_type, 'current_size': current_size, 'pattern': usage_pattern['pattern'] } if usage_pattern['pattern'] == 'very_low': recommendations.update({ 'recommended_type': 'sc1', # Cold HDD 'recommended_size': max(500, current_size // 2), 'potential_savings': self._calculate_savings(current_type, 'sc1', current_size) }) elif usage_pattern['pattern'] == 'low': recommendations.update({ 'recommended_type': 'st1', # Throughput Optimized HDD 'recommended_size': current_size, 'potential_savings': self._calculate_savings(current_type, 'st1', current_size) }) return recommendations def _calculate_savings(self, current_type, recommended_type, size): """Calculate potential cost savings""" pricing = { 'gp3': 0.08, # per GB/month 'gp2': 0.10, 'st1': 0.045, 'sc1': 0.015 } current_cost = pricing.get(current_type, 0.10) * size recommended_cost = pricing.get(recommended_type, 0.08) * size return (current_cost - recommended_cost) * 12 # Annual savings

Network Cost Optimization

NAT Gateway Optimization

Reduce NAT Gateway costs through strategic placement:

# nat-gateway-optimization.tf # Use single NAT Gateway for development environments resource "aws_nat_gateway" "cost_optimized" { count = var.environment == "production" ? length(var.availability_zones) : 1 allocation_id = aws_eip.nat[count.index].id subnet_id = var.public_subnet_ids[count.index] tags = { Name = "${var.environment}-nat-${count.index + 1}" Environment = var.environment CostOptimization = "true" } } # Route table configuration for cost optimization resource "aws_route_table" "private" { count = length(var.private_subnet_ids) vpc_id = var.vpc_id route { cidr_block = "0.0.0.0/0" nat_gateway_id = var.environment == "production" ? aws_nat_gateway.cost_optimized[count.index].id : aws_nat_gateway.cost_optimized[0].id } tags = { Name = "${var.environment}-private-rt-${count.index + 1}" CostOptimization = "true" } } # VPC Endpoints for reducing NAT Gateway usage resource "aws_vpc_endpoint" "s3" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.region}.s3" vpc_endpoint_type = "Gateway" route_table_ids = aws_route_table.private[*].id tags = { Name = "${var.environment}-s3-endpoint" CostOptimization = "NAT-bypass" } } resource "aws_vpc_endpoint" "ecr_api" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.region}.ecr.api" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.vpc_endpoints.id] private_dns_enabled = true tags = { Name = "${var.environment}-ecr-api-endpoint" CostOptimization = "NAT-bypass" } }

Data Transfer Optimization

Minimize cross-AZ and internet data transfer costs:

# data-transfer-optimization.yaml apiVersion: apps/v1 kind: Deployment metadata: name: data-locality-app namespace: production spec: replicas: 6 selector: matchLabels: app: data-locality-app template: metadata: labels: app: data-locality-app spec: # Prefer same-zone scheduling to reduce cross-AZ data transfer affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: ["database"] topologyKey: topology.kubernetes.io/zone nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 preference: matchExpressions: - key: topology.kubernetes.io/zone operator: In values: ["us-west-2a"] # Primary zone containers: - name: app image: myapp:latest env: - name: PREFER_LOCAL_STORAGE value: "true" - name: CACHE_LOCALITY value: "zone-aware" resources: requests: cpu: 200m memory: 256Mi limits: cpu: 500m memory: 512Mi --- # Network policy to control egress traffic apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: cost-optimized-egress namespace: production spec: podSelector: matchLabels: app: data-locality-app policyTypes: - Egress egress: # Allow internal cluster communication - to: - podSelector: {} # Allow DNS resolution - to: - namespaceSelector: matchLabels: name: kube-system ports: - protocol: UDP port: 53 # Restrict external access to necessary services only - to: [] ports: - protocol: TCP port: 443 # HTTPS only

Automated Cost Management

Cost Monitoring and Alerting

Implement comprehensive cost monitoring:

# cost-monitoring.yaml apiVersion: v1 kind: ConfigMap metadata: name: cost-monitor-script namespace: monitoring data: monitor.py: | #!/usr/bin/env python3 import boto3 import json import os from datetime import datetime, timedelta class EKSCostMonitor: def __init__(self): self.ce_client = boto3.client('ce') self.cloudwatch = boto3.client('cloudwatch') self.cluster_name = os.getenv('CLUSTER_NAME', 'production-cluster') def get_daily_costs(self, days=7): """Get daily EKS costs for the past week""" end_date = datetime.now().strftime('%Y-%m-%d') start_date = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d') response = self.ce_client.get_cost_and_usage( TimePeriod={ 'Start': start_date, 'End': end_date }, Granularity='DAILY', Metrics=['UnblendedCost'], GroupBy=[ { 'Type': 'DIMENSION', 'Key': 'SERVICE' } ], Filter={ 'Dimensions': { 'Key': 'SERVICE', 'Values': [ 'Amazon Elastic Container Service for Kubernetes', 'Amazon Elastic Compute Cloud - Compute' ] } } ) return self._process_cost_data(response['ResultsByTime']) def _process_cost_data(self, cost_data): """Process and analyze cost trends""" daily_costs = [] total_cost = 0 for day_data in cost_data: date = day_data['TimePeriod']['Start'] day_total = 0 for group in day_data['Groups']: cost = float(group['Metrics']['UnblendedCost']['Amount']) day_total += cost daily_costs.append({ 'date': date, 'cost': day_total }) total_cost += day_total # Calculate trend if len(daily_costs) >= 2: recent_avg = sum(d['cost'] for d in daily_costs[-3:]) / 3 earlier_avg = sum(d['cost'] for d in daily_costs[:3]) / 3 trend = ((recent_avg - earlier_avg) / earlier_avg) * 100 else: trend = 0 return { 'daily_costs': daily_costs, 'total_cost': total_cost, 'average_daily_cost': total_cost / len(daily_costs), 'cost_trend_percent': trend } def send_cost_alert(self, cost_data): """Send cost alerts if thresholds are exceeded""" daily_budget = float(os.getenv('DAILY_BUDGET', '200')) trend_threshold = float(os.getenv('TREND_THRESHOLD', '20')) latest_cost = cost_data['daily_costs'][-1]['cost'] trend = cost_data['cost_trend_percent'] alerts = [] if latest_cost > daily_budget: alerts.append({ 'type': 'budget_exceeded', 'message': f"Daily cost ${latest_cost:.2f} exceeded budget ${daily_budget:.2f}", 'severity': 'high' }) if trend > trend_threshold: alerts.append({ 'type': 'cost_trend', 'message': f"Cost trend increased by {trend:.1f}% over the past week", 'severity': 'medium' }) if alerts: self._publish_alerts(alerts) def _publish_alerts(self, alerts): """Publish alerts to CloudWatch and Slack""" for alert in alerts: # CloudWatch metric self.cloudwatch.put_metric_data( Namespace='EKS/CostOptimization', MetricData=[ { 'MetricName': 'CostAlert', 'Value': 1, 'Unit': 'Count', 'Dimensions': [ { 'Name': 'AlertType', 'Value': alert['type'] }, { 'Name': 'ClusterName', 'Value': self.cluster_name } ] } ] ) --- apiVersion: batch/v1 kind: CronJob metadata: name: cost-monitor namespace: monitoring spec: schedule: "0 8 * * *" # Daily at 8 AM jobTemplate: spec: template: spec: serviceAccountName: cost-monitor-sa containers: - name: monitor image: python:3.9-slim command: - /bin/bash - -c - | pip install boto3 && python /scripts/monitor.py env: - name: CLUSTER_NAME value: "production-cluster" - name: DAILY_BUDGET value: "200" - name: TREND_THRESHOLD value: "20" volumeMounts: - name: scripts mountPath: /scripts volumes: - name: scripts configMap: name: cost-monitor-script defaultMode: 0755 restartPolicy: OnFailure

Automated Resource Cleanup

Implement automated cleanup of unused resources:

# resource-cleanup.yaml apiVersion: batch/v1 kind: CronJob metadata: name: resource-cleanup namespace: kube-system spec: schedule: "0 1 * * *" # Daily at 1 AM jobTemplate: spec: template: spec: serviceAccountName: cleanup-sa containers: - name: cleanup image: bitnami/kubectl:latest command: - /bin/bash - -c - | # Remove completed jobs older than 24 hours kubectl delete jobs --field-selector status.successful=1 --all-namespaces \ --dry-run=client -o name | \ while read job; do creation_time=$(kubectl get $job -o jsonpath='{.metadata.creationTimestamp}') if [[ $(date -d "$creation_time" +%s) -lt $(date -d "24 hours ago" +%s) ]]; then echo "Deleting old job: $job" kubectl delete $job fi done # Remove failed pods older than 1 hour kubectl get pods --all-namespaces --field-selector status.phase=Failed \ -o json | jq -r '.items[] | select(.metadata.creationTimestamp | fromdateiso8601 < (now - 3600)) | "\(.metadata.namespace) \(.metadata.name)"' | \ while read namespace pod; do echo "Deleting failed pod: $namespace/$pod" kubectl delete pod $pod -n $namespace done # Clean up unused ConfigMaps and Secrets # (Add logic to identify unused resources) # Report cleanup statistics echo "Cleanup completed at $(date)" restartPolicy: OnFailure

Cost Optimization Best Practices

Implementation Checklist

  • Spot Instance Strategy: 70-80% spot instances for non-critical workloads
  • Resource Right-sizing: VPA and monitoring-based optimization
  • Storage Optimization: GP3 volumes with appropriate IOPS/throughput
  • Network Optimization: VPC endpoints and single NAT for dev environments
  • Automated Cleanup: Regular cleanup of unused resources
  • Cost Monitoring: Daily cost tracking with trend analysis
  • Savings Plans: Long-term commitments for predictable workloads
  • Bin Packing: Efficient node utilization through scheduling

Continuous Optimization

  1. Weekly Cost Reviews: Analyze spending patterns and trends
  2. Monthly Resource Audits: Review and optimize resource allocations
  3. Quarterly Strategy Updates: Adjust optimization strategies based on usage patterns
  4. Annual Savings Plan Reviews: Optimize long-term commitments

Conclusion

Effective EKS cost optimization requires a comprehensive approach combining infrastructure efficiency, resource optimization, and operational discipline. The strategies outlined in this guide can typically reduce EKS costs by 40-60% while maintaining or improving performance and reliability.

Key success factors include:

  • Continuous Monitoring: Real-time cost tracking and alerting
  • Automation: Automated cleanup and optimization processes
  • Strategic Planning: Long-term commitments for predictable workloads
  • Team Alignment: Clear cost ownership and accountability

Remember that cost optimization is an ongoing process. Regular review and adjustment of these strategies ensures continued cost efficiency as your applications and infrastructure evolve.


For more cloud cost optimization strategies and FinOps best practices, follow STAQI Technologies' technical blog.

Ready to implement similar solutions?

Contact STAQI Technologies to learn how our expertise in high-volume systems, security operations, and compliance can benefit your organization.

Get Started