Scaling Payment Systems to 50 Million Daily Transactions: An EKS Success Story
How we architected and implemented a high-performance EKS infrastructure that processes 50 million payment transactions daily with sub-100ms latency.
STAQI Technologies Team
January 29, 2024
Scaling Payment Systems to 50 Million Daily Transactions: An EKS Success Story
When a global payment platform approached STAQI Technologies with the challenge of scaling their transaction processing from 5 million to 50 million daily transactions, we knew we were dealing with more than just a 10x scaling problem. This was about reimagining their entire infrastructure architecture to handle massive scale while maintaining the sub-100ms response times their business required.
The Scale Challenge: From Millions to Tens of Millions
Our client, a major payment processor serving multiple financial institutions, was experiencing exponential growth. Their existing infrastructure was hitting critical limitations:
Current State:
- 5 million transactions per day
- Average response time: 200ms
- Peak capacity: 150 TPS (transactions per second)
- Frequent performance degradation during peak hours
- Manual scaling requiring 2-hour lead times
Target Requirements:
- 50 million transactions per day
- Sub-100ms response time (95th percentile)
- Peak capacity: 2,000+ TPS
- Auto-scaling with zero manual intervention
- 99.99% uptime during peak hours
Architecture Revolution: Microservices on EKS
The Monolith Problem
The existing monolithic payment processor was the primary bottleneck. We designed a complete microservices architecture using Amazon EKS:
# Payment Processing Microservices Architecture apiVersion: apps/v1 kind: Deployment metadata: name: payment-validator namespace: payments spec: replicas: 10 selector: matchLabels: app: payment-validator template: metadata: labels: app: payment-validator spec: containers: - name: validator image: staqi/payment-validator:v2.1.0 ports: - containerPort: 8080 resources: requests: memory: "256Mi" cpu: "200m" limits: memory: "512Mi" cpu: "500m" env: - name: DATABASE_POOL_SIZE value: "20" - name: REDIS_CLUSTER_ENDPOINT valueFrom: secretKeyRef: name: redis-config key: cluster-endpoint
Key Microservices Components
1. Payment Validator Service
- Handles initial transaction validation
- Performs fraud detection checks
- Scales independently based on request volume
2. Transaction Processor Service
- Core payment processing logic
- Integrates with multiple payment gateways
- Implements circuit breakers for resilience
3. Settlement Service
- Handles batch settlement operations
- Processes end-of-day reconciliation
- Manages cross-border payment compliance
4. Notification Service
- Real-time transaction status updates
- Webhook delivery to merchant systems
- Event-driven architecture using Kafka
Performance Optimization: Every Millisecond Counts
Database Architecture: Distributed for Scale
We implemented a multi-tier database strategy:
-- Optimized Payment Transaction Table CREATE TABLE payment_transactions ( transaction_id UUID PRIMARY KEY, merchant_id UUID NOT NULL, amount DECIMAL(15,2) NOT NULL, currency_code CHAR(3) NOT NULL, status VARCHAR(20) NOT NULL, created_at TIMESTAMP DEFAULT NOW(), processed_at TIMESTAMP, -- Partitioning by date for performance PARTITION BY RANGE (created_at) ); -- Indexes for high-performance queries CREATE INDEX CONCURRENTLY idx_transactions_merchant_status ON payment_transactions (merchant_id, status, created_at); CREATE INDEX CONCURRENTLY idx_transactions_processing ON payment_transactions (status, created_at) WHERE status IN ('pending', 'processing');
Database Strategy:
- Primary Database: Amazon RDS Aurora PostgreSQL with read replicas
- Cache Layer: Redis Cluster for sub-millisecond lookups
- Analytics: Amazon Redshift for reporting and analytics
- Real-time Events: Amazon DynamoDB for session data
Caching Strategy: Multi-Level Performance
// High-Performance Caching Implementation type PaymentCache struct { L1Cache *ristretto.Cache // In-memory cache L2Cache *redis.Cluster // Redis cluster L3Cache *aurora.ReadReplica // Database read replica } func (pc *PaymentCache) GetMerchantConfig(merchantID string) (*MerchantConfig, error) { // L1: Check in-memory cache (sub-microsecond) if config, found := pc.L1Cache.Get(merchantID); found { return config.(*MerchantConfig), nil } // L2: Check Redis cluster (sub-millisecond) if configJSON, err := pc.L2Cache.Get(merchantID).Result(); err == nil { config := &MerchantConfig{} json.Unmarshal([]byte(configJSON), config) // Store in L1 for next time pc.L1Cache.Set(merchantID, config, 300*time.Second) return config, nil } // L3: Fallback to database config, err := pc.loadFromDatabase(merchantID) if err != nil { return nil, err } // Populate caches for future requests go pc.populateCaches(merchantID, config) return config, nil }
Auto-Scaling: Intelligent Resource Management
Horizontal Pod Autoscaler (HPA) Configuration
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: payment-processor-hpa namespace: payments spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payment-processor minReplicas: 10 maxReplicas: 100 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 - type: Pods pods: metric: name: transactions_per_second target: type: AverageValue averageValue: "50" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 15 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 60
Vertical Pod Autoscaler (VPA) for Right-Sizing
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: payment-processor-vpa namespace: payments spec: targetRef: apiVersion: apps/v1 kind: Deployment name: payment-processor updatePolicy: updateMode: "Auto" resourcePolicy: containerPolicies: - containerName: payment-processor maxAllowed: cpu: "2" memory: "4Gi" minAllowed: cpu: "100m" memory: "128Mi"
Load Testing: Validating Performance at Scale
Comprehensive Load Testing Strategy
We implemented a multi-phase load testing approach:
// K6 Load Testing Script for Payment API import http from 'k6/http'; import { check, sleep } from 'k6'; import { Rate } from 'k6/metrics'; export let errorRate = new Rate('errors'); export let options = { stages: [ { duration: '5m', target: 500 }, // Ramp up to 500 TPS { duration: '30m', target: 500 }, // Stay at 500 TPS { duration: '5m', target: 1000 }, // Ramp up to 1000 TPS { duration: '30m', target: 1000 }, // Stay at 1000 TPS { duration: '5m', target: 2000 }, // Peak load test { duration: '10m', target: 2000 }, // Sustain peak load { duration: '10m', target: 0 }, // Ramp down ], thresholds: { http_req_duration: ['p(95)<100'], // 95% of requests under 100ms http_req_failed: ['rate<0.01'], // Error rate under 1% errors: ['rate<0.01'], }, }; export default function() { const payload = JSON.stringify({ merchant_id: 'merchant_' + Math.floor(Math.random() * 1000), amount: Math.floor(Math.random() * 10000) / 100, currency: 'USD', payment_method: 'credit_card', }); const params = { headers: { 'Content-Type': 'application/json', 'Authorization': 'Bearer ' + __ENV.API_TOKEN, }, }; const response = http.post('https://api.payments.staqi.com/v1/process', payload, params); check(response, { 'status is 200': (r) => r.status === 200, 'response time < 100ms': (r) => r.timings.duration < 100, }) || errorRate.add(1); sleep(1); }
Results: Exceeding Performance Targets
Performance Metrics Achieved
Transaction Volume:
- Successfully scaled to 50 million daily transactions
- Peak performance: 2,400 TPS sustained
- Burst capacity: 3,500 TPS for short periods
Response Time:
- Average response time: 45ms (down from 200ms)
- 95th percentile: 85ms (target was <100ms)
- 99th percentile: 150ms
- 99.9th percentile: 300ms
Availability & Reliability:
- 99.997% uptime achieved
- Zero payment failures during peak hours
- Automated failover in under 30 seconds
- Mean Time to Recovery (MTTR): 2.5 minutes
Cost Optimization Results
Despite 10x transaction volume increase:
- Infrastructure costs: Only 4x increase (60% efficiency gain)
- Operational overhead: 70% reduction through automation
- Development velocity: 200% faster deployment cycles
Advanced Optimizations: The Secret Sauce
Connection Pool Optimization
// Optimized Database Connection Pool type DatabasePool struct { writePool *sql.DB readPools []*sql.DB poolIndex int64 } func NewDatabasePool() *DatabasePool { writeDB := createConnection(writeEndpoint, &sql.Config{ MaxOpenConns: 50, MaxIdleConns: 25, ConnMaxLifetime: 5 * time.Minute, ConnMaxIdleTime: 1 * time.Minute, }) // Create multiple read pools for load distribution readPools := make([]*sql.DB, 5) for i := range readPools { readPools[i] = createConnection(readEndpoint, &sql.Config{ MaxOpenConns: 30, MaxIdleConns: 15, ConnMaxLifetime: 5 * time.Minute, }) } return &DatabasePool{ writePool: writeDB, readPools: readPools, } } func (dp *DatabasePool) GetReadConnection() *sql.DB { // Round-robin load balancing across read replicas index := atomic.AddInt64(&dp.poolIndex, 1) % int64(len(dp.readPools)) return dp.readPools[index] }
JVM Tuning for Kubernetes
# Optimized JVM Configuration for Payment Services apiVersion: v1 kind: ConfigMap metadata: name: jvm-config namespace: payments data: JAVA_OPTS: | -Xms2g -Xmx3g -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:+UnlockExperimentalVMOptions -XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -Djava.security.egd=file:/dev/./urandom -Dspring.jmx.enabled=false -Dmanagement.endpoints.jmx.exposure.exclude=*
Monitoring and Observability: Full Visibility
Comprehensive Metrics Dashboard
We implemented a complete observability stack:
# Prometheus ServiceMonitor for Payment Metrics apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: payment-metrics namespace: monitoring spec: selector: matchLabels: app: payment-processor endpoints: - port: metrics interval: 15s path: /actuator/prometheus
Key Metrics Tracked:
- Transaction throughput and latency
- Error rates by service and endpoint
- Database connection pool utilization
- JVM garbage collection metrics
- Kubernetes resource utilization
- Business metrics (revenue, conversion rates)
The STAQI Difference: Beyond Technology
This project's success came from combining:
Technical Excellence:
- Deep understanding of high-frequency trading systems
- Expertise in microservices architecture patterns
- Performance optimization at every layer
Operational Excellence:
- 24/7 monitoring and support
- Proactive performance tuning
- Rapid incident response capabilities
Business Understanding:
- Financial services compliance requirements
- Revenue impact of performance issues
- Scalability planning for business growth
Future-Proofing: What's Next
The architecture we built is designed for continued growth:
- Machine Learning Integration: Fraud detection and performance optimization
- Global Expansion: Multi-region deployment capabilities
- New Payment Methods: Cryptocurrency and digital wallet support
- Real-time Analytics: Enhanced business intelligence and reporting
Key Takeaways for High-Volume Systems
1. Design for Scale from Day One
Don't wait until you hit limits. Build your architecture assuming 10x current volume.
2. Cache Everything (Intelligently)
Multi-level caching strategy can provide 10x performance improvements.
3. Measure Everything
You can't optimize what you don't measure. Comprehensive monitoring is essential.
4. Test at Scale
Load testing should exceed your expected peak by at least 50%.
5. Automate Operations
Manual processes become bottlenecks at scale. Automate everything possible.
The journey from 5 million to 50 million daily transactions required more than just scaling existing systems—it demanded a complete rethinking of architecture, operations, and monitoring. With the right approach and expertise, these challenges become opportunities for business growth and technical innovation.
Need to scale your payment infrastructure? Contact STAQI Technologies to learn how we can help your organization handle massive transaction volumes with exceptional performance.