Performance & Scaling10 min read

Scaling Payment Systems to 50 Million Daily Transactions: An EKS Success Story

How we architected and implemented a high-performance EKS infrastructure that processes 50 million payment transactions daily with sub-100ms latency.

S

STAQI Technologies Team

January 29, 2024

EKSHigh-VolumePerformanceScalingPayment Processing

Scaling Payment Systems to 50 Million Daily Transactions: An EKS Success Story

When a global payment platform approached STAQI Technologies with the challenge of scaling their transaction processing from 5 million to 50 million daily transactions, we knew we were dealing with more than just a 10x scaling problem. This was about reimagining their entire infrastructure architecture to handle massive scale while maintaining the sub-100ms response times their business required.

The Scale Challenge: From Millions to Tens of Millions

Our client, a major payment processor serving multiple financial institutions, was experiencing exponential growth. Their existing infrastructure was hitting critical limitations:

Current State:

  • 5 million transactions per day
  • Average response time: 200ms
  • Peak capacity: 150 TPS (transactions per second)
  • Frequent performance degradation during peak hours
  • Manual scaling requiring 2-hour lead times

Target Requirements:

  • 50 million transactions per day
  • Sub-100ms response time (95th percentile)
  • Peak capacity: 2,000+ TPS
  • Auto-scaling with zero manual intervention
  • 99.99% uptime during peak hours

Architecture Revolution: Microservices on EKS

The Monolith Problem

The existing monolithic payment processor was the primary bottleneck. We designed a complete microservices architecture using Amazon EKS:

# Payment Processing Microservices Architecture apiVersion: apps/v1 kind: Deployment metadata: name: payment-validator namespace: payments spec: replicas: 10 selector: matchLabels: app: payment-validator template: metadata: labels: app: payment-validator spec: containers: - name: validator image: staqi/payment-validator:v2.1.0 ports: - containerPort: 8080 resources: requests: memory: "256Mi" cpu: "200m" limits: memory: "512Mi" cpu: "500m" env: - name: DATABASE_POOL_SIZE value: "20" - name: REDIS_CLUSTER_ENDPOINT valueFrom: secretKeyRef: name: redis-config key: cluster-endpoint

Key Microservices Components

1. Payment Validator Service

  • Handles initial transaction validation
  • Performs fraud detection checks
  • Scales independently based on request volume

2. Transaction Processor Service

  • Core payment processing logic
  • Integrates with multiple payment gateways
  • Implements circuit breakers for resilience

3. Settlement Service

  • Handles batch settlement operations
  • Processes end-of-day reconciliation
  • Manages cross-border payment compliance

4. Notification Service

  • Real-time transaction status updates
  • Webhook delivery to merchant systems
  • Event-driven architecture using Kafka

Performance Optimization: Every Millisecond Counts

Database Architecture: Distributed for Scale

We implemented a multi-tier database strategy:

-- Optimized Payment Transaction Table CREATE TABLE payment_transactions ( transaction_id UUID PRIMARY KEY, merchant_id UUID NOT NULL, amount DECIMAL(15,2) NOT NULL, currency_code CHAR(3) NOT NULL, status VARCHAR(20) NOT NULL, created_at TIMESTAMP DEFAULT NOW(), processed_at TIMESTAMP, -- Partitioning by date for performance PARTITION BY RANGE (created_at) ); -- Indexes for high-performance queries CREATE INDEX CONCURRENTLY idx_transactions_merchant_status ON payment_transactions (merchant_id, status, created_at); CREATE INDEX CONCURRENTLY idx_transactions_processing ON payment_transactions (status, created_at) WHERE status IN ('pending', 'processing');

Database Strategy:

  • Primary Database: Amazon RDS Aurora PostgreSQL with read replicas
  • Cache Layer: Redis Cluster for sub-millisecond lookups
  • Analytics: Amazon Redshift for reporting and analytics
  • Real-time Events: Amazon DynamoDB for session data

Caching Strategy: Multi-Level Performance

// High-Performance Caching Implementation type PaymentCache struct { L1Cache *ristretto.Cache // In-memory cache L2Cache *redis.Cluster // Redis cluster L3Cache *aurora.ReadReplica // Database read replica } func (pc *PaymentCache) GetMerchantConfig(merchantID string) (*MerchantConfig, error) { // L1: Check in-memory cache (sub-microsecond) if config, found := pc.L1Cache.Get(merchantID); found { return config.(*MerchantConfig), nil } // L2: Check Redis cluster (sub-millisecond) if configJSON, err := pc.L2Cache.Get(merchantID).Result(); err == nil { config := &MerchantConfig{} json.Unmarshal([]byte(configJSON), config) // Store in L1 for next time pc.L1Cache.Set(merchantID, config, 300*time.Second) return config, nil } // L3: Fallback to database config, err := pc.loadFromDatabase(merchantID) if err != nil { return nil, err } // Populate caches for future requests go pc.populateCaches(merchantID, config) return config, nil }

Auto-Scaling: Intelligent Resource Management

Horizontal Pod Autoscaler (HPA) Configuration

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: payment-processor-hpa namespace: payments spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payment-processor minReplicas: 10 maxReplicas: 100 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 - type: Pods pods: metric: name: transactions_per_second target: type: AverageValue averageValue: "50" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 15 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 60

Vertical Pod Autoscaler (VPA) for Right-Sizing

apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: payment-processor-vpa namespace: payments spec: targetRef: apiVersion: apps/v1 kind: Deployment name: payment-processor updatePolicy: updateMode: "Auto" resourcePolicy: containerPolicies: - containerName: payment-processor maxAllowed: cpu: "2" memory: "4Gi" minAllowed: cpu: "100m" memory: "128Mi"

Load Testing: Validating Performance at Scale

Comprehensive Load Testing Strategy

We implemented a multi-phase load testing approach:

// K6 Load Testing Script for Payment API import http from 'k6/http'; import { check, sleep } from 'k6'; import { Rate } from 'k6/metrics'; export let errorRate = new Rate('errors'); export let options = { stages: [ { duration: '5m', target: 500 }, // Ramp up to 500 TPS { duration: '30m', target: 500 }, // Stay at 500 TPS { duration: '5m', target: 1000 }, // Ramp up to 1000 TPS { duration: '30m', target: 1000 }, // Stay at 1000 TPS { duration: '5m', target: 2000 }, // Peak load test { duration: '10m', target: 2000 }, // Sustain peak load { duration: '10m', target: 0 }, // Ramp down ], thresholds: { http_req_duration: ['p(95)<100'], // 95% of requests under 100ms http_req_failed: ['rate<0.01'], // Error rate under 1% errors: ['rate<0.01'], }, }; export default function() { const payload = JSON.stringify({ merchant_id: 'merchant_' + Math.floor(Math.random() * 1000), amount: Math.floor(Math.random() * 10000) / 100, currency: 'USD', payment_method: 'credit_card', }); const params = { headers: { 'Content-Type': 'application/json', 'Authorization': 'Bearer ' + __ENV.API_TOKEN, }, }; const response = http.post('https://api.payments.staqi.com/v1/process', payload, params); check(response, { 'status is 200': (r) => r.status === 200, 'response time < 100ms': (r) => r.timings.duration < 100, }) || errorRate.add(1); sleep(1); }

Results: Exceeding Performance Targets

Performance Metrics Achieved

Transaction Volume:

  • Successfully scaled to 50 million daily transactions
  • Peak performance: 2,400 TPS sustained
  • Burst capacity: 3,500 TPS for short periods

Response Time:

  • Average response time: 45ms (down from 200ms)
  • 95th percentile: 85ms (target was <100ms)
  • 99th percentile: 150ms
  • 99.9th percentile: 300ms

Availability & Reliability:

  • 99.997% uptime achieved
  • Zero payment failures during peak hours
  • Automated failover in under 30 seconds
  • Mean Time to Recovery (MTTR): 2.5 minutes

Cost Optimization Results

Despite 10x transaction volume increase:

  • Infrastructure costs: Only 4x increase (60% efficiency gain)
  • Operational overhead: 70% reduction through automation
  • Development velocity: 200% faster deployment cycles

Advanced Optimizations: The Secret Sauce

Connection Pool Optimization

// Optimized Database Connection Pool type DatabasePool struct { writePool *sql.DB readPools []*sql.DB poolIndex int64 } func NewDatabasePool() *DatabasePool { writeDB := createConnection(writeEndpoint, &sql.Config{ MaxOpenConns: 50, MaxIdleConns: 25, ConnMaxLifetime: 5 * time.Minute, ConnMaxIdleTime: 1 * time.Minute, }) // Create multiple read pools for load distribution readPools := make([]*sql.DB, 5) for i := range readPools { readPools[i] = createConnection(readEndpoint, &sql.Config{ MaxOpenConns: 30, MaxIdleConns: 15, ConnMaxLifetime: 5 * time.Minute, }) } return &DatabasePool{ writePool: writeDB, readPools: readPools, } } func (dp *DatabasePool) GetReadConnection() *sql.DB { // Round-robin load balancing across read replicas index := atomic.AddInt64(&dp.poolIndex, 1) % int64(len(dp.readPools)) return dp.readPools[index] }

JVM Tuning for Kubernetes

# Optimized JVM Configuration for Payment Services apiVersion: v1 kind: ConfigMap metadata: name: jvm-config namespace: payments data: JAVA_OPTS: | -Xms2g -Xmx3g -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:+UnlockExperimentalVMOptions -XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -Djava.security.egd=file:/dev/./urandom -Dspring.jmx.enabled=false -Dmanagement.endpoints.jmx.exposure.exclude=*

Monitoring and Observability: Full Visibility

Comprehensive Metrics Dashboard

We implemented a complete observability stack:

# Prometheus ServiceMonitor for Payment Metrics apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: payment-metrics namespace: monitoring spec: selector: matchLabels: app: payment-processor endpoints: - port: metrics interval: 15s path: /actuator/prometheus

Key Metrics Tracked:

  • Transaction throughput and latency
  • Error rates by service and endpoint
  • Database connection pool utilization
  • JVM garbage collection metrics
  • Kubernetes resource utilization
  • Business metrics (revenue, conversion rates)

The STAQI Difference: Beyond Technology

This project's success came from combining:

Technical Excellence:

  • Deep understanding of high-frequency trading systems
  • Expertise in microservices architecture patterns
  • Performance optimization at every layer

Operational Excellence:

  • 24/7 monitoring and support
  • Proactive performance tuning
  • Rapid incident response capabilities

Business Understanding:

  • Financial services compliance requirements
  • Revenue impact of performance issues
  • Scalability planning for business growth

Future-Proofing: What's Next

The architecture we built is designed for continued growth:

  • Machine Learning Integration: Fraud detection and performance optimization
  • Global Expansion: Multi-region deployment capabilities
  • New Payment Methods: Cryptocurrency and digital wallet support
  • Real-time Analytics: Enhanced business intelligence and reporting

Key Takeaways for High-Volume Systems

1. Design for Scale from Day One

Don't wait until you hit limits. Build your architecture assuming 10x current volume.

2. Cache Everything (Intelligently)

Multi-level caching strategy can provide 10x performance improvements.

3. Measure Everything

You can't optimize what you don't measure. Comprehensive monitoring is essential.

4. Test at Scale

Load testing should exceed your expected peak by at least 50%.

5. Automate Operations

Manual processes become bottlenecks at scale. Automate everything possible.

The journey from 5 million to 50 million daily transactions required more than just scaling existing systems—it demanded a complete rethinking of architecture, operations, and monitoring. With the right approach and expertise, these challenges become opportunities for business growth and technical innovation.

Need to scale your payment infrastructure? Contact STAQI Technologies to learn how we can help your organization handle massive transaction volumes with exceptional performance.

Ready to implement similar solutions?

Contact STAQI Technologies to learn how our expertise in high-volume systems, security operations, and compliance can benefit your organization.

Get Started