Monitoring20 min read

EKS Monitoring & Observability: Production-Grade Implementation Guide

Comprehensive guide to implementing enterprise-grade monitoring, logging, and observability for EKS clusters using Prometheus, Grafana, ELK stack, and AWS native services for production environments.

S

STAQI Technologies

March 22, 2024

EKSMonitoringObservabilityPrometheusGrafanaAWS

EKS Monitoring & Observability: Production-Grade Implementation Guide

Production EKS environments require comprehensive monitoring and observability to ensure reliability, performance, and rapid incident resolution. This guide provides a complete implementation of enterprise-grade monitoring solutions combining open-source tools with AWS native services.

Introduction

Effective observability in EKS environments encompasses metrics collection, log aggregation, distributed tracing, and alerting across multiple layers: infrastructure, Kubernetes platform, and applications. This guide demonstrates how to implement a robust monitoring stack that scales with your production workloads.

Observability Architecture Overview

Three Pillars of Observability

Modern observability practices focus on three core pillars:

# observability-strategy.yaml apiVersion: v1 kind: ConfigMap metadata: name: observability-strategy data: metrics: "Prometheus + Grafana + CloudWatch" logging: "Fluent Bit + ElasticSearch + CloudWatch Logs" tracing: "Jaeger + AWS X-Ray + OpenTelemetry" alerting: "AlertManager + SNS + PagerDuty" dashboards: "Grafana + CloudWatch Dashboards"

Monitoring Stack Components

Comprehensive monitoring architecture for EKS:

graph TB A[EKS Cluster] --> B[Prometheus] A --> C[Fluent Bit] A --> D[Jaeger] B --> E[Grafana] C --> F[ElasticSearch] D --> G[Jaeger UI] B --> H[AlertManager] H --> I[SNS/Slack] A --> J[CloudWatch] J --> K[CloudWatch Dashboards]

Prometheus Implementation

Prometheus Operator Deployment

Deploy Prometheus using the kube-prometheus-stack:

# prometheus-values.yaml prometheus: prometheusSpec: retention: 15d storageSpec: volumeClaimTemplate: spec: storageClassName: gp3-encrypted accessModes: ["ReadWriteOnce"] resources: requests: storage: 100Gi resources: requests: memory: 2Gi cpu: 1000m limits: memory: 8Gi cpu: 4000m nodeSelector: node-type: monitoring tolerations: - key: "monitoring" operator: "Equal" value: "true" effect: "NoSchedule" # External labels for federation externalLabels: cluster: production-eks region: us-west-2 environment: production # Remote write to CloudWatch remoteWrite: - url: https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-12345678-1234-1234-1234-123456789012/api/v1/remote_write sigv4: region: us-west-2 writeRelabelConfigs: - sourceLabels: [__name__] regex: 'container_.*|kubelet_.*|kube_.*' action: keep alertmanager: alertmanagerSpec: storage: volumeClaimTemplate: spec: storageClassName: gp3-encrypted accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi resources: requests: memory: 256Mi cpu: 100m limits: memory: 1Gi cpu: 500m grafana: enabled: true persistence: enabled: true storageClassName: gp3-encrypted size: 10Gi resources: requests: memory: 512Mi cpu: 250m limits: memory: 2Gi cpu: 1000m # AWS IAM role for CloudWatch access serviceAccount: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/GrafanaCloudWatchRole grafana.ini: auth.anonymous: enabled: false security: admin_password: ${GRAFANA_ADMIN_PASSWORD} plugins: - grafana-piechart-panel - grafana-worldmap-panel - cloudwatch sidecar: dashboards: enabled: true searchNamespace: ALL datasources: enabled: true searchNamespace: ALL nodeExporter: enabled: true kubeStateMetrics: enabled: true prometheusOperator: resources: requests: memory: 256Mi cpu: 100m limits: memory: 512Mi cpu: 200m

Custom ServiceMonitor Configurations

Configure application-specific monitoring:

# application-monitoring.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api-service-monitor namespace: production labels: app: api-server spec: selector: matchLabels: app: api-server endpoints: - port: metrics interval: 30s path: /metrics relabelings: - sourceLabels: [__meta_kubernetes_pod_name] targetLabel: pod - sourceLabels: [__meta_kubernetes_pod_node_name] targetLabel: node - sourceLabels: [__meta_kubernetes_namespace] targetLabel: namespace --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: nginx-ingress-monitor namespace: ingress-nginx spec: selector: matchLabels: app.kubernetes.io/name: ingress-nginx endpoints: - port: prometheus interval: 30s path: /metrics --- # Custom PodMonitor for specific applications apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: application-pods namespace: production spec: selector: matchLabels: monitoring: enabled podMetricsEndpoints: - port: metrics interval: 15s path: /metrics relabelings: - sourceLabels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true

PrometheusRules for Alerting

Comprehensive alerting rules for production environments:

# prometheus-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: eks-cluster-alerts namespace: monitoring spec: groups: - name: kubernetes-cluster rules: - alert: KubernetesNodeReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 10m labels: severity: critical annotations: summary: "Kubernetes Node not ready" description: "Node {{ $labels.node }} has been unready for more than 10 minutes" - alert: KubernetesPodCrashLooping expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 5m labels: severity: warning annotations: summary: "Pod is crash looping" description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping" - alert: KubernetesNodeHighCPU expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 90 for: 15m labels: severity: warning annotations: summary: "High CPU usage on node" description: "Node {{ $labels.instance }} has CPU usage above 90% for 15 minutes" - alert: KubernetesNodeHighMemory expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 15m labels: severity: warning annotations: summary: "High memory usage on node" description: "Node {{ $labels.instance }} has memory usage above 90%" - alert: KubernetesPersistentVolumeSpaceUsage expr: 100 * (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 85 for: 10m labels: severity: warning annotations: summary: "PersistentVolume space usage is high" description: "PV {{ $labels.persistentvolumeclaim }} usage is {{ $value }}%" - name: application-alerts rules: - alert: ApplicationHighErrorRate expr: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 > 5 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Application error rate is {{ $value }}% over the last 5 minutes" - alert: ApplicationHighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 10m labels: severity: warning annotations: summary: "High application latency" description: "95th percentile latency is {{ $value }}s" - alert: ApplicationLowThroughput expr: sum(rate(http_requests_total[5m])) < 10 for: 15m labels: severity: warning annotations: summary: "Low application throughput" description: "Request rate is {{ $value }} requests/second" - name: eks-specific rules: - alert: EKSClusterAutoscalerErrors expr: increase(cluster_autoscaler_errors_total[10m]) > 5 for: 5m labels: severity: warning annotations: summary: "Cluster Autoscaler experiencing errors" description: "Cluster Autoscaler has {{ $value }} errors in 10 minutes" - alert: EKSNodeGroupCapacityIssue expr: kube_node_status_capacity_pods - kube_node_status_allocatable_pods < 5 for: 10m labels: severity: warning annotations: summary: "Node approaching pod capacity" description: "Node {{ $labels.node }} has less than 5 available pod slots"

Logging Implementation

Fluent Bit Configuration

Deploy Fluent Bit for comprehensive log collection:

# fluent-bit-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: fluent-bit-config namespace: amazon-cloudwatch data: fluent-bit.conf: | [SERVICE] Flush 5 Grace 30 Log_Level info Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 storage.path /var/fluent-bit/state/flb-storage/ storage.sync normal storage.checksum off storage.backlog.mem_limit 5M [INPUT] Name tail Tag application.* Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy* Path /var/log/containers/*.log multiline.parser docker, cri DB /var/fluent-bit/state/flb_container.db Mem_Buf_Limit 50MB Skip_Long_Lines On Refresh_Interval 10 Rotate_Wait 30 storage.type filesystem Read_from_Head ${READ_FROM_HEAD} [INPUT] Name tail Tag dataplane.systemd.* Path /var/log/journal multiline.parser systemd DB /var/fluent-bit/state/systemd.db Mem_Buf_Limit 25MB Skip_Long_Lines On Refresh_Interval 10 Read_from_Head ${READ_FROM_HEAD} [FILTER] Name kubernetes Match application.* Kube_URL https://kubernetes.default.svc:443 Kube_Tag_Prefix application.var.log.containers. Merge_Log On Merge_Log_Key log_processed K8S-Logging.Parser On K8S-Logging.Exclude Off Labels Off Annotations Off Use_Kubelet On Kubelet_Port 10250 Buffer_Size 0 [FILTER] Name modify Match dataplane.systemd.* Rename _HOSTNAME hostname Rename _SYSTEMD_UNIT systemd_unit Rename MESSAGE message Remove_regex ^((?!hostname|systemd_unit|message).)*$ [FILTER] Name aws Match dataplane.* imds_version v1 [OUTPUT] Name cloudwatch_logs Match application.* region ${AWS_REGION} log_group_name /aws/containerinsights/${CLUSTER_NAME}/application log_stream_template $kubernetes['namespace_name'].$kubernetes['pod_name'].$kubernetes['container_name'] auto_create_group true extra_user_agent container-insights [OUTPUT] Name cloudwatch_logs Match dataplane.systemd.* region ${AWS_REGION} log_group_name /aws/containerinsights/${CLUSTER_NAME}/dataplane log_stream_template $kubernetes['host'].$systemd_unit auto_create_group true extra_user_agent container-insights [OUTPUT] Name elasticsearch Match application.* Host elasticsearch.logging.svc.cluster.local Port 9200 Index eks-logs Type _doc Time_Key @timestamp Time_Key_Format %Y-%m-%dT%H:%M:%S.%L%z Include_Tag_Key true Tag_Key _tag Buffer_Size false tls off tls.verify off parsers.conf: | [PARSER] Name docker Format json Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%LZ [PARSER] Name cri Format regex Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$ Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%L%z [PARSER] Name systemd Format regex Regex ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$ Time_Key time Time_Format %b %d %H:%M:%S

ElasticSearch and Kibana Setup

Deploy ELK stack for advanced log analysis:

# elasticsearch.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: elasticsearch namespace: logging spec: serviceName: elasticsearch replicas: 3 selector: matchLabels: app: elasticsearch template: metadata: labels: app: elasticsearch spec: containers: - name: elasticsearch image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0 resources: requests: memory: 2Gi cpu: 1000m limits: memory: 4Gi cpu: 2000m ports: - containerPort: 9200 name: rest - containerPort: 9300 name: inter-node volumeMounts: - name: data mountPath: /usr/share/elasticsearch/data env: - name: cluster.name value: k8s-logs - name: node.name valueFrom: fieldRef: fieldPath: metadata.name - name: discovery.seed_hosts value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch" - name: cluster.initial_master_nodes value: "elasticsearch-0,elasticsearch-1,elasticsearch-2" - name: ES_JAVA_OPTS value: "-Xms1g -Xmx1g" - name: xpack.security.enabled value: "false" - name: xpack.monitoring.collection.enabled value: "true" initContainers: - name: fix-permissions image: busybox command: ["sh", "-c", "chown -R 1000:1000 /usr/share/elasticsearch/data"] securityContext: privileged: true volumeMounts: - name: data mountPath: /usr/share/elasticsearch/data volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: gp3-encrypted resources: requests: storage: 100Gi --- apiVersion: v1 kind: Service metadata: name: elasticsearch namespace: logging spec: selector: app: elasticsearch clusterIP: None ports: - port: 9200 name: rest - port: 9300 name: inter-node

Advanced Log Processing

Implement structured logging and log enrichment:

# log-processor.yaml apiVersion: v1 kind: ConfigMap metadata: name: logstash-config namespace: logging data: logstash.yml: | http.host: "0.0.0.0" xpack.monitoring.elasticsearch.hosts: ["http://elasticsearch:9200"] pipeline.yml: | - pipeline.id: kubernetes path.config: "/usr/share/logstash/pipeline/kubernetes.conf" kubernetes.conf: | input { beats { port => 5044 } } filter { if [kubernetes] { # Parse JSON logs if [message] =~ /^\{.*\}$/ { json { source => "message" target => "json" } } # Extract log level grok { match => { "message" => "%{LOGLEVEL:log_level}" } tag_on_failure => [] } # Parse timestamps date { match => [ "timestamp", "ISO8601" ] } # Enrich with cluster information mutate { add_field => { "cluster_name" => "${CLUSTER_NAME}" "environment" => "${ENVIRONMENT}" } } # Remove sensitive information mutate { remove_field => [ "password", "token", "secret" ] } } } output { elasticsearch { hosts => ["elasticsearch:9200"] index => "logstash-kubernetes-%{+YYYY.MM.dd}" } # Send critical errors to CloudWatch if [log_level] == "ERROR" or [log_level] == "FATAL" { cloudwatch_logs { log_group_name => "/aws/eks/critical-errors" log_stream_name => "%{kubernetes.namespace}-%{kubernetes.pod_name}" region => "${AWS_REGION}" } } } --- apiVersion: apps/v1 kind: Deployment metadata: name: logstash namespace: logging spec: replicas: 2 selector: matchLabels: app: logstash template: metadata: labels: app: logstash spec: containers: - name: logstash image: docker.elastic.co/logstash/logstash:8.8.0 resources: requests: memory: 1Gi cpu: 500m limits: memory: 2Gi cpu: 1000m ports: - containerPort: 5044 volumeMounts: - name: config mountPath: /usr/share/logstash/config - name: pipeline mountPath: /usr/share/logstash/pipeline env: - name: CLUSTER_NAME value: "production-eks" - name: ENVIRONMENT value: "production" - name: AWS_REGION value: "us-west-2" volumes: - name: config configMap: name: logstash-config items: - key: logstash.yml path: logstash.yml - key: pipeline.yml path: pipeline.yml - name: pipeline configMap: name: logstash-config items: - key: kubernetes.conf path: kubernetes.conf

Distributed Tracing

Jaeger Implementation

Deploy Jaeger for distributed tracing:

# jaeger-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: jaeger namespace: observability spec: replicas: 1 selector: matchLabels: app: jaeger template: metadata: labels: app: jaeger spec: containers: - name: jaeger image: jaegertracing/all-in-one:1.47 ports: - containerPort: 16686 name: ui - containerPort: 14268 name: collector - containerPort: 14250 name: grpc - containerPort: 6831 name: agent-compact - containerPort: 6832 name: agent-binary env: - name: COLLECTOR_ZIPKIN_HOST_PORT value: ":9411" - name: SPAN_STORAGE_TYPE value: elasticsearch - name: ES_SERVER_URLS value: http://elasticsearch.logging.svc.cluster.local:9200 - name: ES_INDEX_PREFIX value: jaeger resources: requests: memory: 512Mi cpu: 200m limits: memory: 1Gi cpu: 500m --- apiVersion: v1 kind: Service metadata: name: jaeger namespace: observability spec: selector: app: jaeger ports: - name: ui port: 16686 targetPort: 16686 - name: collector port: 14268 targetPort: 14268 - name: grpc port: 14250 targetPort: 14250 - name: agent-compact port: 6831 targetPort: 6831 protocol: UDP - name: agent-binary port: 6832 targetPort: 6832 protocol: UDP

OpenTelemetry Collector

Deploy OpenTelemetry for comprehensive telemetry collection:

# otel-collector.yaml apiVersion: v1 kind: ConfigMap metadata: name: otel-collector-config namespace: observability data: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 prometheus: config: scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true jaeger: protocols: grpc: endpoint: 0.0.0.0:14250 thrift_http: endpoint: 0.0.0.0:14268 processors: batch: timeout: 1s send_batch_size: 1024 memory_limiter: limit_mib: 512 attributes: actions: - key: cluster.name value: production-eks action: upsert - key: environment value: production action: upsert exporters: jaeger: endpoint: jaeger.observability.svc.cluster.local:14250 tls: insecure: true prometheus: endpoint: 0.0.0.0:8889 awsxray: region: us-west-2 no_verify_ssl: false logging: loglevel: debug service: pipelines: traces: receivers: [otlp, jaeger] processors: [memory_limiter, batch, attributes] exporters: [jaeger, awsxray] metrics: receivers: [otlp, prometheus] processors: [memory_limiter, batch, attributes] exporters: [prometheus] logs: receivers: [otlp] processors: [memory_limiter, batch, attributes] exporters: [logging] --- apiVersion: apps/v1 kind: Deployment metadata: name: otel-collector namespace: observability spec: replicas: 2 selector: matchLabels: app: otel-collector template: metadata: labels: app: otel-collector spec: serviceAccountName: otel-collector containers: - name: otel-collector image: otel/opentelemetry-collector-contrib:0.81.0 command: - "/otelcol-contrib" - "--config=/conf/config.yaml" resources: requests: memory: 512Mi cpu: 200m limits: memory: 1Gi cpu: 500m ports: - containerPort: 4317 name: otlp-grpc - containerPort: 4318 name: otlp-http - containerPort: 8889 name: prometheus volumeMounts: - name: config mountPath: /conf volumes: - name: config configMap: name: otel-collector-config

AWS Native Monitoring Integration

CloudWatch Container Insights

Enable comprehensive AWS CloudWatch monitoring:

# cloudwatch-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: cwagentconfig namespace: amazon-cloudwatch data: cwagentconfig.json: | { "logs": { "metrics_collected": { "kubernetes": { "cluster_name": "production-eks", "metrics_collection_interval": 60 } }, "force_flush_interval": 5 }, "metrics": { "namespace": "CWAgent", "metrics_collected": { "cpu": { "measurement": ["cpu_usage_idle", "cpu_usage_iowait", "cpu_usage_user", "cpu_usage_system"], "metrics_collection_interval": 60, "totalcpu": false }, "disk": { "measurement": ["used_percent"], "metrics_collection_interval": 60, "resources": ["*"] }, "diskio": { "measurement": ["io_time"], "metrics_collection_interval": 60, "resources": ["*"] }, "mem": { "measurement": ["mem_used_percent"], "metrics_collection_interval": 60 }, "netstat": { "measurement": ["tcp_established", "tcp_time_wait"], "metrics_collection_interval": 60 }, "swap": { "measurement": ["swap_used_percent"], "metrics_collection_interval": 60 } } } } --- apiVersion: apps/v1 kind: DaemonSet metadata: name: cloudwatch-agent namespace: amazon-cloudwatch spec: selector: matchLabels: name: cloudwatch-agent template: metadata: labels: name: cloudwatch-agent spec: containers: - name: cloudwatch-agent image: amazon/cloudwatch-agent:1.247356.0b251814 ports: - containerPort: 8125 hostPort: 8125 protocol: UDP resources: limits: cpu: 200m memory: 200Mi requests: cpu: 200m memory: 200Mi env: - name: HOST_IP valueFrom: fieldRef: fieldPath: status.hostIP - name: HOST_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: K8S_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace volumeMounts: - name: cwagentconfig mountPath: /etc/cwagentconfig - name: rootfs mountPath: /rootfs readOnly: true - name: dockersock mountPath: /var/run/docker.sock readOnly: true - name: varlibdocker mountPath: /var/lib/docker readOnly: true - name: sys mountPath: /sys readOnly: true - name: devdisk mountPath: /dev/disk readOnly: true volumes: - name: cwagentconfig configMap: name: cwagentconfig - name: rootfs hostPath: path: / - name: dockersock hostPath: path: /var/run/docker.sock - name: varlibdocker hostPath: path: /var/lib/docker - name: sys hostPath: path: /sys - name: devdisk hostPath: path: /dev/disk terminationGracePeriodSeconds: 60 serviceAccountName: cloudwatch-agent

AWS Load Balancer Controller Metrics

Monitor ALB/NLB performance:

# alb-controller-monitoring.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: aws-load-balancer-controller namespace: kube-system spec: selector: matchLabels: app.kubernetes.io/name: aws-load-balancer-controller endpoints: - port: webhook-server path: /metrics interval: 30s --- # Custom metrics for ALB apiVersion: v1 kind: ConfigMap metadata: name: alb-custom-metrics namespace: monitoring data: record.rules: | groups: - name: alb.rules rules: - record: alb:request_count_rate5m expr: sum(rate(alb_request_count_total[5m])) by (load_balancer) - record: alb:target_response_time_p95 expr: histogram_quantile(0.95, sum(rate(alb_target_response_time_seconds_bucket[5m])) by (load_balancer, le)) - record: alb:error_rate_5m expr: sum(rate(alb_http_code_target_5xx_count_total[5m])) by (load_balancer) / sum(rate(alb_request_count_total[5m])) by (load_balancer)

Grafana Dashboards

Comprehensive EKS Dashboard

Create production-ready Grafana dashboards:

{ "dashboard": { "id": null, "title": "EKS Cluster Overview", "tags": ["kubernetes", "eks"], "timezone": "browser", "panels": [ { "id": 1, "title": "Cluster Resource Usage", "type": "stat", "targets": [ { "expr": "sum(kube_node_status_allocatable{resource=\"cpu\"}) - sum(kube_node_status_capacity{resource=\"cpu\"})", "legendFormat": "CPU Available" }, { "expr": "sum(kube_node_status_allocatable{resource=\"memory\"}) - sum(kube_node_status_capacity{resource=\"memory\"})", "legendFormat": "Memory Available" } ], "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "unit": "short" } } }, { "id": 2, "title": "Pod Status Distribution", "type": "piechart", "targets": [ { "expr": "sum by (phase) (kube_pod_status_phase)", "legendFormat": "{{phase}}" } ] }, { "id": 3, "title": "Node CPU Usage", "type": "graph", "targets": [ { "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "legendFormat": "{{instance}}" } ], "yAxes": [ { "max": 100, "min": 0, "unit": "percent" } ] }, { "id": 4, "title": "Application Error Rate", "type": "graph", "targets": [ { "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100", "legendFormat": "{{service}}" } ] } ], "time": { "from": "now-1h", "to": "now" }, "refresh": "30s" } }

Alerting Configuration

AlertManager Configuration

Configure comprehensive alerting:

# alertmanager-config.yaml apiVersion: v1 kind: Secret metadata: name: alertmanager-main namespace: monitoring stringData: alertmanager.yml: | global: slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' smtp_smarthost: 'smtp.company.com:587' smtp_from: 'alerts@company.com' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' routes: - match: severity: critical receiver: 'critical-alerts' continue: true - match: severity: warning receiver: 'warning-alerts' receivers: - name: 'web.hook' webhook_configs: - url: 'http://webhook-service.monitoring.svc.cluster.local:8080/webhook' - name: 'critical-alerts' slack_configs: - channel: '#alerts-critical' title: 'Critical Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true pagerduty_configs: - routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY' description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}' email_configs: - to: 'oncall@company.com' subject: 'Critical Alert: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} {{ end }} - name: 'warning-alerts' slack_configs: - channel: '#alerts-warning' title: 'Warning: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']

Best Practices Summary

Monitoring Strategy Checklist

  • Metrics Collection: Prometheus with 15-day retention
  • Log Aggregation: Fluent Bit → ElasticSearch/CloudWatch
  • Distributed Tracing: Jaeger with OpenTelemetry
  • Alerting: Multi-channel alerts with escalation
  • Dashboards: Role-based Grafana dashboards
  • Cost Monitoring: Track monitoring infrastructure costs
  • Security: RBAC for monitoring tools
  • High Availability: Redundant monitoring components

Operational Guidelines

  1. Retention Policies: Balance storage costs with retention needs
  2. Alert Fatigue: Fine-tune alert thresholds to reduce noise
  3. Dashboard Maintenance: Regular review and updates
  4. Performance Impact: Monitor monitoring overhead
  5. Security: Secure monitoring endpoints and data

Conclusion

Implementing comprehensive observability for EKS requires careful orchestration of multiple tools and practices. The architecture outlined in this guide provides production-grade monitoring capabilities that scale with your infrastructure while maintaining operational efficiency.

Success factors include:

  • Layered Approach: Metrics, logs, and traces working together
  • Automation: Automated alerting and response procedures
  • Cost Awareness: Monitoring infrastructure costs alongside application costs
  • Team Alignment: Clear ownership and escalation procedures

Regular review and optimization of monitoring strategies ensures continued effectiveness as your EKS environment evolves and scales.


For more advanced observability strategies and monitoring best practices, follow STAQI Technologies' technical blog.

Ready to implement similar solutions?

Contact STAQI Technologies to learn how our expertise in high-volume systems, security operations, and compliance can benefit your organization.

Get Started