Comprehensive guide to implementing enterprise-grade monitoring, logging, and observability for EKS clusters using Prometheus, Grafana, ELK stack, and AWS native services for production environments.

EKS Monitoring & Observability: Production-Grade Implementation Guide

Production EKS environments require comprehensive monitoring and observability to ensure reliability, performance, and rapid incident resolution. This guide provides a complete implementation of enterprise-grade monitoring solutions combining open-source tools with AWS native services.

Introduction

Effective observability in EKS environments encompasses metrics collection, log aggregation, distributed tracing, and alerting across multiple layers: infrastructure, Kubernetes platform, and applications. This guide demonstrates how to implement a robust monitoring stack that scales with your production workloads.

Observability Architecture Overview

Three Pillars of Observability

Modern observability practices focus on three core pillars:

# observability-strategy.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: observability-strategy
data:
  metrics: "Prometheus + Grafana + CloudWatch"
  logging: "Fluent Bit + ElasticSearch + CloudWatch Logs"
  tracing: "Jaeger + AWS X-Ray + OpenTelemetry"
  alerting: "AlertManager + SNS + PagerDuty"
  dashboards: "Grafana + CloudWatch Dashboards"

Monitoring Stack Components

Comprehensive monitoring architecture for EKS:

graph TB
    A[EKS Cluster] --> B[Prometheus]
    A --> C[Fluent Bit]
    A --> D[Jaeger]
    
    B --> E[Grafana]
    C --> F[ElasticSearch]
    D --> G[Jaeger UI]
    
    B --> H[AlertManager]
    H --> I[SNS/Slack]
    
    A --> J[CloudWatch]
    J --> K[CloudWatch Dashboards]

Prometheus Implementation

Prometheus Operator Deployment

Deploy Prometheus using the kube-prometheus-stack:

# prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3-encrypted
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi
    
    resources:
      requests:
        memory: 2Gi
        cpu: 1000m
      limits:
        memory: 8Gi
        cpu: 4000m
    
    nodeSelector:
      node-type: monitoring
    
    tolerations:
    - key: "monitoring"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
    
    # External labels for federation
    externalLabels:
      cluster: production-eks
      region: us-west-2
      environment: production
    
    # Remote write to CloudWatch
    remoteWrite:
    - url: https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-12345678-1234-1234-1234-123456789012/api/v1/remote_write
      sigv4:
        region: us-west-2
      writeRelabelConfigs:
      - sourceLabels: [__name__]
        regex: 'container_.*|kubelet_.*|kube_.*'
        action: keep

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3-encrypted
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    
    resources:
      requests:
        memory: 256Mi
        cpu: 100m
      limits:
        memory: 1Gi
        cpu: 500m

grafana:
  enabled: true
  persistence:
    enabled: true
    storageClassName: gp3-encrypted
    size: 10Gi
  
  resources:
    requests:
      memory: 512Mi
      cpu: 250m
    limits:
      memory: 2Gi
      cpu: 1000m
  
  # AWS IAM role for CloudWatch access
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/GrafanaCloudWatchRole
  
  grafana.ini:
    auth.anonymous:
      enabled: false
    security:
      admin_password: ${GRAFANA_ADMIN_PASSWORD}
    plugins:
      - grafana-piechart-panel
      - grafana-worldmap-panel
      - cloudwatch
    
  sidecar:
    dashboards:
      enabled: true
      searchNamespace: ALL
    datasources:
      enabled: true
      searchNamespace: ALL

nodeExporter:
  enabled: true
  
kubeStateMetrics:
  enabled: true

prometheusOperator:
  resources:
    requests:
      memory: 256Mi
      cpu: 100m
    limits:
      memory: 512Mi
      cpu: 200m

Custom ServiceMonitor Configurations

Configure application-specific monitoring:

# application-monitoring.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-service-monitor
  namespace: production
  labels:
    app: api-server
spec:
  selector:
    matchLabels:
      app: api-server
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      targetLabel: node
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nginx-ingress-monitor
  namespace: ingress-nginx
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: ingress-nginx
  endpoints:
  - port: prometheus
    interval: 30s
    path: /metrics

---
# Custom PodMonitor for specific applications
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: application-pods
  namespace: production
spec:
  selector:
    matchLabels:
      monitoring: enabled
  podMetricsEndpoints:
  - port: metrics
    interval: 15s
    path: /metrics
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true

PrometheusRules for Alerting

Comprehensive alerting rules for production environments:

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: eks-cluster-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubernetes-cluster
    rules:
    - alert: KubernetesNodeReady
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "Kubernetes Node not ready"
        description: "Node {{ $labels.node }} has been unready for more than 10 minutes"

    - alert: KubernetesPodCrashLooping
      expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod is crash looping"
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

    - alert: KubernetesNodeHighCPU
      expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 90
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on node"
        description: "Node {{ $labels.instance }} has CPU usage above 90% for 15 minutes"

    - alert: KubernetesNodeHighMemory
      expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on node"
        description: "Node {{ $labels.instance }} has memory usage above 90%"

    - alert: KubernetesPersistentVolumeSpaceUsage
      expr: 100 * (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 85
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "PersistentVolume space usage is high"
        description: "PV {{ $labels.persistentvolumeclaim }} usage is {{ $value }}%"

  - name: application-alerts
    rules:
    - alert: ApplicationHighErrorRate
      expr: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 > 5
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Application error rate is {{ $value }}% over the last 5 minutes"

    - alert: ApplicationHighLatency
      expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High application latency"
        description: "95th percentile latency is {{ $value }}s"

    - alert: ApplicationLowThroughput
      expr: sum(rate(http_requests_total[5m])) < 10
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Low application throughput"
        description: "Request rate is {{ $value }} requests/second"

  - name: eks-specific
    rules:
    - alert: EKSClusterAutoscalerErrors
      expr: increase(cluster_autoscaler_errors_total[10m]) > 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Cluster Autoscaler experiencing errors"
        description: "Cluster Autoscaler has {{ $value }} errors in 10 minutes"

    - alert: EKSNodeGroupCapacityIssue
      expr: kube_node_status_capacity_pods - kube_node_status_allocatable_pods < 5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Node approaching pod capacity"
        description: "Node {{ $labels.node }} has less than 5 available pod slots"

Logging Implementation

Fluent Bit Configuration

Deploy Fluent Bit for comprehensive log collection:

# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: amazon-cloudwatch
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush                     5
        Grace                     30
        Log_Level                 info
        Daemon                    off
        Parsers_File              parsers.conf
        HTTP_Server               On
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        storage.path              /var/fluent-bit/state/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 5M

    [INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Path                /var/log/containers/*.log
        multiline.parser    docker, cri
        DB                  /var/fluent-bit/state/flb_container.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 dataplane.systemd.*
        Path                /var/log/journal
        multiline.parser    systemd
        DB                  /var/fluent-bit/state/systemd.db
        Mem_Buf_Limit       25MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                kubernetes
        Match               application.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     application.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Labels              Off
        Annotations         Off
        Use_Kubelet         On
        Kubelet_Port        10250
        Buffer_Size         0

    [FILTER]
        Name                modify
        Match               dataplane.systemd.*
        Rename              _HOSTNAME                   hostname
        Rename              _SYSTEMD_UNIT               systemd_unit
        Rename              MESSAGE                     message
        Remove_regex        ^((?!hostname|systemd_unit|message).)*$

    [FILTER]
        Name                aws
        Match               dataplane.*
        imds_version        v1

    [OUTPUT]
        Name                cloudwatch_logs
        Match               application.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/application
        log_stream_template $kubernetes['namespace_name'].$kubernetes['pod_name'].$kubernetes['container_name']
        auto_create_group   true
        extra_user_agent    container-insights

    [OUTPUT]
        Name                cloudwatch_logs
        Match               dataplane.systemd.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/dataplane
        log_stream_template $kubernetes['host'].$systemd_unit
        auto_create_group   true
        extra_user_agent    container-insights

    [OUTPUT]
        Name                elasticsearch
        Match               application.*
        Host                elasticsearch.logging.svc.cluster.local
        Port                9200
        Index               eks-logs
        Type                _doc
        Time_Key            @timestamp
        Time_Key_Format     %Y-%m-%dT%H:%M:%S.%L%z
        Include_Tag_Key     true
        Tag_Key             _tag
        Buffer_Size         false
        tls                 off
        tls.verify          off

  parsers.conf: |
    [PARSER]
        Name                docker
        Format              json
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [PARSER]
        Name                cri
        Format              regex
        Regex               ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%L%z

    [PARSER]
        Name                systemd
        Format              regex
        Regex               ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key            time
        Time_Format         %b %d %H:%M:%S

ElasticSearch and Kibana Setup

Deploy ELK stack for advanced log analysis:

# elasticsearch.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: logging
spec:
  serviceName: elasticsearch
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0
        resources:
          requests:
            memory: 2Gi
            cpu: 1000m
          limits:
            memory: 4Gi
            cpu: 2000m
        ports:
        - containerPort: 9200
          name: rest
        - containerPort: 9300
          name: inter-node
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
        env:
        - name: cluster.name
          value: k8s-logs
        - name: node.name
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: discovery.seed_hosts
          value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
        - name: cluster.initial_master_nodes
          value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
        - name: ES_JAVA_OPTS
          value: "-Xms1g -Xmx1g"
        - name: xpack.security.enabled
          value: "false"
        - name: xpack.monitoring.collection.enabled
          value: "true"
      initContainers:
      - name: fix-permissions
        image: busybox
        command: ["sh", "-c", "chown -R 1000:1000 /usr/share/elasticsearch/data"]
        securityContext:
          privileged: true
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3-encrypted
      resources:
        requests:
          storage: 100Gi

---
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch
  namespace: logging
spec:
  selector:
    app: elasticsearch
  clusterIP: None
  ports:
  - port: 9200
    name: rest
  - port: 9300
    name: inter-node

Advanced Log Processing

Implement structured logging and log enrichment:

# log-processor.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: logstash-config
  namespace: logging
data:
  logstash.yml: |
    http.host: "0.0.0.0"
    xpack.monitoring.elasticsearch.hosts: ["http://elasticsearch:9200"]
    
  pipeline.yml: |
    - pipeline.id: kubernetes
      path.config: "/usr/share/logstash/pipeline/kubernetes.conf"
    
  kubernetes.conf: |
    input {
      beats {
        port => 5044
      }
    }
    
    filter {
      if [kubernetes] {
        # Parse JSON logs
        if [message] =~ /^\{.*\}$/ {
          json {
            source => "message"
            target => "json"
          }
        }
        
        # Extract log level
        grok {
          match => { "message" => "%{LOGLEVEL:log_level}" }
          tag_on_failure => []
        }
        
        # Parse timestamps
        date {
          match => [ "timestamp", "ISO8601" ]
        }
        
        # Enrich with cluster information
        mutate {
          add_field => {
            "cluster_name" => "${CLUSTER_NAME}"
            "environment" => "${ENVIRONMENT}"
          }
        }
        
        # Remove sensitive information
        mutate {
          remove_field => [ "password", "token", "secret" ]
        }
      }
    }
    
    output {
      elasticsearch {
        hosts => ["elasticsearch:9200"]
        index => "logstash-kubernetes-%{+YYYY.MM.dd}"
      }
      
      # Send critical errors to CloudWatch
      if [log_level] == "ERROR" or [log_level] == "FATAL" {
        cloudwatch_logs {
          log_group_name => "/aws/eks/critical-errors"
          log_stream_name => "%{kubernetes.namespace}-%{kubernetes.pod_name}"
          region => "${AWS_REGION}"
        }
      }
    }

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: logstash
  namespace: logging
spec:
  replicas: 2
  selector:
    matchLabels:
      app: logstash
  template:
    metadata:
      labels:
        app: logstash
    spec:
      containers:
      - name: logstash
        image: docker.elastic.co/logstash/logstash:8.8.0
        resources:
          requests:
            memory: 1Gi
            cpu: 500m
          limits:
            memory: 2Gi
            cpu: 1000m
        ports:
        - containerPort: 5044
        volumeMounts:
        - name: config
          mountPath: /usr/share/logstash/config
        - name: pipeline
          mountPath: /usr/share/logstash/pipeline
        env:
        - name: CLUSTER_NAME
          value: "production-eks"
        - name: ENVIRONMENT
          value: "production"
        - name: AWS_REGION
          value: "us-west-2"
      volumes:
      - name: config
        configMap:
          name: logstash-config
          items:
          - key: logstash.yml
            path: logstash.yml
          - key: pipeline.yml
            path: pipeline.yml
      - name: pipeline
        configMap:
          name: logstash-config
          items:
          - key: kubernetes.conf
            path: kubernetes.conf

Distributed Tracing

Jaeger Implementation

Deploy Jaeger for distributed tracing:

# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: observability
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.47
        ports:
        - containerPort: 16686
          name: ui
        - containerPort: 14268
          name: collector
        - containerPort: 14250
          name: grpc
        - containerPort: 6831
          name: agent-compact
        - containerPort: 6832
          name: agent-binary
        env:
        - name: COLLECTOR_ZIPKIN_HOST_PORT
          value: ":9411"
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: http://elasticsearch.logging.svc.cluster.local:9200
        - name: ES_INDEX_PREFIX
          value: jaeger
        resources:
          requests:
            memory: 512Mi
            cpu: 200m
          limits:
            memory: 1Gi
            cpu: 500m

---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: observability
spec:
  selector:
    app: jaeger
  ports:
  - name: ui
    port: 16686
    targetPort: 16686
  - name: collector
    port: 14268
    targetPort: 14268
  - name: grpc
    port: 14250
    targetPort: 14250
  - name: agent-compact
    port: 6831
    targetPort: 6831
    protocol: UDP
  - name: agent-binary
    port: 6832
    targetPort: 6832
    protocol: UDP

OpenTelemetry Collector

Deploy OpenTelemetry for comprehensive telemetry collection:

# otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      
      prometheus:
        config:
          scrape_configs:
          - job_name: 'kubernetes-pods'
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true
      
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_http:
            endpoint: 0.0.0.0:14268
    
    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
      
      memory_limiter:
        limit_mib: 512
      
      attributes:
        actions:
        - key: cluster.name
          value: production-eks
          action: upsert
        - key: environment
          value: production
          action: upsert
    
    exporters:
      jaeger:
        endpoint: jaeger.observability.svc.cluster.local:14250
        tls:
          insecure: true
      
      prometheus:
        endpoint: 0.0.0.0:8889
      
      awsxray:
        region: us-west-2
        no_verify_ssl: false
      
      logging:
        loglevel: debug
    
    service:
      pipelines:
        traces:
          receivers: [otlp, jaeger]
          processors: [memory_limiter, batch, attributes]
          exporters: [jaeger, awsxray]
        
        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, batch, attributes]
          exporters: [prometheus]
        
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch, attributes]
          exporters: [logging]

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      serviceAccountName: otel-collector
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:0.81.0
        command:
        - "/otelcol-contrib"
        - "--config=/conf/config.yaml"
        resources:
          requests:
            memory: 512Mi
            cpu: 200m
          limits:
            memory: 1Gi
            cpu: 500m
        ports:
        - containerPort: 4317
          name: otlp-grpc
        - containerPort: 4318
          name: otlp-http
        - containerPort: 8889
          name: prometheus
        volumeMounts:
        - name: config
          mountPath: /conf
      volumes:
      - name: config
        configMap:
          name: otel-collector-config

AWS Native Monitoring Integration

CloudWatch Container Insights

Enable comprehensive AWS CloudWatch monitoring:

# cloudwatch-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cwagentconfig
  namespace: amazon-cloudwatch
data:
  cwagentconfig.json: |
    {
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "production-eks",
            "metrics_collection_interval": 60
          }
        },
        "force_flush_interval": 5
      },
      "metrics": {
        "namespace": "CWAgent",
        "metrics_collected": {
          "cpu": {
            "measurement": ["cpu_usage_idle", "cpu_usage_iowait", "cpu_usage_user", "cpu_usage_system"],
            "metrics_collection_interval": 60,
            "totalcpu": false
          },
          "disk": {
            "measurement": ["used_percent"],
            "metrics_collection_interval": 60,
            "resources": ["*"]
          },
          "diskio": {
            "measurement": ["io_time"],
            "metrics_collection_interval": 60,
            "resources": ["*"]
          },
          "mem": {
            "measurement": ["mem_used_percent"],
            "metrics_collection_interval": 60
          },
          "netstat": {
            "measurement": ["tcp_established", "tcp_time_wait"],
            "metrics_collection_interval": 60
          },
          "swap": {
            "measurement": ["swap_used_percent"],
            "metrics_collection_interval": 60
          }
        }
      }
    }

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent
  template:
    metadata:
      labels:
        name: cloudwatch-agent
    spec:
      containers:
      - name: cloudwatch-agent
        image: amazon/cloudwatch-agent:1.247356.0b251814
        ports:
        - containerPort: 8125
          hostPort: 8125
          protocol: UDP
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 200m
            memory: 200Mi
        env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: HOST_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: K8S_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        volumeMounts:
        - name: cwagentconfig
          mountPath: /etc/cwagentconfig
        - name: rootfs
          mountPath: /rootfs
          readOnly: true
        - name: dockersock
          mountPath: /var/run/docker.sock
          readOnly: true
        - name: varlibdocker
          mountPath: /var/lib/docker
          readOnly: true
        - name: sys
          mountPath: /sys
          readOnly: true
        - name: devdisk
          mountPath: /dev/disk
          readOnly: true
      volumes:
      - name: cwagentconfig
        configMap:
          name: cwagentconfig
      - name: rootfs
        hostPath:
          path: /
      - name: dockersock
        hostPath:
          path: /var/run/docker.sock
      - name: varlibdocker
        hostPath:
          path: /var/lib/docker
      - name: sys
        hostPath:
          path: /sys
      - name: devdisk
        hostPath:
          path: /dev/disk
      terminationGracePeriodSeconds: 60
      serviceAccountName: cloudwatch-agent

AWS Load Balancer Controller Metrics

Monitor ALB/NLB performance:

# alb-controller-monitoring.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: aws-load-balancer-controller
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: aws-load-balancer-controller
  endpoints:
  - port: webhook-server
    path: /metrics
    interval: 30s

---
# Custom metrics for ALB
apiVersion: v1
kind: ConfigMap
metadata:
  name: alb-custom-metrics
  namespace: monitoring
data:
  record.rules: |
    groups:
    - name: alb.rules
      rules:
      - record: alb:request_count_rate5m
        expr: sum(rate(alb_request_count_total[5m])) by (load_balancer)
      
      - record: alb:target_response_time_p95
        expr: histogram_quantile(0.95, sum(rate(alb_target_response_time_seconds_bucket[5m])) by (load_balancer, le))
      
      - record: alb:error_rate_5m
        expr: sum(rate(alb_http_code_target_5xx_count_total[5m])) by (load_balancer) / sum(rate(alb_request_count_total[5m])) by (load_balancer)

Grafana Dashboards

Comprehensive EKS Dashboard

Create production-ready Grafana dashboards:

{
  "dashboard": {
    "id": null,
    "title": "EKS Cluster Overview",
    "tags": ["kubernetes", "eks"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Cluster Resource Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(kube_node_status_allocatable{resource=\"cpu\"}) - sum(kube_node_status_capacity{resource=\"cpu\"})",
            "legendFormat": "CPU Available"
          },
          {
            "expr": "sum(kube_node_status_allocatable{resource=\"memory\"}) - sum(kube_node_status_capacity{resource=\"memory\"})",
            "legendFormat": "Memory Available"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "palette-classic"
            },
            "unit": "short"
          }
        }
      },
      {
        "id": 2,
        "title": "Pod Status Distribution",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by (phase) (kube_pod_status_phase)",
            "legendFormat": "{{phase}}"
          }
        ]
      },
      {
        "id": 3,
        "title": "Node CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ],
        "yAxes": [
          {
            "max": 100,
            "min": 0,
            "unit": "percent"
          }
        ]
      },
      {
        "id": 4,
        "title": "Application Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
            "legendFormat": "{{service}}"
          }
        ]
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Alerting Configuration

AlertManager Configuration

Configure comprehensive alerting:

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yml: |
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
      smtp_smarthost: 'smtp.company.com:587'
      smtp_from: 'alerts@company.com'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'web.hook'
      routes:
      - match:
          severity: critical
        receiver: 'critical-alerts'
        continue: true
      - match:
          severity: warning
        receiver: 'warning-alerts'
    
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://webhook-service.monitoring.svc.cluster.local:8080/webhook'
    
    - name: 'critical-alerts'
      slack_configs:
      - channel: '#alerts-critical'
        title: 'Critical Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true
      pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
      email_configs:
      - to: 'oncall@company.com'
        subject: 'Critical Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
    
    - name: 'warning-alerts'
      slack_configs:
      - channel: '#alerts-warning'
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true
    
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'instance']

Best Practices Summary

Monitoring Strategy Checklist

✅ Metrics Collection: Prometheus with 15-day retention
✅ Log Aggregation: Fluent Bit → ElasticSearch/CloudWatch
✅ Distributed Tracing: Jaeger with OpenTelemetry
✅ Alerting: Multi-channel alerts with escalation
✅ Dashboards: Role-based Grafana dashboards
✅ Cost Monitoring: Track monitoring infrastructure costs
✅ Security: RBAC for monitoring tools
✅ High Availability: Redundant monitoring components

Operational Guidelines

Retention Policies: Balance storage costs with retention needs
Alert Fatigue: Fine-tune alert thresholds to reduce noise
Dashboard Maintenance: Regular review and updates
Performance Impact: Monitor monitoring overhead
Security: Secure monitoring endpoints and data

Conclusion

Implementing comprehensive observability for EKS requires careful orchestration of multiple tools and practices. The architecture outlined in this guide provides production-grade monitoring capabilities that scale with your infrastructure while maintaining operational efficiency.

Success factors include:

Layered Approach: Metrics, logs, and traces working together
Automation: Automated alerting and response procedures
Cost Awareness: Monitoring infrastructure costs alongside application costs
Team Alignment: Clear ownership and escalation procedures

Regular review and optimization of monitoring strategies ensures continued effectiveness as your EKS environment evolves and scales.

For more advanced observability strategies and monitoring best practices, follow STAQI Technologies' technical blog.

EKS Monitoring & Observability: Production-Grade Implementation Guide

EKS Monitoring & Observability: Production-Grade Implementation Guide

Introduction

Observability Architecture Overview

Three Pillars of Observability

Monitoring Stack Components

Prometheus Implementation

Prometheus Operator Deployment

Custom ServiceMonitor Configurations

PrometheusRules for Alerting

Logging Implementation

Fluent Bit Configuration

ElasticSearch and Kibana Setup

Advanced Log Processing

Distributed Tracing

Jaeger Implementation

OpenTelemetry Collector

AWS Native Monitoring Integration

CloudWatch Container Insights

AWS Load Balancer Controller Metrics

Grafana Dashboards

Comprehensive EKS Dashboard

Alerting Configuration

AlertManager Configuration

Best Practices Summary

Monitoring Strategy Checklist

Operational Guidelines

Conclusion

Ready to implement similar solutions?