DNS Failures in EKS? The Real Bottleneck Was AWS Network Limits

Published: (December 18, 2025 at 01:51 AM EST)
2 min read
Source: Dev.to

Source: Dev.to

DNS Investigation Overview

During the DNS investigation I initially focused on CoreDNS and NodeLocal DNS metrics.
The real breakthrough came when I started correlating DNS failures with AWS instance‑level network limits.
The most useful signals came from network allowance metrics exposed by the EC2 ENA driver via ethtool.

EC2 ENA Allowance Metrics

MetricDescription
ethtool_linklocal_allowance_exceededPackets dropped because traffic to link‑local services exceeded the packets‑per‑second (PPS) limit. This directly affects DNS, IMDS, and Amazon Time Sync.
ethtool_conntrack_allowance_availableRemaining number of connections that can be tracked before reaching the instance’s connection‑tracking limit. (Supported on Nitro‑based instances only.)
ethtool_conntrack_allowance_exceededPackets dropped because the connection‑tracking limit was exceeded and new connections could not be established.
ethtool_bw_in_allowance_exceededPackets queued or dropped because inbound aggregate bandwidth exceeded the instance limit.
ethtool_bw_out_allowance_exceededPackets queued or dropped because outbound aggregate bandwidth exceeded the instance limit.
ethtool_pps_allowance_exceededPackets queued or dropped because the bidirectional packets‑per‑second (PPS) limit was exceeded.

All *_allowance_exceeded metrics should ideally remain zero.
Any sustained non‑zero value indicates a networking bottleneck at the instance level.

Collecting the Metrics

These metrics are exposed by the EC2 ENA driver via ethtool, collected by node exporter, scraped by Prometheus, and visualized in Grafana.
On Amazon Linux EKS nodes, ethtool is installed by default. To collect the metrics, enable the ethtool collector in the node exporter container:

# node exporter container configuration
containers:
  - args:
      - --collector.ethtool
      - --collector.ethtool.device-include=(eth|em|eno|ens|enp)[0-9s]+
      - --collector.ethtool.metrics-include=.*

After applying this change, the metrics become available in Prometheus and Grafana.

Prometheus Metrics

  • Current allowance available

    node_ethtool_conntrack_allowance_available{job="node-exporter"}
  • Allowance exceeded counters (converted to rates)

    sum by (instance) (
      rate(
        node_ethtool_conntrack_allowance_exceeded{job="node-exporter"}[1m]
      )
    )

    Similar panels can be created for the other allowance‑exceeded metrics:

    • node_ethtool_bw_in_allowance_exceeded
    • node_ethtool_bw_out_allowance_exceeded
    • node_ethtool_pps_allowance_exceeded
    • node_ethtool_linklocal_allowance_exceeded

Each panel shows packets dropped per second per node.

Grafana Dashboard

A full Grafana dashboard JSON (named Network limits dashboard) visualizes all allowance‑exceeded metrics per node. All panels are time‑series panels built per node to help correlate network saturation with DNS errors or latency.

Key Takeaways

  • All allowance‑exceeded metrics are tied to EC2 instance sizing, except link‑local traffic, which has a fixed limit of 1024 packets per second regardless of instance size.
  • This fixed limit explains why DNS can fail even when CPU, memory, and pod‑level metrics look healthy.
  • The bottleneck exists below Kubernetes, at the EC2 networking layer.
  • When debugging intermittent DNS failures on EKS, do not stop at CoreDNS metrics—always inspect instance‑level network allowances.

For more context on DNS misconfiguration, see the post The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it).

Back to Blog

Related posts

Read more »