DNS Failures in EKS? The Real Bottleneck Was AWS Network Limits

Published: 1 week ago (December 18, 2025 at 01:51 AM EST)

2 min read

Source: Dev.to

DNS Investigation Overview

During the DNS investigation I initially focused on CoreDNS and NodeLocal DNS metrics.
The real breakthrough came when I started correlating DNS failures with AWS instance‑level network limits.
The most useful signals came from network allowance metrics exposed by the EC2 ENA driver via ethtool.

EC2 ENA Allowance Metrics

Metric	Description
`ethtool_linklocal_allowance_exceeded`	Packets dropped because traffic to link‑local services exceeded the packets‑per‑second (PPS) limit. This directly affects DNS, IMDS, and Amazon Time Sync.
`ethtool_conntrack_allowance_available`	Remaining number of connections that can be tracked before reaching the instance’s connection‑tracking limit. (Supported on Nitro‑based instances only.)
`ethtool_conntrack_allowance_exceeded`	Packets dropped because the connection‑tracking limit was exceeded and new connections could not be established.
`ethtool_bw_in_allowance_exceeded`	Packets queued or dropped because inbound aggregate bandwidth exceeded the instance limit.
`ethtool_bw_out_allowance_exceeded`	Packets queued or dropped because outbound aggregate bandwidth exceeded the instance limit.
`ethtool_pps_allowance_exceeded`	Packets queued or dropped because the bidirectional packets‑per‑second (PPS) limit was exceeded.

All *_allowance_exceeded metrics should ideally remain zero.
Any sustained non‑zero value indicates a networking bottleneck at the instance level.

Collecting the Metrics

These metrics are exposed by the EC2 ENA driver via ethtool, collected by node exporter, scraped by Prometheus, and visualized in Grafana.
On Amazon Linux EKS nodes, ethtool is installed by default. To collect the metrics, enable the ethtool collector in the node exporter container:

# node exporter container configuration
containers:
  - args:
      - --collector.ethtool
      - --collector.ethtool.device-include=(eth|em|eno|ens|enp)[0-9s]+
      - --collector.ethtool.metrics-include=.*

After applying this change, the metrics become available in Prometheus and Grafana.

Prometheus Metrics

Current allowance available

node_ethtool_conntrack_allowance_available{job="node-exporter"}

Allowance exceeded counters (converted to rates)
```
sum by (instance) (
  rate(
    node_ethtool_conntrack_allowance_exceeded{job="node-exporter"}[1m]
  )
)
```
Similar panels can be created for the other allowance‑exceeded metrics:
- node_ethtool_bw_in_allowance_exceeded
- node_ethtool_bw_out_allowance_exceeded
- node_ethtool_pps_allowance_exceeded
- node_ethtool_linklocal_allowance_exceeded

Each panel shows packets dropped per second per node.

Grafana Dashboard

A full Grafana dashboard JSON (named Network limits dashboard) visualizes all allowance‑exceeded metrics per node. All panels are time‑series panels built per node to help correlate network saturation with DNS errors or latency.

Key Takeaways

All allowance‑exceeded metrics are tied to EC2 instance sizing, except link‑local traffic, which has a fixed limit of 1024 packets per second regardless of instance size.
This fixed limit explains why DNS can fail even when CPU, memory, and pod‑level metrics look healthy.
The bottleneck exists below Kubernetes, at the EC2 networking layer.
When debugging intermittent DNS failures on EKS, do not stop at CoreDNS metrics—always inspect instance‑level network allowances.

For more context on DNS misconfiguration, see the post The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it).

DNS Failures in EKS? The Real Bottleneck Was AWS Network Limits

DNS Investigation Overview

EC2 ENA Allowance Metrics

Collecting the Metrics

Prometheus Metrics

Grafana Dashboard

Key Takeaways

Related posts

Beyond Keywords: Engineering a Production-Ready Agentic Search Framework in Go

A Beginner’s Guide to AIOps: What IT Teams Need to Know

Regression testing workflow: the risk first checks that keep releases stable

The Best Developer AI Tools of 2025 — What Actually Worked in Real Projects