Cloud & Infra Article

Why DNS Breaks Inside Kubernetes on AWS

How ndots defaults, NodeLocal DNS, and EC2's hard packet limits conspire to break your cluster's name resolution.

Emeka Okafor

Security Editor · Jun 21, 2026 · 6 min read

DNS is the silent killer of distributed systems. In Kubernetes clusters running on AWS—whether via Amazon EKS or self-managed EC2 instances—DNS issues are notoriously difficult to debug. They rarely manifest as clean, permanent failures. Instead, they present as intermittent 5-second latency spikes, random connection timeouts, or sudden microservice degradation under load.

These quirks are not random. They are the predictable result of how default Kubernetes DNS configurations interact with AWS infrastructure limits. To keep a cluster stable, platform engineers must understand the interplay between the ndots search path, NodeLocal DNSCache, and EC2's hard network limits.

The `ndots:5` Multiplier: How One Lookup Becomes Ten

The first major culprit behind DNS performance issues is the default DNS client configuration inside Kubernetes pods. If you inspect /etc/resolv.conf inside almost any standard Kubernetes pod, you will see a configuration similar to this:

nameserver 10.100.0.10
search my-namespace.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

The nameserver points to the cluster's internal DNS service (typically CoreDNS), and the search list defines the suffixes appended to queries. The critical parameter here is ndots:5.

The ndots option tells the system's resolver: "If a hostname contains fewer than $N$ dots, treat it as a relative name and try appending each search suffix first before attempting to resolve it as an absolute domain."

Because the default is 5, almost every external Fully Qualified Domain Name (FQDN) triggers this fallback behavior. Consider an application trying to connect to an external API at api.github.com (which contains 2 dots). Because $2 < 5$, the resolver assumes the hostname is relative and sequentially queries:

api.github.com.my-namespace.svc.cluster.local (NXDOMAIN)
api.github.com.svc.cluster.local (NXDOMAIN)
api.github.com.cluster.local (NXDOMAIN)
api.github.com.ec2.internal (NXDOMAIN)
api.github.com (Success)

This single lookup generates five separate DNS queries, four of which are guaranteed to fail (returning NXDOMAIN). Furthermore, because modern runtimes query IPv4 (A) and IPv6 (AAAA) records in parallel, a single external connection attempt can generate up to 10 DNS packets.

For high-throughput applications, this 10x multiplier quickly overwhelms DNS servers and network interfaces.

The Silent Ceiling: EC2's 1024 PPS Limit

The ndots query multiplier becomes dangerous when it collides with AWS infrastructure limits. Amazon EC2 enforces a hard limit on DNS traffic sent to the VPC resolver (the .2 address in your VPC CIDR, also known as AmazonProvidedDNS).

This limit is approximately 1,024 packets per second (pps) per Elastic Network Interface (ENI). Any DNS traffic exceeding this threshold is silently dropped.

Because EKS pods typically share the host node's primary ENI (unless utilizing security groups for pods or prefix delegation), this 1,024 pps limit is shared across all pods running on that node.

flowchart TD
    Pod1[Pod A] -->|ndots:5 Multiplier| CoreDNS[CoreDNS Pod]
    Pod2[Pod B] -->|ndots:5 Multiplier| CoreDNS
    CoreDNS -->|Forward External Queries| VPC[VPC Resolver .2]
    style VPC stroke:#f96,stroke-width:2px
    note[Hard Limit: 1024 pps per ENI] -.-> VPC

If a single node hosts a few chatty microservices making external API calls, the ndots:5 multiplier can easily push the node's aggregate DNS traffic past 1,024 pps. When this happens, EC2 begins dropping packets. The application resolver waits, times out, and retries, leading to intermittent latency spikes that are incredibly difficult to isolate using standard application-level APM tools.

NodeLocal DNSCache: The Intermediary Shield

To mitigate both the ndots query storm and the packet limits, Kubernetes platform engineers should deploy NodeLocal DNSCache.

NodeLocal DNSCache runs as a DaemonSet—one pod per node—and listens on a link-local IP address (typically within the 169.254.0.0/16 range) that is local to the node's network stack.

flowchart TD
    Pod[Pod] -->|UDP :53| NL[NodeLocal DNSCache\nDaemonSet on 169.254.0.0/16]
    NL -->|Cache Hit| Pod
    NL -->|Cache Miss: *.cluster.local| CoreDNS[CoreDNS\nClusterIP Service]
    NL -->|Cache Miss: External| VPC[VPC Resolver\nAmazonProvidedDNS .2]

When NodeLocal DNSCache is active, pods send their queries to this local daemon instead of routing them across the network to the central CoreDNS service. This architecture provides three major benefits:

Bypassing Conntrack: It avoids the notorious Linux connection tracking (conntrack) UDP race condition, which is a common cause of random 5-second DNS delays in highly active clusters.
Caching NXDOMAIN Responses: NodeLocal DNSCache caches negative responses. When the ndots search path generates useless queries like api.github.com.svc.cluster.local, the local cache absorbs subsequent requests, preventing them from hitting CoreDNS or the VPC resolver.
Reducing ENI Traffic: By serving both positive and negative cache hits locally on the node, the volume of DNS packets leaving the node's ENI is drastically reduced, keeping the traffic well below the 1,024 pps limit.

Network and Resource Bottlenecks

Beyond packet limits and search paths, two other common failure modes frequently disrupt DNS inside AWS-based clusters:

1. Network ACLs and Ephemeral Ports

When running worker nodes in private subnets, platform engineers often tighten Network Access Control Lists (NACLs). A common mistake is opening outbound UDP/TCP port 53 for DNS queries but failing to open the inbound ephemeral port range (1025–65535).

Because DNS clients use ephemeral ports to receive responses, blocking these ports causes CoreDNS to receive and process queries successfully while the response packets are silently blocked on their way back to the requesting pod. This issue often manifests as cross-Availability Zone DNS timeouts while same-AZ queries succeed.

2. CoreDNS Resource Starvation

The default Amazon EKS CoreDNS add-on is configured with a strict memory limit (typically 170 MiB) but no CPU limit. If the node hosting a CoreDNS pod experiences a sudden CPU spike, the CoreDNS container can be starved of CPU cycles.

Because DNS is highly latency-sensitive, even a brief period of CPU starvation leads to queued queries, packet drops, and application-side timeouts.

The Platform Engineer's Runbook

To diagnose and resolve these DNS anomalies, follow this systematic debugging workflow.

Step 1: Deploy a Diagnostic Pod

Deploy a standard network utility pod to isolate the issue from your application code. Kubernetes provides a standard dnsutils manifest for this purpose:

kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml

Once the pod is running, execute an nslookup to verify basic resolution:

kubectl exec -i -t dnsutils -- nslookup kubernetes.default

Step 2: Optimize the Pod Spec `dnsConfig`

If your application primarily communicates with external APIs and does not rely on short-name service discovery across namespaces, override the default ndots behavior in your Pod specification:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-api-client
spec:
  template: 
    spec:
      dnsConfig:
        options:
          - name: ndots
            value: "1"
      containers:
        - name: app
          image: my-app:latest

Setting ndots:1 ensures that any domain with at least one dot (e.g., api.github.com) is queried as an absolute domain first, eliminating the search path overhead.

Step 3: Enable CoreDNS Query Logging

To determine if your cluster is suffering from an ndots query storm, enable query logging in CoreDNS. Edit the Corefile ConfigMap:

kubectl -n kube-system edit configmap coredns

Add the log plugin inside the main block:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        log
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
    }

Save the changes and monitor the logs of the CoreDNS pods:

kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

Look for a high volume of NXDOMAIN responses, which indicates that your applications are generating excessive search-path queries.

Step 4: Monitor Resource Utilization

Check if CoreDNS pods are experiencing resource starvation:

kubectl top pods -n kube-system -l k8s-app=kube-dns

If CPU utilization is high or memory is approaching the 170 MiB limit, consider adjusting the resource requests and limits in the CoreDNS deployment, or scaling the deployment horizontally to distribute the load.

The Verdict

Relying on out-of-the-box Kubernetes DNS settings on AWS is a recipe for intermittent production failures. The default ndots:5 setting is optimized for local service discovery, not external cloud APIs. When combined with EC2's strict 1,024 pps ENI limit, it creates a silent bottleneck.

For any production-grade cluster on AWS, deploying NodeLocal DNSCache is not an optional optimization—it is a foundational requirement. Combined with targeted ndots:1 overrides for external-facing workloads and properly configured NACLs, it transforms a fragile, opaque network layer into a predictable and resilient utility.

Sources & further reading

DNS is weird inside k8s on AWS — dev.to
kubernetes - DNS problem on AWS EKS when running in private subnets - Stack Overflow — stackoverflow.com
Troubleshoot DNS failures with Amazon EKS | AWS re:Post — repost.aws
Debugging DNS Resolution | Kubernetes — kubernetes.io
DNS mostly fails inside application pods on brand new cluster · Issue #4391 · kubernetes/kops — github.com

#Kubernetes #Aws #Networking #Dns #Coredns

Written by

Emeka Okafor · Security Editor

Emeka has spent over a decade tracking threat actors, vulnerability disclosures, and the evolving landscape of application security, bringing a sharp continent-spanning perspective to his reporting. He's known for translating dense CVE advisories into clear, actionable context that developers and security teams alike actually read.

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

Why DNS Breaks Inside Kubernetes on AWS

The `ndots:5` Multiplier: How One Lookup Becomes Ten

The Silent Ceiling: EC2's 1024 PPS Limit

NodeLocal DNSCache: The Intermediary Shield

Network and Resource Bottlenecks

1. Network ACLs and Ephemeral Ports

2. CoreDNS Resource Starvation

The Platform Engineer's Runbook

Step 1: Deploy a Diagnostic Pod

Step 2: Optimize the Pod Spec `dnsConfig`

Step 3: Enable CoreDNS Query Logging

Step 4: Monitor Resource Utilization

The Verdict

Sources & further reading

Discussion 0

Related Reading

Inside NVK's Experimental DLSS: A Clever Hack for Open-Source Linux

Disaggregating LLM Inference: Inside AMD's ATOM and ATOMesh Stack

The End of Cheap Bare Metal: Hetzner's Price Realignment

AMD's TSME Reversal Is a Lesson in Firmware Governance

Why DNS Breaks Inside Kubernetes on AWS

The ndots:5 Multiplier: How One Lookup Becomes Ten

The Silent Ceiling: EC2's 1024 PPS Limit

NodeLocal DNSCache: The Intermediary Shield

Network and Resource Bottlenecks

1. Network ACLs and Ephemeral Ports

2. CoreDNS Resource Starvation

The Platform Engineer's Runbook

Step 1: Deploy a Diagnostic Pod

Step 2: Optimize the Pod Spec dnsConfig

Step 3: Enable CoreDNS Query Logging

Step 4: Monitor Resource Utilization

The Verdict

Sources & further reading

Discussion 0

Related Reading

Inside NVK's Experimental DLSS: A Clever Hack for Open-Source Linux

Disaggregating LLM Inference: Inside AMD's ATOM and ATOMesh Stack

The End of Cheap Bare Metal: Hetzner's Price Realignment

AMD's TSME Reversal Is a Lesson in Firmware Governance

The `ndots:5` Multiplier: How One Lookup Becomes Ten

Step 2: Optimize the Pod Spec `dnsConfig`