AI is transforming networking. Not in some vague, conference-keynote way. Right now, inference traffic is breaking the assumptions every proxy, load balancer, and ingress controller was built on. The Kubernetes ecosystem needed a place to define how networking adapts. So I built one.
I have spent years on Gateway API and SIG Network, watching the API mature from an experimental sketch into the definitive standard for Kubernetes traffic management, adopted by over 30 implementations. Gateway API’s mental model is simple: requests arrive with headers, a proxy routes, a backend handles.
LLM inference breaks that model completely.
Models are large, expensive, and slow. Requests are long-lived. Responses stream token by token. Instances are not interchangeable; a server with a LoRA adapter loaded in GPU memory is fundamentally different from one without.
The routing decisions that matter for AI cannot be made from headers alone. The gateway must inspect the payload: which model, what priority, whether the prompt is safe. The entire networking stack was built to avoid exactly that.
By mid-2024, the Gateway API Inference Extension had started addressing model-aware routing. Valuable, but narrow in scope: it targeted companies already running AI at scale. The broader community needed standards for token-based rate limiting, prompt injection guardrails, content filtering, and egress patterns for third-party providers like OpenAI, Vertex AI, and Bedrock. None of these had a home.
An AI gateway is a network gateway implementing the Gateway API spec with capabilities purpose-built for AI workloads. Three core areas:
Inference routing and model-aware load balancing. The gateway understands which model a request targets, which servers have that model loaded, and factors in GPU memory, queue depth, and request criticality. The InferenceModel and InferencePool CRDs introduced this pattern; the WG builds on it.
Payload processing. Inspect and transform full HTTP bodies: prompt guardrails, content filtering, semantic routing, caching, RAG integration. The WG’s payload processing proposal defines standards for declarative processor configuration, ordered pipelines, and configurable failure modes.
Egress gateways. Many organizations route to external AI services and need secure, policy-controlled egress with managed auth, token injection, regional compliance, and provider failover. The egress gateway proposal addresses this.
Working groups require a charter, SIG sponsorship, steering committee approval, and community consensus. The process started on the kubernetes-dev mailing list. I presented at SIG Architecture for alignment, then submitted PR #8521: the formal request for wg-ai-gateway.
Two months of collaborative refinement. Reviewers pushed on scope, governance, relationship to existing efforts. Four steering committee members approved.
Operational pieces followed: Slack channel, proposal template and README, user stories. The co-organizers span five companies. This is not a single-vendor effort.
wg-ai-gateway does not own production code. It makes proposals. Discussions and specifications that, once consensus is reached, get submitted to the relevant SIGs for implementation.
The problems span multiple SIGs. Payload processing touches Gateway API. Egress involves SIG Multicluster. Model-aware routing ties to the Inference Extension under SIG Network. A WG is the right unit for cross-cutting concerns.
The charter includes an explicit exit strategy. The WG concludes after establishing definitions, identifying needed API support, submitting proposals, and documenting best practices. The goal is to solve the problem and dissolve.
The agentic AI wave makes this more urgent. When autonomous agents call tools and chain actions across services, the gateway becomes a critical control plane: enforcing guardrails, inspecting payloads, applying defense-in-depth before requests reach model servers.
The working group meets weekly, Thursdays at 2PM EST. Proposals are in active development. If you work on Kubernetes networking, AI infrastructure, or the intersection, I want you in the room.
We are defining how AI networking works on Kubernetes. Now.