Data Center Fabric Design and Migration: Spine-Leaf, VXLAN EVPN, ACI

Multi-CCIE engineers with 25 years in data center switching and fabric design. Fixed-fee SOW on every engagement. We design and migrate DC fabrics across Cisco Nexus 9000/ACI, Arista EOS, Juniper QFX / Apstra Data Center Director, and NVIDIA Spectrum (Cumulus Linux) — with no vendor bias baked into the recommendation.

25 years of enterprise networking leadership

Multi-CCIE engineering bench

Ekahau Certified Survey Engineer (ECSE)

Minority-owned · Fixed-fee SOW on every project

Every data center failure we get called in to diagnose traces back to the same short list: underlay MTU left at 1500, HSRP still running on leaf switches, head-end replication that nobody audited when the VTEP count crossed 200. We design, migrate, and validate data center fabrics—spine-leaf VXLAN EVPN, Cisco ACI, microsegmented overlays, and data center interconnect—without those inherited mistakes baked in. Our bench holds multiple CCIEs and 25 years of leadership in enterprise networking. Work is scoped and priced on a fixed-fee SOW, not hourly. If the fabric also needs to carry AI/GPU back-end traffic, see our AI-ready infrastructure design practice; if the physical layer needs an audit first, our structured cabling team can clear that dependency before fabric day-one.

Spine-Leaf Architecture: Why It Replaced 3-Tier

A classic 3-tier design (access → distribution → core) creates unequal latency paths and concentration points that become failure domains as east-west traffic scales. Spine-leaf eliminates both. Every leaf switch has a direct uplink to every spine switch. Any leaf-to-leaf flow is exactly two hops: leaf → spine → leaf. There are no exceptions, no shortcuts, and no variance—which means ECMP across all spine uplinks gives you linear bandwidth scaling as you add spines. A four-spine, 24-leaf fabric with 100 GbE uplinks per leaf delivers 400 GbE of non-blocking bandwidth between any two leaf domains simultaneously.

Oversubscription ratios are an explicit design decision, not a default. General compute workloads tolerate 2:1 to 3:1. Storage replication, RoCEv2 AI back-end traffic, and NVMe-oF require 1:1 non-blocking—a single congestion-induced drop on a RoCEv2 path triggers a retransmit storm that cascades across the entire priority group. We size leaf uplink count and spine port density to the ratio the workload actually demands, then validate it under load before production handoff. Our independent validation testing practice runs that traffic-generation phase with real tooling, not a ping sweep and a handshake.

VXLAN Encapsulation: What the Overhead Actually Costs You

VXLAN, defined in RFC 7348, tunnels Layer 2 frames across a Layer 3 underlay using UDP port 4789. The encapsulation overhead on an IPv4 underlay is 50 bytes per frame: 8-byte VXLAN header, 8-byte UDP header, 20-byte IPv4 outer header, 14-byte Ethernet outer header. IPv6 underlay pushes that to 70 bytes. Either way, if your underlay interfaces are running MTU 1500, every tenant frame larger than roughly 1450 bytes gets silently fragmented—or dropped if DF-bit is set. The correct baseline is jumbo frames at MTU 9216 end-to-end across every underlay interface, spine and leaf. We audit and enforce this before VTEP bring-up, not after the first latency complaint from the storage team.

The 24-bit VNI field supports 16,777,216 distinct segments—enough that VNI assignment policy matters more than VNI exhaustion. We document a VNI numbering scheme that encodes tenant, zone, and function directly in the identifier so that any engineer reading a packet capture can interpret the segment without a lookup table.

BGP EVPN Control Plane: Route Types That Matter in Production

BGP EVPN, defined across RFC 7432 and RFC 8365, is the control plane that makes VXLAN scalable. Three route types drive the majority of production behavior. Type 2 (MAC/IP Advertisement) distributes host MAC and IP bindings across VTEPs—this is what enables distributed ARP suppression so that ARP broadcasts never flood the underlay. Type 3 (Inclusive Multicast Ethernet Tag, IMET) advertises BUM replication policy per VNI: which VTEPs participate, and whether replication is head-end unicast or multicast-assisted. Type 5 (IP Prefix Route) handles inter-VRF routing at the border leaf or route-reflector, replacing static inter-tenant routes with BGP-learned prefixes that track reachability automatically.

BUM replication mode selection is a scale decision. Head-end (ingress) replication has each VTEP unicast a separate copy of every BUM frame to every other VTEP in the VNI—O(N) copies per flood event. That is manageable at fewer than roughly 50 VTEPs and a few hundred VNIs. Above 200 VTEPs or 1,000 VNIs without an audit, ingress replication becomes a material underlay bandwidth consumer. Multicast underlay (PIM-SM or Bidir) moves replication into the network, scales efficiently, but adds PIM operational complexity. We model both against your VTEP count and VNI distribution before committing to a replication strategy in the design document.

Cisco ACI vs. NX-OS EVPN: Choosing the Right Data Center Policy Model

ACI and standalone NX-OS EVPN are not interchangeable overlays on the same hardware. ACI runs a policy model—Endpoint Groups (EPGs), Contracts, Bridge Domains—enforced at ASIC line rate by the APIC cluster (3-node minimum, 5-node for large-scale). Every policy decision is made in the APIC object model and pushed to hardware. If you attempt to hand-configure VLANs or VRFs directly on ACI leaf switches outside the APIC model, APIC overwrites your changes. We see that mistake repeatedly on inherited fabrics where a well-meaning network engineer bypassed APIC for a “quick fix” during an incident and created an inconsistent state that took days to reconcile.

Standalone NX-OS EVPN gives you full control-plane visibility, multi-vendor interoperability, and the ability to run the same fabric model across Cisco, Arista, and Juniper hardware in a multi-vendor or budget-constrained environment. The tradeoff is that policy enforcement is software-based at the VRF and ACL layer rather than ASIC-enforced at the EPG boundary. We design both models and migrate between them. The decision depends on your security posture, your team’s operational familiarity, and whether you need ACI’s hardware-enforced microsegmentation or can satisfy the same requirement with a distributed firewall layer above the fabric.

Rack elevation diagrams, current switch port maps, and server-to-ToR cabling documentation give us what we need to scope the engagement. Most engagements are scoped and quoted within two business days.

Frequently asked questions

How do you handle BGP-EVPN route policy when migrating from a legacy three-tier to spine-leaf?

The control-plane migration is the highest-risk phase. We run the legacy STP-based aggregation and the new BGP-EVPN spine-leaf in parallel, with VTEP-to-VTEP reachability validated before any workload moves. BGP-EVPN Type-2 MAC/IP routes (RFC 7432) carry host reachability; Type-5 IP Prefix routes (RFC 9136) carry subnet and VRF-prefix reachability for inter-subnet forwarding and DCI. We tune BGP route-refresh and route-dampening timers before go-live, and we validate ECMP hash distribution across all spine uplinks under synthetic load — ECMP hash polarization is a frequently observed silent performance failure in newly deployed spine-leaf fabrics. The go/no-go checklist is part of the fixed-fee SOW deliverable.

What is the difference between VXLAN EVPN and Cisco ACI in practice, and how do you choose?

Both use VXLAN as the data-plane encapsulation (RFC 7348) and BGP-EVPN (RFC 7432, extended to VXLAN by RFC 8365) as the control plane for host reachability. ACI adds a policy model — the Application Policy Infrastructure Controller (APIC) enforces contract-based microsegmentation between Endpoint Groups (EPGs) without requiring per-VLAN ACL maintenance. Standards-based BGP-EVPN on Nexus 9000 (NX-OS mode) or Arista EOS gives more operator control and avoids APIC licensing costs. We recommend ACI when the operational team wants declarative policy enforcement; we recommend standards EVPN when the team has strong CLI discipline and wants to avoid a proprietary management dependency. EVPN Multisite with Border Gateways is in scope for DCI across two or more data centers.

Does the data center fabric engagement include server-to-fabric connectivity validation, or just the switching layer?

Both layers are in scope. Server-to-ToR connectivity validation includes NIC teaming mode confirmation (LACP per IEEE 802.1AX, originally standardized as 802.3ad, vs. active-backup), MTU alignment end-to-end — 9216 bytes on Cisco NX-OS, 9214 on Arista EOS, 9192 on Juniper QFX; MTU must be aligned end-to-end across the fabric to accommodate 50-byte IPv4 VXLAN encapsulation overhead without fragmentation — VLAN tagging on the vSwitch or SR-IOV VF, and link-state verification after cutover. For Cisco UCS domains with fabric interconnects, we validate port-channel pinning and VSAN zoning as part of the same engagement. Storage fabric (FC, iSCSI, or NVMe-oF via RoCEv2) is scoped separately; we coordinate with the SAN/storage team but do not provide SAN design as a stand-alone deliverable.

How does AI GPU cluster back-end fabric differ from traditional spine-leaf, and do you design both?

They are distinct design disciplines. Traditional spine-leaf for compute and storage workloads uses ECMP over standard Ethernet, tolerates some congestion, and optimizes for cost-per-port. AI back-end fabric for GPU clusters requires non-blocking, near-lossless connectivity with 1:1 oversubscription ratios and congestion control tuned for all-to-all communication patterns. The primary platforms for AI back-end are NVIDIA Spectrum-X (adaptive routing, per-packet congestion control for RoCEv2 traffic), InfiniBand Quantum-2 for tightly coupled HPC workloads, and emerging Ultra Ethernet Consortium (UEC) specifications for open-standard AI Ethernet fabrics. Traditional spine-leaf (Cisco Nexus 9000, Arista EOS, Juniper QFX) remains correct for front-end and storage fabric. We scope both in the same engagement when the data center carries mixed general-purpose and GPU workloads.