Data Center Fabric Design FAQs

Question 1

What is the scope boundary between fabric design and server-to-fabric integration?

Accepted Answer

WFHS designs and migrates the fabric itself — spine, leaf, underlay routing, VXLAN EVPN overlay, border leaf, DCI, management fabric, and microsegmentation policy. Server-side NIC configuration (VMware vSphere Distributed Switch, Linux bonding, Windows NIC teaming), storage-fabric-specific tuning (iSCSI multipathing, NVMe-oF initiator config), and application-layer firewall policy are coordinated with the server and platform teams but implemented by them.

We will document the fabric-side parameters (VLAN/VNI assignments, MTU, LACP configuration, DCBX on the leaf) in a handoff specification that server engineers can implement against, and we validate end-to-end on a representative server before declaring a leaf pair production-ready.

Question 2

How does a VXLAN EVPN migration compare to a Cisco ACI deployment for the same enterprise?

Accepted Answer

Both build a spine-leaf fabric with the same underlying Cisco Nexus 9000 hardware, the same VXLAN data plane, and the same BGP EVPN control plane. The operational difference is where policy lives. ACI puts policy on APIC 6.0 as a central authority with EPG and contract abstractions; day-two operations work through the APIC.

NX-OS EVPN leaves policy on the leaf (ACL, VRF, route-map) and uses Nexus Dashboard Fabric Controller (NDFC) as a configuration orchestrator rather than a policy authority. ACI is stronger for application-centric microsegmentation, multi-tenant isolation, and integrated L7 policy.

NX-OS EVPN is stronger for interoperability with non-Cisco EVPN peers, for teams that prefer routing-centric thinking, and for environments where fabric-level abstraction adds more complexity than it removes.

Question 3

What is the migration strategy from a legacy Catalyst 6500 core or Nexus 7000 aggregation layer to a spine-leaf EVPN-VXLAN fabric?

Accepted Answer

Parallel build, staged subnet cutover, and bake period before decommissioning. The new spine-leaf fabric is built alongside the legacy core with its own power, cabling, and management.

A temporary Layer 2 bridge at the border leaf extends each subnet from the legacy fabric to the new fabric, and the default gateway for that subnet moves from the legacy SVI to the new-fabric anycast gateway on a cutover window per subnet.

General application subnets migrate first; storage, database, and backup subnets migrate last after the general workload is stable on the new fabric. A 30-day bake period on the new fabric with no unexplained packet loss or latency events precedes legacy decommissioning. Rollback at any point is a single-subnet revert — the legacy SVI is re-activated and the BGP EVPN route for that subnet is withdrawn.

Question 4

How is an AI GPU training cluster fabric different from a general enterprise data center fabric?

Accepted Answer

Three differences dominate the design. First, oversubscription: AI training fabrics are 1:1 non-blocking because collective operations (all-reduce, all-gather) generate synchronized east-west bursts that saturate every uplink simultaneously, and oversubscription creates tail-latency events that stall the entire training job.

Second, the transport: RoCEv2 with PFC 802.1Qbb and ECN/DCQCN is tuned for lossless Ethernet behavior — packet loss triggers RDMA NACK and wastes GPU cycles. Third, platform selection leans toward NVIDIA Spectrum-4 SN5600 (51.2 Tbps per 1RU, optimized RoCEv2 ASIC, NetQ telemetry) or Arista 7800R3 and Cisco Nexus 9364E-SG2 as valid single-vendor alternatives.

The general-enterprise disciplines (BGP EVPN overlay, anycast gateway, microsegmentation) still apply, but the buffer tuning, traffic class separation, and NIC-to-fabric integration are materially more detailed than on a general workload fabric.

Question 5

When should we use Ingress Replication (IR) versus a multicast underlay for BUM traffic?

Accepted Answer

Ingress Replication is the correct default for most enterprise fabrics under roughly 64 leafs with low BUM volume — it is operationally simpler (no multicast routing on the underlay) and the O(n) cost per BUM frame is manageable at that scale.

A multicast underlay (PIM-SM with RPs on the spine, or BIDIR-PIM for simpler state) becomes the right answer at hyperscale, in VDI environments where PXE boot and broadcast-heavy profiles generate non-trivial BUM, and in multicast application workloads (market data, video distribution, clustering protocols) where the single multicast tree in the underlay is more efficient than ingress replication in the overlay.

The decision is a trade between operational simplicity and BUM efficiency, and it should be made on measured BUM-as-a-fraction-of-total data, not on vendor reference architectures alone.

Question 6

What are the standard EVPN route types we should expect to see in a production fabric?

Accepted Answer

Type-2 (MAC/IP advertisement) is the dominant route type on any fabric with endpoint mobility — every learned host MAC, every learned IP, every VNI is announced via a Type-2. Type-3 (Inclusive Multicast Ethernet Tag) sets up BUM replication per VNI, either as Ingress Replication or as a reference to a multicast underlay group.

Type-5 (IP Prefix) carries routed prefixes between VRFs and between fabrics — essential for DCI, for inter-VRF route leaking, and for summarization from leafs to border leafs. Type-4 (Ethernet Segment) elects a designated forwarder for ESI-LAG multi-homed servers, and Type-1 (Ethernet Auto-Discovery) supports fast convergence on ESI failure.

A fabric BGP table dominated by Type-2 with no Type-5 is a fabric where Layer 3 extensibility lives at a single tenant-edge device — usually a design gap rather than a design intent, and usually worth revisiting before the next fabric expansion.

Question 7

How should DCI be designed between two enterprise data centers in 2024?

Accepted Answer

EVPN multisite is the default answer on new designs. Each data center is its own VXLAN EVPN fabric. A pair of border leafs at each site acts as the multisite gateway, peering with the other site’s border leafs across the DCI link as EVPN peers with a route-target rewrite.

That architecture controls which tenants extend across sites and which stay local on a per-VRF and per-VNI basis — the policy is explicit rather than implicit. Cisco Nexus 9000 supports EVPN multisite natively in NX-OS 10.4; Arista and Juniper support the equivalent border-leaf model under their respective orchestration.

For enterprise-to-colo extensions without full EVPN peering (typical when the colo is a managed service), MPLS Layer 3 VPN over a provider WAN is a valid design. Cisco OTV is sunsetted for new builds and is a migration target rather than a deployment target.

Question 8

What is the right microsegmentation approach for a brownfield fabric that was not designed for it?

Accepted Answer

Layered, starting with what can be deployed without touching the fabric. Agent-based segmentation (Illumio Core or Akamai Guardicore Centra) is the standard brownfield answer because the agent runs on the server OS and enforces policy at the iptables or Windows Filtering Platform layer — no dependency on the fabric being microsegmentation-capable.

That gets a working policy graph in place while the longer-horizon fabric refresh is planned. If the brownfield environment is heavily virtualized, VMware NSX-T 4.2 distributed firewall is the alternate entry point — policy in the hypervisor, no fabric change, follows the workload across vMotion.

Fabric-native segmentation (ACI EPGs, EVPN VRF with leaf ACL) comes in with the next fabric refresh, and the pre-built policy graph from the agent or hypervisor layer becomes the reference for the fabric-native contract library.

Question 9

In data center network design, what byte overhead does VXLAN add per tenant frame, and what underlay MTU do we plan for?

Accepted Answer

VXLAN adds 50 bytes with an IPv4 outer header and 70 bytes with an IPv6 outer header — 14 bytes outer Ethernet, 20 or 40 bytes outer IP, 8 bytes UDP, and 8 bytes VXLAN header per RFC 7348. Because every tenant frame gets re-encapsulated at the ingress VTEP, the underlay MTU must exceed the tenant payload MTU by at least the applicable overhead.

That is why jumbo frames at 9216 bytes are the de facto standard on EVPN-VXLAN underlays — it leaves room for the encapsulation plus safety margin for nested tagging or IPv6 extension headers.

Our data center fabric engineers validate MTU end-to-end before any VXLAN cutover — a 1500-byte underlay drops encapsulated frames silently and the failure mode looks like random packet loss.

Question 10

In data center network design, which UDP destination port is standardized for VXLAN, and how do we handle legacy 8472 deployments?

Accepted Answer

IANA assigned UDP port 4789 as the VXLAN destination port per RFC 7348, and that value should be used by default on every greenfield fabric. Early Linux kernel implementations and some pre-standard deployments used UDP 8472, which persists in a handful of brownfields where the original VTEP was built before the RFC finalized.

Greenfield and vendor-interop designs run 4789 without exception; brownfield integrations should be audited for port mismatch before any flows are cut over, because a 4789-to-8472 VTEP pair will silently drop encapsulated frames and log only generic forwarding counters.

Spec the destination port explicitly in the fixed-fee SOW so the cutover runbook flags any non-4789 VTEP discovered during cable-over.

Question 11

What does the 24-bit VXLAN VNI give us that a 12-bit 802.1Q VLAN ID cannot?

Accepted Answer

RFC 7348 defines the VXLAN Network Identifier as a 24-bit value, yielding roughly 16 million distinct overlay segments versus the 4,094 usable VLANs in 802.1Q. That scale ceiling is what makes EVPN-VXLAN viable for hyperscale multi-tenancy, IoT fleet isolation, and merged-entity fabrics where VLAN-ID collisions are guaranteed.

The 24-bit VNI is carried in the VXLAN header between VTEPs; from the tenant's perspective the segment is still a normal bridge domain, which means legacy 802.1Q-tagged workloads stitch into a VNI without application changes.

At scale this is the single number that decouples data-center segmentation from the legacy VLAN ceiling.

Question 12

What are the five EVPN route types our fabric actually carries, and what does each do?

Accepted Answer

RFC 7432 defines Types 1 through 4, and RFC 8365 extends to Type 5. Type 1 is Ethernet Auto-Discovery — per-ES and per-EVI routes that signal all-active multihoming and enable mass-withdrawal on link failure. Type 2 is MAC/IP Advertisement, the workhorse that carries host learning and enables ARP suppression. Type 3 is Inclusive Multicast Ethernet Tag, which advertises L2 domain membership and builds the BUM replication list.

Type 4 is Ethernet Segment, which drives Designated Forwarder election across multihoming peers.

Type 5 is IP Prefix, used for inter-VRF prefix advertisement and subnet summarization between L3 VNIs.

All five types appear in production Arista, Cisco Nexus 9000, and Juniper QFX fabrics. Diagnosing a silent-host or stuck-DF problem always starts with dumping Type-2 and Type-4 routes from a route reflector.

Question 13

How does ESI length and DF election work for all-active EVPN multihoming?

Accepted Answer

RFC 7432 defines the Ethernet Segment Identifier as a 10-octet integer carried in Type-4 routes. Every PE attached to the same multihomed CE uses the same ESI, which is how the fabric identifies peers that share a segment.

Default Designated Forwarder election uses service carving: each PE builds an ordered IP-address list of all segment members, then applies (V mod N) equals i, where V is the VLAN, N is the PE count, and i is the PE's ordinal position.

That formula deterministically spreads BUM forwarding responsibility across all segment members, which is why an all-active ESI with three PEs does not bottleneck BUM on a single forwarder.

Question 14

Symmetric or asymmetric IRB — which do we deploy on EVPN-VXLAN fabrics?

Accepted Answer

Symmetric IRB by default, asymmetric only when tenant route counts are genuinely small. Per RFC 9135, symmetric IRB does matching MAC and IP lookups on both ingress and egress PEs and requires an L3 VNI per tenant VRF;

in exchange it saves ARP and bridge-table memory at scale because each PE only holds routes for locally attached hosts. Asymmetric IRB requires every PE to hold ARP entries for every remote host on every tenant, which breaks at a few thousand endpoints.

Cisco Nexus 9000, Arista, and Juniper EVPN fabrics all default to symmetric — it is the right answer for any fabric that will grow.

Our data center security architects review the L3 VNI assignment against the tenant VRF plan before the fabric is built.

Question 15

Why do webscale Clos fabrics run eBGP as the underlay instead of OSPF or IS-IS?

Accepted Answer

RFC 7938 spells out the reasoning. eBGP assigns a unique ASN per leaf and a shared ASN across the spine tier, which exploits BGP's path-vector scope — failure events propagate only where they affect reachability rather than flooding an entire IGP area. Multipath-relax and multipath-multiple-AS options engage every parallel leaf-spine uplink for ECMP, so traffic spreads across all available paths rather than hashing to a single next-hop.

The result is a fabric that scales to thousands of leafs without LSA-flooding storms or area boundaries, and where any single link or spine failure produces a bounded, local reconvergence rather than a fabric-wide event.

Question 16

What does Arista MLAG deliver that EVPN multihoming does not, and when is each appropriate?

Accepted Answer

Arista MLAG is a two-chassis LACP dual-homing scheme — peer-link port-channel, shared domain-id, and heartbeat keep-alive with a 4000 ms default interval. The peer is declared dead after 30 seconds of missed heartbeats, at which point the surviving chassis takes over the LAG. MLAG is Layer-2 and chassis-pair limited;

it cannot scale past two switches. EVPN multihoming using Type-1 and Type-4 routes scales to many leafs and does all-active forwarding across any number of segment members.

Use MLAG for legacy L2 server dual-homing or when a classic pair-at-the-top is the operational model; use EVPN multihoming for multi-leaf all-active designs.

Our engineers scope the choice against the server team's NIC-bonding posture before the fixed-fee SOW is cut.

Question 17

What is Arista VARP, and how does it differ from HSRP, VRRP, or EVPN Distributed Anycast Gateway?

Accepted Answer

VARP lets multiple Arista switches answer ARP for the same virtual IP and MAC, providing active-active first-hop routing without a master or standby election — effectively a static anycast gateway. HSRP and VRRP are active/standby protocols with preemption timers, where only the master forwards on behalf of the virtual IP.

EVPN Distributed Anycast Gateway does what VARP does but at fabric scale, using EVPN Type-2 advertisements so every VTEP across the fabric shares the same gateway IP and MAC.

Picking between VARP and DAG is usually a fabric-scope question: VARP inside an MLAG pair, DAG across a spine-leaf fabric.

Question 18

When does a design require Arista 7800R3 deep-buffer spines instead of shallow-buffer 7060X6?

Accepted Answer

The 7800R3 uses a deep-buffer Virtual Output Queue architecture — 7800R3-36P line cards carry 24 GB of buffer per card, and the flagship 7816R3 has 384 GB of system buffer. FIB scale on the L3-XXXL profile reaches 3,950k IPv4 routes, 384k MAC, and 112k ARP.

The shallow-buffer 7060X6 is the opposite trade-off: 51.2 Tbps of 800G OSFP with small on-chip buffers, tuned for AI lossless fabrics that rely on PFC and ECN rather than on-switch buffering.

Pick 7800R3 for traditional DC fabrics with bursty north-south or incast patterns; pick 7060X6 for GPU training clusters running RoCE.

Question 19

What Arista platform do we spec for an 800G AI training leaf, and what capacity does it deliver?

Accepted Answer

Arista 7060X6 is the current 800G AI leaf — positioned on Arista's platforms page as a best-of-breed 800G solution optimized for AI workloads, with 32 to 64 ports of 800G OSFP800. Pair it with the 7800R4 spine, which delivers up to 460 Tbps of throughput and 576 ports of 800G or 1152 ports of 400G per Arista's AI networking documentation.

For very large training clusters, the 7700R4 Distributed Etherlink Switch scales to more than 30,000 400GbE accelerators in a single fabric.

The right split between 7060X6, 7800R4, and 7700R4 depends on the GPU count and the rail topology the cluster operator has chosen.

Our AI-ready infrastructure team sizes the leaf-spine ratio against the collective-traffic profile.

Question 20

How does Arista Etherlink position against proprietary InfiniBand for RDMA and RoCE AI fabrics?

Accepted Answer

Arista Etherlink is standards-based Ethernet engineered for AI networking. The stack provides RDMA-aware QoS and load-balancing capabilities that ensure reliable packet delivery to NICs supporting RoCE, AI Analyzer with workload and NIC integration for end-to-end visibility, AVA machine-learning for anomaly detection, and forward compatibility with Ultra Ethernet Consortium specifications as they finalize.

The operational argument against InfiniBand is simple: Etherlink reuses the same EOS toolchain, the same CloudVision telemetry, and the same vendor-agnostic Ethernet cable plant that the rest of the data center already runs.

For organizations that do not want a parallel InfiniBand silo with its own OpEx, Etherlink is the path to keep AI training on a single fabric family.

Question 21

Does our DCI link support IEEE 802.1AE MACsec end-to-end, and at what cipher strength?

Accepted Answer

Yes, on platforms that support it. IEEE 802.1AEbn-2011 added GCM-AES-256 as an available cipher suite beyond the original GCM-AES-128 baseline. MACsec peer discovery and authentication use the MACsec Key Agreement Protocol defined in IEEE 802.1X, which is what lets two MACsec endpoints establish mutually authenticated sessions without a pre-shared secret exchange in the clear.

On the Arista side, the 7280R3A, R3AM, and R3AK variants provide AES-256-GCM MACsec on applicable port types while keeping FlexRoute FIB scale intact, which is what you need for an encrypted DCI that also carries full internet tables.

Question 22

What is the Arista 7388X5, and when is a 25.6 Tbps single-ASIC chassis preferable to a multi-ASIC build?

Accepted Answer

The 7388X5 is a 4U modular chassis built on a single 25.6 Tbps packet processor — 64 ports of 400G QSFP-DD or 128 ports of 200G QSFP56, 10.6 billion packets-per-second forwarding, 825 ns port-to-port latency, 114 MB of buffer, and under 10 W typical per 200G port. Use the 7388X5 for hyperscale or AI fabrics that value single-ASIC cut-through latency over the deep-buffer profile of a 7800R3.

Traffic that never leaves the single ASIC avoids the internal-fabric hop of a multi-ASIC chassis, which matters for latency-sensitive collectives and low-jitter workloads.

If bursty incast is the dominant pattern, a deep-buffer 7800R3 is still the right spine.

Question 23

Where does the Arista 7050X3 sit in a spine-leaf build, and what are its real scale limits?

Accepted Answer

The 7050X3 is the fixed 1U general-purpose leaf option — up to 32 ports of 100G or 128 ports of 10/25G, 48xSFP25 through 96x25G SFP variants, port-to-port latency of 800 ns on most models (one variant at 3 microseconds), a fully shared 32 MB packet buffer tuned for lossless networks, MAC table up to 288k, and 64-way MLAG.

It is the right leaf for general-purpose 25G server access, enterprise DC tenants, and sites that do not need AI-specific shallow-buffer 800G leafs.

It is not the right leaf for GPU training clusters or for high-radix 100/400G aggregation — those workloads belong on 7060X6 or 7388X5.

Question 24

Why do EVPN multihoming and fast convergence need BGP unnumbered in Cumulus and FRR leaf stacks?

Accepted Answer

Cumulus Linux uses FRR for BGP and implements RFC 5549 unnumbered: peers exchange IPv4 routes with IPv6 link-local next hops, which eliminates per-interface /30 or /31 addressing across hundreds of fabric links. That removes an entire class of addressing errors at scale and makes Auto-BGP viable for Clos designs, since Cumulus Auto-BGP auto-generates 32-bit ASNs for two-tier leaf-spine topologies without operator intervention.

Cumulus ships ECMP by default on the data plane, so the combination of unnumbered peering plus default ECMP means a greenfield Cumulus fabric reaches all-active leaf-spine forwarding with a minimum of bespoke configuration.

For brownfield moves off traditional IGP underlays this is a large operational simplification.

Question 25

What are the three Arista EVPN service models, and when do we pick each?

Accepted Answer

Per Arista EOS there are three. VLAN-Based is one-to-one VLAN-to-MAC-VRF — granular route targets, finest policy control, largest route-table footprint. VLAN Bundle is N-to-one with a single bridge table across all VLANs in the MAC-VRF — smallest route-table footprint but no per-VLAN policy.

VLAN Aware Bundle keeps per-VLAN bridge tables inside a single MAC-VRF, combining bundling efficiency with VLAN awareness — the right choice for most enterprise tenants because it delivers per-VLAN policy without per-VLAN route-target bloat.

Pick VLAN-Based when every VLAN needs independent policy; pick VLAN Bundle for high-density IoT where policy is uniform; pick VLAN Aware Bundle for almost everything else.

Question 26

What does Arista CloudVision give us that APIC, Apstra, and Nexus Dashboard do not?

Accepted Answer

EOS is built on a publish-subscribe SysDB that is evolving to NetDB — a streaming state layer that runs as a single binary image across every Arista platform. CloudVision consumes the NetDB stream for telemetry, change control, and compliance across the entire portfolio, with first-party Ansible integration and Arista Validated Designs automation on top.

The operational delta versus APIC, Apstra, or Nexus Dashboard is the single-binary, single-state-model story: the same EOS image runs on a 7050X3 leaf, a 7280R3 DCI router, and a 7800R4 AI spine, and every one of them streams state into the same CloudVision instance.

That consistency is what removes per-platform tooling forks.

Question 27

Which Juniper QFX do we spec as a 400G data-center spine today?

Accepted Answer

Juniper QFX5700 is the current 400G spine. Per Juniper's QFX Series page it supports up to 32 ports of 400GbE or 144 ports of 50/40/25/10GbE and is positioned for data-center fabric spine, EVPN-VXLAN fabric, and data-center interconnect. The QFX5120 is the general-purpose 1/10/25G leaf with 8x40/100G uplinks, and the modern portfolio now groups QFX5240, 5230, 5220, 5210, 5200, 5130, 5120, 5110, and 5700 as the data-center fabric family.

For a greenfield EVPN-VXLAN spine on Juniper, QFX5700 is the right answer until Juniper refreshes the 800G tier.

Question 28

What is the scale delta between Arista 7280R3 standard and 7280R3K, and when does the K variant matter?

Accepted Answer

Standard 7280R3 models carry roughly 1,450k IPv4 base routes plus 1,792k FlexRoute routes. The 7280R3K variant scales to 2,250k base plus 2,048k FlexRoute, approaching around 5 million total routes across profiles. The K variant matters when the box sits at a DCI edge, a peering boundary, or an internet-facing spine where full IPv4 and IPv6 tables plus regional carrier overlap all have to fit in FIB simultaneously.

The 7280DR3A-54 pairs 24 GB deep buffers with 54 ports of 400G QSFP-DD for lossless incast on the same platform family.

Pick 7280R3 for internal DC routing; pick 7280R3K when the FIB budget is a real constraint.

Data Center Fabric Design and Migration: Spine-Leaf, VXLAN EVPN, ACI

Spine-Leaf Architecture: Why It Replaced 3-Tier

Leaf Platform Selection: 10G, 25G, 100G, and 400G Access

Spine Platform Selection: Non-Blocking Backplane Capacity

VXLAN EVPN Design: Control Plane, Data Plane, and Route Types

EVPN Route Types That Matter in Production

Anycast Gateway and Distributed Routing

Ingress Replication vs. Multicast Underlay

Cisco ACI vs. NX-OS EVPN: Choosing the Right Data Center Policy Model

When ACI Is the Right Choice

When NX-OS EVPN Is the Right Choice

Vendor-Agnostic Fabric Design: Cisco, Arista, Juniper, NVIDIA

Cisco Nexus 9000 with NX-OS or ACI

Arista EOS with CloudVision

Juniper QFX with Apstra

NVIDIA Spectrum-4 with Cumulus Linux

Oversubscription Decisions: 1:3 vs. 1:1 Non-Blocking

General Enterprise: 1:3 Is the Working Default

AI Training and HPC: 1:1 Non-Blocking Is Non-Negotiable

Microsegmentation: Fabric-Native, Hypervisor, and Agent-Based

Fabric-Native: ACI EPGs, EVPN VRF, and Leaf ACL

Hypervisor: VMware NSX-T 4.2 Distributed Firewall

Agent-Based: Illumio Core and Akamai Guardicore

Data Center Interconnect: EVPN Multisite, DCI-EVPN, and MPLS L3VPN

When Layer 2 Extension Is the Right Answer

MPLS L3VPN and Legacy OTV

AI Training Fabrics: RoCEv2, PFC, and Lossless Ethernet

RoCEv2 and PFC 802.1Qbb

Platform Options for AI Fabrics

Fabric Telemetry, Observability, and Legacy Migration

Migrating from Legacy 3-Tier or Collapsed Core

Scoping a Data Center Fabric Project

Data Center Fabric Credentials and Engagement Model