AI Networking Fabric Comparison: NVIDIA Spectrum-X Ethernet vs Arista Etherlink vs Cisco Nexus (G200) vs NVIDIA Quantum-X800 InfiniBand
Four AI back-end fabric approaches — NVIDIA Spectrum-X Ethernet (SN5600 + BlueField-3 SuperNIC), Arista Etherlink (7060X6 leaf, 7800R4 AI spine), Cisco Nexus with Silicon One G200 (N9364E-SG2), and NVIDIA Quantum-X800 InfiniBand (Q3200 / Q3400 + ConnectX-8) — compared on line-speed ceiling, UEC 1.0 posture, RoCEv2 and DCQCN congestion control, SHARP / in-network reduction, adaptive routing and packet spraying, rail-optimized topology support, tail latency, and reference AI cluster scale.
WiFi Hotshots is a vendor-agnostic enterprise network engineering firm serving enterprise customers, AI-platform buyers, data center architects, and network engineering teams across Southern California and the broader US market.
Multi-CCIE engineering bench
AI-ready infrastructure practice — RoCE, InfiniBand, UEC
Fixed-fee SOW — no T&M surprises
25 years of enterprise networking leadership
AI back-end (scale-out) fabric selection is a transport-plus-topology decision, not a switch-model decision. All four platforms support 800G line rate on current generation and 400G on the preceding generation, but they differ on transport (Ethernet + UEC vs InfiniBand), congestion-control posture (DCQCN variants vs credit-based lossless vs in-network reduction), and the scale at which a non-blocking fat-tree can be sustained without a super-spine tier. See the AI-ready infrastructure practice, the data center engineering service, the full services directory, or browse adjacent comparisons in the vendor comparison library — the 400G data-center leaf comparison covers the front-end and storage-fabric tier.
Why Compare These Four, and Why Not RoCEv2-Only Ethernet
The AI back-end fabric market has consolidated around two transports and three vendor postures. InfiniBand remains the reference design for NVIDIA DGX SuperPOD — the H200, B200, and B300 reference architectures document a rail-optimized full fat-tree with Managed NDR or XDR switches. Ethernet has split into two camps: NVIDIA Spectrum-X Ethernet (tightly coupled Spectrum-4 switch plus BlueField-3 or ConnectX-8 SuperNIC, marketed as a 1.6x AI-networking-performance uplift over generic Ethernet per NVIDIA reference architecture), and Ultra Ethernet Consortium 1.0 compliant platforms from Arista, Cisco, and the broader UEC steering membership (AMD, Broadcom, Intel, Meta, Microsoft, Oracle and others).
UEC 1.0 was released June 11, 2025 as an open multi-vendor transport stack replacing RoCEv2 head-of-line blocking with packet-level multi-pathing and modern congestion control. Generic RoCEv2-only leaves (no UEC, no Spectrum-X, no Jericho3-AI) are excluded from this comparison because they cannot sustain collective-communication efficiency at the GPU counts modern training runs require — 60% effective bandwidth per NVIDIA whitepapers rather than 95%.
The Comparison Matrix: AI Back-End Fabric Specifications That Matter
Vendor marketing “non-blocking” claims assume specific traffic patterns — real collective-communication (NCCL AllReduce) efficiency depends on topology oversubscription ratio, ECMP hash collisions, and congestion reaction time in addition to switch radix. Pricing is framed as CapEx model class, never a list price. Where a specification reads “not published,” the vendor document does not disclose that value in primary sources reviewed.
| Specification | NVIDIA Spectrum-X (SN5600 + BlueField-3) | Arista Etherlink (7060X6 + 7800R4 AI) | Cisco Nexus (N9364E-SG2, G200) | NVIDIA Quantum-X800 InfiniBand (Q3200 / Q3400 + ConnectX-8) |
|---|---|---|---|---|
| Transport | Lossless Ethernet + RoCEv2 with NVIDIA Direct Data Placement (DDP) re-ordering on the SuperNIC. | Lossless Ethernet + RoCEv2 today; UEC-ready (UET) on roadmap per Arista AI networking positioning. | Lossless Ethernet + RoCEv2; Cisco is a UEC steering member (G200 aligned with UEC multi-pathing). | InfiniBand — credit-based lossless transport, no PFC required, native reliable delivery. |
| Line-speed ceiling (current gen) | 800G per port (64x 800G OSFP on SN5600, 51.2 Tbps Spectrum-4 ASIC). | 800G per port on 7060X6 leaf (51.2 Tbps Tomahawk 5); 7800R4 AI spine scales to 576x 800GbE / 460 Tbps. | 800G per port on N9364E-SG2 (64x 800G OSFP or QSFP-DD, 51.2 Tbps G200). | 800G XDR per port (Q3400: 144x 800G across 72 OSFP cages; Q3200: 72x 800G in 2U twin-port). |
| ASIC / silicon vendor | NVIDIA Spectrum-4 — 51.2 Tbps single chip, TSMC 4nm, 100B transistors per third-party analyses. | Leaf: Broadcom Tomahawk 5 (51.2 Tbps). AI Spine: Broadcom Jericho3-AI with Ramon fabric. | Cisco Silicon One G200 — 5 nm, 51.2 Tbps, 512x 112G SerDes, 256 MB on-die buffer. | NVIDIA Quantum-3 ASIC with 200 Gb/s-per-lane SerDes; Q3400 = 144x 800G in 4U. |
| Congestion control | Spectrum-X RoCE congestion control + switch-native high-frequency telemetry feeding SuperNIC rate adjustment. | RDMA-aware QoS, dynamic load balancing, PFC + ECN/DCQCN today; UEC transport on roadmap. | PFC + ECN with advanced load balancing and fault detection to improve AI/ML job completion times. | Credit-based flow control native to IB — no PFC storms, no DCQCN tuning; SHARP offloads reductions. |
| Lossless Ethernet — PFC + ECN | Required; tightly integrated with BlueField-3 SuperNIC for end-to-end adaptive routing. | Required; Arista EOS tooling for PFC + ECN telemetry and AI Analyzer visibility. | Required; Nexus Dashboard + NX-OS congestion-control telemetry for AI workloads. | N/A — IB uses credit-based flow control; PFC is an Ethernet-only concern. |
| UEC 1.0 posture | NVIDIA is not a UEC steering member; Spectrum-X is a proprietary alternative to UEC. | UEC-compatible hardware today; Arista is a UEC participant per company statements on Etherlink positioning. | UEC steering member; G200 and N9364E-SG2 marketed as UEC-ready AI fabric silicon. | N/A — UEC is an Ethernet-only specification; IB is governed by the IBTA. |
| Adaptive routing / packet spraying | Switch-level per-packet adaptive routing + SuperNIC Direct Data Placement (DDP) re-ordering; packet spraying native. | Cell-based fabric on 7800R4 AI spine — data sprayed across all fabric links; dynamic load balancing on 7060X6 leaf. | Advanced load balancing on G200 to avoid ECMP hash collisions and improve job completion times. | Native adaptive routing standard in Quantum InfiniBand since Quantum-2 generation. |
| Rail-optimized topology support | Reference in DGX SuperPOD B300 RA — rail-optimized twin-planar full fat-tree with Spectrum-X. | Supported in Arista AI reference designs; 7800R4 AI spine is purpose-built for rail-optimized clusters. | Supported on Nexus 9000 with G200; Cisco AI PODs validated designs document the topology pattern. | Reference architecture in DGX SuperPOD H200 RA — rail-optimized balanced full fat-tree, Managed NDR. |
| In-network reduction / collective offload | Not native at switch (Spectrum-X is Ethernet); NCCL-level optimization via DDP + adaptive routing. | Not native at switch; NCCL optimization via RDMA-aware QoS and cell-fabric scheduling on 7800R4. | Not native at switch; NCCL optimization via G200 load balancing and Nexus Dashboard analytics. | SHARP v3 on Quantum-X800 — switch-fabric aggregation of AllReduce, ~2x effective bandwidth on collective ops. |
| Reference AI cluster scale (published) | 128K GPUs in two-tier multiplane topology per Spectrum-X SN6000 / SN5000 positioning; SuperPOD B300 reference scales to 2,000+ DGX nodes (~16K+ GPUs) and beyond. | 7800R4 single-chassis supports 576x 800GbE spine ports; 7700R4 Distributed Etherlink scales to 30,000+ 400 GbE accelerators in one domain. | Nexus 9000 + G200 positioned for leaf-and-spine HPC / AI/ML at hyperscaler density; specific max GPU figure not published in primary sources. | Q3400 two-level fat-tree connects up to 10,368 ConnectX-8 NICs per NVIDIA Quantum-X800 documentation. |
| Buffer depth | 160 MB fully shared global cache on Spectrum-4 per third-party silicon analyses. | 165 MB fully shared buffer on 7060X6; deep-buffer VOQ on 7800R4 AI spine (cell-based, distributed credit). | 256 MB on-die packet buffer per Cisco N9364E-SG2 data sheet. | IB buffer model differs — credit-based flow control replaces large ingress buffers. |
| Tail latency (port-to-port) | Not published as single number; Spectrum-X targets microsecond-precision rate reaction per NVIDIA technical blog. | Low / deterministic per Arista 7800R4 distributed scheduling; specific ns-class number not published. | “Deterministic, low-latency” per Cisco G200 data sheet; specific ns-class number not published. | Sub-microsecond switch latency typical in InfiniBand; specific Quantum-X800 number verify in NVIDIA HW user manual. |
| Power envelope (per switch, indicative) | SN5600 2U system; PSU sizing on Dell or NVIDIA datasheet; exact max not extracted in research session. | 7060X6-64PE 2RU with 64x 800G; 7800R4 up to 16-slot chassis with 3 kW HVAC/HVDC/LVDC PSU modules. | N9364E-SG2 2RU with PSU3KW-HVPI 3 kW option; max 20 W per OSFP port. | Q3400 4U air-cooled (144x 800G); Q3200 2U air-cooled. |
| LPO / linear-drive optics support | Industry-standard 800G OSFP optics on SN5600; LPO ecosystem maturing on Spectrum-X generation. | Industry-standard OSFP / QSFP-DD optics; Arista positioned for LPO and XPO migration on 800G / 1.6T. | QSFP-DD and OSFP variants on N9364E-SG2; Cisco announced LPO / XPO roadmap alignment with G300 generation. | OSFP 800G XDR optics; NVIDIA has published CPO / silicon-photonics roadmap on Quantum-X Photonics. |
| Management plane | NVIDIA Unified Fabric Manager (UFM) for Spectrum-X + Cumulus Linux or SONiC on SN5600. | Arista EOS + CloudVision; AI Analyzer and Autonomous Virtual Assist (AVA) for AI fabric visibility. | Cisco NX-OS + Nexus Dashboard + Nexus Dashboard Insights (AI workload analytics). | NVIDIA UFM for InfiniBand — subnet manager, SHARP tree management, adaptive-routing policy. |
| SOC 2 / FedRAMP posture | NVIDIA / Mellanox products — FIPS, CC, and federal compliance handled on a per-product basis; verify with NVIDIA compliance. | Arista EOS carries federal certifications across the 7000 series; verify 7800R4 / 7060X6 specific status with Arista. | Cisco NX-OS + hardware federal certifications (Cisco Trust Portal); N9364E-SG2 specific status verify with Cisco. | NVIDIA Mellanox IB federal compliance verified per product line; verify Q3200 / Q3400 specific with NVIDIA. |
| CapEx model class (framing) | Premium — NVIDIA SN5600 + BlueField-3 SuperNIC pair, no generic-Ethernet substitution without efficiency loss. | Market-competitive 51.2T Ethernet with multi-vendor optics; 7800R4 chassis is a significant CapEx line item in spine tier. | Market-competitive 51.2T Ethernet; Cisco software licensing and Nexus Dashboard subscription add OpEx line items. | Premium — NVIDIA Quantum-X800 + ConnectX-8 pair, commonly bundled in DGX SuperPOD reference configurations. |
AI fabric selection is a GPU-count, collective-pattern, and budget-envelope decision — not a vendor-logo decision. Send GPU counts, model size, and whether you are constrained to Ethernet; WiFi Hotshots returns a fixed-fee fabric-architecture SOW.
Per-Platform Fact Summaries
NVIDIA Spectrum-X Ethernet (SN5600 + BlueField-3 SuperNIC)
Spectrum-X is NVIDIA’s tightly coupled Ethernet-for-AI platform. The SN5600 is a 2U 64x 800G OSFP switch built on the Spectrum-4 51.2 Tbps ASIC; the BlueField-3 SuperNIC delivers up to 400 Gb/s RDMA between GPU servers. The defining architectural pairing is switch-level per-packet adaptive routing plus SuperNIC Direct Data Placement (DDP) to re-order out-of-order packets in GPU memory without application visibility. Published NVIDIA whitepapers position Spectrum-X as delivering ~95% effective bandwidth versus ~60% on generic Ethernet, and the platform is a reference fabric in the DGX SuperPOD B300 architecture (rail-optimized twin-planar full fat-tree). NVIDIA is not a UEC steering member; Spectrum-X is a proprietary alternative. Managed via NVIDIA Unified Fabric Manager (UFM) with Cumulus Linux or SONiC options on SN5600.
Arista Etherlink (7060X6 leaf, 7800R4 AI spine)
Arista Etherlink is the portfolio positioning that spans the 7060X6 (Broadcom Tomahawk 5, 51.2 Tbps, 64x 800G or 128x 400G in 2RU with 165 MB shared buffer) at the leaf, and the 7800R4 AI spine (Broadcom Jericho3-AI with Ramon fabric, up to 576x 800GbE or 1,152x 400GbE, 460 Tbps system throughput) at the scale-out spine. The 7700R4 Distributed Etherlink Switch scales to over 27,000 800GbE ports and 30,000+ 400GbE accelerators in a single distributed domain with cell-based VOQ and distributed credit scheduling. Arista is explicit that Etherlink is “forwards compatible with UEC” in public product positioning, delivering dynamic load balancing, RDMA-aware QoS, and AI Analyzer telemetry on EOS / CloudVision today while the UEC 1.0 transport (released June 11, 2025) moves through implementation.
Cisco Nexus (N9364E-SG2 with Silicon One G200)
Cisco’s AI back-end posture is the Silicon One G200 inside the Nexus 9000 family, most visibly the N9364E-SG2 2RU 64x 800G switch available in QSFP-DD or OSFP form factors with 256 MB on-die packet buffer and 20 W max per OSFP port. G200 is 5 nm, 51.2 Tbps, 512x 112G SerDes with advanced load balancing and fault detection explicitly designed to improve AI/ML job completion times. Cisco is a UEC steering member, positioning the G200 generation as UEC-ready AI fabric silicon. Cisco announced Silicon One G300 in February 2026 as the next-generation scale-up platform for the agentic-AI era; the G200-based N9364E-SG2 remains the shipping AI leaf / spine platform in the Nexus 9000 line. Managed via Cisco NX-OS plus Nexus Dashboard and Nexus Dashboard Insights for AI workload analytics.
NVIDIA Quantum-X800 InfiniBand (Q3200 / Q3400 + ConnectX-8)
Quantum-X800 is the current-generation NVIDIA InfiniBand platform. The Q3400-RA is a 4U switch powered by the Quantum-3 ASIC with 200 Gb/s-per-lane SerDes, delivering 144x 800 Gb/s ports across 72 OSFP cages and supporting a two-level fat-tree of up to 10,368 ConnectX-8 NICs per NVIDIA Quantum-X800 documentation. The Q3200-RA is a 2U twin-switch configuration with 72x 800G effective (36 OSFP twin-port cages). InfiniBand’s native advantages for scale-out training are credit-based lossless flow control (no PFC storm exposure), native adaptive routing since Quantum-2, and SHARP in-network reduction that offloads AllReduce collective operations to the switch fabric — commonly cited as ~2x effective bandwidth uplift on reduction-heavy workloads per NVIDIA Quantum InfiniBand documentation. Reference fabric in DGX SuperPOD H200 and B200 architectures.
When Each Platform Is Worth Evaluating First
These are routing heuristics, not recommendations. A production AI fabric decision requires a GPU-count / model-size / collective-pattern review, a power and cooling audit, and a written scope. WiFi Hotshots engineers fabrics across all four platforms; the routing reflects what documented specifications favor for common scenarios, not a vendor preference.
- Greenfield DGX / HGX cluster where InfiniBand is the sanctioned back-end: Quantum-X800 (Q3400 or Q3200) with ConnectX-8 plus SHARP v3 is the reference-architecture answer. Training-heavy AllReduce workloads see the largest benefit from in-network reduction.
- Mandated Ethernet back-end with tight NVIDIA GPU coupling: NVIDIA Spectrum-X (SN5600 + BlueField-3 SuperNIC) delivers the tightest switch-plus-NIC integration for Ethernet. The 1.6x AI-networking-performance positioning in NVIDIA documentation is conditional on pairing Spectrum-4 and the SuperNIC end-to-end.
- Open multi-vendor Ethernet strategy with UEC direction: Arista Etherlink (7060X6 leaf, 7800R4 AI spine) or Cisco Nexus with Silicon One G200 are the two leading UEC-aligned answers. Arista has the deeper spine-scale story (460 Tbps 7800R4, 30K+ 400GbE 7700R4 distributed); Cisco leads on consolidated NX-OS operations with Nexus Dashboard.
- Very large scale-out (tens of thousands of accelerators): Arista 7800R4 AI spine (single chassis 576x 800GbE) or 7700R4 Distributed Etherlink (30K+ 400GbE) on the Ethernet side; Quantum-X800 Q3400 two-level fat-tree (10,368 NICs) on the InfiniBand side. Spectrum-X multiplane two-tier scales to 128K GPUs per NVIDIA positioning.
- Storage and front-end Ethernet fabric integration: All three Ethernet platforms (Spectrum-X, Arista Etherlink, Cisco Nexus / G200) integrate cleanly with 400G / 800G storage and front-end leaf tiers; see the 400G data-center leaf comparison for the front-end tier.
- Federal / FedRAMP-adjacent workloads: All four vendors maintain active federal compliance programs. Verify the specific certificate number, firmware train, and accreditation boundary with each vendor (NVIDIA compliance, Arista product certifications, Cisco Trust Portal) before downselecting — federal posture is not a differentiator at this fabric tier.
Frequently Asked Questions
What is the difference between Spectrum-X Ethernet and Quantum-X800 InfiniBand for AI training?
Spectrum-X is NVIDIA’s Ethernet-for-AI platform (SN5600 Spectrum-4 switch + BlueField-3 or ConnectX-8 SuperNIC) using RoCEv2 with switch-level adaptive routing and SuperNIC Direct Data Placement re-ordering. Quantum-X800 is NVIDIA’s InfiniBand platform (Q3400 / Q3200 + ConnectX-8) with credit-based lossless transport, native adaptive routing since Quantum-2, and SHARP v3 in-network reduction that offloads AllReduce collectives to the switch fabric. InfiniBand typically delivers higher collective-communication efficiency at training scale; Ethernet offers multi-vendor interoperability and integration with the broader data-center fabric.
When does UEC 1.0 matter, and which platforms are UEC-ready?
UEC 1.0 was released June 11, 2025 by the Ultra Ethernet Consortium as an open Ethernet transport (UET) stack with packet-level multi-pathing and modern congestion control, purpose-built for AI and HPC. It matters when buyers want a multi-vendor Ethernet back-end that replaces RoCEv2 head-of-line blocking.
Cisco is a UEC steering member; Arista positions Etherlink as forwards compatible with UEC; AMD, Broadcom, Intel, Meta, Microsoft, Oracle are also steering members.
NVIDIA Spectrum-X is a proprietary alternative to UEC. InfiniBand is outside UEC scope (IB is governed by the IBTA).
What is SHARP and why does it matter for NCCL AllReduce?
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is NVIDIA’s in-network computing feature on Quantum InfiniBand switches. The switch fabric performs collective aggregation (sum, min, max) on AllReduce operations, reducing data traversal of the network. NVIDIA Quantum InfiniBand documentation describes SHARP as offloading collective communication operations to the switch network. The practical effect on reduction-heavy training workloads is commonly cited at ~2x effective bandwidth uplift, because every GPU’s AllReduce contribution is combined in-fabric rather than round-tripped through every rank.
Spectrum-X Ethernet, Arista Etherlink, and Cisco Nexus with G200 do not offer a direct switch-fabric equivalent today — NCCL optimization on Ethernet runs at the NIC and collective-library layer.
What is rail-optimized topology and do all four platforms support it?
Rail-optimized topology assigns each GPU index (0 through 7 on a typical 8-GPU server) its own dedicated leaf-to-spine “rail” so NCCL AllReduce traffic between same-index GPUs stays on a single rail and avoids cross-rail hops. It is the canonical back-end topology for large training clusters. NVIDIA DGX SuperPOD reference architectures (H200, B200, B300) document rail-optimized full fat-tree designs on both Quantum-X800 InfiniBand and Spectrum-X Ethernet.
Arista 7800R4 AI spine and 7060X6 leaf are designed for rail-optimized clusters per Arista AI networking positioning.
Cisco Nexus 9000 with G200 supports the topology pattern in Cisco AI PODs validated designs.
Do AI fabrics require 1:1 oversubscription (non-blocking fat-tree)?
For training workloads, yes — collective-communication bandwidth is gated by the slowest link in a ring or tree, so any oversubscription between leaf and spine reduces AllReduce throughput for every participating GPU. NVIDIA DGX SuperPOD reference architectures, Arista AI reference designs, and Cisco AI PODs validated designs all specify 1:1 oversubscription on the back-end compute fabric. The storage and front-end fabric tiers can tolerate moderate oversubscription because their traffic patterns are not collective-dominated.
What are PFC and DCQCN, and why do they matter for RoCEv2 fabrics?
PFC (Priority-based Flow Control, IEEE 802.1Qbb) pauses specific traffic classes to avoid drop in lossless Ethernet, which RoCEv2 requires. DCQCN (Data Center Quantized Congestion Notification) is the reference congestion-control algorithm for RoCEv2 combining ECN marking with a rate-adjustment loop. Careless PFC configuration causes head-of-line blocking and pause storms; well-tuned PFC plus ECN / DCQCN yields 60% effective bandwidth per NVIDIA whitepapers on generic Ethernet.
Spectrum-X adds switch-level telemetry plus SuperNIC DDP to reach ~95% effective bandwidth; Arista and Cisco add platform-specific load-balancing and telemetry extensions; UEC 1.0 replaces the DCQCN model entirely with UET transport primitives.
What is packet spraying and why is Direct Data Placement needed?
Packet spraying distributes flow packets across multiple equal-cost paths on a per-packet basis to avoid ECMP hash collisions and sustain fabric utilization under AllReduce traffic. It has the side effect that packets in the same RDMA message can arrive out of order at the receiving NIC. NVIDIA Direct Data Placement (DDP) on the BlueField-3 or ConnectX-8 SuperNIC re-orders those packets into the correct sequence in GPU memory before the application sees the data, making adaptive routing transparent per NVIDIA developer documentation.
UEC 1.0 codifies packet-level multi-pathing with equivalent re-ordering semantics as an open standard.
What GPU counts do these fabrics realistically scale to?
Quantum-X800 Q3400 supports up to 10,368 ConnectX-8 NICs in a two-level fat-tree per NVIDIA documentation (larger scales via three-tier designs). NVIDIA DGX SuperPOD reference architectures document H200 and B200 SuperPODs scaling beyond 2,000 DGX nodes (16,000+ GPUs). Spectrum-X in a two-tier multiplane configuration is positioned for 128K GPUs per NVIDIA Spectrum-X positioning.
Arista 7800R4 AI spine supports 576x 800GbE in one chassis; the 7700R4 Distributed Etherlink scales to 30,000+ 400GbE accelerators in one domain.
Cisco Nexus 9000 with G200 targets hyperscaler leaf-and-spine AI/ML density but a specific published max-GPU figure was not extracted in primary sources reviewed.
Is InfiniBand HDR or NDR or XDR still the right call for large training clusters in 2026?
InfiniBand NDR (400G per port, 800G via 2-port bonding) is the current volume deployment for hyperscale AI training clusters. NVIDIA Quantum-2 NDR switches ship in production; XDR (800G per port, 1.6T via bonding) is the follow-on rollout on Quantum-X800 Q3400 / Q3200 platforms per NVIDIA documentation.
HDR (200G per port) is the mature / prior-gen InfiniBand tier — still in volume at many research supercomputers but not the default for new Blackwell-era GPU builds. For a new 2026+ cluster, NDR is baseline; XDR (Quantum-X800) is the growth tier for B200 / B300 NVLink-domain training at scale.
How do you decide between Spectrum-X 800G Ethernet, Ultra Ethernet, and Arista Etherlink for an Ethernet AI fabric?
Spectrum-X is NVIDIA’s vertically-integrated Ethernet-for-AI (SN5600 switch + BlueField-3 / ConnectX-8 SuperNIC + NCCL-tuned driver stack). It delivers ~95 percent effective bandwidth on RoCEv2 per NVIDIA whitepapers by combining switch-level adaptive routing with SuperNIC Direct Data Placement (DDP) packet reordering.
Ultra Ethernet (UEC 1.0, released June 11, 2025) is the open-standards alternative with multi-vendor governance — Cisco, AMD, Broadcom, Intel, Meta, Microsoft, Oracle as steering members. Arista Etherlink is Arista’s implementation approach, positioned as forwards-compatible with UEC. For multi-vendor interop and procurement-neutrality: UEC or Etherlink. For vertical-integration with NVIDIA GPUs: Spectrum-X.
What is NCCL AllReduce, and how does tuning the collective affect training throughput?
NCCL (NVIDIA Collective Communications Library) implements GPU-to-GPU collective operations — AllReduce, AllGather, Broadcast, ReduceScatter — on NVLink and InfiniBand / RoCEv2 fabrics. AllReduce is the dominant collective in synchronous gradient training (each training step performs one or more AllReduce across all GPUs).
Tuning knobs: NCCL_IB_HCA (select specific HCA ports), NCCL_NET_GDR_LEVEL (GPU-Direct RDMA level), NCCL_ALGO (Ring / Tree / NVLS), NCCL_PROTO (Simple / LL / LL128). Ring vs Tree selection typically auto-negotiates based on GPU count and fabric topology. Bottlenecks in AllReduce directly reduce effective GPU utilization — a 10 percent AllReduce slowdown becomes a 10 percent training-time increase.
What does SHARP v3 do that SHARPv2 did not, and why does it matter for B200-era clusters?
SHARP v3 is the Scalable Hierarchical Aggregation and Reduction Protocol generation shipping on NVIDIA Quantum-X800 Q3400 InfiniBand switches. It extends SHARPv2 with support for larger aggregation trees, improved multi-tenant isolation, and enhanced data-type support for lower-precision training (FP8, FP16).
SHARP v3 offloads AllReduce reduction (sum / min / max) into the switch fabric itself — each switch combines incoming gradient contributions from its downstream ports before forwarding upward. For a large-cluster AllReduce, this halves the data traversal of the network compared to endpoint-only reduction. NVIDIA documentation cites roughly 2x effective bandwidth uplift for reduction-heavy workloads.
What is rail-optimized topology, and how does it differ from dragonfly-plus at AI cluster scale?
Rail-optimized topology assigns each GPU index (0 through 7 in a typical 8-GPU server) its own dedicated leaf-to-spine “rail.” AllReduce traffic between same-index GPUs across the cluster stays on a single rail, avoiding cross-rail hops. It is the canonical back-end topology for rail-optimized full fat-tree AI training clusters.
Dragonfly-plus is a hierarchical topology used at HPC-supercomputer scale (multi-thousand-node clusters) to reduce cable count and hop count vs a flat fat-tree. It is less common in enterprise AI training — rail-optimized fat-tree dominates the 256-to-4,096-GPU scale. Dragonfly-plus appears more often at national-lab scale (Frontier, Aurora, El Capitan).
What is the realistic GPU-per-leaf ratio for a rail-optimized 800G AI leaf?
NVIDIA DGX SuperPOD reference architectures for B200 / B300 document 8-GPU DGX nodes with 4 or 8 NICs per node depending on the back-end topology. For rail-optimized 400G per GPU (one 400G NIC per GPU), a 32-port 800G leaf accommodates 32 GPU links with 1:1 oversubscription to spine.
Typical AI cluster design: 1 leaf per 32 GPUs (4 DGX nodes with 8 GPUs each and 1 NIC per GPU at 400G), or 1 leaf per 16 GPUs if using 800G per GPU. Exact GPU-per-leaf varies by GPU generation, NIC count, and oversubscription target; Arista 7060X6 / NVIDIA SN5600 / Cisco Nexus 9332D-GX2A are the common leaf platforms at this tier.
How does HPCC (High-Precision Congestion Control) differ from DCQCN, and where does Google Swift fit?
HPCC is a research-derived congestion control algorithm (developed by Alibaba with academic collaborators, published SIGCOMM 2019) that uses in-network telemetry (INT) to reduce convergence time compared to DCQCN’s ECN-driven rate adjustment. It is deployed at select hyperscalers but not the default on any commodity switch OS.
Google Swift is Google’s internal congestion control for its own data-center fabrics — not available outside Google. DCQCN (Data Center Quantized Congestion Notification) remains the deployed reference for RoCEv2 on NVIDIA Spectrum-X, Arista, Cisco, and Juniper. UEC 1.0 replaces the DCQCN model entirely with UET transport primitives that include telemetry-aware congestion control.
How do PFC and ETS tune together for lossless RoCEv2 at 400G / 800G?
PFC (IEEE 802.1Qbb) provides per-priority pause, typically assigning PFC priority 3 to RoCEv2 traffic. ETS (Enhanced Transmission Selection per IEEE 802.1Qaz) provides bandwidth allocation across priorities — e.g., 70 percent to RoCEv2 priority 3, 30 percent best-effort to other priorities.
Tuning: PFC high-watermark at ~80 percent egress buffer, low-watermark (XON) at ~50 percent, headroom sized to 2x cable-delay-bandwidth product per priority queue. Under-tuned PFC causes pause storms; over-tuned causes drops before PFC fires. NVIDIA Spectrum-X ships with auto-tuning; other platforms require per-ASIC explicit config validated against buffer size.
What does BlueField-3 DPU offload provide that a standard 400G NIC cannot?
BlueField-3 is a full PCIe SmartNIC with Arm-based CPU cores, programmable datapath, and 400G / 800G Ethernet or InfiniBand connectivity. Standard 400G NICs (e.g., ConnectX-7 or vendor-generic NICs) provide RDMA and basic offloads but lack the programmable data plane.
BlueField-3 specific offloads: storage acceleration (NVMe-oF, iSCSI, compression, encryption), network service offload (virtual switch, stateful firewall, load balancer), AI workload acceleration (Direct Data Placement for Spectrum-X). The SmartNIC capabilities matter most in cloud-tenant isolation scenarios or when the host CPU would otherwise be a bottleneck on network services.
What is the reach and power budget for 800G-FR4 vs 800G-DR4 optics in AI fabrics?
800G-FR4 (IEEE 802.3df) reaches 2 km on duplex singlemode — the typical AI-cluster-to-AI-cluster or cross-row interconnect. 800G-DR4 reaches 500 m on 4-pair singlemode with 4x 200G-PAM4 per fiber pair — the default intra-row or intra-rack optic.
Typical power budget: 800G-FR4 at ~15-17 W per module in OSFP; 800G-DR4 at ~14-16 W per module. OSFP form factor with integrated heat-sink is increasingly the AI-cluster choice because of the thermal headroom at sustained 800G traffic. QSFP-DD800 exists but thermally constrained at full load over long runs.
What are Linear-Drive Optics (LPO) and XPO, and when do they fit in AI fabrics?
Linear-Drive Pluggable Optics (LPO) remove the retimer DSP from pluggable optics — the host ASIC drives the laser directly, saving ~30-40 percent optics power and roughly $1,500-$2,000 per module at 800G per analyst estimates. LPO is an emerging category (not a ratified IEEE standard as of 2026).
XPO (eXtended-Reach Pluggable Optics) is a related category positioned for longer reach than LPO. Both are early-stage adoption: NVIDIA demonstrated LPO with Spectrum-X at NVIDIA GTC 2024; Arista and Cisco have announced LPO roadmap plans. For a greenfield AI cluster today, DSP-based 800G optics (standard QSFP-DD800, OSFP) are the volume-ship baseline; LPO is for 2027+ refresh cycles.
How do silicon photonics and co-packaged optics (CPO) change the 1.6T roadmap?
Co-Packaged Optics (CPO) integrates the optical transceiver directly onto the switch ASIC package, bypassing the pluggable form factor. At 1.6T and beyond, pluggable optics face power / thermal ceilings that CPO addresses — NVIDIA, Broadcom, and Cisco have announced CPO roadmap plans.
Silicon photonics is the underlying technology enabling CPO at scale — fabricating optical waveguides, modulators, and detectors on silicon alongside the switch ASIC. 1.6T per port shipping in pluggable form has been challenging; CPO is the expected bridge to 3.2T per port fabrics in late-2020s hyperscale deployments. Enterprise and mid-scale AI customers will continue with pluggable 800G / 1.6T for several years.
What is the realistic UEC 1.0 adoption timeline for non-hyperscale enterprise AI buyers?
UEC 1.0 specification was released June 11, 2025. Silicon supporting UEC is rolling out in 2025-2026 generation ASICs (Broadcom Tomahawk 5, upcoming Arista / Cisco / Juniper platforms). First shipping UEC-compliant switches at production volume are expected through 2026-2027.
Realistic enterprise adoption: early adopters in 2026-2027 on vendor-announced UEC-compliant platforms; broad mainstream adoption 2027-2028 as SuperNIC and UET-stack maturity catches up to switch silicon. For buyers today, Spectrum-X is the vertical-integration path (shipping now); UEC is the multi-vendor path (shipping 2026-2027); InfiniBand is the incumbent for the highest-performance training tier.
What topology patterns do public hyperscaler AI training clusters typically use?
Public research documents and conference talks from Meta, Microsoft, and Google describe rail-optimized full fat-tree at 16,000-100,000+ GPU scale for LLM training. Meta’s Llama 3 training on two 24,000-GPU clusters (one on H100 / NVIDIA Quantum-2 InfiniBand, one on H100 / Arista Ethernet) is documented in Meta’s engineering blog.
xAI Colossus cluster (announced 2024) is publicly documented at 100,000 H100 GPUs on Ethernet (NVIDIA Spectrum-X) with plans to expand. Microsoft Azure ND H100 v5 SKUs document 8-GPU nodes with NVIDIA Quantum-2 back-end fabric. Training cluster topology patterns across hyperscalers are converging on rail-optimized 1:1 oversubscribed fat-tree with per-GPU back-end NIC at 400G or 800G.
What is NVLink 5 bandwidth per GPU on Blackwell, and how does it compare to the back-end fabric per-GPU?
NVLink 5 on NVIDIA Blackwell B200 delivers 1.8 TB/s of bidirectional bandwidth per GPU per the NVIDIA B200 data sheet — 18x a PCIe Gen 5 x16 slot (~128 GB/s) and ~20x the typical back-end fabric per-GPU allocation (400G = 50 GB/s unidirectional).
This bandwidth asymmetry is intentional: NVLink connects GPUs within a single NVLink-domain (typically 72 GPUs in GB200 NVL72) at ultra-high bandwidth, while the back-end fabric interconnects NVLink domains at per-GPU bandwidth orders of magnitude lower. AllReduce patterns that fit within an NVLink domain avoid the back-end fabric; cross-NVLink-domain AllReduce rides the back-end.
How do collective-communication patterns differ between training and inference workloads in an AI fabric?
Training is collective-dominated: AllReduce (gradient synchronization) on every training step drives 60-80 percent of inter-GPU traffic in synchronous data-parallel training. Model-parallel and pipeline-parallel patterns add AllGather and ReduceScatter.
Inference is typically point-to-point — request routing, KV-cache sharing, and pipeline stages communicate pair-wise rather than collectively. An inference cluster can tolerate higher oversubscription at the fabric tier (2:1 or 3:1 is acceptable) because traffic is not collective-gated. Training fabrics must stay 1:1 non-blocking. Customer scoping should separate training-fabric from inference-fabric sizing requirements.
What is the typical cooling / power density of a 400G / 800G AI leaf rack?
A 32-port 800G leaf at full ASIC load draws ~1500-2000 W (NVIDIA SN5600 data sheet: typical ~1500 W, max ~1800 W). Per-rack density for AI leaf-plus-top-of-rack gear lands at 20-30 kW in modern AI builds. GPU compute racks (DGX B200 at ~10.2 kW per DGX, 8 DGX per rack = 81 kW) are the dominant thermal load.
Liquid-cooled GPU racks are becoming the norm at hyperscale AI scale — direct-to-chip liquid cooling on the GPU plus air-cooled NIC / switch at the top-of-rack. For enterprise AI clusters at 500-5,000 GPU scale, air cooling remains common but at the upper end of what traditional raised-floor data centers can deliver.
Primary Sources Cited on This Page
Citations are grouped by platform for direct verification. If any specification on this page does not match the current vendor document, the vendor document takes precedence — please report the discrepancy to the WiFi Hotshots engineering team.
NVIDIA Spectrum-X Ethernet (SN5600 + BlueField-3 / ConnectX-8)
- NVIDIA Spectrum-X product page
- NVIDIA Developer Blog — Turbocharging AI Workloads with Spectrum-X
- NVIDIA Developer Blog — Next-Generation AI Networking with NVIDIA SuperNICs
- DGX SuperPOD B300 Reference Architecture — Network Fabrics (Spectrum-X)
Arista Etherlink (7060X6 + 7800R4 AI)
- Arista AI Networking solutions page
- Arista 7060X6 Series Data Sheet (800G, Tomahawk 5)
- Arista 7800R4 AI Spine Data Sheet
- Arista Unveils Etherlink AI Networking Platforms (press release)
Cisco Nexus 9364E-SG2 (Silicon One G200)
- Cisco Nexus 9364E-SG2 Data Sheet
- Cisco Silicon One G200 Data Sheet
- Cisco Nexus 9000 AI Networking White Paper
- Cisco Blog — Nexus 9000 AI Data Center Switches
NVIDIA Quantum-X800 InfiniBand (Q3200 / Q3400 + ConnectX-8)
- NVIDIA Quantum-X800 InfiniBand Platform
- NVIDIA Quantum-X800 (XDR) Clusters Documentation
- DGX SuperPOD H200 Reference Architecture (InfiniBand, SHARP, rail-optimized)
Ultra Ethernet Consortium (UEC) 1.0
Buying a Fabric, Not a Switch Model
A comparison table is a starting point. The right AI back-end fabric for a 512-GPU fine-tuning lab is not the right fabric for a 16,000-GPU training campus is not the right fabric for an Ethernet-mandated enterprise AI factory. Send GPU counts, model size, collective patterns, power envelope, and whether InfiniBand is on the table — WiFi Hotshots returns a fixed-fee fabric-architecture SOW that picks the platform based on fit.

