Managed Network Services FAQs

Question 1

Under a managed services contract, what actually happens at NOC Tier 1, Tier 2, and Tier 3 — and how do you prevent every ticket from landing on Tier 3?

Accepted Answer

Tier 1 handles alert triage and pre-approved standard changes against documented runbooks — a Tier 1 engineer who cannot resolve within 20 minutes escalates to Tier 2 with a complete diagnostic package. Tier 2 holds CCNP, JNCIP, ACP, or ACE-P level certifications and handles root-cause analysis, vendor TAC case authoring,

and normal-category change design. Tier 3 is the engineering tier — CCIE, JNCIE, CWNE holders — handling major-change design, protocol-level defect resolution, and RCA authorship.

The matching happens at intake: ticket severity and symptom pattern route to the correct tier via the runbook library, so a known-issue ticket lands on Tier 1 even if the symptom looks dramatic.

Preventing Tier 3 flooding is a function of runbook depth.

If a problem recurs at Tier 2, the RCA output becomes a runbook entry that lets Tier 1 handle the next occurrence.

Question 2

Under a managed services contract, what does Git-versioned network configuration actually give me that a nightly SNMP-triggered backup script does not?

Accepted Answer

A nightly SNMP or SSH-triggered backup script produces a file. Git produces a change history. The difference matters when an incident begins and the first question is "what changed in the last 48 hours." A Git log shows every committed configuration delta across every device in the repo, with author, timestamp, commit message, and the diff itself.

With a script-to-file workflow, you have last night's config and tonight's config, and you must diff them manually across potentially hundreds of devices to find the relevant change.

Git-versioned configuration also enables pre-deploy validation: Batfish can read the proposed configuration (a Merge Request that has not yet merged) and analyze it for routing loops, policy contradictions, or ACL shadowing before the change reaches production.

A nightly backup cannot do that — by the time the backup sees the change, the change is already running.

Question 3

What is the difference between an SLI, SLO, and SLA, and why does WFHS set internal SLOs tighter than contractual SLAs?

Accepted Answer

The SLI (Service Level Indicator) is the measurement — a specific, observable metric with a named measurement source and window, such as "percent of one-minute intervals in which end-to-end WAN loss was below 0.1% as measured by ThousandEyes probes." The SLO (Service Level Objective) is the internal target, set tighter than the contractual commit so the operational team has a buffer before SLA violation.

The SLA (Service Level Agreement) is the contractual commit with a remedy clause attached.

Setting SLO tighter than SLA is deliberate: when the SLO is hit on a given metric, there is operational runway before the SLA is at risk — time to investigate, remediate, and re-measure without the contractual clock running.

Error budgets are the gap between 100% and the SLO; exhausting the error budget triggers deliberate slowdown on change velocity to let reliability recover.

That discipline is the practical payoff of the SRE framework in managed operations.

Question 4

How is streaming telemetry via gNMI and gRPC different from SNMPv3 polling, and when should I choose each?

Accepted Answer

SNMPv3 is a pull-model polling protocol over UDP. The collector asks the device for a specific OID at a polling interval — typically 60 seconds — and the device responds. At scale, SNMP polling creates request bursts against the device's management plane and captures data at coarse time resolution. gNMI (gRPC Network Management Interface) is a push-model streaming protocol over gRPC and HTTP/2.

The collector opens a long-lived subscription to specific OpenConfig YANG paths and the device pushes data on the subscription schedule — sub-second updates are practical — or on change.

Choose gNMI where you need high time resolution, where cross-vendor OpenConfig YANG models reduce schema fragmentation, or where management-plane load at scale is a real constraint.

Choose SNMPv3 where the device platform does not yet support gNMI, where the existing collector investment is significant, or where the data required is not yet modeled in OpenConfig.

Most 2026 managed services engagements run a hybrid: gNMI for platforms that support it, SNMPv3 for legacy devices, with both collected into the same Prometheus-and-Grafana stack.

Question 5

How does WFHS handle vendor TAC cases — do you open them, or do we have to?

Accepted Answer

WFHS opens and manages TAC cases on the client's behalf under the client's support entitlement. We author the case at Tier 2 or Tier 3 with a complete diagnostic package — show-tech-support output, configuration excerpt, syslog deltas, streaming telemetry screenshots, and a clear problem statement.

Case authoring quality consistently reduces case duration by 40 to 70 percent versus a novice-authored case because the TAC engineer does not have to chase basic diagnostic context.

WFHS manages the escalation ladder with the vendor through severity-matched queues and, for critical accounts, named engineering contacts where one has been established.

Case closure produces a written RCA where one is warranted, and the RCA is added to the client's runbook library so the next occurrence of the same root cause is handled at Tier 1.

For platforms under vendor SmartNet or support subscription, we operate the entitlement; for platforms with no active support, we flag that as a scoping issue at contract time — we cannot open a case on a platform the client does not have current support coverage for.

Question 6

What does a realistic NetDevOps transformation look like — how fast can a Rung 1 team get to Rung 4?

Accepted Answer

Rung 1 to Rung 4 is a 12- to 18-month progression for most mid-market enterprise teams, and compressing it shorter is how these transformations fail. Quarter one moves daily configuration backups into Git and introduces Ansible for read-only operations — low-risk wins that build team fluency with the tooling.

Quarter two deploys NetBox or Nautobot as source of truth and begins Jinja2 template authoring for a single domain (campus access, or DC TOR, but not both simultaneously).

Quarter three introduces config-drift detection and alerts, plus Ansible Tower or AAP 2.4 for centralized playbook execution with CyberArk-vaulted credentials.

Quarter four expands templating coverage to additional domains and introduces a first Merge Request workflow for a low-risk change category (VLAN additions, standard port activation).

Year two introduces Batfish pre-deploy validation, moves normal-category changes into Git-gated workflow, and begins Containerlab pilot work.

Teams that try to hit Rung 4 in six months typically end up with tooling installed but not operationally adopted — the people side of the transformation lags the technology.

Question 7

What happens at offboarding — how do we get our automation, runbooks, and configuration data back if we leave?

Accepted Answer

Every WFHS managed services SOW names the exit terms explicitly. The client owns the Git repository of runbooks, Ansible roles, and NetBox or Nautobot data —

typically hosted in the client's GitLab, GitHub Enterprise, or Bitbucket instance from day one, not in a WFHS-owned SaaS tenant. Credentials vaulted in the client's CyberArk or HashiCorp Vault deployment remain with the client and are rotated on offboarding as a standard hygiene step.

Observability dashboards (Grafana panels, Prometheus rules, Alertmanager routing) are exported as JSON and YAML into the client repo.

Documentation — topology diagrams, RCAs, tabletop outputs — is versioned in the same documentation repository the client can reference independently.

The structural answer is that WFHS never holds client data hostage: our engagement value is engineering, not lock-in.

A client that offboards WFHS has the full operational stack and the documentation to hand it to the next operator.

Question 8

Can WFHS integrate with our existing ITSM platform — ServiceNow, Jira Service Management, or Cherwell — or do we have to use yours?

Accepted Answer

WFHS integrates with the client's existing ITSM platform as the authoritative ticket system. ServiceNow is the most common — integration happens via ServiceNow's REST API with scoped application credentials, routing alerts from the observability stack into ServiceNow incidents and pulling ticket context back into our engineer workstation. Jira Service Management integrates similarly via Atlassian's REST API.

Cherwell and other less-common platforms integrate via webhook plus REST where the API supports it.

We do not operate a parallel ticket queue; every ticket lives in the client's ITSM, and our engineers work inside the client's workflow.

The one exception is bidirectional alert routing from our observability stack: alert rules fire into PagerDuty or Opsgenie for on-call escalation, which opens the ITSM incident and assigns it.

PagerDuty-to-ServiceNow and Opsgenie-to-ServiceNow integrations are standard patterns that keep the on-call clock and the ticket system in sync.

Question 9

How is Arista CloudVision different from legacy SNMP-based NMS platforms in managed services operations?

Accepted Answer

CloudVision streams continuous state telemetry into NetDL rather than polling counters every few minutes. Arista EOS supports native streaming of deep platform telemetry with OpenConfig APIs over gNMI, gRPC,

and OpenConfig standards — the managed services operator sees state changes as they happen, not on a 5-minute SNMP walk cadence. A counter that crosses a threshold at 10:00:12 appears in the NOC timeline at 10:00:12, not 10:04:59 on the next poll cycle.

CloudVision deploys as CVaaS, an on-prem virtual appliance, or a physical appliance — the choice sets where NetDL lives, not how telemetry flows.

Our managed services NOC runs both delivery modes against the same runbooks.

Question 10

What does CloudVision Studios actually do for ongoing change management in a managed services contract?

Accepted Answer

Studios provides templated, parameterised workflows for provisioning, upgrades, and rollback — replacing ad-hoc configlet editing. Arista documents Studios as covering initial and ongoing provisioning, ZTP as-a-Service, configuration management, and network-wide change control, with workflow scope extending to automated upgrades, network rollback, and network snapshots.

Pre-deployment, CV UNO rigorously evaluates potential network changes before they land on production. In managed services practice that means a VLAN extension, an OSPF area change, or an EOS image push runs through a Studio template with input validation, a dry-run state diff, and a captured snapshot — the change advisory board reviews a structured artefact rather than a shell transcript.

Rollback is a single Studio action, not a CLI reconstruction exercise.

Question 11

What is Arista AVA and how does it fit into managed services detection and response workflows?

Accepted Answer

AVA is Arista NDR's autonomous analyst — an AI engine that classifies threats by behaviour rather than signature and feeds findings into NOC incident workflows. The AVA platform is built from AVA Sensors (traffic collection), AVA Nucleus (AI/ML processing engine), AVA AI (decision support), EntityIQ (security knowledge graph), and Adversarial Modeling. Detection uses unsupervised and supervised machine learning, deep neural networks, belief propagation and multi-dimensional clustering, decision tree classification, and outlier detection.

Native integrations include Microsoft Sentinel, Zscaler, CrowdStrike, SentinelOne, and the MITRE ATT&CK Framework.

Sensors deploy as software on network switches, standalone hardware, virtual instances, or cloud deployments — so an AVA finding can arrive in our managed services incident queue alongside the Sentinel alert that corroborates it, not as a disconnected vendor pane.

Question 12

How does CloudVision CUE govern a managed wireless fleet when the cloud is unreachable?

Accepted Answer

CUE uses a distributed control plane where APs handle control plane functions locally — a CVaaS outage does not break client authentication, roaming, or forwarding. Arista documents CUE as scaling from a few to 100,000s of APs under one tenant,

with built-in Wireless Intrusion Prevention System (WIPS) using patented Marker Packet technology for rogue classification. Identity integrations include Aruba ClearPass, ForeScout NAC, and Cisco ISE — CUE does not force a proprietary policy server.

Telemetry lands in NetDB, a state-based, cloud-hosted, network-wide database collecting real-time signal.

For our managed services wireless clients, that distributed design means we are not paging engineers at 3 a.m. because a cloud-controller region flapped — the wireless estate keeps serving traffic while we resolve the SaaS-side incident.

Question 13

What does CV UNO add beyond CloudVision Portal for managed services observability?

Accepted Answer

CV UNO correlates application and workload telemetry with network state to produce application-aware observability — available only on Arista CloudVision as-a-Service (CVaaS). Deployment places one or more strategically placed CV UNO Sensor VMs on-premises, and integrations include VMware vCenter, ServiceNow CMDB, and Infoblox IPAM.

The analysis layer surfaces application and topology-aware correlations, event patterns, behaviour changes, and flow anomalies. In practice that means a managed services operator investigating a slow application can see the VM, the vSwitch port, the physical leaf/spine path, and the flow anomaly in one correlated view — not three tabs and a spreadsheet.

Multi-domain visualisation extends to physical and virtual end hosts, which is the difference between "network is up" and "application is healthy."

Question 14

What does Arista EOS expose for managed services automation beyond the CLI?

Accepted Answer

Arista EOS publishes four machine interfaces: eAPI JSON-RPC, EOS SDK, OpenConfig/gNMI streaming, and cEOS containerised deployment. eAPI offers a REST-like interface using native CLI commands, so a config-get and a show-command both return JSON.

EOS SDK supports direct integration with the switch operating system for network applications that require low latency and high performance — useful for on-switch agents. Streaming telemetry uses gNMI, gRPC, and OpenConfig standards for improved network visibility. cEOS extends cloud automation for DevOps integration and runs on 3rd-party hardware.

Underneath, SysDB is a multi-process state-sharing architecture that separates state information and packet forwarding from protocol processing and application logic — crash-isolated by design.

Our managed services runbooks target eAPI and gNMI first; CLI screen-scrape is the deprecated path.

Question 15

What does Arista MSS enforce and where does Zero Trust segmentation fit into managed services?

Accepted Answer

Arista MSS — Multi-Domain Segmentation Services — enforces Zero Trust segmentation policy across data centre, campus, WAN, and cloud domains with continuous policy violation monitoring. MSS provides endpoint identification and tagging, traffic mapping services, and continuous traffic and policy monitoring with real-time violation visibility.

Multi-domain visualisation spans data centre, campus, branch, and cloud networks, and CV UNO extends the topology view to include physical and virtual end hosts so segmentation is identity-aware rather than IP-prefix-aware.

Identity feeds come from Aruba ClearPass, ForeScout NAC, and Cisco ISE via CUE.

For our managed services customers the operational delta is this: a policy violation appears in the NOC queue the moment a tagged endpoint talks to a prohibited zone, not after a weekly audit. Change control for MSS policy runs through the same Studios workflow as VLAN and routing changes.

Question 16

What are the four gNMI RPCs and why do they matter for managed services NOC operations?

Accepted Answer

Capabilities, Get, Set, Subscribe. Capabilities is the initial handshake to exchange capability information. Get retrieves snapshots. Set modifies target state. Subscribe controls streaming subscriptions — three modes: STREAM (long-lived subscriptions), ONCE (single request/response channel), and POLL (on-demand retrieval).

Inside STREAM, subscription types split into ON_CHANGE (transmits data only when values change, with optional heartbeat intervals), SAMPLE (regular intervals specified by sample_interval), and TARGET_DEFINED (target determines the best subscription type per leaf). gNMI operates over gRPC, which uses HTTP/2 with mandatory TLS encryption — specifications state implementations MUST NOT fall back to unencrypted sessions and recommend TLS 1.2 or higher.

In managed services practice that means our telemetry pipeline is encrypted at transport, authenticated per session, and selective per leaf — no SNMP community strings, no UDP, no polling jitter.

Question 17

What is the Meraki Dashboard API rate limit and how does it shape managed services automation?

Accepted Answer

Ten calls per second per organisation. Every automation we build against Meraki respects that cap — batched reads, cached inventory, exponential back-off on HTTP 429.

Authentication requires the Authorization: Bearer header per request, and API keys inherit the same permissions as the account that issued them, so key governance sits inside the identity model, not a separate vault. Claim endpoints carry stricter limits — 10 requests over a 5-minute period per IP.

Regional endpoints split traffic: api.meraki.com, api.meraki.cn, api.meraki.ca, api.meraki.in, and api.gov-meraki.com.

The current base path is https://api.meraki.com/api/v1/. Our managed services automation uses Meraki action batches where possible to stay under the cap on bulk changes, and we size Dashboard API work against the org's other integrations before rolling a new job.

Question 18

What SRE math underpins the error budget and what triggers a release freeze in managed services?

Accepted Answer

The error budget is the inverse of the SLO. Per Google SRE practice, an SLI is a carefully defined quantitative measure of some aspect of the level of service; the SLO is a target value or range for that SLI; the SLA is an explicit or implicit contract with consequences for missing the SLO.

Error budget is a rate at which the SLOs can be missed, tracked daily or weekly. Arithmetic examples in the literature: 99% is "2 nines," 99.999% is "5 nines," and Google Compute Engine targets 99.95% availability.

A published example SLO reads "99% of Get RPC calls will complete in less than 100 ms."

In our managed services practice, exceeding the weekly error budget halts feature rollouts on that service until the budget resets — the SLO is a dial that controls change velocity, not a trophy.

Question 19

Which ThousandEyes test types does a managed services team run for WAN and SaaS visibility?

Accepted Answer

Eight test types, each addressing a specific failure domain. Network tests measure loss, latency, jitter, MTU, and path trace between agent and target, in Agent-to-Server and Agent-to-Agent variants. HTTP Server tests measure availability, response time, throughput, redirects, and response codes.

Page Load tests add time to load page objects and a waterfall view of all page objects. Transaction tests execute scripted actions from within an actual browser process — useful for login flows.

DNS tests include DNS Server, DNS Trace, and DNSSEC variants.

BGP tests report routing path changes, reachability, and BGP updates. Voice tests include SIP Server and RTP Stream for VoIP quality monitoring. Our managed services WAN playbook layers Agent-to-Agent over every SD-WAN overlay plus Page Load against each tenant SaaS — so a Microsoft 365 slowdown is triangulated, not assumed.

Question 20

Why does our managed services observability stack use Prometheus instead of a push-based metrics system?

Accepted Answer

Time series collection in Prometheus happens via a pull model over HTTP, and each Prometheus server is autonomous — it does not depend on distributed storage, making it dependable during infrastructure outages.

When a customer's aggregation layer flaps, the NOC's Prometheus keeps scraping what it can reach and the rest is recorded as target-down, not as a cascading failure of the monitoring tier. Short-lived batch jobs push via an intermediary push gateway when pulling is not practical.

Prometheus documents one honest limitation: not suitable for scenarios requiring 100% accuracy, such as per-request billing — so managed services billing telemetry rides a separate, auditable pipeline.

The ecosystem around the core server (client libraries, exporters, Alertmanager, push gateway) lets us instrument Cisco, Arista, Juniper, HPE Aruba, and Meraki from one fabric.

Question 21

What core data models does NetBox maintain as the source of truth for managed services automation?

Accepted Answer

NetBox models the network hierarchically and physically in one schema. Hierarchical organisation covers regions, sites, and locations. Physical inventory covers racks, devices, device components, cables, and wireless connections, plus power distribution and circuits and providers. IP management covers IP prefixes, ranges, and addresses. Routing covers VRFs and route targets and FHRP groups such as VRRP and HSRP. Addressing covers VLANs and scoped VLAN groups plus L2VPN overlays.

Access is via REST and GraphQL APIs.

NetBox documents itself as the ideal source of truth to power network automation, and it ships under Apache 2 open source. In our managed services deployments NetBox is the one place where "what should be" is defined — Ansible, Nautobot Jobs, and CloudVision Studios all pull intent from NetBox rather than from a spreadsheet.

Question 22

What LogicMonitor data collection methods does a managed services NOC typically configure?

Accepted Answer

SNMP v1/v2/v3, WMI, NetFlow, JMX, event-based webhooks, and cloud and Kubernetes-native collectors — all funnelling into the LogicMonitor Collector agent. LogicMonitor documents SNMP v1/v2 and SNMP v3 configuration explicitly, NetFlow for network traffic flow monitoring, WMI for Windows telemetry,

and JMX with JMX Active Discovery for Java application stacks. EventSource Configuration supports event-based ingestion via webhook integrations. LM Logs supports ingestion from AWS, Azure, GCP, Kubernetes, Syslog, and Windows Events.

LM APM supports distributed tracing with .NET, Java, and Python instrumentation plus Selenium synthetic monitoring.

LM Container is Kubernetes-native monitoring with a Helm Chart deployment model. Dynamic Services provide auto-discovery that creates monitoring services from configurable rules — our managed services team defines those rules once per tenant, and the tool onboards new devices without a ticket.

Managed Services for Enterprise Networks: Engineer-Led NOC, NetDevOps, and Observability

Managed Services Engagement Model: Fixed-Fee SOW, Not Block Hours

Scope boundaries we write into every SOW

NOC Tier Structure: What Tier 1, Tier 2, and Tier 3 Actually Do

Tier 1 — first responder, ticket triage, known-issue playbooks

Tier 2 — troubleshoot, vendor case escalation, configuration change authoring

Tier 3 — engineering, root-cause authoring, change authorship

Observability Stack Design: Metrics, Logs, Traces, and Streaming Telemetry

Metrics layer — Prometheus, InfluxDB, and platform-native collectors

Streaming telemetry — gNMI and the OpenConfig model advantage

Logs and traces — structured logging, syslog, and path visibility

NetDevOps Automation Maturity: Six Rungs From Manual to Fully Intent-Driven

Rung 1 — Manual CLI, change requests via email or ticket

Rung 2 — Ansible playbooks for read-only operations

Rung 3 — Source of truth in NetBox or Nautobot, Jinja2 config generation

Rung 4 — Git-gated changes with pre-deploy validation via Batfish

Rung 5 — Containerlab test environments and CI-driven deployments

Rung 6 — Intent-driven operations with closed-loop verification

ITIL 4 Change Management: Standard, Normal, and Emergency Changes

Standard changes — pre-approved, low-risk, documented template

Normal changes — CAB-reviewed, risk-assessed, scheduled

Emergency changes — post-hoc CAB review, incident-driven, audit-tracked

SLI, SLO, SLA, and Error Budgets: Google SRE Math Applied to Managed Services Operations

How error-budget enforcement changes operational posture

Scope a Managed Services Engagement.

DR, BCP, RPO, and RTO: Annual Tabletop Is the Floor, Not the Goal

Configuration RPO — Git commits as the durable state

Network RTO — warm site, cold site, and redundant-path math

Tabletop testing — quarterly minimum, full-restore annually

Platform Coverage and Vendor Case Management

Vendor TAC case management — authoring, escalation, and closure

Platform-specific operational handoff