Office 365 outage triage and bulk customer comms

PRTG sends a red sensor for “M365 Auth Latency” against three different customer probes within five minutes. The on-call MSP engineer needs to know if it’s the customers’ networks or Microsoft — and if it’s Microsoft, get one consistent message in front of every customer before the phones start ringing.

Systems involved

System	Role
PRTG	Source alarm and sensor history.
Studio diagnostics	Ping, traceroute, DNS, and HTTPS path checks against `outlook.office365.com`.
Microsoft 365 Service Health	Confirm whether Microsoft has acknowledged an incident.
Halo PSA / ConnectWise	Bulk-update affected customer tickets.
Microsoft Teams	Internal `#noc` channel and customer-shared channels.
StatusPage.io	Public status page update.
Gmail / Outlook	Customer comms with technical contacts.

Walkthrough

Acknowledge the PRTG alarm

Copilot pulls the three sensors and their history. They started failing within 90 seconds of each other across three different customer probes — not a customer-side coincidence.

Rule out customer-network paths

Copilot runs a parallel diagnostic sweep: ping and HTTPS probe against outlook.office365.com, login.microsoftonline.com, and graph.microsoft.com from each customer probe via SSH. All three customers have a clean Internet path; Microsoft endpoints respond slowly or 5xx.

Check Microsoft Service Health

Copilot calls the Microsoft 365 Service Health connector. There is an acknowledged incident EX{number} for Exchange Online authentication, scope global. That settles the diagnosis.

Compose the customer message once

Copilot drafts a short customer-facing message: cause (Microsoft incident), scope (Exchange auth), what’s affected (Outlook, OWA), what isn’t (Teams chat, SharePoint), workaround (existing sessions still work), the Microsoft incident ID, and the next update time.

Bulk-update tickets in the PSA

The PSA connector lists every open ticket in the last 60 minutes that mentions Outlook, M365, or “email is slow.” Copilot stages a bulk update with the message, links the Microsoft incident, and pauses for approval. You scan the list, untick two unrelated tickets, approve.

Post in customer-shared Teams channels

For customers with a shared Teams channel, Copilot posts the same message tagged to the right contacts. The message sticks at the top of each channel for visibility.

Update the public status page

The StatusPage.io connector publishes a Monitoring incident pointing at the Microsoft outage and links the upstream Microsoft advisory.

Set a follow-up timer

Copilot adds a 30-minute follow-up reminder. When the timer fires, it re-checks Service Health, the PRTG sensors, and updates the same channels with progress or an all-clear.

Where Studio earns its keep

One diagnostic run touches every customer probe at once — no SSH-jumping between consoles to confirm a global pattern.
The same message reaches the PSA, Teams, and the status page with one approval, instead of forty manual posts.
The follow-up loop is automatic: the 30-minute check happens whether you remember it or not.
The all-clear closes every ticket and posts a final status without you composing it three times.

AI Copilot

Use Planning when the bulk update needs a careful review before it goes out.

Connectors and MCP

How PRTG, the PSA, Microsoft Service Health, Teams, and StatusPage.io are reachable.

Dell R720 disk failure: Kayako to RMA in one workspace

Cisco IOS-XE firmware upgrade across 12 sites with change approval

⌘I

​Systems involved

​Walkthrough

​Where Studio earns its keep

​Related

AI Copilot

Connectors and MCP

Systems involved

Walkthrough

Where Studio earns its keep

Related