Why IAM Is the Most Critical Service in Your Estate, and Why We Trust Ping
- Paul McKeown
- May 20
- 5 min read
Updated: 1 day ago

Introduction – When the Log‑in Screen Goes Dark
On 15 March 2021, Microsoft’s Azure Active Directory (AAD) experienced a global authentication outage that lasted up to 14 hours in some regions. Employees suddenly could not sign in to Microsoft 365, Teams, Dynamics 365, or the Azure Portal. 1 Downdetector charts lit up, service desks were flooded, and businesses lost a full workday of productivity. The incident was traced to a botched key‑rotation and caching sequence inside the AAD control plane. 2
Even though Microsoft resolved the issue the next morning, the reputational impact was immediate: #AzureAD trended on social media, and risk officers everywhere asked uncomfortable questions about single‑point‑of‑failure identity stacks.
Essentially, now when Identity fails, everything else fails.
To take banking as an example, it used to be core banking platform failures that took out online banking services. Now that risk has changed due to advances in system architectures that, for example, use caching to at least always be in a position to provide a read only view of customer data. When IAM is down, even even that will not work.
(1) (practical365.com) (2) (pluralsight.com)
Why Availability & Resiliency Are Non‑Negotiable
Identity is the front door to every digital service you run:
Human impact – People unable to access their banking services may not be able to pay for groceries and fuel, causing embarrassment and stress.
Regulatory exposure – SLAs, PCI DSS, and regulations increasingly specify IAM uptime. High-profile fines have been issued the world over by regulators when IAM systems have failed.
Brand trust – Locked‑out users quickly voice frustrations publicly, eroding confidence.
Revenue protection – If customers cannot authenticate, they cannot transact.
The Azure AD event shows that an identity failure can ripple far beyond the login box—knocking out collaboration tools, call‑centres, and even IoT workflows that rely on tokens for device trust.
Engineering for High Availability & Self‑Healing IAM
Architectural Patterns
The blueprint for resilient IAM begins with architectures that assume failure is inevitable. The patterns below have emerged as battle‑tested ways to keep authentication services alive under duress.
Active‑active multi‑region replicas - Eliminates regional single points of failure; Global clustered load balancers distribute the load.
Stateless token issuance - Encodes claims in JWTs so any node can verify without database look‑ups.
Replicated‑based config stores (e.g., PingDirectory, PingDS) - Supports multi-master replication for updates from anywhere to replicate within the topology.
Blue‑green / canary releases - Allows safe roll‑outs with instant rollback on auth error spikes.
Circuit‑breaker SDKs - Applications fail soft (graceful degradation) rather than hard‑crash.
Process & Culture
Technology alone is insufficient; sustained uptime also depends on disciplined operational practices and a learning culture. The following habits institutionalize resilience and ensure people and process keep pace with architecture:
Define SLOs - 99.99 % availability translates to <52 min downtime yearly.
Observability first - Synthetic sign‑in probes, distributed tracing for every OAuth flow, real‑time alerting to a 24×7 on‑call rotation.
Chaos engineering - Proactively kill pods, sever network links, and invalidate tokens to validate auto‑recovery paths. Automatically test that your systems to fail gracefully.
Runbooks & game‑days - Everyone from developers to executives practices a login‑failure scenario quarterly.
Blameless post‑mortems - Capture lessons quickly and convert them into config‑as‑code.
Cloud reference: The Azure Well‑Architected Framework’s reliability pillar calls multi‑region active‑active table‑stakes for identity workloads. 3
(3) (learn.microsoft.com)
The Power of Open Standards
Open identity protocols are the glue that lets thousands of SaaS apps, on‑premise systems, and cloud APIs speak a common language. Before comparing the major options, it’s worth recalling why these standards exist: to decouple authentication logic from business code, reduce vendor lock‑in, and accelerate secure integrations at Internet scale.
OAuth 2.0
Primary Purpose: Authorisation delegation ("can this app call my API?")
Key Benefit: Granular consent‑driven scopes; token‑exchange & mTLS extensions enable zero‑trust APIs; ubiquitous support across clouds and SaaS.
OpenID Connect (OIDC)
Primary Purpose: Identity layer on top of OAuth 2.0
Key Benefit: Self‑contained ID tokens (JWT) enable stateless verification; Discovery & Dynamic Client Registration streamline integration; PKCE, nonce, & DPoP extensions harden security; ubiquitous support across web, mobile, and IoT.
SAML 2.0
Primary Purpose: Federated SSO via legacy system interoperability
Key Benefit: Battle‑tested enterprise federation; signed & encrypted assertions carry rich attributes; ubiquitous support in commercial off‑the‑shelf apps.
Why They Matter
Interoperability – Standards are baked into SDKs for every language and framework, letting you plug new services in days, not months.
Vendor independence – You can swap providers (Identity or Service Providers) without rewriting application authentication logic.
Developer velocity – New hires are already familiar with OAuth flows; less tribal knowledge trapped in proprietary APIs.
Security reviews – Protocols undergo open scrutiny and formal analysis, increasing confidence.
Integration marketplace – Thousands of pre‑built connectors (e.g., SCIM provisioning, identity bridges) accelerate projects.
OpenID Foundation notes that billions of users now authenticate with OIDC every day across millions of applications, a testament to the force‑multiplying effect of open standards. 4
(4) (openid.net)
Operational Excellence: Uptime Is a Team Sport
Even the strongest architectures and protocols can falter without disciplined, cross‑functional operations. Sustained uptime emerges when engineering, security, and business teams rally around shared objectives and well‑rehearsed practices:
Measure what matters – Track end‑to‑end time‑to‑first‑token and 95th percentile refresh latencies alongside infrastructure metrics.
Automate everything – CI/CD pipelines lint policies, run unit and security tests, and push changes behind feature flags.
Service‑level incident playbooks – Include comms templates for status pages, customer success, and regulatory notices.
Continuous learning loops – Feed incident data into threat models and capacity plans.
Meeting the SLA: PingOne AIC & Ping Advanced Identity Services
Ping Identity offers two flagship platforms (which Midships are experts in integrating and supporting) that operationalise the patterns and practices outlined above, and meet your uptime requirements:
PingOne Advanced Identity Cloud (AIC)
Cloud‑native and multi‑zone – Deployed active‑active across GCP availability zones with a published 99.99% SLA and automated zero‑downtime patching.
Multi-region data backup: For greater data protection, your data can be bunkered in another region in case of regional catastrophes.
Stateless token services – Global data‑grid caches enable horizontal autoscaling and millisecond‑level token issuance.
First‑class standards – Native OAuth 2.0, OpenID Connect, SAML 2.0, FIDO2, and SCIM connectors accelerate integrations.
Hybrid Available – Midships can help you deploy AIC and on-premises AIS in a hybrid architecture for even greater resilience and recoverability.
Ping Advanced Identity Software (formerly ForgeRock)
Proven engines, self-managed control – PingDS, PingAM, and PingIDM components run as self‑managed Kubernetes workloads with rolling, zero‑downtime releases. These components can now all be autoscaled to meet your needs. 5
Proven Availability: Can meet 99.999% uptime requirements.
Multi‑master PingDS replication – Sub‑second data convergence across regions underpins active‑active patterns and graceful degradation.
First‑class standards – Native OAuth 2.0, OpenID Connect, SAML 2.0, FIDO2, and SCIM connectors accelerate integrations.
Lightning fast – Midships have proven this can scale to over 10k transaction per second. 6
(5) (linkedin.com) (6) (linkedin.com)
Takeaway: Both platforms hard‑wire active‑active resiliency, self‑healing operations, and protocol interoperability—turning IAM uptime commitments into reality.
Conclusion
An IAM platform is more than a login box—it is the nervous system of modern business. The 2021 Azure AD outage proved that when identity fails, everything fails. By engineering for high availability, embracing self‑healing patterns, and anchoring your strategy in open, widely adopted standards, you transform IAM from a brittle gatekeeper into a resilient foundation that accelerates innovation rather than holding it back.
Identity downtime is costly; operational excellence is priceless. The choice is yours.
Writer’s Overview
Paul McKeown – Chief Technology Officer, Midships
Paul is a seasoned engineering leader with 19 years in IAM, DevOps, and continuous delivery, with a specialty in ForgeRock and secure banking platforms. He’s delivered CIAM on Kubernetes for major banks in New Zealand and Australia.
Short bio: Paul blends engineering rigor with coaching excellence, driving Midships' technical strategy and delivery risk reduction practices across markets.