top of page

MIDSHIPS

  • Yuxiang Lin

Hybrid CIAM Architecture With ID Cloud

Control CIAM Resiliency with Hybrid Architecture

This paper discusses Midships' design and implementation of a hybrid Customer Identity & Access Management (CIAM) architecture. The hybrid solution is proposed to address concerns about shifting resiliency responsibilities entirely to Software as a Service (SaaS) CIAM services, such as ID Cloud. The hybrid architecture utilises an on-premise CIAM deployment as a standby in the event that ID Cloud is unavailable.


Where ID Cloud is unavailable, this hybrid design allows for a near-zero recovery point objective (RPO) and a recovery time objective (RTO) within minutes. We cover how to:


  • Manage user data synchronisation and traffic control;

  • Options for the type of standby deployment are given to balance RTO and cost; and,

  • How Midships uses one pipeline to ensure both the ID Cloud and the on-premise deployment always share the same configuration.


By adopting this hybrid architecture, organisations can achieve a highly resilient CIAM architecture and retain control to act swiftly in the event of a disaster.


Background & Purpose

Customer Identity & Access Management (CIAM) service is one of the most critical system within an enterprise architecture. The availability of the CIAM service directly affects whether the organisation’s digital services can be accessed by its customers. Therefore, CIAM service requires a resilient architecture design to meet the desired Recovery Point Objective (RPO) and Recovery Time Objective (RTO).


While organisations consider Software as a Service (SaaS) CIAM services such as ID Cloud, a common concern is towards the shift of resiliency responsibilities for CIAM entirely to SaaS. In this paper, we address this concern by illustrating a hybrid design where an on-premise deployment acts as a standby CIAM service to ID Cloud. This hybrid architecture allows the organization to achieve the following key outcomes during an ID Cloud outage:


  • Near Zero RPO: This is achieved by synchronising user data from the ID Cloud to the on-premise setup.

  • Minutes Ranged RTO: This is achieved by changing the CIAM traffic routing to the on-premise setup.


By adopting this hybrid architecture, organisations can achieve a highly resilient CIAM architecture and retain the control to act swiftly in a disaster situation to recover the CIAM service and minimise any customer impact.


Solution

Hybrid Architecture

In this solution, we introduce a hybrid architecture that involves an active-passive (standby) deployment of the CIAM service. The active deployment of the CIAM service will be on ID cloud and the standby deployment of the CIAM service will be on-premise. In order to achieve the desired outcomes on RPO and RTO, this solution addresses the below key considerations:


  • User Data Sync : In order to achieve a near zero RPO, user data changes on the active deployment must be synchronised to the standby deployment in near real time.

  • Traffic Control : In order to achieve a RTO within minutes, there should be control over the routing of the CIAM traffic such that we can easily direct traffic from the active deployment to the standby deployment .

  • Hot Standby Vs Warm Standby : Maintaining the standby deployment involves costs, there should be options on the standby approach to optimised the balance of RTO time and costs.

  • Single Pipeline, Same Configuration : Both deployments of CIAM should have the same configuration and behaviour, this confidence is important for the shift decision to be made during an outage.


User Store Sync

Near real-time sync of user data is achieved via configuring the Live-Sync capability supported on ID Cloud. In order for ID Cloud to perform live-sync with the on-premise user store, a remote connector server (RCS) needs to be deployed on-premise which acts as the on-premise user store connector for ID Cloud. RCS establishes web-socket connections with ID Cloud for bi-directional communications to facilitate the data sync process. The communications between RCS and ID Cloud are secured via standard SSL and OAuth 2 access token (client credential grant).


For more details on RCS, please refer to : Remote connectors :: ICF 1.5.20.21 


Traffic Control

On-premise deployed Identity Gateways (IG) act as an reverse proxy to route all CIAM traffic to either ID Cloud or the on-premise CIAM deployment. This allow IG to be the platform to control all the CIAM traffic. In the event of an outage, IG can be updated to switch the routing targets and achieve a RTO within minutes. IG being an stateless gateway, can auto-scale and recover easily. The resiliency and availability of IG can be addressed separately via multi-cluster deployment approaches.


Hot Standby Vs Warm Standby

For the on-premise standby deployment, there are 2 options an organisation can adopt. The first option is a hot standby approach where all the components of the CIAM system are deployed and running. The compute layer (IDM and AM) can be maintained at the minimum capacity with auto-scaling capability. The data layer (user store, token store and app policy store) needs to be maintained at the production-required capacity. This approach can achieve an extremely short RTO time as the activities involve during a traffic switch is only on updating the IG routing target.


The second option is a warm standby approach where only the user store is deployed and running with production-required capacity. In the event of a traffic switch, CIAM deployment pipeline need to be invoked to deploy the rest of the CIAM components before updating the IG routing target. Compared to the hot standby approach, this option requires a longer RTO time (~30min and could be as low as 10min) but incurs lower cost in maintaining the standby deployment.


Single Pipeline, Same Configuration

ID Cloud and the on-premise CIAM deployment should always share the same configuration and behaviour such that we can confidently carryout a traffic switch whenever the situation amounts to. This confidence is achieved via a single pipeline and codebase that facilitate a CICD process where the same configuration are applied to both deployments for any release. Manual configuration needs to be avoided completely via codification of CIAM configuration. By using the Midship’s accelerator, one will be achieve this single pipeline CICD process. Additionally, the accelerator also makes version upgrade and security patching much easier to handle as the organisation seek to keep the on-premise deployment secure and updated.


Other considerations

Shift back to ID cloud

When ID Cloud has recovered and is ready for traffic to be switched back, there will be a few key operational activities to be carried before the switch back.

  1. On ID Cloud, turn off user data live-sync (from ID Cloud to on-premise)

  2. ON ID Cloud, run user data reconciliation (from on-premise to ID Cloud)

  3. On ID Cloud, turn on live-sync (from on-premise to ID Cloud) and take down the sync token

  4. Carry out the traffic switch

  5. On ID Cloud, turn off live-sync (from on-premise to ID Cloud)

  6. On ID Cloud, turn on user data live-sync (from ID Cloud to on-premise) with sync token


Long-live refresh token and Dynamic OAuth 2 client registration

By avoiding synchronisation of token store, we can reduce the complexity of the data synchronisation between the 2 deployment without much draw back as it is generally acceptable for customers to login again in the event of an outage. However, in the case where the organisation issues long-live refresh tokens, token store synchronisation maybe required.


Similarly, if the organisation supports dynamic OAuth 2 client registration (as part of OIDC), app policy store synchronisation may be required. The approach to synchronise token store or app policy store will be the same as user store but one must pay extra attention on secret management of AM to ensure both deployments use the same secrets.


Parallel Run

Parallel Run can be achieved in this hybrid architecture with additional enhancements.

First is that IG will need to have logic to enforce stickiness such that CIAM requests within a user’s session are handled by one deployment. This stickiness can be lifted if there is no stateful features used (such a refresh token, stateful session, Auth Code grants) and by paying extra attention on secret management of AM to ensure both deployments uses the same secrets.


Second is to configure bi-directional live-sync on ID cloud. Extra attention and filtering scripts will be needed to avoid synchronisation loops and address data conflicts. In generally, unless necessary, parallel run should be avoided due to the amount of added complexity.


Are you interested?

If you would like to learn more, please contact sales@midships.io

Comments


bottom of page