Introduction
This article will explain how Midships supported one of our largest customers during a major ForgeRock upgrade whilst maintaining continuity of service throughout (i.e. zero downtime!). We will describe some of the difficulties encountered and the strategies employed to overcome them.
If you have any queries, please do not hesitate to contact me at ravivarma.baru@midships.io.
Background
Back in 2022, one of the leading financial services providers in Southeast Asia engaged Midships to support an upgrade of their Customer Identity & Access Management (CIAM) service from an older version of ForgeRock 5.x, to the latest version ForgeRock v7.x.
Whilst all major upgrades are challenging, this remained one of the most difficult migrations Midships has been involved with to date as:
the migration was across two major versions which were incompatible with one another
a user store consisting of a large and complex data set and with operational attributes subject to frequent change
support a high volume of transactions
ensure no downtime
minimal (ideally zero) data drift (to facilitate a seamless rollback where required)
live sync (see below) could not use the change logs to trigger replication.
IDM Livesync
The ForgeRock IDM platform offers the functionality of synchronizing data between two sources through a one-time reconciliation or continuous synchronization known as Livesync. The IDM Livesync job can scan for modifications in the data source change logs or use data timestamps at specified intervals (determined by a cronjob schedule). The changes are pushed to the other data source (and manipulated where required).
For more details on IDM sync, please refer: ForgeRock IDM 7 > Synchronization Guide > Synchronization Guide
The Challenge
The complexities meant that we faced several challenges during the implementation and post go-live. The biggest challenge stemmed from our inability to use the Changelog relying on timestamps to identify changes in the source. This meant that:
The timestamp change queries were resource intensive affecting the overall performance
The high volume of changes during peak hours led to bottlenecks where IDM livesync could not keep up with the changes. These in turn created other challenges such as the timestamp queries returning results greater than the index limits (thereby becoming an unindexed query!)
IDM could not guarantee that update will be applied in the correct order.
Deletions and certain types of updates were not correctly detected.
Our Solutions
Category | Issue/Impact | Solution |
Performance | Timestamp-based Livesync increaed the CPU utilization of the VM resulting in slow responses and on occasion blocking incoming requests from other DS consumers, such as Forgerock AM, IDM etc. | The timestamp-based Livesync mechanism uses createTimestamp and modifyTimestamp attributes to detect new entries created/modified within a specific period of time. As these are operational attributes (associated with every data entry of Forgerock DS), they are not by default indexed. Indexing these attributes reduced the CPU utilization from above 98% to less than 5%. |
Index limits | Even when IDM LiveSync is configured to run every 10 seconds, as it runs serially, if the first query and update take longer than 10 seconds (e.g. 20 seconds) then the second query will query for a longer period (e.g. 30 seconds). This can result in the search returning more results than the Index Limit. When this occurs, the search becomes unindexed. | The default index-entry-limit 4000 proved to be inefficient. Through testing we were able to prove that increasing this to 10,000 was sufficient enabling us to maintain an indexed search. |
Data Drift | The initial data migration from v5.x to v7.x follows these steps
This meant that the export contained stale data (i.e. any changes made post export and prior to livesync being enabled). | We were unable to simply enable LiveSync to pick up changes made during this intervening period as it generated an unindexed search. Instead we did a manual data patch by exporting the data from both v5.x and v7.x with a timestamp during the affected time period. A Bash script was created to compare the data between the two versions and identify any new entries or changes made on v5.x. These were then applied to v7 subject to the data not being modified by LiveSync since the initial changes were identified. |
Delete Operations | With Timestamp-based Livesync, IDM can only retrieve the entries from the source DS that were either created or modified within a specific time period, using the attributes createTimestamp or modifyTimestamp. However, this method does not provide information about entries that have been deleted. | A script was developed that monitors the ldap-access-audit logs for capturing the delete operations on the source side and deletes them from the target DS. This way, we didn’t end up with trash data in the target DS, and it led us to better data consistency. |
Changes to a User Record DN | Like Delete operations, Timestamp-based IDM Livesync cannot detect changes to the modifyDN operations. If the DN of an entry gets changed in the source DS, then IDM can pickup the new DN as it is recognized as newly created, but the original entry is not deleted. | A script was developed to capture modifyDN operations from the source DS and systematically delete the source DN entry from the target DS. This allowed us to maintain the data in sync. |
Loopback of changes | Having Bi-directional sync, that is, from v5.x to v7.x and from v7.x to v5.x, can create data loops leading to dirty data | Multi-master updates were not allowed at user object level. I.e. a user object was either mastered on v5.x or v7.x. We applied this by creating new service account on DS for IDM to write to v5.x or v7.x. These service accounts were different to those used by AM and other services. We then excluded (ignored) changes which IDM made during the timestamp search query. |
Replication Delay | Sorry this one is complicated but the most important. v5.x and v7.x both have multiple DS servers connected to RS servers. With multi DS-RS topology, there is always a case of replication delay. Let’s say an entry is created at time t on any of the DS in v5.x; this entry will have a createTimestamp as t. This entry will be replicated across all the DS, but the thing to note with replication is that the entry will be replicated as is. This entry is created across the rest of the DS in topology by replication will have createTimestamp as t only, but not the local creation timestamp i.e, t+x where x is replication delay. So, when IDM Livesync queries for the creation of entries between a period (assuming cron is scheduled for every 5 secs), let’s say from t-5 to t on the source DS server to which it connects, if the entry gets created on any other DS server on topology at time t-1, and that change is not available to the DS to which IDM connects, because of replication delay. Even this change won't be picked up by the next cronjob schedule which runs at t+5. It queries from t to t+5, as create timestamp is t-1. Similarly, modifications are also missed because of the replication delay, as the same argument is valid for modifyTimestamp. | The way to mitigate this is, let’s say if an entry has createTimestamp/modifyTimestamp as t-1, but it gets created or updated on the DS, Livesync connects to at t or t+1 or generally t+x, where x is replication delay. By amending the syncToken, we forced the Livesync cronjob schedule(at t+5) to run the query as create/modifyTimestamp to start from t-1 or t-2 (based on replication delay) to t+5 instead of t to t+5. This resulted in the changes that were applied by replication after the previous cron job had already been completed to be then picked up.
|
Generation of Report
Validation of the data across v5.x and v7.x is necessary, and the business should be able to ensure the volume of changes that occurred to v5.x or v7.x are all successfully applied to the other via Livesync. Again with Timestamp-based livesync, we cannot simply correlate the changes by monitoring the number of operations on DS v5.x to DS v7.x.
Hence we developed the following custom reports:
A report shows each branch's subordinate entries in both v5.x and v7.x and the differences in entries. This report shows if the two data sets diverge or if the Livesync lags as time progresses. This runs every 30 mins.
A report which shows the number of additions and modifications operations that occurs in v5.x and v7.x in a 5-minute window runs every 30 mins. This indicates whether the modifications done on both sides are in sync.
A report to show if there are any differences in updates to user objects by grabbing all the changes that occurred on a source DS over the last 5 mins (this runs for every 30 mins) and comparing these objects between source and target DS. This report verifies that IDM picks up all modifications of source DS and applies them to target DS.
Conclusion
Maintaining data synchronization is crucial during a version upgrade as it helps to accurately identify and troubleshoot issues in the application layer during the testing phase or in production. Without data synchronization, it can be challenging to determine if it is the application layer or the data layer that causes a problem.
In summary, we delved into the challenges of keeping data in sync during a version upgrade using IDM Livesync. Like being unable to use changelogs, data drift, replication delay, and difficulty detecting deleted or modified DN entries. We also discussed the solutions we implemented for solving these problems by using indexing, custom groovy and bash scripts and custom Livesync schedules, etc. Monitoring the performance of Livesync through the generation of reports is essential for evaluating the solution's effectiveness and boosting the confidence of the business.
I hope this article has been helpful. Please do get in touch if you have any questions. Thanks for reading,
Ravi
Comments