This is the first in a series of blogs detailing our experience resolving challenges that can arise when using the ForgeRock Platform.
Applicable to ForgeRock Directory Services v7.3 and earlier
About Taweh Ruhle
Taweh leads the Midships’ DevSecOps practice and is a certified ForgeRock Access Manager engineer.
For any queries, or feedback you may have, please contact Taweh me on taweh@midships.io
What are ChangeLogs?
The ForgeRock Directory Server running as a Replication Server by default will have a Changelog Database enabled (to be changed in future releases of ForgeRock). The database holds the change history of all Directory Servers (non replication servers) in the Replication Topology. In the case of Token Stores, it will hold all the token changes made over time. For User Stores, it will hold all User changes made over time.
Changes in the changelog database are held for a pre-defined window. This window or period of time is controlled by an attribute called the purge-deplay. By default this is set to 3 days. Meaning the change-log database will only have changes for the last 3 days.
How do you detect whether the ChangeLogs are purging on the Replication Servers (RS)? This is simple, you can monitor the volume assigned to the Replication Server. Default directory is:
/path_to_ds_installation/changelogDb/
Check if the volume utilisation is increasing over a period of a couple of weeks. Check if there are ChangeLogs within that are more than 3 days old.
Finally, check if you can see either of these terms mentioned in the error logs:
disk-low-threshold
disk-full-threshold
If the utilisation is stable and you don't see ChangeLogs more than three days old or those terms in the error log, you can probably get that well-deserved coffee!
What should happen?
By default the change log should auto-purge changes that are older than 3 days (or set purge delay).
What happens if I don't address the ChangeLog purge issue?
A service outage once the disk-full-threshold is breached. Let me explain!
When the available disk space falls below the disk-low-threshold, the directory server only allows updates from users and applications that have the bypass-lockdown privilege. When available space falls below disk-full-threshold, the directory server stops allowing updates, instead returning an UNWILLING_TO_PERFORM error to each update request. Your service will be done. See here for more details.
When this happens it could prevent the following (depending on your architecture):
For unauthenticated users, session information cannot be written to the Core Token Store
Authenticated users will be unable to update session data or user store information. Note if you run statelessly then authenticated users maybe okay.
NOTE
By default disk-low-threshold is set to 5% of volume size + 5GB disk-full-threshold is set to 5% of volume size + 1GB. Example command to check Replication Server disk thresholds:
./dsconfig get-replication-server-prop \
--advanced --hostname <rs-fqdn> --port <rs-admin-port> \
--bindDn <rs-bind-dn> --bindPassword <rs-bind-password> \
--provider-name "Multimaster Synchronization" \
--property disk-low-threshold \
--property disk-full-threshold \
--no-prompt -X
Why didn’t the change logs purge? A couple of reasons are mentioned below, but there could be others:
1. You have used an LDIF export/import to populate the User Store. When you use LDIF Export on ForgeRock DS, it includes ForgeRock specific server metadata. When imported into a new replication topology this metadata causes a bad state which in turn affects the changelog purge from working as required. Make sure you use the following LDIF Export command as it strips out server meta data: Ldap-export with the below parameters will ensure that the server meta data is not exported. This will ensure that the changelog DB continue to be purged as per the setting for the purge delay.
./export-ldif --hostname <ds-fqdn> --port 4444 --bindDN uid=admin --bindPasswordFile <some-password> --backend-name userStore –-excludeAttribute ds-sync-hist --excludeAttribute ds-sync-state --excludeAttribute ds-sync-generation-id --ldifFile <path-to-export-data>
2. Kill -9 had been used to terminate DS when it had hung. This resulted in the domain state being inconsistent. This should only be used after waiting a minimum of 200ms and then check that your DS is healthy afterwards.
How do you purge the Change Logs? As warned by ForgeRock here, it is stated:
Do not compress, tamper with, or otherwise alter changelog database files directly, unless specifically instructed to do so by a qualified ForgeRock technical support engineer.External changes to changelog database files can render them unusable by the server.
i.e. Do Not Use “rm -RF *” as suggested by Google Bard recently!
In order to purge the changelog, ForgeRock has created the below command:
./dsrepl clear-changelog
See ForgeRock Backstage here for more details. Note: NEVER clear the changelog without using the above command. Approach 01: Without Downtime (we recommend you raise a ticket and work with ForgeRock before you proceed)
Sample Replication Topology
3 User Stores with Customer Identities
2 Replication Servers with large Changelog Database
Confirm that service is up and running as normal in your environment (Health Check)
Verify that the current changelog database sizing. You can do this by list the changelogDb directory size locatated in the Directory Server (DS) instance folder. See below, yours should be significantly large than the example below:
Shutdown a Replication Servers using the ./stop-ds command. In this scenario on RS1.
Run the ./dsrepl clear-changelog command on RS1. This command requires the server to be offline. If you run It online, you will get a message like the below:
Verify on RS1 that the changelogDb folder has been cleared down by checking the size as you did in step #B above.
Startup RS1 using the ./start-ds command and confirm server starts successfully. You can verify this by checking the server.out logs on the server for Errors. At the end of the logs you should see a successful start message like the below:
You should also see confirmation of connection to the other replication server. In this scenario connection to RS2. For instance:
Current state of the environment:
Following startup of RS1, it will sync up the changelog as required and align with the other Replication Servers in the Replication topology. In this case RS2. Verify that the changelogDb folder size has increased since the cleardown from step #D. Note: Monitor the size of the changelogDB folder for a few minutes and ensure it is either not increasing or increase is very minimal. This is to verify that it is aligned with the other replication servers.
On RS1 run the ./dsrepl status command to verify the relication topology and status. Confirm everything (Delays, Domains, Records count, etc.) is as expected. Below is an example command and output:
./dsrepl status --hostname <ds-fqdn>
--port 4444 --trustAll \
--bindDn uid=admin \
--bindPassword <some-password> \
-–showReplicas
Example output:
Note: In DS v7.2.0 and below, the ./dsrel status command does not include the Entry Count column. To see the entry count we suggest you run the below command:
./status --hostname <ds-fqdn> \
--port 4444 --bindDn uid=admin \
--bindPassword <some-password> \
--trustAll
Possible Challenges with Solution Approach 01
“Bad Generation ID” error when you check your replication status or start up the any of the DSes in the Replication Topology.
A DS Generation ID is a calculated value (shorthand form) of the initial state of its dataset. The generation ID is a hash of the first 1000 entries in a backend. If the replicas' generation IDs match, the servers can replicate data without user intervention. This ID is used by both the DSes and Replications Servers in the topology.
Steps to resolve on affected servers (at this point you should have raised a ForgeRock ticket especially when in production):
Locate the Replication Domain / baseDN with the Bad Geberation ID error. FO instance it could be ou=Tokens. This can be seen from the DS server.out log file on server startup on from the ./dsrepl status command.
Run the below command to remove the affected Replication domain identities
./dsconfig delete-replication-domain \
--provider-name "Multimaster Synchronization" \
--domain-name ou=tokens --hostname <ds-fqdn> \
--port <ds-admin-port> --bindDN <ds-bind-dn> \
--bindPassword <dn-bind-password> \
--trustALL --no-prompt
Verify that the domain has been removed successfully from the Replication Server
./dsconfig list-replication-domains \
--provider-name Multimaster\ Synchronization \
--hostname <ds-fqdn> --port <ds-admin-port> \
--bindDn <ds-bind-dn> \
--bindPassword <ds-bind-password> \
--trustAll --no-prompt
Sample Output
Run the below command to re-add the affected Replication Domain / baseDN
./dsconfig create-replication-domain \
--provider-name "Multimaster Synchronization" \
--domain-name ou=tokens --set base-dn: ou=tokens \
--type generic --hostname <ds-fqdn> \
--port <ds-admin-port> \
--bindDn <ds-bind-dn> \
--bindPassword <ds-bind-password> \
--trustAll --no-prompt
Check the status of the replication using the below command:
./dsrepl status --hostname <ds-fqdn>
--port 4444 --trustAll \
--bindDn uid=admin \
--bindPassword <some-password> \
--showReplicas
NOTE: Solution 02 below can also be used to resolve “Bad Generations ID”. More details available here from ForgeRock.
Approach 02: With Downtime or Blue-Green Deployment
Disable traffic to all DSes in the replication Topology affected with the ever-increasing change log database.
Check that all DSes from #1 above in the replication topology have the same data count. If it is not, either wait for all servers to catch-up or initialise all DSes with the same data.
Below is an example command to get the status and records count (in v7+ only):
./dsrepl status --hostname <ds-fqdn>
--port 4444 --trustAll \
--bindDn uid=admin \
--bindPassword <some-password> \
-–showReplicas
Below is an example command to see records count (in v7.2.0- only):
./status --hostname <ds-fqdn> \
--port 4444 --bindDn uid=admin \
--bindPassword <some-password> \
--trustAll
Below is an example command to initialize all DSes with the same data:
./dsrepl initialize --baseDN <ds-base-dn-to-initialize> \
--bindDN uid=admin --bindPasswordFile <some-password> \
--hostname "<ds-fqdn>" --toAllServers \ --port 4444
--trustAll
Shutdown all Directory from #1 above, including the Replication Servers that are affected. Below is an example command to stop the DS: ./stop-ds
Execute the dsrepl clear-change log command on all Replication Servers in the replication topology
Start up all Replication Servers in the replication topology Below is an example command to stop the DS: ./start-ds
Start up all DSes with data to be replicated in the replication topology Below is an example command to stop the DS: ./start-ds
Monitor the change log database on the RSes and confirm that it is decreasing after the purge delay is executed
I hope this post has been useful (although I hope you don’t face this issue in the first place). In theory, Bad Generation ID etc should not occur from ForgRock DS v7.2.1 onwards. If you face this and need help, please contact Midships. We will be happy to help!
To learn more about our Midships please get in touch with us at sales@midships.io
Comments