Design a Reliable PingDS Backup Strategy

Atharva Thorkar
Jul 22, 2025
4 min read

Updated: Sep 18, 2025

Introduction

Backups are essential, but they’re only valuable if they can be restored quickly and reliably. In modern infrastructure, even a few minutes of downtime can lead to customer dissatisfaction, loss of trust, and damage to brand reputation. A failed or slow restore can be just as costly as having no backup at all.

That’s where RTO (Recovery Time Objective) and RPO (Recovery Point Objective) come in. RTO defines how fast you can recover, and RPO defines how much data you can afford to lose. If your backup strategy can’t meet these goals, it’s time to rethink it.

This article explores how we made Ping Directory Server (PingDS) backup cron jobs more robust for a client.

Problem Statement

The client wanted to ensure their RTO is under 10 minutes and RPO is 1 hr. Since CIAM service was a mission-critical workload for them, maintaining minimal downtime, infrastructure resilience, and quick recovery were mandatory requirements for their business. Falling short on any front would not only have impacted their business negatively but also would have invited regulatory investigations.

Our tech team used this opportunity to demonstrate how the Midships Ping AIS accelerator architecture can easily support these requirements.

Key Constraints

The default setup had each DS performing backups at a fixed time daily, using a local cron job within the containers. While this worked in theory, several limitations became obvious over time:

Resource Contention: All replicas attempted backup simultaneously, causing spikes in disk I/O and memory usage.
No Retry Mechanism: ForgeRock DS’s native backup utility doesn’t retry failed backups, leaving room for silent failures.
Rigid Scheduling: Static timing meant no staggering, no load distribution of backups at different start times across replicas.
Single Recovery Point Across All Pods: If a disaster struck at, say, 6 PM, and the last backup across all replicas was at 2 AM, our only recovery point would be 16 hours old — leading to major data loss across the system.
Long Time for restore the backup: Cloud backups in PingDS are by default incremental in nature, where new backups keep accumulating in same folder. Without purging the backup restore time started increasing to as high as 30 mins, after several months of accumulation of backups in the same folder.

This setup severely constrained both RTO and RPO and made our backup architecture brittle in the face of real-world failures.

The Solution

In response to a clearly defined business problem, the Midships team enhanced the Midships PingAIS Accelerator DS Backup solution and implementation.

Our solution was built around three key principles: scalable, reliable, and observable. These improvements directly targeted the client’s RTO and RPO gaps, transforming the backup system from a rigid setup to a scalable, fault-tolerant design. Here’s how we made it more robust:

Externalised the Backup Scheduler

Instead of relying on internal pod-based cron jobs, we introduced a centralized Kubernetes CronJob object. This scheduler executes externally, outside the DS pods. This shift alone gave us centralized orchestration and removed the risks of overlapping or isolated cron behavior within containers.

Staggered Scheduling of Backups Across Pods

A key improvement we introduced was intelligent staggering of backup execution across all replicas. Instead of running all backups at the same fixed time, we now schedule them in a calculated, evenly distributed manner throughout the day. This approach ensures:

1. Each replica performs multiple backups per day without overlapping others.

2. Spike CPU/ Memory usage is balanced across all the pods, due to staggered backups.

3. Recovery points are always recent, reducing the risk of data loss.

4. No matter how many DS pods are running, our scheduling logic adapts — ensuring coverage, performance, and resilience scale together.

Multiple Recovery Points Across the Day

By increasing the backup frequency to 4x per pod per day, we drastically reduced the maximum data loss window (RPO). Instead of a single recovery point shared across all pods (e.g., 2 AM), we now have 12+ distributed restore points daily (assuming 3 pods), making disaster recovery far more precise and data loss minimal.

This means:

1. In case of a failure at 6 PM, we can restore from as recent as 4 PM (instead of 2 AM).

2. Our RTO is faster, as backups are smaller, recent, and easier to restore.

Built-in Retry and Error Handling

ForgeRock’s backup utility doesn’t natively support retries — so we implemented our own retry mechanism.

In addition to retries, the system also:

1. Logs the outcome of each backup job (success or failure).

2. Tracks retry counts for auditing and analysis.

Segregating the backups per pod and date

This allowed the restore time to come down to under 10 minutes.

1. This segregation ensured that on any folder where backups are stored, total backups are no more than 4 (maximum backups in a day, by a single pod). Thus, the restore time would be a fixed value and would not increase as the accumulation of backups on a folder is not possible with this solution.

2. It also gives the team the ability to have full backups every day in the same folder (the first backup is in an empty folder).

Results

Backup frequency increased by 500% — Increasing from 4 to 24 backups per day, ensuring broad recovery coverage.
Improved RPO by 95% — reducing the maximum data loss window from 24 hours to just 1 hour.
Restore time reduced by 66% - from 30 minutes to under 10 minutes, thanks to smaller, more recent backup stored such that no folder holds more than 4 backups at any given time.
Decentralized load, centralized control — backup orchestration moved out of pods and into a unified Kubernetes cron structure and orchestrated scheduling of distributed backups across pods.
Automatic retry on failure — intelligent error detection and re-execution logic improve backup success rates and reliability.
Scalable and observable — design dynamically adjusts to replica count and integrates with monitoring and alerting systems.
Granular disaster recovery for PingDS — multiple restore points throughout the day enable near-real-time recovery if needed.

Conclusion

Backups aren’t just routine tasks — they’re your last line of defense. By rethinking how we schedule, distribute, and monitor DS backups, we’ve created a system that aligns with real-world recovery needs.

The result? More reliable restores, less risk, and better sleep for everyone involved.

Stronger Identity,
Happier Customers.

Design a Reliable PingDS Backup Strategy

Introduction

Problem Statement

Key Constraints

The Solution

Results

Conclusion

Recent Posts

Comments

Stronger Identity, Happier Customers.

Introduction

Problem Statement

Key Constraints

The Solution

Results

Conclusion

Comments

Stronger Identity,
Happier Customers.