Resilience Grows Through Repetition
Overview
Glen Willis explains why traditional Business Continuity and Disaster Recovery planning no longer matches how modern systems actually fail. Outages are not rare events anymore. In cloud native, distributed environments, disruption is constant and often small but persistent.
Instead of focusing only on recovery after something breaks, organizations need to operate with resilience as a daily discipline. That means designing systems to absorb stress, limiting blast radius when components fail, and keeping critical services running even when conditions are not ideal.
The conversation outlines five practical shifts: engineering for failure, introducing controlled stress before real outages occur, designing for graceful degradation, building strong observability, and practicing response regularly. Recovery plans still matter, but resilience becomes the operating model that reduces impact long before a disaster scenario unfolds.
Transcript
Glen Willis
I’m Glen Willis, Director of Risk at Kalles Group.
There’s a fundamental shift happening in enterprise technology.
For years, Business Continuity and Disaster Recovery, or BCDR, focused on catastrophic, low frequency events such as data center outages, natural disasters, hardware failure, and regional disruptions.
The assumption was simple: systems fail rarely, and when they do, you recover them. That assumption doesn’t hold anymore.
Modern environments are cloud native, API driven, distributed, and heavily integrated with SaaS providers. In that world, small failures happen constantly. Latency spikes. Dependencies degrade. Services time out. Third parties wobble.
Failure is no longer exceptional. It’s expected.
That’s where the shift to IT resilience begins.
Resilience isn’t about recovery. It’s about continuity under stress. It’s proactive instead of reactive. It’s engineered instead of documented. And it assumes disruption is inevitable.
Traditional BCDR focuses heavily on artifacts such as recovery plans, runbooks, RTO and RPO targets, backup validation, and annual tabletop exercises. Those are important. But they activate after something breaks.
Resilience asks a different question: what if the system never needed dramatic recovery in the first place? What if it degraded gracefully? What if it rerouted automatically? What if users never noticed?
That’s the mindset shift.
The first principle is engineering for failure. Resilient systems assume components will fail and design accordingly through redundant paths, stateless architecture, horizontal scaling, dependency isolation, and circuit breakers. Uptime comes from containing failure, not preventing it entirely.
Second is controlled stress testing. Historically, recovery was tested once a year. Resilient organizations introduce failure intentionally and continuously. They simulate outages, inject latency, and remove infrastructure nodes. You don’t learn resilience during an outage. You learn it before the outage happens.
Third is graceful degradation. Not every feature needs to remain fully available at all times. Under stress, critical functionality stays online while non essential features are reduced. Users experience less richness, not total interruption.
Fourth is observability. It’s not enough to know something broke. You need to understand why, where, and how impact spreads. Metrics, traces, logs, dependency mapping, and user telemetry provide real time clarity.
The fifth piece is habit and repetition. This is often overlooked.
The teams that handle incidents well don’t rely on heroics. They practice. They revisit decisions with business partners. They check assumptions regularly, even when nothing urgent is happening. That repetition builds coordination and improves decision quality. It reduces friction with the business. Muscle memory matters.
Many organizations invest heavily in plans and tooling. The gap usually appears in practice. Tabletop exercises happen once a year. Assumptions go untested. Confidence erodes quietly.
The most resilient teams don’t aim for perfection. They aim for familiarity.
Where does that leave BCDR?
It becomes the outer safety net, the last line of defense. Recovery planning still matters. Backups still matter. Compliance still matters. But resilience becomes the day to day operating model.
Organizations that make this shift see measurable benefits: reduced incident impact, faster recovery, stronger customer trust, lower operational stress, improved scalability, and a stronger risk posture overall.
If you’re assessing readiness, look less at writing new plans and more at building new habits. The future of enterprise technology isn’t about building systems that never fail. That’s unrealistic. It’s about building systems that bend, adapt, and continue delivering value even when conditions aren’t ideal.
That’s IT resilience.
