VI #026: Graceful Degradation: Building Resilient Large-scale Distributed Software for Optimized User and Engineer Experience

Read time: 5.5 minutes

Today, we will explain how to build large-scale distributed software that never fails - or, more accurately, fails in a way that's invisible to the user.

Let's use a real-life example for clarity: a search interface in a B2B SaaS HR application. This example is inspired by an application that my team and I previously developed, which is now globally used by our clients.

Why should you want to learn this? Because it makes for happier users and engineers, contributes to overall system stability and resilience, and is essential for any growing SaaS startup.

Unfortunately, graceful degradation – the ability of a system to continue functioning in the event of partial system failure – often seems elusive to many.

Misunderstanding is the primary obstacle.

Here are the common reasons:

They don't comprehend that "local failures are okay, global failures are not" across a distributed application.
They overlook the UX implications of temporary failures & degradation.
They don't consider API dependencies and fallback mechanisms.
They disregard the implications of destructive operations.
They neglect service responsibility and partitioning.
They skip over the importance of thorough testing and system validation.
They fail to implement effective monitoring and alerting systems.
They don't cultivate a culture of resilience.

But there's hope: you can navigate these pitfalls and build a resilient system that ensures an optimized user and engineer experience.

Here's how step by step:

1. Embrace the Principle of "Local Failures are Okay, Global Failures are Not."

In our HR application, imagine the search service encounters a hiccup. When designed with an emphasis on graceful degradation, this failure should be local and quickly rectified. Thus, the search function might be momentarily affected, but the rest of the application continues to run seamlessly.

This approach, where local failures are addressed swiftly, mirrors the strategy we employed in the real-world implementation, ensuring the application is always up and adjusts to an 'eventual consistency' state. It's crucial to remember that users typically don't notice inconsistencies that are resolved in less than a second. Therefore, understanding and operating within these tolerances can significantly improve the resilience of your system.

2. Focus on the User Experience (UX) During Failures

When it comes to managing temporary failures, consider the user experience.

Let's return to our HR application example. If the search function fails, users should be able to utilize other navigational tools like breadcrumbs and menus until the search service is restored. This contingency plan keeps users engaged and productive even in the face of temporary service interruptions.

Similarly, any temporary failures or degradations should be communicated subtly to the users, avoiding panic-inducing error messages. If search results aren't immediately available, a non-intrusive prompt such as "search results temporarily unavailable, please try again" creates a much smoother experience than a glaring red error message.

The user experience matters, even amid a service disruption, and thoughtful communication can go a long way.

3. Account for API Dependencies and Destructive Operations

Now, let's dive into managing API dependencies and handling destructive operations.

In the context of our HR application, suppose one of the data services feeding into the search function fails. The question now is, how much degradation can the upstream services, like the search function, tolerate while this data is unavailable? A robust application design would incorporate fallback mechanisms that these upstream services can utilize until the underlying service is restored. For instance, they could be configured to provide cached or slightly stale results or temporarily suppress minor errors.

Additionally, your system must be prepared to handle destructive operations without causing data loss, particularly for critical operations such as financial transactions.

Implement strategies such as event sourcing, where user inputs are captured using a reliable logging system that maintains the order of operations, offering eventual consistency even when disruptions occur. This way, the user experience remains largely unaffected, even when a service disruption occurs.

It's about understanding the interconnectedness of your services, planning accordingly, and ensuring that the user's trust in your system remains unbroken.

4. Consider Partitioning and the Responsibility of Services

Step 4 is about effectively partitioning responsibilities and incorporating self-healing processes.

In our HR application, this approach breaks down the monolithic 'MVC' architecture, distributing the responsibilities among services based on their function or client base. This way, even if one partition fails, it doesn't impact the entire system but only a specific function or a subset of users, preserving the overall user experience.

Embracing the concept of idempotency further bolsters our architecture's simplicity and resilience.

When a service fails mid-operation, it can resume from where it left off upon restart, eliminating the need for cross-verification with other services. This reduces complexity and contributes to a robust and resilient system.

Coupled with self-healing capabilities, where known errors are logged for later analysis and the support team is alerted for unknown ones, this step mitigates user disruptions and eases pressure on engineers.

By embracing partitioning and effective service healing, you're fostering an environment of graceful degradation that benefits both users and engineers.

5. Implement Thorough Testing and System Validation

The fifth step involves proactively testing your system's resilience, employing chaos testing to introduce failures intentionally.

By identifying and rectifying weak spots before users are affected, we validate our architecture's robustness. This is not just about endurance but understanding system behaviors under stress, allowing for the optimization of resilience and graceful degradation patterns.

Chaos testing isn't about eliminating all possible failures but learning to manage them effectively.

6. Invest in Monitoring, Alerting, and a Culture of Resilience

In the final step, couple a proactive approach to monitoring with a shift in organizational culture.

Embed real-time system monitoring and alert systems in your architecture, as we did with our HR application, to enable swift, proactive detection, and resolution of failures. Establish key performance indicators (KPIs) and service level objectives (SLOs) as central tools in maintaining user satisfaction.

Moreover, foster a culture where engineers view failures as an inevitable part of the process and focus on effectively designing systems to handle these events. Emphasize the importance of transparent communication, ensuring all these strategies and processes are well-documented and shared across your organization.

This fosters a culture of understanding and preparedness, enhancing your ability to consistently deliver an optimized user experience.

That's it for today.

By following these steps, you'll be well on your way to building a system that gracefully degrades, ensuring optimized user and engineer experience while enhancing overall system resilience:

Embrace the Principle of "Local Failures are Okay, Global Failures are Not"
Focus on the User Experience (UX) During Failures
Account for API Dependencies and Destructive Operations
Consider the Responsibility of Services
Implement Thorough Testing and System Validation
Invest in Monitoring, Alerting, and a Culture of Resilience

See you next Sunday.

Whenever you're ready:

Optimize your SaaS product and engineering to accelerate growth and attract investors in under 60 days. Let me show you how. Book a call here.

Build, launch, and scale world-class AI-powered products and platforms.

Join our subscribers who get actionable tips every Thursday.

I will never sell your information, for any reason.