VI #010: Scaling SaaS: How to Architect for Fault Tolerance

Read time: 6 minutes

Successful CTOs understand the importance of a scalable architecture that can handle rapid growth and changes. As software systems and organizations grow, it is natural that there are failures from time to time. To scale successfully and avoid negatively impacting users and/or the organization when parts of a system fail, failures localized in individual components of a system should be designed to not cause a “global” failure of the entire system, and to “self-heal” quickly, i.e. without requiring human intervention. Achieving such fault tolerance in your SaaS product can be a challenging task that requires careful planning and execution. In this article, we'll explore some practical steps you can take to architect your SaaS product or platform for fault tolerance.

Complex distributed systems make fault tolerance difficult

Unfortunately, many SaaS startups struggle with achieving fault tolerance due keeping up with developing features to meet customer needs while grappling with the complex nature of distributed systems. Other reasons why fault tolerance can be challenging include a lack of expertise, insufficient budget and planning, and limited resources.

However, with the right mindset and approach, these challenges can be overcome to build a highly available and resilient SaaS product.

Based on my experiences building global-scale and mission-critical SaaS applications for a wide range of organizations spanning startups to Fortune 500s to Defense, here are four steps to help achieve this:

Step 1: Use scalable architecture patterns

To build a fault-tolerant SaaS product, scalable architecture patterns help the system handle increased workload or demand without affecting its performance or functionality. The architecture should be flexible, resilient, performant, secure, and modular, and all application layers, cross-cutting concerns, and the ecosystem it exists within should be considered. These factors contribute to a system's ability to withstand hardware or network failures, protect against security threats, and accommodate growth.

Collaboration among the system architect(s), product/business stakeholders, and the engineering team is essential for balancing business and technical requirements and constraints. Keeping YAGNI in mind can help ensure that feature development velocity is balanced with creating scalable architectural foundations. Prioritizing frequent feature delivery to meet customer needs while having a plan for scaling the system for a known time horizon, such as the next 6-12 months, and maintaining sufficient modularity to support upgrading of constituent parts as needed to meet customer growth is key.

There are several ways to implement scalable architecture patterns to achieve fault-tolerance.

One of the primary methods is to divide the system into smaller, independent services, allowing the system to continue functioning even if some components fail. Graceful degradation (or progressive enhancement) and self-healing can be very helpful here also, to minimize user impact and support burden in failure scenarios. Redundancy is another critical aspect of fault-tolerance, which can be implemented at both the hardware and software levels. Another scalable architecture pattern for fault-tolerance is load balancing, to distribute workloads to ensure individual system resources do not become overloaded, allowing the system to handle high traffic demands. Various caching approaches can also be useful to store frequently accessed data in memory, allowing it to be retrieved quickly and reducing load on the system, reducing the impact of single component failures. Asynchronous communication can also be particularly useful in situations where a service is slow or unresponsive, as it allows the system to continue functioning even if one component is experiencing issues.

Step 2: Test for failure scenarios

Testing for failure scenarios is essential for ensuring the reliability of your SaaS application.

This can be achieved effectively by identifying and testing potential failure scenarios, prioritizing them by impact, simulating them in a production-like environment, automating testing, and monitoring system performance. Testing recovery mechanisms, such as backup and restore procedures, is also important.

Incorporating failure testing into the development process can help identify issues early and reduce the risk of future system failures.

Teams should strive to automate recovery patterns, but also prepare for manual intervention by creating runbooks and conducting regular drills to prepare for real-world situations. Identifying failure scenarios and implementing scalable processes can prevent firefighting and minimize disruption to customers. Chaos engineering, tools and approaches such as ChaosMonkey and “game days” can also be helpful in this regard.

Step 3: Implement effective monitors and alerts

Implementing proper monitoring and alerts is crucial for ensuring that your SaaS system stays up and running, even in the event of a failure.

Start by defining your key performance indicators (KPIs) and setting up monitoring tools to track system performance and availability. There are a variety of tools to choose from such as Datadog, New Relic, Nagios, or Dynatrace and these can be used to create dashboards that provide real-time visibility into the health of your system.

Configure alerts triggered by thresholds that indicate an issue with the system, and establish escalation procedures with clear notification protocols. Regularly test your alerting and response procedures and also monitor third-party services your system depends on. Where possible, use automated responses to quickly resolve issues that don't require human intervention, such as restarting a service or spinning up additional instances of an application.

To get started, it can help to consider using Google's SRE philosophy's Four Golden Signals which include latency, traffic, errors, and saturation. Monitoring these signals provides a comprehensive view of system behavior and helps enable quickly identifying and responding to issues that arise, ensuring that your system remains fault-tolerant and scalable. Continuously review your monitoring and alerting processes to identify areas for improvement.

Step 4: Foster a culture of continuous improvement

Finally, to ensure the long-term fault tolerance of your SaaS product, it is crucial to create a culture of continuous improvement within your team. One way to achieve this is to encourage experimentation and risk-taking. By allowing your team members to experiment with new technologies and approaches, you can foster an environment where potential issues can be identified and addressed before they become critical. Additionally, this approach can lead to innovative solutions and new features that can differentiate your product from competitors.

Regular blameless post-mortems are another essential step in the continuous improvement process.

By reviewing outages and issues without blame, involving the entire team in the process, and determining how to prevent similar issues in the future, you can continuously improve your system. Open communication is also key, helping to create an environment where team members feel comfortable sharing their ideas and concerns, and everyone is aligned on the team's goals.

Providing training and development opportunities is another effective way to foster continuous improvement. Keeping the team up-to-date with the latest technologies and best practices helps them to identify potential issues and improve the system continuously.

Finally, using continuous integration and deployment can help to identify issues early and ensure that fixes are rolled out quickly, before they become critical.

In summary

By following these steps, you can architect your SaaS product for fault tolerance and ensure that it can handle rapid growth and changes:

Use scalable architecture patterns
Test for failure scenarios
Implement effective monitors and alerts
Foster a culture of continuous improvement

Hope this helps. See you next Sunday.

Whenever you’re ready, there are 2 ways I can help you:

Work with me 1:1 to build your team, product, platform, or career in tech.
Book a free discovery call with me to explore if your business needs would be a good fit for my advisory services. If we’re not a good fit, rest assured I’ll kindly let you know.

Build, launch, and scale world-class AI-powered products and platforms.

Join our subscribers who get actionable tips every Thursday.

I will never sell your information, for any reason.