Building resilient systems is crucial in ensuring that applications can keep functioning even while facing unexpected failures and disruptions. In fact, designing systems that can bounce back quickly and effectively is essential for long-term success. In this episode, we speak with Software Developer Bart de Water, who will share his expertise on the principles and strategies for creating reliable systems that can withstand potential and even unexpected failures.
Bart de Water started his career working at a startup straight out of college. Later on, he moved to Montreal and joined the eCommerce platform Shopify to help develop their payment system. After six years of working at Shopify, Bart moved to Thatch, a brand-new application for managing health plan payments.
Bart became an expert in designing resilient systems after experiencing firsthand how easily small issues could grow to the point of taking down whole systems.
Manage timeouts and retries
“We were working on building a single sign-on system for the UK government called GOV.UK Verify,” Bart recalls. Upon launch, despite his product working as expected, it was taken down by a third-party backend service.
The service timeout tied the system threads to a 60-second wait; hence, they were as incapable of communicating with this third-party service as with any other request. A timeout sets a maximum time limit for a process or operation to complete: if the process or operation takes longer than the specified time, it will be terminated or canceled automatically by the system. However, in Bart’s anecdote, the long timeout of this service prevented the system from taking care of other critical tasks like sending email confirmation links or 2FA codes. “People were complaining that they couldn’t log in, they couldn’t complete their registration, even though our system was fine, it effectively was down because of this other system being down,” he says.
According to Bart, lowering timeouts can solve issues in which your system resources communicate with external systems whose uptime you have no control over. In this way, it is crucial to set timeouts for every external service your system calls and keep it reasonably low. The same applies to HTTP calls and data structure stores such as Redis. Likewise, Bart recommends keeping an eye open and studying your tools and languages to find and configure their timeouts, as “even internally in your programming language, there might be something like a foot gun waiting for you to stumble upon.”
Bart understands that how many retries you should let your service perform is context specific. Real-time processes with real-time requirements, such as card authorizations, can’t take long to retry. In turn, background jobs, such as “to backfill some data from your upstream provider,” taking longer to complete shouldn’t be an issue. In this case, Bart recommends implementing a Fibonacci sequence as a time interval between retries. In this way, it is possible to keep retrying yet without saturating the system: “you don’t care about how fast it’s done, just more that it eventually gets done.”
Bart affirms that “networks are inherently unreliable” to the point that, in some cases, the system can timeout and retry an already processed request. This can be particularly unsafe when processing sensitive data such as credit card payments and can result in duplicate transactions.
To address the issue of unwanted retries and add a layer of resiliency to the system, Bart recommends implementing idempotency keys. These unique and randomly generated keys can singularize and identify each request, preventing a given request from being processed more than once, even if the system sends the same request multiple times.
Bart emphasizes the importance of queues in managing a system’s workload efficiently. He suggests learning about queuing theory and concepts like Little’s law, which can help design systems to match expected demand while remaining cost-effective.
Withal, Bart notes that developers often overlook queue latency in monitoring and diagnosing systems, despite being just as crucial. While waiting five seconds for a page to render is unacceptable, a job queue growing in the background can cause system failures. To provide an example of a queue getting too long, Bart reflects on his experience with the feature launch at the startup. A background job system generated two-factor authentication codes before queuing them for sending. However, by the time the message containing the code was sent, it had already expired. For this reason, Bart suggests setting limits on the queue’s length to prevent similar issues in the future.
Implement circuit breakers
Circuit breakers are a design pattern employed to monitor the likelihood of third-party applications halting our system and interrupting communication if detecting signs of malfunctioning. “If you know that something is down, you might as well just stop trying for a little while,” Bart recommends.
Once you have identified that a system is down, you need to acknowledge that it will probably remain down for at least a couple of minutes. Hence, the next step is to “figure out whether your web framework or programming language has a circuit breaker that, once you’ve triggered an X amount of errors, short circuits the next attempt for a certain period.” In this way, you can prevent requests piling up that can cause the system to go down due to the sudden flood of all requests — which is known as the thundering herd problem.
The fundamentals of designing circuit breakers are summarized in Google’s four golden signals for monitoring distributed systems:
- Latency. How long it takes to process a request and to differentiate between the latency of successful and failed requests.
- Traffic. The system’s demand in terms of how many requests arrive in its queue and how many available resources the system has.
- Errors. The rate of requests that fail and why they fail.
- Saturation. The system’s load at a given present or future time.
Bart also advises being careful when keying circuit breakers to endpoints. He recalls a time at Shopify when, while using a single endpoint for worldwide payments, a significant error rate in one country activated the circuit breaker of the endpoint, hence taking down the payment process service worldwide. To fix this issue, he added the merchant’s country code to the API endpoint host and port and implemented a circuit breaker for each country. As a result, “the next time there was an outage that was confined to a single country’s payment network that did not take down our global payment processing.”
Bart thinks it is essential to learn how to respond to the new ways a system or component can fail to perform its intended function: “There’s going to be new ways that your system is going to fail and that’s fine.” Additionally, to learn from unexpected failures and prevent others, he advises challenging our assumptions when designing our system to identify if these were mistaken or if any blindspots were overlooked.
The bottom line
To learn more about designing resilient systems, check out Bart’s article “10 Tips for Building Resilient Payment Systems.” You can follow Bart on Twitter, GitHub, and Rubysocial.