Stress-Test Your Software to Prevent a Southwest-Type Calamity

Meltdowns are never fun. When failure isn't an option, make sure that your critical software is up to the task.

John Edwards, Technology Journalist & Author

March 9, 2023

5 Min Read

Juhani Viitanen via Alamy Stock

If there’s one lesson to be learned from Southwest Airlines’ system collapse last December, it’s that critical software must be regularly tested to ensure that it can handle extreme conditions.

Stress testing critical software saves organizations time and money -- something Southwest learned the hard way, says Stephen Feloney, vice president of products, continuous testing, at application development tools provider Perforce. Testing performance can effectively simulate how software will behave during high-traffic periods. “Identifying and fixing errors before they’re customer-facing decreases the possibility of a crash and prevents fiascos like Southwest Airlines from occurring,” he notes.

The worst time to discover that your critical software is unable to handle a high load or other stressed situation is when it happens in the live environment, says Arie Trouw, CEO and CTO of XYO, developer of a technology protocol designed to improve data validity, certainty, and value. “Stress tests are the only way to validate that your architecture and implementation can weather a Southwest-like crisis.”

Stress testing is analogous to conducting a fire drill in an office building, says Rohan Padhye, an assistant professor at the Carnegie Mellon University School of Computer Science. “The goal is to ensure that contingencies designed for handling extreme and unexpected conditions, such as emergency protocols and fallback systems, actually operate as designed.”

Testing, Testing

Stress tests typically subject a software system to very large workloads in the form of a high volume of requests or a high rate of failure in individual components. “The idea is to simulate a worst-case scenario with potentially unpredictable side effects,” Padhye says.

Testing reveals how a system will react to slowdowns, memory leaks, security issues, and data corruption. “Across performance-based testing, stress tests must be paired with load tests,” Feloney advises. “For example, spike tests examine how a system will fare under sudden, high ramp-up traffic, and soak tests examine the system’s sustainability over a long period.”

Stress tests can either be performed in an isolated environment designed for quality purposes, or directly on the live customer-facing deployment. “While it sounds scary, testing a live deployment is far more representative of a real extreme scenario, because it also incorporates the human factor presented by users responding to the simulated events in a hard-to-predict way,” Padhye explains.

Developers should always run stress tests after an update is deployed as well as prior to anticipated high-demand events. “By identifying bottlenecks before peak traffic, teams can combat errors with the right resources and continuously monitor performance,” Feloney says. “For example, Ticketmaster's system breakdown during Taylor Swift’s The Eras Tour sale shows the importance of stress testing ahead of time to avoid the energy and costs associated with fixing a system breakdown.”

Stress tests can be conducted by IT staff or an external service provider. There's value in both approaches, Padhye says. “On the one hand, IT staff who run operations on a daily basis understand the system very well and are likely to quickly identify specific weaknesses or outdated components that must be thoroughly tested for extreme conditions,” he explains. “On the other hand, too much familiarity with operating a system can also introduce an unconscious bias about how the system is supposed to run.”

An external service provider can sometimes subject the system to corner case behavior that an internal team may not have even considered as a possibility. “A fresh pair of eyes can, therefore, enable an unbiased test of the overall system,” Padhye says. “External services are particularly useful when testing a software system for security incidents, such as potential data breaches or malicious disruptions.”

Problems and Risks

Even the most comprehensive stress test can't anticipate every possible situation, so it's important to develop a recovery plan for restarting or repairing a stress-induced failure. A common example is when a specific system component fails under stress. “Restarting that part of the system is very difficult because pending queues outside of it have built up during the downtime,” Trouw says. “At that point, the stress during restart may be even higher than the stress that originally caused the outage,” he notes.

One of the core problems affecting large and complex software deployments is a growing dependency on third-party products and services that aren't built or maintained by internal IT staff members. “These components can fail in many unexpected ways, or simply go out of date,” Padhye warns. “Simply deciding whether to update such components to their latest version is a challenging task.”

A risk associated with using an outdated component is that it may contain unpatched defects or security vulnerabilities. On the other hand, an updated component may cause a system failure if the component presents a significantly changed operating interface. “Testing protocols should specifically consider the various risks associated with depending on such third-party software when operating critical services,” Padhye recommends.

What to Read Next:

Chaos Engineering: Benefits of Building a Test Strategy

How Technical Debt Hampers Modernization Efforts for Organizations

Dawn Foods Tries a Low-Code Recipe for QA Testing Automation

About the Author(s)

John Edwards

Technology Journalist & Author

John Edwards is a veteran business technology journalist. His work has appeared in The New York Times, The Washington Post, and numerous business and technology publications, including Computerworld, CFO Magazine, IBM Data Management Magazine, RFID Journal, and Electronic Design. He has also written columns for The Economist's Business Intelligence Unit and PricewaterhouseCoopers' Communications Direct. John has authored several books on business technology topics. His work began appearing online as early as 1983. Throughout the 1980s and 90s, he wrote daily news and feature articles for both the CompuServe and Prodigy online services. His "Behind the Screens" commentaries made him the world's first known professional blogger.

See more from John Edwards

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Stress-Test Your Software to Prevent a Southwest-Type Calamity

Testing, Testing

Problems and Risks

What to Read Next:

About the Author(s)

Editor's Choice