Anyone who was at RWTH on that day or tried to connect via VPN from the road felt it: On November 21, 2024, starting around 10 a.m., many of our IT services no longer functioned as we wanted them to. RWTH e-mail, Eduroam, VPN, RWTHmoodle as well as the entire RWTH Single Sign-On—the login for a large part of the services at RWTH—stopped working. All this was due to a malfunction of our F5 load balancer. In this blog post, we will transparently explain how this massive disruption occurred.
What Is the F5 Load Balancer?
By default, requests from a client go directly to a server. Using a load balancer, client requests are distributed among a pool of servers based on predefined criteria. The load balancer acts as an intermediary that ensures the traffic load is distributed across various servers so that utilization and speed are optimized. At the IT Center of RWTH, we currently use a load balancer from F5 Inc. Our login procedure, RWTH Single Sign-On, is particularly dependent on this load balancer. Without RWTH Single Sign-On, access to many other services such as RWTHonline, RWTHmoodle, Selfservice, etc., would no longer be possible.
How Is Our F5 Load Balancer Secured?
When so many important services depend on one device, redundancy is the top priority. Redundancy is implemented in multiple ways: device and site redundancy. Two devices at two locations are interconnected with cross-cabling to safeguard against line failures. Additionally, there is a direct link between the devices to detect any data connection failures and prevent loss of connection in case of a switch-over. Both devices have independent uplinks to two data center locations. The connections are fiber optic links connected with transceivers known as SFP (Small Form-factor Pluggable). Should one device or line or even an entire location fail, operations at RWTH could continue as usual thanks to redundancy without users noticing anything amiss.
What Happened on November 21, 2024?
To compromise a redundant system requires not just one problem but several errors to coincide. That’s exactly what happened on Thursday, November 21, 2024. There was a hardware failure of an SFP on one line between two buildings; however, this went unnoticed and had no impact due to the existing redundancy in the systems. During the morning hours, a standard software update was performed on the hardware of the device that wasn’t actively used for services (aka standby), which normally wouldn’t have been noticed during operation. However, this software update also affected the use of the SFP on that internal building line—combined with the unknown hardware failure of another SFP line led to an outage affecting the entire device. Unfortunately, instead of simply ceasing its operation when disrupted, this faulty device sent an error status to the functioning device at another location. This device then reacted by switching itself into standby mode while automatically switching back over to the defective device again afterward. This ping-pong behavior caused logins and services to intermittently work for minutes before being disrupted again in the afternoon.
The sources of errors outlined here were not known during the actual disruption; therefore, analysis and troubleshooting took some time. With assistance from a technician from the manufacturer and by implementing a hotfix we were fortunately able to stop this ping-pong behavior so that by about 4:45pm, we could stabilize traffic through one functioning device. With stable service provision restored we were then able to identify and replace the faulty SFP supported incorrectly by software conduct a software downgrade on the disrupted device and ultimately restore system redundancy.
How Can We Prevent Such Outages in Future?
Ultimately, we are dependent on hardware that can always be subject to malfunctions. A comprehensive exclusion of malfunctions for the future can therefore not be seriously promised. However, our aim must be to further minimize the probability of such a malfunction occurring. In concrete terms, this means that we check the availability of all connections before a software update. We only start an update afterwards. In addition to purchasing further load balancers/proxy pairs, we also check which components can be detached from the better-performing hardware setup and operated in a different setup.
Responsible for the content of this article are Bernd Kohler, Susanne Kubiak and Nils Neumann.
Leave a Reply