Spanning-Tree Stops Trains
13 years 6 months ago #36856
by TheBishop
Spanning-Tree Stops Trains was created by TheBishop
Saw this today on Slashdot:
"The railway signaling failure which crippled Sydney on April 12 (some commuters reported trips of more than three hours) was caused by a failing LAN switch and software that couldn’t cope, an engineering report has found.
The switch, probably a Cisco device (Railcorp’s dominant LAN kit supplier) was part of the network in the Sydenham signaling station. That facility governs signaling for a large chunk of the Sydney rail network.
The guilty switch suffered partial failure of two electrolytic capacitors (probably the power supply). The switch is part of a dual redundant LAN which is supposed to be resilient to failure; however, the configuration couldn't handle an intermittent breakdown.
With the caps failing, the switch would shut down and try to re-start itself. This, the engineer’s report says, meant the Sydenham LAN was “caught in a cycle where it was continually trying to reconfigure itself to address the changing state of the network.”
It only took a little over ten minutes for technical staff to initiate a disaster recovery plan, but the procedure took more than an hour to complete. In that time, the software that governs the trains, known as ATRICS, was unable to cope with the flaky network. This led to a knock-on effect, taking out a system called Microloc at another station, Revesby.
With ATRICS and Microloc both failing, the rail network failed to a “safe state” in which the trains were halted where they were. Because of the hugely interdependent state of the Sydney rail network, 847 trains were delayed, 240 were cancelled, and it took the rest of April 12th for the system to recover."
Moral: We all depend on spanning-tree and assume we have resilient architectures, but spanning-tree doesn't do too well with repeated intermittent faults that come and go faster than its convergence time. The world will always need engineers...
"The railway signaling failure which crippled Sydney on April 12 (some commuters reported trips of more than three hours) was caused by a failing LAN switch and software that couldn’t cope, an engineering report has found.
The switch, probably a Cisco device (Railcorp’s dominant LAN kit supplier) was part of the network in the Sydenham signaling station. That facility governs signaling for a large chunk of the Sydney rail network.
The guilty switch suffered partial failure of two electrolytic capacitors (probably the power supply). The switch is part of a dual redundant LAN which is supposed to be resilient to failure; however, the configuration couldn't handle an intermittent breakdown.
With the caps failing, the switch would shut down and try to re-start itself. This, the engineer’s report says, meant the Sydenham LAN was “caught in a cycle where it was continually trying to reconfigure itself to address the changing state of the network.”
It only took a little over ten minutes for technical staff to initiate a disaster recovery plan, but the procedure took more than an hour to complete. In that time, the software that governs the trains, known as ATRICS, was unable to cope with the flaky network. This led to a knock-on effect, taking out a system called Microloc at another station, Revesby.
With ATRICS and Microloc both failing, the rail network failed to a “safe state” in which the trains were halted where they were. Because of the hugely interdependent state of the Sydney rail network, 847 trains were delayed, 240 were cancelled, and it took the rest of April 12th for the system to recover."
Moral: We all depend on spanning-tree and assume we have resilient architectures, but spanning-tree doesn't do too well with repeated intermittent faults that come and go faster than its convergence time. The world will always need engineers...
- next_virus
- Offline
- Senior Member
Less
More
- Posts: 111
- Thank you received: 2
13 years 6 months ago #36870
by next_virus
Replied by next_virus on topic Re: Spanning-Tree Stops Trains
Excellent article.
13 years 6 months ago #36892
by S0lo
Studying CCNP...
Ammar Muqaddas
Forum Moderator
www.firewall.cx
Replied by S0lo on topic Re: Spanning-Tree Stops Trains
Cisco. Lesson learned the hard way!!. Well, Let's just hope that it's learned.
Studying CCNP...
Ammar Muqaddas
Forum Moderator
www.firewall.cx
Time to create page: 0.128 seconds