On The Road: Network Disaster & Dual Public-Private Network

As an administrator within a mid-sized organization, you can find yourself wearing many occupational hats, which becomes only second nature after awhile. One of these many hats I wear, is that of lead network administrator, which is something I am particularly fond of… I love networking and everything about it (except maybe wiring racks and crimping :|).

Today many data center networks are designed in a dual public-private network setup, which simply put is you have a private network parallel to your public network — effectively you run two cat6 copper runs to all racks and servers. The traditional concept behind this is that your servers and/or server customers receive all the benefits a private network entails; unlimited server-to-server traffic, gigabit server-to-server data transfers, secure communication for dedicated back-end application or database servers, out-of-band VPN management, fast off-server local-network backups, relieves congestion on the public network and the list goes on. The costs to run cat6 are about $0.12/ft, it is cheap, provides for a more robust and flexible network environment and simply put is just good practice that leased/colo server customers like to see (we run our private network to all customer servers at work, free of charge!).

There however is an unconventional usage to this dual public-private network implementation that can very well save you some serious headaches, make you the hero of the day and gives true meaning to thinking outside the box.

Our fateful day begins on a beautiful spring day in May of 2009 (how cliche does that sound?). I was just getting on the road heading from Troy, MI to Montreal, Quebec which is about an 8 hour drive. I had to return home to deal with a family emergency and my boss Bill was great about it, offering to drive me to Montreal… One problem though, being a smaller organization our staffing in Troy for the data center pretty much consisted of Bill and myself. Not to worry Bill said!, A quick phone call later and we managed to secure a commitment from one of the data center owners to respond to any hardware events we might have while on the road — after all its only 8 hours away and Bill expected to be back the next day after dropping me off, what could go wrong in a single day!

So, off we went, setting out on the road, we quickly made our way out of Michigan enjoying the wonderfully scenic view of the Ontario landscape (read: a whole lot of nothing). A few stops later for junk food, restroom breaks, and an obligatory stop at one of Canada’s great attractions, Tim Horton’s :), we now found ourselves about 3 or so hours into the journey. Then it happened, Bill’s pager started to go off, seconds later, mine started to go off – something blew up. The laptop bag quickly got pulled out from under my seat as Bill drove and I began to pound away at my keyboard, trying to figure out what was going on. We had multiple servers reporting down, which quickly lead to the realization that all the downed servers were on the same physical rack. The first thing I thought was maybe an APC strip (power outlet strip) tripped or failed, but then I tried to ping some of the private IP’s for the downed servers and they were responding. The conclusion was instant when I saw the private IP’s responding – we just lost a public switch – CRAP!.

Immediately a call was placed to the data center hands that we had thought, and who had committed too, being available to assist us in the event of any kind of failure but we knew this was something he may not even be able to handle. It did not matter, when we needed him, he was nowhere near the data center and couldn’t get there, some commitment that was! At this point Bill took the first U-Turn possible and we started the drive back to Michigan, some 3 hours away, with a rack of servers down from a failed public switch. We were at this point left with our thumbs up our butts, both Bill and myself quietly freaking out in our heads and ultimately unable to do anything, our on-call data center hands failed us and the only other two people who could do anything about it were sitting in the truck 3 hours away.

I sat there, in the truck, contemplating all sorts of things, hoping a power cycle of the switch would work, but sure enough, it did not, that would have been too simple! The more I thought about things, the more I kept returning to the private network, we have all these downed servers and they are responding on the private network…. Then it hit me, why don’t I try route traffic for the downed servers, through a new gateway, a gateway I create on the private network! That was it, I got to work frantically, Bill asking me what I was doing, and me saying only that I am trying something, to give me a few minutes — all the while Bill is still driving back towards Michigan.

The plan was fairly simple, in my head it seemed like that anyways. I would take a server from anywhere else on the network and temporarily use it as a Linux routing/gateway server (think: Windows Internet Connection Sharing) by enabling ip forwarding so it can forward/route packets. Then I would set a static route on the affected servers telling them to route traffic for the public IP network through to the private IP of the designated temporary gateway server, followed by configuring our edge router to static route the IP block of the downed switch to the designated temporary gateway server.

That said it sounds more complicated than it actually is. The server I chose to use for this temporary gateway role was one on the next rack over, lets call this server GW. The private IP on GW is 10.10.7.50 and its public IP is 172.11.14.5. The public IP space that is offline from the downed public switch is 172.11.13.0-255 but each server only has in use 1-3 IP’s.

First things first, on server GW we enabled IP Forwarding with:
# echo 1 > /proc/sys/net/ipv4/ip_forward

Then on one of the affected servers on the downed rack, we needed to add a static route to tell it to route public traffic through the private network to our GW server; note eth1 is our private network interface (this takes care of traffic leaving the downed server):
# route add -net 172.11.13.0/24 gw 10.10.7.50 eth1

Then on the GW server we needed to add a static route similar to above but for each of the downed servers main IP’s.This is slightly tedious but hey its not a perfect situation to begin with right!?. So the downed server we are working on has a public IP of 172.11.13.20 and a private IP of 10.10.7.26, again the private interface is eth1. (this route will take care of traffic going to the downed server):
# route add -host 172.11.13.20 gw 10.10.7.26 eth1

In adding the two routes, the server was immediately able to ping out to the internet but none of it’s IP’s were responding outside of the network. This is because our router was still sending routed traffic for 172.11.13.0/24 to the downed switch. We need to tell it to redirect the traffic to the GW server at 172.11.14.5. Once logged into the edge router running Cisco IOS, I passed it the following static route:
router1(config)# ip route 172.11.13.0 255.255.255.0 172.11.14.5

With that done, the public IP on the downed server started to respond from the internet and traffic began to flow into the server. It was done! I had configured the downed server to route traffic out through an intermediate gateway server on the private network and for that gateway server to likewise route inbound traffic back through the private network to the affected server. Now all that was needed was repeating the first route command on all downed servers then repeating the second route command for each of the downed servers main IP’s. Tedious, not ideal, far from it in fact but it was working – we were bringing servers back up and our total outage time was about 40 minutes. Though significant, less so than it could have been of 3+ hours!

Once I had got the routing working over the private network and the first server back online, this was enough for Bill to turn back around and continue me on my journey to Montreal. Although we could have continued back to Michigan to replace the failed switch, we had a workable solution in place that allowed me to get where I needed to go and handle the failed public switch the next day.

There is arguably more than a few factors to this situation and the network we have at work that made this approach possible but they are outside the scope of this article. The take away is really very simple; a Dual Public-Private Network allows you a great many advantages, as listed earlier in this article, however it is the simple fact of having that private network parallel to your public network that affords you options in a disaster, options you may otherwise not have without it.

What do you think?
Was the choice to continue back on the trip to Montreal the right decision or should we have completely returned to Michigan to replace that failed switch?