R-fx Networks

My Blog

BFD 1.5-1 Update: Forged Syslog Data Vulnerability

by on Jan.29, 2014, under Development, My Blog

Current Release:
http://www.rfxn.com/downloads/bfd-current.tar.gz
http://www.rfxn.com/appdocs/README.bfd
http://www.rfxn.com/appdocs/CHANGELOG.bfd

An updated version of BFD 1.5 has been released, version 1.5-1, which addresses an address scoping issue in the event forged syslog data is encountered on the host system running BFD from a malicious local user or any other sources that may generate forged syslog data. In such situations, BFD can be manipulated to ban addresses that it would otherwise not validly be triggered to do so, with wide scoped CIDR notation at up to a /8.

The 1.5-1 release addresses this by ensuring that addresses BFD passes onto the BAN_COMMAND are fully qualified C class (/32 CIDR) addresses only, as opposed to any CIDR notation address.

Thanks goes to rack911.com for responsibly advising of this issue and awaiting the release of a fix prior to any public disclosure. The responsible disclosure practices of rack911.com are a statement to their professionalism as a managed services provider as well as their dedication to improving the security landscape of the web hosting industry at large.

Leave a Comment :, , , more...

LMD 1.4.1: Delivering on your requests

by on Nov.20, 2011, under Development, My Blog

The release of LMD 1.4.1 is now live and with it comes a few new features. In this small update, I have tried to deliver on on a couple of common feature requests from users which were in-line with my development goals. That said, right to it…

The biggest change has come in the form of what has been dubbed public mode scanning. This is where non-root users can execute malware scans. For this to work, a new quarantine, session and temporary path directory tree needed to be created that users had write access under. This presented some challenges and in the early incarnation of this feature, the pub/ directory tree created for this feature was set world writable. The more I worked with the ideas around this feature the more I hated it, I simply could not impose upon users a world writable path. Then I flirted with the idea of simply creating the directory tree and if users wanted the feature they had to set it mode 777 themselves, though this was a fair trade it still felt like a lazy solution.

In the end, the solution I came up with was to populate the new pub directory tree with user paths based on passwd users and explicitly set ownership to each user for their pub/username path (–mkpubpaths). This meant that something was needed to regularly update the pub directory tree for new users and as such a cronjob was added that runs every 10 minutes to create said paths (cron.d/maldet_pub). This feature is controlled by conf.maldet variable public_scan which is disabled by default and when in a disabled state the cronjob simply does nothing along with user initiated scans exiting with an error that the feature is not currently enabled.

Supplementing the public mode scanning feature is the support for mod_security2 upload scanning which next to user initiated scans was one of the most requested features recently. Although the inotify real-time monitoring still works very well, it is not an option in some environments which makes mod_security2 upload scanning highly desirable. Conveniently, the only obstacle for upload scanning was simply that LMD did not support user initiated scans and with the introduction of public mode scanning there was only a few changes required to fully integrate it. That being the creation of a validation script for mod_security2’s inspectFile hook which returns an approved or denied status for uploaded files based on malware hits. This script was created as modsec.sh and is located in the LMD installation path. Full details on public mode and mod_security2 upload scanning are included in the README file.

Another highly requested feature is the ability to redefine configuration variables on the CLI on a per execution basis. This has been added through the -co|–config-option CLI flags. This was primarily requested by those creating integration interfaces for LMD along with those who create custom scan cronjobs. Likewise, this proved useful in the creation of the mod_security2 validation script. The usage of this feature is straight forward, simply append a comma spaced list of variables you would like to redefine in the format of VAR=VALUE.

For example, to change the email address for a specific scan and enable quarantining of hits:
maldet –config-option email_addr=you@domain.com,quar_hits=1

Effectively, any LMD variable located in conf.maldet or internal.conf can be redefined in this way.

Smaller changes include added support for Plesk in the cron.daily scans, email_ignore_clean conf.maldet variable that allows for reports where all hits are cleaned to be ignored and improved accuracy of (gz)base64 injection signatures to reduce false positives.

That covers the notable changes in this release. Although this isn’t as big or feature packed of an update as the last couple of releases, I am confident it will add to the maturity and utility of the project for all users. Please check the CHANGELOG and README files for further details. This update will push out automatically to LMD installations with the default daily cronjob enabled or you can manually update using the ‘maldet -d’ command.

LMD By The Numbers:
16,036 Downloads month-to-date (includes version updates)
15,261 Malware source URL’s tracked
14,443 Active installations (by unique IP daily signature queries)
11,017 Active 1.4.x installations (by unique IP daily signature queries)
10,192 File submissions pending malware review
9,644 Updates to 1.4.1 (by unique IP signature queries)
8,579 Total malware signatures
7,300 Google references to “linux malware detect”
6,715 MD5 malware signatures
6,374 Unique malware files in the LMD malware repository
3,221 Zombie server nodes seen in the last 30d on IRC C&C networks
1,864 HEX malware signatures
1,338 New signatures since 1.4.0
261 Command & Control IRC networks tracked
226 Signature updates in the last 12 months
196 Unique malware signature classifications
112 Files on average submitted daily through checkout feature
101 GB of bandwidth used per month on average to serve LMD updates
18 Signature updates per month on average
6.3 New signatures per day on average
1.6 Days between signature updates on average

2 Comments :, , more...

Linux Malware Detect: 2 Years Strong

by on Oct.06, 2011, under Development, My Blog

As cliche as it sounds, where has the time gone? Today we celebrate two years of Linux Malware Detect, open-source (web) malware detection.

The project has seen allot of change since the first release. What was initially started as an internal project to deal with a large increase in malware activity at my job, a mid-sized web hosting company, quickly grew into a larger, established, project that proved useful for the hosting community at large. I spent nearly three months collecting malware to form the base of the initial signature set, developing the program logic and engaging people in WHT & Cpanel IRC to test the early releases. Those first releases had less than 200 signatures, it was strictly MD5 based and used technique that were less than efficient and in many ways initially flawed.

As the project matured in it’s early releases, the reality of Linux (web) malware detection became evident, there was little to no tools that existed for the job and LMD was filling an important void. The few tools that did exist were either not focused on malware or were commercial solutions that made no effort to share malware signatures or resources with the Linux community at large. This quickly lead to a litany of feature requests for LMD along with a mountain of malware submissions from early adopters, all of whom saw in LMD what I saw; an ability to become an effective and crucial tool in combating malware.

Inside of the first couple of major releases, LMD saw an explosion of features and signatures which contributed to the maturity of the project. There were major additions such as hex based pattern matching, quarantine support, reporting system, real time inotify monitoring, malware checkouts, clean & restore features and much more. The signature base grew from 200 odd to now 8,388 at the time of this writing, an average of almost 350 new signatures per month.

The project now sits at version 1.4, which was released in April of 2011. Though the current release is 6 months old, that is by no means an indicator of the projects status but rather the success of it and the maturity there-in. The project still receives near daily signature updates, the malware queue from checkouts has never been more busy with an average of 85 malware submissions per day, the manual review queue for checkouts sits at just over 3300 files and is an ever challenging task to maintain but one I do willingly. Though there is much room for improvement and many features that can be added to LMD, at the moment there are no pressing features required by LMD. Do I have plans in store for the project in the short term? Yes, of course, but like many open source projects, time commitment to the project has to be balanced with my job and personal time so the priorities often shift between signature maintenance, feature development and work on other projects.

The success of the project can be measured by the 13,051 installations ( @ time of writing ) that report in daily, the 540+ new installations per month and the over 17,000 google references to the project. I am proud of LMD, where it has come in the last 24 months and am very encouraged by where I see it going in the future. I look forward to many years of success ahead for LMD and hope you will continue to trust in LMD to combat your malware threats.

10 Comments :, more...

ATA Over Ethernet: As an Alternative

by on Apr.04, 2011, under HowTo, My Blog

New technologies, new toys — Oh how I love getting my hands dirty with them. Today I am going to have a look at ATA Over Ethernet (AoE) as an alternative solution to NFS in the role of a NAS/SAN implementation. We will look at both the server side vblade setup and the client side AoE kernel module along with a practical deployment setup which includes a convenience script I developed to make vbladed slightly less of a nuisance to maintain.

First things first though, what exactly is ATA Over Ethernet? Straight off the wikipedia page, here are the important parts that describe AoE best:

"ATA over Ethernet (AoE) is a network protocol developed by the Brantley Coile Company, designed for simple, high-performance access of SATA storage devices over Ethernet networks. It is used to build storage area networks (SANs) with low-cost, standard technologies.
...
AoE runs on layer 2 Ethernet, it does not use internet protocol (IP), so it cannot be accessed over the Internet or other IP networks. In this regard it is more comparable to Fibre Channel over Ethernet.
...
SATA (and older PATA) hard drives use the Advanced Technology Attachment (ATA) protocol to issue commands, such as read, write, and status. AoE encapsulates those commands inside Ethernet frames and lets them travel over an Ethernet network instead of a SATA or 40-pin ribbon cable. By using an AoE driver, the host operating system is able to access a remote disk as if it were directly attached."

OK, of note here is that AoE is an ATA implementation over Ethernet, being layer 2 it is a dumb protocol with no knowledge of the TCP/IP stack, as such it can only communicate in the simplest of ways inside a switched network (its packets cant be routed between multiple networks). As such, AoE is ideal when used on a private network or better yet a network dedicated to SAN (Storage Area Network), it can however be used on a public facing network as so long as the hosts in the AoE network are all within the same switched segment of the network (More info here on routable AoE).

That all said, what makes AoE a viable alternative to NFS? Well in the role of storage access in its simplest capacity, NFS is just bloated and adds a significant amount of overhead and complexity to something that deserves to be simple. Further, NFS is woefully inadequate at maintaining the level of reliability required when you are, for example, exporting an entire file system to another device for the purpose of high-availability usage such as a /home extension or MySQL file system. Personally, I am slightly biased as I hate NFS; I use it, but only for a lack of often anything better to meet the role in exporting file systems and directory trees across networks. Although it does what its supposed to just fine, more often than not you can get woken up at 4AM with the most mysterious and sudden of NFS issues that are notorious for being mind numbing to resolve. It is for this simple reason — NFS’s lack of reliability, that sent me searching for a simple, scalable and reliable alternative. AoE has managed to meet two of these three points — simple and reliable, while coming up short on the scalable side, more on that in a bit.

There are two components of an AoE setup, the server side storage device that will run vblade and the client side that will access the exported storage using the AoE kernel module under Linux. I should note that although the vblade server package is for Linux, the client side drivers are available for Windows, OS X, FreeBSD and more; in Linux the AoE kernel module is part of the mainline kernel.

The server you choose to run vblade can be any device that you want to export files or devices on, there is little in the way of requirements as vblade is a pretty slim package and doesn’t consume much in the way of resources other than CPU. For a modest environment where you plan to export to no more than 10-15 clients, a dual core system with 2GB RAM is more than sufficient for the vblade server. For my deployment, I run vblade on a quad core Xeon 3.0Ghz, 6GB RAM and 9TB Raid5 array that exports to 54 client servers. More on my setup later when we review scalability but for now lets jump right into the vblade setup and usage.

Lets go ahead and grab the vblade package, compile and install it:


# wget http://iweb.dl.sourceforge.net/project/aoetools/vblade/20/vblade-20.tgz
# tar xvfz vblade-20.tgz
# cd vblade-20
# make && make install
install vblade /usr/sbin/
install vbladed /usr/sbin/
install vblade.8 /usr/share/man/man8/

There is no compile time configure script or any other real configuration required, vblade installs straight into /usr/sbin and is an overall painless process. The simplicity of the vblade package comes at a cost, in that there is no support for a configuration file to control multiple vblade instances, making things slightly tedious. This should not detract from the use of vblade, it is a mature and reliable package but one with a very simple approach that does little in the way other than what it is supposed to do.

To make life easier for myself, I created a wrapper of the sorts to add support for a configuration file along with limited error checking and some command line conveniences — we’ll grab the wrapper and default config template as follows:


# wget http://rfxn.com/downloads/vbladed.conf
# wget http://rfxn.com/downloads/vbladectl
# mv vbladed.conf /etc/
# mv vbladectl /usr/sbin
# chmod 640 /etc/vbladed.conf
# chmod 750 /usr/sbin/vbladectl
# ln -s /usr/sbin/vbladectl /etc/init.d/vbladed
# chkconfig --level 2345 on

You will note, that we enabled vblade to start on boot through init, although the wrapper is not technically an init script, it does support being called from init and managed through chkconfig for convenience. Lets look at the configuration file /etc/vbladed.conf then we’ll review the vbladectl usage after that:

##
# vbladed export configuration file
##

# unique shelf identifier for this vblade server
SHELF="0"     # must be numeric 0-254, default 0

##
# AOESLOT FILE MAC IFACE ALIAS
# 0 /data/server.img FF:FF:FF:FF:FF:FF eth1 server

The configuration file is pretty straight forward, the SHELF variable only matters if you intend to run multiple vblade servers on the same network, if that is the case then this value must be unique to each vblade server or you will run into client side conflicts of being unable to distinguish between vblade servers. The export definitions follow in the format of “AOESLOT FILE MAC IFACE ALIAS” which the below breaks down further:
AOESLOT is a per-client identifier for EACH exported file or device to the SAME client; in other words if you configure multiple exports to the same client server then this value needs to be unique for each.
FILE is the full path to a device or file you want to export, this can be an unformatted raw device such as /dev/sdb, a preformatted partition such as /dev/sdb5 or a loopback image such as /data/server.img.
MAC is the MAC address of the client-side interface that is attached to the network you intend AoE traffic to move over; more appropriately, it is the interface connected to your private network on the client server
IFACE is the server-side interface that can reach the client-side interface you defined the MAC address for; more appropriately, it is the interface connected to your private network on the vblade server
ALIAS is a reference alias for each configuration entry, this must be unique to each vbladed.conf definition

For the purpose of this article, we will go ahead and create a loopback image, format it and export it for a client server called apollo, then we will review how to import the file system onto the apollo server using the AoE kernel module. First, lets create our image:


# dd if=/dev/zero of=/home/apollo.img bs=1 count=0 seek=10G
# yes | mkfs.ext3 /home/apollo.img

This will create a sparse, zero filled file, meaning it will be 0bytes on disk and allocate space, up to 10G, as data is stored to it. There is a slight performance hit to this as the image file must grow itself as data is written, this however is made up for in improved efficiency of space usage. To create an image that preallocates space on the disk you would run ‘# dd if=/dev/zero of=/home/apollo.img bs=1M count=10000‘, be patient as this will take some time to complete, then format it as described above.

Now that we have the image/device we want to export, we need to add the definition for it into the vbladed.conf file, to do so we need to note the MAC address of the interface on apollo that will communicate with the vblade server, in our case this is a private interface eth1 but in your setup it can be a public facing interface if needed — just make sure its within the same subnet as the vblade server.

[root@apollo ~]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:16:E6:D3:ED:E5
          inet addr:10.10.6.6  Bcast:10.10.7.255  Mask:255.255.252.0
    ... truncated ...

We now have the client side MAC address (00:16:E6:D3:ED:E5) and we have the device/file we want to export (/home/apollo.img), and we also know the private network interface on our vblade server is eth1 as well, so we can create the vbladed.conf definition:

0 /home/apollo.img 00:16:E6:D3:ED:E51 eth1 apollo

That should be appended into the bottom of /etc/vbladed.conf, then we are ready to start the vblade instance for the configuration we’ve added. The vbladectl wrapper includes start, stop and restart flags which also accept an optional alias for performing actions against only a specific vblade instance, run vbladectl with no options for usage help. Time to start the vblade instance for apollo as follows:

# /usr/sbin/vbladectl start apollo
started vbladed for apollo (pid:16320 file:/home/apollo.img iface:eth1 mac:00:16:E6:D3:ED:E51)
( you could also just pass the start option without an alias to start instances for all entries in vbladectl.conf )

The default behavior for vblade also sends log data to the kernel log, typically /var/log/messages on most systems, so tailing the log will produce the following logs if all is normal:

# tail /var/log/messages
Apr  3 16:49:25 backup5 vbladed: started vbladed for apollo (pid:16320 file:/home/apollo.img iface:eth1 mac:00:16:E6:D3:ED:E51)
Apr  3 16:49:24 backup5 vbladed: pid 16320: e0.0, 419430400 sectors O_RDWR

The important part there is the ‘vbladed: pid 16320: e0.0, 419430400 sectors O_RDWR’ entry in the log as this comes from vblade itself, the other log entry comes from the wrapper. This log entry tells us that vbladed forked off successfully and that it has exported our data for the defined server as e0.0 (etherdrive shelf 0 slot 0), you’ll see the significance of this shortly.

We are now ready to move over to our client server, apollo, and import our new AoE file system. This is an easy task and if you are running a current Fedora / RHEL (CentOS) based distribution, you’ll find the AoE kernel module already included. The module is also part of the mainline kernel so if you are using a custom kernel, please be sure to enable the corresponding config option (CONFIG_ATA_OVER_ETH).

There is really no right way to load a kernel module, you can either use modprobe which I recommend or you can use insmod on the modules full path, which is a matter of preference. Let’s first verify the module exists, which modprobe does for us but for the sake of this article and familiarity, we will check (remember you’re running this on the client server, i.e apollo):

# find /lib/modules/$(uname -r)/ -name "aoe.ko"
/lib/modules/2.6.18-194.32.1.el5PAE/kernel/drivers/block/aoe/aoe.ko

There we have it, the module returned fine, listing the full path to it. If you did not get anything back this may be that you are running a custom kernel by your own choosing, and need to configure the CONFIG_ATA_OVER_ETH option. It may also be that your data center provider or a software vendor installed a custom kernel without this feature and you should contact them requesting it. As an alternative, you could download the etherdrive sources for the AoE kernel module on the coraid website and compile it against your kernel, this requires your kernel build sources or on RHEL based systems the kernel-headers package.

That said, we will now load the module using modprobe, the preferred method:

# /sbin/modprobe aoe
( or you can run /sbin/insmod MODULE-PATH )

If everything went OK, then modprobe will generate no output and you can verify the module is loaded as follows:

# lsmod | grep aoe
aoe                    60385  1

When the AoE module is loaded it will start listening for broadcast traffic from AoE on all available interfaces, a very passive process. If you have done everything correct then the module will quickly detect the exported device/file from the vblade server and inform you in the kernel log along with creating the appropriate /dev/etherd/ device file. Let’s verify this by checking the log and then checking the /dev/etherd path:

# tail /var/log/messages
Apr  4 17:13:02 apollo kernel: aoe: aoe_init: AoE v22i initialised.
Apr  4 17:13:02 apollo kernel: aoe: 003048761643 e0.0 v4014 has 419430400 sectors
Apr  4 17:13:02 apollo kernel:  etherd/e0.0: unknown partition table
# ls /dev/etherd/
e0.0

If for some reason you do not see the log entries described above along with no e0.0 device file under /dev/ethered, this may be a misconfiguration on the vblade server, perhaps you got the interface or mac address in vbladed.conf wrong? Double check all values. If you opted to try run things over a public facing interface, the issue may be that your network VLAN’s each server (which is fairly common), in that case you may need to request that all your hardware be part of the same VLAN or the provisioning of a private switch and private links for your hardware.

Assuming that things went good, that you see the appropriate log entries and the e0.0 device file under /dev/etherd/, we are ready to mount the file system, we will mount it as /mnt/aoe for the purpose of this article:

# mkdir /mnt/aoe
# mount /dev/etherd/e0.0 /mnt/aoe
# df -h /mnt/aoe
Filesystem            Size  Used Avail Use% Mounted on
/dev/etherd/e0.0      5G   36M  4.9G  0% /mnt/aoe

You may run into an issue of unrecognized file system on the device, though the file system we created on it, on the vblade server, should show through. If it does not, simply run an ‘mkfs.ext3 /dev/etherd/e0.0’ on it and you will be all set. There is no hard set rule on creating the file system on the vblade server, you could just export raw images and devices then partition/format file systems on them on a per-client basis as you require it.

The only thing that is left is to set our new file system on apollo to load at boot time, the simplest way to do this is to append a couple of lines to /etc/rc.local as follows:

/sbin/modprobe aoe
sleep 5 ; mount /dev/etherd/e0.0 /mnt/aoe -onoatime

The rc.local script is run at boot time after all other services have started, so if you are loading a file system used for mysql, user home data or similar you will probably want to also add a line after the mount to restart said services. You’ll also notice two things about the entries we added to rc.local; The first is the sleep delay before the mount, this allows the aoe kernel module to complete its discovery process for AoE file systems before we try to mount it. Then, we are using the noatime option on the mount command, which disables the updating of the last access time on files during read/write operations. This is important because traditionally whenever a file is read from disk, it causes a write operation back to disk to update the atime attribute on the file, so disabling atime usage can greatly reduce i/o calls (effectively in half for reads), which is especially significant for networked file systems.

Conclusions
I have had an overall good experience with AoE so far, it is incredibly simple and very reliable as an implementation. The only issue I have seen is the scalability of it and I attribute this more to the vblade server package than AoE as a protocol. There appears to be a degradation in I/O throughput performance for exported file systems that is in-line with the number of (instances) file systems you export on the same physical server. The best usage example of this is that in my environment I run vblade on one server with exports to 54 servers, the throughput when there is 1-10 instances running averages about 51MB/s (408Mbit), as that increases though to 54 instances, the throughput per client server drops drastically to an average of 14MB/s (112Mbit). This is a very sharp decrease in performance, one that makes the viability of vblade in much larger of a setup questionable.

I do need to caution that this issue may be environment specific as speaking to other vblade users has produced mixed feedback, some do not experience this kind of performance loss while others do. I will also note that I run vblade on a second storage device, on the same private network as the 54 instance vblade server, and this second storage device has only 4 instances running with an average throughput of 71MB/s (568Mbit). So the conclusion you draw from this is up to you, at the end of the day I am more than happy with the implementation as a whole and can accept the loss of performance for the larger implementation in the name of reliability and simplicity.

5 Comments :, , , , more...

LMD 1.3.9r1: Hexdepth Bug

by on Apr.03, 2011, under Development, My Blog

I have put up a revision to the 1.3.9 release of LMD that fixes a hexdepth bug in which malware greater than 65Kbytes would cause an error in the internal hexstring.pl script and be considered clean on the stage2 hex scanning of malware. This would mean that unless malware had a MD5 signature for it to be caught on stage1 scan, it would not be picked up by a corresponding HEX rule in stage2 scan if its file size was greater than 65Kbyte, due to the bug.

In addition, I have made the decision in this revision to enable release update checks in the default cron.daily entry installed by LMD, this can be found at /etc/cron.daily/maldet line 9 (after update) if you wish to comment it out. I would however encourage users to leave this option enabled as it will greatly improve receiving timely updates for future bugs fixes and release updates. In the past, the decision was made to not enable automatic release updates for many reasons but mostly in the interest of the software still maturing and being in early development, thereby not wanting to rock any boats with large and sweeping release updates to a version they may have got working the way they prefer. Now though, LMD has come a long way, the installer imports most options and ignore files and there are no drastically sweeping changes planned that will cause a great deal of headaches — so it seemed fitting time to enable automatic updates.

You can update your installation using the ‘maldet -d|–update-ver’ flags or download the current build for new installations.

This release update also coincides with passing 7k signatures….. We now sit at 7,106 signatures or +146 signatures added today. This is no small feat, I remember when we had just a couple hundred signatures not so long ago and I thought that was a big deal! The LMD submissions repository stays very active, it is now the source of almost 60% of the weekly signature additions and has contributed greatly to creating a vastly more accurate signature set that is representative of the threats you, the users, face day-to-day.

That said, month ending March stats recorded +1,464 installations of LMD bringing the install count to 7,157 — which puts LMD now ahead of APF in month-to-month new installation growth. Although, APF still beat LMD on raw downloads last month (3,091 vs 2,583), it is reasonable to predict that LMD will soon take the number one spot for downloads as well. It however still has a long way to go for total active installations, which APF sits at a comfortable 24,791 currently.

Till next time, happy malware hunting 🙂

Leave a Comment :, , , more...

On The Road: Network Disaster & Dual Public-Private Network

by on Mar.24, 2011, under My Blog

As an administrator within a mid-sized organization, you can find yourself wearing many occupational hats, which becomes only second nature after awhile. One of these many hats I wear, is that of lead network administrator, which is something I am particularly fond of… I love networking and everything about it (except maybe wiring racks and crimping :|).

Today many data center networks are designed in a dual public-private network setup, which simply put is you have a private network parallel to your public network — effectively you run two cat6 copper runs to all racks and servers. The traditional concept behind this is that your servers and/or server customers receive all the benefits a private network entails; unlimited server-to-server traffic, gigabit server-to-server data transfers, secure communication for dedicated back-end application or database servers, out-of-band VPN management, fast off-server local-network backups, relieves congestion on the public network and the list goes on. The costs to run cat6 are about $0.12/ft, it is cheap, provides for a more robust and flexible network environment and simply put is just good practice that leased/colo server customers like to see (we run our private network to all customer servers at work, free of charge!).

There however is an unconventional usage to this dual public-private network implementation that can very well save you some serious headaches, make you the hero of the day and gives true meaning to thinking outside the box.

Our fateful day begins on a beautiful spring day in May of 2009 (how cliche does that sound?). I was just getting on the road heading from Troy, MI to Montreal, Quebec which is about an 8 hour drive. I had to return home to deal with a family emergency and my boss Bill was great about it, offering to drive me to Montreal… One problem though, being a smaller organization our staffing in Troy for the data center pretty much consisted of Bill and myself. Not to worry Bill said!, A quick phone call later and we managed to secure a commitment from one of the data center owners to respond to any hardware events we might have while on the road — after all its only 8 hours away and Bill expected to be back the next day after dropping me off, what could go wrong in a single day!

So, off we went, setting out on the road, we quickly made our way out of Michigan enjoying the wonderfully scenic view of the Ontario landscape (read: a whole lot of nothing). A few stops later for junk food, restroom breaks, and an obligatory stop at one of Canada’s great attractions, Tim Horton’s :), we now found ourselves about 3 or so hours into the journey. Then it happened, Bill’s pager started to go off, seconds later, mine started to go off – something blew up. The laptop bag quickly got pulled out from under my seat as Bill drove and I began to pound away at my keyboard, trying to figure out what was going on. We had multiple servers reporting down, which quickly lead to the realization that all the downed servers were on the same physical rack. The first thing I thought was maybe an APC strip (power outlet strip) tripped or failed, but then I tried to ping some of the private IP’s for the downed servers and they were responding. The conclusion was instant when I saw the private IP’s responding – we just lost a public switch – CRAP!.

Immediately a call was placed to the data center hands that we had thought, and who had committed too, being available to assist us in the event of any kind of failure but we knew this was something he may not even be able to handle. It did not matter, when we needed him, he was nowhere near the data center and couldn’t get there, some commitment that was! At this point Bill took the first U-Turn possible and we started the drive back to Michigan, some 3 hours away, with a rack of servers down from a failed public switch. We were at this point left with our thumbs up our butts, both Bill and myself quietly freaking out in our heads and ultimately unable to do anything, our on-call data center hands failed us and the only other two people who could do anything about it were sitting in the truck 3 hours away.

I sat there, in the truck, contemplating all sorts of things, hoping a power cycle of the switch would work, but sure enough, it did not, that would have been too simple! The more I thought about things, the more I kept returning to the private network, we have all these downed servers and they are responding on the private network…. Then it hit me, why don’t I try route traffic for the downed servers, through a new gateway, a gateway I create on the private network! That was it, I got to work frantically, Bill asking me what I was doing, and me saying only that I am trying something, to give me a few minutes — all the while Bill is still driving back towards Michigan.

The plan was fairly simple, in my head it seemed like that anyways. I would take a server from anywhere else on the network and temporarily use it as a Linux routing/gateway server (think: Windows Internet Connection Sharing) by enabling ip forwarding so it can forward/route packets. Then I would set a static route on the affected servers telling them to route traffic for the public IP network through to the private IP of the designated temporary gateway server, followed by configuring our edge router to static route the IP block of the downed switch to the designated temporary gateway server.

That said it sounds more complicated than it actually is. The server I chose to use for this temporary gateway role was one on the next rack over, lets call this server GW. The private IP on GW is 10.10.7.50 and its public IP is 172.11.14.5. The public IP space that is offline from the downed public switch is 172.11.13.0-255 but each server only has in use 1-3 IP’s.

First things first, on server GW we enabled IP Forwarding with:
# echo 1 > /proc/sys/net/ipv4/ip_forward

Then on one of the affected servers on the downed rack, we needed to add a static route to tell it to route public traffic through the private network to our GW server; note eth1 is our private network interface (this takes care of traffic leaving the downed server):
# route add -net 172.11.13.0/24 gw 10.10.7.50 eth1

Then on the GW server we needed to add a static route similar to above but for each of the downed servers main IP’s.This is slightly tedious but hey its not a perfect situation to begin with right!?. So the downed server we are working on has a public IP of 172.11.13.20 and a private IP of 10.10.7.26, again the private interface is eth1. (this route will take care of traffic going to the downed server):
# route add -host 172.11.13.20 gw 10.10.7.26 eth1

In adding the two routes, the server was immediately able to ping out to the internet but none of it’s IP’s were responding outside of the network. This is because our router was still sending routed traffic for 172.11.13.0/24 to the downed switch. We need to tell it to redirect the traffic to the GW server at 172.11.14.5. Once logged into the edge router running Cisco IOS, I passed it the following static route:
router1(config)# ip route 172.11.13.0 255.255.255.0 172.11.14.5

With that done, the public IP on the downed server started to respond from the internet and traffic began to flow into the server. It was done! I had configured the downed server to route traffic out through an intermediate gateway server on the private network and for that gateway server to likewise route inbound traffic back through the private network to the affected server. Now all that was needed was repeating the first route command on all downed servers then repeating the second route command for each of the downed servers main IP’s. Tedious, not ideal, far from it in fact but it was working – we were bringing servers back up and our total outage time was about 40 minutes. Though significant, less so than it could have been of 3+ hours!

Once I had got the routing working over the private network and the first server back online, this was enough for Bill to turn back around and continue me on my journey to Montreal. Although we could have continued back to Michigan to replace the failed switch, we had a workable solution in place that allowed me to get where I needed to go and handle the failed public switch the next day.

There is arguably more than a few factors to this situation and the network we have at work that made this approach possible but they are outside the scope of this article. The take away is really very simple; a Dual Public-Private Network allows you a great many advantages, as listed earlier in this article, however it is the simple fact of having that private network parallel to your public network that affords you options in a disaster, options you may otherwise not have without it.

What do you think?
Was the choice to continue back on the trip to Montreal the right decision or should we have completely returned to Michigan to replace that failed switch?

Leave a Comment : more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...