ATA Over Ethernet: As an Alternative

New technologies, new toys — Oh how I love getting my hands dirty with them. Today I am going to have a look at ATA Over Ethernet (AoE) as an alternative solution to NFS in the role of a NAS/SAN implementation. We will look at both the server side vblade setup and the client side AoE kernel module along with a practical deployment setup which includes a convenience script I developed to make vbladed slightly less of a nuisance to maintain.

First things first though, what exactly is ATA Over Ethernet? Straight off the wikipedia page, here are the important parts that describe AoE best:

"ATA over Ethernet (AoE) is a network protocol developed by the Brantley Coile Company, designed for simple, high-performance access of SATA storage devices over Ethernet networks. It is used to build storage area networks (SANs) with low-cost, standard technologies.
AoE runs on layer 2 Ethernet, it does not use internet protocol (IP), so it cannot be accessed over the Internet or other IP networks. In this regard it is more comparable to Fibre Channel over Ethernet.
SATA (and older PATA) hard drives use the Advanced Technology Attachment (ATA) protocol to issue commands, such as read, write, and status. AoE encapsulates those commands inside Ethernet frames and lets them travel over an Ethernet network instead of a SATA or 40-pin ribbon cable. By using an AoE driver, the host operating system is able to access a remote disk as if it were directly attached."

OK, of note here is that AoE is an ATA implementation over Ethernet, being layer 2 it is a dumb protocol with no knowledge of the TCP/IP stack, as such it can only communicate in the simplest of ways inside a switched network (its packets cant be routed between multiple networks). As such, AoE is ideal when used on a private network or better yet a network dedicated to SAN (Storage Area Network), it can however be used on a public facing network as so long as the hosts in the AoE network are all within the same switched segment of the network (More info here on routable AoE).

That all said, what makes AoE a viable alternative to NFS? Well in the role of storage access in its simplest capacity, NFS is just bloated and adds a significant amount of overhead and complexity to something that deserves to be simple. Further, NFS is woefully inadequate at maintaining the level of reliability required when you are, for example, exporting an entire file system to another device for the purpose of high-availability usage such as a /home extension or MySQL file system. Personally, I am slightly biased as I hate NFS; I use it, but only for a lack of often anything better to meet the role in exporting file systems and directory trees across networks. Although it does what its supposed to just fine, more often than not you can get woken up at 4AM with the most mysterious and sudden of NFS issues that are notorious for being mind numbing to resolve. It is for this simple reason — NFS’s lack of reliability, that sent me searching for a simple, scalable and reliable alternative. AoE has managed to meet two of these three points — simple and reliable, while coming up short on the scalable side, more on that in a bit.

There are two components of an AoE setup, the server side storage device that will run vblade and the client side that will access the exported storage using the AoE kernel module under Linux. I should note that although the vblade server package is for Linux, the client side drivers are available for Windows, OS X, FreeBSD and more; in Linux the AoE kernel module is part of the mainline kernel.

The server you choose to run vblade can be any device that you want to export files or devices on, there is little in the way of requirements as vblade is a pretty slim package and doesn’t consume much in the way of resources other than CPU. For a modest environment where you plan to export to no more than 10-15 clients, a dual core system with 2GB RAM is more than sufficient for the vblade server. For my deployment, I run vblade on a quad core Xeon 3.0Ghz, 6GB RAM and 9TB Raid5 array that exports to 54 client servers. More on my setup later when we review scalability but for now lets jump right into the vblade setup and usage.

Lets go ahead and grab the vblade package, compile and install it:

# wget
# tar xvfz vblade-20.tgz
# cd vblade-20
# make && make install
install vblade /usr/sbin/
install vbladed /usr/sbin/
install vblade.8 /usr/share/man/man8/

There is no compile time configure script or any other real configuration required, vblade installs straight into /usr/sbin and is an overall painless process. The simplicity of the vblade package comes at a cost, in that there is no support for a configuration file to control multiple vblade instances, making things slightly tedious. This should not detract from the use of vblade, it is a mature and reliable package but one with a very simple approach that does little in the way other than what it is supposed to do.

To make life easier for myself, I created a wrapper of the sorts to add support for a configuration file along with limited error checking and some command line conveniences — we’ll grab the wrapper and default config template as follows:

# wget
# wget
# mv vbladed.conf /etc/
# mv vbladectl /usr/sbin
# chmod 640 /etc/vbladed.conf
# chmod 750 /usr/sbin/vbladectl
# ln -s /usr/sbin/vbladectl /etc/init.d/vbladed
# chkconfig --level 2345 on

You will note, that we enabled vblade to start on boot through init, although the wrapper is not technically an init script, it does support being called from init and managed through chkconfig for convenience. Lets look at the configuration file /etc/vbladed.conf then we’ll review the vbladectl usage after that:

# vbladed export configuration file

# unique shelf identifier for this vblade server
SHELF="0"     # must be numeric 0-254, default 0

# 0 /data/server.img FF:FF:FF:FF:FF:FF eth1 server

The configuration file is pretty straight forward, the SHELF variable only matters if you intend to run multiple vblade servers on the same network, if that is the case then this value must be unique to each vblade server or you will run into client side conflicts of being unable to distinguish between vblade servers. The export definitions follow in the format of “AOESLOT FILE MAC IFACE ALIAS” which the below breaks down further:
AOESLOT is a per-client identifier for EACH exported file or device to the SAME client; in other words if you configure multiple exports to the same client server then this value needs to be unique for each.
FILE is the full path to a device or file you want to export, this can be an unformatted raw device such as /dev/sdb, a preformatted partition such as /dev/sdb5 or a loopback image such as /data/server.img.
MAC is the MAC address of the client-side interface that is attached to the network you intend AoE traffic to move over; more appropriately, it is the interface connected to your private network on the client server
IFACE is the server-side interface that can reach the client-side interface you defined the MAC address for; more appropriately, it is the interface connected to your private network on the vblade server
ALIAS is a reference alias for each configuration entry, this must be unique to each vbladed.conf definition

For the purpose of this article, we will go ahead and create a loopback image, format it and export it for a client server called apollo, then we will review how to import the file system onto the apollo server using the AoE kernel module. First, lets create our image:

# dd if=/dev/zero of=/home/apollo.img bs=1 count=0 seek=10G
# yes | mkfs.ext3 /home/apollo.img

This will create a sparse, zero filled file, meaning it will be 0bytes on disk and allocate space, up to 10G, as data is stored to it. There is a slight performance hit to this as the image file must grow itself as data is written, this however is made up for in improved efficiency of space usage. To create an image that preallocates space on the disk you would run ‘# dd if=/dev/zero of=/home/apollo.img bs=1M count=10000‘, be patient as this will take some time to complete, then format it as described above.

Now that we have the image/device we want to export, we need to add the definition for it into the vbladed.conf file, to do so we need to note the MAC address of the interface on apollo that will communicate with the vblade server, in our case this is a private interface eth1 but in your setup it can be a public facing interface if needed — just make sure its within the same subnet as the vblade server.

[root@apollo ~]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:16:E6:D3:ED:E5
          inet addr:  Bcast:  Mask:
    ... truncated ...

We now have the client side MAC address (00:16:E6:D3:ED:E5) and we have the device/file we want to export (/home/apollo.img), and we also know the private network interface on our vblade server is eth1 as well, so we can create the vbladed.conf definition:

0 /home/apollo.img 00:16:E6:D3:ED:E51 eth1 apollo

That should be appended into the bottom of /etc/vbladed.conf, then we are ready to start the vblade instance for the configuration we’ve added. The vbladectl wrapper includes start, stop and restart flags which also accept an optional alias for performing actions against only a specific vblade instance, run vbladectl with no options for usage help. Time to start the vblade instance for apollo as follows:

# /usr/sbin/vbladectl start apollo
started vbladed for apollo (pid:16320 file:/home/apollo.img iface:eth1 mac:00:16:E6:D3:ED:E51)
( you could also just pass the start option without an alias to start instances for all entries in vbladectl.conf )

The default behavior for vblade also sends log data to the kernel log, typically /var/log/messages on most systems, so tailing the log will produce the following logs if all is normal:

# tail /var/log/messages
Apr  3 16:49:25 backup5 vbladed: started vbladed for apollo (pid:16320 file:/home/apollo.img iface:eth1 mac:00:16:E6:D3:ED:E51)
Apr  3 16:49:24 backup5 vbladed: pid 16320: e0.0, 419430400 sectors O_RDWR

The important part there is the ‘vbladed: pid 16320: e0.0, 419430400 sectors O_RDWR’ entry in the log as this comes from vblade itself, the other log entry comes from the wrapper. This log entry tells us that vbladed forked off successfully and that it has exported our data for the defined server as e0.0 (etherdrive shelf 0 slot 0), you’ll see the significance of this shortly.

We are now ready to move over to our client server, apollo, and import our new AoE file system. This is an easy task and if you are running a current Fedora / RHEL (CentOS) based distribution, you’ll find the AoE kernel module already included. The module is also part of the mainline kernel so if you are using a custom kernel, please be sure to enable the corresponding config option (CONFIG_ATA_OVER_ETH).

There is really no right way to load a kernel module, you can either use modprobe which I recommend or you can use insmod on the modules full path, which is a matter of preference. Let’s first verify the module exists, which modprobe does for us but for the sake of this article and familiarity, we will check (remember you’re running this on the client server, i.e apollo):

# find /lib/modules/$(uname -r)/ -name "aoe.ko"

There we have it, the module returned fine, listing the full path to it. If you did not get anything back this may be that you are running a custom kernel by your own choosing, and need to configure the CONFIG_ATA_OVER_ETH option. It may also be that your data center provider or a software vendor installed a custom kernel without this feature and you should contact them requesting it. As an alternative, you could download the etherdrive sources for the AoE kernel module on the coraid website and compile it against your kernel, this requires your kernel build sources or on RHEL based systems the kernel-headers package.

That said, we will now load the module using modprobe, the preferred method:

# /sbin/modprobe aoe
( or you can run /sbin/insmod MODULE-PATH )

If everything went OK, then modprobe will generate no output and you can verify the module is loaded as follows:

# lsmod | grep aoe
aoe                    60385  1

When the AoE module is loaded it will start listening for broadcast traffic from AoE on all available interfaces, a very passive process. If you have done everything correct then the module will quickly detect the exported device/file from the vblade server and inform you in the kernel log along with creating the appropriate /dev/etherd/ device file. Let’s verify this by checking the log and then checking the /dev/etherd path:

# tail /var/log/messages
Apr  4 17:13:02 apollo kernel: aoe: aoe_init: AoE v22i initialised.
Apr  4 17:13:02 apollo kernel: aoe: 003048761643 e0.0 v4014 has 419430400 sectors
Apr  4 17:13:02 apollo kernel:  etherd/e0.0: unknown partition table
# ls /dev/etherd/

If for some reason you do not see the log entries described above along with no e0.0 device file under /dev/ethered, this may be a misconfiguration on the vblade server, perhaps you got the interface or mac address in vbladed.conf wrong? Double check all values. If you opted to try run things over a public facing interface, the issue may be that your network VLAN’s each server (which is fairly common), in that case you may need to request that all your hardware be part of the same VLAN or the provisioning of a private switch and private links for your hardware.

Assuming that things went good, that you see the appropriate log entries and the e0.0 device file under /dev/etherd/, we are ready to mount the file system, we will mount it as /mnt/aoe for the purpose of this article:

# mkdir /mnt/aoe
# mount /dev/etherd/e0.0 /mnt/aoe
# df -h /mnt/aoe
Filesystem            Size  Used Avail Use% Mounted on
/dev/etherd/e0.0      5G   36M  4.9G  0% /mnt/aoe

You may run into an issue of unrecognized file system on the device, though the file system we created on it, on the vblade server, should show through. If it does not, simply run an ‘mkfs.ext3 /dev/etherd/e0.0’ on it and you will be all set. There is no hard set rule on creating the file system on the vblade server, you could just export raw images and devices then partition/format file systems on them on a per-client basis as you require it.

The only thing that is left is to set our new file system on apollo to load at boot time, the simplest way to do this is to append a couple of lines to /etc/rc.local as follows:

/sbin/modprobe aoe
sleep 5 ; mount /dev/etherd/e0.0 /mnt/aoe -onoatime

The rc.local script is run at boot time after all other services have started, so if you are loading a file system used for mysql, user home data or similar you will probably want to also add a line after the mount to restart said services. You’ll also notice two things about the entries we added to rc.local; The first is the sleep delay before the mount, this allows the aoe kernel module to complete its discovery process for AoE file systems before we try to mount it. Then, we are using the noatime option on the mount command, which disables the updating of the last access time on files during read/write operations. This is important because traditionally whenever a file is read from disk, it causes a write operation back to disk to update the atime attribute on the file, so disabling atime usage can greatly reduce i/o calls (effectively in half for reads), which is especially significant for networked file systems.

I have had an overall good experience with AoE so far, it is incredibly simple and very reliable as an implementation. The only issue I have seen is the scalability of it and I attribute this more to the vblade server package than AoE as a protocol. There appears to be a degradation in I/O throughput performance for exported file systems that is in-line with the number of (instances) file systems you export on the same physical server. The best usage example of this is that in my environment I run vblade on one server with exports to 54 servers, the throughput when there is 1-10 instances running averages about 51MB/s (408Mbit), as that increases though to 54 instances, the throughput per client server drops drastically to an average of 14MB/s (112Mbit). This is a very sharp decrease in performance, one that makes the viability of vblade in much larger of a setup questionable.

I do need to caution that this issue may be environment specific as speaking to other vblade users has produced mixed feedback, some do not experience this kind of performance loss while others do. I will also note that I run vblade on a second storage device, on the same private network as the 54 instance vblade server, and this second storage device has only 4 instances running with an average throughput of 71MB/s (568Mbit). So the conclusion you draw from this is up to you, at the end of the day I am more than happy with the implementation as a whole and can accept the loss of performance for the larger implementation in the name of reliability and simplicity.

IRSYNC & Limiting Passwordless SSH Keys

Anyone who has ever used SSH key-pairs to access more than a couple of servers (or hundreds in my case), will tell you they are an invaluable convenience. It is a natural progression and very common usage that SSH key-pairs are coupled with other common tasks or tools, where having a pass phrase attached to the key would be counter-intuitive to the task automation. So, what do we do despite our better judgment? We create key-pairs with absolutely no pass phrase. The implications are abundantly obvious, if the private key ever gets lost or stolen, any accounts that have the key-pair associated to it can be instantly compromised.

In the case of my recently released project Incremental Rsync (IRSYNC), one of the implementation hurdles at work was to have servers backup using a secure medium. This is easily handled with rsync’s -e option to have data transferred over ssh using a key-pair but then the obvious issue comes up that what if a client server ever gets compromised? Then the backup account on the backup server can be compromised (please don’t use root!@#!@#) allowing for backups to be deleted or worse yet data to be stolen for every server that backups to said server/account.

A solution to this is to limit the commands that can be executed over SSH by a specific public key, though this is not a perfect way to mitigate the threat it does go a long way to help. For my backup server implementation I have setup the user ‘irsync’ on the backup server, this account has the usual ‘~irsync/.ssh/authorized_keys’ file where I place the public key. Where things differ is that you prefix a script path in front of the public key that is used to interpret commands sent over ssh, which looks something like this:

command="/data/irsync/" ssh-dss AAAAB3NzaC1kc3MAAAC......87JVNLJ5nhaK1A== irsync@irsync

The ‘’ script is basically a simple interpreter, it looks at the commands being passed over ssh and either allows them or denies them with some logging thrown in for auditing purposes. The script can be downloaded from: Please take note to edit the scripts ‘log_file=’ value to an appropriate path, usually the base backup path or user homedir.

An example of in play would be as follows, first the client side view then the logs from $log_file:

root@praxis [~]# ssh -i /usr/local/irsync/ssh/id_dsa irsync@buserver3 "rm -rf /some/path"
sshval(13156): ssh command rejected from rm -rf /some/path

root@praxis [~]# ssh -i /usr/local/irsync/ssh/id_dsa irsync@buserver3
sshval(13403): interactive shell rejected from

May 04 11:36:15 buserver3 sshval(13156): ssh command rejected from rm -rf /some/path
May 04 11:40:03 buserver3 sshval(13403): interactive shell rejected from

On the flip side when a command is authorized, it gets recorded into the $log_file as follows:

May 04 05:29:08 buserver3 sshval(29993): ssh command accepted from rsync --server -lHogDtprx --timeout=600 --delete-excluded --ignore-errors --numeric-ids . /data/irsync/

Take note that if you do choose to use with irsync, you will need to create your own script to manage the snapshots as internally irsync uses the find command, piping results to xargs and rm which will not be authorized by (for good reason!). This is actually a very simple task, although all your snapshots will have to use the same rotation age (whatever).


for i in `ls $bkpath | grep snaps`; do
find $wd -maxdepth 1 -mtime +14 -type d | xargs rm -rf

You can save this to /root/, chmod 750 it and run it as a daily cronjob by linking it into /etc/cron.daily/ (ln -s /root/ /etc/cron.daily/) or you can add an entry into /etc/crontab as follows:

02 4 * * * root /root/ >> /dev/null 2>&1

Although I detailed the use of in the context of backups with irsync, this could easily be adapted to any usage when you want to restrict the commands executed over ssh with key pairs. You could even create your own script in perl or whatever floats your boat and use that instead — if you happen to go that route, please share with me what you created in the comments or by e-mail to ryan <at>

“oops” Wrong Server!

So this past weekend, I did the unthinkable, I accidentally recycled the wrong dedicated server at work. Usually, this is not much of an issue¬† (not that I make a habit of it) with the continuous data protection we have implemented at the data center (cdp r1soft) except that the backup server this particular client system was using had suffered a catastrophic raid failure the very night before. We have had raid arrays go bust on us before, typically very rare but it does happen… Obviously this resulted in the clients site and databases getting absolutely toasted and having only a static tar.gz cpanel backup available which was over a week old, they were none-too-happy about the loss of the database content.

I have dealt with data loss in the past of various degree’s but never had I dealt with it in the capacity where a format had occurred WITH data being rewritten to the disk. We are also not talking about just a few hundred megs of rewritten data but a complete OS reload along with cpanel install, which comprises multiple gigabytes worth of data and countless software compilations that consist of countless write-delete cycles of data to the disk.

So, the first thing I did on realization of the “incident” was stop everything on the server, remount all file systems read-only then had a “omg wtf” moment. Once I had collected myself I did the usual data loss chore of making a dd image of the disk to a NFS share over our gigabit private network while contemplating my next step. My last big data recovery task was some years ago, perhaps 2 or more years and since I am such a pack rat I still had a custom data recovery gzip on my workstation system that contained a number of tools I used back then. The main ones being testdisk and the sleuth (tsk) tool kit, these tools together are invaluable.

The testdisk tool is designed to recover partition data from a formatted disk and even those that have had minimal data rewrites, which it does exceptionally well. In this case I went in a bit unsure of the success I would have, sure enough though after some poking and prodding of teskdisk options I was able to recover the partition table for the previous system installation. This was an important task as any data that had not been overwritten on the disk instantly became available with the old partition scheme restored, sadly though this did not provide the data I required which was the clients databases. The partitions restored still provided me some metadata to work with and a relative range on the disk of where the data is located, instead of having to ferret over the whole disk. So with that, I created a new dd image of the disk with a more limited scope that comprised the /var partition which effectively cut the amount of unallocated space I needed to search down from 160gb to 40gb.

It was now time to crack out the latest version of the sleuth tool kit and its companion autopsy web application, installed it into shared memory through /dev/shm and then went through the chore of remembering how to use the autopsy webapp. After a few minutes poking around it started to come back to me and before I knew it I was browsing my image files, which is a painfully tedious task in hopes that the metadata can lead you to what your looking for through associative information on file-names to inode relationships. That is really pretty pointless in the end though as ext3 when it deletes data, as I understand, zeros the metadata before completing the unlink process from the journal. I quickly scrapped anything to do with metadata and moved on to generating ASCII string dumps of the images allocated and unallocated space, which allows for quick pattern based searches to find data.

The string dumps took a couple of hours to complete generating, after which I was able to keyword/regexp search the disks contents with relative ease (do not try searching large images without the string dumps, it is absurdly slow). I began some string searches looking for the backup SQL dumps that had been taken less than 24h ago during the weekend backups, although I did eventually find these dumps it turned out some of them were so large they spanned non-sequential parts of the disk. This made my job very difficult as it then became a matter of trying to string together various chunks of an SQL dump which I had no real knowledge of the underlying db structure for. After many hours of effort and some hit-or-miss results, I managed to recover a smaller database for the client which in the end turned out to be absolutely useless to them, that was it for the night for me – I needed sleep.

Sunday morning brought about an individual from the clients organization who was familiar with the database structure to the custom web application they have and was able to give me exact table names they needed recovered Рwhich was exactly what I needed. I was then able to craft some regexp queries that found all insert, update and structure definitions for each of the tables they required and despite some of the parts of these tables being spread across the disk Рknowing what they needed allowed my regexp queries to be accurate and provide me the locations of all the data. Now that I had all the locations it was just a task of browsing the data, modifying the fragment range of the data I was browsing so that it included the beginning and end of the data elements followed by exporting the data into notepad where I reconstructed the sql dumps to what turned out to be a very consistent state. This did take a little while though but was not near as painful a process as my efforts  from the night before, so I was very happy with where we had ended up.

A couple hours after I turned the data over to the client they were restoring tables they much needed to get back online, this was followed by ping from the client on AIM that they had successfully restored all data and were back online in a state near identicle to just before they went offline. What the client took from this is to never trust anyone else with safe guarding there data and they intend to keep regular backups of there own data now in addition to the backups we retain, which is a very sensible practice to say the least.