IRSYNC & Limiting Passwordless SSH Keys

Anyone who has ever used SSH key-pairs to access more than a couple of servers (or hundreds in my case), will tell you they are an invaluable convenience. It is a natural progression and very common usage that SSH key-pairs are coupled with other common tasks or tools, where having a pass phrase attached to the key would be counter-intuitive to the task automation. So, what do we do despite our better judgment? We create key-pairs with absolutely no pass phrase. The implications are abundantly obvious, if the private key ever gets lost or stolen, any accounts that have the key-pair associated to it can be instantly compromised.

In the case of my recently released project Incremental Rsync (IRSYNC), one of the implementation hurdles at work was to have servers backup using a secure medium. This is easily handled with rsync’s -e option to have data transferred over ssh using a key-pair but then the obvious issue comes up that what if a client server ever gets compromised? Then the backup account on the backup server can be compromised (please don’t use root!@#!@#) allowing for backups to be deleted or worse yet data to be stolen for every server that backups to said server/account.

A solution to this is to limit the commands that can be executed over SSH by a specific public key, though this is not a perfect way to mitigate the threat it does go a long way to help. For my backup server implementation I have setup the user ‘irsync’ on the backup server, this account has the usual ‘~irsync/.ssh/authorized_keys’ file where I place the public key. Where things differ is that you prefix a script path in front of the public key that is used to interpret commands sent over ssh, which looks something like this:

command="/data/irsync/validate-ssh.sh" ssh-dss AAAAB3NzaC1kc3MAAAC......87JVNLJ5nhaK1A== irsync@irsync

The ‘validate-ssh.sh’ script is basically a simple interpreter, it looks at the commands being passed over ssh and either allows them or denies them with some logging thrown in for auditing purposes. The script can be downloaded from: http://www.rfxn.com/downloads/validate-ssh.sh. Please take note to edit the scripts ‘log_file=’ value to an appropriate path, usually the base backup path or user homedir.

An example of validate-ssh.sh in play would be as follows, first the client side view then the logs from $log_file:

root@praxis [~]# ssh -i /usr/local/irsync/ssh/id_dsa irsync@buserver3 "rm -rf /some/path"
sshval(13156): ssh command rejected from 192.168.3.33: rm -rf /some/path

root@praxis [~]# ssh -i /usr/local/irsync/ssh/id_dsa irsync@buserver3
sshval(13403): interactive shell rejected from 192.168.3.33

May 04 11:36:15 buserver3 sshval(13156): ssh command rejected from 192.168.3.33: rm -rf /some/path
May 04 11:40:03 buserver3 sshval(13403): interactive shell rejected from 192.168.3.33

On the flip side when a command is authorized, it gets recorded into the $log_file as follows:

May 04 05:29:08 buserver3 sshval(29993): ssh command accepted from 10.10.6.6: rsync --server -lHogDtprx --timeout=600 --delete-excluded --ignore-errors --numeric-ids . /data/irsync/mysql02.mynetwork.com.full

Take note that if you do choose to use validate-ssh.sh with irsync, you will need to create your own script to manage the snapshots as internally irsync uses the find command, piping results to xargs and rm which will not be authorized by validate-ssh.sh (for good reason!). This is actually a very simple task, although all your snapshots will have to use the same rotation age (whatever).

#!/bin/sh
age=14
bkpath=/data/irsync

for i in `ls $bkpath | grep snaps`; do
wd=$bkpath/$i
find $wd -maxdepth 1 -mtime +14 -type d | xargs rm -rf
done

You can save this to /root/irsync_rotate.sh, chmod 750 it and run it as a daily cronjob by linking it into /etc/cron.daily/ (ln -s /root/irsync_rotate.sh /etc/cron.daily/) or you can add an entry into /etc/crontab as follows:

02 4 * * * root /root/irsync_rotate.sh >> /dev/null 2>&1

Although I detailed the use of validate-ssh.sh in the context of backups with irsync, this could easily be adapted to any usage when you want to restrict the commands executed over ssh with key pairs. You could even create your own script in perl or whatever floats your boat and use that instead — if you happen to go that route, please share with me what you created in the comments or by e-mail to ryan <at> rfxn.com.

Upgrade CentOS 4.8 to 5.x (32bit)

Traditionally, the dist upgrade path that many were familiar with from the RH8/9->Fedora or similarly Fedora dist upgrades, have applied more or less to RHEL/CentOS but with the release of 4.5 and early releases of 5.0 the actual dist upgrade path was messy or nearly impossible. The early versions of 5.0 (up to 5.2) had excessive dependency issues with versions later than 4.4 for straight dist upgrades that would often result in a box blowing up on you or forcing a messy downgrade attempt of 4.5+ to 4.4 to try get things to dist upgrade. With more recent release updates the gap has closed and now dist upgrades on are far more reasonable to complete with little in the way of problems.

If you are currently running a version of RHEL/CentOS earlier than 4.8 (cat /etc/redhat-release) then please do a proper ‘yum update’ and get yourself on 4.8. Although this is intended for CentOS it “should” (read: at own risk) work on RHEL systems as well, in the unfortunate situation that something does blow up please post a comment and I will try to assist.

The first thing we must do is make sure none of our core binaries, libraries or other content is set immutable as this will cause a package to fail on installation. If you are running an earlier version of LES or you use immutable bits on system paths (sbin/bin/share/include/libexec/etc) then you should run the following:

wget http://www.rfxn.com/downloads/disable.les.rpmpkg
sh disable.les.rpmpkg

Once that is done we should go ahead and have a quick run through of cleaning up yum cache, double check that any pending updates are installed and rebuild the rpmdb:

rpm --rebuilddb
yum clean all
yum update

If for some reason the rpm rebuild hangs for more than a few minutes then you may need to manually clear the rpmdb files:

rm -f /var/lib/rpm/__db.00*
rpm --rebuilddb

If you run into any minor dependency issues for packages that are not essential, such as syslinux and lftp then you can either exclude them or better yet remove them. If you are not sure what a package does, then you should query it for description details and make an educated choice (rpm -qi PACKAGE):

rpm -e lftp syslinux mkbootdisk

OR (but not recommended)

yum update --exclude=syslinux --exclude=lftp --exclude=mkboot

At this point you should be able to run a ‘yum update’ command with optional exclude and receive no errors (again, I recommend you remove conflicts items instead of using exclusions).

# yum update –exclude=nagios-plugins
Setting up Update Process
Setting up repositories
Reading repository metadata in from local files
Excluding Packages in global exclude list
Finished
No Packages marked for Update/Obsoletion

Now we are ready to get going, I have put together a small package that contains the needed packages for this upgrade in addition to a few that you might require to resolve dependency conflicts:

wget http://www.rfxn.com/downloads/CentOS-5up.tar.gz
tar xvfz CentOS-5up.tar.gz
cd CentOS-5up

We need to go ahead and setup the centos-release package as follows:

rpm -Uhv centos-release-*

If you see that CentOS-Base.repo was created as /etc/yum.repos.d/CentOS-Base.repo.rpmnew then go ahead and move it into the proper place:

mv /etc/yum.repos.d/CentOS-Base.repo.rpmnew /etc/yum.repos.d/CentOS-Base.repo

Now we are ready to go with kernel changes, this is an important part so pay attention. The key to successful upgrade is that you remove ALL OLD KERNELS as many packages will fail to install during the upgrade if they detect a release 4.x kernel due to minimum kernel version dependency checks. We will start with first installing the new kernel so it preserves grub templating:

rpm -ivh kernel-2.6.18-164.el5.i686.rpm kernel-devel-2.6.18-164.el5.i686.rpm --nodeps

NOTE: release 5.x has smp support integrated into the standard kernel, so no -smp version is required for mp systems

If you are running an older system the chances are you got allot of older kernel packages installed so make sure you get them all out of the way:

rpm -e $(rpm -qa | grep kernel | grep -v 2.6.18 | tr '\n' ' ')

You may end up with a few dependencies coming up such as lm_sensors and net-snmp if the list is fairly small and packages you do not recognize as critical (if unsure always query the package for info ‘rpm -qi PACKAGE’, remember you can reinstall them later):

# rpm -e $(rpm -qa | grep kernel | grep -v 2.6.18)
error: Failed dependencies:
kernel-utils is needed by (installed) lm_sensors-2.8.7-2.40.5.i386

The command the ended up being required on most of my servers to get the kernel packages and related dependencies came out to the following:

rpm -e $(rpm -qa | grep kernel | grep -v 2.6.18 | tr '\n' ' ') lm_sensors net-snmp net-snmp-devel net-snmp-utils

That said and done you should now only have 2 kernel packages installed which are the 2.6.18 release 5.x kernels, DO NOT under any circumstance continue if you still got 2.6.9 release 4.x kernels packages still installed, remove them!

# rpm -qa | grep -i kernel
kernel-2.6.18-164.el5
kernel-devel-2.6.18-164.el5

A cleanup of /etc/grub.conf may be required, though if all went as planned then the rpm command should have done this up for us but review it anyways for good measure. You should find that 2.6.18-164.el5 is the only kernel in the file, if it is not go ahead and clean it by removing all older entries for 2.6.9 kernels.

There is a known bug with python-elementtree package versions which cause yum/rpm to think the release 4.x version is newer than the 5.x version, to get around this without blowing up the entire python installation we need to remove the package from just the rpmdb as follows:

rpm -e --justdb python-elementtree --nodeps

We can now go ahead and use yum to start the upgrade process, this is a dry run and will take a few minutes to compile list of available packages and associated dependency checks. You should carry the exclude options, if any, that you used during the ‘yum update’ process as so to avoid unresolvable dependencies:

yum clean all
yum upgrade --exclude=nagios-plugins

You will end up with a small list of dependency errors, these should be resolved by again evaluating a packages need as a critical system component and either removing it with ‘rpm -e’ or excluding it with ‘–exclude’ (remember to query description with ‘rpm -qi PACKAGE’ if you are unsure what something does). In my case the packages that threw up red flags were stuff I had manually installed over time such as iftop and mrtg in addition to default installed samba, these can all safely be removed or excluded as you prefer (removal always safest to prevent dependency chain issues).

Error: Missing Dependency: libpcap.so.0.8.3 is needed by package iftop
Error: Missing Dependency: perl(Convert::ASN1) is needed by package samba
Error: Missing Dependency: libevent-1.1a.so.1 is needed bypackage nfs-utils
Error: Missing Dependency: perl-Socket6 is needed by package mrtg
Error: Missing Dependency: perl-IO-Socket-INET6 is needed by package mrtg


rpm -e iftop samba nfs-utils mrtg system-config-samba

At this point we should be ready to do a final dry run of with yum and see where we stand on dependencies, rerun the earlier ‘yum upgrade’ while making sure to carry over any exclude options you are using.

yum upgrade --exclude=nagios-plugins

You should now end up with a summary of actions that yum needs to perform, go ahead and kick it off… this will take a bit to complete so go grab some coffee/jolt/redbull and maybe a small snack cause it could be a long night if this blows up on you.

Transaction Summary
=============================================================================
Install 183 Package(s)
Update 327 Package(s)
Remove 0 Package(s)
Total download size: 299 M
Is this ok [y/N]:

Once yum has completed (hopefully without major errors) we need to fix a few things, the first is the rpmdb needs a rebuild due to version changes that will cause any rpm commands to fail:

# rpm -qa
rpmdb: Program version 4.3 doesn’t match environment version
error: db4 error(-30974) from dbenv->open: DB_VERSION_MISMATCH: Database environment version mismatch
error: cannot open Packages index using db3 – (-30974)
error: cannot open Packages database in /var/lib/rpm

This can be fixed by running the following to manually rebuild the rpmdb:

rm -f /var/lib/rpm/__db.00*
rpm --rebuilddb
yum clean all

The next issue on the list is python-elementtree and python-sqlite, one or both of these may have ended up in a broken state that will cause all yum commands to break, so we will go ahead and reinstall both of these for good measure:

rpm -e --justdb python-elementtree --nodeps
rpm -ivh python-elementtree-1.2.6-5.el5.i386.rpm
rpm -ivh python-sqlite-1.1.7-1.2.1.i386.rpm --nodeps --force

The yum command should now work, go ahead and run it with no options, if you do not get any errors you are all sorted.

Hopefully the install went well for you, the only thing left to do is go ahead and reboot the system; this is the last point at which you have to make backups (but we all maintain backups right?). For the sake of avoiding a heart attack if the system goes into an fsck, we will reboot with the -f option to skip fsck:

shutdown -rf now

That’s a wrap, I hope you found this HowTo useful, if you did run into any issues then go ahead and post them into the comments field and I will try to assist but when in doubt typically google is the fastest alternative.

Linux Malware Detectection

[ UPDATE: Linux Malware Detect has been released ]
I have the last few weeks been working on a new project for malware detection on Linux web servers, it is already at a pre-release version in use at work and it has shown phenomenal promise.

Right to it, some background… On a daily basis the network I manage receives a large number of attacks, most of these are web based abuses against common web application vulnerabilities which inject/upload to servers an array of malware such as phishing content, defacement tools, exploits for privilege escalation and irc c&c bots. All these actions are typically logged and recorded by our network edge snort setup which got me to thinking if we started to catalog some of the injected malware, I could hash it and then detect it on servers.

Now, some might be thinking – “network edge IDS? why not convert it to IPS and stop the attacks right away?” – though this is something I am actually in the process of doing, there is a much larger problem and that is content encoding. Allot of malware attacks are coming in these days in base64 and gzip encoded data payloads which snort or any other IDS/IPS products for that matter are currently NOT capable of decoding without use of fancy transparent proxy setups that are out of the scope of standard network edge intrusion detection/prevention.

So, this brings us to a host based solution for malware detection which as it turns out is not so easy as there is no simple sites that actually track malware specifically targeting web applications and the ones that do exist focus primarily on Windows based malware; utterly useless. To address this short coming, what I have done is essentially written a set of tools that extracts from specific ids events the payload data of attacks (decodes if needed) and saves/downloads the content attackers are trying to inject. This data is then processed for false positives by me every couple of days followed by the creation of md5 hashed definitions of the malware for the detection tool. The hashes are compiled in two methods, the first is straight md5 hashes of the data and the second are hashes of “chunked” elements of the data in specific increments and formats as so to detect commonly occurring malware code in otherwise unique files and content types.

The scanner portion of the malware detection tool comes in 3 varieties, the first is a standard “scan all” feature which scans an entire defined path, the second is a “scan recent” feature that can scan a path for content created in the last X days (i.e: /home/*/public_html content created in the last 7 days) and the third is a real time monitoring service component that uses Linux inotify() kernel feature to detect real time file create/move/modify operations and scan content immediately as it is created under user web paths (default /home[2]/user/public_html).

The malware hit management is a very simple anti-virus like quarantine system that moves offending files to ‘INSTALL_PATH/quarantine/’ and logs the exact source path and destination file name in quarantine locker in case you need to restore any data due to false positives (though this should never happen since we are using hashed detection). In addition, the quarantine function can optionally search the process table for running tasks that contain the file name of the offending malware and kicks off a kill -9 against it.

The event management is handled in two ways, for manual user invoked scans from cron/command line, emails are directly dispatched with the scan results including quarantine details – nothing really fancy here. The monitor component that uses inotify() on the other hand, has the potential to generate allot of quarantine events in rapid succession so a standard email out on every hit isn’t appropriate. Instead, we have a daily cron job that runs an internal option in the malware detect tool to read ONLY new lines from a quarantine hit list and dispatch a daily event summary if any quarantine hits are found. Since we are only reading new lines from the hit list, we avoid repetitive daily alerts for events we already know about and retain the hit list as an “all-time” hit list that can later be used to derive trending data / phone home features for global trending.

Finally, the project also contains an internal update function to check for new hashes and runs in the daily cron task in addition to a simple check feature that determines if inotify() based monitoring is running, if it is not then it kicks off against /home[2]/user/public_html a scan for content created in the last 48h.

“oops” Wrong Server!

So this past weekend, I did the unthinkable, I accidentally recycled the wrong dedicated server at work. Usually, this is not much of an issue  (not that I make a habit of it) with the continuous data protection we have implemented at the data center (cdp r1soft) except that the backup server this particular client system was using had suffered a catastrophic raid failure the very night before. We have had raid arrays go bust on us before, typically very rare but it does happen… Obviously this resulted in the clients site and databases getting absolutely toasted and having only a static tar.gz cpanel backup available which was over a week old, they were none-too-happy about the loss of the database content.

I have dealt with data loss in the past of various degree’s but never had I dealt with it in the capacity where a format had occurred WITH data being rewritten to the disk. We are also not talking about just a few hundred megs of rewritten data but a complete OS reload along with cpanel install, which comprises multiple gigabytes worth of data and countless software compilations that consist of countless write-delete cycles of data to the disk.

So, the first thing I did on realization of the “incident” was stop everything on the server, remount all file systems read-only then had a “omg wtf” moment. Once I had collected myself I did the usual data loss chore of making a dd image of the disk to a NFS share over our gigabit private network while contemplating my next step. My last big data recovery task was some years ago, perhaps 2 or more years and since I am such a pack rat I still had a custom data recovery gzip on my workstation system that contained a number of tools I used back then. The main ones being testdisk and the sleuth (tsk) tool kit, these tools together are invaluable.

The testdisk tool is designed to recover partition data from a formatted disk and even those that have had minimal data rewrites, which it does exceptionally well. In this case I went in a bit unsure of the success I would have, sure enough though after some poking and prodding of teskdisk options I was able to recover the partition table for the previous system installation. This was an important task as any data that had not been overwritten on the disk instantly became available with the old partition scheme restored, sadly though this did not provide the data I required which was the clients databases. The partitions restored still provided me some metadata to work with and a relative range on the disk of where the data is located, instead of having to ferret over the whole disk. So with that, I created a new dd image of the disk with a more limited scope that comprised the /var partition which effectively cut the amount of unallocated space I needed to search down from 160gb to 40gb.

It was now time to crack out the latest version of the sleuth tool kit and its companion autopsy web application, installed it into shared memory through /dev/shm and then went through the chore of remembering how to use the autopsy webapp. After a few minutes poking around it started to come back to me and before I knew it I was browsing my image files, which is a painfully tedious task in hopes that the metadata can lead you to what your looking for through associative information on file-names to inode relationships. That is really pretty pointless in the end though as ext3 when it deletes data, as I understand, zeros the metadata before completing the unlink process from the journal. I quickly scrapped anything to do with metadata and moved on to generating ASCII string dumps of the images allocated and unallocated space, which allows for quick pattern based searches to find data.

The string dumps took a couple of hours to complete generating, after which I was able to keyword/regexp search the disks contents with relative ease (do not try searching large images without the string dumps, it is absurdly slow). I began some string searches looking for the backup SQL dumps that had been taken less than 24h ago during the weekend backups, although I did eventually find these dumps it turned out some of them were so large they spanned non-sequential parts of the disk. This made my job very difficult as it then became a matter of trying to string together various chunks of an SQL dump which I had no real knowledge of the underlying db structure for. After many hours of effort and some hit-or-miss results, I managed to recover a smaller database for the client which in the end turned out to be absolutely useless to them, that was it for the night for me – I needed sleep.

Sunday morning brought about an individual from the clients organization who was familiar with the database structure to the custom web application they have and was able to give me exact table names they needed recovered – which was exactly what I needed. I was then able to craft some regexp queries that found all insert, update and structure definitions for each of the tables they required and despite some of the parts of these tables being spread across the disk – knowing what they needed allowed my regexp queries to be accurate and provide me the locations of all the data. Now that I had all the locations it was just a task of browsing the data, modifying the fragment range of the data I was browsing so that it included the beginning and end of the data elements followed by exporting the data into notepad where I reconstructed the sql dumps to what turned out to be a very consistent state. This did take a little while though but was not near as painful a process as my efforts  from the night before, so I was very happy with where we had ended up.

A couple hours after I turned the data over to the client they were restoring tables they much needed to get back online, this was followed by ping from the client on AIM that they had successfully restored all data and were back online in a state near identicle to just before they went offline. What the client took from this is to never trust anyone else with safe guarding there data and they intend to keep regular backups of there own data now in addition to the backups we retain, which is a very sensible practice to say the least.