Tracking & Killing Bot Networks

In a previous blog I discussed how one of the more enjoyable parts of my day-to-day malware rituals also involves the tracking and killing of command and control bot networks. Recently I have begun automating this process a bit; I have created a series of scripts that extract irc servers, port numbers and channels from malware as it comes in and then checks if the irc server is still online, a custom bot then logs into the server, queries the active channels and determines how many zombies are active on the network. If an irc server is determined to be active with zombies actively connected, the server is then reported to the abuse address listed in the whois information for the servers IP Address.

The automation of this process is something I have had on my todo list for a little while but finally stopped procrastinating it and got it done. The real advantage of it being automated now is I can easily generate a tangible set of information that allows for me to see how many bot networks are present in the malware I process daily, weekly and monthly, how many of those networks are still active and more importantly how many of those networks have active zombies still connected. Likewise, as I’ve discussed previously, I am working on a threat portal and having the irc c&c data processing automated will more easily allow me to put that information on the threat portal and integrate it into the aggregate threat feed that the portal will offer for route/firewall/DNSBL drops.

Here are some statistics on IRC command and control networks as seen in the malware processed by me in the last 30 days:
Total Processed Malware (30d): 607
Total IRC C&C Servers: 251
Total Online IRC C&C Servers (as of 08/17/10): 118
Total Online IRC C&C Servers with Active Zombie Hosts: 30
Total Zombies Observed on Online IRC C&C Servers: 1,679 (55 average per server)

There are some notable observations, out of the total of 251 noted IRC C&C servers, only 118 of them are still online, of those 118 that are still active, 64 of them utilize free DNS naming services and/or dynamic dns services, the other 54 create C&C channels on established public IRC networks or use the DNS name of compromised hosts running an IRC server. Most every one of the 133 now inactive IRC servers used IP addresses within the host malware script, a small majority used DNS names of compromised hosts.

It goes without saying that by using public DNS services / dynamic DNS services, it allows attackers the flexibility to quickly recover a C&C server and its participating zombies in the event of the host server being shutdown. Further, a number of more mature IRC C&C bots will continue reconnection attempts periodically when disconnected from the host C&C server, further increasing the chance of fully recovering the zombie network for the attacker.

Also increasingly, PHP is becoming more common as a language of choice for C&C bot agents, though Perl agents are still vastly more popular. The LMD project currently has classified 44 unique C&C bot agents comprising 286 agent scripts/binaries, 14 classes or 38 scripts of which are PHP based and 21 classes or 213 scripts of which are Perl based, 9 classes or 35 scripts/binaries being Other (c/ruby/java).

Currently there is an average of 6 bot networks being abuse reported per day, of those only about 2-3 per day ever receive any form of followup and/or shutdown of the host running the network. That is a rate of less than 50% on average, which is abysmal to say the least. When the threat management portal goes up in the coming weeks, these networks will find themselves at the top of the threat feed and planted squarely on the front page of the portal — we might not be able to shut them down but we sure can filter them off our networks.

Understanding Signatures

The signature naming scheme for LMD is a little confusing and something I’ve received more than a few questions about, more so about what the *.unclassed signatures mean. The naming scheme (to me) is straight forward and breaks down as follows:


The ‘SIG_FORMAT’ is either HEX or MD5 reflecting the internal format of the signature, the ‘lang/vector’ is the language or attack vector of the malware, ‘type’ is a short descriptive field for what the malware does (i.e: ircbot, mailer, injection etc…), ‘name’ is a short descriptive name unique to the piece of malware and ‘ID#’ is the internal signature ID number.

What some people appear confused about is signatures such as ‘{HEX}base64.inject.unclassed.7’ that use the term “unclassed” for the name field. Essentially, signatures that are unclassed represent a group of malware that is not necessarily unique from each other but that follows the same attack vector, such as base64 encoded scripts; there are hundreds of these scripts and in encoded form it doesn’t really matter what they do, we are detecting the encoded format not the decoded, so they get lumped together. In other instances, I will throw some malware into an unclassed group when it is very new and I have not had time yet for processing it into its own classification, for example the web.malware.unclassed is a dumping ground for allot of malware that is newly submitted, which I have reviewed and confirmed IS MALWARE but have not yet classified it or determined if it is a variant of an existing malware classification.

It needs to be understood that the processing of malware is mostly a manual task, though there are some elements of it that are automated, the actual review of each malware file is done by hand to remove the chance of false positives — keeping LMD accurate and reliable. As such, not all malware makes it into a classification group right away, the important part is that malware is reviewed, verified and signatures generated for it in a timely fashion. I process malware daily from the network edge IPS system at work, from user submitted files and from various malware news groups / web sites and the priority is getting the signatures up for in the wild threats. The signature name/classification serves informative purposes, yes it is important but not as important as the actual verification and signature generation.

ATF v2: Weighted Threats

When I first introduced you all to the Aggregate Threat Feed back in May, it was a much smaller feed with very simple ambitions — pulling together threat data at work from our network edge and host based firewalls and aggregating the data into a usable feed. The actual intention being that as an attacker exposes themselves more on the network through invasive scans and attacks, they would quickly climb up the threat feed and end up banned proactively. Though this did and still does happen in a way, a problem was introduced when more and more data started to come in from the network edge and it quickly outweighed data from the hosts.

The old way the threat feed was sorted was by number of events. For the network edge IPS the events correlated with actual signature events on the network edge, so these could number from 50 events for an SNMP community scan to thousands of events for an SSH scan. Then you have the host based firewall events (mostly brute force attacks), these events are correlated into the feed by the occurrence of an attackers address across unique servers, so if made a brute force attempt against 11 servers it would show in the feed as 11 events.

The problem that developed here is that the network edge IPS is far more noisy on an exposure level than the host based firewalls, so you would end up with hundreds of IP’s from the network edge with thousands of events each, while the host based firewalls, even though they represent hundreds of attacking IP’s also, the actual event counts relative to unique servers those IP’s attacked, was FAR lower. This meant that often the top 50 or 100 items in the threat feed were all IPS events, though quite valid events the actual host based events had more of a threat significance than some of the IPS events. The host events were simply being washed out of the top 100 on the list from the sheer volume of IPS events (who really wants to import 300 addresses from a threat feed? let alone even 100).

So, what I decided on doing was adding a weighted field into the database that is based on unique targets for each attacking IP. This weighted field is the new sort method for the feed and it works something like this. If the IPS picks up an attacker hammering five servers with an SQL injection exploit, that attacking IP ends up in the threat feed with a weight of 5, if we then have an attacker that runs brute force attacks on 30 servers, that attacking IP ends up in the threat feed with a weight of 30. The end result is that the threat feed gets better populated with the highest weighted attackers at the top, so those attackers who are more aggressive across unique targets, quickly end up at the top of the list. This allows the feed to better protect the devices/hosts it is being used on from a developing attack before the attacker reaches that device/host on the network.

Drop Format:

List Format (fields: IP | SERVICE | EVENTS | WEIGHT):