Incremental rsync

Current Release:
http://www.rfxn.com/downloads/irsync-current.tar.gz
http://www.rfxn.com/appdocs/README.irsync
http://www.rfxn.com/appdocs/CHANGELOG.irsync

Description:
The irsync tool is an incremental wrapper for the rsync utility, though this is native-supported by rsync, the irsync tool provides convenience features. The design goals behind irsync were to provide a tool that would allow me to create point-in-time incremental backups that used as little space as possible on the storage media in addition to having a complete and effective MySQL backup routine. Though the initial goals of the project were limited and scoped to mainly cover some personal hardware, it quickly snowballed into its own fully featured tool that I decided should be packaged as a project.

Currently I have irsync running on 28 servers managing 448 snapshots consisting of 21TB of data. The usage varies from backups of dns servers to dedicated SQL servers and critical web servers. The only usage note that stands out is that if you elect to use the ‘–mysql-dump-gz’ or ‘mysql_dump_gz=1’ in conf.irsync, this will break incremental support of MySQL dump backups and force a full copy of the dumps to be retained within each snapshot.  This may be desired for some people but if you have a large MySQL installation this could quickly get out of hand on space usage across the default retention of 14 days of snap data.

Features:
– traffic control (tc) shaping of outbound traffic for rate limiting
– preservation of full backup with incremental snapshots
– each incremental snapshot can be restored as a full point-in-time backup
– hard link based snapshots to reduce disk usage
– compatible with unmanaged storage space, all opertions are client side
– optional local option for performing serverless backups (i.e: to backup disk)
– auto-deletion of snapshots based on configurable age values
– auto-generation of ssh public/private key pairs for irsync install
– mysql backups through mysqldump with non-locking fast dumps & gzip compression
– mysql backups through mysqlhotcopy of raw mysql database (var/lib/mysql/db/*)
– mysql backups flush to disk of all open tables for consistent backups
– mysql backups stored as full and point-in-time backups of hotcopy images

Storage:
The irsync storage logic is based on hardlinks to create point-in-time backups of full incremental backups. On execution rsync creates a full backup of defined paths then the ‘cp’ tool is used to create a hardlinked copy of data. Upon the next rsync run against the full backup path, any data that has been created, deleted or modified will overwrite the existing data in the storage path
thereby breaking hard links and creating a point-in-time backup of changed data.

The path structure is as follows:
STORAGE_PATH/HOSTNAME.FULL
STORAGE_PATH/HOSTNAME.FULL/MYSQLHOTCOPY
STORAGE_PATH/HOSTNAME.FULL/MYSQLDUMP
STORAGE_PATH/HOSTNAME.SNAPS/DATESTAMP

The point-in-time backups which are restorable as full backups are stored in the .SNAPS directory, these are rotated off for deletion based on the max age value in conf.irsync using find’s mtime option piped to rm.

A common misconception is that deleting a hard link will delete the source data but this is not the case. When an rm is run on hardlink pointers, the number of links is checked and the data is only deleted when links reaches 0.

To demonstrate how the backups work on the storage server we can look at the below storage layout details to see how the snapshots and full image get populated.

The full image synced data with size and # of files:
# ls freedom.lan.full/
etc home local root var mysqldump mysqlhotcopy

# du -sh freedom.lan.full/
1.9G freedom.lan.full

# find freedom.lan.full | wc -l
17911

Now lets assume we have run three iterations of irsync to date, the snapshots path would look something like this:

# ls freedom.lan.snaps/
2010-02-19.202026 2010-02-20.202718 2010-02-21.191503

# ls 2010-02-21.191503/
etc home local root var mysqldump mysqlhotcopy

# du -shc *
12M 2010-02-19.202026
133M 2010-02-20.202718
275M 2010-02-21.191503

# for i in `ls`; do find $i | wc -l; done
17819 2010-02-19.202026
18416 2010-02-20.202718
18227 2010-02-21.191503

So what does this all translate into? as we can see our full backup is 1.9G in size with 17.9k files then subsequent backups have synced in changed data only with the 2010-02-19.202026 image having 12M of changed data and an offset of 92 fewer files. Although we capture the changed data in the 02-19 snap, we also have all our original data as indicated by the file counts but without having the space overhead of duplicating the data.

This is done by hard linking to the full image for any unchanged data, on subsequent irsync runs when new changed data is synced in, it breaks the hard links in the snapshots which leave behind a copy of the original data in its previous state. This method of point-in-time incremental backups allows for the easy retention of changed data, with minimal space usage while having a logical backup layout that is fully restorable from each individual snapshot and compatible with any utility as hard links are treated just like regular
files and directories.

Funding:

Funding for the continued development and research into this and other projects is solely dependent on public contributions and donations. If this is your first time using this software we ask that you evaluate it and consider a small donation; for those who frequent and are continued users of this and other projects we also ask that you make an occasional donation to help ensure the future of our public projects.

14 Replies to “Incremental rsync”

  1. Seems to be a problem with mysqldumps. I have set this..

    mysql_hotcopy=0
    mysql_dump=1
    mysql_dump_gz=1

    Now I expected to get a full set of dumps in each daily snapshot but instead I get nothing at all there.. only in ‘full’ is the latest set of dumps. Regular files/dirs as listed in paths.irsync do appear as expected in snapshots and full.

    Any way to fix this? Really do need daily mysqldump backups.

    Also the ‘mysql-only’ flag doesn’t work..

    /usr/local/irsync/irsync –remote –mysql-only
    ..
    Performing remote backup: ssh: –mysql-only: Name or service not known

    This way also..

    /usr/local/irsync/irsync -rm
    ..
    error: one of –local or –remote must be declared, see –help

  2. First off, thanks for a simple, elegant and flexible a backup solution.

    However, I recently found that the snaps-directory was empty even though the script had been running every night. This turned to be because the mtime on the full-directory had not changed since it was originally created so the find-command that did the cleanup ended up wiping everything after the first 14 days had elapsed.

    Adding “touch -m $backup_storage/$lohost.full” before the find command seems to fix the problem.

    As far as I can tell the mtime of a directory will only be updated if its contents are changed (ie. files and/or directories added or deleted). Since I’m just backing up a bunch of directories, the contents of the full-directory doesn’t change between rsyncs (only the contents of the directories within it does) so the mtime doesn’t update.

    This is on a Linux box with ext3 filesystems.

    Hope this helps!

  3. Ryan,

    Would it be possible to reverse the way this runs and have a backup server that polls each server and downloads the backups that way? Rather than the actual host sending to the backup?

    1. In theory yes that is easily done with ssh keys however in practice you would be creating allot more work than is needed, having servers “phone-home” if you will, to the backup server, is allot less tedious to setup in the long run. You may also run into issues with the backup server properly retrieving backups from the servers if you are not SCP/FTP’ing into the server as root to copy the backups, as there are allot of backed up files that will have root level permissions.

      1. The issue I continue to see when the Server itself phones to the backup server. If the Server itself it hacked, the hacker will then have access to the backups and those are compromised.

  4. I might be missing something, but where can I see some examples of the paths/exclude files? Can you pattern match? For example, excluding *.tar.gz

  5. Is there a way to exclude having a seed for all the subsequent backups ? rysnc-ing only the changes after the initial seed ?

    1. The full seed (servername.full) is only created once, subsequent runs sync only changes into the servername.full path. So, you will always see rsync running against the servername.full path but it is still only syncing in the new/changed data to the path. The .snaps folders are created off the .full path with hardlinks, when data is synced into the .full path, changed/new files will break/create new hardlinks, leaving unique snapshots behind for each respective date.

Leave a Reply

Your email address will not be published. Required fields are marked *