Mirroring an Entire Site using Rsync over SSH

Sometimes there is an urgent need for creating an exact duplicate or "mirror" of a web site on a separate server. This could be needed for creating Round Robin Setups, Load-Balancing, Failovers, or for just plain vanilla backups. In the past I have used a lot of different methods to copy data from one server to another, including creating an archive of the whole directory and then using scp to send the file over, creating an archive and then encrypting it and then sending that file over using ftp, curl, etc., and my persistence at learning new ways to do things has paid off because now I use rsync to keep an exact replica of the entire directory on an external server, without having to use all the CPU and resources of other mirroring methods.

For this article I will show how I setup a web mirror of from my DreamHost server to my HostGator Server. For the transfer and synchronization of the directories we will be using rsync over SSH. We will also be automating this task using a cronjob with no user-interaction, so creating public and private keys for the ssh will be neccessary. Finally, I provide a simple shell script that prevents anyone from logging into your account with the created keys for any purpose other than to rsync.

Rsync Synchronization Magic

rsync is an open source file transfer program for Unix systems that uses the "rsync algorithm" which provides a very fast method for synchronizing files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar programs/protocols is that the mirroring takes place with only one transmission in each direction.. It does this by sending just the differences in the files across the link, without requiring that both sets of files are present at one of the ends of the link beforehand.

Securing Rsync with SSH

I NEVER transfer any unencrypted data around unless that data is transported encrypted and to a trusted recipient, (I use HTTPS for WordPress administration) and I haven't had time to probe the HostGator system for security issues yet, so right away I decided I needed an automated way to securely transfer files TO hostgator, while not allowing my hostgator account access back on my main server. So if the hostgator account were to get hacked somehow, the cracker/spammer wouldn't have access back to my main server.

Generate Keys with No Password

First I created a private key, specifically a passwordless host key, meaning to gain access with ssh you only need to supply the key, not a password+key like normal.

[local@dreamhost] $ mkdir -p ~/.ssh && chmod 700 ~/.ssh

# Create the public and private keys
# public key at:
# private key at: z.askapache-hostgator.id_rsa
[local@dreamhost] $ ssh-keygen -t rsa -b 2048 -f ~/.ssh/z.askapache-hostgator.id_rsa

# add the public key to remote hosts ~/.ssh/authorized_keys file
[local@dreamhost] $ ssh-copy-id -i ~/.ssh/ remoteuser@remotehost

# or use scp + ssh to add the public key
[local@dreamhost] $ scp ~/.ssh/ remoteuser@remotehost:/web/remoteuser/
[local@dreamhost] $ ssh remoteuser@remotehost
[gatoraskapache@gator] $ mkdir -p ~/.ssh && chmod 700 ~/.ssh
[gatoraskapache@gator] $ cat ~/ >> ~/.ssh/authorized_keys
[gatoraskapache@gator] $ chmod 600 ~/.ssh/authorized_keys

Custom SSH Connection Info

This helps alot, by adding this to the very top of my ~/.ssh/config file I don't have to add all this stuff to the rsync command-line. Basically when I reference connecting to the host 'gator' it uses all these options. Very helpful and you can add as many entries as you want.

Host gator
   IdentityFile ~/.ssh/z.askapache-hostgator.id_rsa
   Port 2222
   Protocol 2
   User gatoraskapache
   PasswordAuthentication no

Creating Cronjob for Synchronization

This cronjob runs every 30 minutes, copying all modified files from my local directory ~/ to the remote directory ~/public_html/z/. First I backup the current crontab, then I edit the crontab and add this.

crontab -l > .crontab-`date +%F.backup`; crontab -e

*/30 * * * * /usr/bin/rsync -e 'ssh' -rt --delete ~/ gator:'~/public_html/z/' 1>/dev/null
@midnight /usr/bin/find ~/ -type d ! -perm 755 -exec chmod 755 {} ; 1>/dev/null
@midnight /usr/bin/find ~/ -type f ! -perm 644 -exec chmod 644 {} ; 1>/dev/null

Those 2 find commands scheduled to run at midnight simply fix and permissions on files and directories in my static folder. They are all static files so there is no reason they need to have any other permission.

Only Allow rsync

This is SWEET! If you like candy that is.. It's called each time anything logs into your machine using the password less key we created above, and it simple checks what command the login process is attempting to issue. To set this up you need to edit the ~/.ssh/authorized_keys file on the remote host and prefix your public key that you added with a command directive to execute a script:

command="/web/remoteuser/" ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwBhj6UCS7JbJ08C8pWJqCh2iXZMN7tXpYZh47f4gZZBwrNHZQ== localuser@dreamhost

For rsync requests it will always be rsync --server at the start of the command, so if the command is anything else then this script:

  1. Sends you an email notifying you somethings up..
  2. Moves the ~/.ssh folder to ~/.locked-ssh
  3. Adds cronjob to move the folder back in about an hour.
# Author:
# Version: 1.2
# Date: 04-08-2009

# If the command used to login to ssh correctly starts with 'rsync --server'
# then exit this script and dont process the rest of the script
case "$SSH_ORIGINAL_COMMAND" in 'rsync --server'*) exit 0; ;; esac;

# the home directory where the .ssh folder is located

# if there is a locked ssh folder, kill the rsync and die
[[ -d $H/.lssh ]] && echo "REJECTED" && exit 1 # notified about locked status
OC=$H/old-crontab.txt # the original crontab
NC=$H/new-crontab.txt #the new crontab

# When to unlock the rsync
UNLOCK_AT=$(( date +%M %k --date='30 minute 1 hour' ))

# move the .ssh to .lssh which locks all key-logins
mv $H/.ssh $H/.lssh

# mail a notice to the boss

# subshell backs-up crontab then deletes active cron

 crontab -l > $OC &>/dev/null || echo -n #backup current crontab
 crontab -r >/dev/null 2>&1 || echo -n # delete current crontab

# subshell creates new crontab combined with old crontab
 # create new crontab
 echo -en "MAILTO='${EMAIL}'n${UNLOCK_AT} * * * mv $H/.lssh $H/.ssh" >> $NC
 echo -n " && date|mail -s 'UNLOCKED!!!' '${EMAIL}' && crontab $OC || rm $OC && rm $OC" >> $NC

 # add old crontab to new crontab minus any MAILTO lines
 cat $OC | sed '/^MAILTO/d' >> $NC

 # load the new crontab and if it doesnt work notify boss
 crontab $NC || echo "manually mv .lssh to .ssh" mail -s 'CRONTAB PROBLEM!!!' "$EMAIL"

 # remove new crontab
 rm $NC

exit $?

Here is the cronjob entry it creates... This will enable the rsync again by moving the folder back, then it mails you to alert you that its back up, and finally the original crontab is restored.

03 9 * * * mv ~/.locked-ssh ~/.ssh && date|mail -s 'RUNLOCKED!!!' "" && crontab ~/old-crontab.txt

Rsync/SSH Debugging and Stats

Adding the option -v to the ssh command, ie rsync -e 'ssh -vv' will give you a lot of debugging info.

By adding --stats to your rsync command you can get a detailed look at its efficacy.

Number of files: 14900
Number of files transferred: 0
Total file size: 1456832331 bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 320551
File list generation time: 17.393 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 320571
Total bytes received: 20

sent 320571 bytes  received 20 bytes  17329.24 bytes/sec
total size is 1456832331  speedup is 4544.21

Rsync Algorithm

The rsync utility uses an algorithm (invented by the Australian computer programmer Andrew Tridgell) for efficiently transmitting a structure (such as a file) across a communications link when the receiving computer already has a different version of the same structure.

The recipient splits its copy of the file into fixed-size non-overlapping chunks, of size S, and computes two checksums for each chunk: the MD4 hash, and a weaker 'rolling checksum'. It sends these checksums to the sender. Version 30 of the protocol (released with rsync version 3.0.0) now uses MD5 hashes rather than MD4.

The sender computes the rolling checksum for every chunk of size S in its own version of the file, even overlapping chunks. This can be calculated efficiently because of a special property of the rolling checksum: if the rolling checksum of bytes n through n + S - 1 is R, the rolling checksum of bytes n + 1 through n + S can be computed from R, byte n, and byte n + S without having to examine the intervening bytes. Thus, if one had already calculated the rolling checksum of bytes 1–25, one could calculate the rolling checksum of bytes 2–26 solely from the previous checksum, and from bytes 1 and 26.

The rolling checksum used in rsync is based on Mark Adler's adler-32 checksum, which is used in zlib, and which itself is based on Fletcher's checksum. The sender then compares its rolling checksums with the set sent by the recipient to determine if any matches exist. If they do, it verifies the match by computing the MD4 checksum for the matching block and by comparing it with the MD4 checksum sent by the recipient.

The sender then sends the recipient those parts of its file that did not match any of the recipient's blocks, along with assembly instructions on how to merge these blocks into the recipient's version. In practice, this creates a file identical to the sender's copy. However, it is in principle possible that the recipient's copy differs at this point from the sender's: this can happen when the two files have different chunks that nonetheless possess the same MD4 hash and rolling checksum; the chances for this to happen are in practice extremely remote.

If the sender's and recipient's versions of the file have many sections in common, the utility needs to transfer relatively little data to synchronize the files.

While the rsync algorithm forms the heart of the rsync application that essentially optimizes transfers between two computers over TCP/IP, the rsync application supports other key features that aid significantly in data transfers or backup. They include compression and decompression of data block by block using zlib at sending and receiving ends, respectively, and support for protocols such as ssh that enables encrypted transmission of compressed and efficient differential data using rsync algorithm. Instead of ssh, stunnel can also be used to create an encrypted tunnel to secure the data transmitted.

Finally, rsync is capable of limiting the bandwidth consumed during a transfer, a useful feature that few other standard file transfer protocol offer.

This page contains content from article and is released under the CC-BY-SA.

Security rsync ssh