Sunday, January 25, 2015

Backing up your data with Amazon S3 Glacier and rsnapshot. A complete guide, Part 2.


(Part I is here)

Let's get our hands dirty!
It's time to make automated backups with rsnapshot.

Remember rsnapshot lets you access full backups while minimizing space and letting you access older versions of the files.

Install rsnapshot

Step 1 is to install rsnapshot on your system.

Configure rnapshot

rsnapshot can be configured to store files over the network and do pretty complicated stuff. It is in fact just a layer written in Perl on top of rsync and other common Linux commands.
The configuration file /etc/rsnapshot.conf will tell you plenty on how to configure the program. I just want you to pay attention to these points that are not that clear in tutorials and hard to find in the documentation:
  • Use TABS, not spaces. If like me your default in ViM is to replace tabs by spaces, you can temporarily disable this behavior for the current session (or file?) by typing ":set noexpandtab". It has to look stupid when you "cat" the file.
  • Folder paths must end with a slash. Always.
  • Look at the rsync man page for the exclusion patterns you can use.
  • The retain lines should be read like below. Do not try to interpret it otherwise, it would be wrong.

    retain hourly  4


    Keep only the four most recent versions of the job named "hourly". Only a few people know this but "hourly" doesn't mean anything for rsnapshot. You could replace it with "darkvader" if you wanted to.
    Here are incorrect ways to read the "retain" lines:
    "4" is not the number of times per hour the backup must be done.
    "hourly 0.5" doesn't mean the job will be executed every two days.
  • The retain lines must be declared from the most to the least frequent. So: hourly, daily, weekly, monthly, yearly.
  • Again, the job name (e.g. "daily") doesn't mean anything. You can remove any of them. For instance you could have it configured to keep the last 4 "hourly" jobs and the last 2 "monthly" jobs without mentioning "daily" and "weekly".
  • I repeat for the third time: the job name has no meaning. So if you put "daily" before "hourly", then the folders named "daily" will actually contain the "hourly" backups.

Rsnapshot will create the output folder if it doesn't exist. On Debian, the default path is /var/cache/rsnapshot. The folder will be owned by root and forbid anyone else to access it.

First run

The very first time, invoke rsnapshot manually as root from the command line (preferrably with screen) in verbose mode and see what happens:

rsnapshot -v hourly
where "hourly" is the name of the first retain job in the configuration. The very first run will take much longer than all the other afterward because it has to make all the copies. The next runs are faster because only the modified files get copied.

Schedule rsnapshot to run every hour / day / week / month ...


If all went well, you can now create a few cron tasks to run rsnapshot automatically. Type "crontab -e" as root and enter something like this (I will explain it below):

# m h  dom mon dow   command
  0 1,7,13,19 * * * /usr/bin/rsnapshot hourly
  0 2  *   *   *    /usr/bin/rsnapshot daily
  0 6  *    *   1   /usr/bin/rsnapshot weekly
  0 11 1    *   *   /usr/bin/rsnapshot monthly

Quit the crontab editor.

hourly: Instead of "*/6" to make an hourly backup every 6 hours, I didn't want the first one to run between midnight and 1 am because I know there are other cron jobs scheduled at that time. 
If you are keeping the last 4 "hourly" backups you probably want to make one every 6 hours. Does that make sense to you?

daily: There is one big risk with these cronjobs. It is that the hourly cronjob is not finished when you schedule the daily cronjob. In that case, the daily cronjob will be cancelled. I am pretty sure you can configure rsnapshot to run two jobs in parallel but I would advise against that. The best bet is to keep enough time for the "hourly" job to complete.

weekly: Same remark. Funny story, the value of "dow" can be [0 - 7]. Both "0" and "7" designate Sunday for portability reasons. Here "1" is for Monday. (You should probably run the weekly job in the week-end in a corporate environment.) In my case the job runs every Monday at 6 am.

monthly: Same remark regarding the hour (not too close from the other jobs). In my case the monthly job runs every 1st day of the month at 11 am. 

Trick question: How can you schedule a backup to run every 3 days instead of one and keep all of the backups from the past month? You must keep the daily and weekly backups.

In /etc/rsnapshot.conf:
retain everyotherday 10 
where "everyotherday" could be "gogglydoe", and 10 is 30 days divided by 3 days.
The line must go between "daily" and "weekly".

In the crontab: 
# m h  dom mon dow   command
  0 0  */3 *   *    /usr/bin/rsnapshot everyotherday

Enjoy the power of full backups

You know what's nice with full backups (or kind of, as rsnapshot uses hard links to avoid duplication) ?
You can browse the backup folders in /var/cache/rsnapshot just like the "live" folders!

Continue to Part III

No comments:

Post a Comment