Backup and Compression

Backup and Compression

What do we need to Backup?

Some data is critical for backup, some less critical, and some never needs saving.

What Needs Backup?

  • Definitely Yes: The following data should always be backed up:

    • Business-related data
    • System configuration files
    • User files (usually under /home)
  • Maybe

    • Spooling directories (for printing, mail, etc.)
    • Logging files (found in /var/log, and elsewhere)
  • Probably Not

    • Software that can easily be re-installed; on a well-managed system, this should be almost everything
    • The /tmp directory, because its contents are indeed supposed to be only temporary
  • Definitely Not

    • Pseudo-filesystems such as /proc/dev and /sys
    • Any swap partitions or files
  • Files essential to organization require backup. Configuration files may change frequently, and along with individual user’s files, require backup as well.

  • Logging files can be important if you have to investigate your system’s history, which can be particularly important for detecting intrusions and other security violations.

  • You don’t have to back up anything that can easily be re-installed. Also, the swap partitions (or files) and /proc filesystems are generally not useful or necessary to back up, since the data in these areas is basically temporary (just like in the **/tmp** directory).

Backup Methods

  • You should never have all backups residing in the same physical location as the systems being protected.

Several different kinds of backup methods can be used, often in concert with each other.

  • Full Backup : Backup for all files on the system.
  • Incremental : Backup for all files that have changed since the last incremental or full backup.
  • Differential : Backup for all files that have changed since the last full backup.
  • Multiple level incremental : Backup for all files that have changed since the previous backup at the same or a previous level.
  • User : Backup only for files in a specific user’s directory.

Backup Strategies

  • We should note that backup methods are useless without associated restore methods. You have to take into account the robustness, clarity and ease of both directions when selecting strategies.

  • The simplest backup scheme is to do a full backup of everything once, and then perform incremental backups of everything that subsequently changes.

  • While full backups can take a lot of time, restoring from incremental backups can be more difficult and time consuming. Thus, you can use a mix of both to optimize time and effort.

Some Backup Related Utilities

A number of programs are used for backup purposes.

  • cpio and tar create and extract archives of files.
  • The archives are often compressed with gzipbzip2, or xz. The archive file may be written to disk, magnetic tape, or any other device which can hold files. Archives are very useful for transferring files from one filesystem or machine to another.
  • dd is a powerful utility often used to transfer raw data between media. It can be used to copy entire partitions or entire disks.
  • rsync is a powerful utility that can synchronize directory subtrees or entire filesystems across a network, or between different filesystem locations on a local machine.
  • dump and restore are ancient utilities which were designed specifically for backups. They read from the filesystem directly (which is more efficient). However, they must be restored only on the same filesystem type that they came from. There are newer alternatives.
  • mt is used for querying and positioning tapes before performing backups and restores.

Backing Up Data

  • Basic ways to do so include the use of simple copying with cp and use of the more robust rsync.
    • Both can be used to synchronize entire directory trees.
    • However, rsync is more efficient, because it checks if the file being copied already exists. If the file exists and there is no change in size or modification time, rsync will avoid an unnecessary copy and save time.
    • Furthermore, because rsync copies only the parts of files that have actually changed, it can be very fast.
    • cp can only copy files to and from destinations on the local machine (unless you are copying to or from a file system mounted using NFS), but rsync can also be used to copy files from one machine to another. Locations are designated in the target:path form, where target can be in the form of someone@host. The someone@ part is optional and used if the remote user is different from the local user.
    • rsync is very efficient when recursively copying one directory tree to another, because only the differences are transmitted over the network.
    • One often synchronizes the destination directory tree with the origin, using the -r option to recursively walk down the directory tree copying all files and directories below the one listed as the source.

Using rsync

  • rsync is a very powerful utility.

    • For example, a very useful way to back up a project directory might be to use the following command:
      rsync -r project-X archive-machine:archives/project-X
  • Note that rsync can be very destructive! Accidental misuse can do a lot of harm to data and programs, by inadvertently copying changes to where they are not wanted.

  • Take care to specify the correct options and paths.

  • It is highly recommended that you first test your rsync command using the -dry-run option to ensure that it provides the results that you want.

  • To use rsync at the command prompt, type

    rsync sourcefile destinationfile

    where either file can be on the local machine or on a networked machine;

  • The contents of sourcefile will be copied to destinationfile. A good combination of options is :

    rsync --progress -avrxH  --delete sourcedir destdir

Compressing Data

  • File data is often compressed to save disk space and reduce the time it takes to transmit files over networks.

  • Linux uses a number of methods to perform this compression, including:

    Command Usage
    gzip The most frequently used Linux compression utility
    bzip2 Produces files significantly smaller than those produced by gzip
    xz The most space-efficient compression utility used in Linux
    zip Is often required to examine and decompress archives from other operating systems
  • tar utility is often used to group files in an archive and then compress the whole archive at once.

Compressing Data Using gzip

  • gzip is the most often used Linux compression utility. It compresses very well and is very fast. The following table provides some usage examples:
    Command Usage
    gzip * Compresses all files in the current directory; each file is compressed and renamed with a .gz extension.
    gzip -r projectX Compresses all files in the projectX directory, along with all files in all of the directories under projectX.
    gunzip foo De-compresses foo found in the file foo.gz. Under the hood, the gunzip command is actually the same as gzip –d.

Compressing Data Using bzip2

  • bzip2 has a syntax that is similar to gzip but it uses a different compression algorithm and produces significantly smaller files, at the price of taking a longer time to do its work. Thus, it is more likely to be used to compress larger files.

  • Examples of common usage are also similar to gzip:

    Command Usage
    bzip2 * Compresses all of the files in the current directory and replaces each file with a file renamed with a .bz2 extension.
    bunzip2 *.bz2 Decompresses all of the files with an extension of .bz2 in the current directory. Under the hood, bunzip2 is the same as calling bzip2 -d.

    NOTE_:_ _bzip2_ has lately become deprecated due to lack of maintenance and the superior compression ratios of _xz_ which is actively maintained.

Compressing Data Using xz

  • xz is the most space efficient compression utility used in Linux and is used to store archives of the Linux kernel.

  • Once again, it trades a slower compression speed for an even higher compression ratio. Some usage examples:

    Command Usage
    xz * Compresses all of the files in the current directory and replaces each file with one with a .xz extension.
    xz foo Compresses foo into foo.xz using the default compression level (-6), and removes foo if compression succeeds.
    xz -dk bar.xz Decompresses bar.xz into bar and does not remove bar.xz even if decompression is successful.
    xz -dcf a.txt b.txt.xz > abcd.txt Decompresses a mix of compressed and uncompressed files to standard output, using a single command.
    xz -d *.xz Decompresses the files compressed using xz.
  • Compressed files are stored with a .xz extension.

Handling Files Using zip

  • The zip program is not often used to compress files in Linux, but is often required to examine and decompress archives from other operating systems.
  • It is only used in Linux when you get a zipped file from a Windows user. It is a legacy program.
    Command Usage
    zip backup * Compresses all files in the current directory and places them in the backup.zip.
    zip -r backup.zip ~ Archives your login directory (~) and all files and directories under it in backup.zip.
    unzip backup.zip Extracts all files in backup.zip and places them in the current directory.

Archiving and Compressing Data Using tar

  • Historically, tar stood for “t****ape ar****chive” and was used to archive files to a magnetic tape.

  • It allows you to create or extract files from an archive file, often called a tarball.

  • At the same time, you can optionally compress while creating the archive, and decompress while extracting its contents.

  • Here are some examples of the use of tar:

    Command Usage
    tar xvf mydir.tar Extract all the files in mydir.tar into the mydir directory.
    tar zcvf mydir.tar.gz mydir Create the archive and compress with gzip.
    tar jcvf mydir.tar.bz2 mydir Create the archive and compress with bz2.
    tar Jcvf mydir.tar.xz mydir Create the archive and compress with xz.
    tar xvf mydir.tar.gz Extract all the files in mydir.tar.gz into the mydir directory.NOTE: You do not have to tell tar it is in gzip format.
  • You can separate out the archiving and compression stages, as in:

    tar cvf mydir.tar mydir ; gzip mydir.tar
    gunzip mydir.tar.gz ; tar xvf mydir.tar

    but this is slower and wastes space by creating an unneeded intermediary .tar file.

Incremental Backups with tar

You can do an incremental backup with tar using the -N (or the equivalent --newer), or the --after-date options. Either option requires specifying either a date or a qualified (reference) file name. See commands below:

tar --create --newer '2011-12-1' -vzf backup1.tgz /var/tmp
tar --create --after-date '2011-12-1' -vzf backup1.tgz /var/tmp

Either form creates a backup archive of all files in /var/tmp which were modified after December 1, 2011.

Because tar only looks at a file’s date, it does not consider any other changes to the file, such as permissions or file name. To include files with these changes in the incremental backup, use find and create a list of files to be backed up.

Disk-to-Disk Copying (dd)

  • The dd program is very useful for making copies of raw disk space.

    • For example, to back up your Master Boot Record (MBR) (the first 512-byte sector on the disk that contains a table describing the partitions on that disk), you might type:
    dd if=/dev/sda of=sda.mbr bs=512 count=1

WARNING dd if=/dev/sda of=/dev/sdb - to make a copy of one disk onto another, will delete everything that previously existed on the second disk.

  • An exact copy of the first disk device is created on the second disk device.
Do not experiment with this command as written above, as it can erase a hard disk!
  • Exactly what the name dd stands for is an often-argued item. The words data definition is the most popular theory and has roots in early IBM history. Often, people joke that it means disk destroyer and other variants such as delete data!

  • The following command will back up the MBR (along with the partition table):

    dd if=/dev/sda of=mbrbackup bs=512 count=1
  • The MBR can be restored using the following command:

    sudo dd if=mbrbackup of=/dev/sda bs=512 count=1
  • The above dd commands only copy the primary partition table; they do not deal with any partition tables stored in the other partitions (for extended partitions, etc.).

  • For GPT systems it is best to use the sgdisk tool, as in this command:

    sudo sgdisk -p /dev/sda
    
    Output:
    Disk /dev/sda: 1000215216 sectors, 476.9 GiB
    Model: SAMSUNG MZNLN512
    Sector size (logical/physical): 512/512 bytes
    Disk identifier (GUID): DBDBE747-8392-4B62-97AA-6214490073DC
    Partition table holds up to 128 entries
    ....
    Number Start (sector) End (sector) Size      Code  Name
       1           2048        534527 260.0 MiB  EF00  EFI System Partition
       2         534528        567295 16.0 MiB   0C01  Microsoft reserved ...
    ....
  • if run on a pure MBR system, the output is different:

    sudo sgdisk -p /dev/sda
    
    Output:
    *Found invalid GPT and valid MBR; converting MBR to GPT formatin memory.
    *Disk /dev/sda: 500118192 sectors, 238.5 GiBModel: 
    Crucial_CT256MX1Sector size (logical/physical): 512/4096 bytes...

Backup Programs

There is no shortage of available backup program suites available for Linux, including proprietary applications or those supplied by storage vendors, as well as open-source applications.

  • Amanda

    Amanda (Advanced Maryland Automatic Network Disk Archiver) uses native utilities (including tar and dump), but is far more robust and controllable. Amanda is generally available on Enterprise Linux systems through the usual repositories.

  • Bacula

    Bacula is designed for automatic backup on heterogeneous networks. It can be rather complicated to use and is recommended (by its authors) only to experienced administrators. Bacula is generally available on Enterprise Linux systems through the usual repositories.

  • Clonezilla

    Clonezilla is a very robust disk cloning program, which can make images of disks and deploy them, either to restore a backup, or to be used for ghosting, to provide an image that can be used to install many machines.

    The program comes in two versions: Clonezilla Live, which is good for single machine backup and recovery, and Clonezilla SE, server edition, which can clone to many computers at the same time.

    Clonezilla is not very hard to use and is extremely flexible, supporting many operating systems (not just Linux), filesystem types, and boot loaders.