Tutorial: what is the best way to backup your data safely and efficiently?

Hello all,

I am not a Linux expert (yet), but I will share my experience a bit here about: how I back-up my data safely and efficiently. However, this is only my personal option and opinion, some experts may have a better and efficient method in doing this.

From my experience, there are several ways how to backup our data:
  1. Using free cloud storage, such as Dropbox, Google Drive or One Drive
  2. External hard-drive
  3. Data server
In this article, I will share my experience in backing up my data using Ubuntu (Linux). This article should also be useful for Mac OS users.

As I mentioned in my another article, I have been using Linux due to several reasons.
However, one of the main issue in using Linux, there is no official version of Google Drive, that is my favourite cloud storage due to 15 GB free space (at starting). Although I could install Dropbox, but it only gives 2 GB free space initially (before inviting other people to use the Dropbox service).

As a scientist, who works on artificial intelligence and machine learning, to deal with big scientific data, I really need a proper backup system where it can store and save my work data safely and efficiently, Dropbox is not my option here due to its limitation space. Hence, I have to find an alternative method in backing-up my data. 

Fortunately, in my current work institution (i.e. Helsinki University), the IT office provides to all staffs and students a drive, called Z-drive. It has the disk space between 5 GB to 20 GB. In addition to that, several specific departments and divisions also provide local drives for their scientists, which can reach up to 60 GB provided for each scientist.

Furthermore, I also have a privilege to have an account as well as to operate a super computer cluster in CSC – IT Center for Science Ltd. [1] that is a Finnish IT centre for science which provides IT support and modelling, computing and information services for academia, research institutes and companies in Finland and across Europe. There are several types of super computer cluster there where I have an access to a medium super cluster, named Taito [2]. In that super cluster, I am entitled to have 50 GB form home directory (that is used for saving our data as well as our developed programs) and I also have 5 TB (i.e. 5,000 GB) in work directory (that is mainly used to store our data obtained from complex scientific calculation). However, the work directory is not reliable for backing up the data since that directory is cleared from time to time due to maintenance and other reasons.
For your information, CSC – IT Center for Science Ltd. also has a more powerful supercomputer called SISU, that is the most powerful supercomputer in Finland and one of the most powerful in Northern Europe [3].

CSC headquarters in Espoo, Finland

Taito upercluster is intended for serial (single core) and small to medium-size parallel jobs (photo courtesy of CSC IT, Finland). 

In summary, I am entitled to have three different types of data server here:

  1. The Z-drive that has space between 5 GB to 20 GB for each staff 
  2. The departmental network disk that has space 60 GB for each departmental scientist
  3. The home directory disk in the CSC-Taito super cluster, that is 50 GB

Since I have several options here, I have opted to use only two data servers, that are no. 2 and 3. In order to back-up our data into the network server, we can use unix commands, such as rsync or unison. The key differences between these two are in the updating direction, rsync synchronise the data on one direction whereas unison updates the data from both directions. 

The rsync command can be used in this way:

# rsync -av source dest.machine:dir/there

The -a flag is for 'archive mode' and equals -rlptgoD (no -H,-A,-X).  It stores the time-stamps correctly etc. 

Now, let take a look the use of rsync in my case. In this case, I use the departmental network disk.

First, I need to ensure if I can access to the server by ssh to the server from the Linux terminal:

# ssh login.physics.helisnki.fi

Then, I am brought to the server computer where I can navigate my files there, we can also run several programs there, such as Matlab, Mathematica as well as Phython3 and IPython. The above ssh command allow me to use the computer server whenever I have access inside or outside the university network.

Then I type:

# exit

and I am brought back to my local computer, then from my home directory, I create a back-up folder:

# mkdir Backup_Local2

The next step is to:

# sshfs login.physics.helsinki.fi:/data/people/zaidanma Backup_Local2

The above command mounts the data server into my local directory (i.e. Backup_Local2). Now, I should be able to visualise the Backup_Local2 folder and the files inside from my local computer.

The next step is to synchronise our data from my local computer to the target device that is the departmental network disk by typing this command: 

# rsync -av Backup_Local/ login.physics.helsinki.fi:/data/people/zaidanma

Backup_Local is the folder where I save all my files which I want to back up. 
Now, the backed up files are synchronised to the network disk. In order to check if the files have been already there, after getting the ssh connection to login.physics.helsinki.fi, just type:

# cd /data/people/zaidanma

That is where the people directory is mounted in login.physics. Now we can see that all my files are already there and they are saved there.

However, rsync command cannot detect the deleted files. If we deleted a file in the Backup_Local folder, it will be not synchronised in the network data. Fortunately, using delete-before or delete-after may be a good solution for it.

--delete-before         receiver deletes before transfer, not during
This means that:

  • rsync deletes the file from TARGET which are removed from SOURCE.
  • rsync starts syncing files.

--delete-after          receiver deletes after transfer, not during
This means that:

  • rsync starts syncing files.
  • rsync deletes the file from TARGET which are removed from SOURCE after syncing.

Hence the command should be something like this:

# rsync -av --delete-after Backup_Local/ login.physics.helsinki.fi:/data/people/zaidanma

Therefore, some of my deleted files in my local computer will also be removed in the network drive.

Unison command as an alternative?

As an alternative, sometimes it is also good to use unison: 

# unison Backup_Local ssh://login.physics.helsinki.fi////data/people/zaidanma/Backup_Local2/Backup_Local

The above command will completely synchronise between my local machine and the network drive on both side. Some Linux experts do not recommend this command for the purpose of backing up the data, because if something happens to the network drive (where the data is backed up, the data in local machine will get affected).

Nevertheless, this unison command will be very useful to synchronise our data between two machines (e.g. just like Dropbox). For example, if we work on a desktop computer in our office, and then we would like to have the same copy in our laptop while we work at home, in this case, unison command will be very useful.

I hope my shared experience here is useful.

By:
Martha Arbayani Zaidan
Helsinki, 23 October 2017

Reference:
[1] https://www.csc.fi/ (Accessed on 23 October 2017)
[2] https://research.csc.fi/taito-supercluster (Accessed on 23 October 2017)
[3] https://research.csc.fi/csc-s-servers (Accessed on 23 October 2017)

Comments

Popular posts from this blog

Why using Linux?

Data mining and machine learning to enhance data analysis for atmospheric science research