Friday, June 9, 2017

Reimplementing Plan 9 dump, relaxing backups

Some time has passed now since I last used Plan 9 and there are many things I miss; the simplicity, understanding a whole system from the ground up... One of the things I used to miss most was the backup system. For the user it was amazingly simple. In /dump one could find a directory tree with a directory per year, with one per month inside an so on, with a snapshot of the filesystem at that point in time.
This happened automatically without intervention of the user. It enabled me to relax and focus on the job (whatever that was).
While I have used similar features in other systems, they were either too complex (and thus not trustworthy) or I didn't fully understand them, or they required the intervention of the user or they failed too often or a combination of all of them.

The first answer I get when I tell some developer this is "you should be using Git".
Of course, I know of the existence of Git, (and other VCSs), and use it but there are three main problems I find with them as backup systems. One is they are not trivial to navigate so when I am doing something urgent, they take cycles of whatever I am doing. Another is that when you are working with things which are not code, like images for presentations, video... there are problems with binary diffs. Finally, they require intervention of the user, you have to do a push everytime you edit a file. Of course, you could do it periodically, but there are still other factors to consider like simplicity. Adding a layer over git to make it into a dump filesystem is too complex which, again, makes it less trustworthy. Git is nice when you have a distributed team of developers and the complexity pays off. Backups are an entirely different problem.

The most important lessons I learnt from Plan 9's dump is that backups should happen without you doing anything and you should be able to access and navigate them instantly and without fuss. This makes you check without noticing that the backup is online and working because you are using it as part of your daily work, for example, by checking the history of a file edits or modifications. All of that without giving it any second thought.

Another requirement I learnt the hard way, and which is also important, is for the backup system to be as simple as possible, specially in the part that takes the snapshots, as my data and part of my sanity are going to depend on it. If I can't understand completely the format in which the backups are stored, I end up not trusting the backup system. If parts of the backups are lost, I should be able to understand and piece together whatever is left.

While I was thinking about this, I found this post which shows that rsync is the ideal candidate to make the snapshots. It is a battle tested command, where it is going to be quite difficult to find essential bugs, and has support to make the snapshots efficiently, by making hard links to the files that have not changed. I make my snapshots hourly, and for this time period, not making the dumps too redundant, is quite important. Also, there are quite a lot of corner cases in filesystems these days (ACLs, metadata, etc.) which I don't want to deal with or think about if possible. I let rsync take care of all that.

The script to make the snapshot is very simple and, of course, running it hourly in Unix is no problem (before I would have used cron, now systemd).

 Now I have a dump filesystem which I can bind or soft link (if you want to install the whole thing in Linux, it is explained here) to /dump or mount remotely with sshfs. This enables me to do things like

cd /dump/2017/0513/1410/MAIN/paurea/doc/howto

 to see that directory as it was in that date, like in Plan 9.

Not having almost any redundancy is a little troubling, because if a file is overwritten in the dump by someone, we may loose it. The clients mount it read-only, so it shouldn't happen,  exactly like all the other things the backup stops from happening which shouldn't be happening either. To prevent this, I checkpoint the whole filesystem with this script monthly. It creates a fresh copy without any backwards dependencies and doubles the use of space by the size of a working copy.
 To do that I use the cp(1) command which, by default, copies hard links as separate files.
I think of dumpme as compression and chkptme as controlled redundancy, two steps in channel coding (hopefully without catastrophic error propagation).

To navigate the dump comfortably, I reimplemented the yesterday(1) and history(1) commands from Plan 9 using go. They are complete reimplementations from scratch with the options I find useful.

These commands I called yest and hist, so now, you can run

$yest -y=1 /main/MAIN/paurea/doc/howto
/dump/2016/0609/1606/MAIN/paurea/doc/howto

to get the path of the file or directory in the dump a year ago. Some of the features of yesterday(1) like -c to copy yesterday's file are not there, I may add them later if I find them necessary.

The other command is hist, which will give you the history of the file or directory like history, including its diffs if it is a text file.

$ cd  /main/MAIN/paurea/doc/charlas/1.talk

$ hist -c slides.tex

#create    /dump/2017/0510/1605/MAIN/paurea/doc/charlas/1.talk/slides.tex 19718 0777 2017-05-09 17:52:09 +0200 CEST  f
#wstat    /dump/2017/0510/1728/MAIN/paurea/doc/charlas/1.talk/slides.tex 19718 0777 2017-05-10 16:55:52 +0200 CEST  f
#write    /dump/2017/0512/1851/MAIN/paurea/doc/charlas/1.talk/slides.tex 34710 0777 2017-05-12 18:14:37 +0200 CEST  f
#write    /dump/2017/0518/1130/MAIN/paurea/doc/charlas/1.talk/slides.tex 34844 0777 2017-05-17 16:08:08 +0200 CEST  f


Without the -c it returns the incremental diffs of the file. The diff of two files is implemented using diffmatchpatch, which returns a list of edits (insertions, deletions and matches), which I had to convert into line differences like the ones given by the command diff(1).

One feature I would like to implement in the future is to somehow mark the checkpoints in the filesystem and use some form of scrubbing or error detection with the different (supposedly equal) copies of the files.