# The 4 Drive Backup Solution for Mere Mortals

In this post I describe a minimal, yet comprehensive personal backup solution. It is relatively easy to implement, using only the built-in features of your operating system, and is quite cheap as it requires only 4 hard drives (and can be accomplished with even fewer). Despite being extremely simple, it has the characteristics of a complete backup system and protects against several causes of data loss. It is a sensible backup strategy as of June 2014. This post is aimed towards the technologically-inclined reader.

##### The solution
• Preparation: Acquire 4 external hard drives, each as large as you wish, all of roughly the same capacity. I will refer to them as A1, A2, I1 and I2.
• Archival drives: Drives A1 and A2 are archival drives. They contain data that you no longer keep on your primary computer, and data that you no longer expect to change. This might include photos, music, and old work. You must ensure that A1 and A2 always have the same content as each other.
• Incremental backup drives: Drives I1 and I2 are incremental backup drives. They will contain a versioned history of all the files on your primary computer. For instance, you can set them both to be Time Machine drives. Time Machine is the incremental/differential backup software that comes standard with Mac OS X (alternative solutions are available for other operating systems).
• Location: Drives A1 and I1 are stored at the same primary location, such as your home. Drives A2 and I2 are stored a different, secondary location, such as your workplace.
• What you need to do: Update the content on A1 and A2 at your convenience, making sure they are always in sync. Make incremental backups with I1 and I2 as frequently as possible (at least once daily). With Time Machine this amounts to merely plugging in the drive (or connecting to the same network as the drive, if you use Time Capsule, or you can use something like a Transporter).

And that’s it.

##### What this scheme protects you against
• Under the event of data loss due to a hardware or software failure, that is, if one of the drives fails or the data on one of the drives gets corrupted, there is always another drive with a copy of the same data. This drive may be used until the failed/corrupt drive is replaced.
• Under the event of data loss due to human error, such as accidentally deleting or overwriting a file, there are two incremental backups from which any historic version of the file can be restored.
• Under the event of data loss due to natural disasters (such as a fire, power surge, or flood) or theft, which causes the drives in one location to be destroyed or stolen, there is always a duplicate of the drives in another location which may be used until the destroyed/stolen drives are replaced. This is what is known as an offsite backup.
##### What this scheme doesn’t protect you against
• Both archival drives or both incremental backup drives failing simultaneously: this is extremely unlikely, but if you’re worried about it you can add a third drive of each type.
• Failure to make incremental/archival backups often enough: this is your problem, not a problem with the scheme.
##### Modifying the scheme if it doesn’t work for you

This scheme can be directly implemented if:

1. You primarily use one computer, which is a Mac
2. Your day-to-day work does not create huge (i.e. comparable to the size of your hard drive), constantly changing files
3. You do not care for third party services or cloud services (which often require recurring monthly fees)
4. You are somewhat conscious of but not too restricted by price
5. You are okay with waiting a few hours to get going again from your backups in case the hard drive in your computer fails and you can no longer boot

If the above do not apply to you, it is easy to adapt this solution for other use cases. For instance, you can easily modify the solution if:

1. You use Windows/Linux: I believe Windows has an equivalent to Time Machine called “Windows Backup“. Linux users can probably fend for themselves and find something that works for them.
2. You primarily use multiple computers: You will need an additional pair of incremental backup drives for each additional computer you use.
3. You need to be able to immediately continue from where you left off in case your computer stops working: You will need to start creating bootable clones, which can be achieved using software such as Disk Utility (comes standard with Mac OS X), SuperDuper or Carbon Copy Cloner. For Windows users, Windows Backup can also create bootable clones. These can be stored on additional drives or on your archival drives.
4. You don’t mind third party or cloud services: I recommend looking into a solution such as Crashplan or BackBlaze. You can use these services to augment the 4 drive solution or to replace it entirely, depending on your level of trust and the quality of your Internet connection.
5. You are extremely price conscious: It is possible to implement this scheme with only two drives. In this scenario you will have to create two partitions on each drive, one for archival and the other for the incremental backup. The drives must of course still be stored at separate locations. I personally prefer the 4 drive version because (1) hard drives are not yet capacious enough that cheap commodity drives can be partitioned into useful sizes for those with lots of data, (2) partitioning necessitates erasing the drive, (3) I am leery of increased opportunities for filesystem corruption with multiple partitions, and (4) it is much less effort to replace drives if they only serve a single purpose.
##### Choosing a mix of drives

Since you will be acquiring multiple drives, you have the opportunity to spread your risk even further. By buying drives from different brands, you reduce your vulnerability if any single manufacturer or hard drive model has a faulty run. It is also good to have a mix of hard drive ages, since very young as well as very old drives appear to have a higher failure rate than those between the ages of 1 and 3 years.

I hope this is of some use. I was tired of thinking about backups and tired of researching third party backup solutions, so I settled on this compact, no-frills setup that can cope with all major threats to your data. If you have a suggestion or notice a deficiency, please leave a comment!

# The Metropolis-Hastings Algorithm: a Tutorial

In an earlier post I discussed how to use inverse transform sampling to generate a sequence of random numbers following an arbitrary, known probability distribution. In a nutshell, it involves drawing a number x from the uniform distribution between 0 and 1, and returning CDF-1(x), where CDF is the cumulative distribution function corresponding to the probability density/mass function (PDF) we desire.

Calculating the CDF requires that we are able to integrate the PDF easily. Therefore, this method only works when our known PDF is simple, i.e., it is easily integrable. This is not the case if:

• The integral of the PDF has no closed-form solution, and/or
• The PDF in question is a massive joint PDF over many variables, and so solving the integral is intractable.

In particular, the second case is very common in machine learning applications. However, what can we do if we still wish to sample a random sequence distributed according to the given PDF, despite being unable to calculate the CDF?

The solution is a probabilistic algorithm known as the Metropolis or Metropolis-Hastings algorithm. It is surprisingly simple, and works as follows:

1. Choose an arbitrary starting point x in the space. Remember P(x) as given by the PDF.
2. Jump away from x by a random amount in a random direction, to arrive at point x’. If P(x’) is greater than P(x), add x’ to the output sequence. Otherwise, if it is less, decide to add it to the output sequence with probability P(x’)/P(x).
3. If you have decided to add x’ to the output sequence, move to the new point and repeat the process from step 2 onwards (i.e. jump away from x’ to some x”, and if you add x”, then jump away from it to x”’ etc). If you did not add x’ to the sequence, then return to x and try to generate another x’ by jumping away again by a random amount in a random direction.

The PDF of the sequence of random numbers emitted by this process ultimately converges to the desired PDF. The process of “jumping away” from x is achieved by adding some random noise to it, this is usually chosen to be a random number from a normal distribution centred at x.

Why does this work? Imagine that you’re standing somewhere in a hilly region, and you want to visit each point in the region with a frequency proportional to its elevation; that is, you want to visit the hills more than the valleys, the highest hills most of all, and the lowest valleys least of all. From your starting point, you make a random step in a random direction and come to a new point. If the new point is higher than the old point, you stay at the new point. If the new point is lower than the old point, you flip a biased coin and depending on the result, either choose to stay at the new point or return to the old point (it turns out that in practice, this means choosing the lower point with probability P(x’)/P(x), but I will not bother proving why here). If you do this for an infinitely long time, you’ll probably visit most of the region at least once, but you’ll have visited the highest regions much more than the lower regions, simply because you always accept upwards steps, whereas you only accept downwards steps a certain amount of the time.

A nifty trick is to not use the desired PDF  to calculate P(x) directly, but instead to use a function f such that f(x) is proportional to P(x) (this results in the same probability for deciding whether to accept a new point or not). Such proportional approximations are often easier to compute and can speed up the operation of the algorithm dramatically.

You may have heard of the Metropolis algorithm being referred to as a Markov chain Monte-Carlo algorithm. There are two parts to this; the first is “Markov chain” — this is simply referring to the fact that at each step of the algorithm we only consider the point we visited immediately previously; we do not remember more than just the last step we took in order to compute the next step. The second is “Monte Carlo” — this simply means that we are using randomness in the algorithm, and that the output may not be exactly correct. By saying “not exactly correct”, we are acknowledging the fact that the distribution of the sequence converges to the desired distribution as we draw more and more samples; a very small sequence may not look like it follows the desired probability distribution at all.

# The Shortest Bayes Classifier Tutorial You’ll Ever Read

The Bayes classifier is one of the simplest machine learning techniques. Yet despite its simplicity, it is one of the most powerful and flexible.

Being a classifier, its job is to assign a class to some input. It chooses the most likely class given the input. That is, it chooses the class that maximises $P(class | input)$.

Being a Bayes classifier, it uses Bayes’ rule to express this as the class that maximises $P(input | class)*P(class)$.

All you need to build a Bayes classifier is a dataset that allows you to empirically measure $P(class)$ and $P(input | class)$ for all combinations of input and class. You can then store these values and reuse them to calculate the most likely class for an unseen input. It’s as simple as that.

This concludes the shortest Bayes classifier tutorial you’ll ever read.

Appendix: what happened to the denominator in Bayes’ rule?

Okay, so I cheated a little bit by adding an appendix. Even so, the tutorial above is a complete description of the Bayes classifier. Those familiar with Bayes’ rule would complain that when I rephrased $P(class | input)$ as $P(input | class)*P(class)$, the denominator $P(input)$ is missing. This is correct; but since this denominator is independent of the value of class, it can safely be removed from the expression with the guarantee that the class that maximises it is the same as the class that would have maximised it if the denominator was still present. Look at it this way: say you want to find the value $x$ that maximises the function $f(x) = -x*x$. This is the same value of $x$ that maximises the function $g(x) = f(x)/5$, simply because the denominator, 5, is independent of the value of $x$. We are not interested in the actual output of $f(x)$ or $g(x)$, merely the value of $x$ that maximises either.

Appendix: the naïve Bayes classifier

The Bayes classifier above comes with a caveat, though: if you have even reasonably complicated input, procuring a dataset that allows you to reliably measure $P(input | class)$ for all unique combinations of input and class isn’t easy! For example, if you are building a binary classifier and your input consists of four features that can take on ten values each, that’s already 20,000 combinations of features and classes! A common way to remedy this problem is to regard each feature as independent of each other. That way you only need to empirically measure the likelihood of each value of each feature occurring given a certain class. You then estimate the likelihood of an entire set of features by multiplying together the likelihood of occurrence of each of its constituent feature values. This is a naïve assumption, and so results in the creation of a naïve Bayes classifier. This is also a purposely vague summary of the workings of a naïve Bayes classifier. I would recommend an Internet search for a more in-depth treatment.