The 4 Drive Backup Solution for Mere Mortals

In this post I describe a minimal, yet comprehensive personal backup solution. It is relatively easy to implement, using only the built-in features of your operating system, and is quite cheap as it requires only 4 hard drives (and can be accomplished with even fewer). Despite being extremely simple, it has the characteristics of a complete backup system and protects against several causes of data loss. It is a sensible backup strategy as of June 2014. This post is aimed towards the technologically-inclined reader.

The solution
  • Preparation: Acquire 4 external hard drives, each as large as you wish, all of roughly the same capacity. I will refer to them as A1, A2, I1 and I2.
  • Archival drives: Drives A1 and A2 are archival drives. They contain data that you no longer keep on your primary computer, and data that you no longer expect to change. This might include photos, music, and old work. You must ensure that A1 and A2 always have the same content as each other.
  • Incremental backup drives: Drives I1 and I2 are incremental backup drives. They will contain a versioned history of all the files on your primary computer. For instance, you can set them both to be Time Machine drives. Time Machine is the incremental/differential backup software that comes standard with Mac OS X (alternative solutions are available for other operating systems).
  • Location: Drives A1 and I1 are stored at the same primary location, such as your home. Drives A2 and I2 are stored a different, secondary location, such as your workplace.
  • What you need to do: Update the content on A1 and A2 at your convenience, making sure they are always in sync. Make incremental backups with I1 and I2 as frequently as possible (at least once daily). With Time Machine this amounts to merely plugging in the drive (or connecting to the same network as the drive, if you use Time Capsule, or you can use something like a Transporter).

And that’s it.

What this scheme protects you against
  • Under the event of data loss due to a hardware or software failure, that is, if one of the drives fails or the data on one of the drives gets corrupted, there is always another drive with a copy of the same data. This drive may be used until the failed/corrupt drive is replaced.
  • Under the event of data loss due to human error, such as accidentally deleting or overwriting a file, there are two incremental backups from which any historic version of the file can be restored.
  • Under the event of data loss due to natural disasters (such as a fire, power surge, or flood) or theft, which causes the drives in one location to be destroyed or stolen, there is always a duplicate of the drives in another location which may be used until the destroyed/stolen drives are replaced. This is what is known as an offsite backup.
What this scheme doesn’t protect you against
  • Both archival drives or both incremental backup drives failing simultaneously: this is extremely unlikely, but if you’re worried about it you can add a third drive of each type.
  • Failure to make incremental/archival backups often enough: this is your problem, not a problem with the scheme.
Modifying the scheme if it doesn’t work for you

This scheme can be directly implemented if:

  1. You primarily use one computer, which is a Mac
  2. Your day-to-day work does not create huge (i.e. comparable to the size of your hard drive), constantly changing files
  3. You do not care for third party services or cloud services (which often require recurring monthly fees)
  4. You are somewhat conscious of but not too restricted by price
  5. You are okay with waiting a few hours to get going again from your backups in case the hard drive in your computer fails and you can no longer boot

If the above do not apply to you, it is easy to adapt this solution for other use cases. For instance, you can easily modify the solution if:

  1. You use Windows/Linux: I believe Windows has an equivalent to Time Machine called “Windows Backup“. Linux users can probably fend for themselves and find something that works for them.
  2. You primarily use multiple computers: You will need an additional pair of incremental backup drives for each additional computer you use.
  3. You need to be able to immediately continue from where you left off in case your computer stops working: You will need to start creating bootable clones, which can be achieved using software such as Disk Utility (comes standard with Mac OS X), SuperDuper or Carbon Copy Cloner. For Windows users, Windows Backup can also create bootable clones. These can be stored on additional drives or on your archival drives.
  4. You don’t mind third party or cloud services: I recommend looking into a solution such as Crashplan or BackBlaze. You can use these services to augment the 4 drive solution or to replace it entirely, depending on your level of trust and the quality of your Internet connection.
  5. You are extremely price conscious: It is possible to implement this scheme with only two drives. In this scenario you will have to create two partitions on each drive, one for archival and the other for the incremental backup. The drives must of course still be stored at separate locations. I personally prefer the 4 drive version because (1) hard drives are not yet capacious enough that cheap commodity drives can be partitioned into useful sizes for those with lots of data, (2) partitioning necessitates erasing the drive, (3) I am leery of increased opportunities for filesystem corruption with multiple partitions, and (4) it is much less effort to replace drives if they only serve a single purpose.
Choosing a mix of drives

Since you will be acquiring multiple drives, you have the opportunity to spread your risk even further. By buying drives from different brands, you reduce your vulnerability if any single manufacturer or hard drive model has a faulty run. It is also good to have a mix of hard drive ages, since very young as well as very old drives appear to have a higher failure rate than those between the ages of 1 and 3 years.

I hope this is of some use. I was tired of thinking about backups and tired of researching third party backup solutions, so I settled on this compact, no-frills setup that can cope with all major threats to your data. If you have a suggestion or notice a deficiency, please leave a comment!

Advertisements

Data Science vs Data Analysis vs Data Mining: What’s the Difference?

This is a question that I often get asked by people new to data science. Because these are subjective, evolving terms, this question will never have a definitive answer. However, I think of it like this:

Data analysis is literally just the act of drawing an inference from some data. Something as simple as looking at a set of 10 numbers and calculating their average can constitute data analysis.

Data mining is, most generally, when the act of data analysis is partially or fully-automated. Data mining is strongly associated with large datasets, which you would expect, given that the ability to automate analysis is particularly useful with large datasets.

Data science is the most nebulous and vague term of the three. It’s better to think of data science as a craft, rather than a specific activity. The ultimate aim of a data scientist is simply to draw inferences from data; in that sense they are simply data analysts. But a data scientist is also equipped with the knowledge and skills to manage this process from end to end:

  1. to gather the data, and store and process it until it is in a form suitable for analysis,
  2. to perform the analysis, and
  3. to present the results of the analysis in a manner useful to the person who needs it.

Much of the reason that data science has emerged as a separate entity is because of the transition of data analysis from data-poor to data-rich. The transition has been extremely swift. People who were trained extensively to perform steps 2 and 3, because they were trained to work in a world where those steps were the bottleneck, are now choked by their inability to do step 1 well, simply because of the sheer volume, variety, and velocity of the data. Conventional data processing methods simply do not scale to data-rich environments. It is common knowledge in the industry that in data analysis, 90% of the time is spent preparing the data, and 10% of the time is spent doing actual science. These figures are not exaggerated.

Data scientists can not only do all steps 1-3, but importantly should be able to do them in a way that scales, such that the human effort is redistributed more effectively between the steps. This is one of the best ways to tell whether you have hired a true data scientist, or merely a statistician pretender.

Why Certain Special Characters Reduce The SMS Character Limit To 70

I recently noticed that the character count of a text message I was drafting on my iPhone suddenly changed from “x/160” to “x/70” (here’s how to display a character count in Messages, if you didn’t already know).

Perplexed, I turned to the Internet for an answer, and found one quite quickly on this MacRumors thread.

It basically boils down to this: An SMS may contain up to 140 bytes (= 1120 bits) of data. UK mobile networks use the GSM standard. The basic GSM character set is encoded using 7 bits per character, which allows for a text message to consist of 1120/7 = 160 characters.

It is only possible to represent 128 different characters with 7 bits. This suffices to capture all common English characters. A few additional special characters (mostly punctuation) can be specified using the basic character set “extension”, which requires 14 bits for every character in the extended set.

However, support for the vast majority of foreign language characters comes in the form of 16-bit UTF-16 alphabet. If you have a mix of English and foreign language characters in your text message, the entire message must be sent in UTF-16, which reduces the number of available characters to 1120/16 = 70 characters. This explains the phenomenon I was experiencing.

I know what you’re thinking: this sucks for those who text in languages other than English. Thankfully, the GSM standard has a solution called “national language shift tables”. In this scheme, several 7-bit character sets are recognised, each corresponding to the most commonly used characters of a particular language. The first four bytes of the text message indicate the specific character set to use, and the rest of the message (136 bytes, or 1088 bits) can be used for the actual content of the message, allowing for a respectable compromise of 1088/7 ≅ 155 characters.

Using characters that belong to multiple shift tables in the same text triggers a fallback to UTF-16, but the idea is to capture the vast majority of text communication.

If this has piqued your interest, Wikipedia has a fairly comprehensive article about GSM 03.38.

Civil Partnerships are Discriminatory

Civil partnerships are often incorrectly viewed as the panacea for reconciling the views of those in favour of human rights with the views of champions of “traditional” marriage. It’s easy to see the appeal: the same-sex couples get to enjoy the same legal benefits of marriage (which, by the way, they often don’t), and the tragically misguided get to cling to the vacuous notion that somehow, the “real” meaning of marriage remains sacrosanct. Everyone wins! Unfortunately, it isn’t that convenient.

The fallacy is concisely stated: having the same legal rights is not equality. The ostensibly well-intentioned civil partnership is a step in the right direction, but ultimately fails to satisfy some very basic, primitive human needs, and is therefore not a solution to the problem of marriage inequality.

But they’re different!

A common argument is that different things call for different names. This is nothing but a rehash of the separate but equal argument, attempting to hide the proponent’s homophobia under a thin veneer of semantics. It reaches for ad absurdum: in the interest of political correctness, why bother with semantic distinctions at all? Instead of having different words for “man” and “woman”, why not call them all “persons”?

In short, the reason it makes no sense to have distinct words to express two different kinds of marriage is because there is no great utility to be found in making that distinction. This is why we don’t have separate words for interracial marriage, marriage between old people, marriage between wealthy people, or marriage between unpleasant people, even though these are all different things. Can we move on now?

The word can only be marriage.

Words do not have meaning. Rather, they convey meaning, and what a word conveys is a matter of soft social convention. No single party can claim global authority on the definition of a word.

The word “marriage” carries social weight which makes it absolutely essential for complete equality. A marriage is not simply a contract which entitles the underwritten to certain legal provisions. Marriage means confidence in a relationship and the ability to commit to one. Marriage means willingness and ability to reconcile differences. Marriage means the opportunity for mutual growth and support.

A marriage is a publicly visible and recognised milestone in a relationship. It is a deeply ingrained human aspiration in civilised society. Like anything else that has heavy aspirational value placed on it, we are taught the worth of marriage through our upbringing, through our friends and relatives, through experiences and conversation, and through media. Everyone is searching for meaning in life, and we have taught ourselves that marriage is a most important and meaningful human experience.

The value of marriage is a complex creature, evolved through centuries and across several cultures, and realised through arbitrary conventions and rituals. A contemporary Western marriage, for instance, becomes much less meaningful without arbitrary rituals such as stag nights and honeymoons.

I understand that no one is stopping civil partners executing identical rituals in an effort to make their experience of civil partnership a better approximation for the marriage experience to which they aspire. However that is exactly the problem: no matter how painstakingly planned and executed, the civil partnership remains an approximation. The value of the experience, the life-meaning that the individuals can glean, can never quite match the aspirational value they have imbibed from society.

Depriving an entire category of people of an opportunity to obtain meaning in life is discriminatory. It sends a message that certain people are unworthy of feeding such an aspiration because of something in which they had no choice.

Our soft social convention for what the word “marriage” conveys is easily lenient enough to accommodate same-sex couples, and the legal definition needs must catch up.