The Kolmogorov-Smirnov Test: an Intuition

The Kolmogorov–Smirnov test (K–S test) tests if two probability distributions are equal. Therefore, you can compare an empirically observed distribution with a known reference distribution, or you can compare two observed distributions, to test whether they match.

It works in really quite a simple manner. Let the cumulative distribution functions of the two distributions be CDFA and CDFB respectively. We simply measure the maximum difference between these two functions for any given argument. This maximum difference is known as the Kolmogorov-Smirnov statistic, D, and is given by:

D = \max_x{(| CDF_A(x) - CDF_B(x) |)}

You can think about it this way: if you plotted of CDFA and CDFB together on the same set of axes, D is the length of the largest vertical line you could draw between the two plots.

This image illustrates the CDFs of two empirically observed distributions. The K-S test statistic is the maximum vertical distance between these two CDFs, and is represented by the black line.
This image illustrates the CDFs of two empirically observed distributions. The K-S test statistic is the maximum vertical distance between these two CDFs, and is represented by the black line.

To perform the Kolmogorov-Smirnov test, one simply compares D to a table of thresholds for statistical significance. The thresholds are calculated under the null hypothesis that the distributions are equal. If D is too big, the null hypothesis is rejected. The threshold for significance depends on the size of your sample (as your sample gets smaller, your D needs to get larger to show that the two distributions are different) and, of course, on the desired significance level.

The test is non-parametric or distribution-free, which means it makes no assumptions about the underlying distributions of the data. It is useful for one-dimensional distributions, but does not generalise easily to multivariate distributions.

How To Generate Any Probability Distribution, Part 1: Inverse Transform Sampling

In this post I’d like to briefly describe one of my favourite algorithmic techniques: inverse transform sampling. Despite its scary-sounding name, it is actually quite a simple and very useful procedure for generating random numbers from an arbitrary, known probability distribution — given random numbers drawn from a uniform distribution. For example, if you had empirically observed from a database that a variable took on some probability distribution, and you wanted to simulate similar conditions, you would need to draw a random variable with that same distribution. How would you go about doing this?

In essence, it is simply a two-step process:

  1. Generate a random value x from the uniform distribution between 0 and 1.
  2. Return the value y such that = CDF(y), where CDF is the cumulative distribution function of the probability distribution you wish to achieve.

Why does this work? Step 1 picks a uniformly random value between 0 and 1, so you can interpret this as a probability. Step 2 inverts the desired cumulative distribution function; you are calculating y = CDF-1(x), and therefore the returned value y is such that a random variable drawn from that distribution is less than or equal to y with probability x.

Thinking in terms of the original probability density function, we are uniformly randomly choosing a proportion of the area under the curve of the PDF and returning the number in the domain such that exactly this proportion of the area occurs to the left of that number. So numbers in the regions of the PDF with greater areas are more likely to occur. The uniform distribution is thereby projected onto this desired PDF.

This is a really neat algorithm. But what do you do if you don’t know the CDF of the distribution you want to sample from? I discuss a solution in Part 2 of this series: How To Generate Any Probability Distribution, Part 2: The Metropolis-Hastings Algorithm

The Shortest Bayes Classifier Tutorial You’ll Ever Read

The Bayes classifier is one of the simplest machine learning techniques. Yet despite its simplicity, it is one of the most powerful and flexible.

Being a classifier, its job is to assign a class to some input. It chooses the most likely class given the input. That is, it chooses the class that maximises P(class | input).

Being a Bayes classifier, it uses Bayes’ rule to express this as the class that maximises P(input | class)*P(class).

All you need to build a Bayes classifier is a dataset that allows you to empirically measure P(class) and P(input | class) for all combinations of input and class. You can then store these values and reuse them to calculate the most likely class for an unseen input. It’s as simple as that.

This concludes the shortest Bayes classifier tutorial you’ll ever read.

Appendix: what happened to the denominator in Bayes’ rule?

Okay, so I cheated a little bit by adding an appendix. Even so, the tutorial above is a complete description of the Bayes classifier. Those familiar with Bayes’ rule would complain that when I rephrased P(class | input) as P(input | class)*P(class), the denominator P(input) is missing. This is correct; but since this denominator is independent of the value of class, it can safely be removed from the expression with the guarantee that the class that maximises it is the same as the class that would have maximised it if the denominator was still present. Look at it this way: say you want to find the value x that maximises the function f(x) = -x*x. This is the same value of x that maximises the function g(x) = f(x)/5, simply because the denominator, 5, is independent of the value of x. We are not interested in the actual output of f(x) or g(x), merely the value of x that maximises either.

Appendix: the naïve Bayes classifier

The Bayes classifier above comes with a caveat, though: if you have even reasonably complicated input, procuring a dataset that allows you to reliably measure P(input | class) for all unique combinations of input and class isn’t easy! For example, if you are building a binary classifier and your input consists of four features that can take on ten values each, that’s already 20,000 combinations of features and classes! A common way to remedy this problem is to regard each feature as independent of each other. That way you only need to empirically measure the likelihood of each value of each feature occurring given a certain class. You then estimate the likelihood of an entire set of features by multiplying together the likelihood of occurrence of each of its constituent feature values. This is a naïve assumption, and so results in the creation of a naïve Bayes classifier. This is also a purposely vague summary of the workings of a naïve Bayes classifier. I would recommend an Internet search for a more in-depth treatment.

Data Science vs Data Analysis vs Data Mining: What’s the Difference?

This is a question that I often get asked by people new to data science. Because these are subjective, evolving terms, this question will never have a definitive answer. However, I think of it like this:

Data analysis is literally just the act of drawing an inference from some data. Something as simple as looking at a set of 10 numbers and calculating their average can constitute data analysis.

Data mining is, most generally, when the act of data analysis is partially or fully-automated. Data mining is strongly associated with large datasets, which you would expect, given that the ability to automate analysis is particularly useful with large datasets.

Data science is the most nebulous and vague term of the three. It’s better to think of data science as a craft, rather than a specific activity. The ultimate aim of a data scientist is simply to draw inferences from data; in that sense they are simply data analysts. But a data scientist is also equipped with the knowledge and skills to manage this process from end to end:

  1. to gather the data, and store and process it until it is in a form suitable for analysis,
  2. to perform the analysis, and
  3. to present the results of the analysis in a manner useful to the person who needs it.

Much of the reason that data science has emerged as a separate entity is because of the transition of data analysis from data-poor to data-rich. The transition has been extremely swift. People who were trained extensively to perform steps 2 and 3, because they were trained to work in a world where those steps were the bottleneck, are now choked by their inability to do step 1 well, simply because of the sheer volume, variety, and velocity of the data. Conventional data processing methods simply do not scale to data-rich environments. It is common knowledge in the industry that in data analysis, 90% of the time is spent preparing the data, and 10% of the time is spent doing actual science. These figures are not exaggerated.

Data scientists can not only do all steps 1-3, but importantly should be able to do them in a way that scales, such that the human effort is redistributed more effectively between the steps. This is one of the best ways to tell whether you have hired a true data scientist, or merely a statistician pretender.

Why Certain Special Characters Reduce The SMS Character Limit To 70

I recently noticed that the character count of a text message I was drafting on my iPhone suddenly changed from “x/160” to “x/70” (here’s how to display a character count in Messages, if you didn’t already know).

Perplexed, I turned to the Internet for an answer, and found one quite quickly on this MacRumors thread.

It basically boils down to this: An SMS may contain up to 140 bytes (= 1120 bits) of data. UK mobile networks use the GSM standard. The basic GSM character set is encoded using 7 bits per character, which allows for a text message to consist of 1120/7 = 160 characters.

It is only possible to represent 128 different characters with 7 bits. This suffices to capture all common English characters. A few additional special characters (mostly punctuation) can be specified using the basic character set “extension”, which requires 14 bits for every character in the extended set.

However, support for the vast majority of foreign language characters comes in the form of 16-bit UTF-16 alphabet. If you have a mix of English and foreign language characters in your text message, the entire message must be sent in UTF-16, which reduces the number of available characters to 1120/16 = 70 characters. This explains the phenomenon I was experiencing.

I know what you’re thinking: this sucks for those who text in languages other than English. Thankfully, the GSM standard has a solution called “national language shift tables”. In this scheme, several 7-bit character sets are recognised, each corresponding to the most commonly used characters of a particular language. The first four bytes of the text message indicate the specific character set to use, and the rest of the message (136 bytes, or 1088 bits) can be used for the actual content of the message, allowing for a respectable compromise of 1088/7 ≅ 155 characters.

Using characters that belong to multiple shift tables in the same text triggers a fallback to UTF-16, but the idea is to capture the vast majority of text communication.

If this has piqued your interest, Wikipedia has a fairly comprehensive article about GSM 03.38.

Civil Partnerships are Discriminatory

Civil partnerships are often incorrectly viewed as the panacea for reconciling the views of those in favour of human rights with the views of champions of “traditional” marriage. It’s easy to see the appeal: the same-sex couples get to enjoy the same legal benefits of marriage (which, by the way, they often don’t), and the tragically misguided get to cling to the vacuous notion that somehow, the “real” meaning of marriage remains sacrosanct. Everyone wins! Unfortunately, it isn’t that convenient.

The fallacy is concisely stated: having the same legal rights is not equality. The ostensibly well-intentioned civil partnership is a step in the right direction, but ultimately fails to satisfy some very basic, primitive human needs, and is therefore not a solution to the problem of marriage inequality.

But they’re different!

A common argument is that different things call for different names. This is nothing but a rehash of the separate but equal argument, attempting to hide the proponent’s homophobia under a thin veneer of semantics. It reaches for ad absurdum: in the interest of political correctness, why bother with semantic distinctions at all? Instead of having different words for “man” and “woman”, why not call them all “persons”?

In short, the reason it makes no sense to have distinct words to express two different kinds of marriage is because there is no great utility to be found in making that distinction. This is why we don’t have separate words for interracial marriage, marriage between old people, marriage between wealthy people, or marriage between unpleasant people, even though these are all different things. Can we move on now?

The word can only be marriage.

Words do not have meaning. Rather, they convey meaning, and what a word conveys is a matter of soft social convention. No single party can claim global authority on the definition of a word.

The word “marriage” carries social weight which makes it absolutely essential for complete equality. A marriage is not simply a contract which entitles the underwritten to certain legal provisions. Marriage means confidence in a relationship and the ability to commit to one. Marriage means willingness and ability to reconcile differences. Marriage means the opportunity for mutual growth and support.

A marriage is a publicly visible and recognised milestone in a relationship. It is a deeply ingrained human aspiration in civilised society. Like anything else that has heavy aspirational value placed on it, we are taught the worth of marriage through our upbringing, through our friends and relatives, through experiences and conversation, and through media. Everyone is searching for meaning in life, and we have taught ourselves that marriage is a most important and meaningful human experience.

The value of marriage is a complex creature, evolved through centuries and across several cultures, and realised through arbitrary conventions and rituals. A contemporary Western marriage, for instance, becomes much less meaningful without arbitrary rituals such as stag nights and honeymoons.

I understand that no one is stopping civil partners executing identical rituals in an effort to make their experience of civil partnership a better approximation for the marriage experience to which they aspire. However that is exactly the problem: no matter how painstakingly planned and executed, the civil partnership remains an approximation. The value of the experience, the life-meaning that the individuals can glean, can never quite match the aspirational value they have imbibed from society.

Depriving an entire category of people of an opportunity to obtain meaning in life is discriminatory. It sends a message that certain people are unworthy of feeding such an aspiration because of something in which they had no choice.

Our soft social convention for what the word “marriage” conveys is easily lenient enough to accommodate same-sex couples, and the legal definition needs must catch up.