Posted on June 14, 2008 by Peter Turney
Ockham’s razor is the principle that For any set of observations, there are an infinite number of theories that can fit the observations, with varying degrees of accuracy. For any set of data points, there are an infinite number of curves that fit the points, exactly or approximately. We don’t usually pick the theory with the highest accuracy (in general, there are infinitely many theories that are perfectly accurate); instead, we pick the theory that best balances accuracy and simplicity. Why? Because simplicity is a guide to truth. That is, the theory that best balances accuracy and simplicity is the theory that will make the most reliable predictions about future observations. My PhD thesis was, in essence, about Ockham’s razor. For a period of about a decade, extending from my late undergraduate years to my early postdoctoral years, it would be fair to say that I was obsessed with Ockham’s razor. I was convinced that it was the key to understanding how we acquire knowledge about the world. I no longer believe in Ockham’s razor.
There are various ways to make Ockham’s razor more precise, such as Bayesian inference, minimum description length, minimum message length, Kolmogorov complexity, and data compression. My own approach was to reduce simplicity to stability. That is, we should prefer simpler theories because they are more stable, more resistant to noise in the data. This approach is closely related to the Akaike information criterion and the bias/variance dilemma. For some time, I was satisfied with bias/variance as a justification for Ockham’s razor. The papers that awoke me from my dogmatic slumber were Cullen Schaffer’s “A conservation law for generalization performance” and “Overfitting avoidance as bias“. The subsequent No Free Lunch theorems and Webb‘s “Further experimental evidence against the utility of Occam’s razor” further persuaded me that Ockham’s razor is dull. It is notoriously difficult to explain the No Free Lunch theorems informally. I think of the theorems in terms of Hume and Kant. The NFL theorems show that there is no a priori reason to prefer any one theory to another. If you don’t make any assumptions about the world (i.e., the universe, the environment, the particular dataset that you are analyzing), then you can’t make any claims about the reliability of your predictions. In machine learning, the NFL theorems are explained in terms of inductive bias. Note that inductive bias is distinct from statistical bias, which is the kind of bias involved in the bias/variance dilemma. Inductive bias is anything that is used to select a unique theory from the set of all theories, other than fit of the theory to the training data. In the words of Tom Mitchell, inductive bias is “a means of selecting between competing hypotheses that utilizes criteria beyond those strictly encapsulated in the training data”. The NFL theorems show that there is no way to avoid inductive bias, and there is no way to justify one inductive bias over another, without making assumptions about the world. The lesson I draw from this is that, instead of asking how to measure the complexity/simplicity of a theory, we should ask which inductive biases tend to work well in this particular universe in which we find ourselves. Simplicity is just one more example of an inductive bias, and there is no reason to prefer it to any other inductive bias, unless empirical evidence supports its value. Webb’s experiments suggest that simplicity is not a particularly good bias for our world. The success of support vector machines, on the other hand, suggests that maximum margin is a good bias for our world. In my experience with machine learning algorithms, on a wide variety of problems, the choice of features in the feature vectors is far more important than the choice of learning algorithm. That is, the most important issue is representation, and the inductive bias due to a given representation. When applying decision tree induction to chess endgame classification, Quinlan reported that the most difficult part of the problem was finding the right features to represent the data. The initial attempt was a feature vector with 64 elements, one for each square of the board. This low-level representation was useless. The final features were much higher-level. This kind of experience is rarely written down, but it is often discussed informally, in hallways at machine learning conferences, for example. I believe that it is not fruitful to think of the representation problem in terms of Ockham’s razor. I have nothing against Bayesian inference, for example. It is often a useful tool. But there are those who seem to believe that all problems can be solved using a Bayesian approach. I disagree. You can’t apply Bayesian inference until you figure out how to represent the problem. But figuring out how to represent the problem is 95% of the work. By the time you have the representation right, the tool that you use to finish the remaining 5% is not terribly important. |