## What, Exactly, Is Probability?

“Probability is the bane of the age,” said Moreland, now warming up. “Every Tom, Dick, and Harry thinks he knows what is probable. The fact is most people have not the smallest idea what is going on round them. Their conclusions about life are based on utterly irrelevant – and usually inaccurate – premises.”

Anthony Powell, “Casanova’s Chinese Restaurant” in

2nd Movement in A Dance to the Music of Time, University of Chicago Press, 1995

Because many events can’t be predicted with total certainty, often the best we can do is say what the probability is that an event will occur – that is, how likely it is to happen. The probability that a particular event (or set of events) will occur is expressed on a linear scale from 0 (impossibility) to 1 (certainty), or as a percentage between 0 and 100%.

The analysis of events governed by probability is called statistics, a branch of mathematics that studies the possible outcomes of given events together with their relative likelihoods and distributions. It is one of the last major areas of mathematics to be developed, with its beginnings usually dated to correspondence between the mathematicians Blaise Pascal and Pierre de Fermat in the 1650′s concerning certain problems that arose from gambling.

Chevalier de Méré, a French nobleman with an interest in gaming and gambling questions, called Pascal’s attention to an apparent contradiction concerning a popular dice game that consisted in throwing a pair of dice 24 times. The problem was to decide whether or not to bet even money on the occurrence of at least one “double six” during the 24 throws. A seemingly well-established gambling rule led de Méré to believe that betting on a double six in 24 throws would be profitable, but his own calculations indicated just the opposite. This problem (as well as others posed by de Méré) led to the correspondence in which the fundamental principles of probability theory were formulated for the first time.

Statistics is routinely used in in every social and natural science. It is making inroads in law and in the humanities. It has been so successful as a discipline that most research is not regarded as legitimate without it. It’s also used in a wide variety of practical tasks. Physicians rely on computer programs that use probabilistic methods to interpret the results of some medical tests. Construction workers use a chart based on probability theory when mixing the concrete for the foundation of buildings, and tax assessors use a statistical package to decide how much the house is worth.

While there a number of forms of statistical analysis, the two dominant forms are Frequentist and Bayesian.

Bayesian analysis is the older form, and focuses on P(H|D) – the probability (P) of the hypothesis (H), given the data (D). This approach treats the data as fixed (these are the only data you have) and hypotheses as random (the hypothesis might be true or false, with some probability between 0 and 1). This approach is called Bayesian because it uses Bayes’ Theorem to calculate P(H|D).

The conceptual framework for Bayes’ Theorem was developed by the Reverend Thomas Bayes), and published posthumously in 1764. It was perfected and advanced by French physicist Pierre Simon Laplace, who gave it its modern mathematical form and scientific application. Bayes’ theorem has a 250-year history, and the method of inverse probability that was developed from it dominated statistical thinking into the twentieth century.

For the Bayesian:

• Probability is subjective – a measurement of the degree of belief that an event will occur – and can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can refer to tomorrow’s weather as having a 50% chance of rain.

• Parameters are random variables that have a given distribution, and other probability statements can be made about them.

• Probability has a distribution over the parameters, and point estimates are usually done by either taking the mode or the mean of the distribution.

A Bayesian basically says, “I don’t know how the world is. All I have to go on is finite data. So I’ll use statistics to infer something from those data about how probable different possible states of the world are.”

Frequentist (sometimes called “a posteriori”, “empirical”, or “classical”) analysis focuses on P(D|H), the probability (P) of the data (D), given the hypothesis (H). That is, this approach treats data as random (if you repeated the study, the data might come out differently), and hypotheses as fixed (the hypothesis is either true or false, and so has a probability of either 1 or 0, you just don’t know for sure which it is). This approach is called frequentist because it’s concerned with the frequency with which one expects to observe the data, given some hypothesis about the world.

Frequentist statistical analysis is associated with Sir Ronald Fisher (who created the null hypothesis and p-values as evidence against the null), Jerzy Neyman (who was the first to introduce the modern concept of a confidence interval in hypothesis testing) and Egon Pearson (who with Neyman developed the concept of Type I and II errors, power, alternative hypotheses, and deciding to reject or not reject based on an alpha level). They use the relative frequency concept – you must perform one experiment lots of times and measure the proportion where you get a positive result.

For the Frequentist:

• Probability is objective and refers to the limit of an event’s relative frequency in a large number of trials. For example, a coin with a 50% probability of heads will turn up heads 50% of the time.

• Parameters are all fixed and unknown constants.

• Any statistical process only has interpretations based on limited frequencies. For example, a 95% confidence interval of a given parameter will contain the true value of the parameter 95% of the time.

• Referring to tomorrow’s weather as having a 50% chance of rain would not make sense to a Frequentist because tomorrow is just one unique event, and cannot be referred to as a relative frequency in a large number of trials. But they could say that 70% of days in April are rainy in Seattle.

A Frequentist basically says, “The world is a certain way, but I don’t know how it is. Further, I can’t necessarily tell how the world is just by collecting data, because data are always finite and noisy. So I’ll use statistics to line up the alternative possibilities, and see which ones the data more or less rule out.”

Frequentist and Bayesian approaches represent deeply conflicting approaches with deeply conflicting goals. Perhaps the most important conflict has to do with alternative interpretations of what “probability” means. These alternative interpretations arise because it often doesn’t make sense to talk about possible states of the world. For instance, there’s either life on Mars, or there’s not.

We don’t know for sure which it is, but we can say with certainty that it’s one or the other. So if you insist on putting a number on the probability of life on Mars (i.e. the probability that the hypothesis “There is life on Mars” is true), you are forced to drop the Frequentist interpretation of probability. A Frequentist interprets the word “probability” as meaning “the frequency with which something would happen in a lengthy series of trials”.

The Bayesian interprets the word “probability” as “subjective degree of belief” – the probability that you (personally) attach to a hypothesis is a measure of how strongly you (personally) believe that hypothesis. So a Frequentist would never say “There’s probably not life on Mars”, unless they were speaking loosely and using that phrase as shorthand for “The data are inconsistent with the hypothesis of life on Mars”. But the Bayesian would say “There’s probably not life on Mars”, not as a loose way of speaking about Mars, but as a very literal and precise way of speaking about their beliefs about Mars. A lot of the choice between Frequentist and Bayesian statistics comes down to whether you think science should comprise statements about the world, or statements about our beliefs.

Let’s look at the simple task of flipping a coin. The flip of a fair coin has no memory, or as mathematicians would say, each flip is independent. Even if by chance the coin comes up heads ten times in a row, the probability of getting heads or tails on the next flip is precisely equal. You may believe that a coin that, because a flipped coin has come up heads ten times in a row, that “tails is way overdue”, but the coin doesn’t know and doesn’t care about the last ten flips; the next flip is just as likely to be the eleventh head in a row as the tail that breaks the streak. The probability that the flip of a fair coin will come up heads or tails, then, is 50%.

But what, exactly, do we mean when we say that the probability is 50%? A Frequentist would say that if the probability of landing or either side is 50%, this means that if we were to repeat the experiment of flipping the coin a large number of times, we would expect to see approximately the same number of heads as tails. That is, the ratio of heads to tails will approach 1:1 as we flip the coin more and more times.

In contrast, a Bayesian would say that probability is a very personal opinion. What probability of 50% means to you is different from what it might mean to me. If pressed to place a bet on the outcome of flipping a single coin, you would just as well guess heads or tails. More generally, if you were to bet on the flip of a coin and was told that the probability of either side coming up was 50%, and the rewards for guessing correctly on any outcome are equal, then it would make no difference to you what side of the coin you bet on.

Both approaches are addressing the same fundamental problem (what are the odds that flipping a coin will result in it landing heads up), but attack the problem in reverse orders (the probability of getting data, given a model, versus probability of a model, given some data). It’s quite common to get the same basic result out of both methods, but many will argue that the Bayesian approach more closely relates to the fundamental problem in science (we have some data, and we want to infer the most likely truth.)

So, which approach is best? The Frequentist position would seem to be the answer. In our coin-flipping example, the probability of a fair coin landing heads is 50% because it lands heads half the time. Defining probability in terms of frequency seems to be the empirical thing to do. After all, frequency is “real”. It isn’t metaphysical, like “degree of certainty,” or “degree of warranted belief.” You can go out and observe it.

However, the Frequentist position also has some significant problems. First, it requires the long run relative frequency interpretation of probability – that is, the limiting frequency with which that outcome appears in a long series of similar events. Dice, coins and shuffled playing cards can be used to generate random variables; therefore, they have a frequency distribution, and the frequency definition of probability theory can be used. Unfortunately, the frequency interpretation can only be used in cases such as these. Another problem is that almost all prior information is ignored, and it doesn’t allow you to incorporate what you already know. Even more seriously, a hypothesis that may be true may be rejected because it hasn’t predicted observable results that have not occurred.

But the Bayesian position has its own set of problems. Bayesian calculations almost invariably require integrations over uncertain parameters, making them computationally difficult. Second, Bayesian methods require specifying prior probability distributions, which are often themselves unknown. Bayesian analyses generally assume so-called “uninformative” (often uniform) priors in such cases. But such assumptions may or may not be valid, and more importantly, it may not be possible to determine their validity with any degree of certainty.

Finally, though Bayes’ theorem is trivially true for random variables X and Y, it’s not clear that parameters or hypotheses should be treated as random variables. It’s accepted that you can talk about the probability of observed data given a model – the frequency with which you would obtain those data in the limit of infinite trials. But if you talk about the “probability”’ of a one-time, non-repeatable event that is either true or false, there is no frequency interpretation.

While both approaches have their (often rabid) proponents, I would argue that the approach you take depends on the question (or questions) you’re asking. Let’s take the hypothetical case of a patient you want to perform a test on.

You know the patient is either healthy (H) or sick (S). Once you perform the test, the result will either be Positive (+) or Negative (-). Now, let’s assume that if the patient is sick, they will always get a Positive result. We’ll call this the correct (C) result and say that if the patient is healthy, the test will be negative 95% of the time, but there will be some false positives. In other words, the probability of the test being Correct, for healthy people, is 95%. So the test is either 100% accurate or 95% accurate, depending on whether the patient is healthy or sick. Taken together, this means the test is at least 95% accurate.

These are the statements that would be made by a Frequentist. The statements are simple to understand and are demonstrably true. But what if we ask a more difficult, and arguably a more useful question – given the test result, what can you learn about the health of the patient?

If you get a negative test result, the patient is obviously healthy, as there are no false negatives. But what if the test is positive? Was the test positive because the patient was actually sick, or was it a false positive? This is where the frequentist and Bayesian diverge. Everybody will agree that this cannot be answered at the moment. The frequentist will refuse to answer. The Bayesian will be prepared to give you an answer, but you’ll have to give the Bayesian a prior first – i.e. tell it what proportion of the patients are sick.

If you are satisfied with statements such as “for healthy patients, the test is very accurate” and “for sick patients, the test is very accurate”, the Frequentist approach is best. But for the question “for those patients that got a positive test result, how accurate is the test?”, a Bayesian approach is required.

**References**

Ambaum, Maarten H. P., 2012. *Frequentist vs Bayesian statistics—a non-statisticians view*. http://arxiv.org/abs/1208.2141

Bayarri, M.J. and Berge, J.O. *The Interplay of Bayesian and Frequentist Analysis*. Statist. Sci. Volume 19, Number 1 (2004), 58-80.

Fienberg, Stephen E., 2006. *When Did Bayesian Inference Become Bayesian?* Bayesian Analysis Volume 1, Number 1, pp. 1-40.

Gustafson, Paul and Greenland, Sander, 2009. *Interval Estimation for Messy Observational Data*. Statist. Sci. Volume 24, Number 3, 28–342.

Hald, Anders, 2003. *A History of Probability and Statistics and Their Applications before 1750*. Hoboken, NJ: Wiley-Interscience

Hampel, Frank, 1998. *On the foundations of statistics: A frequentist approach*, Research Report No. 85. Zurich, Switzerland: Seminar fur Statistik, Eidgenossische Technische Hochschule (ETH)

Samaniego, Francisco J., 2010. *A Comparison of the Bayesian and Frequentist Approaches to Estimation*. New York, NY: Springer

Shafer, Glenn, 1990. *The Unity and Diversity of Probability*. Statist. Sci. Volume 5, Number 4, 435-444.

Zabell , Sandy , 1989. *R. A. Fisher on the History of Inverse Probability*. Statist. Sci. Volume 4, Number 3, 247-256.

A good exposition, but I differ on some fundamental points. I think very few people even in the sciences fully grasp the underlying basis of probability, including myself up until reading some of the work of a brilliant buy named E.T. Jaynes, who you might want to check out.

But to the point, there is really only one meaning to probability. It is the degree of belief about outcomes which can only be verified via statistical measurement. Bayes Theorem applies to any probability values, regardless of whether they were derived by ad hoc belief or statistical experiment. The two camps that you have described are really just two different ways of assigning probability – either via a theoretical model or via experimental data. Once you have the values, Bayes theorem applies from there as just a transformation between a priori and a posteriori. As you mention, the statistical method is not always available. In the absence of data, the best guess of probability given a particular model of the world is the one that maximizes statistical entropy. Jaynes works with this a lot and calls it the “maximum entropy method”. But it is really just another way to assign a priori probability in the absence of data. Even though a statistical method of estimating probability does not require a theoretical model, it is still only a degree of belief about future outcomes which may or may not be correct.

Referring to your last example, there is no reason why a Frequentist can’t answer the question using Bayes Law and not remain a Frequentist. Conversely, the 95% accuracy of the test is assumed true by both parties, whether it was derived from past trials, or derived from some model of how the test operates. The required information is the probability of being sick. The Frequentist might obtain this information by sampling the population. The “other” guy might come up with a model to predict the probability of being sick based on something like how the virus interacts with the human body or whatever and then using maximum entropy method. The difference is not the use of Bayes Theorem, but in how you determine the a priori probability.