Milk, Tea, and Statistics: The Birth of Hypothesis Testing

The following is an excerpt from an article^† written by Carl Zimmer detailing an important moment in the history of the development of statistics:

People often think that the job of scientists is to prove a hypothesis is true—the existence of electrons, for example, or the ability of a drug to cure cancer. But very often, scientists do the reverse: They set out to disprove a hypothesis.

It took many decades for scientists to develop this method, but one afternoon in the early 1920s looms large in its history. At an agricultural research station in England, three scientists took a break for tea. A statistician named Ronald Fisher poured a cup and offered it to his colleague, Muriel Bristol.

Bristol declined it. She much preferred the taste of a cup into which the milk had been poured first.

“Nonsense,” Fisher reportedly said. “Surely it makes no difference.”

But Bristol was adamant. She maintained that she could tell the difference.

The third scientist in the conversation, William Roach, suggested that they run an experiment. (This may have actually been a moment of scientific flirtation: Roach and Bristol married in 1923.) But how to test Bristol’s claim? The simplest thing that Fisher and Roach could have done was pour a cup of tea out of her sight, hand it to her to sip, and then let her guess how it was prepared.

If Bristol got the answer right, however, that would not necessarily be proof that she had an eerie perception of tea. With a 50 percent chance of being right, she might easily answer correctly by chance alone.

Several years later, in his 1935 book The Design of Experiments, Fisher described how to test such a claim. Instead of trying to prove that Bristol could tell the difference between the cups of tea, he would try to reject the hypothesis that her choices were random. “We may speak of this hypothesis as the ‘null hypothesis,’ ” Fisher wrote. “The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”

Fisher sketched out a way to reject the null hypothesis—that Bristol’s choices were random. He would prepare eight cups, putting milk first into four of them, and milk second into the other four. He would scramble the cups into a random order and offer them to Bristol to sip, one at a time. She would then divide them into two groups—the cups that she believed had received milk first would go in one group, milk second in the other.

Bristol reportedly passed the test with flying colors, correctly identifying all eight cups. Thanks to the design of Fisher’s experiment, the odds that she would divide eight cups into two groups correctly by chance were small. There were 70 different possible ways to divide eight cups into two groups of four, which meant that Bristol could identify the cups correctly by chance only once out of every 70 trials.

Fisher’s test couldn’t completely eliminate the possibility that Bristol was guessing. It just meant that the chance she was guessing was low. He could have reduced the odds further by having Bristol drink more tea, but he could never reduce the chances she was guessing to zero.

Since absolute proof was impossible, Fisher preferred to be practical when he ran experiments. At the lab where he and Bristol worked, Fisher was charged with analyzing decades of collected data to determine whether that information could divine details, like the best recipe for crop fertilizer. Scientists could use that data to design ever larger experiments with increasingly more accurate results. Fisher thought it would be pointless to design an experiment that needed centuries to yield results. At some point, Fisher believed, scientists had to just call it a day.

He believed that a sensible threshold was 5 percent. If we assumed that the null hypothesis was true and found that the odds of observing the data was less than 5 percent, then we could safely reject it. In Bristol’s case, the odds were comfortably below Fisher’s threshold, at just 1.4 percent.

Thanks in large part to Fisher, the null hypothesis has become an important tool for scientific discovery. You can find tests of null hypotheses in every branch of science, from psychology to virology to cosmology. And scientists have followed Fisher in using a 5-percent threshold.

^†"Why We Can't Rule Out Bigfoot: How the null hypothesis keeps the hairy hominid alive." by Carl Zimmer