## Tuesday, September 01, 2015

### Blogger's Desk #8- Statistics in Layman's language

Greetings,

How often is it that you had to deal with mathematics in biology. I have often heard from many of my students and some friends, they could never understand usage of statistics. The concept of P value has been so powerful, that journals reject the paper with data that doesn't fall into the good side of P value. There is an excellent article in nature on how we are overusing the P value (Link). I thought it would be a good idea to place a short idea of how it works. I will try to keep it as simple as possible.

One of the major problems in biology is currently the biostatistics. Biostatistics, is a branch of life science, involved in finding the probabilities of a particular event. You could ask what this deal about probability is. Why can’t we be absolutely sure? I have previously talked about probability in science (Link), and you should read it if you haven't to catch the idea in the following paragraphs.

Consider, you want to estimate the number of cars in your community. You have a background information that there are about 500 families in that community. Since it is not possible for you to go and check every house you decide on a simple way. You will check into about 50 families and the results will be extrapolated (Not a bad idea at all). Let’s call these 50 families as sample population. So you start and find that every family has absolutely one car per family. No more, no less. You can comfortably decide that there are 500 cars in the community, since everybody has one. (Of course there is a chance that the 50 families you chose are outliers. Everybody else has 2 cars. But for the moment assume there is no deviation). Let’s call this as variability. In an ideal case where there is no variation, even if you check a single family that had be sufficient to quantify the entire community. Now assume that of the 50 families you surveyed, 45 people were having one car, and 3 families had 2 cars and 2 families had no car. That means total car for 50 families is 45+6+0=51 cars. Now predicting the total number of cars for 500 families with full confidence is slightly difficult. The 3 families with 2 cars may be the only ones in the entire community to have 2 cars, or it is possible that in every 50 set families same variation exists. It is also possible that in all other 450 families everybody has 4 cars. The variation that has crept in has reduced the confidence on absolute count.

Here is where the statistics comes in. First issue is to avoid a biased sampling. You don’t pick up sample families just like that. To ensure that you capture the variation in community, you randomly select the families. Random pick up ensures that there is a good probability that all kinds of family is picked up. Now, using a mathematical trick you could predict that every other 50 family cluster that you will pick will be almost the same. So even with variation you could still be close to complete confidence the total number of cars by sampling 50 families. But you could still not make an absolute count with total confidence. We will not delve into the mathematics of it. The summary is that you make a probability of chances that your count is close to the actual absolute value.

Now there are a couple of questions here. First, why sample? Why not just go to all the families and compute the number of cars. Second, how do you know randomly picking 50 samples will get in all those types that are varying?

Picking 500 families car and estimating the total is an over-simplified example. Even in this case, imagine yourself going to every family and counting number of cars. The time involved in the whole process is much more, sometimes the family isn’t there for you to find out, some may lie to you etc. There will be some families who don’t want see your face. In experiments there are similar problems. You cannot test every person that exists for a particular parameter.

The answer for second question is a bit more mathematical. To understand this play this simple game. Make 100 blank cards which is not identifiable by any other means when turned down. Label 90 cards with letter “A”, 5 cards with letter “B” and remaining 5 cards with letter “C”. Ask a person to shuffle it up and lay them neatly on the table without any pattern. I’m bringing in the randomness here. Now you pick up randomly 20 cards from it. The chances are most of the time you will pick up 18 A cards, 1 B card and 1 C card. I said chances. Needn’t be every time. This is probability. The different cards are variation. Suppose you also had a D and E card you will have to pick more cards to see these cards. There is a mathematical way of predicting this. How many cards you should draw from the set to be able to see all the different cards. To boil it down to a single sentence, the more types you have the more cards you have to draw. Same applies for the scientific process.

 Fig 1: P- value. Source
Now let us put a boggler. Suppose I have not told you how many types of card are there and you have to decide how many cards you have to pick up and find all cards at least once. Further, you can draw all cards. How will you do it? Pragmatically, this is the situation encountered when studies are conducted on a preliminary basis. You have no idea what the unknown-unknown is. But if I let you have 2 chances of doing it probably you can achieve it. First round, You do something called as pilot survey. You pick randomly about 10 cards. If all cards are same, there is a pretty good chance that the variation is not too high. If there was more than 10 different variations you will pick at least one variety. Let’s say you got 9 A cards and 1 B card. That means there are at least 2 variants and probably more. But because there was 9 A cards of ten, there is very little chance that there is a lot of B or even C cards. Based on this you can predict the sample size required for having a fair estimate of how many cards you need to pick up.

As you might have already noted, this is a number game. I may be seriously lucky sometimes and get the exact number, or horribly unlucky at times. How would I know? This is where the statistical golden term comes in- the “P- value”. In layman’s terms P value is an estimate of the probability that you are right or wrong. The calculation takes into accountability the variation and the possible deviations, the sample size and how much of it is predictive etc into a single mathematical equation. What most scientists have not been taught is that P-value is a slippery slope of mathematics. It is an estimate of possibility. The P-value is designed to give you an estimate on do you need to give a second look at the data or not.

Motulsky HJ (2015). Common misconceptions about data analysis and statistics. Pharmacology research & perspectives, 3 (1) PMID: 25692012

Nuzzo, R. (2014). Scientific method: Statistical errors Nature, 506 (7487), 150-152 DOI: 10.1038/506150a