Part 3 of 3

Wednesday May 11, 2022

Measuring impact quantitatively - effect sizes

Host

Alexander Bertram

About this webinar

This Webinar is the third in our series "Measuring impact quantitatively". It is a one-hour session ideal for Monitoring and Evaluation professionals who are interested in learning more about measuring impact and using statistics. In this session, we look at statistics for demonstrating impact from data from random samples.

View the presentation slides of the webinar

During this third session, some of the points we cover are:

Understand measures of statistical significance (p-values)
Moving from statistical significance (p-values) to measures of effect size and confidence intervals
Communicating results with confidence intervals

Is this Webinar for me?

Are you an M&E practitioner responsible for designing surveys and data collection tools for your programmes?
Do you wish to learn more about working with quantitative data?
Do you wish to understand better surveys and how you can demonstrate the impact of your programmes?
Are you interested in constantly improving your systems?

Then, watch our Webinar!

Other parts of this series

This Webinar is the third in our series "Measuring impact quantitatively".

About the Speaker

Mr. Alexander Bertram, Technical Director of BeDataDriven and founder of ActivityInfo, is a graduate of the American University's School of International Service and started his career in international assistance fifteen years ago working with IOM in Kunduz, Afghanistan and later worked as an Information Management officer with UNICEF in DR Congo. With UNICEF, frustrated with the time required to build data collection systems for each new programme, he worked on the team that developed ActivityInfo, a simplified platform for M&E data collection. In 2010, he left UNICEF to start BeDataDriven and develop ActivityInfo full time. Since then, he has worked with organizations in more than 50 countries to deploy ActivityInfo for monitoring & evaluation.

Transcript

00:00:00 Introduction

Thank you so much, Fay. I am really excited to see the interest in these quantitative method webinars. I think it's a really challenging and interesting topic and always lots to talk about. Today, what we're going to do is a quick review of the last two webinars in the series because it's been a while. Then, we'll look at two different methods for dealing with uncertainty in our impact results.

The first is statistical significance or hypothesis testing, but then we'll turn to effect size and see how that can often be more useful in understanding the uncertainty related to our impact measurements. We are hopefully going to have lots of time for questions. The last two sessions we had so many great questions, and I wanted to make sure that we had time for those at the end of this session. So, we'll go about 30 minutes, I think, and leave the rest of the hour for questions about this session or any of the previous presentations.

00:01:15 Recap of previous sessions

So, what have we been talking about in this series? We've been talking about impact evaluation, and specifically quantitative impact evaluation. It is not the only way to measure the impact of course, but it is a useful tool. It comes back down to this equation here that we've seen before. It's just a fancy way of looking at the difference. This "Y" is the outcome that we care about. That might be incomes, yields for agricultural programs, or a psychosocial rating for programs working with children. Whatever that number is, we talked about all the different challenges of measuring that.

That is the value without our program—when the program equals zero, when it's not there—and the value of that outcome when our program is there. So we subtract those two things and then we get a delta, a difference. That is the causal impact of our program, the difference that can be attributed to our program specifically.

In the first session, we looked at all of the challenges of measuring: turning something that we care about into numbers, which can be very challenging. We also talked about why you would conduct a quantitative impact evaluation at all. That recording is up on our website; there is lots of really interesting discussion there. Then, in the previous session, we looked at causal inference. How can you compare values with and without your program if we don't have a multiverse where we can run experiments on different versions of people? This is what is called the fundamental problem of causal inference. We looked at four strategies for so-called "counterfeit counterfactuals," and specifically, we looked at the risk of just doing before-and-after comparisons.

00:04:00 Statistical significance and hypothesis testing

Today, we're going to be looking at the last step in this process. If you've gotten that difference, that delta, and you want to know: can I trust this number? Do I have enough data to trust this number? A phrase that you've undoubtedly heard is, "Is this statistically significant?" We're going to be looking at that first—what that means and how to determine whether something is statistically significant. But I really want to highlight some of the ways that that can fall short of what we really need and look at some alternatives using effect sizes in the second part of my presentation.

First, we want to talk about what people mean when they say statistical significance. Where does this come from? This statistical significance comes from the process of hypothesis testing. A hypothesis is just a fancy word for a theory. It's something that we think might be true, but we're not sure. This method of statistical testing comes from a tradition started by a scientist named Ronald Fisher and his friend "Student" back in the 1900s. They looked at ways to do this and came up with a process where you start with what they call the null hypothesis—the opposite of what you want to prove.

In our case, if we look at impact evaluation, the null hypothesis is often going to be the opposite of what we want to see: the hypothesis that our program had zero impact, that it changed absolutely nothing. We're going to start with that null hypothesis and try to disprove it. If we can disprove the null hypothesis with data, then we can prove the inverse, the alternate hypothesis, that our program had some impact. Putting this back into the equation, this delta is zero. The outcome with the program and without the program is exactly the same; you subtract them and you get zero. That's what we want to disprove, and we're going to use what's called hypothesis testing or significance testing to do that.

00:06:30 The coin flip analogy

We do this by starting with the assumption that the null hypothesis is true and then doing a survey or data collection. We're going to calculate the chances—the probability—of seeing the data that we end up with if our null hypothesis had been true. It's a bit of tortured logic, so we're going to start with a super simple example and then come back and reapply that to real life.

We're going to use a simple example with a coin flip. We want to prove that a coin is unfair. Sometimes my kids will come to me with magic tricks or say things like, "Hey, if we flip the coin and I get heads, then we can have pizza again tonight." I'll kind of fold my shoulders and say, "Is this really a fair coin? Is this a coin that I can trust, or have you stuck gum on the other side so it's always going to come up heads?"

Let's say that I want to test whether this coin is fair. I want to prove that it's rigged. So we start with a null hypothesis that the coin is fair, and we run an experiment. We say, "Okay, I'm going to flip the coin three times and see what we get." Lo and behold, all three times the coin comes up heads. The question that we have to ask is: what are the chances of this happening? What is the probability of getting three heads in a row if the coin is fair? Is this something that we could expect? Is this something that is normal or reasonable?

We can come up with a number for this. If you go back to your high school probability math, we have these trees where you toss a coin and each time you toss, you have an equal chance of getting heads or tails if the coin is fair. At the end of three tosses, you have eight possible outcomes, all with the same probability (one out of eight). Heads-heads-heads is one possibility out of eight, which gives us a 12.5% chance. If the coin was fair, we would have a 12.5% chance of getting this result—about one out of eight times we run this experiment.

Can we reject the null hypothesis? Can we say, "No, this coin is not fair, no pizza tonight, you guys have done something to this coin"? According to statistics best practices, no, we can't reject this hypothesis because it's not that unusual. If eight of us did this, one of you would get three heads in a row on average. It's not unusual enough to reject this hypothesis; we need more data.

What happens if we flip it eight times and get two tails? This is a 4% probability. We're starting to get more evidence that my kids are trying to deceive me. By some standards, this is accepted. This could happen about 1 out of 20 times. This is starting to meet the standard of statistical significance. But be careful, because if you do 20 surveys in your life and you use this standard, then you could end up being wrong on one of those surveys.

Let's do it again. This time we do it 20 times and we have only one tail. If you do the math, it's a 0.001% chance of getting this. This is pretty low. Most statisticians would say yes, we can reject the null hypothesis. The chances of this happening if the coin was fair are just way too low. This is not something we would expect, so we can safely conclude that the null hypothesis is wrong and our alternate hypothesis—that the coin is unfair—is valid.

00:11:00 Applying significance to impact measurement

What does all of this coin flipping have to do with impact measurement? Let's try and connect these principles back to what we're trying to do. We'll go back to our example from the previous webinar. The idea is that we have a refugee camp, and one of the things that we want to do is stimulate local trade within the camp. To do that, we're going to provide vocational training to the refugees to give them some freedom and dignity to earn their own money. But we're not sure if this is a good idea or if it is effective.

So, we're going to do a quantitative evaluation. We do all the right things: we come up with a way to measure this, we make sure our questionnaire is clear, and we randomly select people to participate in these trainings so that we have a solid basis for causal inference. But we still only have time to interview 50 people in each group.

The result is a plus-10 effect size. The group without the training had an average of $50 per month in extra earnings, and the group with the training had $60. That gives us a difference of $10. Can we be sure that this is a real result and not just due to chance? Just like our coin flips, when we randomly select people, we're introducing an element of chance. There's the possibility that this number is just random—that we happened to settle on the one guy we gave a training to who was already very wealthy, and in the other group, we had the misfortune to talk to 50 people who all got sick.

Just like the coin flips, we want to be able to use some of this basic probability to figure out: what are the chances of getting this result—a difference of $10—if our training program had zero impact on the refugees? We want to be able to demonstrate to ourselves that this is not just randomness, that this is a real impact.

00:13:30 Calculating probability with T-tests

Just like we did with the coins, where we used that decision tree to figure out the chances of getting a specific number of heads or tails, we can use this statistic—the t-test—to calculate the probability of getting this much difference in averages if the null hypothesis were true. Fun fact: "Student" was a pseudonym for a scientist who worked at the Guinness plant. He was a food scientist interested in getting the best beer out of the processes, so he developed these statistics. You have the Guinness company to thank for this math.

If we were using ActivityInfo for this, we would fire up the mobile app, collect the data, record how much income each person had in the last 30 days, bring this up into ActivityInfo, fire up RStudio, connect to ActivityInfo, and run this t-test. R has this built into the base package. We split it up into two groups, training and control, and use a two-sided t-test. We come up with a p-value. R gives it to us in scientific notation that is very small: 4.7 times 10 to the minus 6. It is super significant. That's enough to demonstrate our alternative hypothesis that the true difference in means is not equal to zero. So we can say that we had an impact.

You can also do this with Excel, Google Sheets, or LibreOffice. I exported the data from ActivityInfo and used the T.TEST function. This gives you the p-value to determine the chances of getting these results if our null hypothesis were true. As you can see, written as a percentage, it's very small—less than a thousandth of a percent. So we should be confident that we've had some impact.

00:16:30 Drawbacks of hypothesis testing

These are some of the standards that are commonly used for significance testing. Less than 5% is often referred to as statistically significant, but not by all statisticians and not in all domains. Less than 1% is often called highly statistically significant. You'll also often see two asterisks next to the value to indicate this. The gold standard is less than 0.001, which gets three stars.

Now I want to start looking at some of the drawbacks of this approach. As I mentioned, there's some disagreement about what is the right standard for treating a result as significant. Some domains, like branches of social sciences, have gone so far as to say we don't want to see this used at all. Basically, "not zero" is a very low standard for impact. Showing that our program has had some effect is maybe not as interesting because it's possible for programs to have very small effects that are not zero.

Let's pretend that we got different results from the survey. Instead of seeing this $10 difference in incomes, we see only one dollar. The control group is earning $50 a month and the treatment group gets $51. If we do the same tests with 50 people in each group, I got a p-value of 0.5. Definitely not statistically significant. If the impact was zero, getting a difference of one dollar or more is pretty likely—it's kind of 50/50.

The interesting thing about hypothesis testing is that the more data we collect, that p-value is going to start to go down. If we collect a thousand responses from both groups, that p-value is going to go real close to zero because a one-dollar difference is still greater than zero. With enough data, even very small differences will become statistically significant. Statistically significant doesn't mean this is a huge change or a huge impact; it means we have enough data to determine that the result was not zero.

For a program like this, if you had such a low impact, we might want to rethink it. Maybe we haven't designed this program very well, or maybe it's not been implemented very well. Especially if it's costing you money to do this vocational training. If a one-dollar difference costs you a thousand dollars to train 100 people, that money is better spent somewhere else. Statistically significant does not mean a huge impact; it just means not zero.

00:19:30 Understanding effect sizes

What I want to turn to next is looking at uncertainty regarding the magnitude of change: the size of the effect. This can help us think and use the uncertainty here, or better understand the uncertainty. Effect size refers to a broad family of statistics. A statistic is just a number, a way of summarizing our data. An average is a statistic, a difference between averages is a statistic, and Pearson's correlation coefficient is a statistic.

The effect size that we've actually been looking at in the previous example was the difference in means. Looking at that average is really important and useful for things where the units make sense. If we're looking at changes in income, we know what that means. Plus ten dollars is very easy to reason about. If you're looking at a change in yield for an agricultural program or the number of students graduating, those are things that make sense to us.

But sometimes we have numbers that don't have meaning in themselves, like a score from a psychological test. For those, we'll look at an alternative statistic: Cohen's d. There are others like odds ratios—you often hear those referred to with things like vaccinations—or Eta squared for categorical variables.

00:21:30 Confidence intervals

Let's start with the difference in means. Why do we need statistics if we already have the difference in means? The statistics, just like the hypothesis testing, help us deal with the uncertainty involved in surveys and sampling. If we only do measurements of 100 people, we don't know the true value; we don't know the true difference in means. We just have an estimate. What the hypothesis testing and the effect size do is use confidence intervals to understand how sure we are about this number.

If we start out with a sample size of 15, I randomly generated data and came up with a confidence interval. The confidence interval gives us an estimate—in this case, the 95% confidence interval—of what we think the difference in the means could be based on how much data I've collected. Think of it as a good range of possible values.

If we do a sample size of 50, the confidence interval starts to narrow down as we get more data. With a small sample, it's pretty wide; it could be as low as negative two or as high as four dollars. We don't have enough data to know where the true answer falls. As we collect more data—say, a thousand sample size—the p-value goes all the way down to zero, but the confidence interval is still telling a pretty consistent story. It's somewhere between an increase of one dollar and maybe three dollars. We still know that the effect size is small.

Let's look at a sample size of 100. A typical confidence interval might show a lower end of negative one dollar and an upper bound of plus four dollars. The way to think about that is: we did a survey and got a $1.67 difference, but in real life, that's just a sample. The real value could be as low as negative one—so we could have made things worse—or we could have made things better by four dollars a month. We don't really know where the value falls; it's somewhere in here. You notice that the confidence interval includes zero, so it could have had no impact. That's why the p-value is not telling us to reject the null hypothesis.

However, even given the uncertainty, this still gives us a fair bit of information. We know that the best we could expect is four dollars. If you imagine that you're spending $5,000 on this training, you could already do an estimate of how many people you would have to train to make this cost-effective. Even with this 50-person survey, I think you would conclude pretty quickly that this is just not worth continuing unless you have volunteers doing it. The impact you're having is not large enough to justify the cost, even given the uncertainty. Confidence intervals on effect sizes can be very useful in thinking about programs.

Let's compare that to my first scenario where we do the survey and get a result close to ten dollars. The p-value is super low, so it passes the statistical significance test. But we also get a confidence interval, so we can start to work with the likely size of this impact. We're reasonably confident that it's somewhere between six and fourteen dollars. With that range, you can do your cost-benefit analysis. You can see if this is worth the money we're putting in and compare impacts between programs.

00:26:00 Cohen's d statistic

Here we were looking at the difference in the means. That's one measure of effect size, a very simple one. When the thing that you're measuring has units that everybody understands, this is a great tool. Everybody knows what a dollar is. But sometimes we deal with quantities that don't have any intrinsic meaning in their values. For example, measuring psychological resilience. We work with some customers that have programs intended to increase the resilience of young people. They have a measure, a survey tool, used to see whether they're having an impact. But that number is on a scale from zero to five. If you go from 2.5 to 3, is that a big impact? That's hard to reason about.

For that, Cohen's d is a good alternative. It's basically the difference in the means divided by the pooled standard deviation. Cohen came up with a way of pooling the standard deviation from both groups to get a standardized way to compare dollars with kilos or dollars with a psychological scale. The idea is to try to put them all on the same scale.

Roughly speaking, the way to think about this is it talks about how much variation is explained. Why do some people earn more money and others less? We're going to have people all over the map in both groups. The intuitive way to think about this is it gives you an idea of how much of that difference your program accounts for versus how much difference is left over. Cohen's d gives you a scale that you could use for any measurement, ranging from a very small impact all the way up to two. We can compute confidence intervals for this so that we can see the likely range of impacts that our intervention has had.

00:28:30 Demonstration in ActivityInfo and R

Let's try to make this a little bit more concrete. I'll spend just a couple of minutes doing a quick walkthrough on how to do this with ActivityInfo and R. I've got my income survey here that I collected with ActivityInfo's mobile app. I want to see what the differences are between these two groups: the training group and the control group. I'll use the analyze function in ActivityInfo to do a quick report. I can see the same results: $59 from my training group and $49 from the control group.

For statistics, I'm going to connect to R. I can do that from the export menu via the API. It gives me some code that I can copy and paste into RStudio. I'm going to add some code to split my sample into two groups, training and control. I'm just going to use R's built-in t-test here, which gives me my p-value, but it also very helpfully gives me a confidence interval. It's this 95% confidence interval between 6 and 14 as a measure of difference.

We can do the same thing with Cohen's d. I have queried my data here from ActivityInfo. The nice thing about the R connection is if you make any changes or update your data, you can always rerun your analysis. I'm going to load the effect size package and use the cohens_d function. It gives me an estimate that's independent of the units I'm using, so I can compare this effect size with another experiment even if I'm measuring that other experiment in kilos or a psychological scale.

It gives me a confidence interval for this effect size. This is what I think is really useful; it gives you a much more useful way of thinking about uncertainty than the p-value. If we go back and look at the table Cohen proposed, you can see the lower end of my confidence interval is 0.5, which is described as a medium impact, and the upper is 1.3, which could be very large. So I know that I can feel comfortable that my impact is somewhere between medium and very large.

Just a last note I really want to underscore: all of these statistics are for simple random surveys. If you're doing clustering or stratification, do not naively use these statistics. R also has a complex survey package. We did a webinar series on sampling last year that is very useful, but keep in mind that all of this math is for simple random samples.

00:33:30 Q&A session

Now, let's go to the questions. I see we have lots of people from all over the world in the chat.

Is the confidence level corresponding to the min and max values got from the survey? The answer is actually no. The confidence interval comes from a method called bootstrapping. The idea is that to create a confidence interval for a difference in the mean, we assume for a moment that the true value is the difference we found (e.g., $1.6). Then we randomly sample—we simulate this thousands of times with a computer. We look for the range of values that include 95% of the values. It gives you an indication not of the min and max values in the sample, but what you would expect from a sample size of 50.

What is the difference between statistical significance versus practical significance? Statistical significance tells you whether the impact is not zero. Effect size speaks more to practical significance—what is the practical impact? That gives you a measure either in real units like dollars or kilos, or along a scale of explaining variance.

As sample size increases, p-value tends towards zero. What does it really mean to say we have enough data to conclude zero? That's indeed true. The more data that you have, the more likely you're going to be able to reject the null hypothesis because in the real world, it's very rare that you have zero difference between two groups. As long as you have some difference, no matter how small, as the sample size increases, your p-value is going to trend towards zero.

If you have a low p-value but high effect size, how should you interpret these results? That's exactly the case where you want to look at the confidence interval of the effect size. It could be the case where you have a borderline p-value, but your confidence interval would still include zero. In most cases, if your p-value is greater than 0.05 or 0.10, you probably just don't have enough data to draw a conclusion.

What's the best tool to analyze data between R and Google Sheets? If you're collecting or managing data, I recommend ActivityInfo. For basic tests, you can use the t-test in Google Sheets or Excel, but you don't get those confidence intervals. For that, I'd really recommend taking a look at some of our tutorials using R. It does have a learning curve, but it allows you to tap into real power because effect size calculations beyond simple statistical testing are not available in Excel or Google Sheets.

Is Cohen's d the most commonly used method for effect size analysis? I'm not sure I would characterize it as the most common method. It really depends on the kind of analysis you're doing. Probably the most commonly used would be the difference of the means, though only if your units have intrinsic value. Odds ratios are also a great effect size measure that you hear very often. Cohen's d is a good starting point.

I think we're going to have to wrap up there. Thank you so much for joining us. Please let me know your feedback; I'm keen to hear if it was an easy-to-follow presentation. We look forward to seeing you in the next webinar. Have a great day and a great week.

Sign up for our newsletter

Sign up for our newsletter and get notified about new resources on M&E and other interesting articles and ActivityInfo news.