Can machines score nonprofit ‘impact’?

A discussion on a proposed ‘automated impact scorecard’ — which, to start, will rely on some studies done by humans.

By Timothy Catlett, Agora for Good Evaluation Advisor

Is it possible to develop a system that ‘automatically’ ranks nonprofits according to their impact? We previously mentioned that Agora is working on an ‘impact scorecard’ to rank nonprofits in our search engine. This ranking will include grants and endorsements from experts, users, and other nonprofits, but we are also planning to use the increasing amounts of data that are coming out of evaluations of the effectiveness of charity work. This is incredibly difficult, in practice, and we don’t want to give the impression of overconfidence or naivety around how hard this will be. Rather, we hope to collect feedback and input, with the hopes of creating a dialogue around the components associated with nonprofit impact.

Impact evaluations are increasingly common

Not only are individual nonprofits conducting studies to measure their impact, but these separate studies are being brought together into systematic reviews, where studies on an intervention conducted by many different nonprofits are analyzed to determine trends in outcomes. At Agora, we hope to use both of these sources of information, giving donors a more informed view of what a specific nonprofit is achieving, and what outcomes to expect based on similar interventions. As far as we can tell, no one has created a searchable database for this purpose yet.

We propose transforming evaluations into scores

How will this work, exactly? Our ideas is that an ‘impact scorecard’ could rate nonprofits separately for individual evaluations and systematic reviews. We’ll categorize each study by two things: First, the results of the study: how positive are they, and with what degree of confidence? That is translated into a numeric score. Second, the “study design,” or the specific strategy used to separate the impact of the nonprofit’s intervention from the noise and happenstance of everything else. (The results may also present a wide array of conclusions on the intervention’s impact based on varying conditions or types of beneficiaries.)

As the database grows, could machine learning get smart about categorizing these studies; either doing so automatically, or allowing users to self-categorize (honor system) and then having a system find ‘discrepancies’ or red-flags that require a more detailed look?

Caveat: It would not be possible for our scoring system to take every particular variation into account, nor communicate those variations to users, and so we are sorting the evaluations into concrete categories, based on the type of study and its results. Studies with better designs and statistically positive results will receive more points in the ranking.

How, tactically, will we categorize impact evaluations?

Agora has started a database of systematic reviews pulled from various expert sources, and will score the interventions of nonprofit strategies via the following four steps:

  1. Define each systematic review by the intervention used and the outcomethat was measured; for example, “microfinance” programs for the goal of “poverty rates reduction.” This will enable keyword tagging in our database by intervention.
  2. Categorize the results of each systematic review based on both how effective the interventions were on average, and how much variation there was between different studies.
  3. Evaluate the quality of the systematic reviews based on their ability to predict results for nonprofits, by reviewing the methodology of the study.
  4. Average those scores to give a “summary” of the current literature’s stance on the impact of an intervention, with higher quality studies being weighted more heavily in the summary.

From this system, each nonprofit receives a score for the degree of evidence behind their work, and users are informed when a nonprofit is using an intervention with expert consensus behind it. (As mentioned, this is just one component of the overall ‘impact score’). As this grows, can this system self-categorize, or rely on human-machine interaction to categorize?

Like this idea? Leave us a comment!


Let’s go through some examples…

Let’s take a global nonprofit that distributes bed nets to fight malaria. It could benefit from two types of points: If it had done its own internal impact evaluation with a high quality design, it would receive points for that study, which it could upload, as ‘an individual review’. In addition, because other researchers out there have collected about the positive impact of using bed nets, it would receive ‘systemic intervention’ points for using a proven intervention in their work.

By comparison, a small food assistance program that does not have an expensive evaluation, but may have conducted surveys or recorded simple impact measures, which are also forms of evaluation, would also receive points. Their score would benefit from the additional research academics have done on how food assistance helps communities in need. While their score would not be as high as the global bed nets nonprofit, they would still be ranked higher than groups who do not measure their impact at all, or work on issues that no outside research has supported.

The scores from the systematic and individual reviews are combined along with our other metrics; professional endorsements, users’ feedback and transparency about the nonprofit on the Agora site. Through this system, we hope that many different kinds of great nonprofits will be highlighted on the site, and users can feel more confident in assessing the impact of their donations.

What about nonprofits that have not conducted evaluations?

Many nonprofits will not have any impact evaluations for their programs. Evaluations take time and resources to implement, and newer nonprofits, as well as those working in uncertain situations, will not be able to conduct them. We hope Agora can be a platform for new ideas to receive exposure and support, so we want to provide ways for these nonprofits to place highly in our ranking, and for users to gain an idea of what kind of impact to expect. When a nonprofit builds their profile, they can upload a copy of their own impact evaluations, which will be available to interested users, or they can search from a database of studies we’ve created drawing from places like JPAL, AidGrade, etc. We also will enable funding of ‘new’ ideas for donors willing to take on some risk.


We recognize our current evaluation system isn’t perfect. Agora’s mission is not to conduct charity evaluations; we just want to make sure that we use the existing data as effectively as possible. In addition, the work of evaluating nonprofit ‘impact’ and the results of studies associated with nonprofit impact is extremely complex, and we want to make this information as accessible to users as possible. Below is a list of a few of the limitations of our methodology; we welcome comments and thoughts on others.

Individual nonprofits have particular needs for their own evaluations that we can’t always capture:

While we reward nonprofits that use more rigorous kinds of study designs, such as the “gold standard” of randomized controlled trials, some types of nonprofit work require different approaches due to their unique contexts. Furthermore, newer and smaller nonprofits may not have the resources to conduct more expensive forms of evaluation, and our system does not want to unduly punish nonprofits that don’t fit into a certain model. Finally, many evaluations will only look at a part of the nonprofit’s work, and cannot judge the organization as a whole.

How we address this: While it would be difficult to develop a system that accounted for all of these complexities, we want to reward better evaluation efforts. Nonprofits are granted a large amount of points for simply having an evaluation and communicating to users how it was conducted, as it demonstrates a commitment to effectiveness regardless of results. Having a better design or results adds points on top of that, but not so much that good organizations can’t achieve similar scores.

Nonprofits are also permitted to upload multiple evaluations to the site, with each one addressing different parts of the program. They would receive points based on the average of the results of those evaluations.

Systematic reviews are designed for specific research questions, not nonprofit evaluations:

Most systematic reviews only look at a specific intervention strategy, which may only be one part of the more complete strategy a nonprofit is using. The systematic reviews are also often made in isolation, so they have different ways of measuring similar outcomes that makes comparisons between studies difficult — one education study will focus on standardized test scores, while another focuses on homework completion rates, and so on. If we were to only evaluate a nonprofit with systematic reviews that studied interventions and outcomes identical to the work of the nonprofit, most nonprofits would not have a matching review, and each individual nonprofit would require large amounts of focused research.

How we address this: In order to make this research useful, we created larger categories for the interventions and outcomes that the reviews can fit into — so all of the various ways to pay student tuition, for example, fall into one intervention, and all measures of student achievement fall into one outcome. This generalization lets us say a lot about many more nonprofits, but does come at the cost of accuracy. For this reason, we do not report any specific numbers from the research, and instead communicate a general positive or negative trend that the systematic reviews suggested.

Fields with a shortage of evaluations are unfairly punished:

We know that many types of nonprofits, from advocacy groups to cultural programs, are simply not in a field that lends itself to impact evaluations or has had systematic reviews of their work done. Systematic reviews in particular are conducted with the goals of the researcher in mind, and so a lack of such reviews on a particular type of intervention is not at all a reflection of the quality of the intervention.

How we address this: We want every nonprofit to be able to distinguish itself with its work, so our search rank is designed so that points from other categories, such as professional and user endorsements, can earn a nonprofit at least as many points as impact evaluations. This way, there are always multiple ways for a nonprofit to distinguish itself.

Our design is always evolving

This system is our first attempt at integrating the existing impact data into Agora, and we are open to revising the design decisions and rank system that we have made. Please leave us any feedback or comments you may have, and we hope we can continue to expand on the system to make it as inclusive for nonprofits and as helpful to users as possible.

Footnote. To date, we have a database that represents about 130 studies across sectors. This analysis was conducted by Timothy Catlett, a former researcher at AidGrade and a current student at Columbia University’s School of International and Public Affairs. His work has specialized in monitoring and evaluation systems for international development, and promoting the use of data by development practitioners.

Leave a Reply

Your email address will not be published. Required fields are marked *