Across industries, we’re getting better at picking metrics

Everywhere I look, I see people optimizing bad metrics. Sometimes people optimize metrics that aren’t in their self interest, like when startups focus entirely on signup counts while forgetting about retention rates. In other cases, people optimize metrics that serve their immediate short term interest but which are bad for social welfare, like when California corrections officers lobby for longer prison sentences.

The good news is that as we become a more data-driven society, there seems to be a broad trend — albeit a very slow one — towards better metrics. Take the media economy, for example. A few years ago, media companies optimized for clicks, and companies like Upworthy thrived by producing low quality content with clickbaity headlines. But now, thanks to a more sustainable business model, companies like Buzzfeed are optimizing for shares rather than clicks. It’s not perfect, but overall it’s better for consumers.

In science, researchers used to optimize for publication counts and citation counts, which biased them towards publishing surprising and interesting results that were unlikely to be true. These metrics still loom large, but increasingly scientists are beginning to optimize for other metrics like open data badges and reproducibility, although we still have a long way to go before quality metrics are effectively measured and incentivized.

In health care, hospitals used to profit by maximizing the quantity of care. Perversely, hospitals benefited whenever patients were readmitted due to infections acquired in the hospital or due to lack of adequate follow-up plan. Now, with ACA policies that penalize hospitals for avoidable readmissions, hospitals are taking real steps to improve follow-up care and to reduce hospital-acquired infections. While the metrics should be adjusted so that they don’t unfairly penalize low income hospitals, the overall emphasis on quality rather than quantity is moving things in the right direction.

We still are light years from where we need to be, and bad incentives continue to plague everything from government to finance to education. But slowly, as we get better at measuring and storing data, I think we are getting at picking the right metrics.

Independent t-tests and the 83% confidence interval: A useful trick for eyeballing your data.

Like most people who have analyzed data using frequentist statistics, I have often found myself staring at error bars and trying to guess whether my results are significant. When comparing two independent sample means, this practice is confusing and difficult. The conventions that we use for testing differences between sample means are not aligned with the conventions we use for plotting error bars. As a result, it’s fair to say that there’s a lot of confusion about this issue.

Some people believe that two independent samples have significantly different means if and only if their standard error bars (68% confidence intervals for large samples) don’t overlap. This belief is incorrect. Two samples can have nonoverlapping standard error bars and still fail to reach statistical significance at \alpha=.05. Other people believe that two means are significantly different if and only if their 95% confidence intervals overlap. This belief is also incorrect. For one sample t-tests, it is true that significance is reached when the 95% confidence interval crosses the test parameter \mu_0. But for two-sample t-tests, which are more common in research, statistical significance can occur with overlapping 95% confidence intervals.

If neither the 68% confidence interval nor the 95% confidence interval tells us anything about statistical significance, what does? In most situations, the answer is the 83.4% confidence interval.

Figure 1: Two samples with a barely significant difference in means (p=.05). Each panel shows a different type of confidence interval. Only the 83.4% confidence intervals shown in the third panel are barely overlapping, reflecting the barely significant results.

Figure 1: Two samples with a barely significant difference in means (p=.05). Each panel shows a different type of confidence interval. Only the 83.4% confidence intervals shown in the third panel are barely overlapping, reflecting the barely significant results.

To see why, let’s start by defining the t-statistic for two independent samples:

t = \frac{\overline{X_1} - \overline{X_2}}{\sqrt{se_1^2 + se_2^2}}

where \overline{X_1} and \overline{X_2} are the means of the two samples, and se_1 and se_2 are their standard errors. By rearranging, we can see that significant results will be barely obtained (p=.05) if the following condition holds:

\overline{X_1} - \overline{X_2} = 1.96\times\sqrt{se_1^2 + se_2^2}

where 1.96 is the large sample t cutoff for significance. Assuming equal standard errors (more on this later), the equation simplifies to:

\overline{X_1} - \overline{X_2} = 1.96\times{\sqrt{2}}\times{se}

On a graph, the quantity \overline{X_1} - \overline{X_2} is the distance between the means. If we want our error bars to just barely touch each other, we should set the length of the half-error bar to be exactly half of this, or:


This corresponds to an 83.4% confidence interval on the normal distribution. While this result assumes a large sample size, it remains quite useful for sample sizes as low as 20. The 83.4% confidence interval can also become slightly less useful when the samples have strongly different standard errors, which can stem from very unequal sample sizes or variances. If you really want a solution that generalizes to this situation, you can set your half-error bar on your first sample to:

\frac{1.96\times{\sqrt{se_1^2 + se_2^2}}\times{se_1}}{se_1^2 + se_2^2}

and make the appropriate substitutions to compute the half-error bar in your second sample. However, this solution has the undesirable property that the error bar for one sample depends on the standard error of the other sample. For most purposes, it’s probably better to just plot the 83% confidence interval. If you are eyeballing data for a project that requires frequentist statistics, it is arguably more useful than plotting the standard error or the 95% confidence interval.

Update: Jeff Rouder helpfully points me to Tryon and Lewis (2008), which presents an error bar that generalizes both to unequal standard errors and small samples. Like the last equation presented above, it has the undesirable property that the size of the error bar around a particular sample depends on both samples. But on the plus side, it’s guaranteed to tell you about significance.

Jumping quickly between deep directories

I often need to jump between different directories with very deep paths, like this:

$ cd some/very/deep/directory/project1
$ # do stuff in Project 1
$ cd different/very/deep/directory/project2
$ # do stuff in Project 2

While it only takes a handful of seconds to switch directories, the extra mental effort often derails my train of thought. Some solutions exist, but they all have their limitations. For example, pushd and popd don’t work well for directories you haven’t visited in a while. Aliases require you to manually add a new alias to your .bashrc every time you want to save a new directory.

I recently found a solution, inspired by this post from Jeroen Janssens, that works great and feels totally natural. All it takes is a one-time change to your .bashrc that will allow you to easily save directories and switch between them. To save a directory, just use the mark function:

$ pwd
$ mark project1

To navigate to a saved directory, just use the cdd function:

$ cdd project1
$ # do stuff in Project 1
$ cdd project2
$ # do stuff in Project 2

You can display a list of your saved directories with the marks function, and you can remove a directory from the list with the unmark function:

$ unmark project1

For any of this to work, you’ll need to add this to your .bashrc, assuming you have a Mac and use the bash shell.

function cdd {
    cd -P "$MARKPATH/$1" 2>/dev/null || echo "No such mark: $1"
function mark {
    mkdir -p "$MARKPATH"; ln -s "$(pwd)" "$MARKPATH/$1"
function unmark {
    rm -i "$MARKPATH/$1"
function marks {
    \ls -l "$MARKPATH" | tail -n +2 | sed 's/  / /g' | cut -d' ' -f9- | awk -F ' -> ' '{printf "%-10s -> %s\n", $1, $2}'

    local cur=${COMP_WORDS[COMP_CWORD]}
    COMPREPLY=( $(compgen -W "$( ls $MARKPATH )" -- $cur) )
complete -F _cdd cdd

This differs from Jeroen’s original code in a couple of ways. First, to be more brain-friendly, it names the function “cdd” instead of “jump”. Second, the tab completion works better.

Update: John McDonnell points me to autojump.

How failed replications change our effect size estimates

Yesterday I posted a very unscientific survey asking researchers to describe how failed replications changed their subjective estimates of effect sizes. The main survey asked for “ballpark estimates” of effect sizes, but an alternative interactive version allowed researchers to also report their uncertainty by specifying both the mean and variance of their posterior distributions. Thanks to everyone who participated. I won’t be analyzing any new data after this, but it’s never too late to publicly share your estimates!

Here are the questions. 

Question 1. A 2009 experiment with 50 subjects (25 per cell) is published in Psych Science. The experiment does not require any special equipment other than a questionnaire. It is not pre-registered. The results show an effect size of d=0.5. Let’s define the true effect size to be the average effect size of an infinite number of replications that the original experimenter would deem “reasonably exact” in advance. Based on this information alone, what is your ballpark subjective estimate of the true effect size?

Question 2. What if the experiment had been pre-registered? 

Question 3. Assume again that the experiment was not pre-registered. Now imagine that a pre-registered replication attempt with the same sample size estimated the effect size to be d=0.0. At the time of pre-registration, the original experimenter deemed it “reasonably exact”. Based on this replication and the original experiment, what is your ballpark subjective estimate of the true effect size?

Question 4. What if the replication attempt had 300 subjects per cell?

Here are the results.


Keeping in mind all the caveats about sampling bias and other issues, here are a few observations:

  • The original study reported an effect size of d=0.5, but the results for Question 1 tell us that most researchers believed the true effect size was closer to d=0.2, which is roughly in line with my own estimate. Had I allowed researchers to state their uncertainty, I suspect that many would find it quite possible that even the sign of the effect was wrong. This isn’t really surprising to me, but I think we should take a moment to reflect on what this means. When a scientist reports a result, most other researchers believe it is massively overstated. I know that there are still some researchers who want little or no changes to the status quo, but I’d like to live in a world where people actually believe the claims that scientists make. That’s why I’m a strong supporter of all the attempts to fundamentally change how scientists do research.
  • If you want people to have more confidence in your findings, pre-registration can make a big difference.
  • While it’s not apparent from the plot, almost all respondents reduced their effect size estimate upon hearing about failed replications (Question 3 and 4 compared to Question 1).
  • As some have pointed out, the original experiment falls a bit short of statistical significance. This was an oversight, as I forgot to check the p-value after changing some of the values. I don’t think this is a huge deal, since posterior estimates shouldn’t really depend too much on whether the results cross an arbitrary threshold. But apologies for the error.
  • My estimates were .25, .40, .10, .05.
  • I wish I included another question asking what people would have thought of the original study if it was conducted in 2014.

Jason Mitchell’s essay

As of yesterday I thought the debate about replication in psychology was converging on consensus in at least one respect. While there was still some disagreement about tone, basically everyone agreed that there was value in failed replications. But then this morning, Jason Mitchell posted this essay, in which he describes his belief that failed replication attempts can contain errors and therefore “cannot contribute to a cumulative understanding of scientific phenomena”. It’s hard to know where to begin when someone comes from a worldview so different from one’s own. Since there’s clearly a communication problem here, I’ll just give two examples to illustrate how I think about science.

  • Example 1. A rigorous lab conducts an experiment using a measurement device that requires special care. The effect size is d=0.5. Later, a different lab with no experience using the device tries to quickly replicate the experiment and computes an effect size of d=0.0.
  • Example 2. A small sample experiment in a field with a history of p-hacking shows an effect size of d=0.5. Another lab tries to replicate the study with a much larger sample and computes an effect size of d=0.0.

In both cases, I’d have subjective beliefs about the true effect size. For the first example, my posterior distribution might peak around d=0.4. For the second example, my posterior distribution might peak around d=0.1. In both cases, the replication would influence my posterior, but to varying degrees. In the first example, it would cause a small shift. In the second, it would cause a big shift. Reasonable people can disagree on the exact positions of the posteriors, but basically everyone ought to agree that our posteriors should incrementally adjust as we acquire new information, and that the size of these shifts should depend on a variety of factors, including the possibility of errors in either the original experiment or in the replication attempt. Maybe it’s because I’m stuck in a worldview, but none of this even seems very hard to understand. 

Jason Mitchell sees things differently. For him, all failed replications contain “no meaningful evidentiary value” and “do not constitute scientific output”. I don’t doubt the sincerity of his beliefs, but I suspect that most scientists and nonscientists alike will find these assertions to be pretty bizarre. NHST isn’t the only thing causing the crisis in psychology, but it’s pretty clear that this is what happens when people get too immersed in it. 

How I use Twitter

Next week I’m going to start a new job as a data scientist at Twitter and I am thrilled. Aside from Google search, no other website has had a more positive impact on my life than Twitter. Twitter is just so much fun, and I have learned so much from it. 

Because my experience has been so good, it saddens me to hear that some people don’t really “get” Twitter. Some people who try it feel frustrated and stop using it. Others use it occasionally but don’t really see what all the fuss is about.

I want to share my approach to using Twitter so that others can try. There are probably other ways to enjoy it, but this approach has worked well for me:

  • I don’t necessarily follow my friends, and I don’t expect them to follow me. I use Twitter for a limited set of interests, and not all of my friends tweet about those interests.
  • I generally don’t follow organizations. They tend to tweet too much and their content is often too promotional.
  • Instead, I follow opinionated people who tweet about a small set of topics that I’m interested in.
  • I make sure that my tweet stream is slow enough that I can read every tweet. I do this by limiting the number of people I follow and by making sure that I don’t follow people who tweet too much, even if they have good content.

That’s it. Follow opinionated strangers who tweet about topics you are interested in. Maybe you have a different approach that works well for you. But if you are still trying to figure out the incredible appeal of Twitter, you might want to give my approach a shot.