How to know to switch: Monty Hall without the math

You may have encountered, in a book, article, or possibly a classroom, the now well-known game show puzzle sometimes called the Monty Hall problem. It goes like this: you are a contestant on a game show, presented with three closed doors. Behind one of the doors is a new car; the other two, goats.

Here’s how the game proceeds:

  1. First, you choose any door you’d like.

  2. Second, the host opens a door you didn’t choose, which (purposefully) reveals a goat.Step 2

  3. Third, you have the option to switch your initial guess.Step 3

Then the door you finally settled on is opened: and you get whatever it reveals! The puzzle is this: to get the car, is it to your benefit to switch after the host opens a door? That is, will your odds of winning the car increase if you change your initial guess?

The mathematical solution to this problem is complicated to understand completely without a college-level understanding of probability. In fact, when the puzzle was first widely publicized by Marilyn vos Savant, it fooled hundreds of respected mathematicians and physicists. To this day, a basic internet search for the problem will turn up countless forum posts asking for help in understanding, and as many diverse attempts to explain the answer, often to no avail.

But the answer should be obvious, right? After the host reveals one of the goats, you must have a 50-50 shot at the car: it shouldn’t matter whether or not you switch doors. That turns out not to be correct: you’re actually twice as likely to take home the car if you switch your guess!

The issue people usually have with this puzzle is that the answer is non-intuitive: even if you understand the math behind the right answer, when you put yourself in the story, it doesn’t feel like switching should make a difference. The goal of this short post is to conjure the right gut reaction for this game – to call into being the intuition needed to “feel-out” the answer.

Consider a modification of the game in which there are not three but one hundred doors. There is still only 1 car, but now there are 99 goats. The rules can be generalized, like this:

  1. First, you choose any door you’d like.Step 1, Part 2

  2. Second, the host purposefully reveals all but one goat (and that goat is behind either your door or the final door).Step 2, Part 2

  3. Third, you have the option to switch your initial guess.Step 3, Part 2

As in the three-door game, by step three you are down to two options: either your initial guess was correct, or the final door has the car. Now, ask yourself: does this choice still feel like a 50-50 shot? In any 100-door game, it is quite likely (99% chance) that your first guess will be wrong, and therefore that the car is among the other doors. And, since the host will always end up opening all but one non-car door, the best strategy is to switch your guess.

The same principle applies in the 3-door game. More likely than not, your first guess was wrong: you only had a 1/3 chance of being right. After your initial guess, the host shows you all but one goat*. So, under a day-to-day interpretation of probability, there’s a 2/3rds chance the car is behind the final door.

*Notice that the 3-door game is the smallest game in which the host can take this action. In the 2-door game, it really is a 50-50 shot at the end: but that’s because your first guess is a 50-50 shot, and the host can do nothing.

 

Understanding Mixed-Effects Models with Cersei Lannister



This post is mostly for people who know a bit about doing statistics and want a primer on mixed-effects models. I’ve used the plainest language possible, so even if you don’t fit that bill, you might still find it readable. At the very least, you’ll get to laugh at my nerdy attempts to make statistics more entertaining to non-experts. My main goals are:

  1. Give accessible examples (the last being an online services simulation) of fitting mixed-effects models in R
  2. Discuss the interpretation of and motivation for mixed-effects models
  3. Provide short, educational GOT fan-fic, with pictures

Some of this is in response to a great post from folks at Google. They show a cool result about the predictive efficacy of a mixed-effects model, so I decided to try a related experiment. However, their introduction of the model did not thoroughly enough (for me) distinguish it from a Bayesian approach. So I’ll be discussing this as well.

I’ll provide some code I used, but not all of it. Most code I provide in-line is meant to display the basic functionality of the R library I chose to fit the mixed-effects models, called lme4. I put some other code at the end, and the rest is available upon request.

Hat tip to this cool website for providing me endless Westeros-style names to use! In the following example, set in the Game of Thrones universe, I go over what it means to have different experimental units in a regression model. This is an important concept to understanding why we might want to use a mixed-effects model in the first place. Enter the Lannister family:

Jaime’s archers

Jaime was preparing for a long journey, on official Lannister business. Knowing that things would likely come to swords, he assembled a small force to accompany him. He had some time to take special care when outfitting his troops, so he decided to run an experiment with two different kinds of bows.

Normally, Lannister forces use bows made from Northwood, as common knowledge says they are stronger than Southwood bows. However, since Jaime is a Southerner, he prefers Southwood bows out of pride and convenience. He wanted to see if Southwood bows cause a difference in archer range.

So, he chose three archers randomly from the kingdom’s forces, who each would shoot a Northwood bow and a Southwood bow 10 times. He bullied some local city-folk into recording the distance of the shots, in meters. Some of the resulting data is shown below:

##   Distance_      Bow_          Name_
## 1       163 Northwood Deonte Skinner
## 2       163 Northwood Deonte Skinner
## 3       159 Northwood Deonte Skinner
## 4       149 Northwood Deonte Skinner
## 5       152 Northwood Deonte Skinner
## 6       155 Northwood Deonte Skinner

The other two archers were named Fabiar Darklyn and Margan Drumm. Jaime chose to fit the following linear regression model to his experiment’s data: \[
Y_{ik} = \mu + \tau\mathbb{1}(i = 2) + \epsilon_{ik}
\]
Above, \(Y_{ik}\) is the \(k\)-th shot from bow \(i = 1,2\), and the Southwood bow is considered to be the second bow. The variables \(\mu\) and \(\tau\) represent (respectively) the mean shot distance and mean effect of the Southwood bow. The final term \(\epsilon_{ik}\) is the random error (the Winds of Westeros, perhaps) contributing to the \(k\)-th shot from the \(i\)-th bow. There are \(3\times 10 = 30\) shots per bow, since each markman shoots each bow 10 times. The errors are independent shot-to-shot and Normally distributed with constant variance \(\sigma^2_e\). Here’s the summary of Jaime’s analysis:

lm0 <- lm(Distance_ ~ Bow_, data = JaimeExp1$Data)
summary(lm0)
## 
## Call:
## lm(formula = Distance_ ~ Bow_, data = JaimeExp1$Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25.84 -15.59 -10.21  23.75  34.70 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    167.885      3.678  45.643   <2e-16 ***
## Bow_Southwood   -9.767      5.202  -1.878   0.0655 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.15 on 58 degrees of freedom
## Multiple R-squared:  0.0573, Adjusted R-squared:  0.04105 
## F-statistic: 3.526 on 1 and 58 DF,  p-value: 0.06546

We see that when the archers switched to a Southwood bow, their shot distances decreased by about 10 meters, on average. However, Jaime realized his p-value was not below that magical 5% threshold! He strutted merrily out of the experiment grounds, confident that his results must be due to unfair winds. (Clearly, Jaime was absent for some important Statistical Lectures of the Crown.)

A most excellent result

“A most excellent result”

Adding archer terms

He was, of course, stopped by his brother Tyrion, who observed the whole experiment. “Now Jaime, can’t you see that you chose archers of wildly different statures? Only a fool would not account for this in his analysis.” Grumpily, Jaime re-fit his model with additional terms for the archers:

\[
Y_{ijk} = \mu + \tau\mathbb{1}(i = 2) + \alpha_j + \epsilon_{ik}
\]

Above, \(Y_{ijk}\) is the \(k\)-th shot (now \(k\) goes from 1 to 10) from the \(j\)-th archer when he is using the \(i\)-th bow. The \(\alpha_j\) parameters represent archer \(j\)’s baseline skill. Here were Jaime’s results, using the new model:

lm1 <- lm(Distance_ ~ Bow_ + Name_, data = JaimeExp1$Data)
summary(lm1)
## 
## Call:
## lm(formula = Distance_ ~ Bow_ + Name_, data = JaimeExp1$Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1652  -3.4554  -0.5064   2.5754  13.1611 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          156.427      1.238 126.386  < 2e-16 ***
## Bow_Southwood         -9.767      1.238  -7.891 1.17e-10 ***
## Name_Fabiar Darklyn   38.585      1.516  25.454  < 2e-16 ***
## Name_Margan Drumm     -4.212      1.516  -2.779  0.00741 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.794 on 56 degrees of freedom
## Multiple R-squared:  0.9485, Adjusted R-squared:  0.9457 
## F-statistic: 343.6 on 3 and 56 DF,  p-value: < 2.2e-16

(A small detail you’ll notice above is that there is no estimate for the 1st archer effect. The estimate of that effect is couched in the Intercept term; to predict shot distance from the 1st archer, we simply leave out all archer terms).

After accounting for the different archer, the estimated effect of the Southwood bow became extremely statistically significant. These results weren’t as pleasing to Jaime, who stormed past a smirking Tyrion. “Well? What read your results?” he heard as he slammed the training field gate behind him.

So what happened? In truth, there are two separable sources of error around the bow effect: per-shot random error, and variance among archer shooting strengths. Jaime (well, actually Tyrion) controlled for this in the second model by including archer effects.

I have some experience in these matters

“I have some experience in these matters”

If you’ve never encountered this statistical situation before, a visualization can help illustrate why it’s important to include archer effects, or in general, “experimental unit” effects. Here’s what Jaime’s data look like, without knowledge of the particular archer:

The red shots look somewhat above the blue shots, but something’s fishy. The shot points seem to be layered, somehow. Now let’s shape the points differently for different archers:

Now we can see clearly that the red shots are usually higher than the green shots, and the shots cluster around distinct locations that depend on the archer. By including archer terms in our model, we’re able to estimate the noise variance around the archer centers, rather than the overall center. This is beneficial from (at least) two standpoints:

  1. Interpretation: the variance due to archer skill is clearly not random error (Winds of Westeros), so a model that does not account for the marksman will give a highly innacurate estimate of the per-shot variance.
  2. Estimation of \(\tau\): Loosely speaking, we are better able to “see” the effect of the Southwood bow when we separate the shots by archer.

After cooling down a little, Jaime realized that (despite not achieving the result he desired) his findings could be of use to the Lannister army. He decided to run a larger experiment with 50 archers to solidify the results. Here was the outcome:

lm2 <- lm(Distance_ ~ Bow_ + Name_, data = JaimeExp2$Data)
summary(lm2)
## 
## Call:
## lm(formula = Distance_ ~ Bow_ + Name_, data = JaimeExp2$Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4424  -3.3524  -0.1689   3.2859  14.8538 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            168.8198     1.1263 149.883  < 2e-16 ***
## Bow_Southwood          -10.4792     0.3154 -33.221  < 2e-16 ***
## Name_Aldo Apperford    -12.2736     1.5772  -7.782 1.86e-14 ***
## Name_Antorn Feller      34.5155     1.5772  21.884  < 2e-16 ***
## Name_Arren Durrandon    29.3949     1.5772  18.637  < 2e-16 ***
## Name_Ashtin Orme        11.5839     1.5772   7.345 4.44e-13 ***
## Name_Barret Foral      -16.6079     1.5772 -10.530  < 2e-16 ***
## Name_Barrock Ryswell     6.4531     1.5772   4.091 4.65e-05 ***
## Name_Brack Mallister   -53.4039     1.5772 -33.860  < 2e-16 ***
## Name_Brandeth Erenford   6.1860     1.5772   3.922 9.41e-05 ***
## Name_Brayan Trante     -35.2405     1.5772 -22.344  < 2e-16 ***
## Name_Charres Apperford   0.3317     1.5772   0.210 0.833494    
## Name_Clarrik Parge       8.2105     1.5772   5.206 2.37e-07 ***
## Name_Codin Sadelyn      37.8102     1.5772  23.973  < 2e-16 ***
## Name_Culler Vyrwel       6.8461     1.5772   4.341 1.57e-05 ***
## Name_Curtass Shiphard  -33.6791     1.5772 -21.354  < 2e-16 ***
## Name_Denzin Garner     -21.6985     1.5772 -13.758  < 2e-16 ***
## Name_Deran Swygert       5.3668     1.5772   3.403 0.000695 ***
## Name_Donnal Brackwell  -18.5525     1.5772 -11.763  < 2e-16 ***
## Name_Dontar Brask        3.0880     1.5772   1.958 0.050534 .  
## Name_Dran Kenning       -4.3976     1.5772  -2.788 0.005406 ** 
## Name_Dwan Plumm        -14.3560     1.5772  -9.102  < 2e-16 ***
## Name_Eddard Meadows    -24.9352     1.5772 -15.810  < 2e-16 ***
## Name_Erock Conklyn      11.5451     1.5772   7.320 5.28e-13 ***
## Name_Evin Rane         -33.1439     1.5772 -21.014  < 2e-16 ***
## Name_Fabiar Darklyn    -33.2426     1.5772 -21.077  < 2e-16 ***
## Name_Harden Blest      -30.0566     1.5772 -19.057  < 2e-16 ***
## Name_Howar Clifton      -2.6749     1.5772  -1.696 0.090221 .  
## Name_Jaesse Cantell    -15.7270     1.5772  -9.971  < 2e-16 ***
## Name_Jares Grell         3.6472     1.5772   2.312 0.020966 *  
## Name_Jarvas Bywater      8.6005     1.5772   5.453 6.32e-08 ***
## Name_Jeffary Mudd       -4.4122     1.5772  -2.797 0.005254 ** 
## Name_Jeran Parsin       -8.9081     1.5772  -5.648 2.14e-08 ***
## Name_Kober Maeson       -5.4609     1.5772  -3.462 0.000559 ***
## Name_Koryn Staunton     32.5297     1.5772  20.625  < 2e-16 ***
## Name_Marak Herston     -51.0331     1.5772 -32.357  < 2e-16 ***
## Name_Mikal Netley      -11.2779     1.5772  -7.151 1.72e-12 ***
## Name_Mykal Ryser       -21.1308     1.5772 -13.398  < 2e-16 ***
## Name_Nelsor Baxter     -19.2805     1.5772 -12.225  < 2e-16 ***
## Name_Portar Parge       36.7979     1.5772  23.331  < 2e-16 ***
## Name_Raman Brakker      36.6719     1.5772  23.251  < 2e-16 ***
## Name_Randar Condon      -4.3076     1.5772  -2.731 0.006428 ** 
## Name_Roberd Spyre        3.6156     1.5772   2.292 0.022098 *  
## Name_Rolan Morrass      25.9665     1.5772  16.464  < 2e-16 ***
## Name_Samn Whitehill    -18.9911     1.5772 -12.041  < 2e-16 ***
## Name_Sarrac Ridman      -9.2904     1.5772  -5.890 5.34e-09 ***
## Name_Seamas Knigh        3.9833     1.5772   2.526 0.011713 *  
## Name_Seldan Harker     -21.1080     1.5772 -13.383  < 2e-16 ***
## Name_Sharun Mudd       -27.9034     1.5772 -17.692  < 2e-16 ***
## Name_Zandren Kneight   -53.3039     1.5772 -33.797  < 2e-16 ***
## Name_Zarin Goodbrother -21.9110     1.5772 -13.892  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.988 on 949 degrees of freedom
## Multiple R-squared:  0.9585, Adjusted R-squared:  0.9564 
## F-statistic: 438.8 on 50 and 949 DF,  p-value: < 2.2e-16

Having disregarded much of his scientific education, Jaime felt quite overwhelmed with this massive list of parameter estimates. In particular, he was confused by the few archers without a statistically significant estimate. Should he re-run the experiment without them?

(If you’re wondering why all the estimated standard errors for the archer effect estimates are the same, it’s because they all fired the same number of shots. Intuitively, this means we should have the same amount of certainty in each estimate.)

This time, Cersei happened to be hanging out around the training grounds. Overhearing his concerns, she decided to step in. “Jaime, don’t trouble yourself with those worrisome parameter estimates for the archer. I looked at your charts from the first experiment. It is certainly proper to account for archers in the model, but the only quantity we should care about right now is the variance of their skills.” Later that afternoon she developed Westeros’s first linear mixed-effects model, in which archer effects are considered to be random draws from a larger population.

With these equations I shall secure the reputation of the Lannister name in both might and maths

“With these equations I shall secure the reputation of the Lannister name in ways my father would never have dreamed”

Formulation of the mixed-effects model

Cersei’s key observation is that Jaime chose 50 archers out of the large population of archers in and around King’s Landing. Furthermore, he did so in a random fashion, caring only to distribute the bow test across a representative population of archers. In this sense, the archer effects are random. We can incorporate this in the model, by treating them as Normal random variables with a common, fixed variance: \[
Y_{ijk} = \mu + \tau\mathbb{1}(i = 2) + \alpha_j + \epsilon_{ik}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\alpha_j\sim\mathcal{N}(0, \sigma^2_a)
\]
The \(\alpha_j\)’s are now not of interest to estimate. They exist simply to separate out the contribution of marksman skill to per-shot variance. This “variance component”, \(\sigma^2_a\), is now the archer parameter of interest.

It is important to realize that, though we model the \(\alpha_j\)’s as random, they do not change shot-to-shot. From a physical perspective, they are still fixed features of the archer. We regard them as random only to focus on the underlying distribution of archer skill, and its effect on the predictive model.

Random-effects vs. Bayesian randomness

Wait, if the parameters are random, isn’t this a Bayesian model? This is a reasonable question. One easy answer is that the bow effect is fixed, so it doesn’t make sense to give some parameters a Bayesian treatment and not others. But there are mixed-effects models that have no fixed parameters (usually termed “random-effects” models), so the distinction is more fundamental.

Here’s one perspective on the difference. As mentioned above, the parameters in a mixed model get their randomness from the unit (marksman) sampling process. In this sense, they are objectively random, because there is a real population (large enough to be considered infinite) of parameters being sampled from. In a Bayesian model, however, parameters are subjectively random in the sense that their prior and posterior distributions are meant to quantify our personal uncertainty about their values.

These are distinct conceptions of randomness. In fact, it is possible to model the random effects in a mixed-effects model as, additionally, subjectively random with respect to their values. Thus, we can fit both fixed-effects or mixed-effects models with fully Bayesian approaches. The choice to employ a mixed-effects model in the first place still exists in the Bayesian framework, and depends on all the same aspects of our modeling goals.

Let’s map out the Bayesian approach to our two types of models. Look back at the full fixed-effects model we considered for the bow experiments. In the most straightforward Bayesian framework, we would specify a conjugate set of of priors distributions: Normal for \(\mu\), \(\tau\), and the \(\alpha_j\)’s, and Inverse Gamma for \(\sigma^2_e\).

For the full mixed-effects model, however, it sort of looks like we already have a prior on the \(\alpha\)’s. I would argue that this is an abuse of the term “prior”. The Normal distribution on the \(\alpha\)’s still represents the sampling uncertainty in the model. In our Gibbs sampler, it is used to update the \(\alpha\)’s, but if we store the \(\alpha\)’s what we end up with should not be interpreted as a posterior distribution of a fixed parameter.

This is not merely a semantic difference: in the Bayesian approach to our mixed-effects model, we would need a prior on \(\sigma^2_a\), since that is the archer parameter of interest. In the Bayesian fixed-effects model, we gave the \(\alpha\)’s Normal priors directly, and \(\sigma^2_a\) was a fixed hyperparameter.

But the semantic distinctions are important too. Though I’m no expert in Bayesian analysis, I feel like it would be inappropriate, for instance, to talk about Bayes Factors or credible intervals associated with the \(\alpha\)’s, since from a modeling perspective, we have decided not to treat them as values to estimate or do inference upon.

When not to use a mixed-effects model

Aren’t we always picking experimental units out of a larger population? What’s special about this situation? We are, but we don’t always do so randomly. Often, we care about particular units of interest that we knew about before the experiment, so we pick them directly. There’s nothing random about this process. Also, if we have that sort of interest in the units, usually estimating their effects directly is an important goal of the analysis.

For instance, Jaime might have also cared to estimate the skills of a few archers he knew about. In this case, he’d have probably hand-picked fewer archers (like in his first experiment), and have them take more shots. As a more realistic example, consider a small school district doing some sort of standardized test analysis on their three middle schools. There will be many, many observations from each middle school, and the district is not necessarily interested in considering the larger population of middle schools. Their aim is to build a predictive model for their particular district, and do inference on parameters for their schools only.

Fitting the mixed-effects model

Alright, enough philosophy. What can mixed effects modeling actually do for our us, from an inferential or decision-based perspective? First, let’s fit the model to Jaime’s 2nd set of experimental data:

lme2 <- lmer(Distance_ ~ Bow_ + (1 | Name_), data = JaimeExp2$Data)

The switch to mixed effects modeling shifts the focus from inference on the individual archer to inference on the population of archers. This focus is important in applications from many fields, like biology (where observations may come from many organisms), economics (where observations may come from many regions or agents), or clinical trials (where observations may come from many different labs). In our example, it makes sense for Cersei to want to learn about the variance in city-wide marksman skill and separate it from random error. The estimate \(\hat\sigma^2_a\) gives her precisely this, and she won’t get that from the fixed-effects model:

VarCorr(lme2)
##  Groups   Name        Std.Dev.
##  Name_    (Intercept) 22.9698 
##  Residual              4.9875
confint(lme2)[1:2, ]
##            2.5 %    97.5 %
## .sig01 18.907078 28.054556
## .sigma  4.768896  5.217741

The output above gives the estimates for \(\sigma_a\) and \(\sigma_e\), and the assocated confidence intervals.

Using the model to choose archers

“But wait,” Jaime might protest, “what if I care about the particular archer I happened to put to task? I now accept that Northwood bows are the obvious choice, but how does our model help me choose which men to take on my journey? At least with the fixed effects model I had parameter estimates!”

Actually, a mixed-effects model can rank the units (archers) in a functionally identical way. Once the model is fit, we can calculate the conditional expectation of a particular archer’s effect, given his experimental data. This can operate as a skill score for the archer: \[
\text{Score for archer }j = \mathbb{E}\left[\alpha_j | \bar y_{j}\right]
\]
Above, \(\bar y_{j}\) is the \(j\)-th archer’s average shot distance. Remember, in the original mixed-effects model, each \(\alpha_j\) is random, with mean zero. So taking expectations shouldn’t seem useful, at first. However, knowledge of our experimental data biases our expectation of \(\alpha_j\), depending on where the data fall relative to the overall mean. The above quantity is often taken as a “predictor” of \(\alpha_j\), and archers with higher \(\bar y_{j}\)’s have higher predictions. As it turns out, the predictions are exactly correlated with the parameter estimates from the fixed-effects model:

predictions <- ranef(lme2)$Name_[-1, ]
estimates <- unname(coef(lm2)[-c(1:2)])