## Correlation Illustrated: Minimum and Maximum of Die Rolls

Suppose we have a fair standard die which we roll twice. The die has no magical properties that cause the outcome of one roll to influence the outcome of a subsequent roll, so we can assume that the outcomes of each roll are independent.

Let $X$ and $Y$ denote the maximum and minimum of the two rolls, respectively. By definition of maximum and minimum, $X\geq Y$. For example, if we know that the minimum $Y$ of the two rolls is $4$, then we know that the maximum $X$ is at least $4$.

This example illustrates that $X$ and $Y$ are not independent: knowing $Y$ provides us information on the distribution of $X$. Additionally, the above example tells us that larger values of $Y$ are associated with larger values of $X$, since the minimum of the rolls is a lower bound for the maximum of the rolls. Given this observation, we might suspect that $X$ and $Y$ are positively correlated. To validate this suspicion, we compute the joint distribution of $X$ and $Y$ using conditional probability.

Let $R_{1}$ and $R_{2}$ denote the outcomes of the first and second rolls, respectively. For integers $k,j\in\left\{1,\ldots,6\right\}$ with $k\geq j$, observe that

$\begin{array}{lcl}\displaystyle\mathbf{P}(X=k, Y=j)&=&\displaystyle\begin{cases}{\mathbf{P}(R_{1}=k,R_{2}=j)+\mathbf{P}(R_{1}=j,R_{2}=k)}&{k\neq j}\\ \\{\mathbf{P}(R_{1}=k,R_{2}=j)}&{k=j}\end{cases}\\[2 em]&=&\displaystyle\begin{cases}{\dfrac{1}{18}}&{k\neq j}\\ \\{\dfrac{1}{36}}&{k=j}\end{cases}\end{array}$

Using this formula, we can compute the marginal distributions of $X$ and $Y$. For $1\leq j\leq 6$,

$\displaystyle\mathbf{P}(Y=j)=\sum_{k=j}^{6}\mathbf{P}(X=k,Y=j) = \dfrac{1}{36}+(6-j)\cdot\dfrac{1}{18}$

and

$\displaystyle\mathbf{P}(X=j)=\sum_{k=1}^{j}\mathbf{P}(X=j,Y=k) = \dfrac{1}{36}+(j-1)\cdot\dfrac{1}{18}$

We summarize these results with the table below.

One key observation about the conditional distribution of $X$ given $Y$, the distribution of the random variable $X\mid Y$, is that it is not uniform. In the first draft of this post, I wrote, as an example, that $X\mid\left\{Y=4\right\}$ is uniformly distributed over the integers $4$, $5$, and $6$. This is false:

$\displaystyle\mathbf{P}\left(X=k\mid Y=4\right)=\dfrac{\mathbf{P}\left(X=k,Y=4\right)}{\mathbf{P}(Y=4)}=\begin{cases}\dfrac{\frac{1}{36}}{\frac{5}{36}}=\dfrac{1}{5}&{k=4}\\ \\ \dfrac{\frac{1}{18}}{\frac{5}{36}}=\dfrac{2}{5}&{k=5,6}\end{cases}$

The reason for this is that there is only one way for the maximum of the two rolls to be $4$, given that the minimum is $4$: both rolls have outcomes of $4$. However, there are two ways for the maximum of the two rolls to be $5$ or $6$, given that minimum that the minimum is $4$: the first roll is $4$ and the second roll is $5$ or $6$, or the the first roll is $5$ or $6$ and the second roll is $4$.

Now that we have the joint pmf of $X$ and $Y$, we can return to the original question of the correlation between $X$ and $Y$. Observe that

$\begin{array}{lcl}\displaystyle\mathbf{E}\left[XY\right]=\sum_{k=1}^{6}\sum_{j=1}^{k}(k\cdot j)\mathbf{P}(X=k,Y=j)&=&\displaystyle\sum_{i=1}^{6}\dfrac{i^{2}}{36}+\sum_{k\neq j}\dfrac{k\cdot j}{18}\\[1 em]&=&\displaystyle\sum_{i=1}^{6}\dfrac{i^{2}}{36}+\dfrac{\left(1+\cdots+6\right)^{2}}{36}-\sum_{i=1}^{6}\dfrac{i^{2}}{36}\\[1 em]&=&\displaystyle\dfrac{21^{2}}{36}\\[1 em]&=&\displaystyle\dfrac{441}{36}\end{array}$

Using the marginal distributions of $X$ and $Y$, we compute the expectations to be

$\displaystyle\mathbf{E}[X]=\sum_{i=1}^{6}i\cdot\dfrac{2i-1}{36}=\dfrac{6\cdot 7\cdot 13}{6\cdot 18}-\dfrac{21}{36}=\dfrac{182}{36}-\dfrac{21}{36}=\dfrac{161}{36}$

and

$\displaystyle\mathbf{E}[Y]=\sum_{i=1}^{6}(7-i)\cdot\dfrac{2i-1}{36}=7-\dfrac{161}{36}=\dfrac{252}{36}-\dfrac{161}{36}=\dfrac{91}{36}$

Putting these results together, we see that $X$ and $Y$ have positive covariance:

$\displaystyle\text{Cov}(X,Y)=\mathbf{E}[XY]-\mathbf{E}[X]{E}[Y]=\dfrac{441}{36}-\dfrac{161}{36}\cdot\dfrac{91}{36}=\dfrac{1225}{1296}\approx .945$

I leave it as an exercise to the reader to compute the Pearson $\rho$-correlation coefficient of $X$ and $Y$, which is defined by

$\displaystyle\rho(X,Y)=\dfrac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}$

It’s not hard; since you know $\text{Cov}(X,Y)$, you just need to compute the variances of $X$ and $Y$ using their respective marginal distributions.

One of the reasons I like this problem is it illustrates that covariance is a measure of association, not causation. By association, I mean how two random variables vary together. Statistics teachers try to knock in students’ head this fact with the oft-repeated statement “correlation does not imply causation,” but all too often I read descriptions of studies that measure the correlation between two variables $X$ and $Y$ and summarize the conclusions by saying that $X$ causes $Y$ or $Y$ causes $X$. The media are the worst offenders when it comes to confounding correlation and causation; just read the headlines in the Health section of your Google News page.

Even though the minimum and maximum of the die rolls are positively correlated, as shown above, no one would say that a large minimum causes a large maximum; remember that we assume the rolls are independent. Knowing the minimum of the rolls only gives information on the possible values of the maximum, by definition of minimum and maximum. The maximum is tautologically at least as big as the minimum, so if $Y=4$, then we know that $X\neq 3$. The positive correlation stems from the information one random variable provides about one another.