SML310 Mini-Project 3: Statistical Inference and Multi-Level Models

Overview

For this mini-project, we are concerned with statistical inference and generative models. You will be “generating” fake data – i.e., writing functions that use random-number generators to create a new dataset. If you are using R, you may find Ch. 5 of Shalizi helpful.

Part 1: Low P-Values (20 pts)

For this part, you should demonstrate that low p-values do not necessarily imply a practically important association between two variables.

Write a function to generate a dataset \((X, Y)\) containing \(n\) data points, such that

$$Y^{(i)}=a_0 + a_1 X^{(i)} + \epsilon_i \text{ , with } \epsilon_i\sim N(0, \sigma) $$

Set the parameters to be such that the relationship between \(X\) and \(Y\) is very weak. State what the parameters are, and plot \(Y\) vs. \(X\).

Now, show that for larger \(n\), you will nearly always reject the hypothesis that \(a_1 = 0\), even if \(a_1\) is very close to 0. Make an appropriate figure to demonstrate this fact. Think of a way to make the figure convincing.

Part 2: Lurking Variables

Part 2(a): Warm-up (15 pts)

Generate a dataset \((X_1, X_2, Y)\) where the correlation between \(X_1\) and \(Y\) is negative, but the coefficient \(a_1\) for \(X_1\) is positive when you run the regression

$$Y = a_0 + a_1 X_1 + a_2 X_ 2.$$

State how you created the dataset, and show that it conforms to the required property.

Part 2(b): Challenge (10 pts)

Now, come up with a plausible scenario where \(X_2\) is a lurking variable. That is, in your report, come up with a story about the data \((X_1, X_2, Y)\). Now, re-generate the data in a way that makes clear that the data could be generated if the scenario actually happened.

Include your code, an explanation, and a demonstration of the effect of \(X_2\) as a lurking variable.

Part 3: Free Throws

Free throws are an important part of basketball. Here, we will be using a dataset of the number of free throws that Shaquille O’Neal , one of the greatest basketball players in the last few decades who was also known for his poor free-throwing ability , has taken in 23 different games. The total number of attempted free throws is recorded, as well as the number of times that the attempts were successful.

lines <- 
"Game   Scored  N.Attempts
1   4   5
2   5   11
3   5   14
4   5   12
5   2   7
6   7   10
7   6   14
8   9   15
9   4   12
10  1   4
11  13  27
12  5   17
13  6   12
14  9   9
15  7   12
16  3   10
17  8   12
18  1   6
19  18  39
20  3   13
21  10  17
22  1   6
23  3   12"
con <- textConnection(lines)
shaq <- read.csv(con, sep="")
shaq
##    Game Scored N.Attempts
## 1     1      4          5
## 2     2      5         11
## 3     3      5         14
## 4     4      5         12
## 5     5      2          7
## 6     6      7         10
## 7     7      6         14
## 8     8      9         15
## 9     9      4         12
## 10   10      1          4
## 11   11     13         27
## 12   12      5         17
## 13   13      6         12
## 14   14      9          9
## 15   15      7         12
## 16   16      3         10
## 17   17      8         12
## 18   18      1          6
## 19   19     18         39
## 20   20      3         13
## 21   21     10         17
## 22   22      1          6
## 23   23      3         12

Part 3(a): Good Days and Bad Days (15)

We are interested in determining the true probability of successfully scoring a free throw in each game. What is a hierarchical model that would make sense if Shaq had a different probability of scoring on different days?

Part 3(b): Fit a hierarchical model in Stan (15)

Fit the model you proposed using Stan. Display the posterior distribution for each of the parameters of the model.

From the posteriors that you plotted, does it appear that Shaq has good days and bad days? What are you basing this conclusion on?

Part 3(c): No-pooling, partial pooling, and complete pooling (15)

Suppose we want to model the probability of Shaq scoring a free throw, for each game. What are the no-pooling, partial-pooling, and complete-pooling models that would be used to estimate the probabilities?

Run the models, and display the posterior distributions for each day using each of the pooling methods. One way to display the distributions it to display the 95% predictive interval for each day, similarly to what we did with Minnesota counties in lecture.

What to submit

Please submit all of your Python code, as well as a report in PDF format. Your report should address every one of the tasks above. Your report should be reproducible: in the report, include the function calls that the TA should make to get the outputs that you are showing in your report. (You do not need to include helper functions in your report.

Report quality (10 pts)

Your report should be readable and reproducible. 10 pts will be awarded for very readable and professional reports. 5 pts will be awarded for reports that are readable with some effort. Points will be deducted for lack of reproducibility.