December 1, 2023
Print list of packages and cite them via Pandoc citation.
We use the HSB dataset from the merTools
package (Knowles & Frederick, 2024):
schid minority female ses mathach size schtype meanses
1 1224 0 1 -1.528 5.876 842 0 -0.428
2 1224 0 1 -0.588 19.708 842 0 -0.428
3 1224 0 0 -0.528 20.349 842 0 -0.428
4 1224 0 0 -0.668 8.781 842 0 -0.428
5 1224 0 0 -0.158 17.898 842 0 -0.428
6 1224 0 0 0.022 4.583 842 0 -0.428
Environment (often) shows natural hierarchical structures (e.g., students nested within classes/schools, time points within persons)
The hierarchical structures may (!) lead to dependency in data
Hierarchical structures as a research object/topic
Example: The relation between socio-economic status and achievement across all (actually a subsample) schools
dat |>
(\(d) d[1:550,])() |>
ggplot(aes(x = ses, y = mathach)) +
geom_point(shape=21,
fill="white",
color="black",
size=2) +
geom_smooth(method='lm',
formula=y~x,
color = "blue",
se = F,
linewidth = 1.5) +
geom_smooth(method='loess',
formula=y~x,
color = "red",
se = F,
linewidth = 1.5) +
labs(x = "SES",
y = "Math Achievemenet") +
theme_classic()
Call:
lm(formula = mathach ~ ses, data = (function(d) d[1:550, ])(dat))
Residuals:
Min 1Q Median 3Q Max
-18.0224 -4.5656 0.1607 4.8815 17.8936
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.3956 0.2806 44.17 <2e-16 ***
ses 4.1711 0.3551 11.74 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.58 on 548 degrees of freedom
Multiple R-squared: 0.2011, Adjusted R-squared: 0.1996
F-statistic: 137.9 on 1 and 548 DF, p-value: < 2.2e-16
Example: The relation between socio-economic status and achievement in two selected schools
dat |>
subset(subset = schid == 1224 | schid == 1462) |>
ggplot(aes(x = ses, y = mathach)) +
geom_point(shape=21,
fill="white",
color="black",
size=2) +
geom_smooth(method='lm',
formula=y~x, color = "blue", se = F) +
geom_smooth(method='loess',
formula=y~x, color = "red", se = F) +
facet_wrap(~schid, nrow = 1) +
labs(x = "SES",
y = "Math Achievemenet") +
theme_classic()
\(𝐸(Y│X)= 10.81 + 2.51X\)
\(E(Y│X)=9.994−0.82X\)
Example: The relation between socio-economic status and achievement separated in J (14) schools
ggplot(dat[1:550,],
aes(x = ses, y = mathach)) +
geom_point(shape=21,
fill="white",
color="black",
size=2) +
geom_smooth(method='lm',
formula=y~x,
se = F,
color = "blue") +
geom_smooth(method='loess',
formula=y~x,
se = F,
color = "red") +
facet_wrap(~schid, nrow = 2) +
labs(x = "SES",
y = "Math Achievemenet") +
theme_classic()
One way to address the hierarchical data structure is to include indicator variables (i.e., dummy variables) for the clusters in the model → fixed effects approach
Using reference coding we build \((k-1)\) indicator variables \(I_c\)
Model equation (see Equation 1):
\[ Y = \beta_0 + \beta_1X + \beta_2I_2 + \beta_3I_3 + \dots + \beta_kI_k \qquad(1)\]
Example: The relation between socio-economic status \((X)\) and achievement \((Y)\) in J (160!) schools
Model equation (see Equation 2):
\[ Y = \beta_0 + \beta_1X + \sum\limits_{2}^{160} \beta_kI_k \qquad(2)\]
Instead of estimating a huge number of regression coefficients, it is possible to estimate only the distribution parameters (i.e., expected value and variance) of these parameters across clusters
In doing so, there are additional assumptions:
Level 1 residuals within all clusters follow a (multivariate) normal distribution (when the outcome is continuous): \(r_{ij} \sim N(0,\sigma^2)\)
Level 2 residuals (i.e., random effects) follow a multivariate normal distribution
\[ \left( \begin{matrix}u_{0j} \\ u_{1j}\end{matrix}\right) \sim N\left(\begin{bmatrix} 0 \\ 0\end{bmatrix}, \begin{bmatrix} VAR(u_{0j}) & COV(u_{0j}, u_{1j}) \\ COV(u_{1j}, u_{0j}) & VAR(u_{1j})\end{bmatrix}\right) \]
Description of this relationship within any school \(j\) by the equations
Level-specific equations: \[ \text{Level 1: } Y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + r_{ij} \qquad(3)\]
\[ \begin{aligned} \text{Level 2 (intercept): }& \beta_{0j} = \gamma_{00} + u_{0j} \end{aligned} \qquad(4)\]
\[ \begin{aligned} \text{Level 2 (slope): } &\beta_{1j} = \gamma_{01} + u_{1j} \end{aligned} \qquad(5)\]
Combined equation: Substituting Equation 4 and Equation 5 into Equation 3 yields the combined model equation (see Equation 6):
\[ Y_{ij} = \gamma_{00} + u_{0j} + (\gamma_{01} + u_{1j})X_{ij} + r_{ij} \qquad(6)\]
Random intercept: \(\beta_{0j} = \gamma_{00} + u_{0j}\)
Random slope: \(\beta_{1j} = \gamma_{01} + u_{1j}\)
Covariance between slopes and intercepts: \(COV(\beta_{0j}, \beta_{1j}) = \tau_{01}\)
Fixed effects: \(\gamma_{00}, \gamma_{01}\)
Random effects/residuals: \(u_{0j}, u_{1j}, r_{ij}\)
Random effects/ level 2 residuals:
\[ \begin{aligned} & \beta_{0j} = \gamma_{00} + u_{0j} \text{(rearrange equation)} \\ & $u_{0j} = \beta_{0j} - \gamma_{00} \end{aligned} \]
→ deviation of the cluster specific intercept (\(\beta_{0j}\)) from the average intercept across cluster (\(\gamma_{00}\))
\(\beta_{1j} = \gamma_{01} + u_{1j}\) (rearrange equation)
\(u_{1j} = \beta_{1j} - \gamma_{01}\) → deviation of the cluster specific slope (\(\beta_{1j}\)) from the average slope across cluster (\(\gamma_{01}\))
Random effects/ level 1 residual:
Recall EQ@ref(eq:comb-eq):
\(Y_{ij} = \gamma_{00} + u_{0j} + (\gamma_{01} + u_{1j})X_{ij} + r_{ij}\) (rearrange equation)
\(r_{ij} = Y_{ij} - \gamma_{00} - u_{0j} - \gamma_{01}X_{ij} - u_{1j}X_{ij}\) → deviation of the observed value (\(Y_{ij}\)) from the conditional expected value, given the predictor (\(X_{ij}\)) in cluster \(j\)
This is a test