9 Descriptive Statistics
Revealjs Presentation
If you want to see the presentation in full screen click here.
9.1 Preface: Data matrix
Variables (e.g., characterisics), units (e.g., persons) and data (e.g., measurements) are often presented in matrix form. A matrix is a system of \(n \cdot p\) quantities and looks like in the following:
\[ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1p} \\ X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \cdots & X_{np} \end{bmatrix} \]
- \(n\) rows; 1 row is also known as a vector or row matrix
- \(p\) columns; 1 column is also known as a vector or column matrix
see Eid et al. (2013)
9.2 Overview
- Quantiles (not covered)
- Measures of variability
- Standard deviation
- Variance
- Range (Minimum, Maximum)
- Interquartile range (not covered)
- Semi-interquartile range (not covered)
- …
- Measures of shape
- Skewness (not covered)
- Kurtosis (not covered)
9.3 Example data set
Consider the following 2 vectors within the example data set**
9.4 Absolute Frequencies
9.5 Relative Frequencies
9.6 Mean
How to calculate the mean in R?
with(ex_dat,
mean(num_vec))[1] 3.8
9.7 A brief note on missing data
In R missing values/data are represented by the symbol NA. Most of the basic functions cannot deal appropriately with missing data.
To demonstrate this we create another example vector (exVec2).
num_vec2 <- c(1, 2, 5, 3, 8, NA)
mean(num_vec2)[1] NA
If there is missing data, we are required to set the argument na.rm to TRUE.
mean(num_vec2, na.rm = TRUE)[1] 3.8
Omitting or deleting missing values should–in most scenarios–be avoided altogether (Enders, 2025; Schafer & Graham, 2002)
9.8 Median
How to calculate the median in R?
with(ex_dat,
median(num_vec))[1] 3
9.9 Variance
How to calculate the variance in R?
with(ex_dat,
var(num_vec))[1] 7.7
9.10 Standard Deviation
How to calculate the standard deviation in R?
with(ex_dat,
sd(num_vec))[1] 2.774887
9.11 Range
How to calculate the range in R?
with(ex_dat,
range(num_vec))[1] 1 8
Alternatively, calculate minimum and maximum separately…
with(ex_dat,{
c(min(num_vec),
max(num_vec))})[1] 1 8
To compute the range apply Equation 9.7.
with(ex_dat,
max(num_vec)-min(num_vec))[1] 7
9.12 Put everything together
Let us calculate descriptive statistics (e.g., mean, standard deviation, minimum and maximum) for multiple variables. For this exercise, we use a subset of the HSB dataset which is provided in the merTools package (Knowles & Frederick, 2025) (for some details see here):
#install.packages("merTools")
dat <- merTools::hsb
head(dat, 10) schid minority female ses mathach size schtype meanses
1 1224 0 1 -1.528 5.876 842 0 -0.428
2 1224 0 1 -0.588 19.708 842 0 -0.428
3 1224 0 0 -0.528 20.349 842 0 -0.428
4 1224 0 0 -0.668 8.781 842 0 -0.428
5 1224 0 0 -0.158 17.898 842 0 -0.428
6 1224 0 0 0.022 4.583 842 0 -0.428
7 1224 0 1 -0.618 -2.832 842 0 -0.428
8 1224 0 0 -0.998 0.523 842 0 -0.428
9 1224 0 1 -0.888 1.527 842 0 -0.428
10 1224 0 0 -0.458 21.521 842 0 -0.428
There are also functions such as colMeans(), colSums(), rowMeans() and rowSums().
A flexible approach would be to use the apply() function…
1myVar <- c("Math achievement" = "mathach",
"Gender" = "female",
"Socioeconomic status" = "ses",
"Class size" = "size")
2ex_descr <- apply(
3 X = dat[,myVar],
4 MARGIN = 2,
5 FUN = function(x) {
6 ret <- c(
mean(x, na.rm = T),
sd(x, na.rm = T),
min(x, na.rm = T),
max(x, na.rm = T)
)
7 return(ret)
})- 1
-
Create a (named) character vector of the variables by using the
c()function. - 2
-
Use the
applyfunction to apply a or multiple function(s) on data (here: 4 columns). - 3
- The input is the dataset with the selected columns of interest (see 1.).
- 4
-
MARGIN = 2indicates that the function should be applied over columns. - 5
-
Create the function that should be applied. Here we calculate the
mean(),sd(),min()andmax(). - 6
-
Create a temporary
Robject, which should be later returned (here: the vectorret) - 7
- Return the temporary object and close functions.
Print the results…
print(ex_descr) mathach female ses size
[1,] 12.747853 0.5281837 0.0001433542 1056.8618
[2,] 6.878246 0.4992398 0.7793551951 604.1725
[3,] -2.832000 0.0000000 -3.7580000000 100.0000
[4,] 24.993000 1.0000000 2.6920000000 2713.0000
This is a weird format; variables should be in rows not columns. Transpose…
[,1] [,2] [,3] [,4]
mathach 1.274785e+01 6.8782457 -2.832 24.993
female 5.281837e-01 0.4992398 0.000 1.000
ses 1.433542e-04 0.7793552 -3.758 2.692
size 1.056862e+03 604.1724993 100.000 2713.000
Better, but still not really convincing…
library(flextable)
1ex_descr_table <- ex_descr |>
2 t() |>
as.data.frame() |>
3 (\(d) cbind(names(myVar), d))() |>
4 flextable() |>
5 theme_apa() |>
6 set_header_labels(
"names(myVar)" = "Variables",
V1 = "Mean",
V2 = "SD",
V3 = "Min",
V4 = "Max") |>
7 align(part = "body", align = "center") |>
align(j = 1, part = "all", align = "left") |>
8 add_footer_lines(
as_paragraph(as_i("Note. "),
"This is a footnote.")
) |>
align(align = "left", part = "footer") |>
9 width(j = 1, width = 2, unit = "in") |>
width(j = 2:5, width = 1, unit = "in")
10ex_descr_table- 1
-
Take the results (here:
ex_descrobject)… - 2
-
…and
transpose(i.e., using thet()function) and coerce it to adata.frameobject (as.data.frame()) - 3
-
Use the so-called lambda (or anonymous) function to bind (using the
cbind()function) the variable names as the first column to the dataset. - 4
-
Apply the
flextable()function. - 5
-
Use the APA theme (
theme_apa()). - 6
-
Rename the column names (
set_header_labels()). - 7
-
Center body part of the table (
align()). - 8
-
Add a footnote (
add_footer_lines) and align it to the left. - 9
-
Change column width (
width) to 2 resp. 1 inch. - 10
- Print the table.
Variables | Mean | SD | Min | Max |
|---|---|---|---|---|
Math achievement | 12.75 | 6.88 | -2.83 | 24.99 |
Gender | 0.53 | 0.50 | 0.00 | 1.00 |
Socioeconomic status | 0.00 | 0.78 | -3.76 | 2.69 |
Class size | 1,056.86 | 604.17 | 100.00 | 2,713.00 |
Note. This is a footnote. | ||||
9.13 Descriptive statistics with the psych package
Alternatively, it is convenient to use additional R packages such as the
psychpackage (Revelle, 2026) to calculate descriptive statisticsHere we use the
describefunction (with thefastargument set toTRUE) to calculate the descriptive statistics of all variables within the example data set
vars | n | mean | sd | median | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|
1 | 7,185.00 | 0.27 | 0.45 | 0.00 | 0.00 | 1.00 | 1.00 | 1.01 | -0.98 | 0.01 |
2 | 7,185.00 | 0.53 | 0.50 | 1.00 | 0.00 | 1.00 | 1.00 | -0.11 | -1.99 | 0.01 |
3 | 7,185.00 | 0.00 | 0.78 | 0.00 | -3.76 | 2.69 | 6.45 | -0.23 | -0.38 | 0.01 |
4 | 7,185.00 | 12.75 | 6.88 | 13.13 | -2.83 | 24.99 | 27.82 | -0.18 | -0.92 | 0.08 |
5 | 7,185.00 | 1,056.86 | 604.17 | 1,016.00 | 100.00 | 2,713.00 | 2,613.00 | 0.57 | -0.36 | 7.13 |
6 | 7,185.00 | 0.49 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 0.03 | -2.00 | 0.01 |
7 | 7,185.00 | 0.01 | 0.41 | 0.04 | -1.19 | 0.83 | 2.02 | -0.27 | -0.48 | 0.00 |
9.14 Some Questions
… and needs to be completed.
What is NOT a measure of central tendency?
Most functions in R can deal appropriately with missing data if the argument na.rm is set to TRUE.