10 Descriptive Statistics – Coding and Teaching Library

Revealjs Presentation

If you want to see the presentation in full screen click here.

10.1 Preface: Data matrix

Variables (e.g., characterisics), units (e.g., persons) and data (e.g., measurements) are often presented in matrix form. A matrix is a system of \(n \cdot p\) quantities and looks like in the following:

\[ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1p} \\ X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \cdots & X_{np} \end{bmatrix} \]

\(n\) rows; 1 row is also known as a vector or row matrix
\(p\) columns; 1 column is also known as a vector or column matrix

see Eid et al. (2013)

10.2 Overview

Frequencies
- Absolute
- Relative

Measures of central tendency
- Mean
- Weighted mean (not covered)
- Weighted geometric mean (not covered)
- Median
- Mode (not covered)
- …

Quantiles

Measures of variability
- Standard deviation
- Variance
- Range (Minimum, Maximum)
- Interquartile range (not covered)
- Semi-interquartile range (not covered)
- …

Measures of shape
- Skewness (not covered)
- Kurtosis (not covered)

10.3 Example data set

Consider the following 2 vectors within the example data set.

ex_dat <- data.frame(
  num_vec = c(1, 2, 5, 3, 8),
  chr_vec = c("low", "med", "low", "high", "high")
)

10.4 Absolute Frequencies

Absolute frequencies refer to the numbers of a particular value or category appearing in a variable. It may be abbreviated with \(n_j\) where \(n\) is the number of a specific value/category \(j\).

Example Frequency table
Category \(j\)	Absolute Frequency (\(n_j\))
low (\(j=1\))	2
med (\(j=2\))	1
high (\(j=3\))	2
\(\sum\)	\(\sum_{j=1}^3n_j=n=5\)

with(ex_dat,
     table(chr_vec))

chr_vec
high  low  med 
   2    2    1

An important argument (useNA) and another useful function (addmargins())…

with(ex_dat,
     table(chr_vec, useNA = "always")) |>
  addmargins()

chr_vec
high  low  med <NA>  Sum 
   2    2    1    0    5

ex_dat |>
  dplyr::group_by(chr_vec) |>
  dplyr::summarise(absFreq = dplyr::n())

# A tibble: 3 × 2
  chr_vec absFreq
  <chr>     <int>
1 high          2
2 low           2
3 med           1

10.5 Relative Frequencies

Relative frequencies refer to the proportion of a specific value or category relative to the total number of observations (\(n\)).

\[ h_j=\frac{n_j}{n} \]

Example Frequency table
Category \(j\)	Absolute Frequency (\(n_j\))	Relative Frequency (\(h_j\))
low (\(j=1\))	2	0.40
med (\(j=2\))	1	0.20
high (\(j=3\))	2	0.40
\(\sum\)	\(\sum_{j=1}^3n_j=n=5\)	\(\sum_{j=1}^3h_j=1\)

with(ex_dat,
     table(chr_vec)/sum(table(ex_dat$chr_vec)))

chr_vec
high  low  med 
 0.4  0.4  0.2

Another useful function (sprintf()) to force 2 decimal and add %…

with(ex_dat,
     table(chr_vec)/sum(table(ex_dat$chr_vec))) |>
     (function(x) sprintf("%.2f%%", x*100))()

[1] "40.00%" "40.00%" "20.00%"

ex_dat |>
  dplyr::select(chr_vec) |>
  dplyr::group_by(chr_vec) |>
  dplyr::summarise(absFreq= dplyr::n()) |>
  dplyr::mutate(relFreq = absFreq/sum(absFreq))

# A tibble: 3 × 3
  chr_vec absFreq relFreq
  <chr>     <int>   <dbl>
1 high          2     0.4
2 low           2     0.4
3 med           1     0.2

10.6 Mean

The mean (or arithmetic mean, average) is the sum of a collection of numbers divided by the count of numbers in the collection. The formula is given in Equation 10.1.

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i=\frac{x_1+x_2+\dots+x_n}{n} \tag{10.1}\]

For example, consider a vector of numbers: \(x = 1, 2, 5, 3, 8\)

\[ \bar{x} = \frac{(1+2+5+3+8)}{5}=3.8 \]

If the underlying data is a sample (i.e., a subset of a population), it is called the sample mean.

In R:

with(ex_dat,
     mean(num_vec))

[1] 3.8

10.7 A brief note on missing data

In R missing values/data are represented by the symbol NA. Most of the basic functions cannot deal appropriately with missing data.

To demonstrate this we create another example vector (exVec2).

num_vec2 <- c(1, 2, 5, 3, 8, NA)
mean(num_vec2)

[1] NA

If there is missing data, we are required to set the argument na.rm to TRUE.

mean(num_vec2, na.rm = TRUE)

[1] 3.8

Omitting or deleting missing values should–in most scenarios–be avoided altogether (Enders, 2025; Schafer & Graham, 2002)

10.8 Median

The median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as “the middle” value. The formulas are given in Equation 10.2.

\[ Mdn = \widetilde{x} = \begin{cases} x_{(n+1)/2} & \:\: \text{if } n \text{ is odd} \\ (x_{n/2} + x_{(n/2)+1}) / 2 & \:\: \text{if } n \text{ is even} \end{cases} \tag{10.2}\]

Consider again the vector of numbers: \(x = 1, 2, 5, 3, 8\) with length \(n = 5\). To calculate the median you need to first, order the the vector: \(x = 1, 2, 3, 5, 8\) and then apply the corresponding formula (odd vs. even; here odd):

\[ \widetilde{x}=x_{\frac{(5+1)}{2}}=x_3 = 3 \]

In R:

with(ex_dat,
     median(num_vec))

[1] 3

10.9 Variance

The variance is the expectation of the squared deviation of a random variable from its mean. Usually it is distinguished between the population and the sample variance. The formula of the population variance is given in Equation 10.3.

\[ VAR(X) = \sigma^2 = \frac{1}{N} \sum\limits_{i=1}^N (x_i - \mu)^2 \tag{10.3}\]

The formula of the sample variance is given in Equation 10.4.

\[ VAR(X) = s^2 = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \bar{x})^2 \tag{10.4}\]

Using again the vector \(x = 1, 2, 5, 3, 8\), the sample variance is calculated as follows:

\[ Var(X) =\frac{1}{4}((1-3.8)^2 + (2-3.8)^2 + (5-3.8)^2 + (3-3.8)^2 + (8-3.8)^2) = 7.7 \]

In R:

with(ex_dat,
     var(num_vec))

[1] 7.7

10.10 Quantiles

A \(p\)-quantile is the value \(x_p\) \((0 < p < 1)\) such that at least \(p \cdot 100\) of the data are less than or equal to \(x_p\), and at least \((1 - p) \cdot 100\) of the data are greater than or equal to \(x_p\).

To calculate the \(p\)-quantile, we first need to sort the vector (\(x = 1, 2, 3, 5, 8\) see also Median) and then multiply the length of the vector (here 5) with \(p\) (e.g., 0.25 for the 25% quantile) to get the corresponding index.

\(x_{0.25} = x_{5 \cdot 0.25} = x_{1.25} = x_2 = 2\)

In R:

with(ex_dat,
     quantile(num_vec, probs = c(0.25, 0.75)))

25% 75% 
  2   5

p_quantile <- function(x, p) {
  x_sorted <- sort(x)
  n <- length(x_sorted)
  k <- ceiling(n * p)
  out <- x_sorted[k]
  names(out) <- paste0(p * 100, "%")
  out
}

p_quantile(ex_dat$num_vec, c(0.25, 0.75))

25% 75% 
  2   5

10.11 Standard Deviation

The standard deviation is defined as the square root of the variance. Again, it is distinguished between the population and the sample variance. The formula of the population standard deviation is given in Equation 10.5.

\[ SD(X) = \sigma = \sqrt{\sigma^2} \tag{10.5}\]

The formula of the population standard deviation is given in Equation 10.6.

\[ SD(X) = s = \sqrt{s^2} \tag{10.6}\]

Recall the variance calculation from the previous slide, the (sample) variance of the vector is \(7.7\).

\[ SD(X) = \sqrt{7.7}=2.774887 \]

In R:

with(ex_dat,
     sd(num_vec))

[1] 2.774887

10.12 Range

The range of a vector is the difference between the largest (maximum) and the smallest (minimum) values/observations.

\[ Range(x) = R = x_{max}-x_{min} \tag{10.7}\]

In R:

with(ex_dat,
     range(num_vec))

[1] 1 8

Alternatively, calculate minimum and maximum separately…

with(ex_dat,{
     c(min(num_vec),
       max(num_vec))})

[1] 1 8

To compute the range apply Equation 10.7.

with(ex_dat,
     max(num_vec)-min(num_vec))

[1] 7

10.13 Put everything together

Exercise

Let us calculate descriptive statistics (e.g., mean, standard deviation, minimum and maximum) for multiple variables. For this exercise, we use a subset of the HSB dataset which is provided in the merTools package (Knowles & Frederick, 2025) (for some details see here):

#install.packages("merTools")
dat <- merTools::hsb
head(dat, 10)

   schid minority female    ses mathach size schtype meanses
1   1224        0      1 -1.528   5.876  842       0  -0.428
2   1224        0      1 -0.588  19.708  842       0  -0.428
3   1224        0      0 -0.528  20.349  842       0  -0.428
4   1224        0      0 -0.668   8.781  842       0  -0.428
5   1224        0      0 -0.158  17.898  842       0  -0.428
6   1224        0      0  0.022   4.583  842       0  -0.428
7   1224        0      1 -0.618  -2.832  842       0  -0.428
8   1224        0      0 -0.998   0.523  842       0  -0.428
9   1224        0      1 -0.888   1.527  842       0  -0.428
10  1224        0      0 -0.458  21.521  842       0  -0.428

There are also functions such as colMeans(), colSums(), rowMeans() and rowSums().

A flexible approach would be to use the apply() function…

1myVar <- c("Math achievement" = "mathach",
           "Gender" = "female",
           "Socioeconomic status" = "ses",
           "Class size" = "size")

2ex_descr <- apply(
3  X = dat[,myVar],
4  MARGIN = 2,
5  FUN = function(x) {
6    ret <- c(
             mean(x, na.rm = T),
             sd(x, na.rm = T),
             min(x, na.rm = T),
             max(x, na.rm = T)
             )
7    return(ret)
    })

1: Create a (named) character vector of the variables by using the c() function.
2: Use the apply function to apply a or multiple function(s) on data (here: 4 columns).
3: The input is the dataset with the selected columns of interest (see 1.).
4: MARGIN = 2 indicates that the function should be applied over columns.
5: Create the function that should be applied. Here we calculate the mean(), sd(), min() and max().
6: Create a temporary R object, which should be later returned (here: the vector ret)
7: Return the temporary object and close functions.

Print the results…

print(ex_descr)

       mathach    female           ses      size
[1,] 12.747853 0.5281837  0.0001433542 1056.8618
[2,]  6.878246 0.4992398  0.7793551951  604.1725
[3,] -2.832000 0.0000000 -3.7580000000  100.0000
[4,] 24.993000 1.0000000  2.6920000000 2713.0000

This is a weird format; variables should be in rows not columns. Transpose…

ex_descr |>
  t() |>
  print()

                [,1]        [,2]    [,3]     [,4]
mathach 1.274785e+01   6.8782457  -2.832   24.993
female  5.281837e-01   0.4992398   0.000    1.000
ses     1.433542e-04   0.7793552  -3.758    2.692
size    1.056862e+03 604.1724993 100.000 2713.000

Better, but still not really convincing…

library(flextable)
1ex_descr_table <- ex_descr |>
2    t() |>
    as.data.frame() |>
3    (\(d) cbind(names(myVar), d))() |>
4    flextable() |>
5    theme_apa() |>
6    set_header_labels(
      "names(myVar)" = "Variables",
      V1 = "Mean",
      V2 = "SD",
      V3 = "Min",
      V4 = "Max") |>
7    align(part = "body", align = "center") |>
    align(j = 1, part = "all", align = "left") |>
8    add_footer_lines(
      as_paragraph(as_i("Note. "),
                   "This is a footnote.")
      ) |>
    align(align = "left", part = "footer") |>
9    width(j = 1, width = 2, unit = "in") |>
    width(j = 2:5, width = 1, unit = "in")

10ex_descr_table

1: Take the results (here: ex_descr object)…
2: …and transpose (i.e., using the t() function) and coerce it to a data.frame object (as.data.frame())
3: Use the so-called lambda (or anonymous) function to bind (using the cbind() function) the variable names as the first column to the dataset.
4: Apply the flextable() function.
5: Use the APA theme (theme_apa()).
6: Rename the column names (set_header_labels()).
7: Center body part of the table (align()).
8: Add a footnote (add_footer_lines) and align it to the left.
9: Change column width (width) to 2 resp. 1 inch.
10: Print the table.

Table 10.1: Descriptive statistics

Variables	Mean	SD	Min	Max
Math achievement	12.75	6.88	-2.83	24.99
Gender	0.53	0.50	0.00	1.00
Socioeconomic status	0.00	0.78	-3.76	2.69
Class size	1,056.86	604.17	100.00	2,713.00
Note. This is a footnote.

If you are not using Quarto (what you should be doing, though), and you need to export the table, you can use the save_as_docx() function.

exDescrTab |>
  set_caption(caption = "Table X.\nDescriptive statistics") |>
  save_as_docx(path = "descr-tab.docx")

10.14 Descriptive statistics with the `psych` package

Alternatively, it is convenient to use additional R packages such as the psych package (Revelle, 2026) to calculate descriptive statistics
Here we use the describe function (with the fast argument set to TRUE) to calculate the descriptive statistics of all variables within the example data set

dat |>
  subset(select = -c(1)) |>
  psych::describe(fast = TRUE) |>
  flextable() |>
  colformat_double(digits = 2)

Table 10.2: Descriptive statistics with the psych package

vars	n	mean	sd	median	min	max	range	skew	kurtosis	se
1	7,185.00	0.27	0.45	0.00	0.00	1.00	1.00	1.01	-0.98	0.01
2	7,185.00	0.53	0.50	1.00	0.00	1.00	1.00	-0.11	-1.99	0.01
3	7,185.00	0.00	0.78	0.00	-3.76	2.69	6.45	-0.23	-0.38	0.01
4	7,185.00	12.75	6.88	13.13	-2.83	24.99	27.82	-0.18	-0.92	0.08
5	7,185.00	1,056.86	604.17	1,016.00	100.00	2,713.00	2,613.00	0.57	-0.36	7.13
6	7,185.00	0.49	0.50	0.00	0.00	1.00	1.00	0.03	-2.00	0.01
7	7,185.00	0.01	0.41	0.04	-1.19	0.83	2.02	-0.27	-0.48	0.00

10.15 Some Questions

This is work in progress…

… and needs to be completed.

What is NOT a measure of central tendency?

Most functions in R can deal appropriately with missing data if the argument na.rm is set to TRUE.

Why do we divide by \(n-1\) instead of \(n\) when calculating the sample variance?

Revealjs Presentation

10.1 Preface: Data matrix

10.2 Overview

10.3 Example data set

10.4 Absolute Frequencies

10.5 Relative Frequencies

10.6 Mean

10.7 A brief note on missing data

10.8 Median

10.9 Variance

10.10 Quantiles

10.11 Standard Deviation

10.12 Range

10.13 Put everything together

10.14 Descriptive statistics with the psych package

10.15 Some Questions

10.14 Descriptive statistics with the `psych` package