Descriptive Statistics

March 4, 2026

Agenda

Preface: Data matrix
Descriptive statistics
Exercise

Preface: Data matrix

Variables (e.g., characterisics), units (e.g., persons) and data (e.g., measurements) are often presented in matrix form. A matrix is a system of \(n \cdot p\) quantities and looks like in the following:

\[ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1p} \\ X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \cdots & X_{np} \end{bmatrix} \]

\(n\) rows; 1 row is also known as a vector or row matrix
\(p\) columns; 1 column is also known as a vector or column matrix

Descriptive statistics

Overview

Frequencies
- Absolute
- Relative

Measures of central tendency
- Mean
- Weighted mean (not covered)
- Weighted geometric mean (not covered)
- Median
- Mode (not covered)
- …

Quantiles

Measures of variability
- Standard deviation
- Variance
- Range (Minimum, Maximum)
- Interquartile range (not covered)
- Semi-interquartile range (not covered)
- …

Measures of shape
- Skewness (not covered)
- Kurtosis (not covered)

Example data set

Consider the following 2 vectors within the example data set.

ex_dat <- data.frame(
  num_vec = c(1, 2, 5, 3, 8),
  chr_vec = c("low", "med", "low", "high", "high")
)

Absolute Frequencies

Absolute frequencies refer to the numbers of a particular value or category appearing in a variable. It may be abbreviated with \(n_j\) where \(n\) is the number of a specific value/category \(j\).

Example Frequency table
Category \(j\)	Absolute Frequency (\(n_j\))
low (\(j=1\))	2
med (\(j=2\))	1
high (\(j=3\))	2
\(\sum\)	\(\sum_{j=1}^3n_j=n=5\)

with(ex_dat,
     table(chr_vec))

chr_vec
high  low  med 
   2    2    1

An important argument (useNA) and another useful function (addmargins())…

with(ex_dat,
     table(chr_vec, useNA = "always")) |>
  addmargins()

chr_vec
high  low  med <NA>  Sum 
   2    2    1    0    5

ex_dat |>
  dplyr::group_by(chr_vec) |>
  dplyr::summarise(absFreq = dplyr::n())

# A tibble: 3 × 2
  chr_vec absFreq
  <chr>     <int>
1 high          2
2 low           2
3 med           1

Relative Frequencies

Relative frequencies refer to the proportion of a specific value or category relative to the total number of observations (\(n\)).

\[ h_j=\frac{n_j}{n} \]

Example Frequency table
Category \(j\)	Absolute Frequency (\(n_j\))	Relative Frequency (\(h_j\))
low (\(j=1\))	2	0.40
med (\(j=2\))	1	0.20
high (\(j=3\))	2	0.40
\(\sum\)	\(\sum_{j=1}^3n_j=n=5\)	\(\sum_{j=1}^3h_j=1\)

Relative Frequencies in R

base
dplyr

with(ex_dat,
     table(chr_vec)/sum(table(ex_dat$chr_vec)))

chr_vec
high  low  med 
 0.4  0.4  0.2

Another useful function (sprintf()) to force 2 decimal and add %…

with(ex_dat,
     table(chr_vec)/sum(table(ex_dat$chr_vec))) |>
     (function(x) sprintf("%.2f%%", x*100))()

[1] "40.00%" "40.00%" "20.00%"

ex_dat |>
  dplyr::select(chr_vec) |>
  dplyr::group_by(chr_vec) |>
  dplyr::summarise(absFreq= dplyr::n()) |>
  dplyr::mutate(relFreq = absFreq/sum(absFreq))

# A tibble: 3 × 3
  chr_vec absFreq relFreq
  <chr>     <int>   <dbl>
1 high          2     0.4
2 low           2     0.4
3 med           1     0.2

Mean

The mean (or arithmetic mean, average) is the sum of a collection of numbers divided by the count of numbers in the collection. The formula is given in Equation 1.

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i=\frac{x_1+x_2+\dots+x_n}{n} \tag{1}\]

For example, consider a vector of numbers: \(x = 1, 2, 5, 3, 8\)

\[ \bar{x} = \frac{(1+2+5+3+8)}{5}=3.8 \]

If the underlying data is a sample (i.e., a subset of a population), it is called the sample mean.

In R:

with(ex_dat,
     mean(num_vec))

[1] 3.8

A brief note on missing data

In R missing values/data are represented by the symbol NA. Most of the basic functions cannot deal appropriately with missing data.

To demonstrate this we create another example vector (exVec2).

num_vec2 <- c(1, 2, 5, 3, 8, NA)
mean(num_vec2)

[1] NA

If there is missing data, we are required to set the argument na.rm to TRUE.

mean(num_vec2, na.rm = TRUE)

[1] 3.8

Omitting or deleting missing values should–in most scenarios–be avoided altogether (Enders, 2025; Schafer & Graham, 2002)

Median

The median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as “the middle” value. The formulas are given in Equation 2.

\[ Mdn = \widetilde{x} = \begin{cases} x_{(n+1)/2} & \:\: \text{if } n \text{ is odd} \\ (x_{n/2} + x_{(n/2)+1}) / 2 & \:\: \text{if } n \text{ is even} \end{cases} \tag{2}\]

Consider again the vector of numbers: \(x = 1, 2, 5, 3, 8\) with length \(n = 5\). To calculate the median you need to first, order the the vector: \(x = 1, 2, 3, 5, 8\) and then apply the corresponding formula (odd vs. even; here odd):

\[ \widetilde{x}=x_{\frac{(5+1)}{2}}=x_3 = 3 \]

In R:

with(ex_dat,
     median(num_vec))

[1] 3

Variance

The variance is the expectation of the squared deviation of a random variable from its mean. Usually it is distinguished between the population and the sample variance. The formula of the population variance is given in Equation 3.

\[ VAR(X) = \sigma^2 = \frac{1}{N} \sum\limits_{i=1}^N (x_i - \mu)^2 \tag{3}\]

The formula of the sample variance is given in Equation 4.

\[ VAR(X) = s^2 = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \bar{x})^2 \tag{4}\]

Using again the vector \(x = 1, 2, 5, 3, 8\), the sample variance is calculated as follows:

\[ Var(X) =\frac{1}{4}((1-3.8)^2 + (2-3.8)^2 + (5-3.8)^2 + (3-3.8)^2 + (8-3.8)^2) = 7.7 \]

In R:

with(ex_dat,
     var(num_vec))

[1] 7.7

Quantiles

A \(p\)-quantile is the value \(x_p\) \((0 < p < 1)\) such that at least \(p \cdot 100\) of the data are less than or equal to \(x_p\), and at least \((1 - p) \cdot 100\) of the data are greater than or equal to \(x_p\).

To calculate the \(p\)-quantile, we first need to sort the vector (\(x = 1, 2, 3, 5, 8\) see also Median) and then multiply the length of the vector (here 5) with \(p\) (e.g., 0.25 for the 25% quantile) to get the corresponding index.

\(x_{0.25} = x_{5 \cdot 0.25} = x_{1.25} = x_2 = 2\)

In R:

stats::quantile
stats::quantile approximation

with(ex_dat,
     quantile(num_vec, probs = c(0.25, 0.75)))

25% 75% 
  2   5

p_quantile <- function(x, p) {
  x_sorted <- sort(x)
  n <- length(x_sorted)
  k <- ceiling(n * p)
  out <- x_sorted[k]
  names(out) <- paste0(p * 100, "%")
  out
}

p_quantile(ex_dat$num_vec, c(0.25, 0.75))

25% 75% 
  2   5

Standard Deviation

The standard deviation is defined as the square root of the variance. Again, it is distinguished between the population and the sample variance. The formula of the population standard deviation is given in Equation 5.

\[ SD(X) = \sigma = \sqrt{\sigma^2} \tag{5}\]

The formula of the population standard deviation is given in Equation 6.

\[ SD(X) = s = \sqrt{s^2} \tag{6}\]

Recall the variance calculation from the previous slide, the (sample) variance of the vector is \(7.7\).

\[ SD(X) = \sqrt{7.7}=2.774887 \]

In R:

with(ex_dat,
     sd(num_vec))

[1] 2.774887

Range

The range of a vector is the difference between the largest (maximum) and the smallest (minimum) values/observations.

\[ Range(x) = R = x_{max}-x_{min} \tag{7}\]

In R:

with(ex_dat,
     range(num_vec))

[1] 1 8

Alternatively, calculate minimum and maximum separately…

with(ex_dat,{
     c(min(num_vec),
       max(num_vec))})

[1] 1 8

To compute the range apply Equation 7.

with(ex_dat,
     max(num_vec)-min(num_vec))

[1] 7

Exercise

Put everything together

Exercise

Let us calculate descriptive statistics (e.g., mean, standard deviation, minimum and maximum) for multiple variables. For this exercise, we use a subset of the HSB dataset which is provided in the merTools package (Knowles & Frederick, 2025) (for some details see here):

#install.packages("merTools")
dat <- merTools::hsb
head(dat, 10)

   schid minority female    ses mathach size schtype meanses
1   1224        0      1 -1.528   5.876  842       0  -0.428
2   1224        0      1 -0.588  19.708  842       0  -0.428
3   1224        0      0 -0.528  20.349  842       0  -0.428
4   1224        0      0 -0.668   8.781  842       0  -0.428
5   1224        0      0 -0.158  17.898  842       0  -0.428
6   1224        0      0  0.022   4.583  842       0  -0.428
7   1224        0      1 -0.618  -2.832  842       0  -0.428
8   1224        0      0 -0.998   0.523  842       0  -0.428
9   1224        0      1 -0.888   1.527  842       0  -0.428
10  1224        0      0 -0.458  21.521  842       0  -0.428

Solution

Step 1: Calculate descriptives
Step 2: Print the results
Step 3: Create results table
Step 4: Export the table (optional)

A flexible approach would be to use the apply() function…

1myVar <- c("Math achievement" = "mathach",
           "Gender" = "female",
           "Socioeconomic status" = "ses",
           "Class size" = "size")

2ex_descr <- apply(
3  X = dat[,myVar],
4  MARGIN = 2,
5  FUN = function(x) {
6    ret <- c(
             mean(x, na.rm = T),
             sd(x, na.rm = T),
             min(x, na.rm = T),
             max(x, na.rm = T)
             )
7    return(ret)
    })

1: Create a (named) character vector of the variables by using the c() function.
2: Use the apply function to apply a or multiple function(s) on data (here: 4 columns).
3: The input is the dataset with the selected columns of interest (see 1.).
4: MARGIN = 2 indicates that the function should be applied over columns.
5: Create the function that should be applied. Here we calculate the mean(), sd(), min() and max().
6: Create a temporary R object, which should be later returned (here: the vector ret)
7: Return the temporary object and close functions.

Print the results…

print(ex_descr)

       mathach    female           ses      size
[1,] 12.747853 0.5281837  0.0001433542 1056.8618
[2,]  6.878246 0.4992398  0.7793551951  604.1725
[3,] -2.832000 0.0000000 -3.7580000000  100.0000
[4,] 24.993000 1.0000000  2.6920000000 2713.0000

This is a weird format; variables should be in rows not columns. Transpose…

ex_descr |>
  t() |>
  print()

                [,1]        [,2]    [,3]     [,4]
mathach 1.274785e+01   6.8782457  -2.832   24.993
female  5.281837e-01   0.4992398   0.000    1.000
ses     1.433542e-04   0.7793552  -3.758    2.692
size    1.056862e+03 604.1724993 100.000 2713.000

Better, but still not really convincing…

library(flextable)
1ex_descr_table <- ex_descr |>
2    t() |>
    as.data.frame() |>
3    (\(d) cbind(names(myVar), d))() |>
4    flextable() |>
5    theme_apa() |>
6    set_header_labels(
      "names(myVar)" = "Variables",
      V1 = "Mean",
      V2 = "SD",
      V3 = "Min",
      V4 = "Max") |>
7    align(part = "body", align = "center") |>
    align(j = 1, part = "all", align = "left") |>
8    add_footer_lines(
      as_paragraph(as_i("Note. "),
                   "This is a footnote.")
      ) |>
    align(align = "left", part = "footer") |>
9    width(j = 1, width = 2, unit = "in") |>
    width(j = 2:5, width = 1, unit = "in")

10ex_descr_table

1: Take the results (here: ex_descr object)…
2: …and transpose (i.e., using the t() function) and coerce it to a data.frame object (as.data.frame())
3: Use the so-called lambda (or anonymous) function to bind (using the cbind() function) the variable names as the first column to the dataset.
4: Apply the flextable() function.
5: Use the APA theme (theme_apa()).
6: Rename the column names (set_header_labels()).
7: Center body part of the table (align()).
8: Add a footnote (add_footer_lines) and align it to the left.
9: Change column width (width) to 2 resp. 1 inch.
10: Print the table.

Table 1: Descriptive statistics

Variables	Mean	SD	Min	Max
Math achievement	12.75	6.88	-2.83	24.99
Gender	0.53	0.50	0.00	1.00
Socioeconomic status	0.00	0.78	-3.76	2.69
Class size	1,056.86	604.17	100.00	2,713.00
Note. This is a footnote.

If you are not using Quarto (what you should be doing, though), and you need to export the table, you can use the save_as_docx() function.

exDescrTab |>
  set_caption(caption = "Table X.\nDescriptive statistics") |>
  save_as_docx(path = "descr-tab.docx")

Descriptive statistics with the `psych` package

Alternatively, it is convenient to use additional R packages such as the psych package (Revelle, 2026) to calculate descriptive statistics
Here we use the describe function (with the fast argument set to TRUE) to calculate the descriptive statistics of all variables within the example data set

dat |>
  subset(select = -c(1)) |>
  psych::describe(fast = TRUE) |>
  flextable() |>
  colformat_double(digits = 2)

Table 2: Descriptive statistics with the psych package

vars	n	mean	sd	median	min	max	range	skew	kurtosis	se
1	7,185.00	0.27	0.45	0.00	0.00	1.00	1.00	1.01	-0.98	0.01
2	7,185.00	0.53	0.50	1.00	0.00	1.00	1.00	-0.11	-1.99	0.01
3	7,185.00	0.00	0.78	0.00	-3.76	2.69	6.45	-0.23	-0.38	0.01
4	7,185.00	12.75	6.88	13.13	-2.83	24.99	27.82	-0.18	-0.92	0.08
5	7,185.00	1,056.86	604.17	1,016.00	100.00	2,713.00	2,613.00	0.57	-0.36	7.13
6	7,185.00	0.49	0.50	0.00	0.00	1.00	1.00	0.03	-2.00	0.01
7	7,185.00	0.01	0.41	0.04	-1.19	0.83	2.02	-0.27	-0.48	0.00

Questions?

References

Eid, M., Gollwitzer, M., & Schmitt, M. (2013). Statistik und Forschungsmethoden: Lehrbuch ; mit Online-Materialien (3., korrigierte Auflage). Beltz.

Enders, C. K. (2025). Missing data: An update on the state of the art. Psychological Methods, 30(2), 322–339. https://doi.org/10.1037/met0000563

Knowles, J. E., & Frederick, C. (2025). merTools: Tools for analyzing mixed effect regression models. https://doi.org/10.32614/CRAN.package.merTools

Revelle, W. (2026). Psych: Procedures for psychological, psychometric, and personality research. https://personality-project.org/r/psych/

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147