Descriptive statistics

Author

Sven Rieger

Last updated on

May 15, 2024

Abstract

This material introduces basic descriptive statistics (e.g., mean, standard deviation etc.). It is also available as a Revealjs presentation.

Warning

This page is work in progress and under active development.

Revealjs Presentation

If you want to see the presentation in full screen go to Other Formats on the right.

Preface: Used packages

The following packages are used:

descrPkg <- c("merTools",
              "sn",
              "knitr",
              "flextable",
              "psych",
              "lavaan",
              "ggplot2")

Install packages when not already installed:

lapply(X = descrPkg,
       FUN = function(x) {
          if( !x %in% rownames(installed.packages()) ) { 
            install.packages(x) }
            }
       )

Load (a subset of) the required package(s) into the R session.

library(ggplot2)
library(flextable)

Preface: Cite the packages

Print list of packages and cite them via Pandoc citation.

Show/hide fenced code

```{r}
#| label: write-pkgs
#| code-fold: true
#| code-summary: "Show/hide fenced code"
#| output-location: fragment
#| output: asis

for (i in 1:length(descrPkg)) {
  
  cat(paste0(i, ". ",
             descrPkg[i],
             " [", "v", utils::packageVersion(descrPkg[i]),", @R-", descrPkg[i],
             "]\n"))
}
```

merTools (v0.6.1, Knowles & Frederick, 2024)
sn (v2.1.1, Azzalini, 2023)
knitr (v1.44, Xie, 2023)
flextable (v0.9.4, Gohel & Skintzos, 2024)
psych (v2.3.9, Revelle, 2024)
lavaan (v0.6.16, Rosseel et al., 2023)
ggplot2 (v3.5.0, Wickham et al., 2023)

Preface: Data matrix

Variables (e.g., characterisics), units (e.g., persons) and data (e.g., measurements) are often presented in matrix form. A matrix is a system of \(n \cdot p\) quantities and looks like in the following:

\[ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1p} \\ X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \cdots & X_{np} \end{bmatrix} \]

\(n\) rows; 1 row is also known as a vector or row matrix
\(p\) columns; 1 column is also known as a vector or column matrix

see Eid et al. (2013)

Overview

Frequencies
- Absolute
- Relative

Quantiles (not covered)

Measures of central tendency
- Mean
- Weighted mean (not covered)
- Weighted geometric mean (not covered)
- Median
- Mode (not covered)
- …

Measures of variability
- Standard deviation
- Variance
- Range (Minimum, Maximum)
- Interquartile range (not covered)
- Semi-interquartile range (not covered)
- …

Measures of shape
- Skewness (not covered)
- Kurtosis (not covered)

Example data set

Consider the following 2 vectors within the example data set

exDat <- data.frame(
  numVec = c(1, 2, 5, 3, 8),
  chrVec = c("low", "med", "low", "high", "high")
)

Absolute Frequencies

Absolute frequencies refer to the numbers of a particular value or category appearing in a variable. It may be abbreviated with \(n_j\) where \(n\) is the number of a specific value/category \(j\).

Example Frequency table
Category \(j\)	Absolute Frequency (\(n_j\))
low (\(j=1\))	2
med (\(j=2\))	1
high (\(j=3\))	2
\(\sum\)	\(\sum_{j=1}^3n_j=n=5\)

with(exDat,
     table(chrVec))

chrVec
high  low  med 
   2    2    1

An important argument (useNA) and another useful function (addmargins())…

with(exDat,
     table(chrVec, useNA = "always")) |>
  addmargins()

chrVec
high  low  med <NA>  Sum 
   2    2    1    0    5

exDat |>
  dplyr::group_by(chrVec) |>
  dplyr::summarise(absFreq = dplyr::n())

# A tibble: 3 × 2
  chrVec absFreq
  <chr>    <int>
1 high         2
2 low          2
3 med          1

Relative Frequencies

Relative frequencies refer to the proportion of a specific value or category relative to the total number of observations (\(n\)).

\[ h_j=\frac{n_j}{n} \]

Example Frequency table
Category \(j\)	Absolute Frequency (\(n_j\))	Relative Frequency (\(h_j\))
low (\(j=1\))	2	0.40
med (\(j=2\))	1	0.20
high (\(j=3\))	2	0.40
\(\sum\)	\(\sum_{j=1}^3n_j=n=5\)	\(\sum_{j=1}^3h_j=1\)

Relative Frequencies in R

base
dplyr

with(exDat,
     table(chrVec)/sum(table(exDat$chrVec)))

chrVec
high  low  med 
 0.4  0.4  0.2

Another useful function (sprintf()) to force 2 decimal and add %…

with(exDat,
     table(chrVec)/sum(table(chrVec))) |>
     (function(x) sprintf("%.2f%%", x*100))()

[1] "40.00%" "40.00%" "20.00%"

exDat |>
  dplyr::select(chrVec) |>
  dplyr::group_by(chrVec) |>
  dplyr::summarise(absFreq= dplyr::n()) |>
  dplyr::mutate(relFreq = absFreq/sum(absFreq))

# A tibble: 3 × 3
  chrVec absFreq relFreq
  <chr>    <int>   <dbl>
1 high         2     0.4
2 low          2     0.4
3 med          1     0.2

Mean

The mean (or arithmetic mean, average) is the sum of a collection of numbers divided by the count of numbers in the collection. The formula is given in Equation 1.

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i=\frac{x_1+x_2+\dots+x_n}{n} \qquad(1)\]

For example, consider a vector of numbers: \(x = 1, 2, 5, 3, 8\)

\[ \bar{x} = \frac{(1+2+5+3+8)}{5}=3.8 \]

If the underlying data is a sample (i.e., a subset of a population), it is called the sample mean.

How to calculate the mean in R?

with(exDat,
     mean(numVec))

[1] 3.8

A brief note on missing data

In R missing values/data are represented by the symbol NA. Most of the basic functions cannot deal appropriately with missing data.

To demonstrate this we create another example vector (exVec2).

numVec2 <- c(1, 2, 5, 3, 8, NA)
mean(numVec2)

[1] NA

If there is missing data, we are required to set the argument na.rm to TRUE.

mean(numVec2, na.rm = TRUE)

[1] 3.8

Omitting or deleting missing values should–in most scenarios–be avoided altogether (Enders, 2023; Schafer & Graham, 2002)

Median

The median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as “the middle” value. The formulas are given in Equation 2.

\[ Mdn = \widetilde{x} = \begin{cases} x_{(n+1)/2} & \:\: \text{if } n \text{ is odd} \\ (x_{n/2} + x_{(n/2)+1}) / 2 & \:\: \text{if } n \text{ is even} \end{cases} \qquad(2)\]

Consider again the vector of numbers: \(x = 1, 2, 5, 3, 8\) with length \(n = 5\). To calculate the median you need to first, order the the vector: \(x = 1, 2, 3, 5, 8\) and then apply the corresponding formula (odd vs. even; here odd):

\[ \widetilde{x}=x_{\frac{(5+1)}{2}}=x_3 = 3 \]

How to calculate the median in R?

with(exDat,
     median(numVec))

[1] 3

Variance

The variance is the expectation of the squared deviation of a random variable from its mean. Usually it is distinguished between the population and the sample variance. The formula of the population variance is given in Equation 3.

\[VAR(X) = \sigma^2 = \frac{1}{N} \sum\limits_{i=1}^N (x_i - \mu)^2 \qquad(3)\]

The formula of the sample variance is given in Equation 4.

\[ VAR(X) = s^2 = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \bar{x})^2 \qquad(4)\]

Using again the vector \(x = 1, 2, 5, 3, 8\), the sample variance is calculated as follows:

\[ Var(X) =\frac{1}{4}((1-3.8)^2 + (2-3.8)^2 + (5-3.8)^2 + (3-3.8)^2 + (8-3.8)^2) = 7.7 \]

How to calculate the variance in R?

with(exDat,
     var(numVec))

[1] 7.7

Standard Deviation

The standard deviation is defined as the square root of the variance. Again, it is distinguished between the population and the sample variance. The formula of the population standard deviation is given in Equation 5.

\[ SD(X) = \sigma = \sqrt{\sigma^2} \qquad(5)\]

The formula of the population standard deviation is given in Equation 6.

\[ SD(X) = s = \sqrt{s^2} \qquad(6)\]

Recall the variance calculation from the previous slide, the (sample) variance of the vector is \(7.7\).

\[ SD(X) = \sqrt{7.7}=2.774887 \]

How to calculate the standard deviation in R?

with(exDat,
     sd(numVec))

[1] 2.774887

Range

The range of a vector is the difference between the largest (maximum) and the smallest (minimum) values/observations.

\[ Range(x) = R = x_{max}-x_{min} \qquad(7)\]

How to calculate the range in R?

with(exDat,
     range(numVec))

[1] 1 8

Alternatively, calculate minimum and maximum separately…

with(exDat,{
     c(min(numVec),
       max(numVec))})

[1] 1 8

To compute the range apply Equation 7.

with(exDat,
     max(numVec)-min(numVec))

[1] 7

Put everything together 0

Exercise

Let us calculate several descriptive statistics (e.g., mean, standard deviation, minimum and maximum) for multiple variables. For this exercise, we use a subset of the HSB dataset which is provided in the merTools package (Knowles & Frederick, 2024) (for some details see here):

dat <- merTools::hsb
head(dat, 10)

   schid minority female    ses mathach size schtype meanses
1   1224        0      1 -1.528   5.876  842       0  -0.428
2   1224        0      1 -0.588  19.708  842       0  -0.428
3   1224        0      0 -0.528  20.349  842       0  -0.428
4   1224        0      0 -0.668   8.781  842       0  -0.428
5   1224        0      0 -0.158  17.898  842       0  -0.428
6   1224        0      0  0.022   4.583  842       0  -0.428
7   1224        0      1 -0.618  -2.832  842       0  -0.428
8   1224        0      0 -0.998   0.523  842       0  -0.428
9   1224        0      1 -0.888   1.527  842       0  -0.428
10  1224        0      0 -0.458  21.521  842       0  -0.428

Put everything together I

A flexible approach would be…

1myVar <- c("Math achievement" = "mathach",
           "Gender" = "female",
           "Socioeconomic status" = "ses",
           "Class size" = "size")

2exDescr <- apply(
3  X = dat[,myVar],
4  MARGIN = 2,
5  FUN = function(x) {
6    ret <- c(
             mean(x, na.rm = T),
             sd(x, na.rm = T),
             min(x, na.rm = T),
             max(x, na.rm = T)
             )
7    return(ret)
    })

1: Create a (named) character vector of the variables by using the c() function.
2: Use the apply function to apply a or multiple function(s) on data (here: 4 columns).
3: The input is the dataset with the selected columns of interest (see 1.).
4: MARGIN = 2 indicates that the function should be applied over columns.
5: Create the function that should be applied. Here we calculate the mean(), sd(), min() and max().
6: Create a temporary R object, which should be later returned (here: the vector ret)
7: Return the temporary object and close functions.

There are also functions such as colMeans(), colSums(), rowMeans() and rowSums().

Put everything together II

Print the results…

exDescr |>
  print()

       mathach    female           ses      size
[1,] 12.747853 0.5281837  0.0001433542 1056.8618
[2,]  6.878246 0.4992398  0.7793551951  604.1725
[3,] -2.832000 0.0000000 -3.7580000000  100.0000
[4,] 24.993000 1.0000000  2.6920000000 2713.0000

This is a weird format; variables should be in rows not columns. Transpose…

exDescr |>
  t() |>
  print()

                [,1]        [,2]    [,3]     [,4]
mathach 1.274785e+01   6.8782457  -2.832   24.993
female  5.281837e-01   0.4992398   0.000    1.000
ses     1.433542e-04   0.7793552  -3.758    2.692
size    1.056862e+03 604.1724993 100.000 2713.000

Better, but still not really convincing…

Making a table I

1exDescrTab <- exDescr |>
2    t() |>
    as.data.frame() |>
3    (\(d) cbind(names(myVar), d))() |>
4    flextable() |>
5    theme_apa() |>
6    set_header_labels(
      "names(myVar)" = "Variables",
      V1 = "Mean",
      V2 = "SD",
      V3 = "Min",
      V4 = "Max") |>
7    align(part = "body", align = "c") |>
    align(j = 1, part = "all", align = "l") |>
8    add_footer_lines(
      as_paragraph(as_i("Note. "),
                   "This is a footnote.")
      ) |>
    align(align = "left", part = "footer") |>
9    width(j = 1, width = 2, unit = "in") |>
    width(j = 2:5, width = 1, unit = "in")

1: Take the results (here: exDescr object)…
2: …and transpose (i.e., using the t() function) and coerce it to a data.frame object (as.data.frame())
3: Use the so-called lambda (or anonymous) function to bind (using the cbind() function) the variable names as the first column to the dataset.
4: Apply the flextable() function.
5: Use the APA theme (theme_apa()).
6: Rename the column names (set_header_labels()).
7: Center body part of the table (align()).
8: Add a footnote (add_footer_lines) and align it to the left.
9: Change column width (width) to 2 resp. 1 inch.

Making a table II

Print the table.

Code

exDescrTab

Table 1: Descriptive statistics

Variables	Mean	SD	Min	Max
Math achievement	12.75	6.88	-2.83	24.99
Gender	0.53	0.50	0.00	1.00
Socioeconomic status	0.00	0.78	-3.76	2.69
Class size	1,056.86	604.17	100.00	2,713.00
Note. This is a footnote.

Table export

If you want to export the table…

exDescrTab |>
  set_caption(caption = "Table X.\nDescriptive statistics") |>
  save_as_docx(path = "descr-tab.docx")

Descriptive statistics with the `psych` package

Alternatively, it is convenient to use additional R packages such as the psych package (Revelle, 2024) to calculate descriptive statistics
Here we use the describe function (with the fast argument set to TRUE) to calculate the descriptive statistics of all variables within the example data set

dat |>
  subset(select = -c(1)) |>
  psych::describe(fast = TRUE) |>
  flextable() |>
  colformat_double(digits = 2)

Table 2: Descriptive statistics with the psych package

vars	n	mean	sd	min	max	range	se
1	7,185.00	0.27	0.45	0.00	1.00	1.00	0.01
2	7,185.00	0.53	0.50	0.00	1.00	1.00	0.01
3	7,185.00	0.00	0.78	-3.76	2.69	6.45	0.01
4	7,185.00	12.75	6.88	-2.83	24.99	27.82	0.08
5	7,185.00	1,056.86	604.17	100.00	2,713.00	2,613.00	7.13
6	7,185.00	0.49	0.50	0.00	1.00	1.00	0.01
7	7,185.00	0.01	0.41	-1.19	0.83	2.02	0.00

Exercise

Style the table according to your ideas/demands and export it to Word.

References

Azzalini, A. (2023). Sn: The skew-normal and related distributions such as the skew-t and the SUN. http://azzalini.stat.unipd.it/SN/

Eid, M., Gollwitzer, M., & Schmitt, M. (2013). Statistik und Forschungsmethoden: Lehrbuch ; mit Online-Materialien (3., korrigierte Auflage). Beltz.

Enders, C. K. (2023). Missing data: An update on the state of the art. Psychological Methods. https://doi.org/10.1037/met0000563

Gohel, D., & Skintzos, P. (2024). Flextable: Functions for tabular reporting. https://ardata-fr.github.io/flextable-book/

Knowles, J. E., & Frederick, C. (2024). merTools: Tools for analyzing mixed effect regression models. https://CRAN.R-project.org/package=merTools

Revelle, W. (2024). Psych: Procedures for psychological, psychometric, and personality research. https://personality-project.org/r/psych/ https://personality-project.org/r/psych-manual.pdf

Rosseel, Y., Jorgensen, T. D., & De Wilde, L. (2023). Lavaan: Latent variable analysis. https://lavaan.ugent.be

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147

Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., Woo, K., Yutani, H., & Dunnington, D. (2023). ggplot2: Create elegant data visualisations using the grammar of graphics. https://ggplot2.tidyverse.org

Xie, Y. (2023). Knitr: A general-purpose package for dynamic report generation in r. https://yihui.org/knitr/