Descriptive statistics
This page is work in progress and under active development.
Revealjs Presentation
If you want to see the presentation in full screen go to Other Formats on the right.
Preface: Used packages
- The following packages are used:
- Install packages when not already installed:
Preface: Cite the packages
Print list of packages and cite them via Pandoc citation.
Show/hide fenced code
- merTools (v0.6.1, Knowles & Frederick, 2024)
- sn (v2.1.1, Azzalini, 2023)
- knitr (v1.44, Xie, 2023)
- flextable (v0.9.4, Gohel & Skintzos, 2024)
- psych (v2.3.9, Revelle, 2024)
- lavaan (v0.6.16, Rosseel et al., 2023)
- ggplot2 (v3.5.0, Wickham et al., 2023)
Preface: Data matrix
Variables (e.g., characterisics), units (e.g., persons) and data (e.g., measurements) are often presented in matrix form. A matrix is a system of \(n \cdot p\) quantities and looks like in the following:
\[ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1p} \\ X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \cdots & X_{np} \end{bmatrix} \]
- \(n\) rows; 1 row is also known as a vector or row matrix
- \(p\) columns; 1 column is also known as a vector or column matrix
see Eid et al. (2013)
Overview
- Quantiles (not covered)
- Measures of variability
- Standard deviation
- Variance
- Range (Minimum, Maximum)
- Interquartile range (not covered)
- Semi-interquartile range (not covered)
- …
- Measures of shape
- Skewness (not covered)
- Kurtosis (not covered)
Example data set
Consider the following 2 vectors within the example data set
Absolute Frequencies
Absolute Frequencies in R
Relative Frequencies
Relative Frequencies in R
Mean
How to calculate the mean in R
?
with(exDat,
mean(numVec))
[1] 3.8
A brief note on missing data
In R
missing values/data are represented by the symbol NA
. Most of the basic functions cannot deal appropriately with missing data.
To demonstrate this we create another example vector (exVec2
).
<- c(1, 2, 5, 3, 8, NA)
numVec2 mean(numVec2)
[1] NA
If there is missing data, we are required to set the argument na.rm
to TRUE
.
mean(numVec2, na.rm = TRUE)
[1] 3.8
Omitting or deleting missing values should–in most scenarios–be avoided altogether (Enders, 2023; Schafer & Graham, 2002)
Median
How to calculate the median in R
?
with(exDat,
median(numVec))
[1] 3
Variance
How to calculate the variance in R
?
with(exDat,
var(numVec))
[1] 7.7
Standard Deviation
How to calculate the standard deviation in R
?
with(exDat,
sd(numVec))
[1] 2.774887
Range
How to calculate the range in R
?
with(exDat,
range(numVec))
[1] 1 8
Alternatively, calculate minimum and maximum separately…
with(exDat,{
c(min(numVec),
max(numVec))})
[1] 1 8
To compute the range apply Equation 7.
with(exDat,
max(numVec)-min(numVec))
[1] 7
Put everything together 0
Let us calculate several descriptive statistics (e.g., mean, standard deviation, minimum and maximum) for multiple variables. For this exercise, we use a subset of the HSB dataset which is provided in the merTools
package (Knowles & Frederick, 2024) (for some details see here):
<- merTools::hsb
dat head(dat, 10)
schid minority female ses mathach size schtype meanses
1 1224 0 1 -1.528 5.876 842 0 -0.428
2 1224 0 1 -0.588 19.708 842 0 -0.428
3 1224 0 0 -0.528 20.349 842 0 -0.428
4 1224 0 0 -0.668 8.781 842 0 -0.428
5 1224 0 0 -0.158 17.898 842 0 -0.428
6 1224 0 0 0.022 4.583 842 0 -0.428
7 1224 0 1 -0.618 -2.832 842 0 -0.428
8 1224 0 0 -0.998 0.523 842 0 -0.428
9 1224 0 1 -0.888 1.527 842 0 -0.428
10 1224 0 0 -0.458 21.521 842 0 -0.428
Put everything together I
A flexible approach would be…
1myVar <- c("Math achievement" = "mathach",
"Gender" = "female",
"Socioeconomic status" = "ses",
"Class size" = "size")
2exDescr <- apply(
3 X = dat[,myVar],
4 MARGIN = 2,
5 FUN = function(x) {
6 ret <- c(
mean(x, na.rm = T),
sd(x, na.rm = T),
min(x, na.rm = T),
max(x, na.rm = T)
)
7 return(ret)
})
- 1
-
Create a (named) character vector of the variables by using the
c()
function. - 2
-
Use the
apply
function to apply a or multiple function(s) on data (here: 4 columns). - 3
- The input is the dataset with the selected columns of interest (see 1.).
- 4
-
MARGIN = 2
indicates that the function should be applied over columns. - 5
-
Create the function that should be applied. Here we calculate the
mean()
,sd()
,min()
andmax()
. - 6
-
Create a temporary
R
object, which should be later returned (here: the vectorret
) - 7
- Return the temporary object and close functions.
There are also functions such as colMeans()
, colSums()
, rowMeans()
and rowSums()
.
Put everything together II
Print the results…
mathach female ses size
[1,] 12.747853 0.5281837 0.0001433542 1056.8618
[2,] 6.878246 0.4992398 0.7793551951 604.1725
[3,] -2.832000 0.0000000 -3.7580000000 100.0000
[4,] 24.993000 1.0000000 2.6920000000 2713.0000
This is a weird format; variables should be in rows not columns. Transpose…
Better, but still not really convincing…
Making a table I
1exDescrTab <- exDescr |>
2 t() |>
as.data.frame() |>
3 (\(d) cbind(names(myVar), d))() |>
4 flextable() |>
5 theme_apa() |>
6 set_header_labels(
"names(myVar)" = "Variables",
V1 = "Mean",
V2 = "SD",
V3 = "Min",
V4 = "Max") |>
7 align(part = "body", align = "c") |>
align(j = 1, part = "all", align = "l") |>
8 add_footer_lines(
as_paragraph(as_i("Note. "),
"This is a footnote.")
) |>
align(align = "left", part = "footer") |>
9 width(j = 1, width = 2, unit = "in") |>
width(j = 2:5, width = 1, unit = "in")
- 1
-
Take the results (here:
exDescr
object)… - 2
-
…and
transpose
(i.e., using thet()
function) and coerce it to adata.frame
object (as.data.frame()
) - 3
-
Use the so-called lambda (or anonymous) function to bind (using the
cbind()
function) the variable names as the first column to the dataset. - 4
-
Apply the
flextable()
function. - 5
-
Use the APA theme (
theme_apa()
). - 6
-
Rename the column names (
set_header_labels()
). - 7
-
Center body part of the table (
align()
). - 8
-
Add a footnote (
add_footer_lines
) and align it to the left. - 9
-
Change column width (
width
) to 2 resp. 1 inch.
Making a table II
Print the table.
Code
exDescrTab
Variables | Mean | SD | Min | Max |
---|---|---|---|---|
Math achievement | 12.75 | 6.88 | -2.83 | 24.99 |
Gender | 0.53 | 0.50 | 0.00 | 1.00 |
Socioeconomic status | 0.00 | 0.78 | -3.76 | 2.69 |
Class size | 1,056.86 | 604.17 | 100.00 | 2,713.00 |
Note. This is a footnote. |
Table export
If you want to export the table…
Descriptive statistics with the psych
package
Alternatively, it is convenient to use additional
R
packages such as thepsych
package (Revelle, 2024) to calculate descriptive statisticsHere we use the
describe
function (with thefast
argument set toTRUE
) to calculate the descriptive statistics of all variables within the example data set
dat |>
subset(select = -c(1)) |>
psych::describe(fast = TRUE) |>
flextable() |>
colformat_double(digits = 2)
vars | n | mean | sd | min | max | range | se |
---|---|---|---|---|---|---|---|
1 | 7,185.00 | 0.27 | 0.45 | 0.00 | 1.00 | 1.00 | 0.01 |
2 | 7,185.00 | 0.53 | 0.50 | 0.00 | 1.00 | 1.00 | 0.01 |
3 | 7,185.00 | 0.00 | 0.78 | -3.76 | 2.69 | 6.45 | 0.01 |
4 | 7,185.00 | 12.75 | 6.88 | -2.83 | 24.99 | 27.82 | 0.08 |
5 | 7,185.00 | 1,056.86 | 604.17 | 100.00 | 2,713.00 | 2,613.00 | 7.13 |
6 | 7,185.00 | 0.49 | 0.50 | 0.00 | 1.00 | 1.00 | 0.01 |
7 | 7,185.00 | 0.01 | 0.41 | -1.19 | 0.83 | 2.02 | 0.00 |
Exercise
Style the table according to your ideas/demands and export it to Word.