May 15, 2024
navigate vertical ↓
Print list of packages and cite them via Pandoc citation.
Variables (e.g., characterisics), units (e.g., persons) and data (e.g., measurements) are often presented in matrix form. A matrix is a system of \(n \cdot p\) quantities and looks like in the following:
\[ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1p} \\ X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \cdots & X_{np} \end{bmatrix} \]
navigate vertical ↓
Consider the following 2 vectors within the example data set
Absolute frequencies refer to the numbers of a particular value or category appearing in a variable. It may be abbreviated with \(n_j\) where \(n\) is the number of a specific value/category \(j\).
Category \(j\) | Absolute Frequency (\(n_j\)) | |
---|---|---|
low (\(j=1\)) | 2 | |
med (\(j=2\)) | 1 | |
high (\(j=3\)) | 2 | |
\(\sum\) | \(\sum_{j=1}^3n_j=n=5\) |
Relative frequencies refer to the proportion of a specific value or category relative to the total number of observations (\(n\)).
\[ h_j=\frac{n_j}{n} \]
Category \(j\) | Absolute Frequency (\(n_j\)) | Relative Frequency (\(h_j\)) |
---|---|---|
low (\(j=1\)) | 2 | 0.40 |
med (\(j=2\)) | 1 | 0.20 |
high (\(j=3\)) | 2 | 0.40 |
\(\sum\) | \(\sum_{j=1}^3n_j=n=5\) | \(\sum_{j=1}^3h_j=1\) |
The mean (or arithmetic mean, average) is the sum of a collection of numbers divided by the count of numbers in the collection. The formula is given in Equation 1.
\[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i=\frac{x_1+x_2+\dots+x_n}{n} \qquad(1)\]
For example, consider a vector of numbers: \(x = 1, 2, 5, 3, 8\)
\[ \bar{x} = \frac{(1+2+5+3+8)}{5}=3.8 \]
If the underlying data is a sample (i.e., a subset of a population), it is called the sample mean.
In R
missing values/data are represented by the symbol NA
. Most of the basic functions cannot deal appropriately with missing data.
Omitting or deleting missing values should–in most scenarios–be avoided altogether (Enders, 2023; Schafer & Graham, 2002)
The median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as “the middle” value. The formulas are given in Equation 2.
\[ Mdn = \widetilde{x} = \begin{cases} x_{(n+1)/2} & \:\: \text{if } n \text{ is odd} \\ (x_{n/2} + x_{(n/2)+1}) / 2 & \:\: \text{if } n \text{ is even} \end{cases} \qquad(2)\]
Consider again the vector of numbers: \(x = 1, 2, 5, 3, 8\) with length \(n = 5\). To calculate the median you need to first, order the the vector: \(x = 1, 2, 3, 5, 8\) and then apply the corresponding formula (odd vs. even; here odd):
\[ \widetilde{x}=x_{\frac{(5+1)}{2}}=x_3 = 3 \]
The variance is the expectation of the squared deviation of a random variable from its mean. Usually it is distinguished between the population and the sample variance. The formula of the population variance is given in Equation 3.
\[VAR(X) = \sigma^2 = \frac{1}{N} \sum\limits_{i=1}^N (x_i - \mu)^2 \qquad(3)\]
The formula of the sample variance is given in Equation 4.
\[ VAR(X) = s^2 = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \bar{x})^2 \qquad(4)\]
Using again the vector \(x = 1, 2, 5, 3, 8\), the sample variance is calculated as follows:
\[ Var(X) =\frac{1}{4}((1-3.8)^2 + (2-3.8)^2 + (5-3.8)^2 + (3-3.8)^2 + (8-3.8)^2) = 7.7 \]
The standard deviation is defined as the square root of the variance. Again, it is distinguished between the population and the sample variance. The formula of the population standard deviation is given in Equation 5.
\[ SD(X) = \sigma = \sqrt{\sigma^2} \qquad(5)\]
The formula of the population standard deviation is given in Equation 6.
\[ SD(X) = s = \sqrt{s^2} \qquad(6)\]
Recall the variance calculation from the previous slide, the (sample) variance of the vector is \(7.7\).
\[ SD(X) = \sqrt{7.7}=2.774887 \]
The range of a vector is the difference between the largest (maximum) and the smallest (minimum) values/observations.
\[ Range(x) = R = x_{max}-x_{min} \qquad(7)\]
Alternatively, calculate minimum and maximum separately…
To compute the range apply Equation 7.
navigate vertical ↓
Exercise
Let us calculate several descriptive statistics (e.g., mean, standard deviation, minimum and maximum) for multiple variables. For this exercise, we use a subset of the HSB dataset which is provided in the merTools
package (Knowles & Frederick, 2024) (for some details see here):
Exercise
Let us calculate several descriptive statistics (e.g., mean, standard deviation, minimum and maximum) for multiple variables. For this exercise, we use a subset of the HSB dataset which is provided in the merTools
package (Knowles & Frederick, 2024) (for some details see here):
schid minority female ses mathach size schtype meanses
1 1224 0 1 -1.528 5.876 842 0 -0.428
2 1224 0 1 -0.588 19.708 842 0 -0.428
3 1224 0 0 -0.528 20.349 842 0 -0.428
4 1224 0 0 -0.668 8.781 842 0 -0.428
5 1224 0 0 -0.158 17.898 842 0 -0.428
6 1224 0 0 0.022 4.583 842 0 -0.428
7 1224 0 1 -0.618 -2.832 842 0 -0.428
8 1224 0 0 -0.998 0.523 842 0 -0.428
9 1224 0 1 -0.888 1.527 842 0 -0.428
10 1224 0 0 -0.458 21.521 842 0 -0.428
A flexible approach would be…
c()
function.
apply
function to apply a or multiple function(s) on data (here: 4 columns).
MARGIN = 2
indicates that the function should be applied over columns.
mean()
, sd()
, min()
and max()
.
R
object, which should be later returned (here: the vector ret
)
Print the results…
mathach female ses size
[1,] 12.747853 0.5281837 0.0001433542 1056.8618
[2,] 6.878246 0.4992398 0.7793551951 604.1725
[3,] -2.832000 0.0000000 -3.7580000000 100.0000
[4,] 24.993000 1.0000000 2.6920000000 2713.0000
This is a weird format; variables should be in rows not columns. Transpose…
Better, but still not really convincing…
1exDescrTab <- exDescr |>
2 t() |>
as.data.frame() |>
3 (\(d) cbind(names(myVar), d))() |>
4 flextable() |>
5 theme_apa() |>
6 set_header_labels(
"names(myVar)" = "Variables",
V1 = "Mean",
V2 = "SD",
V3 = "Min",
V4 = "Max") |>
7 align(part = "body", align = "c") |>
align(j = 1, part = "all", align = "l") |>
8 add_footer_lines(
as_paragraph(as_i("Note. "),
"This is a footnote.")
) |>
align(align = "left", part = "footer") |>
9 width(j = 1, width = 2, unit = "in") |>
width(j = 2:5, width = 1, unit = "in")
exDescr
object)…
transpose
(i.e., using the t()
function) and coerce it to a data.frame
object (as.data.frame()
)
cbind()
function) the variable names as the first column to the dataset.
flextable()
function.
theme_apa()
).
set_header_labels()
).
align()
).
add_footer_lines
) and align it to the left.
width
) to 2 resp. 1 inch.
Print the table.
If you want to export the table…
psych
packageAlternatively, it is convenient to use additional R
packages such as the psych
package (Revelle, 2024) to calculate descriptive statistics
Here we use the describe
function (with the fast
argument set to TRUE
) to calculate the descriptive statistics of all variables within the example data set
dat |>
subset(select = -c(1)) |>
psych::describe(fast = TRUE) |>
flextable() |>
colformat_double(digits = 2)
vars | n | mean | sd | min | max | range | se |
---|---|---|---|---|---|---|---|
1 | 7,185.00 | 0.27 | 0.45 | 0.00 | 1.00 | 1.00 | 0.01 |
2 | 7,185.00 | 0.53 | 0.50 | 0.00 | 1.00 | 1.00 | 0.01 |
3 | 7,185.00 | 0.00 | 0.78 | -3.76 | 2.69 | 6.45 | 0.01 |
4 | 7,185.00 | 12.75 | 6.88 | -2.83 | 24.99 | 27.82 | 0.08 |
5 | 7,185.00 | 1,056.86 | 604.17 | 100.00 | 2,713.00 | 2,613.00 | 7.13 |
6 | 7,185.00 | 0.49 | 0.50 | 0.00 | 1.00 | 1.00 | 0.01 |
7 | 7,185.00 | 0.01 | 0.41 | -1.19 | 0.83 | 2.02 | 0.00 |
Style the table according to your ideas/demands and export it to Word.