mean(exDat$age,
na.rm = T)
[1] 10.03468
In this part you will learn how to calculate descriptive statistics, item statistics and scale scores.
Recall, to calculate the descriptive statistics for continuous and categorical variables as well as scales, we provide functions which are ready to use (see here) .
Typical descriptive statistics are:
R
provides functions to calculate the statistics: mean
, sd
, min
, max
, table
, …
In addition, there are packages such as the psych
package (Revelle, 2023) that offer helpful functionality.
For example, if we want to calculate the mean and the standard deviation of the variable age, we can use the respective functions (i.e., mean
& sd
):
Or, if you are interested in absolute and/or relative frequencies, we can use the table
function:
table(exDat$sex)
0 1
357 354
However, often you need to calculate the descriptive statistics not only for a single variable, but for many variables. Again, in R
this can be done in many different ways. Here we depict a base R
solution using the apply
function (for more on the apply function see the Tip on the apply function).
In order to better retrace the calculation procedure, we present it in a two-step1 approach:
apply
function and save the output in an object (here: exDescr
):
X
is the example data set: exDat
MARGIN
value to 2, because we want to apply the function over columns of exDat
FUN
: Here we define a function2 that calculates the mean, standard deviation, minimum and maximum (combined through the c
function). Within the braces {
you can state several (consecutive) arguments. Finally, we return the object (fOut
) of the function. Note the fOut
object is not saved in the workspace.t
function) the returned matrix
, transform it then to a data.frame
3 object and provide the column names.exDescr <- as.data.frame(
t(exDescr)
)
colnames(exDescr) <- c("Mean", "SD", "Min", "Max")
exDescr
Mean SD Min Max
msc1 2.518667 0.7373668 1.00000 4
msc2 2.544118 0.7186516 1.00000 4
msc3 2.488000 0.7482887 1.00000 4
msc4 2.480000 0.7463306 1.00000 4
age 10.034684 1.7561728 5.43976 30
(3. Finally, if you need an formatted table, the kbl
function from the kableExtra
package (Zhu, 2021) transforms matrices or data frames into nice tables.)
kableExtra::kbl(exDescr,
digits = 2) |>
kableExtra::kable_classic()
Mean | SD | Min | Max | |
---|---|---|---|---|
msc1 | 2.52 | 0.74 | 1.00 | 4 |
msc2 | 2.54 | 0.72 | 1.00 | 4 |
msc3 | 2.49 | 0.75 | 1.00 | 4 |
msc4 | 2.48 | 0.75 | 1.00 | 4 |
age | 10.03 | 1.76 | 5.44 | 30 |
Although it is probably a matter of coding taste, we do not recommend such a “nesting” approach, because it is difficult to read and comprehend. We like it neat. See also the note on What is the pipe operator?.
kableExtra::kbl(x = t(
apply(exDat[,1:5],
MARGIN = 2,
FUN = function(x) {
fOut <- c(
mean(x, na.rm = T),
sd(x, na.rm = T),
min(x, na.rm = T),
max(x, na.rm = T)
)
return(fOut)
})
),
digits = 2,
col.names = c("Mean", "SD", "Min", "Max")) |>
kableExtra::kable_classic()
Mean | SD | Min | Max | |
---|---|---|---|---|
msc1 | 2.52 | 0.74 | 1.00 | 4 |
msc2 | 2.54 | 0.72 | 1.00 | 4 |
msc3 | 2.49 | 0.75 | 1.00 | 4 |
msc4 | 2.48 | 0.75 | 1.00 | 4 |
age | 10.03 | 1.76 | 5.44 | 30 |
From the function description (see ?psych::describe
):
The
describe
function in thepsych
package is meant to produce the most frequently requested stats in psychometric and psychology studies, and to produce them in an easy to read data.frame. If a grouping variable is called for in formula mode, it will also calldescribeBy
to the processing.
The function has many input argument (see again ?psych::describe
), but requires only x
a data frame or matrix.
psych::describe(x = exDat)
vars n mean sd median trimmed mad min max range skew
msc1 1 750 2.52 0.74 3.00 2.52 1.48 1.00 4 3.00 -0.02
msc2 2 680 2.54 0.72 3.00 2.53 1.48 1.00 4 3.00 -0.02
msc3 3 750 2.49 0.75 2.00 2.49 1.48 1.00 4 3.00 0.00
msc4 4 750 2.48 0.75 2.00 2.48 1.48 1.00 4 3.00 0.06
age 5 670 10.03 1.76 10.03 10.02 1.46 5.44 30 24.56 2.11
sex 6 711 0.50 0.50 0.00 0.50 0.00 0.00 1 1.00 0.01
edu 7 750 2.67 1.25 3.00 2.79 1.48 0.00 4 4.00 -0.59
fLang* 8 730 4.89 0.58 5.00 5.00 0.00 1.00 7 6.00 -2.93
id 9 750 375.50 216.65 375.50 375.50 277.99 1.00 750 749.00 0.00
msc3r 10 750 2.51 0.75 3.00 2.51 1.48 1.00 4 3.00 0.00
msc4r 11 750 2.52 0.75 3.00 2.52 1.48 1.00 4 3.00 -0.06
msc 12 680 2.53 0.61 2.50 2.53 0.74 1.00 4 3.00 -0.07
kurtosis se
msc1 -0.31 0.03
msc2 -0.27 0.03
msc3 -0.34 0.03
msc4 -0.33 0.03
age 23.94 0.07
sex -2.00 0.02
edu -0.73 0.05
fLang* 19.28 0.02
id -1.20 7.91
msc3r -0.34 0.03
msc4r -0.33 0.03
msc -0.35 0.02
The describe
output is easily transformed into a data.frame
which then can be passed to the kbl
function from the kableExtra
package (Zhu, 2021).
kableExtra::kbl(x = as.data.frame(psych::describe(x = exDat))[c("n", "mean", "sd")],
caption = "Descriptive Statistics of the example data set calculated by the psych package [@R-psych]",
digits = 2) |>
kableExtra::kable_paper()
n | mean | sd | |
---|---|---|---|
msc1 | 750 | 2.52 | 0.74 |
msc2 | 680 | 2.54 | 0.72 |
msc3 | 750 | 2.49 | 0.75 |
msc4 | 750 | 2.48 | 0.75 |
age | 670 | 10.03 | 1.76 |
sex | 711 | 0.50 | 0.50 |
edu | 750 | 2.67 | 1.25 |
fLang* | 730 | 4.89 | 0.58 |
id | 750 | 375.50 | 216.65 |
msc3r | 750 | 2.51 | 0.75 |
msc4r | 750 | 2.52 | 0.75 |
msc | 680 | 2.53 | 0.61 |
If you need to calculate the descriptive statistics separate for groups, there is the describeBy
function. Use the group
argument. If you use set the fast
argument to TRUE
only n
, means
, sds
, min
, max
, ranges
are calculated.
psych::describeBy(x = exDat,
group ="sex",
fast = TRUE)
Descriptive statistics by group
sex: 0
vars n mean sd min max range se
msc1 1 357 2.48 0.72 1.00 4 3.00 0.04
msc2 2 323 2.53 0.71 1.00 4 3.00 0.04
msc3 3 357 2.52 0.71 1.00 4 3.00 0.04
msc4 4 357 2.50 0.73 1.00 4 3.00 0.04
age 5 320 10.13 1.96 5.44 30 24.56 0.11
sex 6 357 0.00 0.00 0.00 0 0.00 0.00
edu 7 357 2.71 1.26 0.00 4 4.00 0.07
fLang 8 347 NaN NA Inf -Inf -Inf NA
id 9 357 381.09 214.74 1.00 749 748.00 11.37
msc3r 10 357 2.48 0.71 1.00 4 3.00 0.04
msc4r 11 357 2.50 0.73 1.00 4 3.00 0.04
msc 12 323 2.50 0.59 1.00 4 3.00 0.03
------------------------------------------------------------
sex: 1
vars n mean sd min max range se
msc1 1 354 2.55 0.76 1.0 4.00 3.00 0.04
msc2 2 321 2.56 0.73 1.0 4.00 3.00 0.04
msc3 3 354 2.46 0.79 1.0 4.00 3.00 0.04
msc4 4 354 2.47 0.76 1.0 4.00 3.00 0.04
age 5 313 9.94 1.53 5.7 14.76 9.06 0.09
sex 6 354 1.00 0.00 1.0 1.00 0.00 0.00
edu 7 354 2.62 1.25 0.0 4.00 4.00 0.07
fLang 8 344 NaN NA Inf -Inf -Inf NA
id 9 354 369.27 220.29 2.0 750.00 748.00 11.71
msc3r 10 354 2.54 0.79 1.0 4.00 3.00 0.04
msc4r 11 354 2.53 0.76 1.0 4.00 3.00 0.04
msc 12 321 2.55 0.63 1.0 4.00 3.00 0.04
Coefficient \(\alpha\) is often called an internal consistency reliability coefficient, as it is based on covariances among scale items and thus internal consistency among items. But internal consistency should not be conflated with homogeneity, where homogeneity implies that a single dimension underlies the set of items (Widaman & Revelle, 2022).
Following again the description and notation of Widaman & Revelle (2022):
Coefficient alpha, or \(\alpha\) can be written as
\[ \alpha = (\frac{p}{p-1})\left(\frac{\sigma_Y^2-\sum\limits_{j=1}^p\sigma_j^2}{\sigma_Y^2}\right) = (\frac{p}{p-1})\left(1- \frac{\sum\limits_{j=1}^p\sigma_j^2}{\sigma_Y^2}\right) \tag{9.1}\]
where \(p\) is the number of items on the scale, \(\sigma_j^2\) (\(j=1,...,p\)) is the variance of item \(j\), […]. \(\sigma_Y^2\) is the variance of the sum score.
Now lets calculate Cronbachs \(\alpha\). We use the variables msc1
,msc2
,msc3r
, & msc4r
from the example data set (exDat
).
na.omit
function) all incomplete cases from the data set.p <- length(mscItems)
sum
function) of \(\sigma_j^2\) of all items using the apply
function and \(\sigma_Y^2\) of the total score (that is calculated by the rowSums
function; see below for a detailed description of the function).alpha <- (p/(p - 1)) * (1 - sigmaInd/ sigmaTot )
alpha
[1] 0.8496289
An important item statistics is the so-called Item whole correlation for this item against the scale without this item (in the alpha
function from the psych
package (Revelle, 2023) this is called: r.drop
).
named list
object (i.e., mscItems
).names(mscItems)
[1] "msc1" "msc2" "msc3r" "msc4r"
cor
function. To ignore the item when calculating the sum score, we state [-1]
after the names
call of the named list
.[1] 0.6657391
sapply
function. More specifically, we iterate through the item names (or to be precise, through the length of the item names)4. In the input argument x
of the cor
function, we (only) use the respective item [itemNr]
, while we ignore it [-itemNr]
when calculating the sum score in the second input argument y
.In the psych
package (Revelle, 2023), the alpha
function is designed to calculate Cronbachs \(\alpha\). The function has several input arguments (see ?psych::alpha
), but requires only x
: A data.frame or matrix of data, or a covariance or correlation matrix.
The function returns a couple of lists, including:
different \(\alpha\) estimates (i.e., raw_alpha
, std.alpha
)
item statistics (e.g., item whole correlation corrected for item overlap and scale reliability, item whole correlation for this item against the scale without this item, …)
response frequencies
calculated mean- or sum score (depending on the cumulative
argument)
…
Reliability analysis
Call: psych::alpha(x = exDat[names(mscItems)])
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.85 0.85 0.82 0.59 5.7 0.0089 2.5 0.62 0.57
95% confidence boundaries
lower alpha upper
Feldt 0.83 0.85 0.87
Duhachek 0.83 0.85 0.87
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
msc1 0.82 0.82 0.76 0.60 4.5 0.012 0.01145 0.55
msc2 0.84 0.84 0.78 0.63 5.2 0.010 0.00617 0.59
msc3r 0.79 0.79 0.71 0.55 3.7 0.013 0.00080 0.55
msc4r 0.79 0.79 0.72 0.56 3.9 0.013 0.00063 0.55
Item statistics
n raw.r std.r r.cor r.drop mean sd
msc1 750 0.82 0.82 0.72 0.67 2.5 0.74
msc2 680 0.79 0.79 0.67 0.62 2.5 0.72
msc3r 750 0.87 0.86 0.82 0.74 2.5 0.75
msc4r 750 0.86 0.85 0.80 0.73 2.5 0.75
Non missing response frequency for each item
1 2 3 4 miss
msc1 0.07 0.42 0.44 0.08 0.00
msc2 0.06 0.41 0.45 0.07 0.09
msc3r 0.07 0.42 0.43 0.08 0.00
msc4r 0.08 0.41 0.44 0.08 0.00
Now, we calculate scale scores (i.e., \(msc_{sum}\), \(msc_{avg}\)) of the items \(msc_1,msc_2,msc_3r,msc_{4r}\).
The rowSums
function needs one input (copied from the function description):
x
: an array of two or more dimensions, containing numeric, complex, integer or logical values, or a numeric data frame.But the na.rm
argument needs special attention:
na.rm
: logical. Should missing values (including NaN) be omitted from the calculations?This argument is important when some items have missing data. The question is: Should the scores be build based on the available items (this procedure is called person mean imputation) or discarded?
Enders (2010) summarizes it as follows (p.51):
Until more research accumulates, you should use person mean imputation with caution and should perhaps avoid it altogether, particularly if there are high rates of item nonresponse.
This means, we set na.rm = FALSE
. It is important to note, that there are options to circumvent this issue, such as a model-based estimation of composite scores (Rose et al., 2019) or multiple imputation (Enders, 2010; see e.g., Schafer & Graham, 2002)
To calculate scale scores, you can also use the scoreItems
function from the psych
package (Revelle, 2023).
The scoreItems
function needs at least two inputs (copied from the package description):
keys
: list of scoring keys or a matrix
or data.frame
of -1, 0, or 1 weights for each item on each scale which may be created by hand […]. Here we assign an equal weight (=1) for all itemsitems
: Matrix
or data.frame
of raw item scoresHowever, there a more input arguments that are important:
totals
: if TRUE
find total scores (!aka sum scores), if FALSE
(default), find average scoresmissing
: missing = TRUE
is the normal case and data are imputed according to the impute option. missing = FALSE
, only complete cases are scored.It is recommended to use missing = FALSE
(see Approach 1: rowSums).
Because the function calculates several other statistics (e.g., Cronbachs \(\alpha\), average correlation within a scale, …), we do it in two-step approach. Executing the function and save the information in an object, and then extracting (with the $
operator) the scores from the object (i.e., MscsumPsych$scores
) while merging the scores with the example data set (i.e., by rownames via by = 0
) . Merging is necessary because of missing = FALSE
.
Because we need more than 1 scale for this section, you need to load the exDatComb
data set (it is generated and merged here).
exDatComb <- readRDS("exDatComb.RDS")
The first step is to create a (nested) list that contain all the scales with the respective items. Note that in this example, we assume that the reversed items were already recoded.
Then using the sapply
function to iterate through the list elements and applying the rowSums
function to the items within the second list (accessing item names that are the columns in the data set with the names
function). Note we calculate average scores, because we divide through the total score through the number of items.
exDatComb[,paste0(names(scaleList),
"Score")] <- sapply(1:length(scaleList),
function(s) rowSums(x = exDatComb[,names(scaleList[[s]])],
na.rm = FALSE)/length(scaleList[[s]]),
simplify = FALSE)
head(exDatComb[,c("id", paste0(names(scaleList), "Score"))])
id mscScore gscScore
1 1 2.25 2.75
2 2 1.75 3.25
3 3 2.50 1.75
4 4 2.25 2.50
5 5 2.25 2.75
6 6 2.75 2.75