9  Descriptive statistics and item analysis

In this part you will learn how to calculate descriptive statistics, item statistics and scale scores.

Note

Recall, to calculate the descriptive statistics for continuous and categorical variables as well as scales, we provide functions which are ready to use (see here) .

9.1 Descriptive statistics

Typical descriptive statistics are:

  • mean
  • standard deviation
  • minimum and maximum
  • absolute and relative frequencies

R provides functions to calculate the statistics: mean, sd, min, max, table, …

In addition, there are packages such as the psych package (Revelle, 2023) that offer helpful functionality.

For example, if we want to calculate the mean and the standard deviation of the variable age, we can use the respective functions (i.e., mean & sd):

mean(exDat$age,
     na.rm = T)
[1] 10.03468
sd(exDat$age,
   na.rm = T)
[1] 1.756173

Or, if you are interested in absolute and/or relative frequencies, we can use the table function:

table(exDat$sex)

  0   1 
357 354 
table(exDat$sex,
      useNA = "always")/length(exDat$sex)*100

   0    1 <NA> 
47.6 47.2  5.2 

However, often you need to calculate the descriptive statistics not only for a single variable, but for many variables. Again, in R this can be done in many different ways. Here we depict a base R solution using the apply function (for more on the apply function see the Tip on the apply function).

In order to better retrace the calculation procedure, we present it in a two-step1 approach:

  1. Using the apply function and save the output in an object (here: exDescr):
    • Input X is the example data set: exDat
    • Set MARGIN value to 2, because we want to apply the function over columns of exDat
    • FUN: Here we define a function2 that calculates the mean, standard deviation, minimum and maximum (combined through the c function). Within the braces { you can state several (consecutive) arguments. Finally, we return the object (fOut) of the function. Note the fOut object is not saved in the workspace.
exDescr <- apply(X = exDat[,1:5],
                 MARGIN = 2,
                 FUN = function(x) {
                   fOut <- c(
                     mean(x, na.rm = T),
                     sd(x, na.rm = T),
                     min(x, na.rm = T),
                     max(x, na.rm = T)
                     )
                   return(fOut)
                   })
  1. In the second step, we first transpose (t function) the returned matrix, transform it then to a data.frame3 object and provide the column names.
exDescr <- as.data.frame(
  t(exDescr)
  )
colnames(exDescr) <- c("Mean", "SD", "Min", "Max")
exDescr
          Mean        SD     Min Max
msc1  2.518667 0.7373668 1.00000   4
msc2  2.544118 0.7186516 1.00000   4
msc3  2.488000 0.7482887 1.00000   4
msc4  2.480000 0.7463306 1.00000   4
age  10.034684 1.7561728 5.43976  30

(3. Finally, if you need an formatted table, the kbl function from the kableExtra package (Zhu, 2021) transforms matrices or data frames into nice tables.)

kableExtra::kbl(exDescr,
                digits = 2) |> 
  kableExtra::kable_classic() 
Table 9.1: Descriptive Statistics of the example data set
Mean SD Min Max
msc1 2.52 0.74 1.00 4
msc2 2.54 0.72 1.00 4
msc3 2.49 0.75 1.00 4
msc4 2.48 0.75 1.00 4
age 10.03 1.76 5.44 30
Note

Although it is probably a matter of coding taste, we do not recommend such a “nesting” approach, because it is difficult to read and comprehend. We like it neat. See also the note on What is the pipe operator?.

kableExtra::kbl(x = t(
  apply(exDat[,1:5],
        MARGIN = 2,
        FUN = function(x) {
          fOut <- c(
            mean(x, na.rm = T),
            sd(x, na.rm = T),
            min(x, na.rm = T),
            max(x, na.rm = T)
            )
          return(fOut)
          })
  ),
  digits = 2,
  col.names = c("Mean", "SD", "Min", "Max")) |>
  kableExtra::kable_classic()
Table 9.2: Descriptive Statistics of the example data set
Mean SD Min Max
msc1 2.52 0.74 1.00 4
msc2 2.54 0.72 1.00 4
msc3 2.49 0.75 1.00 4
msc4 2.48 0.75 1.00 4
age 10.03 1.76 5.44 30

From the function description (see ?psych::describe):

The describe function in the psych package is meant to produce the most frequently requested stats in psychometric and psychology studies, and to produce them in an easy to read data.frame. If a grouping variable is called for in formula mode, it will also call describeBy to the processing.

The function has many input argument (see again ?psych::describe), but requires only x a data frame or matrix.

psych::describe(x = exDat)
       vars   n   mean     sd median trimmed    mad  min max  range  skew
msc1      1 750   2.52   0.74   3.00    2.52   1.48 1.00   4   3.00 -0.02
msc2      2 680   2.54   0.72   3.00    2.53   1.48 1.00   4   3.00 -0.02
msc3      3 750   2.49   0.75   2.00    2.49   1.48 1.00   4   3.00  0.00
msc4      4 750   2.48   0.75   2.00    2.48   1.48 1.00   4   3.00  0.06
age       5 670  10.03   1.76  10.03   10.02   1.46 5.44  30  24.56  2.11
sex       6 711   0.50   0.50   0.00    0.50   0.00 0.00   1   1.00  0.01
edu       7 750   2.67   1.25   3.00    2.79   1.48 0.00   4   4.00 -0.59
fLang*    8 730   4.89   0.58   5.00    5.00   0.00 1.00   7   6.00 -2.93
id        9 750 375.50 216.65 375.50  375.50 277.99 1.00 750 749.00  0.00
msc3r    10 750   2.51   0.75   3.00    2.51   1.48 1.00   4   3.00  0.00
msc4r    11 750   2.52   0.75   3.00    2.52   1.48 1.00   4   3.00 -0.06
msc      12 680   2.53   0.61   2.50    2.53   0.74 1.00   4   3.00 -0.07
       kurtosis   se
msc1      -0.31 0.03
msc2      -0.27 0.03
msc3      -0.34 0.03
msc4      -0.33 0.03
age       23.94 0.07
sex       -2.00 0.02
edu       -0.73 0.05
fLang*    19.28 0.02
id        -1.20 7.91
msc3r     -0.34 0.03
msc4r     -0.33 0.03
msc       -0.35 0.02

The describe output is easily transformed into a data.frame which then can be passed to the kbl function from the kableExtra package (Zhu, 2021).

kableExtra::kbl(x = as.data.frame(psych::describe(x = exDat))[c("n", "mean", "sd")],
                caption = "Descriptive Statistics of the example data set calculated by the psych package [@R-psych]",
                digits = 2) |>
  kableExtra::kable_paper()
Descriptive Statistics of the example data set calculated by the psych package [@R-psych]
n mean sd
msc1 750 2.52 0.74
msc2 680 2.54 0.72
msc3 750 2.49 0.75
msc4 750 2.48 0.75
age 670 10.03 1.76
sex 711 0.50 0.50
edu 750 2.67 1.25
fLang* 730 4.89 0.58
id 750 375.50 216.65
msc3r 750 2.51 0.75
msc4r 750 2.52 0.75
msc 680 2.53 0.61

If you need to calculate the descriptive statistics separate for groups, there is the describeBy function. Use the group argument. If you use set the fast argument to TRUE only n, means, sds, min, max, ranges are calculated.

psych::describeBy(x = exDat,
                  group ="sex",
                  fast = TRUE)

 Descriptive statistics by group 
sex: 0
      vars   n   mean     sd  min  max  range    se
msc1     1 357   2.48   0.72 1.00    4   3.00  0.04
msc2     2 323   2.53   0.71 1.00    4   3.00  0.04
msc3     3 357   2.52   0.71 1.00    4   3.00  0.04
msc4     4 357   2.50   0.73 1.00    4   3.00  0.04
age      5 320  10.13   1.96 5.44   30  24.56  0.11
sex      6 357   0.00   0.00 0.00    0   0.00  0.00
edu      7 357   2.71   1.26 0.00    4   4.00  0.07
fLang    8 347    NaN     NA  Inf -Inf   -Inf    NA
id       9 357 381.09 214.74 1.00  749 748.00 11.37
msc3r   10 357   2.48   0.71 1.00    4   3.00  0.04
msc4r   11 357   2.50   0.73 1.00    4   3.00  0.04
msc     12 323   2.50   0.59 1.00    4   3.00  0.03
------------------------------------------------------------ 
sex: 1
      vars   n   mean     sd min    max  range    se
msc1     1 354   2.55   0.76 1.0   4.00   3.00  0.04
msc2     2 321   2.56   0.73 1.0   4.00   3.00  0.04
msc3     3 354   2.46   0.79 1.0   4.00   3.00  0.04
msc4     4 354   2.47   0.76 1.0   4.00   3.00  0.04
age      5 313   9.94   1.53 5.7  14.76   9.06  0.09
sex      6 354   1.00   0.00 1.0   1.00   0.00  0.00
edu      7 354   2.62   1.25 0.0   4.00   4.00  0.07
fLang    8 344    NaN     NA Inf   -Inf   -Inf    NA
id       9 354 369.27 220.29 2.0 750.00 748.00 11.71
msc3r   10 354   2.54   0.79 1.0   4.00   3.00  0.04
msc4r   11 354   2.53   0.76 1.0   4.00   3.00  0.04
msc     12 321   2.55   0.63 1.0   4.00   3.00  0.04

9.2 Cronbachs \(\alpha\) and item statistics

Coefficient \(\alpha\) is often called an internal consistency reliability coefficient, as it is based on covariances among scale items and thus internal consistency among items. But internal consistency should not be conflated with homogeneity, where homogeneity implies that a single dimension underlies the set of items (Widaman & Revelle, 2022).

Coefficient \(\alpha\) debate

Some interesting reads: Cronbach (1951);Revelle & Zinbarg (2009); Sijtsma (2009); McNeish (2017); Savalei & Reise (2019)

Following again the description and notation of Widaman & Revelle (2022):

Coefficient alpha, or \(\alpha\) can be written as

\[ \alpha = (\frac{p}{p-1})\left(\frac{\sigma_Y^2-\sum\limits_{j=1}^p\sigma_j^2}{\sigma_Y^2}\right) = (\frac{p}{p-1})\left(1- \frac{\sum\limits_{j=1}^p\sigma_j^2}{\sigma_Y^2}\right) \tag{9.1}\]

where \(p\) is the number of items on the scale, \(\sigma_j^2\) (\(j=1,...,p\)) is the variance of item \(j\), […]. \(\sigma_Y^2\) is the variance of the sum score.

Now lets calculate Cronbachs \(\alpha\). We use the variables msc1,msc2,msc3r, & msc4r from the example data set (exDat).

  1. Before we calculate Cronbachs \(\alpha\), we remove (using the na.omit function) all incomplete cases from the data set.
itemDF <- na.omit(exDat[names(mscItems)])
  1. Get the number of items (\(p\))
p <- length(mscItems)
  1. Calculate the sum (i.e., sum function) of \(\sigma_j^2\) of all items using the apply function and \(\sigma_Y^2\) of the total score (that is calculated by the rowSums function; see below for a detailed description of the function).
sigmaInd <- sum(apply(itemDF, 2, sd)^2)
sigmaTot <- var(rowSums(itemDF))
  1. Apply Equation 9.1
alpha <- (p/(p - 1)) * (1 - sigmaInd/ sigmaTot )
alpha
[1] 0.8496289

An important item statistics is the so-called Item whole correlation for this item against the scale without this item (in the alpha function from the psych package (Revelle, 2023) this is called: r.drop).

  1. Recall, the names of the items are stored in the named list object (i.e., mscItems).
names(mscItems)
[1] "msc1"  "msc2"  "msc3r" "msc4r"
  1. To correlate the first item with the sum score, we use the cor function. To ignore the item when calculating the sum score, we state [-1] after the names call of the named list.
cor(x = exDat[,"msc1"],
    y = rowSums(exDat[names(mscItems)[-1]]),
    use = "pairwise.complete.obs")
[1] 0.6657391
  1. Now, we do it for all items. Therefore, we may use the sapply function. More specifically, we iterate through the item names (or to be precise, through the length of the item names)4. In the input argument x of the cor function, we (only) use the respective item [itemNr], while we ignore it [-itemNr] when calculating the sum score in the second input argument y.
sapply(1:length(names(mscItems)),
       function(itemNr) cor(x = exDat[,names(mscItems)[itemNr]],
                            y = rowSums(exDat[names(mscItems)[-itemNr]]),
                            use = "pairwise.complete.obs"))
[1] 0.6657391 0.6269790 0.7422577 0.7205174

In the psych package (Revelle, 2023), the alpha function is designed to calculate Cronbachs \(\alpha\). The function has several input arguments (see ?psych::alpha), but requires only x: A data.frame or matrix of data, or a covariance or correlation matrix.

The function returns a couple of lists, including:

  • different \(\alpha\) estimates (i.e., raw_alpha, std.alpha)

  • item statistics (e.g., item whole correlation corrected for item overlap and scale reliability, item whole correlation for this item against the scale without this item, …)

  • response frequencies

  • calculated mean- or sum score (depending on the cumulative argument)

psych::alpha(x = exDat[names(mscItems)])

Reliability analysis   
Call: psych::alpha(x = exDat[names(mscItems)])

  raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
      0.85      0.85    0.82      0.59 5.7 0.0089  2.5 0.62     0.57

    95% confidence boundaries 
         lower alpha upper
Feldt     0.83  0.85  0.87
Duhachek  0.83  0.85  0.87

 Reliability if an item is dropped:
      raw_alpha std.alpha G6(smc) average_r S/N alpha se   var.r med.r
msc1       0.82      0.82    0.76      0.60 4.5    0.012 0.01145  0.55
msc2       0.84      0.84    0.78      0.63 5.2    0.010 0.00617  0.59
msc3r      0.79      0.79    0.71      0.55 3.7    0.013 0.00080  0.55
msc4r      0.79      0.79    0.72      0.56 3.9    0.013 0.00063  0.55

 Item statistics 
        n raw.r std.r r.cor r.drop mean   sd
msc1  750  0.82  0.82  0.72   0.67  2.5 0.74
msc2  680  0.79  0.79  0.67   0.62  2.5 0.72
msc3r 750  0.87  0.86  0.82   0.74  2.5 0.75
msc4r 750  0.86  0.85  0.80   0.73  2.5 0.75

Non missing response frequency for each item
         1    2    3    4 miss
msc1  0.07 0.42 0.44 0.08 0.00
msc2  0.06 0.41 0.45 0.07 0.09
msc3r 0.07 0.42 0.43 0.08 0.00
msc4r 0.08 0.41 0.44 0.08 0.00

9.3 Calculating scale scores

Now, we calculate scale scores (i.e., \(msc_{sum}\), \(msc_{avg}\)) of the items \(msc_1,msc_2,msc_3r,msc_{4r}\).

Scale scores

Some interesting reads: Widaman & Revelle (2022); McNeish & Wolf (2020); Rose et al. (2019)

The rowSums function needs one input (copied from the function description):

  • x: an array of two or more dimensions, containing numeric, complex, integer or logical values, or a numeric data frame.

But the na.rm argument needs special attention:

  • na.rm: logical. Should missing values (including NaN) be omitted from the calculations?

This argument is important when some items have missing data. The question is: Should the scores be build based on the available items (this procedure is called person mean imputation) or discarded?

Enders (2010) summarizes it as follows (p.51):

Until more research accumulates, you should use person mean imputation with caution and should perhaps avoid it altogether, particularly if there are high rates of item nonresponse.

This means, we set na.rm = FALSE. It is important to note, that there are options to circumvent this issue, such as a model-based estimation of composite scores (Rose et al., 2019) or multiple imputation (Enders, 2010; see e.g., Schafer & Graham, 2002)

  1. Calculation of the sum score
exDat$mscsum <- rowSums(x = exDat[,names(mscItems)],
                        na.rm = FALSE)
  1. Calculation of the average score
exDat$mscavg <- rowSums(x = exDat[,names(mscItems)],
                        na.rm = FALSE)/length(names(mscItems))

To calculate scale scores, you can also use the scoreItems function from the psych package (Revelle, 2023).

The scoreItems function needs at least two inputs (copied from the package description):

  • keys: list of scoring keys or a matrix or data.frame of -1, 0, or 1 weights for each item on each scale which may be created by hand […]. Here we assign an equal weight (=1) for all items
  • items: Matrix or data.frame of raw item scores

However, there a more input arguments that are important:

  • totals: if TRUE find total scores (!aka sum scores), if FALSE (default), find average scores
  • missing: missing = TRUE is the normal case and data are imputed according to the impute option. missing = FALSE, only complete cases are scored.

It is recommended to use missing = FALSE (see Approach 1: rowSums).

Because the function calculates several other statistics (e.g., Cronbachs \(\alpha\), average correlation within a scale, …), we do it in two-step approach. Executing the function and save the information in an object, and then extracting (with the $ operator) the scores from the object (i.e., MscsumPsych$scores) while merging the scores with the example data set (i.e., by rownames via by = 0) . Merging is necessary because of missing = FALSE.

  1. Calculation of the sum score
MscsumPsych <- psych::scoreItems(keys = rep(1, length(names(mscItems))),
                                 items = exDat[,names(mscItems)],
                                 totals = TRUE,
                                 missing = FALSE,
                                 min = 1,
                                 max = 4)


colnames(MscsumPsych$scores) <- "mscsum2"

exDat <- merge(exDat, MscsumPsych$scores,
               by = 0,
               all.x = T)
exDat$Row.names <- NULL
  1. Calculation of the average score
MscavgPsych <- psych::scoreItems(keys = rep(1, length(names(mscItems))),
                                 items = exDat[,names(mscItems)],
                                 totals = FALSE,
                                 missing = FALSE,
                                 min = 1,
                                 max = 4)


colnames(MscavgPsych$scores) <- "mscavg2"

exDat <- merge(exDat, MscavgPsych$scores,
               by = 0,
               all.x = T)
exDat$Row.names <- NULL

Because we need more than 1 scale for this section, you need to load the exDatComb data set (it is generated and merged here).

exDatComb <- readRDS("exDatComb.RDS")

The first step is to create a (nested) list that contain all the scales with the respective items. Note that in this example, we assume that the reversed items were already recoded.

scaleList <- list(msc = list("msc1" = c("Ich bin gut in Mathematik."),
                             "msc2" = c("Ich war in Mathematik immer gut."),
                             "msc3" = c("..."),
                             "msc4" = c("...")
                             ),
                  gsc = list("gsc1" = c("Ich bin gut in Deutsch"),
                             "gsc2" = c("Ich war in Deutsch immer gut."),
                             "gsc3" = c("..."),
                             "gsc4" = c("...")
                             )
                  )

Then using the sapply function to iterate through the list elements and applying the rowSums function to the items within the second list (accessing item names that are the columns in the data set with the names function). Note we calculate average scores, because we divide through the total score through the number of items.

exDatComb[,paste0(names(scaleList),
                  "Score")] <- sapply(1:length(scaleList),
                                      function(s) rowSums(x = exDatComb[,names(scaleList[[s]])],
                                                          na.rm = FALSE)/length(scaleList[[s]]),
                                      simplify = FALSE)

head(exDatComb[,c("id", paste0(names(scaleList), "Score"))])
  id mscScore gscScore
1  1     2.25     2.75
2  2     1.75     3.25
3  3     2.50     1.75
4  4     2.25     2.50
5  5     2.25     2.75
6  6     2.75     2.75

  1. That is actually a three-step approach.↩︎

  2. It is also possible to write a function in advance and use them in this argument. A short introduction how to write function can be found here.↩︎

  3. Note that this is actually not necessary.↩︎

  4. Check it by writing: 1:length(names(mscItems))↩︎