Data documentation with R and Quarto

Authors

Noel Wytopil (presenter)

Sonja Gelzenleuchter (presenter & creator)

Sven Rieger (creator)

Workshop held on:

April 13, 2023

1 Preperation

This Quarto book is designed to provide an introduction to data documentation with R and Quarto and serves as the accompanying script for the workshop. For an overview about the workshop agenda see the Introduction section.

Note

The material is work in progress. It is the first time that the workshop will be held in this format. If you have feedback or encountered any bugs, please send us an email. The book was last updated on April 12, 2023.

Please prepare yourself by following the steps below:

If you encounter any problems, please send us an email.

1.1 Software installation

Please install the following software and make sure that you are download the latest (released!) version of each program:

1.2 Package installation

R is an integrated suite of software facilities for data manipulation, calculation and graphical display (see https://www.r-project.org).

R is–among other things–great, because there is a large collection of packages. During the workshop, we will use the following R packages:

Show/hide code
pkgList <- c("rmarkdown",
             "knitr", # tables
             "kableExtra", # tables
             "tibble", # data frame
             "data.table", # rbindlist function
             "haven", # read data
             "lavaan", # generate data
             "tidyr", # reshape tidyverse
             "dplyr", # prepare data
             "moments", # skewness/kurtosis
             "car", # recoding
             "stringr", # strings
             "psych", # descriptive statistics
             "ggplot2",# plots
             "scales") # percent 
  1. rmarkdown (v2.21, Allaire et al., 2023)
  2. knitr (v1.42, Xie, 2023)
  3. kableExtra (v1.3.4, Zhu, 2021)
  4. tibble (v3.2.1, Müller & Wickham, 2023)
  5. data.table (v1.14.8, Dowle & Srinivasan, 2023)
  6. haven (v2.5.1, Wickham et al., 2022)
  7. lavaan (v0.6.15, Rosseel et al., 2023)
  8. tidyr (v1.3.0, Wickham, Vaughan, et al., 2023)
  9. dplyr (v1.1.1, Wickham, François, et al., 2023)
  10. moments (v0.14.1, Komsta & Novomestky, 2022)
  11. car (v3.1.1, Fox et al., 2022)
  12. stringr (v1.5.0, Wickham, 2022)
  13. psych (v2.3.3, Revelle, 2023)
  14. ggplot2 (v3.4.1, Wickham, Chang, et al., 2023)
  15. scales (v1.2.1, Wickham & Seidel, 2022)

You can install them (check the versions!) with the following code:

Show/hide code
lapply(pkgList,
       function(x) 
         if(!x %in% rownames(installed.packages())) install.packages(x))
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.utf8     LC_CTYPE=German_Germany.utf8      
[3] LC_MONETARY=German_Germany.utf8    LC_NUMERIC=C                      
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.5.4 compiler_4.2.3    fastmap_1.1.0     cli_3.6.1        
 [5] tools_4.2.3       htmltools_0.5.4   rstudioapi_0.14   yaml_2.3.7       
 [9] rmarkdown_2.21    knitr_1.42        jsonlite_1.8.4    xfun_0.36        
[13] digest_0.6.29     rlang_1.1.0       evaluate_0.19    

Note that we often did not load the packages, but use the function via :: (e.g., psych::describe()).

1.3 Data set

Finally, we will use an (simulated) example data set. To get it, execute the following code:

Show/hide code
PopMod <- "
eta1 =~ .8*msc1 + .8*msc2 + -.8*msc3 + -.8*msc4
eta1 ~~ 1*eta1
eta1 ~ 0*1

msc3 ~~ .2*msc4

msc1 | -1.5*t1 + 0*t2 + 1.5*t3
msc2 | -1.5*t1 + 0*t2 + 1.5*t3
msc3 | 1.5*t1 + 0*t2 + -1.5*t3
msc4 | 1.5*t1 + 0*t2 + -1.5*t3

age ~ 10*1
age ~~ 2.5*age

sex | 0*t1
sex ~*~ .5*sex

eta1 ~~ age + sex
"

exDat <- lavaan::simulateData(model = PopMod,
                              sample.nobs = seq(50,250, by = 50),
                              seed = 999)

Some cosmetics, and “adding” missing data.

Show/hide code
exDat$sex <- exDat$sex-1
exDat$edu <- exDat$group-1
exDat$group <- NULL

propMiss1 <- .05
propMiss2 <- .1

exDat$sex <- ifelse (
  rbinom(
    nrow(exDat),
    size = 1,
    propMiss1) == 1,
  NA,
  exDat$sex
  )

exDat$age <- ifelse (
  rbinom(
    nrow(exDat),
    size = 1,
    propMiss2) == 1,
  NA,
  exDat$age
  )

exDat$msc2 <- ifelse (
  rbinom(
    nrow(exDat),
    size = 1,
    propMiss2) == 1,
  NA,
  exDat$msc2
  )

Add a character.

exDat$fLang <- rep(c("german", "ger", "germn",
                     "italian",
                     "french",
                     NA,
                     " ",
                     ""),
                   c(650, 49, 1, 10, 10, 20, 5, 5))

Add outlier for the variable age.

exDat[600, "age"] <- 30

Add id variable.

exDat$id <- 1:nrow(exDat)
Descriptive statistics of the generated data
vars n mean sd median trimmed mad min max range skew kurtosis se
msc1 1 750 2.52 0.74 3.00 2.52 1.48 1.00 4 3.00 -0.02 -0.31 0.03
msc2 2 680 2.54 0.72 3.00 2.53 1.48 1.00 4 3.00 -0.02 -0.27 0.03
msc3 3 750 2.49 0.75 2.00 2.49 1.48 1.00 4 3.00 0.00 -0.34 0.03
msc4 4 750 2.48 0.75 2.00 2.48 1.48 1.00 4 3.00 0.06 -0.33 0.03
age 5 670 10.03 1.76 10.03 10.02 1.46 5.44 30 24.56 2.11 23.94 0.07
sex 6 711 0.50 0.50 0.00 0.50 0.00 0.00 1 1.00 0.01 -2.00 0.02
edu 7 750 2.67 1.25 3.00 2.79 1.48 0.00 4 4.00 -0.59 -0.73 0.05
fLang* 8 730 4.89 0.58 5.00 5.00 0.00 1.00 7 6.00 -2.93 19.28 0.02
Correlation table of the generated data
msc1 msc2 msc3 msc4 age sex edu
msc1 1.00
msc2 0.54 1.00
msc3 -0.59 -0.55 1.00
msc4 -0.58 -0.53 0.72 1.00
age 0.01 0.04 0.01 0.00 1.00
sex 0.05 0.02 -0.04 -0.02 -0.05 1.00
edu 0.01 -0.03 0.00 0.05 -0.02 -0.04 1.00