Data documentation with R and Quarto

Authors

Noel Wytopil (presenter)

Sonja Gelzenleuchter (presenter & creator)

Sven Rieger (creator)

Workshop held on:

April 13, 2023

1 Preperation

This Quarto book is designed to provide an introduction to data documentation with R and Quarto and serves as the accompanying script for the workshop. For an overview about the workshop agenda see the Introduction section.

Note

The material is work in progress. It is the first time that the workshop will be held in this format. If you have feedback or encountered any bugs, please send us an email. The book was last updated on April 12, 2023.

Please prepare yourself by following the steps below:

Software installation
Package installation
Data set

If you encounter any problems, please send us an email.

1.1 Software installation

Please install the following software and make sure that you are download the latest (released!) version of each program:

R (v4.2.3, R Core Team, 2023): https://cran.r-project.org/bin/windows/base/
RStudio (v2023.3.0.386, Posit team, 2023): https://posit.co/downloads/
Quarto (v1.3, Allaire, 2022): https://quarto.org/docs/get-started/

1.2 Package installation

R is an integrated suite of software facilities for data manipulation, calculation and graphical display (see https://www.r-project.org).

R is–among other things–great, because there is a large collection of packages. During the workshop, we will use the following R packages:

Show/hide code

pkgList <- c("rmarkdown",
             "knitr", # tables
             "kableExtra", # tables
             "tibble", # data frame
             "data.table", # rbindlist function
             "haven", # read data
             "lavaan", # generate data
             "tidyr", # reshape tidyverse
             "dplyr", # prepare data
             "moments", # skewness/kurtosis
             "car", # recoding
             "stringr", # strings
             "psych", # descriptive statistics
             "ggplot2",# plots
             "scales") # percent

rmarkdown (v2.21, Allaire et al., 2023)
knitr (v1.42, Xie, 2023)
kableExtra (v1.3.4, Zhu, 2021)
tibble (v3.2.1, Müller & Wickham, 2023)
data.table (v1.14.8, Dowle & Srinivasan, 2023)
haven (v2.5.1, Wickham et al., 2022)
lavaan (v0.6.15, Rosseel et al., 2023)
tidyr (v1.3.0, Wickham, Vaughan, et al., 2023)
dplyr (v1.1.1, Wickham, François, et al., 2023)
moments (v0.14.1, Komsta & Novomestky, 2022)
car (v3.1.1, Fox et al., 2022)
stringr (v1.5.0, Wickham, 2022)
psych (v2.3.3, Revelle, 2023)
ggplot2 (v3.4.1, Wickham, Chang, et al., 2023)
scales (v1.2.1, Wickham & Seidel, 2022)

You can install them (check the versions!) with the following code:

Show/hide code

lapply(pkgList,
       function(x) 
         if(!x %in% rownames(installed.packages())) install.packages(x))

Information About the R Session

sessionInfo()

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.utf8     LC_CTYPE=German_Germany.utf8      
[3] LC_MONETARY=German_Germany.utf8    LC_NUMERIC=C                      
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.5.4 compiler_4.2.3    fastmap_1.1.0     cli_3.6.1        
 [5] tools_4.2.3       htmltools_0.5.4   rstudioapi_0.14   yaml_2.3.7       
 [9] rmarkdown_2.21    knitr_1.42        jsonlite_1.8.4    xfun_0.36        
[13] digest_0.6.29     rlang_1.1.0       evaluate_0.19

Note that we often did not load the packages, but use the function via :: (e.g., psych::describe()).

1.3 Data set

Finally, we will use an (simulated) example data set. To get it, execute the following code:

Show/hide code

PopMod <- "
eta1 =~ .8*msc1 + .8*msc2 + -.8*msc3 + -.8*msc4
eta1 ~~ 1*eta1
eta1 ~ 0*1

msc3 ~~ .2*msc4

msc1 | -1.5*t1 + 0*t2 + 1.5*t3
msc2 | -1.5*t1 + 0*t2 + 1.5*t3
msc3 | 1.5*t1 + 0*t2 + -1.5*t3
msc4 | 1.5*t1 + 0*t2 + -1.5*t3

age ~ 10*1
age ~~ 2.5*age

sex | 0*t1
sex ~*~ .5*sex

eta1 ~~ age + sex
"

exDat <- lavaan::simulateData(model = PopMod,
                              sample.nobs = seq(50,250, by = 50),
                              seed = 999)

Some cosmetics, and “adding” missing data.

Show/hide code

exDat$sex <- exDat$sex-1
exDat$edu <- exDat$group-1
exDat$group <- NULL

propMiss1 <- .05
propMiss2 <- .1

exDat$sex <- ifelse (
  rbinom(
    nrow(exDat),
    size = 1,
    propMiss1) == 1,
  NA,
  exDat$sex
  )

exDat$age <- ifelse (
  rbinom(
    nrow(exDat),
    size = 1,
    propMiss2) == 1,
  NA,
  exDat$age
  )

exDat$msc2 <- ifelse (
  rbinom(
    nrow(exDat),
    size = 1,
    propMiss2) == 1,
  NA,
  exDat$msc2
  )

Add a character.

exDat$fLang <- rep(c("german", "ger", "germn",
                     "italian",
                     "french",
                     NA,
                     " ",
                     ""),
                   c(650, 49, 1, 10, 10, 20, 5, 5))

Add outlier for the variable age.

exDat[600, "age"] <- 30

Add id variable.

exDat$id <- 1:nrow(exDat)

Some descriptive statistics

Descriptive statistics of the generated data
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
msc1	1	750	2.52	0.74	3.00	2.52	1.48	1.00	4	3.00	-0.02	-0.31	0.03
msc2	2	680	2.54	0.72	3.00	2.53	1.48	1.00	4	3.00	-0.02	-0.27	0.03
msc3	3	750	2.49	0.75	2.00	2.49	1.48	1.00	4	3.00	0.00	-0.34	0.03
msc4	4	750	2.48	0.75	2.00	2.48	1.48	1.00	4	3.00	0.06	-0.33	0.03
age	5	670	10.03	1.76	10.03	10.02	1.46	5.44	30	24.56	2.11	23.94	0.07
sex	6	711	0.50	0.50	0.00	0.50	0.00	0.00	1	1.00	0.01	-2.00	0.02
edu	7	750	2.67	1.25	3.00	2.79	1.48	0.00	4	4.00	-0.59	-0.73	0.05
fLang*	8	730	4.89	0.58	5.00	5.00	0.00	1.00	7	6.00	-2.93	19.28	0.02

Correlation table of the generated data
	msc1	msc2	msc3	msc4	age	sex	edu
msc1	1.00
msc2	0.54	1.00
msc3	-0.59	-0.55	1.00
msc4	-0.58	-0.53	0.72	1.00
age	0.01	0.04	0.01	0.00	1.00
sex	0.05	0.02	-0.04	-0.02	-0.05	1.00
edu	0.01	-0.03	0.00	0.05	-0.02	-0.04	1.00