3 General information on data documentation

Important

We currently discussing some changes to data management and data documentation. The goal is to use the so-called Stamp (“Standardisierte Datenmanagementplan”) guidelines for educational research (ger only, see here: https://www.forschungsdaten-bildung.de/stamp-nutzen)

3.1 What is data documentation?

Data documentation is a central part of data management¹ in general and is part of the DFG Guidelines for safeguarding good scientific research practice (code of conduct)
- Guideline 12: Documentation (for more see below)

Guideline 12: Documentation

Researchers document all information relevant to the production of a research result as clearly as is required by and is appropriate for the relevant subject area to allow the result to be reviewed and assessed. In general, this also includes documenting individual results that do not support the research hypothesis. The selection of results must be avoided. Where subject-specific recommendations exist for review and assessment, researchers create documentation in accordance with these guidelines. If the documentation does not satisfy these requirements, the constraints and the reasons for them are clearly explained. Documentation and research results must not be manipulated; they are protected as effectively as possible against manipulation.

Explanations:

An important basis for enabling replication is to make available the information necessary to understand the research (including the research data used or generated, the methodological, evaluation and analytical steps taken, and, if relevant, the development of the hypothesis), to ensure that citations are clear, and, as far as possible, to enable third parties to access this information. Where research software is being developed, the source code is documented.

Data documentation is a continuous process and it should start at the beginning (see e.g., Figure 3.1) of each project that collects (any kind of) data

Figure 3.1: Proposed institutional research data management workflow by Wissik & Ďurčo (2015)

Note

In the following, we restrict the term data documentation to the documentation of a single data set.

3.2 Who uses data documentation?

Everybody who works with the data set(s). It serves as a guidance, especially for people who did not collect the data themselves.

3.3 How is data documentation structured?

Data documentation include several parts:

General information on the data set and the usage of it
Documentation of the survey process
Information about the variables (e.g., demographic variables), (achievement) tests, items and scales
- Scale level
- Item wording
- Descriptive Statistics (e.g., $N, M, SD,Min, Max,...$; for more see below)
- Cross-references to other studies
- Sources
Bibliography

An example of a table of conents of a so-called “Skalenhandbuch” can be seen in Figure 3.2.

Figure 3.2: Example of table of contents of a “Skalenhandbuch”

3.4 How to document (a specific) data (set)?

Tip: Data management (ger only)

Recommendations for data management (e.g., planning, generation, preparation & analyses, and publications)

Figure 3.3: Forschungsdatenmanagement in kleinen Forschungsprojekten

3.4.1 Basis

The basis of data documentation are the:

codebooks & questionnaires
the (final) data sets

To get an overview, you should start with carefully reviewing these sources and think about a meaningful structure and order.

Tip: How to design codebooks?

You can find some information at the https://datawizkb.leibniz-psychology.org website.

3.4.2 Order

It is advisable to group the variables according to their content. In the following example variable clusters are listed:
- Socio-demographic variables (e.g., age, sex, family background, income, household, …)
- Tests (e.g., school achievement tests, tests on cognitive ability)
- Psycho-social variables (e.g., Self-concepts, motivation, personality)
- …
Also, related scales should be reported together within a respective variable cluster (e.g., math, English and German self-concept scales)
Discuss the concrete order of the variables with your supervisor
Names of variables, items and scales must be identical across codebooks, questionaires, and data sets

3.4.3 Template

Important

You should always use a template! You can find it/them on the network drive L, too.

There are two options:

Calculate the required statistics and transfer the values into one of the millions word templates (not recommended)
Use the Quarto template (recommended)
- for now the template generates only html² documents

3.4.4 Functions

The following functions were written to make your job a little bit easier.

If you know what you are doing, feel free to change or adjust them.
If you found a bug or need more/other functions, please write us an email.

3.4.4.1 Table for variable overview

The varOverview function translates the general information into a table and has the following arguments:

name: Name of the variable
type: Input is a character: Either variable or scale
itemword: Requires a named list as input. Should contain the names and wording of the items.
recItems: Numeric vector of the reversed worded items.
questionaire: Input is a character: Which questionnaire?
prompt: Input is a character. Wording of the question,
response: Input is a character. Response format.
cross: Input is a character. Cross-reference to other studies.
source: Input is a character. Potential reference.
note: Input is a character. Room for additional notes.

Show/hide function

varOverview <- function ( name,
                          type = "variable",
                          itemword = NULL,
                          recItems = NULL,
                          questionaire = NULL,
                          prompt = NULL,
                          response = NULL,
                          cross = NULL,
                          source = NULL,
                          note = NULL) {
  
  if (type == "variable") {
    namDes <- "Name of variable:"
  }
  
  if (type == "scale") {
    namDes <- "Name of scale:"
  }
  
  if (type == "scale" & is.null(itemword) ) {
    stop("please provide the itemwording of the items")
  } 
  
  if(!is.null(itemword)) {
    
    if ( is.null(names(itemword)) ) {
      stop("please provide a named list as input: e.g., list(item1 = c('I like apples.'")
      
      
    } else {
      
      itemLab <- names(itemword)
      
      if (!is.null(recItems)) {
      itemLab[recItems] <- paste0(names(itemword)[recItems],"r")
      }
      
      itemword <- unlist(itemword)
      
      }
    
    
    
  } else {
    
    itemLab <- NULL
    itemword <- NULL
    
  }
  
  
  if (!is.null(questionaire)) {
    
    questLab <- "Questionaire:"
    questionaire <- paste0(questionaire, collapse = "; ")
    
  } else {
    
    questLab <- NULL
    
  }
  
  if (!is.null(prompt)) {
    
    promptLab <- "Prompt in Questionaire:"
    prompt <- paste0(prompt, collapse = ":")
    
  } else {
    
    promptLab <- NULL
    
  }
  
  if (!is.null(response)) {
    
    responseLab <- "Response format:"
    response <- paste0(response, collapse = "; ")
    
  } else {
    
    responseLab <- NULL
    
  }
  
  if (!is.null(cross)) {
    
    crossLab <- "Cross reference:"
    cross <- paste0(cross, collapse = "; ")
    
  } else {
    
    crossLab <- NULL
    
  }
  
  if (!is.null(source)) {
    
    sourceLab <- "Source:"
    source <- knitr::asis_output(paste0(source, collapse = "; "))
    
  } else {
    
    sourceLab <- NULL
    
  }
  
  if (!is.null(note)) {
    
    #generalTitel <- "Note:"
    noteLab <- "Note:"
    note <- paste0(note, collapse = "; ")
    
  } else {
    
    #generalTitel <- ""
    noteLab <- NULL
    
  }
  
  tempTab <- data.frame(
    "Col1" = c(namDes,
               questLab,
               promptLab,
               responseLab,
               crossLab,
               sourceLab,
               noteLab,
               "",
               itemLab),
    "Col2" = c(name,
               questionaire,
               prompt,
               response,
               cross,
               source,
               note,
               "",
               itemword)
    )
  
  kableExtra::kbl(tempTab,
               col.names = NULL,
               align = "l"
               ,booktabs = TRUE
               ) |>
    kableExtra::kable_styling(
      full_width = TRUE,
      bootstrap_options = c("hover", "responsive")
      , latex_options = "HOLD_position"
      ) |>
    kableExtra::column_spec(column = 1, width = "12em")
  
}

An example execution.

Show/hide code

varOverview(name = "gender",
            prompt = "The question in the questionaire...",
            response = c("0 = male",
                         "1 = female",
                         "2 = divers"),
            cross = c("A study", "Another study"),
            source = "@Kunter2002",
            note = "Here is room for some notes")

Table 3.1: Variable: gender
Name of variable:	gender
Prompt in Questionaire:	The question in the questionaire...
Response format:	0 = male; 1 = female; 2 = divers
Cross reference:	A study; Another study
Source:	@Kunter2002
Note:	Here is room for some notes

Note

The reference is not rendered correctly. This is a known bug: see here: https://github.com/quarto-dev/quarto-cli/issues/3340 and will be hopefully solved in the next release.

3.4.4.2 Continuous variables

The contVar function is based on the describe function from the psych package (Revelle, 2023) and has the following arguments:

variable: This should be a character or a character vector (name of variable)
data: The data frame
which.stat: Which statistics should be calculated? Default is: N, Mean, SD, Min, Max. Because the function is based on the describe function from psych package, you can also request: trimmed mean, median, mad, skew, kurtosis, standard error.
table: logical. Default is TRUE. Then the output is printed as a table using the kbl function from the kableExtra package (Zhu, 2021). If set to false, the output is printed as a data.frame.

Show/hide function

contVar <- function ( variable,
                      data,
                      which.stat = c("N",
                                     "Mean", "SD",
                                     "Min", "Max"),
                      table = TRUE) {

  
  
  # calculate descriptives
  tempDescr <- as.data.frame(
    psych::describe(
      data[,variable]
      )
    )[tolower(which.stat)]
  
  tempDescr$varNam <- variable
  rownames(tempDescr) <- NULL
  
  
  # calculate missing in %
  if ( length(variable) == 1 ) {
    
    missDF <- as.data.frame(mean(is.na(data[,variable]))*100)
    
  } else {
    
    missDF <- as.data.frame(colMeans(is.na(data[,variable]))*100)
    
  }
  
  colnames(missDF) <- "miss"
  
  tempDescr$miss <- paste0(
    sprintf(missDF$miss,
            fmt = '%#.1f'),
    "%")
  
  
  if ( table == TRUE ) {
    
    tempTab <- kableExtra::kbl(
      x = tempDescr[,c("varNam",
                       tolower(which.stat),
                       "miss")],
      col.names = c("Variable", which.stat, "Miss."),
      row.names = FALSE,
      digits = 2,
      align = c("l", rep("c",length(which.stat)), "c")
      ,booktabs = TRUE
      #,format = "html"
      ) |>
      kableExtra::kable_styling(
        full_width = TRUE,
        bootstrap_options = c("hover", "responsive")
        ,latex_options = "HOLD_position"
        )
    
    tempOut <- tempTab
    tempOut
    
    } else {
      
    tempOut <- tempDescr[,c("varNam",
                            tolower(which.stat),
                            "miss")]
    
    return(tempOut)
    
  }
  
}

An example execution.

Show/hide code

varOverview(name = "age",
            type = "variable",
            questionaire = "Student",
            prompt = "How old are you?",
            response = "open",
            cross = "not applicable",
            source = "not applicable",
            note = "")
contVar(variable = "age", data = exDat)

Table 3.2: Variable: age

(a) Overview
Name of variable:	age
Questionaire:	Student
Prompt in Questionaire:	How old are you?
Response format:	open
Cross reference:	not applicable
Source:	not applicable
Note:

(b) Calculated statistics
Variable	N	Mean	SD	Min	Max	Miss.
age	670	10.03	1.76	5.44	30	10.7%

3.4.4.3 Categorical variables

The catVar function is based on the table function from R (R Core Team, 2023) and has the following arguments:

variable: This should be a character or a character vector (name of variable)
data: The data frame
table: logical. Default is TRUE. Then the output is printed as a table using the kbl function from the kableExtra package (Zhu, 2021). If set to false, the output is printed as a data.frame.

Show/hide function

catVar <- function ( variable,
                     data,
                     table = TRUE) {

  
  # calculate freq. as function
  calcFreq <- function ( variable, data) {
  
    # calc abs and rel freq.
    tempFreq <- data.frame( abs = table(data[,variable],
                                        useNA = "always"))
    
    tempFreq$rel <- table(data[,variable],
                          useNA = "always")/length(data[,variable])*100
    
    colnames(tempFreq) <- c("var",
                            "Absolute",
                            "Relative")
  
    # calculate sum
    tempFreq <- rbind(tempFreq,
                      data.frame(var = "Sum",
                                 t(colSums(tempFreq[!is.na(tempFreq$var),-1])))
                      )
    
  
    # cosmetics
    
    tempFreq$Relative <- paste0(
      sprintf(tempFreq$Relative,
              fmt = '%#.1f'),
      "%")
  
    tempTab <- as.data.frame(t(tempFreq[,-1]))
    colnames(tempTab) <- tempFreq[,1]
    colnames(tempTab)[is.na(colnames(tempTab))] <- "Miss"
    
    tempTab <- as.data.frame(
      t(sapply(colnames(tempTab),
               function(x) paste0(tempTab[1,x],
                                  " (",
                                  tempTab[2,x], ")"),
               simplify = TRUE)
        )
      )
      
    return(tempTab)
    }
  
  tempFreq <- data.table::rbindlist(
    sapply(variable,
           function(x) 
             calcFreq(variable = x, data = data),
           simplify = FALSE),
    idcol = "Variable", use.names = T, fill = T)
  
  # get colnames to sort
  colToSort <- c("Variable", colnames(tempFreq)[!colnames(tempFreq) %in% 
                                                  c("Variable",
                                                    "Miss",
                                                    "Sum"
                                                    #"Sum (miss)"
                                                    )],
                 "Sum",
                 "Miss"
                 #"Sum (miss)",
                 )
  
  tempFreq <- tempFreq[,..colToSort]
  
  
  if ( table == TRUE ) {
    
    tempTab <- kableExtra::kbl(
      x = tempFreq,
      align = "c"
      ) |>
      kableExtra::kable_styling(
        full_width = TRUE,
        bootstrap_options = c("hover", "responsive")
        #,latex_options = "HOLD_position"
        ) 
    
    tempOut <- tempTab
    tempOut
    
    } else {
    
    tempOut <- tempFreq[,..colToSort]
      
    return(tempOut)
    
  }
  
}

An example execution.

Show/hide code

varOverview(name = "edu",
            type = "variable",
            questionaire = "Parent",
            prompt = "Welchen Schulabschluss haben Sie? Bitte geben Sie *nur* Ihren höchsten Abschluss an.",
            response = c(
              "0 = Kein Schulabschluss",
              "1 = Abschluss einer Sonderschule/Foerderschule",
              "2 = einer Polytechnischen Oberschule nach der 8. Klasse", 
              "3 = Hauptschulabschluss/ Volksschulabschluss",
              "4 = ..."),
            cross = "NEPS",
            source = "@Kunter2002",
            note = "")
catVar(variable = c("edu"),
       data = exDat)

Table 3.3: Variable: Education

(a) Overview
Name of variable:	edu
Questionaire:	Parent
Prompt in Questionaire:	Welchen Schulabschluss haben Sie? Bitte geben Sie nur Ihren höchsten Abschluss an.
Response format:	0 = Kein Schulabschluss; 1 = Abschluss einer Sonderschule/Foerderschule; 2 = einer Polytechnischen Oberschule nach der 8. Klasse; 3 = Hauptschulabschluss/ Volksschulabschluss; 4 = ...
Cross reference:	NEPS
Source:	@Kunter2002
Note:

(b) Calculated statistics
Variable	0	1	2	3	4	Sum	Miss
edu	50 (6.7%)	100 (13.3%)	150 (20.0%)	200 (26.7%)	250 (33.3%)	750 (100.0%)	0 (0.0%)

3.4.4.4 Scale score

The scaleScore function is based on the alpha function from the psych package (Revelle, 2023) and has the following arguments:

items: This should be a character vector (name of items) or a named list
data: The data frame
recItems: Numeric vector of the reversed worded items.
recode: logical. Default is FALSE. If TRUE the items in recItems will be recoded.
which.stat: Which statistics should be calculated? Default is: N, Mean, SD, Min, Max. Because the function is based on the describe function from psych package, you can also request: trimmed mean, median, mad, skew, kurtosis, standard error.
table: logical. Default is TRUE. Then the output is printed as a table using the kbl function from the kableExtra package (Zhu, 2021). If set to false, the scale score (and if requested the recoded items) are returned.

Important

The scale score is built based on only valid items (i.e., na.rm = FALSE argument of the rowSums function). This means if one item has missing value, no score is computed.

Show/hide function

scaleScore <- function ( items,
                         scaleNameShort = NULL,
                         recItems = NULL,
                         recode = FALSE,
                         data,
                         which.stat = c("N",
                                        "Mean", "SD",
                                        "Min", "Max"),
                         table = TRUE) {
  
  # name check
  
  if (is.null(scaleNameShort)) {
    stop("please provide the short name of the scale (variable name)")
  }
  
  
  # recode check 
  
  if (!is.null(recItems) & recode == FALSE) {
    warning("Are the items already recoded?\nIf not, you may use the recode argument.")
  }
  
  if (!is.null(names(items))) {
    items <- names(items)
  }
  
  
  # recode if necessary
  if ( recode == T ) {
    
    data[,paste0(items[recItems],"r")] <- sapply(
      items[recItems],
      function(x) {
        tempRec <- (max(data[,x]) + min(data[,x])) - data[,x]
        return(tempRec)
      }, simplify = FALSE)
    
    items[recItems] <- paste0(items[recItems],"r")
  }
  
  
  # compute sumscore
  data[,scaleNameShort] <- rowSums(data[,items]) / length(items)
  
  # calculate descriptives
  tempDescr <- as.data.frame(
    psych::describe(
      data[,c(items,scaleNameShort)]
      )
    )[tolower(which.stat)]
  
  # calculate missing in %
  missDF <- as.data.frame(colMeans(is.na(data[,c(items,scaleNameShort)]))*100)
  colnames(missDF) <- "miss"
  
  missDF$miss <- paste0(
    sprintf(missDF$miss,
            fmt = '%#.1f'),
    "%")
  
  # calculate alpha et al
  tempAlpha <- psych::alpha(
    data[,items]
    )
  
  # extract r.cor & r.drop
  rcordrop <- tempAlpha$item.stats[,c("r.cor", "r.drop")]
  alpha <- round(tempAlpha$total$raw_alpha,3)
  
  # merge dataframes
  tempTab <- merge(tempDescr, missDF, by = 0, sort = FALSE) 
  rownames(tempTab) <- tempTab$Row.names
  tempTab$Row.names <- NULL
  
  tempTab <- merge(tempTab, rcordrop, by = 0, all.x = T, sort = FALSE)
  rownames(tempTab) <- NULL
  
  # make tables
  if ( table == TRUE ) {
    
    tempScaleTab <- kableExtra::kbl(
      x = tempTab,
      col.names = c("Item", which.stat, "Missing", "$r_{it}$", "$r_{drop}$"),
      digits = 2,
      align = c("l", rep("c",length(which.stat)), "c", "c", "c"),
      booktabs = TRUE,
      format = "html",
      escape = FALSE
      ) |>
      kableExtra::kable_styling(
        full_width = TRUE,
        bootstrap_options = c("hover", "responsive")
        #,latex_options = "HOLD_position"
        ) |>
      kableExtra::footnote(general_title = "Cronbachs $\\\\alpha$ = ",
                           general = paste0(alpha),
                           footnote_as_chunk = TRUE,
                           threeparttable = TRUE) |>
      kableExtra::row_spec(row = nrow(tempTab),
                           background = "#d3d3d3")
    
    tempOut <- tempScaleTab
    tempOut
    
    } else {
      
      tempOut <- data[,c(items[recItems], scaleNameShort)]
      print(tempTab)
    
    return(tempOut)
    
  }
  
}

An example execution.

Before executing the function, we create a named list. Item 3 and 4 are reversed worded items.

mscItems <- list("msc1" = c("Ich bin gut in Mathematik."),
                 "msc2" = c("Ich war in Mathematik immer gut."),
                 "msc3" = c("Ich habe Verständnisschwierigkeiten bei allem, für das man Mathematik braucht."),
                 "msc4" = c("Ich bin bei Aufgaben, die mathematisches Denken
erfordern, nie gut."))

Then we execute the two functions. because the items 3 and 4 are not recoded in the data set, we have to pass the information to the scaleScore function.

Show/hide code

varOverview(name = "Math Self-concept",
            type = "scale",
            itemword = mscItems,
            recItems = c(3,4),
            prompt = "Wie sehr treffen die folgenden Aussagen auf Sie zu?",
            response = c("1 = trifft überhaupt nicht zu",
                        "2 = trifft eher nicht zu",
                        "3 = trifft eher zu",
                        "4 = trifft völlig zu"),
           cross = "TOSCA-R",
           source = "")
scaleScore(items = names(mscItems),
           scaleNameShort = "msc",
           recItems = c(3,4),
           recode = TRUE,
           data = exDat,
           table = T)

Table 3.4: Variable Math Self-concept

(a) Overview
Name of scale:	Math Self-concept
Prompt in Questionaire:	Wie sehr treffen die folgenden Aussagen auf Sie zu?
Response format:	1 = trifft überhaupt nicht zu; 2 = trifft eher nicht zu; 3 = trifft eher zu; 4 = trifft völlig zu
Cross reference:	TOSCA-R
Source:

msc1	Ich bin gut in Mathematik.
msc2	Ich war in Mathematik immer gut.
msc3r	Ich habe Verständnisschwierigkeiten bei allem, für das man Mathematik braucht.
msc4r	Ich bin bei Aufgaben, die mathematisches Denken erfordern, nie gut.

(b) Calculated statistics
Item	N	Mean	SD	Min	Max	Missing	$r_{it}$	$r_{drop}$
msc1	750	2.52	0.74	1	4	0.0%	0.72	0.67
msc2	680	2.54	0.72	1	4	9.3%	0.67	0.62
msc3r	750	2.51	0.75	1	4	0.0%	0.82	0.74
msc4r	750	2.52	0.75	1	4	0.0%	0.80	0.73
msc	680	2.53	0.61	1	4	9.3%	NA	NA
Cronbachs $\\alpha$ = 0.851

3.4.4.5 Clean character variables

The cleanStr function is designed to clean strings and has the following arguments:

stringToClean: This should be a character.
pattern: The pattern which should be replaced. Can be character vector.
replacement: The replacement. A character with length = 1.
replaceNA: logical. Should NAs be replaced? Default is TRUE.
replaceNAval: If replaceNA == TRUE. The replacement of NA values. Default is: "Unknown"
as.fac: logical. Output of the function (character or factor). Default is FALSE.
print: logical. Default is TRUE.

Show/hide function

cleanStr <- function ( stringToClean, pattern = NULL, replacement = "Unknown",
                       replaceNA = TRUE,
                       replaceNAval = "Unknown",
                       as.fac = FALSE,
                       print = TRUE) {
  
  if ( length(replacement) > 1 | length(replaceNAval) > 1 ) {
    
    stop("Values can only be replaced by one value. length > 1")
    
  }
  
  if ( !is.character(stringToClean) ) {
    
    stop("input is no character")
    
  } else {
      
    outStr <- stringToClean
    
    }
  
  if ( replaceNA == TRUE ) {
    outStr[is.na(outStr)] <- replaceNAval
    missMess <- paste0("Missing values (NA) are replaced with: '", replaceNAval, "'")
  } else {
    
    missMess <- NULL
    
  }
  
  outStr[outStr %in% pattern] <- replacement
  
  if (print == TRUE ) {
   
    message("Input\n")
    print( table(stringToClean, useNA = "always") )
    message("Output\n")
    print( table(outStr, useNA = "always") )
    
    message("\nThis is what happened:")
    cat("The following pattern(s): ",
        paste0(paste0(pattern, collapse = ", "),
        "\nwas/were replaced with: ",
        replacement), "\n\n",
        missMess)
    
    }
  if ( as.fac == FALSE ) {
    
    return(outStr)
    
  } else {
    
    outFac <- as.factor(outStr)
    return(outFac)
  }
  
}

An application of the function is shown in the Data cleaning part.

see also here: https://www.forschungsdaten-bildung.de/↩︎
This is (mainly) because there are some issues with the kableExtra package (Zhu, 2021) and latex implementation (i.e., format: pdf) in Quarto.↩︎

# General information on data documentation ::: {.callout-important} We currently discussing some changes to data management and data documentation. The goal is to use the so-called Stamp ("Standardisierte Datenmanagementplan") guidelines for educational research (ger only, see here: <https://www.forschungsdaten-bildung.de/stamp-nutzen>) ::: ## What is data documentation? - Data documentation is a central part of data management^[see also here: [https://www.forschungsdaten-bildung.de/](https://www.forschungsdaten-bildung.de/daten-managen?la=en)] in general and is part of the [DFG Guidelines](https://www.dfg.de/en/research_funding/principles_dfg_funding/good_scientific_practice/) for safeguarding good scientific research practice (code of conduct) - Guideline 12: [Documentation](https://wissenschaftliche-integritaet.de/en/code-of-conduct/documentation/) (for more see below) ::: {.callout-note appearance="simple" collapse="true" title="Guideline 12: Documentation"} > Researchers document all information relevant to the production of a research result as clearly as is required by and is appropriate for the relevant subject area to allow the result to be reviewed and assessed. In general, this also includes documenting individual results that do not support the research hypothesis. The selection of results must be avoided. Where subject-specific recommendations exist for review and assessment, researchers create documentation in accordance with these guidelines. If the documentation does not satisfy these requirements, the constraints and the reasons for them are clearly explained. Documentation and research results must not be manipulated; they are protected as effectively as possible against manipulation. **Explanations:** > An important basis for enabling replication is to make available the information necessary to understand the research (including the research data used or generated, the methodological, evaluation and analytical steps taken, and, if relevant, the development of the hypothesis), to ensure that citations are clear, and, as far as possible, to enable third parties to access this information. Where research software is being developed, the source code is documented. ::: - Data documentation is a continuous process and it should start at the beginning (see e.g., @fig-research-management) of each project that collects (any kind of) data ![Proposed institutional research data management workflow by @Wissik2015](images/Wissik_Durco_2015_Figure5.PNG){#fig-research-management width="90%"} ::: {.callout-note} In the following, we restrict the term data documentation to the documentation of a single data set. ::: ## Who uses data documentation? Everybody who works with the data set(s). It serves as a guidance, especially for people who did not collect the data themselves. ## How is data documentation structured? Data documentation include several parts: 1. General information on the data set and the usage of it 2. Documentation of the survey process 3. Information about the variables (e.g., demographic variables), (achievement) tests, items and scales - Scale level - Item wording - Descriptive Statistics (e.g., $N, M, SD,Min, Max,...$; for more see [below](item-analy-descr.qmd)) - Cross-references to other studies - Sources 4. Bibliography An example of a table of conents of a so-called *"Skalenhandbuch"* can be seen in @fig-example-scale-manual. ![Example of table of contents of a "Skalenhandbuch"](images/TOSCA-scale-manual-example.pdf){#fig-example-scale-manual width=80% height=5in} ## How to document (a specific) data (set)? ::: {.callout-tip collapse="true" appearance="simple" title="Tip: Data management (ger only)"} Recommendations for data management (e.g., planning, generation, preparation & analyses, and publications) ![Forschungsdatenmanagement in kleinen Forschungsprojekten](RatSWD (2023) - FDM-kleine Projekte.pdf){#fig-data-manage width=80% height=5in} ::: ### Basis The basis of data documentation are the: - codebooks & questionnaires - the (final) data sets To get an overview, you should start with carefully reviewing these sources and think about a meaningful structure and order. ::: {.callout-tip appearance="simple" collapse="false" title="Tip: How to design codebooks?"} You can find some information at the [https://datawizkb.leibniz-psychology.org](https://datawizkb.leibniz-psychology.org/index.php/during-data-collection/what-should-i-know-about-codebooks/) website. ::: ### Order - It is advisable to group the variables according to their content. In the following example variable clusters are listed: - Socio-demographic variables (e.g., age, sex, family background, income, household, ...) - Tests (e.g., school achievement tests, tests on cognitive ability) - Psycho-social variables (e.g., Self-concepts, motivation, personality) - ... - Also, related scales should be reported together within a respective variable cluster (e.g., math, English and German self-concept scales) - Discuss the concrete order of the variables with your supervisor - Names of variables, items and scales **must be identical** across codebooks, questionaires, and data sets ### Template {#template} ::: {.callout-important} You should always use a template! You can find it/them on the network drive L, too. ::: There are two options: 1. Calculate the required statistics and transfer the values into one of the millions word templates (not recommended) 2. Use the `Quarto` [template](template-scale-manual.qmd) (recommended) - for now the template generates *only* html^[This is (mainly) because there are some issues with the `kableExtra` package [@R-kableExtra] and `latex` implementation (i.e., `format: pdf`) in `Quarto`.] documents ### Functions {#functions} ```{r} #| label: read-exampleDat #| echo: false exDat <- readRDS("exampleDat.RDS") ``` The following functions were written to make your job a little bit easier. - If you know what you are doing, feel free to change or adjust them. - If you found a bug or need more/other functions, please write us an [email](mailto:sven.rieger@uni-tuebingen.de). ```{=html} <style type="text/css"> caption, .table-caption { text-align: left; } </style> ``` ```{=html} <style type="text/css"> .table-hover > tbody > tr:hover { background-color: #d3d3d3; } </style> ``` #### Table for variable overview The `varOverview` function translates the general information into a table and has the following arguments: - `name`: Name of the variable - `type`: Input is a `character`: Either `variable` or `scale` - `itemword`: Requires a named `list` as input. Should contain the names and wording of the items. - `recItems`: Numeric vector of the reversed worded items. - `questionaire`: Input is a `character`: Which questionnaire? - `prompt`: Input is a `character`. Wording of the question, - `response`: Input is a `character`. Response format. - `cross`: Input is a `character`. Cross-reference to other studies. - `source`: Input is a `character`. Potential reference. - `note`: Input is a `character`. Room for additional notes. ```{r} #| label: table-function #| code-fold: true #| code-summary: "Show/hide function" #| echo: true varOverview <- function ( name, type = "variable", itemword = NULL, recItems = NULL, questionaire = NULL, prompt = NULL, response = NULL, cross = NULL, source = NULL, note = NULL) { if (type == "variable") { namDes <- "Name of variable:" } if (type == "scale") { namDes <- "Name of scale:" } if (type == "scale" & is.null(itemword) ) { stop("please provide the itemwording of the items") } if(!is.null(itemword)) { if ( is.null(names(itemword)) ) { stop("please provide a named list as input: e.g., list(item1 = c('I like apples.'") } else { itemLab <- names(itemword) if (!is.null(recItems)) { itemLab[recItems] <- paste0(names(itemword)[recItems],"r") } itemword <- unlist(itemword) } } else { itemLab <- NULL itemword <- NULL } if (!is.null(questionaire)) { questLab <- "Questionaire:" questionaire <- paste0(questionaire, collapse = "; ") } else { questLab <- NULL } if (!is.null(prompt)) { promptLab <- "Prompt in Questionaire:" prompt <- paste0(prompt, collapse = ":") } else { promptLab <- NULL } if (!is.null(response)) { responseLab <- "Response format:" response <- paste0(response, collapse = "; ") } else { responseLab <- NULL } if (!is.null(cross)) { crossLab <- "Cross reference:" cross <- paste0(cross, collapse = "; ") } else { crossLab <- NULL } if (!is.null(source)) { sourceLab <- "Source:" source <- knitr::asis_output(paste0(source, collapse = "; ")) } else { sourceLab <- NULL } if (!is.null(note)) { #generalTitel <- "Note:" noteLab <- "Note:" note <- paste0(note, collapse = "; ") } else { #generalTitel <- "" noteLab <- NULL } tempTab <- data.frame( "Col1" = c(namDes, questLab, promptLab, responseLab, crossLab, sourceLab, noteLab, "", itemLab), "Col2" = c(name, questionaire, prompt, response, cross, source, note, "", itemword) ) kableExtra::kbl(tempTab, col.names = NULL, align = "l" ,booktabs = TRUE ) |> kableExtra::kable_styling( full_width = TRUE, bootstrap_options = c("hover", "responsive") , latex_options = "HOLD_position" ) |> kableExtra::column_spec(column = 1, width = "12em") } ``` An example execution. ```{r} #| label: tbl-demo-table-function #| echo: true #| code-fold: show #| tbl-cap: "Variable: gender" #| output: asis varOverview(name = "gender", prompt = "The question in the questionaire...", response = c("0 = male", "1 = female", "2 = divers"), cross = c("A study", "Another study"), source = "@Kunter2002", note = "Here is room for some notes") ``` ::: {.callout-note} The reference is not rendered correctly. This is a known bug: see here: <https://github.com/quarto-dev/quarto-cli/issues/3340> and will be hopefully solved in the next release. ::: #### Continuous variables The `contVar` function is based on the `describe` function from the `psych` package [@R-psych] and has the following arguments: - `variable`: This should be a character or a character vector (name of variable) - `data`: The data frame - `which.stat`: Which statistics should be calculated? Default is: `N`, `Mean`, `SD`, `Min`, `Max`. Because the function is based on the `describe` function from `psych` package, you can also request: `trimmed mean`, `median`, `mad`, `skew`, `kurtosis`, `standard error`. - `table`: logical. Default is `TRUE`. Then the output is printed as a table using the `kbl` function from the `kableExtra` package [@R-kableExtra]. If set to false, the output is printed as a `data.frame`. ```{r} #| label: contVar-function #| echo: true #| code-fold: true #| code-summary: "Show/hide function" contVar <- function ( variable, data, which.stat = c("N", "Mean", "SD", "Min", "Max"), table = TRUE) { # calculate descriptives tempDescr <- as.data.frame( psych::describe( data[,variable] ) )[tolower(which.stat)] tempDescr$varNam <- variable rownames(tempDescr) <- NULL # calculate missing in % if ( length(variable) == 1 ) { missDF <- as.data.frame(mean(is.na(data[,variable]))*100) } else { missDF <- as.data.frame(colMeans(is.na(data[,variable]))*100) } colnames(missDF) <- "miss" tempDescr$miss <- paste0( sprintf(missDF$miss, fmt = '%#.1f'), "%") if ( table == TRUE ) { tempTab <- kableExtra::kbl( x = tempDescr[,c("varNam", tolower(which.stat), "miss")], col.names = c("Variable", which.stat, "Miss."), row.names = FALSE, digits = 2, align = c("l", rep("c",length(which.stat)), "c") ,booktabs = TRUE #,format = "html" ) |> kableExtra::kable_styling( full_width = TRUE, bootstrap_options = c("hover", "responsive") ,latex_options = "HOLD_position" ) tempOut <- tempTab tempOut } else { tempOut <- tempDescr[,c("varNam", tolower(which.stat), "miss")] return(tempOut) } } ``` An example execution. ```{r} #| label: tbl-demo-contVar-function #| echo: true #| code-fold: show #| tbl-cap: "Variable: age" #| tbl-subcap: ["Overview", "Calculated statistics"] #| output: asis varOverview(name = "age", type = "variable", questionaire = "Student", prompt = "How old are you?", response = "open", cross = "not applicable", source = "not applicable", note = "") contVar(variable = "age", data = exDat) ``` #### Categorical variables The `catVar` function is based on the `table` function from R [@R-base] and has the following arguments: - `variable`: This should be a character or a character vector (name of variable) - `data`: The data frame - `table`: logical. Default is `TRUE`. Then the output is printed as a table using the `kbl` function from the `kableExtra` package [@R-kableExtra]. If set to false, the output is printed as a `data.frame`. ```{r} #| label: catVar-function #| echo: true #| code-fold: true #| code-summary: "Show/hide function" catVar <- function ( variable, data, table = TRUE) { # calculate freq. as function calcFreq <- function ( variable, data) { # calc abs and rel freq. tempFreq <- data.frame( abs = table(data[,variable], useNA = "always")) tempFreq$rel <- table(data[,variable], useNA = "always")/length(data[,variable])*100 colnames(tempFreq) <- c("var", "Absolute", "Relative") # calculate sum tempFreq <- rbind(tempFreq, data.frame(var = "Sum", t(colSums(tempFreq[!is.na(tempFreq$var),-1]))) ) # cosmetics tempFreq$Relative <- paste0( sprintf(tempFreq$Relative, fmt = '%#.1f'), "%") tempTab <- as.data.frame(t(tempFreq[,-1])) colnames(tempTab) <- tempFreq[,1] colnames(tempTab)[is.na(colnames(tempTab))] <- "Miss" tempTab <- as.data.frame( t(sapply(colnames(tempTab), function(x) paste0(tempTab[1,x], " (", tempTab[2,x], ")"), simplify = TRUE) ) ) return(tempTab) } tempFreq <- data.table::rbindlist( sapply(variable, function(x) calcFreq(variable = x, data = data), simplify = FALSE), idcol = "Variable", use.names = T, fill = T) # get colnames to sort colToSort <- c("Variable", colnames(tempFreq)[!colnames(tempFreq) %in% c("Variable", "Miss", "Sum" #"Sum (miss)" )], "Sum", "Miss" #"Sum (miss)", ) tempFreq <- tempFreq[,..colToSort] if ( table == TRUE ) { tempTab <- kableExtra::kbl( x = tempFreq, align = "c" ) |> kableExtra::kable_styling( full_width = TRUE, bootstrap_options = c("hover", "responsive") #,latex_options = "HOLD_position" ) tempOut <- tempTab tempOut } else { tempOut <- tempFreq[,..colToSort] return(tempOut) } } ``` An example execution. ```{r} #| label: tbl-edu #| eval: true #| code-fold: show #| tbl-cap: "Variable: Education" #| tbl-subcap: #| - "Overview" #| - "Calculated statistics" #| results: asis varOverview(name = "edu", type = "variable", questionaire = "Parent", prompt = "Welchen Schulabschluss haben Sie? Bitte geben Sie *nur* Ihren höchsten Abschluss an.", response = c( "0 = Kein Schulabschluss", "1 = Abschluss einer Sonderschule/Foerderschule", "2 = einer Polytechnischen Oberschule nach der 8. Klasse", "3 = Hauptschulabschluss/ Volksschulabschluss", "4 = ..."), cross = "NEPS", source = "@Kunter2002", note = "") catVar(variable = c("edu"), data = exDat) ``` #### Scale score The `scaleScore` function is based on the `alpha` function from the `psych` package [@R-psych] and has the following arguments: - `items`: This should be a character vector (name of items) or a named list - `data`: The data frame - `recItems`: Numeric vector of the reversed worded items. - `recode`: logical. Default is `FALSE`. If `TRUE` the items in `recItems` will be recoded. - `which.stat`: Which statistics should be calculated? Default is: `N`, `Mean`, `SD`, `Min`, `Max`. Because the function is based on the `describe` function from `psych` package, you can also request: `trimmed mean`, `median`, `mad`, `skew`, `kurtosis`, `standard error`. - `table`: logical. Default is `TRUE`. Then the output is printed as a table using the `kbl` function from the `kableExtra` package [@R-kableExtra]. If set to false, the scale score (and if requested the recoded items) are returned. ::: {.callout-important} The scale score is built based on only valid items (i.e., `na.rm = FALSE` argument of the `rowSums` function). This means if one item has missing value, no score is computed. ::: ```{r} #| label: scales-function #| echo: true #| eval: true #| code-fold: true #| code-summary: "Show/hide function" scaleScore <- function ( items, scaleNameShort = NULL, recItems = NULL, recode = FALSE, data, which.stat = c("N", "Mean", "SD", "Min", "Max"), table = TRUE) { # name check if (is.null(scaleNameShort)) { stop("please provide the short name of the scale (variable name)") } # recode check if (!is.null(recItems) & recode == FALSE) { warning("Are the items already recoded?\nIf not, you may use the recode argument.") } if (!is.null(names(items))) { items <- names(items) } # recode if necessary if ( recode == T ) { data[,paste0(items[recItems],"r")] <- sapply( items[recItems], function(x) { tempRec <- (max(data[,x]) + min(data[,x])) - data[,x] return(tempRec) }, simplify = FALSE) items[recItems] <- paste0(items[recItems],"r") } # compute sumscore data[,scaleNameShort] <- rowSums(data[,items]) / length(items) # calculate descriptives tempDescr <- as.data.frame( psych::describe( data[,c(items,scaleNameShort)] ) )[tolower(which.stat)] # calculate missing in % missDF <- as.data.frame(colMeans(is.na(data[,c(items,scaleNameShort)]))*100) colnames(missDF) <- "miss" missDF$miss <- paste0( sprintf(missDF$miss, fmt = '%#.1f'), "%") # calculate alpha et al tempAlpha <- psych::alpha( data[,items] ) # extract r.cor & r.drop rcordrop <- tempAlpha$item.stats[,c("r.cor", "r.drop")] alpha <- round(tempAlpha$total$raw_alpha,3) # merge dataframes tempTab <- merge(tempDescr, missDF, by = 0, sort = FALSE) rownames(tempTab) <- tempTab$Row.names tempTab$Row.names <- NULL tempTab <- merge(tempTab, rcordrop, by = 0, all.x = T, sort = FALSE) rownames(tempTab) <- NULL # make tables if ( table == TRUE ) { tempScaleTab <- kableExtra::kbl( x = tempTab, col.names = c("Item", which.stat, "Missing", "$r_{it}$", "$r_{drop}$"), digits = 2, align = c("l", rep("c",length(which.stat)), "c", "c", "c"), booktabs = TRUE, format = "html", escape = FALSE ) |> kableExtra::kable_styling( full_width = TRUE, bootstrap_options = c("hover", "responsive") #,latex_options = "HOLD_position" ) |> kableExtra::footnote(general_title = "Cronbachs $\\\\alpha$ = ", general = paste0(alpha), footnote_as_chunk = TRUE, threeparttable = TRUE) |> kableExtra::row_spec(row = nrow(tempTab), background = "#d3d3d3") tempOut <- tempScaleTab tempOut } else { tempOut <- data[,c(items[recItems], scaleNameShort)] print(tempTab) return(tempOut) } } ``` An example execution. 1. Before executing the function, we create a named list. Item 3 and 4 are reversed worded items. ```{r} #| label: named-list-msc mscItems <- list("msc1" = c("Ich bin gut in Mathematik."), "msc2" = c("Ich war in Mathematik immer gut."), "msc3" = c("Ich habe Verständnisschwierigkeiten bei allem, für das man Mathematik braucht."), "msc4" = c("Ich bin bei Aufgaben, die mathematisches Denken erfordern, nie gut.")) ``` 2. Then we execute the two functions. because the items 3 and 4 are not recoded in the data set, we have to pass the information to the `scaleScore` function. ```{r} #| label: tbl-msc #| echo: true #| eval: true #| code-fold: show #| tbl-cap: "Variable Math Self-concept" #| tbl-subcap: ["Overview", "Calculated statistics"] varOverview(name = "Math Self-concept", type = "scale", itemword = mscItems, recItems = c(3,4), prompt = "Wie sehr treffen die folgenden Aussagen auf Sie zu?", response = c("1 = trifft überhaupt nicht zu", "2 = trifft eher nicht zu", "3 = trifft eher zu", "4 = trifft völlig zu"), cross = "TOSCA-R", source = "") scaleScore(items = names(mscItems), scaleNameShort = "msc", recItems = c(3,4), recode = TRUE, data = exDat, table = T) ``` #### Clean character variables The `cleanStr` function is designed to clean strings and has the following arguments: - `stringToClean`: This should be a character. - `pattern`: The pattern which should be replaced. Can be character vector. - `replacement`: The replacement. A character with length = 1. - `replaceNA`: logical. Should `NA`s be replaced? Default is `TRUE`. - `replaceNAval`: If `replaceNA` == `TRUE`. The replacement of `NA` values. Default is: `"Unknown"` - `as.fac`: logical. Output of the function (`character` or `factor`). Default is `FALSE`. - `print`: logical. Default is `TRUE`. ```{r} #| label: cleanStr-function #| echo: true #| eval: true #| code-fold: true #| code-summary: "Show/hide function" cleanStr <- function ( stringToClean, pattern = NULL, replacement = "Unknown", replaceNA = TRUE, replaceNAval = "Unknown", as.fac = FALSE, print = TRUE) { if ( length(replacement) > 1 | length(replaceNAval) > 1 ) { stop("Values can only be replaced by one value. length > 1") } if ( !is.character(stringToClean) ) { stop("input is no character") } else { outStr <- stringToClean } if ( replaceNA == TRUE ) { outStr[is.na(outStr)] <- replaceNAval missMess <- paste0("Missing values (NA) are replaced with: '", replaceNAval, "'") } else { missMess <- NULL } outStr[outStr %in% pattern] <- replacement if (print == TRUE ) { message("Input\n") print( table(stringToClean, useNA = "always") ) message("Output\n") print( table(outStr, useNA = "always") ) message("\nThis is what happened:") cat("The following pattern(s): ", paste0(paste0(pattern, collapse = ", "), "\nwas/were replaced with: ", replacement), "\n\n", missMess) } if ( as.fac == FALSE ) { return(outStr) } else { outFac <- as.factor(outStr) return(outFac) } } ``` An application of the function is shown in the [Data cleaning](data-cleaning.qmd) part. ```{r} #| label: save-functions #| echo: false #| eval: false scaleManualFunctions <- list(varOverview = varOverview, catVar = catVar, contVar = contVar, scaleScore = scaleScore, cleanStr = cleanStr ) save(scaleManualFunctions, file = "scaleManualFunctions.Rdata") ```