x <- 5
y <- 48 Data transformation
In this part you will learn how to transform or convert variables into a usable form for analysis. The steps include:
8.1 Building new variables
To build new variables and add them to a data set, we can use all functions and operators that R offers.
In R, you can create objects (from existing ones) and store them as new objects using functions and operators. In the following example, we create the objects x and y and assign them numbers.
Then we sum them up:
xy <- x+y
xy[1] 9
This is the same as:
Finally, we may delete all objects using the rm function.
rm(x,y,xy)For example, we might want to sum up item indicators to a so-called scale score (for now only the first 2; for a more detailed examination see the section Descriptive statistics and item analysis) and add them as a new column to the example data set exDat.
There are several ways to do that, for example:
To delete variables, you can use the NULL statement.
To add a variable to a data set, we may use mutate function from the dplyr package.
From the function description: dplyr::mutate creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).
Add the new variable mscSum6.
exDat <- exDat |>
dplyr::mutate(mscSum6 = msc1 + msc2)Delete the variable mscSum6.
exDat <- exDat |>
dplyr::mutate(mscSum6 = NULL)8.2 Recoding variables
Recoding is the process of reassigning values (old → new) for a variable in a dataset.
The old values can either be overwritten by the new values or saved as a new variable.
- old values are not lost
- errors during recoding can be reproduced
A common approach to recode variables is to use the base::ifelse function. It requires 3 input arguments:
-
test: which is an object that can be coerced to logical mode -
yes: return values for true elements of test -
no: return values for false elements of test
The function returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.
ifelse(1 == 1,
yes = "That is correct!",
no = FALSE)[1] "That is correct!"
[1] "That is correct!" "FALSE" "FALSE" "That is correct!"
To apply the function in a more “meaningful” setting. Lets transform (i.e., recode) the variable age to a categorical variable (here: ageCat) with the categories: old and young. All units that are older than 10 are getting the value "old", otherwise they get the value "young".
Psychological instruments often contain items that are designed to measure the opposite of the actual construct (e.g., “I am good at math.” vs. “I am bad at math.”). These items are called reverse-scored or negatively-keyed items. In the example data set exDat the variables msc3 and msc4 are (simulated as) reverse-scored and thus, need to be recoded.
The first step is to check the response categories.
table(exDat$msc3, useNA = "always")
1 2 3 4 <NA>
60 320 314 56 0
In this approach, we subtract the item from the sum of the maximum and minimum of the item (here is maximum = 4 minimum = 1).
Note that this approach is not very robust across different recoding strategies. Also, when sample size is small and the categories are not used completely.
It is important to evaluate the result.
In this approach, we use the recode function of the car package (Fox et al., 2022). This function needs at least 2 inputs (copied from the package description):
-
var: numeric vector, character vector, or factor. -
recodes: character string of recode specifications
There are further additional arguments such as as.factor and as.numeric which direct the class of the output.
Again, evaluating the result.
Evaluating the result.
(copied from the ?dplyr::recode function description)
dplyr::recode is superseded in favor of dplyr::case_match, which handles the most important cases of dplyr::recode with a more elegant interface. dplyr::recode_factor is also superseded, however, its direct replacement is not currently available but will eventually live in forcats. For creating new variables based on logical vectors, use dplyr::if_else(). For even more complicated criteria, use dplyr::case_match.
8.3 Working with strings (character objects)
This section briefly introduces how to work with strings (character objects). Recall, what is a character (in R)?
Everything what appears within single (') or double quotes ("; double quotes are recommended) will be treated as a string (i.e., character object). It is important to know that character objects are space and case sensitive.
" " == ""[1] FALSE
"hello world!" == "Hello world!"[1] FALSE
When it comes to survey research, strings are used to transfer the information from (mostly) open fields in a questionnaire into the data set (e.g., “What is your first language or mother tongue? (Please specify)”.
The variable fLang in the exDat data set contains the different answers on such a question.
table(exDat$fLang, useNA = "always")
french ger german germn italian <NA>
5 5 10 49 650 1 10 20
To clean such a character variable, R offers a couple of functions (e.g., base::grep, base::grepl, base::regexpr, base::gregexpr, …). Also, there are packages such as stringr (Wickham, 2022) or stringi (Gagolewski et al., 2022) that offer great functionality1.
In the following example, we use base::gsub function to search for matches in the value pattern " " of the character vector anotherChr and replace it with "".
The cleanStr function (introduced in the Functions section of the General information on data documentation part) is designed to clean strings and has the following arguments:
-
stringToClean: This should be a character. -
pattern: The pattern which should be replaced. Can be character vector. -
replacement: The replacement. A character with length = 1. -
replaceNA: logical. ShouldNAs be replaced? Default isTRUE. -
replaceNAval: IfreplaceNA==TRUE. The replacement ofNAvalues. Default is:"Unknown" -
as.fac: logical. Output of the function (characterorfactor). Default isFALSE. -
print: logical. Default isTRUE.
exDat$fLangR <- cleanStr(exDat$fLang,
pattern = c(" ", ""),
replacement = "Unknown",
replaceNA = TRUE)Input
stringToClean
french ger german germn italian <NA>
5 5 10 49 650 1 10 20
Output
outStr
french ger german germn italian Unknown <NA>
10 49 650 1 10 30 0
This is what happened:
The following pattern(s): ,
was/were replaced with: Unknown
Missing values (NA) are replaced with: 'Unknown'
- Use the
cleanStrfunction and use thepatternandreplacementarguments.
Show/hide code
exDat$fLangR <- cleanStr(exDat$fLangR,
pattern = c("ger", "germn"),
replacement = "german",
replaceNA = FALSE)Input
stringToClean
french ger german germn italian Unknown <NA>
10 49 650 1 10 30 0
Output
outStr
french german italian Unknown <NA>
10 700 10 30 0
This is what happened:
The following pattern(s): ger, germn
was/were replaced with: german
A comprehensive introduction to these functions or packages is beyond the scope of this workshop. Hence, we focus on some mechanics of these functions.↩︎