x <- 5
y <- 4
8 Data transformation
In this part you will learn how to transform or convert variables into a usable form for analysis. The steps include:
8.1 Building new variables
To build new variables and add them to a data set, we can use all functions and operators that R
offers.
In R
, you can create objects (from existing ones) and store them as new objects using functions and operators. In the following example, we create the objects x
and y
and assign them numbers.
Then we sum them up:
xy <- x+y
xy
[1] 9
This is the same as:
Finally, we may delete all objects using the rm
function.
rm(x,y,xy)
For example, we might want to sum up item indicators to a so-called scale score (for now only the first 2; for a more detailed examination see the section Descriptive statistics and item analysis) and add them as a new column to the example data set exDat
.
There are several ways to do that, for example:
To delete variables, you can use the NULL
statement.
To add a variable to a data set, we may use mutate
function from the dplyr
package.
From the function description: dplyr::mutate
creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL
).
Add the new variable mscSum6
.
exDat <- exDat |>
dplyr::mutate(mscSum6 = msc1 + msc2)
Delete the variable mscSum6
.
exDat <- exDat |>
dplyr::mutate(mscSum6 = NULL)
8.2 Recoding variables
Recoding is the process of reassigning values (old → new) for a variable in a dataset.
The old values can either be overwritten by the new values or saved as a new variable.
- old values are not lost
- errors during recoding can be reproduced
A common approach to recode variables is to use the base::ifelse
function. It requires 3 input arguments:
-
test
: which is an object that can be coerced to logical mode -
yes
: return values for true elements of test -
no
: return values for false elements of test
The function returns a value with the same shape as test
which is filled with elements selected from either yes
or no
depending on whether the element of test is TRUE
or FALSE
.
ifelse(1 == 1,
yes = "That is correct!",
no = FALSE)
[1] "That is correct!"
[1] "That is correct!" "FALSE" "FALSE" "That is correct!"
To apply the function in a more “meaningful” setting. Lets transform (i.e., recode) the variable age
to a categorical variable (here: ageCat
) with the categories: old and young. All units that are older than 10 are getting the value "old"
, otherwise they get the value "young"
.
Psychological instruments often contain items that are designed to measure the opposite of the actual construct (e.g., “I am good at math.” vs. “I am bad at math.”). These items are called reverse-scored or negatively-keyed items. In the example data set exDat
the variables msc3
and msc4
are (simulated as) reverse-scored and thus, need to be recoded.
The first step is to check the response categories.
table(exDat$msc3, useNA = "always")
1 2 3 4 <NA>
60 320 314 56 0
In this approach, we subtract the item from the sum of the maximum and minimum of the item (here is maximum = 4 minimum = 1).
Note that this approach is not very robust across different recoding strategies. Also, when sample size is small and the categories are not used completely.
It is important to evaluate the result.
In this approach, we use the recode
function of the car
package (Fox et al., 2022). This function needs at least 2 inputs (copied from the package description):
-
var
: numeric vector, character vector, or factor. -
recodes
: character string of recode specifications
There are further additional arguments such as as.factor
and as.numeric
which direct the class of the output.
Again, evaluating the result.
Evaluating the result.
(copied from the ?dplyr::recode
function description)
dplyr::recode
is superseded in favor of dplyr::case_match
, which handles the most important cases of dplyr::recode
with a more elegant interface. dplyr::recode_factor
is also superseded, however, its direct replacement is not currently available but will eventually live in forcats
. For creating new variables based on logical vectors
, use dplyr::if_else()
. For even more complicated criteria, use dplyr::case_match
.
8.3 Working with strings (character objects)
This section briefly introduces how to work with strings (character
objects). Recall, what is a character
(in R
)?
Everything what appears within single ('
) or double quotes ("
; double quotes are recommended) will be treated as a string (i.e., character
object). It is important to know that character
objects are space and case sensitive.
" " == ""
[1] FALSE
"hello world!" == "Hello world!"
[1] FALSE
When it comes to survey research, strings are used to transfer the information from (mostly) open fields in a questionnaire into the data set (e.g., “What is your first language or mother tongue? (Please specify)”.
The variable fLang
in the exDat
data set contains the different answers on such a question.
table(exDat$fLang, useNA = "always")
french ger german germn italian <NA>
5 5 10 49 650 1 10 20
To clean such a character
variable, R
offers a couple of functions (e.g., base::grep
, base::grepl
, base::regexpr
, base::gregexpr
, …). Also, there are packages such as stringr
(Wickham, 2022) or stringi
(Gagolewski et al., 2022) that offer great functionality1.
In the following example, we use base::gsub
function to search for matches in the value pattern " "
of the character vector anotherChr
and replace it with ""
.
The cleanStr
function (introduced in the Functions section of the General information on data documentation part) is designed to clean strings and has the following arguments:
-
stringToClean
: This should be a character. -
pattern
: The pattern which should be replaced. Can be character vector. -
replacement
: The replacement. A character with length = 1. -
replaceNA
: logical. ShouldNA
s be replaced? Default isTRUE
. -
replaceNAval
: IfreplaceNA
==TRUE
. The replacement ofNA
values. Default is:"Unknown"
-
as.fac
: logical. Output of the function (character
orfactor
). Default isFALSE
. -
print
: logical. Default isTRUE
.
exDat$fLangR <- cleanStr(exDat$fLang,
pattern = c(" ", ""),
replacement = "Unknown",
replaceNA = TRUE)
Input
stringToClean
french ger german germn italian <NA>
5 5 10 49 650 1 10 20
Output
outStr
french ger german germn italian Unknown <NA>
10 49 650 1 10 30 0
This is what happened:
The following pattern(s): ,
was/were replaced with: Unknown
Missing values (NA) are replaced with: 'Unknown'
- Use the
cleanStr
function and use thepattern
andreplacement
arguments.
Show/hide code
exDat$fLangR <- cleanStr(exDat$fLangR,
pattern = c("ger", "germn"),
replacement = "german",
replaceNA = FALSE)
Input
stringToClean
french ger german germn italian Unknown <NA>
10 49 650 1 10 30 0
Output
outStr
french german italian Unknown <NA>
10 700 10 30 0
This is what happened:
The following pattern(s): ger, germn
was/were replaced with: german
A comprehensive introduction to these functions or packages is beyond the scope of this workshop. Hence, we focus on some mechanics of these functions.↩︎