8  Data transformation

In this part you will learn how to transform or convert variables into a usable form for analysis. The steps include:

  1. Building new variables
  2. Recoding variables

8.1 Building new variables

To build new variables and add them to a data set, we can use all functions and operators that R offers.

In R, you can create objects (from existing ones) and store them as new objects using functions and operators. In the following example, we create the objects x and y and assign them numbers.

x <- 5
y <- 4

Then we sum them up:

xy <- x+y
xy
[1] 9

This is the same as:

sum(c(x,y)) == xy
[1] TRUE

Finally, we may delete all objects using the rm function.

rm(x,y,xy)

For example, we might want to sum up item indicators to a so-called scale score (for now only the first 2; for a more detailed examination see the section Descriptive statistics and item analysis) and add them as a new column to the example data set exDat.

There are several ways to do that, for example:

exDat$mscSum1 <- exDat$msc1+exDat$msc2
exDat$mscSum2 <- with(exDat, msc1 + msc2)
exDat[,"mscSum3"] <- with(exDat, msc1 + msc2)

mscSum4 <- exDat$msc1+exDat$msc2
exDat <- cbind(exDat, mscSum4)

exDat$mscSum5 <- rowSums(exDat[,c("msc1","msc2")], na.rm = T)

To delete variables, you can use the NULL statement.

exDat[,c(paste0("mscSum", 1:5))] <- NULL

To add a variable to a data set, we may use mutate function from the dplyr package.

From the function description: dplyr::mutate creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).

Add the new variable mscSum6.

exDat <- exDat |>
  dplyr::mutate(mscSum6 = msc1 + msc2)

Delete the variable mscSum6.

exDat <- exDat |>
  dplyr::mutate(mscSum6 = NULL)

8.2 Recoding variables

Recoding is the process of reassigning values (old → new) for a variable in a dataset.

The old values can either be overwritten by the new values or saved as a new variable.

We always create a new variable when we recode a variable!
  • old values are not lost
  • errors during recoding can be reproduced

A common approach to recode variables is to use the base::ifelse function. It requires 3 input arguments:

  • test: which is an object that can be coerced to logical mode
  • yes: return values for true elements of test
  • no: return values for false elements of test

The function returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.

ifelse(1 == 1,
       yes = "That is correct!",
       no = FALSE)
[1] "That is correct!"
ifelse(1 == c(1,2,3,1),
       yes = "That is correct!",
       no = FALSE)
[1] "That is correct!" "FALSE"            "FALSE"            "That is correct!"

To apply the function in a more “meaningful” setting. Lets transform (i.e., recode) the variable age to a categorical variable (here: ageCat) with the categories: old and young. All units that are older than 10 are getting the value "old", otherwise they get the value "young".

exDat$ageCat <- ifelse (exDat$age > 10, "old", "young")
table(exDat$ageCat)

  old young 
  342   328 

Psychological instruments often contain items that are designed to measure the opposite of the actual construct (e.g., “I am good at math.” vs. “I am bad at math.”). These items are called reverse-scored or negatively-keyed items. In the example data set exDat the variables msc3 and msc4 are (simulated as) reverse-scored and thus, need to be recoded.

The first step is to check the response categories.

table(exDat$msc3, useNA = "always")

   1    2    3    4 <NA> 
  60  320  314   56    0 

In this approach, we subtract the item from the sum of the maximum and minimum of the item (here is maximum = 4 minimum = 1).

exDat$msc3r1 <- sum(max(exDat$msc3,na.rm=T),
                    min(exDat$msc3,na.rm=T)) - exDat$msc3 

Note that this approach is not very robust across different recoding strategies. Also, when sample size is small and the categories are not used completely.

It is important to evaluate the result.

with(exDat, table(msc3, msc3r1))
    msc3r1
msc3   1   2   3   4
   1   0   0   0  60
   2   0   0 320   0
   3   0 314   0   0
   4  56   0   0   0

In this approach, we use the recode function of the car package (Fox et al., 2022). This function needs at least 2 inputs (copied from the package description):

  • var: numeric vector, character vector, or factor.
  • recodes: character string of recode specifications

There are further additional arguments such as as.factor and as.numeric which direct the class of the output.

exDat$msc3r2 <- car::recode(var = exDat$msc3,
                            recodes = "
                            1 = 4;
                            2 = 3;
                            3 = 2;
                            4 = 1;
                            NA = NA",
                            as.factor = FALSE,
                            as.numeric = TRUE)

Again, evaluating the result.

with(exDat, table(msc3, msc3r2))
    msc3r2
msc3   1   2   3   4
   1   0   0   0  60
   2   0   0 320   0
   3   0 314   0   0
   4  56   0   0   0
exDat <- exDat |>
  dplyr::mutate(
    msc3r3 = dplyr::case_when(
      msc3 == 1 ~ 4,
      msc3 == 2 ~ 3,
      msc3 == 3 ~ 2,
      msc3 == 4 ~ 1
      ))

Evaluating the result.

with(exDat, table(msc3, msc3r3))
    msc3r3
msc3   1   2   3   4
   1   0   0   0  60
   2   0   0 320   0
   3   0 314   0   0
   4  56   0   0   0

(copied from the ?dplyr::recode function description)

dplyr::recode is superseded in favor of dplyr::case_match, which handles the most important cases of dplyr::recode with a more elegant interface. dplyr::recode_factor is also superseded, however, its direct replacement is not currently available but will eventually live in forcats. For creating new variables based on logical vectors, use dplyr::if_else(). For even more complicated criteria, use dplyr::case_match.

8.3 Working with strings (character objects)

This section briefly introduces how to work with strings (character objects). Recall, what is a character (in R)?

aChr <- "This is character"
class(aChr)
[1] "character"
print(aChr)
[1] "This is character"

Everything what appears within single (') or double quotes ("; double quotes are recommended) will be treated as a string (i.e., character object). It is important to know that character objects are space and case sensitive.

" " == ""
[1] FALSE
"hello world!" == "Hello world!"
[1] FALSE

When it comes to survey research, strings are used to transfer the information from (mostly) open fields in a questionnaire into the data set (e.g., “What is your first language or mother tongue? (Please specify)”.

The variable fLang in the exDat data set contains the different answers on such a question.

table(exDat$fLang, useNA = "always")

                 french     ger  german   germn italian    <NA> 
      5       5      10      49     650       1      10      20 

To clean such a character variable, R offers a couple of functions (e.g., base::grep, base::grepl, base::regexpr, base::gregexpr, …). Also, there are packages such as stringr (Wickham, 2022) or stringi (Gagolewski et al., 2022) that offer great functionality1.

In the following example, we use base::gsub function to search for matches in the value pattern " " of the character vector anotherChr and replace it with "".

anotherChr <- c(" ", "", "hello", " hello")

gsub(" ", "", x = anotherChr)
[1] ""      ""      "hello" "hello"

The cleanStr function (introduced in the Functions section of the General information on data documentation part) is designed to clean strings and has the following arguments:

  • stringToClean: This should be a character.
  • pattern: The pattern which should be replaced. Can be character vector.
  • replacement: The replacement. A character with length = 1.
  • replaceNA: logical. Should NAs be replaced? Default is TRUE.
  • replaceNAval: If replaceNA == TRUE. The replacement of NA values. Default is: "Unknown"
  • as.fac: logical. Output of the function (character or factor). Default is FALSE.
  • print: logical. Default is TRUE.
exDat$fLangR <- cleanStr(exDat$fLang,
                         pattern = c(" ", ""),
                         replacement = "Unknown",
                         replaceNA = TRUE)
Input
stringToClean
                 french     ger  german   germn italian    <NA> 
      5       5      10      49     650       1      10      20 
Output
outStr
 french     ger  german   germn italian Unknown    <NA> 
     10      49     650       1      10      30       0 

This is what happened:
The following pattern(s):   , 
was/were replaced with: Unknown 

 Missing values (NA) are replaced with: 'Unknown'
  1. Use the cleanStr function and use the pattern and replacement arguments.
Show/hide code
exDat$fLangR <- cleanStr(exDat$fLangR,
                         pattern = c("ger", "germn"),
                         replacement = "german",
                         replaceNA = FALSE)
Input
stringToClean
 french     ger  german   germn italian Unknown    <NA> 
     10      49     650       1      10      30       0 
Output
outStr
 french  german italian Unknown    <NA> 
     10     700      10      30       0 

This is what happened:
The following pattern(s):  ger, germn
was/were replaced with: german 

  1. A comprehensive introduction to these functions or packages is beyond the scope of this workshop. Hence, we focus on some mechanics of these functions.↩︎