Functions in R

Last updated on

October 25, 2024

Abstract

Writing functions is the best way to enhance your data processing skills. Functions allow you to automate tasks that are needed to be repeated more than 2-times. This sections gives a brief introduction in writing your own functions. There are other sources that cover this topic in much more detail (see R for Data Science by Hadley Wickham & Garrett Grolemund).

Why you should avoid copy/paste and use functions instead

One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has four big advantages over using copy-and-paste:

  1. You can give a function an evocative name that makes your code easier to understand.

  2. As requirements change, you only need to update code in one place, instead of many.

  3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but > not in another).

  4. It makes it easier to reuse work from project-to-project, increasing your productivity over time.

see R for Data Science (2e), Chapter 25

What is a function in R?

A function is an object that may contain multiple related operations, statements and functions. These are executed in a predefined order.

Functions in R can be created using the function() statement and consist of roughly 3 parts:

  1. name of the function
  2. arguments and parameters: may vary across calls of the function
  3. body that contains the code which is executed across function calls
name <- function( arguments ) {

  body 

}

In addition, you might want to include warning, and/or error messages.

How to write a function?

Example: A function that calculates scale scores for a set of item indicators.

Step 1

Provide a concise and meaningful name (here: calcScaleScore). The name of the function object will–after defining the function–appear in the R environment.

calcScaleScore <- function( arguments ) {

  body 

}

Step 2

Define the inputs (also known as arguments or parameters) of the function. These inputs provide the necessary data and information and concurenntly define how the function operates. The inputs are written within regular parentheses (...). Optional: Prepare if conditions.

To calculate a scale score, we need the following inputs:

  • Dataset: A data frame containing the items.
  • Items: A character vector specifying the names of the items.
calcScaleScore <- function( data,
                            items,
                            score = "sum" ) {
  
  if (score == "sum") {

    # code to calculate sum score

  } else if ( score == "mean" ) {

    # code to calculate mean score

  } else {

    stop("The 'score' argument must be either 'sum' or 'mean'")

  }
}

Step 3

Provide the actual code in body of the function. This code is written inside the curly brackets { }. Do not forget to return() the results.

To calculate the sum score, we use the rowSums() function; for the mean score, we use the rowMeans() function. The calculated scores are temporarily stored in the ret object within the function (the ret object does not exist in the global R environment).

calcScaleScore <- function( data,
                            items,
                            score = "sum" ) {

  if (score == "sum") {

    ret <- rowSums(data[,items])

  } else if ( score == "mean" ) {

    ret <- rowMeans(data[,items])

  } else {

    stop("score argument must be either 'sum' or 'mean'")

  }

  return(ret)

}

Step 4

Test the function.

A simulated data set (wideLSdat) can be found in the Example Data section.

head(wideLSdat[,1:6])
         Y11        Y21        Y31        Y12        Y22        Y32
1 -1.1184017 -1.6943188  0.1323046  0.5748477  0.3150744  1.4075321
2  1.4531887  2.7288805 -1.2379256 -0.4005724  3.2846794  0.6107107
3 -2.2427060 -0.1185009  0.9572514  0.3580144  0.1341493  1.0716963
4 -1.0682233 -1.0543955  0.1124673 -0.4524512 -1.3420486 -1.2007979
5 -1.7815469 -0.2275640 -2.1083167 -1.3481591 -1.7817689 -0.2680896
6 -0.4298616 -0.6126364 -1.6585705 -0.1314354  0.6786605  0.4220211

Quickly create a named list.

wideLSVar <- list("Y1" = paste0("Y", 1:3, 1),
                  "Y2" = paste0("Y", 1:3, 2),
                  "Y3" = paste0("Y", 1:3, 3))
wideLSVar
$Y1
[1] "Y11" "Y21" "Y31"

$Y2
[1] "Y12" "Y22" "Y32"

$Y3
[1] "Y13" "Y23" "Y33"
wideLSdat$Y1 <- calcScaleScore(data = wideLSdat,
                               items = wideLSVar$Y1,
                               score = "sum")

Evaluate the results.

table(rowSums(wideLSdat[, wideLSVar$Y1]) == wideLSdat$Y1)

TRUE 
1000 

Input validation

Useful… :)!

Exercise (15min): Input validation

Add an input validation that all items exist in the data set. Bonus: Print the missing items

Show solution
calcScaleScore <- function( data,
                            items,
                            score = "sum" ) {

  validItems <- items %in% colnames(data)
  
  missingItems <- items[!validItems]
  
  if (length(missingItems) > 0) {

      stop("The following item(s) is/are not in the dataset: ",
           paste(missingItems, collapse = ", "))

    }

  if (score == "sum") {

    ret <- rowSums(data[,items])

  } else if ( score == "mean" ) {

    ret <- rowMeans(data[,items])

  } else {

    stop("score argument must be either 'sum' or 'mean'")

  }

  return(ret)

}