3  Introduction to R

Last updated on

April 10, 2025

Abstract

This chapter provides a short introduction to the statistical programming language R–detached from the concept of reproducibility. In addition, it touches the ecosystem of R and the differences between base R and the tidyverse package collection. Finally, it provides an introduction to the concept of functions and loops in R.

3.1 What is R?

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. (r-project.org)

Install R from https://cran.rstudio.com/

3.2 Short introduction to the language

To understand computations in R, two slogans are helpful:

  • Everything that exists is an object.
  • Everything that happens is a function call.

– John Chambers (creator of the S programming language)

For comprehensive introductions

3.2.1 Some Basics

Before looking at the data types and structures, there are a few basics you need to know.

  • R has a help system to get help on functions and packages
help("mean")
# or aquivalently
?mean
  • R is a calculator
sqrt(25) + 2^2
[1] 9
  • R is case-sensitive
"name" == "Name"
[1] FALSE
  • Values can be assigned to objects using <-
a <- 2
b <- 4
a + b
[1] 6
  • Arguments in functions are assigned using =
df <- data.frame(
  x = 1:4,
  y = 3:6
)

3.2.2 Data Types

The basic data types in R are depicted in Table 3.1.

Table 3.1: Basic data types in
Type Description Value (example)
Numeric Numbers with decimal value or fraction 3.7
Integer Counting numbers and their additive inverses 2, -115
Character Letters enclosed by quotes in the output (aka string) "Hello World!","4"
Logical boolean TRUE, FALSE
Factor Categorial data
- Level: characteristic value as seen by R
- Label: designation of the characteristic attributes

0, 1
male,female
Complex1 numbers with a real and an imaginary part 2 + 3i
Special
  • Missing values: unknown cell value
  • Impossible values: not a number
  • Empty values: known empty cell value
NA
NaN
NULL


You can check the class of an object with the class() function.

class(a)
[1] "numeric"
class(df)
[1] "data.frame"

3.2.3 Data Structures

R has five2 fundamental data structures: vectors, matrices, arrays, lists, and data frames.

A vector is…

  • an one-dimensional array and
  • the elements are of the same data type (here: numeric/integer)


vec <- c(45, 6, -83)
vec
[1]  45   6 -83

Create a vector with the c() function

v <- c(45, 6, -83, 23, 61)
v
[1]  45   6 -83  23  61

Or a named vector…

vNam <- c(a = 45, b = 6, c = -83, d = 23, e = 61)
vNam
  a   b   c   d   e 
 45   6 -83  23  61 

Count the elements of items contained in vector

length(v)
[1] 5

Vector indexing (by position)

v[1]
[1] 45
v[-3]
[1] 45  6 23 61

Slicing vectors

v[3:5]
[1] -83  23  61

Generate regular sequences using seq function

seq(from = 0,
    to = 20,
    by = 2)
 [1]  0  2  4  6  8 10 12 14 16 18 20

A matrix is…

  • an two-dimensional array (rows and columns) and
  • the elements are of the same data type (here: numeric/integer)


matrix(
  data = c(1, 2, 3, 45),
  nrow = 2,
  ncol = 2,
  byrow = TRUE
 )
     [,1] [,2]
[1,]    1    2
[2,]    3   45

The matrix() function creates a matrix from the given set of values

m <- matrix(data = c(1, 2, 3, 45, 36, 52),
            nrow = 2,
            ncol = 3,
            byrow = TRUE)
m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]   45   36   52

Slicing works also on matrices: m[row , column]

m[, 1:2]
     [,1] [,2]
[1,]    1    2
[2,]   45   36

An array is…

  • a multi-dimensional data structure and
  • the elements must be of the same data type

Dimensions can be 2D (matrix), 3D, or higher (see cran.r-project.org/doc/manuals)


arr <- array(1:12,
             dim = c(3, 4))
arr
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

Create an array with the array function (here 3×4 matrix):

arr <- array(1:12, dim = c(3, 4))
arr
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

A 3D array (2 “sheets” of 3×4):

arr3d <- array(1:24, dim = c(3, 4, 2))
arr3d
, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

, , 2

     [,1] [,2] [,3] [,4]
[1,]   13   16   19   22
[2,]   14   17   20   23
[3,]   15   18   21   24

Check the dimensions:

dim(arr3d)
[1] 3 4 2

Indexing (row, column, layer):

arr3d[2, 3, 1]   # element in row 2, col 3, layer 1
[1] 8
arr3d[ , , 2]  # layer 2
     [,1] [,2] [,3] [,4]
[1,]   13   16   19   22
[2,]   14   17   20   23
[3,]   15   18   21   24

Slicing across dimensions:

arr3d[1:2, , ]
, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11

, , 2

     [,1] [,2] [,3] [,4]
[1,]   13   16   19   22
[2,]   14   17   20   23

A list is …

  • an ordered collection of elements (order is preserved), and
  • can contain elements of various data types

Lists are one-indexed (indexing starts with 1), which means that the first element of a list is accessed with index 1, not 0.


list("hi", 2, NULL)
[[1]]
[1] "hi"

[[2]]
[1] 2

[[3]]
NULL

Create lists (with different elements, i.e., numbers and letters) with the list() function

l1 <- list(1:5)
l2 <- list(letters[1:5])
l3 <- list(LETTERS[1:5])

Create a nested list…

l4 <- list(l1, l2, l3)
l4
[[1]]
[[1]][[1]]
[1] 1 2 3 4 5


[[2]]
[[2]][[1]]
[1] "a" "b" "c" "d" "e"


[[3]]
[[3]][[1]]
[1] "A" "B" "C" "D" "E"

…or a named (nested) list

l4Nam <- list("Numbers" = l1,
              "SmallLetters" = l2,
              "CaptialLetters" = l3)
l4Nam
$Numbers
$Numbers[[1]]
[1] 1 2 3 4 5


$SmallLetters
$SmallLetters[[1]]
[1] "a" "b" "c" "d" "e"


$CaptialLetters
$CaptialLetters[[1]]
[1] "A" "B" "C" "D" "E"

Access list or nested list elements

l4[2]
[[1]]
[[1]][[1]]
[1] "a" "b" "c" "d" "e"
l4[[2]][3]
[[1]]
NULL

Unlist the list to get vector which contains all the atomic components

unlist(l1)
[1] 1 2 3 4 5
unlist(l4)
 [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e" "A" "B" "C" "D" "E"

Count amount of items contained in list

length(l4)
[1] 3
length(unlist(l4))
[1] 15

You might want to check

A data frame

  • consists of multiple columns, and
  • each column may have a different data type

Usually, variables are stored in columns and units in rows.


data.frame(
  id = 1:4,
  age = c(12, 13, 12, 14),
  sex = c("male", "female", "female", "male")
)
  id age    sex
1  1  12   male
2  2  13 female
3  3  12 female
4  4  14   male
ex_df <- data.frame(
  id = 1:4,
  age = c(12, 13, 12, 14),
  sex = c("male", "female", "female", "male")
)
ex_df
  id age    sex
1  1  12   male
2  2  13 female
3  3  12 female
4  4  14   male

Number of observations

nrow(ex_df)
[1] 4

Show dimension (rows, columns) of dataframe

dim(ex_df)
[1] 4 3

Column names

colnames(ex_df)
[1] "id"  "age" "sex"

Show the first two rows of the dataframe

head(ex_df, 2)
  id age    sex
1  1  12   male
2  2  13 female


The structure of an object can be checked using the str() function.

str(vec) 
 num [1:3] 45 6 -83
str(ex_df)
'data.frame':   4 obs. of  3 variables:
 $ id : int  1 2 3 4
 $ age: num  12 13 12 14
 $ sex: chr  "male" "female" "female" "male"

3.3 Base , Additional Packages and tidyverse

R has a large ecosystem of libraries (often called packages) that extend the functionality of base R. Base R itself consists of a set of packages that are loaded by default when you start an R session. These can be checked with the sessionInfo() function.

sessionInfo()$basePkgs
[1] "stats"     "graphics"  "grDevices" "datasets"  "utils"     "methods"  
[7] "base"     

On the Comprehensive R Archive Network (CRAN), there are innumerable packages available that can be installed and loaded to add specific functionalities such as enhanced data manipulation, specific data analyses (e.g., linear mixed models, structural equation modeling), or data visualization to R. The following example demonstrates how to install a package (i.e., ggplot2 for data visualization) from CRAN, and load it into a R session.

install.packages("ggplot2") 
library(ggplot2) 
NoteModern package manager for R

You might also want to consider using the pak package (Csárdi & Hester, 2025) for installing packages. pak is a modern package manager and is a faster, a more efficient (e.g., regarding dependency management) and a more user-friendly (e.g., improved error handling) way to install and manage R packages than the install.packages() function.

For educational research, you may be interested in the following list of packages that are well established and can be used without concerns regarding their reliability and maintenance:

Finally, there is the so-called tidyverse (Wickham, 2023) within R. The tidyverse is a collection of R packages (see Figure 3.1) that “share an underlying design philosophy, grammar, and data structures” and are (specifically) designed for data science.

Figure 3.1: tidyverse package collection

Within the tidyverse package collection, the dplyr package (Wickham, François, et al., 2026) provides a set of convenient functions for manipulating data. Together with the pipe operator %>% from the magrittr package (Bache & Wickham, 2025), it is an extremely powerful approach to manipulate data in a clear and comprehensible way. The native3 R pipe |> was introduced with R v4.1.0 and since R 4.3.0, the base pipe provides all the features from magrittr.

The tidyverse style guide suggests using the pipe operator “to emphasize a sequence of actions”. The pipe operator can be understood as “take the object and then” pass it to the next function. To illustrate the idea behind the pipe operator, we will use the following example dataset (ex_dat) which contains two numeric variables (X and Y).

Code
ex_dat <- data.frame(
  X = rnorm(100, mean = 2, sd = 1),
  Y = rnorm(100, mean = 3, sd = 1) 
)

In the following, the use of the base R pipe operator is demonstrated through a simple table creatation example.

1ex_dat |>
2  dplyr::select(X, Y) |>
3  psych::describe(fast=TRUE) |>
4  knitr::kable(digits = 2)
1
Take the data frame ex_dat and then
2
Select the variables: X and Y and then
3
Calculate descriptive statistics using the describe function from the psych package (Revelle, 2026) and then
4
Create a table with the kable function from the knitr package (Xie, 2025)
vars n mean sd median min max range skew kurtosis se
X 1 100 2.07 0.94 2.09 -0.82 4.33 5.15 -0.28 0.27 0.09
Y 2 100 2.87 0.87 2.89 0.06 5.26 5.20 -0.11 0.83 0.09


If you are convinced by the pipe approach, you can stop reading here and jump to the next section More about R Programming. If not, let us compare the pipe approach with two alternative ways to write the same code:

  • a nested approach

  • a sequential approach

The nested approach looks like this:

knitr::kable(psych::describe(ex_dat[, c("X", "Y")], fast=TRUE), digits = 2) 

…or, formatted:

knitr::kable(
  psych::describe(ex_dat[, c("X", "Y")],
                  fast=TRUE),
  digits = 2) 

Although, the second version slightly improves clarity through formatting, both versions require the reader to parse the code from the inside out. This is less readable and less intuitive than the pipe approach.

The sequential approach looks like this:

ex_dat_sub <- ex_dat[, c("X", "Y")]
ex_descr <- psych::describe(ex_dat_sub, 
                            fast=TRUE)

knitr::kable(ex_descr, digits = 2)

This approach is readable and intuitive because it explicitly separates all processing steps. However, creating a lot of so-called intermediate objects (i.e., ex_dat_sub, ex_descr), clutter the workspace/environment and becomes messy and difficult to comprehend when the project grows. To avoid this, you should consider putting such code sequences into self-written (wrapper) functions (see the next section What is a function in R?).

3.4 More about R Programming

When you are familiar with the basics of R, you might want to learn more about two fundamental programming concepts (in R):

  1. functions and
  2. loops.

The combination thereof allows you to automate repetitive tasks (e.g., calculating scale scores, running different analyses) which makes your code more efficient, easier to maintain, and enhances reproducibility. This idea is also highlighted in R for Data Science (2e), Chapter 25:

[…] Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has four big advantages over using copy-and-paste:

  1. You can give a function an evocative name that makes your code easier to understand.

  2. As requirements change, you only need to update code in one place, instead of many.

  3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

  4. It makes it easier to reuse work from project-to-project, increasing your productivity over time.

3.4.1 Functions

3.4.1.1 What is a function in R?

TipDefinition: Function

A function is an object that may contain multiple related operations, statements and functions. These are executed in a predefined order.

In R, functions can be created using the function() statement and they consist of 3 parts:

  1. name of the function
  2. arguments and parameters: may vary across calls of the function
  3. body that contains the code which is executed across function calls
name <- function( arguments ) {

  body 

}

3.4.1.2 How to write a function?

To demonstrate how to write a function, we will use the following example. A common (repetitive) task is to recode item indicators. Consider two variables which need to be recoded (i.e., from 1 → 4, 2 → 3, 3 → 2, 4 → 1):

ex_rec <- data.frame(
  id = 1:6,
  item1 = c(1, 2, 2, 4, NA, 3),
  item2 = c(4, 3, NA, 1, 4, 1)
)

Provide a concise and meaningful name (here: rec_items). The name of the function object will–after defining the function–appear in the R environment.

rec_items <- function( inputs ) {

  # body

}

Next, we need to define the inputs (also known as arguments or parameters) of the function. These inputs provide the necessary data and information and concurrently define how the function operates. The inputs are written within regular parentheses (...).

To recode an item, we need the following inputs:

  • Dataset: A data frame containing the items.
  • Item: A character input specifying the names of the items.
rec_items <- function( data,
                       item ) {

  # Check if data is a data frame
  stopifnot(is.data.frame(data))
  
  # Check if item exists in the dataset
  if (!item %in% colnames(data)) {
    stop(sprintf("Item '%s' not found in dataset.", item), call. = FALSE)
  }

 # body goes here

}

Optional: A function profits from input validation. This means we can include checks and error messages within the function body (e.g., check if the dataset is a data frame, check if the item exists in the dataset, etc.).

Third, provide the actual code in the body of the function. This code is written inside the curly brackets { }. Do not forget to return the results.

In this recoding approach, we subtract the item from the sum of the maximum and minimum of the item. Note that this approach is not very robust across different recoding strategies. This approach fails when the sample size is small and the categories are not used completely.

rec_items <- function( data,
                       item ) {

  # Check if data is a data frame
  stopifnot(is.data.frame(data))
  
  # Check if item exists in the dataset
  if (!item %in% colnames(data)) {
    stop("Item not found in dataset")
  }

  x <- data[[item]]
  
  if (all(is.na(x))) {
    stop(sprintf("Item '%s' contains only NA values.", item), call. = FALSE)
  }

  max_x <- max(x, na.rm = TRUE)
  min_x <- min(x, na.rm = TRUE)

  ret <- (max_x + min_x) - x
  ret # or return(ret)

}

Lastly, we test the function by executing it on the example dataset ex_rec and check if the item was recoded correctly.

ex_rec$item1_r <- rec_items(data = ex_rec,
                            item = "item1")

with(ex_rec,
     table(item1, item1_r,
           useNA = "ifany"))
      item1_r
item1  1 2 3 4 <NA>
  1    0 0 0 1    0
  2    0 0 2 0    0
  3    0 1 0 0    0
  4    1 0 0 0    0
  <NA> 0 0 0 0    1

3.4.2 Loops

3.4.2.1 What is a loop in R?

TipDefinition: Loop

Looping is the repeated evaluation of a statement or block of statements. Base R provides functions for explicit (i.e., for, while, repeat) and implicit looping (e.g., apply, sapply, lapply,…).

There are also other packages (e.g., parallel, purrr, furrr) that offer more advanced or parallelized looping capabilities, providing more efficient and convenient ways to iterate over data, particularly for complex workflows or large datasets.

3.4.2.2 lapply

We begin by using the lapply function because it is a little bit beginner-friendly than for loops. The lapply function applies a(nother) function over a list or vector. It needs 2 arguments as inputs:

  • X: a vector (atomic or list)
  • FUN: the function to be applied to each element of X

lapply returns a list of the same length as the input X (see ?lapply).

Two (nearly equivalent) examples with lapply are shown in the following. The first example is shorter, but the second example is often preferred because it offers the option to further customize the operations within the applied function (e.g., calculating the square of x).

printList1 <- lapply(X = 1:3,
                     FUN = print) 
[1] 1
[1] 2
[1] 3
printList2 <-  lapply(X = 1:3,
1                      FUN = function(x) {
2                        ret <- x^2 |>
                                print()
                        return(ret) 
                       }
                    ) 
1
Defines an so-called anonymous function that takes an argument x.
2
This approach offers further customization of operations such as calculating the square of x (for more see Functions in R).
[1] 1
[1] 4
[1] 9
NoteUse your own function with apply family.

If the function() to be applied becomes more complex, it might be reasonable to define it first and then apply it with lapply (or other apply functions).

3.4.2.3 for loops

work in progress…

3.4.2.4 When not to loop?

The typical answer is if vectorization is possible. To demonstrate the difference between a loop and vectorization, we will use the following (“stupid”) example. We create a large vector (10 million random values) …

set.seed(999)
x <- rnorm(1e7)  

… and multiply it by 2 using first a loop and then vectorization.

Loop approach

system.time({

  y <- numeric(length(x))

  for (i in 1:length(x)) {
    y[i] <- x[i] * 2
  }

})
   user  system elapsed 
  0.243   0.007   0.250 

Vectorizized approach

system.time({

  y <- x * 2

})
   user  system elapsed 
  0.003   0.003   0.005 

The next question is then how can we know that vectorization is possible? Vectorization is possible whenever an operation is defined element-wise or column-wise on an entire object (e.g., vector, matrix, data frame). Examples include:

  • Arithmetic operations (e.g., +, -, …)
  • Logical comparisons (e.g., ==, >, …)
  • Mathematical functions (e.g., sqrt(), log(), …)

If an operation requires sequential dependence (e.g., each iteration depends on a previous result), then vectorization is usually not possible, and a loop is required.

3.5 Some Questions

NoteThis is work in progress…

… and needs to be completed.

Is the R language a dialect of the S programming language?

Is R case-sensitive?

What is NOT considered a data type in R?


  1. “In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted i, called the imaginary unit and satisfying the equation \(i^2 = -1\); every complex number can be expressed in the form \(a+bi\), where a and b are real numbers.” (wikipedia)↩︎

  2. In addition, factors are a special data type used to represent categorical variables.↩︎

  3. for the initial difference between |> and %>% see https://ivelasq.rbind.io/blog/understanding-the-r-pipe/↩︎