3 Introduction to R

Last updated on

April 10, 2025

Abstract

This chapter provides a short introduction to the statistical programming language R–detached from the concept of reproducibility. In addition, it touches the ecosystem of R and the differences between base R and the tidyverse package collection. Finally, it provides an introduction to the concept of functions and loops in R.

3.1 What is R?

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. (r-project.org)

Install R from https://cran.rstudio.com/

3.2 Short introduction to the language

To understand computations in R, two slogans are helpful:

Everything that exists is an object.

Everything that happens is a function call.

– John Chambers (creator of the S programming language)

For comprehensive introductions

R Manual on the CRAN website
The R Manuals: A re-styled version of the original R manuals
R for Data Science by Hadley Wickham and Garrett Grolemund
Hands-On Programming with R by Garrett Grolemund

3.2.1 Some Basics

Before looking at the data types and structures, there are a few basics you need to know.

R has a help system to get help on functions and packages

help("mean")
# or aquivalently
?mean

R is a calculator

sqrt(25) + 2^2

[1] 9

R is case-sensitive

"name" == "Name"

[1] FALSE

Values can be assigned to objects using <-

a <- 2
b <- 4
a + b

[1] 6

Arguments in functions are assigned using =

df <- data.frame(
  x = 1:4,
  y = 3:6
)

3.2.2 Data Types

The basic data types in R are depicted in Table 3.1.

Table 3.1: Basic data types in

Type	Description	Value (example)
Numeric	Numbers with decimal value or fraction	`3.7`
Integer	Counting numbers and their additive inverses	`2`, `-115`
Character	Letters enclosed by quotes in the output (aka string)	`"Hello World!"`,`"4"`
Logical	boolean	`TRUE`, `FALSE`
Factor	Categorial data - Level: characteristic value as seen by R - Label: designation of the characteristic attributes	`0`, `1` `male`,`female`
Complex¹	numbers with a real and an imaginary part	`2 + 3i`
Special	Missing values: unknown cell value Impossible values: not a number Empty values: known empty cell value	`NA` `NaN` `NULL`

You can check the class of an object with the class() function.

class(a)

[1] "numeric"

class(df)

[1] "data.frame"

3.2.3 Data Structures

R has five² fundamental data structures: vectors, matrices, arrays, lists, and data frames.

A vector is…

an one-dimensional array and
the elements are of the same data type (here: numeric/integer)

vec <- c(45, 6, -83)
vec

[1]  45   6 -83

3.3 Base , Additional Packages and tidyverse

R has a large ecosystem of libraries (often called packages) that extend the functionality of base R. Base R itself consists of a set of packages that are loaded by default when you start an R session. These can be checked with the sessionInfo() function.

https://cran.r-project.org/

sessionInfo()$basePkgs

[1] "stats"     "graphics"  "grDevices" "datasets"  "utils"     "methods"  
[7] "base"

On the Comprehensive R Archive Network (CRAN), there are innumerable packages available that can be installed and loaded to add specific functionalities such as enhanced data manipulation, specific data analyses (e.g., linear mixed models, structural equation modeling), or data visualization to R. The following example demonstrates how to install a package (i.e., ggplot2 for data visualization) from CRAN, and load it into a R session.

install.packages("ggplot2") 
library(ggplot2)

Modern package manager for R

You might also want to consider using the pak package (Csárdi & Hester, 2025) for installing packages. pak is a modern package manager and is a faster, a more efficient (e.g., regarding dependency management) and a more user-friendly (e.g., improved error handling) way to install and manage R packages than the install.packages() function.

For educational research, you may be interested in the following list of packages that are well established and can be used without concerns regarding their reliability and maintenance:

dplyr (Wickham, François, et al., 2026) for data manipulation (see also below)
psych (Revelle, 2026) for psychometric analyses
survey (Lumley et al., 2025) for complex survey analysis
lme4 (Bates et al., 2025) for linear mixed-effects models
lavaan (Rosseel et al., 2025) for structural equation modeling
ggplot2 (Wickham, Chang, et al., 2026) for data visualization
…

Finally, there is the so-called tidyverse (Wickham, 2023) within R. The tidyverse is a collection of R packages (see Figure 3.1) that “share an underlying design philosophy, grammar, and data structures” and are (specifically) designed for data science.

See https://www.tidyverse.org/

Figure 3.1: tidyverse package collection

Within the tidyverse package collection, the dplyr package (Wickham, François, et al., 2026) provides a set of convenient functions for manipulating data. Together with the pipe operator %>% from the magrittr package (Bache & Wickham, 2025), it is an extremely powerful approach to manipulate data in a clear and comprehensible way. The native³ R pipe |> was introduced with R v4.1.0 and since R 4.3.0, the base pipe provides all the features from magrittr.

What does the pipe operator |> do?

The tidyverse style guide suggests using the pipe operator “to emphasize a sequence of actions”. The pipe operator can be understood as “take the object and then” pass it to the next function. To illustrate the idea behind the pipe operator, we will use the following example dataset (ex_dat) which contains two numeric variables (X and Y).

Code

ex_dat <- data.frame(
  X = rnorm(100, mean = 2, sd = 1),
  Y = rnorm(100, mean = 3, sd = 1) 
)

In the following, the use of the base R pipe operator is demonstrated through a simple table creatation example.

1ex_dat |>
2  dplyr::select(X, Y) |>
3  psych::describe(fast=TRUE) |>
4  knitr::kable(digits = 2)

1: Take the data frame ex_dat and then
2: Select the variables: X and Y and then
3: Calculate descriptive statistics using the describe function from the psych package (Revelle, 2026) and then
4: Create a table with the kable function from the knitr package (Xie, 2025)

	vars	n	mean	sd	median	min	max	range	skew	kurtosis	se
X	1	100	2	0.99	2	-0.89	4.19	5.08	-0.23	-0.02	0.10
Y	2	100	3	1.05	3	0.63	5.52	4.89	0.04	-0.40	0.11

If you are convinced by the pipe approach, you can stop reading here and jump to the next section More about R Programming. If not, let us compare the pipe approach with two alternative ways to write the same code:

a nested approach
a sequential approach

The nested approach looks like this:

knitr::kable(psych::describe(ex_dat[, c("X", "Y")], fast=TRUE), digits = 2)

…or, formatted:

knitr::kable(
  psych::describe(ex_dat[, c("X", "Y")],
                  fast=TRUE),
  digits = 2)

Although, the second version slightly improves clarity through formatting, both versions require the reader to parse the code from the inside out. This is less readable and less intuitive than the pipe approach.

The sequential approach looks like this:

ex_dat_sub <- ex_dat[, c("X", "Y")]
ex_descr <- psych::describe(ex_dat_sub, 
                            fast=TRUE)

knitr::kable(ex_descr, digits = 2)

This approach is readable and intuitive because it explicitly separates all processing steps. However, creating a lot of so-called intermediate objects (i.e., ex_dat_sub, ex_descr), clutter the workspace/environment and becomes messy and difficult to comprehend when the project grows. To avoid this, you should consider putting such code sequences into self-written (wrapper) functions (see the next section What is a function in R?).

3.4 More about R Programming

When you are familiar with the basics of R, you might want to learn more about two fundamental programming concepts (in R):

functions and
loops.

The combination thereof allows you to automate repetitive tasks (e.g., calculating scale scores, running different analyses) which makes your code more efficient, easier to maintain, and enhances reproducibility. This idea is also highlighted in R for Data Science (2e), Chapter 25:

[…] Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has four big advantages over using copy-and-paste:

You can give a function an evocative name that makes your code easier to understand.

As requirements change, you only need to update code in one place, instead of many.

You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

It makes it easier to reuse work from project-to-project, increasing your productivity over time.

3.4.1 Functions

3.4.1.1 What is a function in R?

Definition: Function

A function is an object that may contain multiple related operations, statements and functions. These are executed in a predefined order.

https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Function-objects

In R, functions can be created using the function() statement and they consist of 3 parts:

name of the function
arguments and parameters: may vary across calls of the function
body that contains the code which is executed across function calls

name <- function( arguments ) {

  body 

}

3.4.1.2 How to write a function?

To demonstrate how to write a function, we will use the following example. A common (repetitive) task is to recode item indicators. Consider two variables which need to be recoded (i.e., from 1 → 4, 2 → 3, 3 → 2, 4 → 1):

ex_rec <- data.frame(
  id = 1:6,
  item1 = c(1, 2, 2, 4, NA, 3),
  item2 = c(4, 3, NA, 1, 4, 1)
)

Provide a concise and meaningful name (here: rec_items). The name of the function object will–after defining the function–appear in the R environment.

rec_items <- function( inputs ) {

  # body

}

Next, we need to define the inputs (also known as arguments or parameters) of the function. These inputs provide the necessary data and information and concurrently define how the function operates. The inputs are written within regular parentheses (...).

To recode an item, we need the following inputs:

Dataset: A data frame containing the items.
Item: A character input specifying the names of the items.

rec_items <- function( data,
                       item ) {

  # Check if data is a data frame
  stopifnot(is.data.frame(data))
  
  # Check if item exists in the dataset
  if (!item %in% colnames(data)) {
    stop(sprintf("Item '%s' not found in dataset.", item), call. = FALSE)
  }

 # body goes here

}

Optional: A function profits from input validation. This means we can include checks and error messages within the function body (e.g., check if the dataset is a data frame, check if the item exists in the dataset, etc.).

Third, provide the actual code in the body of the function. This code is written inside the curly brackets { }. Do not forget to return the results.

In this recoding approach, we subtract the item from the sum of the maximum and minimum of the item. Note that this approach is not very robust across different recoding strategies. This approach fails when the sample size is small and the categories are not used completely.

rec_items <- function( data,
                       item ) {

  # Check if data is a data frame
  stopifnot(is.data.frame(data))
  
  # Check if item exists in the dataset
  if (!item %in% colnames(data)) {
    stop("Item not found in dataset")
  }

  x <- data[[item]]
  
  if (all(is.na(x))) {
    stop(sprintf("Item '%s' contains only NA values.", item), call. = FALSE)
  }

  max_x <- max(x, na.rm = TRUE)
  min_x <- min(x, na.rm = TRUE)

  ret <- (max_x + min_x) - x
  ret # or return(ret)

}

Lastly, we test the function by executing it on the example dataset ex_rec and check if the item was recoded correctly.

ex_rec$item1_r <- rec_items(data = ex_rec,
                            item = "item1")

with(ex_rec,
     table(item1, item1_r,
           useNA = "ifany"))

      item1_r
item1  1 2 3 4 <NA>
  1    0 0 0 1    0
  2    0 0 2 0    0
  3    0 1 0 0    0
  4    1 0 0 0    0
  <NA> 0 0 0 0    1

3.4.2 Loops

3.4.2.1 What is a loop in R?

Definition: Loop

Looping is the repeated evaluation of a statement or block of statements. Base R provides functions for explicit (i.e., for, while, repeat) and implicit looping (e.g., apply, sapply, lapply,…).

There are also other packages (e.g., parallel, purrr, furrr) that offer more advanced or parallelized looping capabilities, providing more efficient and convenient ways to iterate over data, particularly for complex workflows or large datasets.

https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Looping

3.4.2.2 lapply

We begin by using the lapply function because it is a little bit beginner-friendly than for loops. The lapply function applies a(nother) function over a list or vector. It needs 2 arguments as inputs:

X: a vector (atomic or list)
FUN: the function to be applied to each element of X

lapply returns a list of the same length as the input X (see ?lapply).

Two (nearly equivalent) examples with lapply are shown in the following. The first example is shorter, but the second example is often preferred because it offers the option to further customize the operations within the applied function (e.g., calculating the square of x).

printList1 <- lapply(X = 1:3,
                     FUN = print)

[1] 1
[1] 2
[1] 3

printList2 <-  lapply(X = 1:3,
1                      FUN = function(x) {
2                        ret <- x^2 |>
                                print()
                        return(ret) 
                       }
                    )

1: Defines an so-called anonymous function that takes an argument x.
2: This approach offers further customization of operations such as calculating the square of x (for more see Functions in R).

[1] 1
[1] 4
[1] 9

Use your own function with apply family.

If the function() to be applied becomes more complex, it might be reasonable to define it first and then apply it with lapply (or other apply functions).

3.4.2.3 for loops

work in progress…

3.4.2.4 When not to loop?

The typical answer is if vectorization is possible. To demonstrate the difference between a loop and vectorization, we will use the following (“stupid”) example. We create a large vector (10 million random values) …

set.seed(999)
x <- rnorm(1e7)

… and multiply it by 2 using first a loop and then vectorization.

Loop approach

system.time({

  y <- numeric(length(x))

  for (i in 1:length(x)) {
    y[i] <- x[i] * 2
  }

})

   user  system elapsed 
  0.248   0.006   0.254

Vectorizized approach

system.time({

  y <- x * 2

})

   user  system elapsed 
  0.003   0.003   0.005

For more see here: https://stackoverflow.com/questions/58568392/how-do-i-know-a-function-or-an-operation-in-r-is-vectorized

and

https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/

The next question is then how can we know that vectorization is possible? Vectorization is possible whenever an operation is defined element-wise or column-wise on an entire object (e.g., vector, matrix, data frame). Examples include:

Arithmetic operations (e.g., +, -, …)
Logical comparisons (e.g., ==, >, …)
Mathematical functions (e.g., sqrt(), log(), …)

If an operation requires sequential dependence (e.g., each iteration depends on a previous result), then vectorization is usually not possible, and a loop is required.

3.5 Some Questions

This is work in progress…

… and needs to be completed.

Is the R language a dialect of the S programming language?

Is R case-sensitive?

What is NOT considered a data type in R?

“In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted i, called the imaginary unit and satisfying the equation \(i^2 = -1\); every complex number can be expressed in the form \(a+bi\), where a and b are real numbers.” (wikipedia)↩︎
In addition, factors are a special data type used to represent categorical variables.↩︎
for the initial difference between |> and %>% see https://ivelasq.rbind.io/blog/understanding-the-r-pipe/↩︎