This part gives a brief introduction in how to implement a project-oriented workflow for any data-related project with R.
Definition
A project-oriented workflow begins by creating a folder (working directory) to store all files related to a specific project.
This approach provides several benefits, such as:
improved organization of project-related files (e.g., data, scripts, results, and documentation)
use of relative file instead of absolute paths (no need for setwd() which ensures portability to other systems)
facilitates version control through (e.g., Git )
ease implementation of package and environment management (e.g., renv package for R )
Implementation
The easiest way to implement a project-oriented workflow is to use an Integrated Development Environment (IDE) such as RStudio or Positron. These IDEs offer built-in features that help with creating and managing projects efficiently.
Exercise (5min)
Create a working directory (i.e., project folder) through your IDE.
Create a project: File > New Project... and select project type.
Choose a location (typically a folder called “projects”) and create a new directory within it.
Select R version.
In general, it is recommended to use a version control system (e.g., Git ) and a library/package (e.g., the renv package, Ushey & Wickham, 2024 in R) to create reproducible environments (see below).
The structure of a project folder should look anything like in the following example. Different components of a project (i.e., data, code, and output-files) should be stored in separate directories.
ProjectName/├── data/│ ├── raw/ # Original datasets (read-only)│ │ ├── rawData-1.csv│ │ ├── rawData-2.csv│ │ └── ...│ ├── processed/ # Cleaned, processed and final datasets│ │ ├── 01_dataCleaning.csv│ │ ├── 02_dataTransformation.csv│ │ ├── ...│ │ └── dataToShare.csv├── code/│ ├── src/ # reusable (custom) functions, helper utilities│ │ ├── _functions.r│ │ └── ...│ ├── scripts/ # scripts for data processing and core analysis│ │ ├── 01_dataCleaning.qmd│ │ ├── 02_dataTransformation.qmd│ │ ├── 03_analysis.qmd│ │ └── ...├── output/ # results│ ├── figures/│ │ ├── histogram.png│ │ ├── resultPlot.png│ │ └── ...│ ├── tables/│ │ ├── summaryTable.csv│ │ └── ...├── report.qmd # document that combines everything├── report.pdf # aka rendered report.qmd├── images/ # images that need to be included├── README.md # provides a project overview├── .gitignore # useful when using Git ├── _quarto.yml # Quarto Projects only├── .Rprofile / renv.lock # information about evironment├── codebook.md└── ...
To enhance the reproducibility of the project, it is essential to provide a clear and comprehensive documentation. Adding a README.md file at the root of your working directory helps others to understand the structure, and usage of your project.
Create the (sub)directories (programmatically)
First, create a file (File > New File > R Script) and name it, for example, createDirs.R. Next, to efficiently create all (sub)directories, you need to define a character vector that contains all the directories (paths).
Use c() function to combine any values into a vector/list (here: character vector).
1
file.path() function construct platform-independent file paths.
2
No subdirectories in these two directories.
These directories will be created in the current working directory!
You may want to check the working directory using getwd() function. If you created a R Project, this is most likely not an issue, because the working directory is set to the project folder.
Finally, we used a (for) loop to iterate over each path and executing the dir.create() function. This loop also checks whether each directory already exists to prevent redundant operations. The recursive = TRUE argument ensures that all necessary parent directories are created if they do not exist.
for (x in myFolders) {if (!dir.exists(x)) { success <-dir.create(x, recursive =TRUE)if (!success) {warning("Failed to create directory: '", x, "'") } } else {message("Directory '", x, "' already exists.") }}
Exercise (10min)
Create the directory structure you want to create.
Adjust the myFolders vector accordingly.
Use the loop to create the (sub)directories (just copy it).
Repository (Repo): The place where your project lives. It contains all the files and the entire revision history.
Commit: Making a commit is making a snapshot of your repository at a specific time point. Each commit records the current state of your project and has a unique identifier.
Branch: A branch may be a separate line of project development (e.g., to try out new ideas in a isolated area). The ‘main’ (or previous ‘master’) branch is usually considered the definitive branch.
Merge: Merging means to incorporate changes from a different branch into the the main branch.
Pull Request: When collaborating, you make changes in your branch and then ask others to review and merge them. This request is called a pull request.
Clone: Making a local copy of a remote repository.
Fork: Copy a project from somebody else without affecting the original project.
While installing Git on Windows is straightforward (just run git-current-version.exe), on macOS it requires an additional step of installing a package manager (here: Homebrew), before proceeding with the Git installation.
Git installation for macOS only.
Copy and paste the following comand in a macOS terminal. Follow the steps.
After installation, you might want to check the installed version of git. Copy and paste the following comand in the terminal.
git--version
Combine it with GitHub
GitHub provides a home for Git-based projects and allows other people to see the project
Creating a reproducible environment
In R, the renv package (Ushey & Wickham, 2024) is desigend to create a reproducible environment.
How does it work? When initiating a project with the renv package, it…
creates a separate library (instead of having one library containing the packages used in all projects)
creates a lockfile (i.e., renv.lock) that records metadata about all packages
creates a .Rprofile file that is automatically run every time you start the project
But…no panacea for reproducibility
The renv package does not help with the R version, Pandoc (R Markdown and Quarto rely on pandoc) and the operating system, versions of system libraries, compiler versions.
In general, the .Rprofile file is a user-controllable file that enables the user to set default options (e.g., options(digits = 4)) and environment variables either on the user or the project level (see here). The .Rprofile file is run automatically every time you start R or a certain project.
In the context of renv package, it sources the activate.R script that was created by the renv package. Recall, this script is run, everytime you (or somebody else) open(s) the project and creates the project environment (e.g., project-specific library).
Ensure that renv.lock, .Rprofile, renv/settings.json, and renv/activate.R are commited to version control. Without these files, the environment cannot be recreated.
From wikipedia: JavaScript Object Notation is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values)↩︎