Setup a project folder

Author

Sven Rieger

Last updated on

April 13, 2023

Abstract

This part gives a brief introduction in how to setup a project folder for any data-related projects.

Structure of project folder

The structure of a project folder should look anything like in the following example. Different components of a project (i.e., data, code, and output-files) should be stored in separate directories. To enhance the reproducibility of the project, it is indispensable to provide a clear and comprehensive documentation (e.g., using README files and Quarto scripts).

.md = markdown, .qmd = quarto-markdown, .csv = comma-separated-values-format, .png = portable-network-graphics, .pdf = portable-document-format, yml = Yet Another Markup Language

ProjectName/
├── data/
│   ├── raw/ # Original datasets (read-only)
│   │   ├── rawData-1.csv 
│   │   ├── rawData-2.csv 
│   │   └── ...
│   ├── processed/ # Cleaned, processed and final datasets
│   │   ├── 01_dataCleaning.csv
│   │   ├── 02_dataTransformation.csv
│   │   ├── ...
│   │   └── dataToShare.csv
├── code/
│   ├── src/ # reusable (custom) functions, helper utilities
│   │   ├── _functions.r
│   │   └── ...
│   ├── scripts/ # scripts for data processing and core analysis
│   │   ├── 01_dataCleaning.qmd
│   │   ├── 02_dataTransformation.qmd
│   │   ├── 03_analysis.qmd
│   │   └── ...
├── output/ # results
│   ├── figures/
│   │   ├── histogram.png
│   │   ├── resultPlot.png
│   │   └── ...
│   ├── tables/
│   │   ├── summaryTable.csv
│   │   └── ...
├── report.qmd # document that combines everything
├── report.pdf # aka rendered report.qmd  
├── images/ 
├── README.md # provides a project overview
├── .gitignore # useful when using Git 
├── _quarto.yml # Quarto Projects only
├── .Rprofile / renv.lock # information about evironment
├── codebook.md
└── ...

Do not create the folders yet. We will first create the project folder, and then proceed to create the necessary subfolders programmatically.

Project-oriented Workflow

Projects…

  • help to organize the files (see above)

  • file path referencing is neat (no more setwd)

  • make version control & package management easier

Create a project in RStudio: File > New Project...

Create a project in Positron: File > New Project...

For all (data-related) projects, it is highly recommended to use Git as a version control system and a library/package to create reproducible environments (for R projects this would be, for example, the renv package, Ushey & Wickham (2024)).

For more see R for Data Science by Hadley Wickham & Garrett Grolemund.

Version control: Git

What is version control and why should you use it?

Tracking and recording changes for all kind of files (within a project) over time

  • Backup: Records the history of your project and allows for easy recovery of earlier versions
  • Collaboration: It allows multiple people to work on the same project without overwriting each other’s work.
  • Understanding & Traceability: It helps to track why changes were made, who made them, and when

Time machine analogy1

“Track Changes” features from Microsoft Word on steroids (https://happygitwithr.com/big-picture)

Git basics

  1. Repository (Repo): The place where your project lives. It contains all the files and the entire revision history.
  1. Commit: Making a commit is making a snapshot of your repository at a specific time point. Each commit records the current state of your project and has a unique identifier.
  1. Branch: A branch may be a separate line of project development (e.g., to try out new ideas in a isolated area). The ‘main’ (or previous ‘master’) branch is usually considered the definitive branch.
  1. Merge: Merging means to incorporate changes from a different branch into the the main branch.
  1. Pull Request: When collaborating, you make changes in your branch and then ask others to review and merge them. This request is called a pull request.
  1. Clone: Making a local copy of a remote repository.
  1. Fork: Copy a project from somebody else without affecting the original project.

Git in RStudio I

  1. Download & install Git: https://git-scm.com/downloads
  2. Go to Tools > Global Options
  3. Click Git/SVN
  4. Click Enable version control interface for RStudio projects
  5. If necessary, enter the path for your Git where provided.

Git in RStudio II

Enable it when creating a R project: Click ‘Create a git repository’

Combine it with GitHub

GitHub provides a home for Git-based projects and allows other people to see the project

Happy Git and GitHub for the useR: https://happygitwithr.com/

The renv package

What does the renv package (Ushey & Wickham, 2024) do to create a reproducible environment for R projects?

It…

  • creates a separate library for each project (instead of having one library containing the packages used in all projects)

  • creates a lockfile (i.e., renv.lock) that records metadata about all packages

  • creates a .Rprofile file that is automatically run every time you start the project

No panacea for reproducibility

renv does not provide help with the R version, Pandoc (R Markdown and Quarto rely on pandoc) and the operating system, versions of system libraries, compiler versions

renv package in R projects (within R Studio) I

Use it when creating a R project: Click ‘Use renv with this project’

renv package in R projects (within R Studio) II

Or use functions from the package to set up a project infrastructure:

renv::init()

References

Ushey, K., & Wickham, H. (2024). Renv: Project environments. https://rstudio.github.io/renv/

Footnotes

  1. Image was created with ChatGPT↩︎