Topic 3 | Setting Up for Reproducible Data Analysis

NoteClass Details

πŸ“… Date: April, 2026
πŸ“– Synopsis: Installing the toolchain, creating a GitHub account, and connecting your project.

Class overview By the end of this class you will have: assessed your current R skills, installed R, RStudio, and Git on your personal computer, created a GitHub account with SSH authentication, created a structured project directory, and connected your local project to GitHub.


Segment Duration
R confidence check + discussion 30 min
Install R, RStudio, Git 60 min
GitHub account + SSH key 40 min
Project creation + structure + first push 50 min
Wrap-up + mini-project introduction 20 min
Buffer / troubleshooting 20 min
Total ~220 min (~4 h)

R Confidence Check

Work through the following prompts in RStudio. There are no right or wrong approaches: The goal is to see how you currently think in R.

# 1. Create a numeric vector of 10 values and compute its mean and sd.

# 2. Load the built-in dataset `iris`. How many rows and columns does it have?
#    What are the column names?

# 3. Subset `iris` to only rows where Species is "setosa"
#    and Petal.Length is greater than 1.5.

# 4. Using base R or dplyr (your choice), compute the mean Sepal.Length
#    grouped by Species.

# 5. Make a scatter plot of Sepal.Length vs Petal.Length,
#    coloured by Species, using either base R or ggplot2.

# Advanced users:
# 6. Write a function that takes a numeric vector and returns
#    a named list with its mean, median, and sd.

Discussion

We will share and compare solutions as a group. Key things to reflect on:

  • Did you use <- or = for assignment? (discuss conventions).
  • Did anyone already reach for the native pipe |> or the magrittr pipe %>%?
  • Who has used R Markdown or Quarto before?

Installing the Toolchain

We are now going to make sure every personal computer has the same foundation. This is itself an act of reproducibility: we want our computational environment to be as explicit and intentional as our analysis.

1. Install R

Go to https://cran.r-project.org and download the installer for your operating system.

Download and run the .exe installer. Accept all defaults.

Download and run the .pkg installer. If you are on Apple Silicon (M1/M2/M3/M4), make sure you choose the arm64 build (it will be clearly labelled on the download page).

Verify the installation by opening a terminal and running:

R --version

You should see R version 4.x.x.

2. Install Coding tools

Windows users need to install Rtools, a collection of build tools required to compile R packages from source. Many packages on CRAN and GitHub are distributed as source code and will fail to install without it.

Go to https://cran.r-project.org/bin/windows/Rtools/ and download the version of Rtools that matches your R version - the page makes this explicit. Run the .exe installer and accept all defaults.

Once installed, verify that R can find Rtools by running this in the R console:

pkgbuild::has_build_tools(debug = TRUE)

You should see TRUE. If the pkgbuild package is not yet installed, run install.packages("pkgbuild") first.

No action needed. macOS uses the Xcode Command Line Tools (installed alongside Git in a later step) to compile packages from source.

3. Install RStudio Desktop

Go to https://posit.co/download/rstudio-desktop/ - the page auto-detects your OS and recommends the right download.

Run the .exe installer and accept all defaults.

Open the .dmg file and drag RStudio to your Applications folder.

Open RStudio and confirm that the R version shown in the Console pane matches the one you just installed.

4. Install Git

If you need extra detailed instructions for each Operating System (Linux, Mac, Windows), check here.

Download Git for Windows from https://git-scm.com/download/win and run the installer.

During installation, pay attention to these two prompts:

  • Adjusting your PATH environment: choose β€œGit from the command line and also from 3rd-party software” (the recommended option).
  • Choosing the default editor used by Git: change from Vim to whatever you are comfortable with. Notepad is fine for now since we will drive Git from RStudio.
  • Everything else: accept the defaults.

Open Terminal and run:

git --version

If Git is not present, macOS will prompt you to install the Xcode Command Line Tools. Accept and let it run (this may take a few minutes).

Verify the installation

In RStudio, open a terminal via Tools > Terminal > New Terminal and run:

git --version

You should see git version 2.x.x.

5. Configure your Git identity

In the RStudio Terminal (or any terminal), run the following with your own details:

git config --global user.name "Firstname Lastname"
git config --global user.email "your.email@example.com"
Important

Git records who made every change. This name and email will appear in every commit you make, including in your project repository on GitHub. Use the same email you will register with on GitHub.

Verify:

git config --global --list

6. GitHub Account and SSH Authentication

Create a GitHub account

Go to https://github.com/join and create a free account.

A few things worth considering when choosing a username:

  • This will appear on your CV, in shared links, and potentially in publications. Choose something professional.
  • Something like firstnamelastname or f-lastname works well.
  • Use the same email you configured in Git above.

The free tier is sufficient for everything in this course.

Set up SSH authentication

We will use SSH keys so that RStudio can push to GitHub without you typing a password every time. This is the standard approach in professional settings.

Generate the key pair

In the RStudio Terminal:

ssh-keygen -t ed25519 -C "your.email@example.com"

Accept the default file location (~/.ssh/id_ed25519). You can leave the passphrase empty for simplicity in this course, or set one if you prefer.

Copy the public key

Write the following in RStudio Terminal, and copy the output using the right mouse button β€œcopy” option.

cat ~/.ssh/id_ed25519.pub
cat ~/.ssh/id_ed25519.pub

Add the key to GitHub

  1. On GitHub, click your avatar (top right) > Settings > SSH and GPG keys > New SSH key.
  2. Give it a descriptive title, e.g. personal_laptop_2026.
  3. Paste the public key into the Key field.
  4. Click Add SSH key.

Test the connection

In the RStudio Terminal run:

ssh -T git@github.com

Expected response:

Hi username! You've successfully authenticated, but GitHub does not provide shell access.

Creating and Structuring the Project

Set up a clean directory structure

A reproducible analysis depends on a consistent, self-explanatory folder structure. Here is a minimal but solid starting layout.

In RStudio:

  1. Navigate to the folder where you want to create the data analysis project folder. Call it something meaningful, like rep_data_analysis_project.

  2. Create the following folders through the Files pane: data data/raw data/processed scripts outputs docs. The folder (or directory) structure should look like this:

my-data-analysis/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/          # original, never-edited data files
β”‚   └── processed/    # cleaned or derived data
β”œβ”€β”€ scripts/                # reusable functions and scripts
β”œβ”€β”€ outputs/          # plots, tables, model outputs
β”œβ”€β”€ docs/             # reports, notes

Create an R Project with Git

In RStudio:

  1. Go to File > New Project > Existing Directory.
  2. Choose where on your local disk the project folder has been created.
  3. Click Create Project.
  4. Go to Tools > Project Options > Git/SVN > Version control system: Git
  5. Confirm New Git Repository: Yes;
  6. Restart R and RStudio: Yes.

RStudio will create the project and open it. The Git pane (top right panel) should now be visible, confirming that Git is tracking this directory.

Warning

Avoid paths that contain spaces or accented characters - this is a common source of problems on Windows. For example, prefer C:\projects\my_data_analysis over C:\Users\Jane Doe\Documents\my_data_analysis.

Start documenting your project

Create and open a README.md file and add a few sentences describing what the project is about. You will expand this as the project develops.

Update .gitignore

The .gitignore file, as the name implies, specifies which files and folders Git should ignore, so they are not tracked or included in commits (or in other words, it tells Git which files to leave out).

Now, open .gitignore and add entries for files or folders that should never be version controlled:

# R artefacts
.Rhistory
.RData
.Rproj.user/

# Large data files: adjust extensions to match your data
data/raw/*.csv
data/raw/*.xlsx

# Rendered outputs that can be regenerated
*.html
*.pdf
Note

Raw data files are often excluded from version control due to size or confidentiality constraints. The standard practice is to document where the data came from in the README, so that anyone with appropriate access can reproduce the starting point of the analysis.

Make your first commit

In the Git pane in RStudio:

  1. Click the Staged checkbox next to all changed files (.gitignore, README.md). Note that empty folders do not appear here (Git does not track empty directories).
  2. Click Commit.
  3. Write a commit message, for example: First commit.
  4. Click Commit.

Create a GitHub repository and connect it

Now we create the remote repository on GitHub and link the local project to it.

On GitHub:

  1. Logged in in your GitHub account, click the + icon in the top right > New repository.
  2. Use exactly the same name as your local folder, e.g. my_data_analysis.
  3. Add a one-sentence description.
  4. Set it to Public or Private (Public is fine for this course).
  5. Do NOT initialise the repository with a README, .gitignore, or license. The repository must be empty so it can receive your local project.
  6. Click Create repository.

GitHub will show you a page with setup instructions. Copy the SSH URL, it looks like:

git@github.com:username/my-data-analysis.git

Now go back to RStudio and open the Terminal (Tools > Terminal > New Terminal). Run the following, replacing the URL with your own:

git remote add origin git@github.com:username/my-data-analysis.git
git branch -M main
git push -u origin main
Note

git remote add origin tells your local repository where the remote lives. git push -u origin main pushes your local commits and sets origin/main as the default tracking branch, so future pushes from RStudio’s Git pane need no further configuration.

Refresh your GitHub repository page - it should now show your project files.

NOTE: Sometimes it takes a minute to display the newly added changes to your GitHub repository.

TipClone an existing GitHub repository

If you are joining a project that already exists on GitHub, or if you prefer to start from GitHub, the workflow is the reverse: create the repository on GitHub first (this time choosing to add a README and a .gitignore file), then in RStudio go to File > New Project > Version Control > Git, paste the SSH URL, and click Create Project. Both workflows end up in the same state.

TipCheckpoint

Every student should have their repository on GitHub showing at least README.md and an updated .gitignore before we move on.


Wrap-up and Mini-project Introduction

What was accomplished today

Each piece of the toolchain has a specific role in the reproducibility workflow:

  • R and RStudio | The analysis environment.
  • Git | The mechanism for tracking every change and recording why it was made (versionn control).
  • GitHub | The remote backup, the collaboration layer, and ultimately the public record of the analysis.
  • The project directory | A contract with your future self (and collaborators) about where everything data-analysis-related lives.

What comes next

In the next class we will write our first analysis inside this project. Everything you do: data import, transformation, visualisation, and discussion, will live in a single reproducible file that renders to HTML or PDF. The project you set up today is where that file will live.

Common problems and fixes

Problem Likely cause Fix
ssh -T git@github.com fails Public key not saved correctly on GitHub, or wrong email Re-check the key on GitHub under Settings > SSH keys
RStudio does not find Git Git executable not on PATH Tools > Global Options > Git/SVN and set the path. On Windows: C:\Program Files\Git\bin\git.exe
Push rejected Cloned via HTTPS instead of SSH Switch the remote: git remote set-url origin git@github.com:username/repo.git
Errors on paths with spaces (Windows) Project cloned into a path like C:\Users\Jane Doe\... Move the project to a path without spaces or accented characters

Suggested reading

  • Jenny Bryan’s Happy Git and GitHubfor the useR β€” https://happygitwithr.com. Focus particularly on the mental model of a repository and the commit/diff/push cycle.

Homework

  • Add a paragraph to your README.md describing the data source you plan to use for the mini-project.
  • Commit and push that change.
  • Browse the commit history on GitHub and confirm your name and email appear correctly.

Back to top