Computational Biology | Transcriptomics Analysis using R

Author

Isabel Duarte

Published

March 2023

Computational Biology course | Module 1
Masters in Health Sciences - Disease Mechanisms
Faculdade de Medicina e Ciências Biomédicas - Universidade do Algarve, Faro, Portugal
Isabel Duarte | giduarte at ualg dot pt | Website: http://iduarte.eu/

General Information

Learning outcomes | Knowledge, Skills, and Competences

The students will acquire basic knowledge in the application of computational techniques and tools to study genetics, molecular biology, and structural biology, through the analysis of Gene expression (Module 1), Systems and networks (Module 2), and Sequence and structure of proteins (Module 3).

The following topics will be addressed in Module 1:

Statistical analysis applied to medical/genetic studies;
Introduction to R programming;
Brief overview of Transcriptomics (gene expression) analysis.

Evaluation

The evaluation for Module 1 (Biostatistics and R Programming) will be the summation of the two following evaluation criteria:

A. Performance in class (completing the tutorials and assignments, participation in class, and collaboration with classmates): 5 points.
B. One final written exam with one section of multiple choice questions, plus one section with questions to write R code for simple programming tasks: 15 points.

Final grade improvements will be assessed with:

One individual assignment, to be completed at home, that must be presented and discussed in a 30 minute individual oral exam: 20 points.

Classes documents

Additional file required for exercises | Mice diet dataset

Contents & Bibliography

Syllabus | Isabel Duarte & Ana Marreiros
- 1. Introduction to Biostatistics | Prof. Ana Marreiros
  - Basic concepts in statistics: Univariate and Bivariate analysis
  - Categorical data (Nominal or Ordinal) VS Numeric data (Discrete or Continuous)
  - Descriptive/Exploratory studies: Mean, Standard deviation, Variance, Mode, Median
  - Linear regression and Correlation coefficient (Pearson and Spearman)
  - Inferential studies
  - Parametric VS Non-Parametric tests
  - Z-score (Standard score)
  - Hypothesis testing (Null hypothesis and Alternative hypothesis)
  - Unilateral VS Bilateral tests
  - P-value
- 2. Introduction to R
  - Introduction to R programming
  - Descriptive statistics in R
  - Hypothesis testing in R
  - Statistical significance in R
- 3. Transcriptomics analysis using R (brief overview)
  - Current technologies
  - Microarrays and RNA sequencing
  - Biomedical applications

Pedagogical goals | At the end of Module 1, the students will be able to:
- 1. Introduction to Biostatistics:
  1. Identify the type of a variable (Numeric - Continuous or Discrete, Categorical - Ordinal or Nominal);
  2. Formulate hypotheses for hypothesis testing (t-test);
  3. Decide between bilateral and unilateral testing;
  4. Calculate and interpret the p-value of a test.
- 2. Introduction to R:
  1. Create an RStudio project;
  2. Install packages from major repositories, namely CRAN and Bioconductor;
  3. Identify 4 types of data structures available in R: Vectors, Matrices, Data frames, and Lists;
  4. Recognize the 4 main vector data types: Logical (TRUE or FALSE), Numeric (e.g. 1,2,3…), Character (e.g. “Universidade”, “do”, “Algarve”), and Complex (e.g. 3+2i);
  5. Obtain help regarding R functions (using ? or help);
  6. Create vectors;
  7. Assign results to named variables using the assignment operators <- and = ;
  8. Convert between data types;
  9. Understand vectorized arithmetics, i.e. operations between vectors are applied element-wise;
  10. Understand vector recycling, i.e. if an operation is conducted between vectors of different length, the elements from the shorter vector are reused from the beginning;
  11. Construct code iterations using for loops;
  12. Construct conditional statements (if statements);
  13. Load a dataset in R;
  14. Inspect the data loaded;
  15. Obtain information about the dimensions of the dataset, such as the number of rows and number of columns;
  16. Subset a dataset based on row/column number (with []), or based on column name (with $);
  17. Obtain summary statistics on the dataset (mean, maximum, minimum, quartiles, standard deviation, and variance);
  18. Graphically explore the data with boxplots and histograms;
  19. Export results to a file (data analyses and figures);
  20. Save the workspace with all analysis’ results in a .Rdata file;
  21. Understand hypothesis testing: interpreting the p-value;
- 3. Transcriptomics in R
  1. Understand the definition of transcriptomics as the measurement of the expression levels of all genes simultaneously;
  2. Distinguish between microarray and RNA-seq technologies (strengths and weaknesses of each technology);
  3. Understand the limitations of transcriptomics;
  4. List the several steps needed to analyse a transcriptomics dataset in R: data loading, quality control, normalization, visualization, and extraction of gene expression values;
  5. Understand differential gene expression: what is, how it is achieved, what it can reveal;
  6. Define a design matrix and a contrast matrix, used for differential gene expression;
  7. Identify the applications of transcriptomics (as a data driven discovery strategy as opposed to classical hypothesis driven research).

Online resources and Bibliography (for future learning)
- Websites
  - R Project (The developers of R)
  - Quick-R (Roadmap and R code to quickly use R)
  - Cookbook for R (R code “recipes”)
  - R-bloggers (well explained step-by-step R code to perform common tasks)
  - Bioconductor workflows (R code for pipelines of genomic analyses)
  - Introduction to Data Science (Free online book from Rafael A. Irizarry, 2020)
  - Advanced R (If you want to learn R from a programmers perspective)
  - Biostatistics open learning course | University of Florida Health, USA (Free online course, with good overview of basic statistics, with videos, and exercises for self learning)
- Books
  - Data Analysis for the Life Sciences (LeanPub, 2021, Rafael A. Irizarry and Michael I. Love, https://leanpub.com/dataanalysisforthelifesciences)
  - A first course in statistical programming with R (CUP, Braun and Murdoch, 2016)
  - R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (O’Reilly, Wickham and Grolemund, 2017) (for advanced users)

Software requirements

To follow the classes for the first module, you will need to use R and RStudio.

You have 2 options:

Install R and RStudio in your own personal computer.
Use the UAlg’s VDI (Virtual Desktop Infrastructure) and connect to the BIOCOMP pool. To access it from home, first install UAlg’s VPN to access the university’s network, and then you can use the client installed in your personal computer (see below).

Option 1 | Install R and RStudio in your personal computer

First install R | https://www.r-project.org/

To download R, please choose any CRAN mirror from the following url: https://cran.r-project.org/mirrors.html

Next install RStudio | https://posit.co/download/rstudio-desktop/#download

Option 2 | Install the university’s VPN first (if not already installed).

Site: https://www.ualg.pt/suporte-informatico

Follow the instructions in: “Como aceder remotamente à UAlg? Como posso configurar a VPN?”

Turn the VPN on (with your student credentials: student number and password).
Next, install the VMware Horizon Client required to access the university’s VDI (Virtual Desktop Infrastructure) following the instructions in the file: http://iduarte.eu/classes/compbiol_2023/software_required_Instal_VMware_Horizon_Client_access_VDIs.pdf.
Once the VMware client is launched, you should see the collection of VDIs available. Choose the BIOCOMP virtual desktop, and authenticate with your student credentials. Click in the RStudio icon present in the Desktop.

Protocols

Welcome to Module 1 of the Computational Biology class. This course is ambitious. You will be challenged to learn many new concepts in a short amount of time. Do not hesitate to ask for help in order to gain the most out of these classes.

These classes are organized by protocols to be completed on each day:

first you will get a brief introduction to R programming;
then you will practice R by analysing a dataset using descriptive statistics, and learning some plotting functions to visualize the data;
you will have a self-assessment exam to test your newly acquired R programming knowledge, and identify possible difficulties;
finally, you will proceed with a brief transcriptomics microarray data analysis using R.

1. Introduction to R

Introduction to R

First, we will discuss the purpose of learning to program in R.
Then, you will be introduced on how to install and use R and RStudio.
Next, you will follow an interactive learning tutorial, self-paced, using the swirl package that you will be installing in your R.
Finally, you will recap some important notes about R programming.

Questions to debate in class:

Why do we need to use computer programming?
Why do we use R?
Are there other alternatives to R?

1. Using RStudio

The practical classes for this course require R and RStudio to be installed on your system.

Please read the tab “Software requirements” to get information on how to install R and RStudio in your system, or how to use the software installed at the virtual machines from UAlg (requires login with UAlg student number).

The friendliest way to learn R is to use RStudio which is an Integrated Development Environment - IDE - that includes:

a text editor with syntax-highlighting (where you save the R code in a script to run again in the future);
an R console to execute code (where the R instructions are executed and evaluated to generate the output);
a workspace and history management;
and tabs to visualize plots and exporting images, browsing the files in the workspace, managing packages, getting help with functions and packages, and viewing html/pdf or presentation files created within RStudio (using RMarkdown).

Figure 1: RStudio Graphical User Interface (GUI)

**2. Install the swirl package & Follow the guided course**

After you become familiarized with RStudio, we shall progress to the interactive, self-paced, hands-on tutorial provided by the swirl package. You will learn R inside R. For more info about this package, visit: https://swirlstats.com/

Type these instructions in the R console (panel number 2 of RStudio). The words after a # sign are called comments, and are not interpreted by R, so you do not need to copy them.
You can copy-paste the code, and press enter after each command.

install.packages("swirl")   # Install swirl - this step will take around 2 minutes

library("swirl")            # Load the swirl package to make it available for you to use

swirl()                     # Start swirl

After starting swirl, follow the instructions given by the program, and complete all of the following lessons:

Please install the course:
1: R Programming: The basics of programming in R

Choose the course:
1: R Programming

Complete the following lessons, in this order:
1: Basic Building Blocks
3: Sequences of Numbers
4: Vectors
5: Missing Values
6: Subsetting Vectors
7: Matrices and Data Frames
8: Logic
12: Looking at Data
15: Base Graphics

Please call the instructor for any question, or if you require help.
Now, just Keep Calm… and Good Work!

3. Review & Extra notes about R and RStudio

After finishing the introduction to R with swirl, please recall the following information and hints about R and R programming.

R is case sensitive - be aware of capital letters (b is different from B).
All R code lines starting with the # (hash) sign are interpreted as comments, and therefore are not evaluated.

# This is a comment
# 3 + 4   # this code is not evaluated, so and it does not print any result
2 + 3     # the code before the hash sign is evaluated, so it prints the result (value 5)

[1] 5

Expressions in R are evaluated from the innermost parenthesis toward the outermost one (following proper mathematical rules).

# Example with parenthesis:
((2+2)/2)-2

[1] 0

# Without parenthesis:
2+2/2-2

[1] 1

Spaces in variable names are not allowed — use a dot . or underscore _ to create longer names to make the variables more descriptive, e.g. my.variable_name.
Spaces between variables and operators do not matter: 3+2 is the same as 3 + 2, and function (arg1 , arg2) is the same as function(arg1,arg2).
A new line (enter) ends one command in R. If you want to write 2 expressions/commands in the same line, you have to separate them by a ; (semi-colon).

#Example:
3 + 2 ; 5 + 1

[1] 5

[1] 6

To access the help pages on R functions, just type help(function_name) or ?function_name.

# Example: open the documentation about the function sum
help (sum)    
# Quick access to help page about sum
?sum

RStudio auto-completes your commands by showing you possible alternatives as soon as you type 3 consecutive characters. If you want to see the options for less than 3 chars to get help on available functions, just press tab to display available options. Tip: Use auto-complete as much as possible to avoid typing mistakes.
There are 4 main vector data types in R: Logical (TRUE or FALSE); Numeric (e.g. 1,2,3…); Character (i.e. text, for example “universidade”, “do”, “Algarve”) and Complex (e.g. 3+2i).
Vectors are ordered sets of elements. In R vectors are 1-based, i.e. the first index position is number 1 (as opposed to other programming languages whose indexes start at zero, like Python).
R objects can be divided in two main groups: Functions and Data-related objects. Functions receive arguments inside circular brackets ( ) and objects receive arguments inside square brackets [ ]:

function (arguments)
data.object [arguments]
There are five basic data structures in R: Vector, Matrix, Array, Data frame, and List (see following figure):

12.1 The basic data structure in R is the vector, which requires all of its elements to be of the same type (e.g. all numeric; all character (text); all logical (TRUE or FALSE)).

12.2 Matrices are two dimensional vectors (tables), where all columns are of the same length, and all from the same type.

12.3 Data frames are the most flexible and commonly used R data structures, used to store datasets in spreadsheet-like tables. The columns can be vectors of different types (i.e. text, number, logical, etc, can all be stored in the same data frame), but each column must to be of the same data type.

12.4 Lists are ordered sets of elements, that can be arbitrary R objects (vectors, data frames, matrices, strings, functions, other lists etc), and heterogeneous, i.e. each element can be of a different type and different lengths.

R is column-major by default, meaning that the elements of a multi-dimensional array are linearly stored in memory column-wise. This is important for example when using data to populate a matrix.

matrix(1:9, ncol = 3)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

When subsetting matrices and data-frames, the first index is the row, and the second index is the column. If left empty (no value), then the full row or the full column is returned.

# My letters matrix
(my_letters <- matrix(rep(LETTERS[1:3], each=3), ncol = 3))

     [,1] [,2] [,3]
[1,] "A"  "B"  "C" 
[2,] "A"  "B"  "C" 
[3,] "A"  "B"  "C"

# Element in Row 1, Column 2
my_letters[1,2]

[1] "B"

# Element in Row 2, Column 3
my_letters[2,3]

[1] "C"

# Row 3
my_letters[3,]

[1] "A" "B" "C"

# Column 1
my_letters[,1]

[1] "A" "A" "A"

Working directory

The working directory is the location in the filesystem (folder) where R will look for input data and where it will save the output from your analysis. In RStudio you can graphically check this information.

dir()   # list all files in your working directory
getwd() # find out the path to your working directory
setwd("/home/isabel") # example of setting a new working directory path

Saving your workspace in R

The R environment is controlled by hidden files (files that start with a dot .) in the start-up directory: .RData, .Rhistory and .Rprofile (optional).

.RData is a file containing all the objects, data, and functions created during a work-session. This file can then be loaded for future work without requiring the re-computation of the analysis. (Note: it can potentially be a very large file);
.Rhistory saves all commands that have been typed during the R session;
.Rprofile useful for advanced users to customize RStudio behavior.

These files can be renamed:

# DO NOT RUN
save.image (file=“myProjectName.RData”)
savehistory (file=“myProjectName.Rhistory”)

To quit R just close RStudio, or use the q () function. You will then be asked if you want to save the workspace image (i.e. the .RData file):

q()

Save workspace image to ~/path/to/your/working/directory/.RData? [y/n/c]:

If you type y (yes), then the entire R workspace will be written to the .RData file (which can be very large). Often it is sufficient to just save an analysis script (i.e. a reproducible protocol) in an R source file. This way, one can quickly regenerate all data sets and objects for future analysis. The .RData file is particularly useful to save the results from analyses that require a long time to compute.
In RStudio, to quit your session, just hit the close button (x button), just like when you want to quit any other application in your computer.

Package repositories

In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. These packages are stored in online repositories from which they can be easily retrieved and installed on your computer (R packages by Hadley Wickham). There are 2 main R repositories:

The Comprehensive R Archive Network - CRAN (19262 packages in March 2023)
Bioconductor (2183 packages in March 2023) (bioscience data analysis)

This huge variety of packages is one of the reasons why R is so successful: the chances are that someone has already developed a method to solved the problem that you’re working on, and you can benefit from their work by downloading their package for free.

In this course, we will not use many packages. However, if you continue to use R for your data analyses you will need to install many more useful packages, particularly from Bioconductor — an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomics data.

Later on this course we will be installing Bioconductor and two dedicated packages for transcriptomics analysis.

How to Install and Load packages

There are several alternative ways to install packages in R. Depending on the repository from which you want to install a package, there are dedicated functions that facilitate this task:

install.packages() built-in function to install packages from the CRAN repository;
BiocManager::install() to install packages from the Bioconductor repository;
remotes::install_github to install packages from GitHub (a code repository, not dedicated to R).

After installing a package, you must load it to make its contents (functions and/or data) available. The loading is done with the function library(). Alternatively, you can prepend the name of the package followed by :: to the function name to use it (e.g. ggplot2::qplot()).

install.packages("ggplot2")   # install the package called ggplot2 from CRAN
library ("ggplot2")            # load the library ggplot2 

help (package=ggplot2)         # help(package="package_name") to get help about a specific package
vignette ("ggplot2")           # show a pdf with the package manual (called R vignettes)

Creating an RStudio project

RStudio Projects are a great functionality, easing the transition between dataset analyses, and allowing a fast navigation to your analysis/working directory. To create a new project:

File > New Project... > New Directory > New Project
Directory name: compBiol_module1
Create project as a subdirectory of: ~/
                           Browse... (directory/folder to save the class data)
Create Project

Projects should be personalized by clicking on the menu in the right upper corner. The general options - R General - are the most important to customize, since they allow the definition of the RStudio “behavior” when the project is opened. The following suggestions are particularly useful:

Restore .RData at startup - Yes (for analyses with +1GB of data, you should choose "No")
Save .RData on exit - Ask
Always save history - Yes

2. Practicing R

Descriptive statistics in R

To practice your newly acquired R skills, you will be doing a brief summary statistics analysis of a simple dataset.

The scientific experiment | Imagine that you are interested in determining if a high-fat diet causes mice to gain weight, after one month. For this study, the scientists obtained data from 60 mice, where half were fed a lean-diet, and the other half a high-fat diet. All other living conditions were the same. Four weeks after, all mice were weighted, and the sex and age were also recorded. The results were saved in mice_data.txt file.

Start by discussing the experimental design

What is the research question? What is the hypothesis?
How many variables are in the study?
Which variable(s) are dependent? (Dependent or Response variables are the variables that we are interested in predicting or explaining.)
Which variable(s) are independent? (Independent or Explanatory variables are used to explain or predict the dependent variable.)
Which variable(s) are covariates? (Covariates are variables that are potentially related to the outcome of interest in a study, but are not the main variable under study - used to control for potential confounding factors in a study.)
Are the “controls” appropriate? Why?

Recall the most common variable types. The types of the variables will define the analyses and visualizations that are possible.

Create a new R script file

In order to save your work, you must create a new R script file. Go to File > New File > R Script. A blank text file will appear above the console. Save it to your project folder with the name compBiol_biostatistics.R

Important Note | Why do we need scripts?
In computer programming, a script is a text file containing a sequence of instructions that are interpreted by the computer in order to perform a given task. In R, you will save the commands that you have issued at the command line so that you can later re-run your complete analysis. This makes your results instantly reproducible, objective and fast to run again if there are changes/updates to your data.
As such, this automation is one of the major strengths of using a programming language like R for your statistical analysis — you invest time developing the analysis script, but once it is done, you will advance much faster. Additionally, you can share it with others (collaborators, friends, peers) so that they can see what you have done, help with improving it, and add new features. Currently, most reputable scientific journals require that the scripts used for analysis be submitted with the manuscripts for publication.

Load data and inspect it

Download the file mice_data.txt (mice weights according to diet) from here.
Save the file in your current working directory where the RProject was created.
Type the instructions inside grey boxes in pane number 2 of RStudio — the R Console. As you already know, the words after a # sign are comments not interpreted by R, so you do not need to copy them.
- In the R console, you must hit enter after each command to obtain the result.
- In the script file (R file), you must run the command by pressing the run button (on the top panel), or by selecting the code you want to run and pressing ctrl + enter.
Save all your relevant/final commands (R instructions) to your script file to be available for later use.

# Load the file and save it to object mice_data
mice_data <- read.table(file="data/mice_data.txt", header = TRUE,
                        sep = "\t", dec = ".",
                        stringsAsFactors = TRUE)

# Briefly explore the dataset
View (mice_data)       # Open a tab in RStudio showing the whole table

head (mice_data, 10)   # Show the first 10 rows

   diet weight gender age_months
1  lean  24.02      F         19
2  lean  21.79      F         17
3  lean  23.90      F         20
4  lean  11.15      M         10
5  lean  17.73      F         15
6  lean  12.89      M         12
7  lean  20.12      F         16
8  lean  23.04      F         18
9  lean  22.84      F         19
10 lean  18.92      M         15

tail (mice_data, 10)   # Show the last 10 rows

   diet weight gender age_months
51  fat  23.75      M         18
52  fat  21.84      M         17
53  fat  26.60      F         20
54  fat  21.13      M         17
55  fat  24.20      M         19
56  fat  30.69      M         23
57  fat  23.99      F         18
58  fat  19.35      M         17
59  fat  26.37      F         22
60  fat  28.84      M         20

str(mice_data)         # Describe the class of each column in the dataset

'data.frame':   60 obs. of  4 variables:
 $ diet      : Factor w/ 2 levels "fat","lean": 2 2 2 2 2 2 2 2 2 2 ...
 $ weight    : num  24 21.8 23.9 11.2 17.7 ...
 $ gender    : Factor w/ 2 levels "F","M": 1 1 1 2 1 2 1 1 1 2 ...
 $ age_months: int  19 17 20 10 15 12 16 18 19 15 ...

summary (mice_data)    # Get the summary statistics for all columns

   diet        weight      gender   age_months   
 fat :30   Min.   :10.62   F:30   Min.   :10.00  
 lean:30   1st Qu.:19.24   M:30   1st Qu.:17.00  
           Median :22.79          Median :18.00  
           Mean   :22.43          Mean   :17.98  
           3rd Qu.:25.63          3rd Qu.:20.00  
           Max.   :34.76          Max.   :24.00

To facilitate further analysis, we will create 2 separate data frames: one for each type of diet.

# Filter the diet column by lean or fat and save results in a data frame
lean <- subset (mice_data, diet == "lean")
fat <- subset (mice_data, diet == "fat")

# Look at the new tables
head (lean)

  diet weight gender age_months
1 lean  24.02      F         19
2 lean  21.79      F         17
3 lean  23.90      F         20
4 lean  11.15      M         10
5 lean  17.73      F         15
6 lean  12.89      M         12

head (fat)

   diet weight gender age_months
31  fat  15.67      F         14
32  fat  28.18      M         22
33  fat  29.50      M         22
34  fat  23.89      M         20
35  fat  21.61      F         18
36  fat  25.53      M         21

Descriptive statistics and Plots using R

Now, we should look at the distributions of the variables. First we will use descriptive statistics that summarize the sample data. We will use measures of central tendency — Mean, Median, and Mode —, and measures of dispersion (or variability) — Standard Deviation, Variance, Maximum, and Minimum.

# Summary statistics per type of diet - min, max, median, average, standard deviation and variance 
summary(lean)      # quartiles, median, mean, max and min

   diet        weight      gender   age_months   
 fat : 0   Min.   :10.62   F:20   Min.   :10.00  
 lean:30   1st Qu.:17.86   M:10   1st Qu.:15.00  
           Median :20.86          Median :17.00  
           Mean   :20.37          Mean   :16.77  
           3rd Qu.:23.03          3rd Qu.:18.75  
           Max.   :30.12          Max.   :22.00

sd (lean$weight)   # standard deviation of the weight

[1] 4.86655

var(lean$weight)   # variance of the weight (var=sd^2)

[1] 23.68331

summary(fat)

   diet        weight      gender   age_months   
 fat :30   Min.   :15.46   F:10   Min.   :14.00  
 lean: 0   1st Qu.:21.71   M:20   1st Qu.:18.00  
           Median :24.11          Median :19.00  
           Mean   :24.50          Mean   :19.20  
           3rd Qu.:27.79          3rd Qu.:20.75  
           Max.   :34.76          Max.   :24.00

sd (fat$weight)

[1] 4.484297

var(fat$weight)

[1] 20.10892

How is the variable “mouse weight” distributed in each diet? | Histograms

After summarizing the data, we should find appropriate plots to look at it. A first approach is to look at the frequency of the mouse weight values per diet using a histogram.

Recall | Histograms plot the distribution of a continuous variable (x-axis), in which the data is divided into a set of intervals (or bins), and the count (or frequency) of observations falling into each bin is plotted as the height of the bar.

# Histogram - BASIC
hist(lean$weight)

# Add x axis names and title
hist(lean$weight,   # the data object to be plotted
     xlab = "Mouse weight",  # label for x axis
     main = "Lean Diet | Histogram of mouse weight")   # title

# Add color
# install the package RColorBrewer if it is not installed using
   # install.packages("RColorBrewer")
library(RColorBrewer) 

hist(lean$weight,
     xlab = "Mouse weight",                         
     main = "Lean Diet | Histogram of mouse weight", 
     col = brewer.pal(5, "YlOrRd"))   # using 5 colors of the Yellow to Red palette

# Make the same plot for the fat diet, using our own colors
   # to see the other color names: colors()
hist(fat$weight,
     xlab = "Mouse weight",
     main = "Fat Diet | Histogram of mouse weight", 
      col = brewer.pal(5, "Greens"))

# Plot both histograms in same image
par(mfrow=c(1,2))   # set the parameters for the number of rows and columns of plots

hist(lean$weight, col = brewer.pal(5, "YlOrRd"),
     xlab = "Mouse weight",                            
     main = "Lean Diet | Histogram of weight")

hist(fat$weight,
     xlab = "Mouse weight",
     main = "Fat Diet | Histogram of weight", 
     col = brewer.pal(5, "Greens"))

How is the variable “mouse weight” distributed in each diet? | Boxplots

Since our data of interest is one categorical variable (type of diet), and one continuous variable (weight), a boxplot is one of the most informative.

Note | A boxplot represents the distribution of a continuous variable. The box in the middle represents the interquartile range (IQR), which is the range of values from the first quartile to the third quartile, and the line inside the box represents the median value (i.e. the second quartile). The lines extending from the box are called whiskers, and represent the range of the data outside the box, i.e. the maximum and the minimum, excluding any outliers, which are shown as points outside the whiskers (not present in this dataset). Outliers are defined as values that are more than 1.5 times the IQR below the first quartile or above the third quartile.

# Box and whiskers plot - BASIC
boxplot(lean$weight, fat$weight)

# Add color
boxplot(lean$weight, fat$weight, col=c("lightpink", "lightgreen"))  # choosing our own colors

# Add X axis names and Y axis name and limits
boxplot(lean$weight, fat$weight, col=c("lightpink", "lightgreen"),
        names=c("Lean diet", "Fat diet"),
        ylab="Mouse weight (g)",
        ylim = c(5, 40))   # setting the limits of the y axis

# Plot individual points and add them to the boxplot
   # pch is the point character, i.e. the symbol used for the points
stripchart(list(lean$weight, fat$weight),
           vertical = TRUE, method = "stack",
           pch = 21, col="grey42", bg="lightgrey",  
           add = TRUE)

How are the other variables distributed?

There are other variables in our data for each mouse that could influence the results, namely gender (categorical variable) and age (discrete variable). We should also look at these data.

# create table with weights per gender
females <- subset (mice_data, gender == "F")
males <- subset (mice_data, gender == "M")

# Box and whiskers plot
boxplot(lean$weight, fat$weight, females$weight, males$weight,
        ylim = c(5, 40),
        col=c("lightpink", "lightgreen", "skyblue", "orange"),
        names=c("Lean diet", "Fat diet", "Females", "Males"),
        ylab="Mouse weight (g)", main = "Boxplot of mice weight")

# Plot individual points and add them to the boxplot
stripchart(list(lean$weight, fat$weight, females$weight, males$weight),
           vertical = TRUE, method = "jitter",
           pch = 21, col="grey42", bg="grey80",
           add = TRUE)

# Look at the distribution of age
hist(mice_data$age_months, 
     xlab="Age (months)", 
     col = brewer.pal(5, "Pastel1"),
     main="Histogram of mice age")

What is the frequency of each variable?

When exploring the results of an experiment, we want to learn about the variables measured (age, gender, weight), and how many observations we have for each variable (number of females, number of males …), or combination of variables, for example, number of females in lean diet. This is easily done by using the function table. This function outputs a frequency table, i.e. the frequency (counts) of all combinations of the variables of interest.

# How many measurements do we have for each gender (a categorical variable)
table(mice_data$gender)


 F  M 
30 30

# How many measurements do we have for each diet (a categorical variable)
table(mice_data$diet)


 fat lean 
  30   30

# How many measurements do we have for each gender in each diet? (Count the number of observations in the combination between the two categorical variables).
table(mice_data$diet, mice_data$gender)

# We can also use this for numerical discrete variables, like age.
# How many measurements of each age (a discrete variable) do we have by gender? 
table(mice_data$age_months, mice_data$gender)

# And by diet type? 
table(mice_data$age_months, mice_data$diet)

    
     fat lean
  10   0    2
  12   0    2
  14   2    1
  15   1    4
  16   0    1
  17   3    9
  18   6    3
  19   4    3
  20   6    1
  21   1    3
  22   5    1
  23   1    0
  24   1    0

# What if we want to know the results for each of the three variables: age, diet and gender?
   # Using ftable instead of table to format the output in a more friendly way
ftable(mice_data$age_months, mice_data$diet, mice_data$gender)

         F M
            
10 fat   0 0
   lean  1 1
12 fat   0 0
   lean  0 2
14 fat   2 0
   lean  1 0
15 fat   1 0
   lean  2 2
16 fat   0 0
   lean  1 0
17 fat   0 3
   lean  7 2
18 fat   2 4
   lean  2 1
19 fat   1 3
   lean  3 0
20 fat   2 4
   lean  1 0
21 fat   0 1
   lean  1 2
22 fat   2 3
   lean  1 0
23 fat   0 1
   lean  0 0
24 fat   0 1
   lean  0 0

Bivariate Analysis | Linear regression and Correlation coefficient

Is there a dependency between the age and the weight of the mice in our study?

To test if two variables are correlated we will start by (1) making a scatter plot of these two variables, followed by a calculation of the Pearson correlation coefficient, and finally by fitting a linear model to the data to evaluate how the weight changes depending on the age of the mice.

# Create the vectors with the variables of interest
my.weight <- mice_data$weight 
my.age <- mice_data$age_months

# First step: scatter plot of age and weight 
# Note that the dependent variable is the weight, so it should be in the y axis, while the independent variable should be in the x axis.
plot(mice_data$age_months, mice_data$weight,
     ylab = "Weight (g)",
     xlab = "Age (months)", 
     pch = 19)  # character used for the points

# Second step: Calculate the Pearson coefficient of correlation (r)
my.correlation <- cor(my.weight, my.age, method = "pearson")
my.correlation

[1] 0.9539404

# Third step: fit a linear model (using the function lm) and draw it on the scatter plot (using the function abline)
my.lm <- lm (my.weight ~ my.age)  # model of the form response ~ input (where response is the dependent variable (y), and the input is the independent variable (x).
my.lm


Call:
lm(formula = my.weight ~ my.age)

Coefficients:
(Intercept)       my.age  
     -6.487        1.608

# Plot the fitted line on the scatter plot
plot(mice_data$age_months, mice_data$weight, 
     ylab = "Weight (g) [Dependent variable]", xlab = "Age (months) [Independent variable]", pch = 19, 
     col = c(rep("lightgreen", 30), rep("orange", 30)))   # color the points from lean and fat diet

# add the line to the plot
abline(my.lm, col="grey50", lwd=2)

# add a legend to the plot
legend(30, 14, legend=c("Lean diet", "Fat diet"),
       col=c("lightgreen", "orange"), pch=19)

Hypothesis testing and Statistical significance using R

Going back to our original question: Does the type of diet influence the body weight of mice?
Can we answer this question just by looking at the plot? Are these observations compatible with a scenario where the type of diet does not influence body weight?

Remember the basic statistical methods:

Here enters hypothesis testing. In hypothesis testing, the investigator formulates a null hypothesis (H0) that usually states that there is no difference between the two groups, i.e. the observed weight differences between the two groups of mice occurred only due to sampling fluctuations (like when you repeat an experiment drawing samples from the same population). In other words, H0 corresponds to an absence of effect.
The alternative hypothesis (H1), just states that the effect is present between the two groups, i.e. that the samples were taken from different populations.

Hypothesis testing proceeds with using a statistical test to try and reject H0. For this experiment, we will use a T-test that compares the difference between the means of the two diet groups, yielding a p-value that we will use to decide if we reject the null hypothesis, at a 5% significance level (p-value < 0.05). Meaning that, if we repeat this experiment 100 times in different mice, in 5 of those experiments we will reject the null hypothesis, even thought the null hypothesis is true.

# Apply a T-test to the lean and fat diet weights 

### Explanation of the arguments used ###
  # alternative="two.sided" :  two-sided because we want to test any difference between the means, and not only weight gain or weight loss (in which case it would be a one-sided test)

  # paired = FALSE : because we measured the weight in 2 different groups of mice (never the same individual). If we measure a variable 2 times in the same individual the data would be paired.

  # var.equal = TRUE : T-tests apply to equal variance data, so we assume it is TRUE and ask R to estimate the variance (if we chose FALSE, then R uses another similar method called Welch (or Satterthwaite) approximation) 

ttest <- t.test(lean$weight, fat$weight,
                alternative="two.sided", 
                paired = FALSE, 
                var.equal = TRUE)

# Print the results
ttest


    Two Sample t-test

data:  lean$weight and fat$weight
t = -3.4197, df = 58, p-value = 0.001154
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.550137 -1.713197
sample estimates:
mean of x mean of y 
 20.36700  24.49867

Now that we have calculated the T-test, shall we accept or reject the null hypothesis? What are the outputs in R from the t-test?

# Find the names of the output from the function t.test
names(ttest)

 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

# Extract just the p-value
ttest$p.value

[1] 0.00115364

Final discussion

Take some time to discuss the results with your classmates, and decide if H0 should be rejected or not, and how confident you are that your decision is reasonable. Can you propose solutions to improve your confidence on the results? Is the experimental design appropriate for the research question being asked? Is this experiment well controlled and balanced?

Evaluation task

At the end of the class, save all the code that was written while following the tutorial in an R script. Also, include a brief discussion of the final question and send it to the tutor for evaluation.

3. Self-evaluation

Self-evaluation exam

The first task for today is to complete an individual self-evaluating exam, where you will have one hour to answer some data analysis questions using R. This is not for performance evaluation, meaning that the correctness of your answers will not be evaluated, but your application, participation, helping your colleagues will be evaluated. It is mostly intended for you to identify your difficulties and consolidate knowledge.

For this exam, please install the ualg.compbio package:

# Make sure you have the package `remotes` installed:
install.packages('remotes')

# Now you can install the ualg.compbio package from GitHub:
remotes::install_github("instructr/ualg.compbio")

Run the self_eval_exam from the ualg.compbio package:

learnr::run_tutorial('self_eval_exam', package = 'ualg.compbio')

Please complete the exam. You have 60 minutes.

4. Transcriptomics in R

Sequencing and Transcriptomics

What is DNA sequencing?

DNA sequencing is the process of determining the order of the four nucleotides (A, T, C, and G) that make up a DNA molecule. The sequence of nucleotides in DNA encodes genetic information that determines the traits and characteristics of an organism.

Sequencing technologies have advanced significantly in recent years, making it possible to sequence the entire genomes of organisms. This has revolutionized biological research, boosting the fields of genomics, transcriptomics, and epigenomics, leading to many new discoveries and insights into the functioning of living cell.

There are several different sequencing technologies available, namely:

Sanger sequencing: a relatively old method that is still used for sequencing small DNA fragments;
Next-generation sequencing (NGS): allow for the sequencing of millions of DNA fragments in parallel, providing high-throughput and cost-effective sequencing;
Third-generation sequencing: such as single-molecule real-time (SMRT) sequencing and nanopore sequencing, can generate longer read lengths and are particularly useful for sequencing complex genomic regions and detecting structural variations.

What is transcriptomics?

Transcriptomics is the study of the complete set of RNA transcripts in a cell or tissue at a given moment. With the advent of high-throughput sequencing technologies, transcriptomics has become a powerful tool for studying gene expression, alternative splicing, and other aspects of RNA biology on a large scale.

Which transcriptomics technologies are currently available?
- Microarrays: Microarrays are a hybridization-based technology that allow for the detection of gene expression levels of thousands of genes simultaneously. They use nucleic acid probes that are hybridized to complementary RNA molecules to detect their abundance. Microarrays are less sensitive than RNA-seq, but can be useful for detecting changes in gene expression in a high-throughput manner.
- RNA-sequencing (RNA-seq): RNA-seq is a widely used transcriptomics technique that allows for the quantitative analysis of gene expression. It involves the conversion of RNA into cDNA, which is then sequenced using high-throughput sequencing technologies. RNA-seq provides high resolution and coverage of the transcriptome, allowing for the detection of low abundance transcripts, alternative splicing events, and non-coding RNAs. There are various protocols for RNA-seq library preparation, including poly(A) enrichment, ribosomal RNA depletion, and strand-specific sequencing, which can affect the accuracy and sensitivity of the results.
- Single-cell sequencing: Single-cell sequencing allows for the analysis of gene expression in individual cells, providing insights into cell-to-cell variability and heterogeneity. Single-cell RNA-seq (scRNA-seq) involves the isolation and sequencing of RNA from individual cells, which can reveal cell type-specific gene expression patterns and identify rare cell populations.
- Long-read sequencing: Long-read sequencing technologies, such as PacBio and Oxford Nanopore, can provide full-length transcript sequencing, allowing for the detection of isoforms and splice variants. Long-read sequencing can be particularly useful for identifying novel transcripts and alternative splicing events.
- Spatial transcriptomics: Spatial transcriptomics is a relatively new technology that allows for the analysis of gene expression patterns within intact tissues. It involves the imaging of gene expression in situ, followed by the spatially-resolved sequencing of the corresponding regions. Spatial transcriptomics can provide insights into the spatial organization of cells and tissues and their roles in complex biological processes.

Why do we need transcriptomics? Current applications for transcriptomics studies.

Transcriptomics is an important field of research that helps us understand the expression and regulation of genes at the transcript level, which provides critical information about the biological processes that are occurring in cells and tissues. Overall, transcriptomics is a powerful tool for understanding gene expression and regulation, and has a wide range of applications in basic research, medicine, and biotechnology.

Transcriptomics relevance for the biomedical research community:

Understand gene expression: Transcriptomics allows researchers to study gene expression patterns in different cell types, tissues, and developmental stages, providing insight into how genes are regulated and how they contribute to different biological processes.
Identify novel genes and transcripts: Transcriptomics enables the discovery of previously unknown genes and transcripts, which can provide new targets for drug discovery and help us understand the genetic basis of diseases.
Characterize alternative splicing: Transcriptomics allows researchers to study alternative splicing, which is the process by which a single gene can produce multiple transcripts with different exon compositions. Alternative splicing plays a key role in generating protein diversity and can have important implications for disease.
Study non-coding RNAs: Transcriptomics can be used to study non-coding RNAs, which are RNA molecules that do not code for proteins but have important regulatory functions in the cell. Transcriptomics can help us understand the roles of non-coding RNAs in gene regulation, development, and disease.
Compare gene expression across conditions: Transcriptomics can be used to compare gene expression patterns across different conditions, such as disease states or drug treatments, providing insight into the molecular mechanisms underlying these conditions and identifying potential therapeutic targets.

Current applications of transcriptomics studies include:

Gene expression profiling: study gene expression patterns in different cell types, tissues, and developmental stages, providing insight into how genes are regulated and how they contribute to different biological processes.
Biomarker discovery: identify potential biomarkers for diseases, which can be used for early diagnosis or to monitor disease progression.
Drug discovery: identify potential drug targets or to screen drugs for their effects on gene expression.
Precision medicine: identify genetic variants that are associated with specific diseases, allowing for more personalized and targeted treatments.
Functional genomics: study the function of specific genes or to identify gene networks that are involved in specific biological processes.

What are the challenges with transcriptomics data analyses?

Transcriptomics data analysis can be challenging due to several factors, including the complexity of the data, the variety of platforms and methods available, and the need for advanced computational tools and statistical analyses. Some of the main challenges associated with transcriptomics data analysis are:

Data quality: Transcriptomics data can be affected by various sources of noise and variability, including technical factors such as batch effects, platform-specific biases, and low signal-to-noise ratios. Ensuring data quality is critical for accurate downstream analysis.
Data pre-processing: Transcriptomics data requires preprocessing to correct for technical artifacts, normalize for differences in sequencing depth, and filter out low-quality reads. There is often no consensus on the best way to preprocess data, and different approaches can lead to different results.
Data analysis: Transcriptomics data analysis involves identifying differentially expressed genes, clustering genes by expression pattern, and identifying enriched biological pathways. However, the choice of statistical tests, normalization methods, and downstream analyses can greatly affect the results and interpretation of the data.
Interpretation: Transcriptomics data can produce large amounts of data, making it challenging to interpret the results and identify meaningful biological insights. Additionally, the interpretation of results can be affected by biases in the selection of genes or pathways for analysis.
Integration: Integrating transcriptomics data with other types of omics data, such as genomics or proteomics, can be challenging due to differences in data types and the need for specialized computational tools.

Differential gene expression analysis: what is it and why do we need it?

Differential gene expression analysis is a technique that involves comparing the expression levels of each gene across different conditions to identify genes that are up- or down-regulated in response to a particular treatment or condition.

Differential gene expression analysis allows researchers to identify the genes and biological pathways that are affected by a particular condition or treatment (for example, to identify the genes that are up- or down-regulated in diseased tissues compared to healthy tissues, or in response to a particular treatment or drug). This can provide insight into the molecular mechanisms underlying diseases, identify potential therapeutic targets, or identify the genes and pathways that are affected by a particular drug.

There are several statistical methods and software tools available for performing differential gene expression analysis in R, including DESeq2, edgeR, and limma. These tools use statistical models to identify genes that are significantly differentially expressed between conditions.

Hands-on microarray data analysis in R

The scientific experiment | Imagine that you are interested in studying ovarian cancer, in particular, the effect of TGF-beta on the growth of human ovarian cancer cells.

Motivation | Advanced ovarian cancer is one of the most lethal gynecologic malignancies. Ovarian cancer cells are known to have diminished response to TGF-beta, but it remains unclear whether TGF-beta can modulate ovarian cancer cell growth in an indirect manner through cancer-associated fibroblasts (CAFs).

Experimental approach | Using transcriptome profiling analyses on TGF-beta-treated ovarian fibroblasts, you will be able to identify the gene signature of TGF-beta-regulated genes in the ovarian microenvironment, hence helping with understanding the role of TGF-beta in ovarian cancer progression.

Data | GSE40266 dataset from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40266).

Experimental design | The human telomerase-immortalized ovarian fibroblast line NOF151 was treated with 5ng/mL of either TGF-beta-1 or TGF-beta-2. Total RNA was isolated from control samples and TGF-beta-treated fibroblasts samples at 48 hours post-treatment, followed by cDNA synthesis, IVT and biotin labeling. Samples were then hybridized onto Affymetrix Human Genome U133 Plus 2.0 microarrays. For each treatment group, three independent samples were prepared for the microarray experiment.

Our analysis | To conduct a transcriptomics analysis of this microarray dataset, we must load the data, visualize it, process it, and run a differential gene expression analysis to find genes that are differentially regulated between the 3 experimental groups: control cells (untreated), TGFb1 (cells treated with TGF-beta 1), and TGFb2 (cells treated with TGF-beta 2) .

Guided analysis tutorial

Note | Performing a transcriptomics analysis using R requires intermediate programming skills. Since this course aims at introducing R to the students, we will run the analysis from a guided interactive tutorial, step by step.
The tutor will be explaining in detail each command. Do not hesitate to ask if anything is not clear to you.
In the end, there is a small quiz to help you diagnose your difficulties.

Good work!

learnr::run_tutorial('transcriptomics2023', package = 'ualg.compbio')

For future reference | R script used in the transcriptomics guided tutorial

For your reference, here is the R code snippet used for the transcriptomics analysis.

# Transcriptomics microarray analysis
# Computational Biology - Module 1
# Masters Biomedical Sciences - UAlg
#
# R script generated from GEO2R
# Adapted, Customized and Improved by: Isabel Duarte 
# Date: March 2023
################################################################
###         Microarray Transcriptomics Data Analysis         ###
###            Differential Expression with limma            ###
################################################################

# Install Bioconductor (if not yet installed)
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

### Install packages from Bioconductor 
BiocManager::install("GEOquery", version = "3.8")
BiocManager::install("limma", version = "3.8")

### Load packages
library(Biobase)
library(GEOquery)
library(limma)

# Set working directory to the folder where the class data is located
setwd("/Users/isabelduarte/Desktop/OneDrive - Universidade do Algarve/UAlg/3.classes/bioComputacional_2022/protocols2022/")  # Change this path to the path from your system.

# IMPORTANT: Make sure you create a directory inside your working directory named "R-output"
  # to save the files download from GEO, and the results from the data analysis.

###############################################################
### Load microarray data, and annotation, from a GEO series ###
###############################################################
gset <- getGEO("GSE40266", destdir="R-output/", GSEMatrix =TRUE, AnnotGPL=TRUE, getGPL = TRUE)

# inspect gset - find what type of object it is
gset
class(gset)
length(gset)

# The list has only 1 ExpressionSet object, so extract it from its list format 
gset <- gset[[1]]
class(gset)
View(gset)

# Discover the measured variable names for which there are annotation data in the gset object
fvarLabels(gset)
# make proper column names to match the names that will be used by limma package (toptable function)
fvarLabels(gset) <- make.names(fvarLabels(gset))

### Log2 transform the expression values
# keep a copy of the non-log transformed values
ex_raw <- exprs(gset)

# Before log transforming the values, we will auto-detect if the values are already in log space by looking at the distribution (quantiles) of the expression values.
# create a vector of quantiles 
qx <- as.numeric(quantile(ex_raw, probs=c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=TRUE))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0) ||
  (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
# log transform the expression values
if (LogC) {
  ex_raw[which(ex_raw<= 0)] <- NaN
  exprs(gset) <- log2(ex_raw)
}

# Compare the expression matrix BEFORE and AFTER making the values log2
   # (columns are arrays (samples), rows are probesets (genes))
head (ex_raw)       # Before Log2
head (exprs(gset))  # After Log2


############################################################
### Differential Expression analysis using limma package ###
############################################################
# Set up the data and group names for all samples
gsms <- "000111222"
sml <- unlist (strsplit(gsms, ""))
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl

# Create a design matrix to use with limma - Linear Model fitting package
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
View(design)

# Fit the gene expression values with linear models, using the design matrix
fit <- lmFit(gset, design)   
View(fit)

# Create the contrast matrix - with the comparisons that we want to make for differential expression
cont.matrix <- makeContrasts(G2-G0, G1-G0, G2-G1, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)  # Fit the contrast matrix to the matrix that was previously fit
fit2 <- eBayes(fit2, 0.01)               # Apply eBayes error correction

# Extract all the differentially expressed genes (even the ones bellow significance), and only the 500 most significant
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=nrow(fit2))
tT500 <- topTable(fit2, adjust="fdr", sort.by="B", number=500)
head (tT500)
colnames(tT500)
View(tT500)

# Keep only the columns with data that we are interested in
tT <- subset(tT, select=c("ID","Gene.symbol","Gene.title","G2...G0","G1...G0","G2...G1","AveExpr","adj.P.Val","P.Value","F"))
tT500 <- subset(tT500, select=c("ID","Gene.symbol","Gene.title","G2...G0","G1...G0","G2...G1","AveExpr","adj.P.Val","P.Value","F"))
View(tT500)
write.table(tT, file=stdout(), row.names=F, sep="\t")
write.table(tT500, file=stdout(), row.names=F, sep="\t")

#####################
### Plot the data ###
#####################
# Boxplot

# order samples by group
ex <- exprs(gset)[ , order(sml)]
sml <- sml[order(sml)]
fl <- as.factor(sml)
labels <- c("control","tgfB1","tgfB2")

# set parameters and draw the plot
palette(c("#dfeaf4","#f4dfdf","#f2cb98", "#AABBCC"))
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE40266", '/', annotation(gset), " selected samples", sep ='')

boxplot(ex, boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=fl)
legend("topleft", labels, fill=palette(), bty="n", cex = 0.8)

___ END ___