Computational Biology
Computational Biology course | Intro to R data analysis (Module 1)
Masters in Health Sciences - Disease Mechanisms
Faculdade de Medicina e Ciências Biomédicas - Universidade do Algarve, Faro, Portugal
Isabel Duarte | giduarte at ualg dot pt | Website: http://iduarte.eu/
General Information
Learning outcomes | Knowledge, Skills, and Competences
The students will acquire basic knowledge in the application of computational analysis techniques to biological data using R.
The following topics will be addressed in the ‘Intro to R data analysis’ (Module 1):
- Introduction to R programming;
- Descriptive statistics using R;
- Brief exploratory data analysis of a biological dataset.
Evaluation
The evaluation for Module 1: Intro to R data analysis will be the summation of the two following evaluation criteria:
A. Performance in class measured by completing the tutorials and assignments, participation in class, and collaboration with classmates: 5 points.
B. One final written exam with one section of multiple choice questions, plus one section with questions to write R code for simple programming tasks: 15 points.
Final grade improvements will be assessed with:
- One individual assignment, to be completed at home, that must be presented and discussed in a 30 minute individual oral exam: 20 points.
Classes documents
Additional files required for exercises
Prerequisites
To attend these classes, students should be familiar with the following basic statistical concepts:
- Basic concepts in statistics: Univariate and Bivariate analysis
- Categorical data (Nominal or Ordinal) vs Numeric data (Discrete or Continuous)
- Descriptive/Exploratory studies: Mean, Median, Min, Max, Standard deviation, Variance, Mode, Interquartile range
- Linear regression and Correlation coefficient (Pearson and Spearman)
- Inferential studies
- Parametric vs Non-Parametric tests
- Z-score (Standard score)
- Hypothesis testing (Null hypothesis and Alternative hypothesis)
- Unilateral vs Bilateral tests
- P-value
Syllabus
- 1. Brief recap of basic statistics concepts
- 2. Introduction to R
- Introduction to R programming
- Descriptive statistics in R
- Hypothesis testing in R
- Statistical significance in R
- 3. Mini-project: Exploratory data analysis using R
- Tidy data concept: How to organize data into tidy tables
- Visualization of descriptive statistics, and variable distributions
- Principal Component Analysis (PCA) for multidimensional reduction
- Fitting simple linear models
- Finding and visualizing correlations
- Strategies to derive knowledge from data
- Others (according to students requests)
Pedagogical goals
At the end of Module 1, the students will be able to:
- 1. Biostatistics:
- Identify the type of a variable (Numeric - Continuous or Discrete, Categorical - Ordinal or Nominal);
- Formulate hypotheses for hypothesis testing (t-test);
- Decide between bilateral and unilateral testing;
- Calculate and interpret the p-value of a test.
- 2. Introduction to R:
- Create an RStudio project;
- Install packages from major repositories, namely CRAN and Bioconductor;
- Identify 4 types of data structures available in R: Vectors, Matrices, Data frames, and Lists;
- Recognize the 4 main vector data types: Logical (
TRUEorFALSE), Numeric (e.g.1,2,3…), Character (e.g.“Universidade”, “do”, “Algarve”), and Complex (e.g.3+2i); - Obtain help regarding R functions (using
?orhelp); - Create vectors;
- Assign results to named variables using the assignment operators
<-and=; - Convert between data types;
- Understand vectorized arithmetics, i.e. operations between vectors are applied element-wise;
- Understand vector recycling, i.e. if an operation is conducted between vectors of different length, the elements from the shorter vector are reused from the beginning;
- Construct code iterations using
forloops; - Construct conditional statements (
ifstatements); - Load a dataset in R;
- Inspect the data loaded;
- Obtain information about the dimensions of the dataset, such as the number of rows and number of columns;
- Subset a dataset based on row/column number (with
[]), or based on column name (with$); - Obtain summary statistics on the dataset (mean, maximum, minimum, quartiles, standard deviation, and variance);
- Graphically explore the data with boxplots and histograms;
- Export results to a file (data analyses and figures);
- Save the workspace with all analysis’ results in a
.Rdatafile; - Understand hypothesis testing: interpreting the p-value.
- Mini-project: Exploratory data analysis using R
- Inspect and assess the structure and quality of a dataset;
- Perform structured exploratory data analysis using appropriate visualisations;
- Interpret distributions, group differences, and correlations in biological terms;
- Recognise multivariate patterns and potential confounding effects;
- Translate exploratory findings into hypotheses for downstream modelling.
Bibliography
Online resources and Bibliography (for future learning)
Websites
- R Project (The developers of R.)
- Quick-R (Roadmap and R code to quickly use R.)
- R-bloggers (Great resource for posts related to alternative ways to do the same thing in R.)
- Bioconductor workflows (R code for pipelines of genomic analyses.)
- R Project (The developers of R.)
Free Online Books
- R for Data Science (Hadley Wickham, Mine Cetinkaya-Rundel & Garrett Grolemund. A great book for structured learning of R for data science, starting from simple concepts and building up step by step.)
- Modern Statistics for Modern Biology (Wolfgang Huber and Susan P. Holmes. Great book to learn Biostatistics using R.)
- Introduction to Data Science: Data Wrangling and Visualization with R (Rafael A. Irizarry. High-quality R code, with clear and rigorous explanations, demonstrating how to format data and visualise it using
ggplot2.) - Introduction to Data Science: Statistics and Prediction Algorithms Through Case Studies (Rafael A. Irizarry. Hands-on tutorials focused on learning how to use R for data analysis using carefully chosen case studies, where data analysis brings information to light.)
- Cookbook for R (Winston Chang. Well-structured R scripts (practical “recipes”) for common programming tasks.)