Week 5 | Exploratory analysis critique
Class Details
📅 Date: 08 April, 2025
⏰ Time: 09:30h - 12:00h
📖 Synopsis: Metadata documentation, data visualization for validation, critical review of scientific datasets and their publications.
1. Metadata Files
- Discussion Topics:
- What is a metadata file, and why is it important for reproducibility?
- Minimum information to include:
- Variable names and descriptions
- Units of measurement
- Data types (e.g., numeric, categorical)
- Valid value ranges
- Missing value encoding
- Encoding not applicable or missing values:
- Use standard codes (e.g.,
NA
,null
, or defined codes like999
), and document them clearly
- Use standard codes (e.g.,
- Categorical variables:
- Define all possible levels
- Include labels or descriptions for each level, if acronym is being used
- Best practices:
- Keep metadata in a separate, human-readable file (e.g., CSV, TXT)
- Use consistent naming between metadata and data files
- Regularly update metadata as data evolves
- Activity:
2. Principal Component Analysis (PCA)
- Discussion Topics:
- Can categorical variables be included in PCA?
- Generally, PCA is for continuous variables
- Categorical variables need to be encoded numerically (e.g., one-hot encoding), which may affect interpretation
- Consider alternative techniques like Multiple Correspondence Analysis (MCA) for purely categorical data
- Can categorical variables be included in PCA?
- Resource:
- Watch and evaluate StatQuest video on PCA (Link here)
3. Visualizing Distributions of Variables
What to look for in plots:
- Outliers
- Skewness or symmetry
- Bimodal or multimodal distributions
- Unexpected patterns or values
Questions to ask when selecting plots:
- What story does this plot tell? What information does it convey? Is this information what I am aiming to show?
- Is the plot relevant to the hypothesis or goal of my analysis?
- Is it easy to interpret by the target audience?
- Are there simpler visualizations for this variable? “Less is more!”
- Use previous results/plots/analyses to inform which plots to show next, and keep in the main text versus sending to supplementary material.
Data validation through visualization:
- Identify data entry errors or inconsistencies
- Spot missing or unusual values
- Confirm assumptions (e.g., normality, bimodality…)
- Help ensure robusteness and reliability
- Whenever possible, show all data points (Transparency is key for reproducibility)
Choose your plots wisely!
Useful online galleries of plots, per type of data or information to be displayed:
- Includes R code for all plots: from Data to Viz
- Visual vocabulary
- The Do’s of chart design
- Finantial Times: Chart-Doctor
4. Critical Review of Nature Scientific Data Papers
- Papers for student-led discussion:
- Paper 1: High-quality genome assembly of the azooxanthellate coral Tubastraea coccinea (Lesson, 1829)
DOI: https://doi.org/10.1038/s41597-025-04839-7
- Paper 2: Seagrass morphometrics at species level in Moreton Bay, Australia from 2012 to 2013
DOI: https://doi.org/10.1038/sdata.2017.60
- Paper 1: High-quality genome assembly of the azooxanthellate coral Tubastraea coccinea (Lesson, 1829)
- Guiding questions for critical review of a data descriptor article:
- Did the authors successfully show the relevance of their data to their scientific community?
- Is the data of high quality?
- Are all visualizations necessary and informative?
- Are they easy to interpret?
- Are they well-described?
- Is any crucial information, about the data or the experimental design, missing?
- Would you feel confident using this dataset in a new analysis? Why? Why not?
- Experimental Design figure, in vectorial format (pdf, eps, svg).
Top Tips for an Experimental Design Figure
- Keep it simple and clear
- Show the workflow or timeline of the experiment
- Include key elements: groups, treatments, time points, sample sizes
- Use consistent labels and colors
- Make sure it’s readable and matches the methods
- Export as a vector (PDF/SVG) for best quality
- Next Week, each student will receive individual R code review. Please complete and format your R code, to the best of your current ability. Focus areas for next week’s feedback will be:
- Code correctness and clarity
- Annotation and commenting practices
- Proper use of literate programming (e.g., R Markdown)
- Adherence to best practices in reproducible research workflows (Git)
Please make notes of any difficulties and questions to guide our discussion next week.