Week 5 | Exploratory analysis critique

Class Details

📅 Date: 08 April, 2025
Time: 09:30h - 12:00h
📖 Synopsis: Metadata documentation, data visualization for validation, critical review of scientific datasets and their publications.

1. Metadata Files

  • Discussion Topics:
    • What is a metadata file, and why is it important for reproducibility?
    • Minimum information to include:
      • Variable names and descriptions
      • Units of measurement
      • Data types (e.g., numeric, categorical)
      • Valid value ranges
      • Missing value encoding
    • Encoding not applicable or missing values:
      • Use standard codes (e.g., NA, null, or defined codes like 999), and document them clearly
    • Categorical variables:
      • Define all possible levels
      • Include labels or descriptions for each level, if acronym is being used
    • Best practices:
      • Keep metadata in a separate, human-readable file (e.g., CSV, TXT)
      • Use consistent naming between metadata and data files
      • Regularly update metadata as data evolves
  • Activity:
    • Analyze and discuss a good example of data and metadata from a Figshare dataset record (Link here).
    • Look at the paper that analysis the dataset above: (Link here)

2. Principal Component Analysis (PCA)

  • Discussion Topics:
    • Can categorical variables be included in PCA?
      • Generally, PCA is for continuous variables
      • Categorical variables need to be encoded numerically (e.g., one-hot encoding), which may affect interpretation
      • Consider alternative techniques like Multiple Correspondence Analysis (MCA) for purely categorical data
  • Resource:
    • Watch and evaluate StatQuest video on PCA (Link here)

3. Visualizing Distributions of Variables

  • What to look for in plots:

    • Outliers
    • Skewness or symmetry
    • Bimodal or multimodal distributions
    • Unexpected patterns or values
  • Questions to ask when selecting plots:

    • What story does this plot tell? What information does it convey? Is this information what I am aiming to show?
    • Is the plot relevant to the hypothesis or goal of my analysis?
    • Is it easy to interpret by the target audience?
    • Are there simpler visualizations for this variable? “Less is more!”
    • Use previous results/plots/analyses to inform which plots to show next, and keep in the main text versus sending to supplementary material.
  • Data validation through visualization:

    • Identify data entry errors or inconsistencies
    • Spot missing or unusual values
    • Confirm assumptions (e.g., normality, bimodality…)
    • Help ensure robusteness and reliability
    • Whenever possible, show all data points (Transparency is key for reproducibility)
  • Choose your plots wisely!

    Useful online galleries of plots, per type of data or information to be displayed:


4. Critical Review of Nature Scientific Data Papers

  • Papers for student-led discussion:
  • Guiding questions for critical review of a data descriptor article:
    1. Did the authors successfully show the relevance of their data to their scientific community?
    2. Is the data of high quality?
    3. Are all visualizations necessary and informative?
      • Are they easy to interpret?
      • Are they well-described?
    4. Is any crucial information, about the data or the experimental design, missing?
    5. Would you feel confident using this dataset in a new analysis? Why? Why not?
  1. Experimental Design figure, in vectorial format (pdf, eps, svg).

Top Tips for an Experimental Design Figure

  • Keep it simple and clear
  • Show the workflow or timeline of the experiment
  • Include key elements: groups, treatments, time points, sample sizes
  • Use consistent labels and colors
  • Make sure it’s readable and matches the methods
  • Export as a vector (PDF/SVG) for best quality
  1. Next Week, each student will receive individual R code review. Please complete and format your R code, to the best of your current ability. Focus areas for next week’s feedback will be:
    • Code correctness and clarity
    • Annotation and commenting practices
    • Proper use of literate programming (e.g., R Markdown)
    • Adherence to best practices in reproducible research workflows (Git)

Please make notes of any difficulties and questions to guide our discussion next week.

Back to top