Topic 1 | Introduction to Bioinformatics

Class Details

📅 Date: 15/17 September 2025
📖 Synopsis: Introduction to Bioinformatics and Biological Databases

Lecture Topics:

Definition and scope of bioinformatics.
Bioinformatics in drug discovery.
Overview of databases per biological data types.
Primary and secondary databases.
Bioinformatics as an experimental science: search, compare, model, and integrate data.

Theory
Practical

1. What is bioinformatics?

Bioinformatics is the interdisciplinary science of managing and analysing biological data. First coined by Dutch theoretical biologists Paulien Hogeweg and Ben Hesper in the 1970’s to describe ‘informatic processes in biotic systems’, it began with sequence analysis but now also includes modelling, image analysis, and comparison of both linear sequences and 3D structures (Fig. 1).

Figure 1: Overview of the data types in bioinformatics: originally centered on sequence data, now expanded to include structural, chemical, and systems biology. (From: https://www.ebi.ac.uk/training/online/courses/bioinformatics-terrified)

2. Who uses bioinformatics?

Molecular life sciences are now strongly data-driven and rely on open-access databases in both basic and applied research.

You do not need to be a bioinformatician to use bioinformatics databases, methods, and tools. However, as large datasets become central to biomedical and pharmaceutical research, it is increasingly important for all molecular life scientists to understand both the possibilities and the limits of bioinformatics, and to collaborate with bioinformatics experts in designing, analysing, and interpreting experiments.

2.1 Bioinformatics in drug discovery

This is equally true in the pharmaceutical sciences, where bioinformatics supports drug discovery, biomarker development, and precision medicine.
For example, bioinformatics is used to:
- Identify and validate potential drug targets by integrating genomic, transcriptomic, and proteomic data.
- Discover and evaluate biomarkers for diagnosis, prognosis, and treatment response.
- Apply precision medicine approaches, matching patients to therapies based on their molecular profiles.
- Support drug repurposing by linking existing compounds to new disease indications.

3. The role of public databases

A few bioinformatics centres of excellence worldwide are responsible for collecting, organising, and providing open access to published biological data. Key centres include:

EMBL-European Bioinformatics Institute (EMBL-EBI)
US National Center for Biotechnology Information (NCBI)
National Institute of Genetics in Japan (NIG)

This work began in the early 1980s, when DNA sequence data started to accumulate in scientific publications. To manage this, the ENA - European Nucleotide Archive) was created, as well as the GenBank at NCBI, and DDBJ at NIG.

Who owns the data? From the start, the bioinformatics community has promoted open data sharing and made it a reality through international collaborations, for example:

For nucleotide sequences: International Sequence Database Collaboration;
For protein sequences: UniProt Consortium;
For macromulecular structures: Worldwide Protein Data Bank.

This openness has allowed researchers to fully benefit from large-scale projects such as the Human Genome Project and The Encyclopedia of DNA Elements (ENCODE). Importantly, open sharing is not limited to big collaborations. Increasingly, funding policies reflect the principle that if research is publicly funded, the resulting data should also be made publicly accessible for others to use.

4. Making data useful

Open access is only the first step. For data to be truly useful, it must also be recorded and organised in a way that supports consistent interpretation and reuse. This is where the distinction between primary and secondary databases becomes important.

4.1 Primary vs Secondary databases

Primary databases in bioinformatics are repositories that store raw, unprocessed biological data generated directly from experimental methods, such as nucleotide sequences, protein sequences, and macromolecular structures. Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record.

Secondary databases comprise data derived from the results of analysing primary data. They provide curated, processed, or interpreted data that is derived from the combination of diverse sources of primary data, offering functional insights and higher-level biological context, to derive new knowledge from the public record of science. Secondary databases have become the molecular biologist’s reference library, offering extensive information on nearly every studied gene or gene product. Although often overwhelming, these resources hold enormous potential for discovery through data mining.

4.2 Primary databases

Store original and unprocessed data from experiments.
Examples: ENA (European Nucleotide Archive), GenBank (nucleotide sequences), Protein Data Bank (PDB) (3D protein structures), ArrayExpress (gene expression).
Data is archival and serves as the scientific record, maintaining the integrity of experimental data.

4.3 Secondary databases

Contain information derived by analyzing, annotating, and curating data from primary databases.
Examples: InterPro (protein families, integrative protein data), UniProt (protein sequences and functional information).
Data includes interpretations, predictions, and functional annotations which enrich biological context and usability for researchers.

4.4 Comparison table

Category	Primary Database	Secondary Database
Data	Raw, experimental	Processed, curated, interpreted
Source	Direct experiment/researcher input	Analysis of primary database data
Examples	ENA, GenBank and DDBJ (nucleotide sequence); ArrayExpress and GEO (functional genomics data); PDB (3D macromolecular structures)	UniProt (sequence and functional information on proteins); Ensembl (variation, function, regulation and more layered onto whole genome sequences); InterPro (protein families, motifs and domains)
Function	Archival record	Adds annotations and context
Modification	Data not altered	Data is analyzed, can be modified

These two types of databases are fundamental to bioinformatics, working together to organize, preserve, and interpret biological data for research and discovery.

4.5 Hybrid databases

Some databases combine both primary and secondary features. For example, UniProt holds experimental peptide sequences but also infers sequences and adds extensive automated (TrEMBL) and manual (SwissProt) annotations. Others separate the two functions into different branches, such as ArrayExpress (raw functional genomics data) and the Expression Atlas (derived expression patterns).

4.6 Molecular databases per data type

Bioinformatics spans many different types of biological data, each with its own uses:
- Genomic data: DNA sequences, variants, and structural alterations.
- Transcriptomic data: RNA expression and splicing patterns.
- Proteomic data: protein expression, modifications, and interactions.
- Metabolomic data: metabolites and metabolic pathways.
- Structural data: 3D structures of proteins and other macromolecules.
- Phenotypic and clinical data: patient traits, disease states, and treatment outcomes.

Together, these diverse datasets provide complementary perspectives and, when integrated, can generate powerful insights into biology and disease.

Figure 2 shows an overview of molecular biology data resources: genomic (DNA & genes), transcriptomic (RNA expression), proteomic (proteins & structures), pathway/integrated knowledge, and chemical/drug databases, highlighting the some of the main open-access repositories that support research and drug discovery.

Figure 2: Overview of bioinformatics online public resources.

5. Bioinformatics as an experimental science

Bioinformatics goes beyond data storage to include experiments on data. Simple searches retrieve information, but drawing conclusions (such as finding protein homologues) requires the same rigour as lab experiments, including appropriate methods and controls. Core databases serve as gateways to structured knowledge, enabling systematic exploration of the literature.

Bioinformatics experiments can be grouped into four types: searching, comparing, modelling, and integrating.

5.1 Searching

The simplest type of bioinformatics experiment is to search a public database for information on a specific gene or protein. For example, with EBI Search (Figure 3), you can search across a large number of public databases simultaneously, without restricting beforehand the database to be searched (see bellow at the Integrating section).

Figure 3: Data types that can be searched using EBI Search (where you can explore the interactive version of this map).

Controls

A simple search is not an experiment and does not require controls. But once search results are used to answer a biological question, it becomes an experiment and controls are essential.

Search only	Experiment + controls
Retrieve information only	Use search results to answer a biological question
Example: find all protein sequences with the keyword globin	Example: identify kinases that are in a pathway and upregulated in a disease
No controls required	Controls needed: – Check if terms map to unrelated pathways – Confirm terms are correctly linked to entries – Look for consistency in related records

5.2 Comparing

The most frequently used type of comparison in bioinformatics is sequence comparison to help reveal relationships in function, evolution, or both. The most common approach is sequence comparison, where nucleotide or protein sequences are aligned to database entries. Alignments account for insertions, deletions, and substitutions that may have occurred since divergence from a common ancestor. Matches can suggest evolutionary or functional relationships.

Alignments may be pairwise or multiple, with different tools suited to different contexts: for example, BLAST is designed for finding regions of local similarity between sequences across large databases with high sensitivity, while BLAT is optimized for quickly aligning nearly identical sequences to a reference genome with lower sensitivity but much greater speed.

Controls

The key challenge is judging whether an alignment is significant, not just producing it. Significance is measured by the expectation score (e-value): the lower the e-value, the less likely it is that the match is random, and therefore, the more likely it reflects true homology. Controls include comparing random sequences and checking scores for unrelated sequences. Tool choice also matters, as tools differ in speed, accuracy, and suitability for specific applications.

5.3 Modelling

Structural modelling helps generate hypotheses about the 3D structure of macromolecules and their biochemical functions. A major breakthrough came with AlphaFold (Google DeepMind, 2021), which predicts protein structures from sequences. Its impact was recognised with the 2024 Nobel Prize in Chemistry, awarded to Demis Hassabis, John M. Jumper, and David Baker. In collaboration with EMBL-EBI, AlphaFold predictions are freely available in the AlphaFold Protein Structure Database.

Another widely used option for structural modelling is SWISS-MODEL, a web-based service developed at the Swiss Institute of Bioinformatics (SIB). It provides automated homology modelling of protein 3D structures by using known experimental structures as templates. This allows researchers to generate reliable structural models when experimental data are not available, making SWISS-MODEL a valuable complement to resources like AlphaFold.

In the Drug-Design module of this course, you will learn learn more about structural modelling.

Beyond individual molecules, researchers also model biological processes and interactions (systems biology). Standardised mathematical models can be accessed from the BioModels Database. Systems modelling follows an iterative cycle:

Build a model based on current knowledge;
Test it against biological data;
Refine it;
Repeat until the model reliably reflects reality.

5.4 Integrating

Data integration has long been a challenge in bioinformatics, yet it is a powerful way to test hypotheses. By combining results, for example from transcriptomics, proteomics, and metabolomics, researchers can gather strong evidence for the involvement of a pathway in disease or drug resistance. Like systems modelling, integration supports hypothesis generation, but experimental validation remains essential.

The good news is that many resources now handle much of the complex work of linking and integrating data. One example is EBI Search, which maps related entities efficiently (also mentioned above in the Searching section).

Another is Open Targets, a platform that integrates large volumes of public data to support discovery. Open Targets is specifically designed to help researchers explore and visualise drug targets in the context of disease.

References

This content was adapted from EBI’s “Bioinformatics for the terrified: An introduction to the science of bioinformatics”

Activity 1: From Data to Databases: Using Dice to Illustrate Principles of Bioinformatics

Duration: ~1.5 hours
Learning goals:
- Experience generating raw data from simple systems (dice).
- Understand the importance of consistent data recording and organization.
- Appreciate how analysis depends on well-structured data.
- Connect classroom activities to bioinformatics databases.
Materials:
- Dice (D4, D6, D8, D10, D12, and D20): one per group (1 or 2 students).
- Paper sheets for data recording.
- Worksheets or laptops.

Activity overview

In this activity we will create a class dataset using dice. Along the way, you will recognize:

Why standardisation of data is essential to make data usable.
How sample size affects confidence in results, hence the importance of combining datasets.
What tidy data means.
How different dice (D4, D6, D8, D10, D12, D20) relate to variables with different possible measurements.
The difference between single-variable and multivariate analysis.

Step 1. Roll (15 min)

Each group is assigned one die.
Roll it 10 times.
Record your results in any format you like.

Hint: Record the data in the way that you find most useful and understandable.

Step 2. Compare & Standardise (15 min)

Join all groups with the same die type (all D4 groups, all D6 groups…).
Compare how different groups recorded their results.
Notice differences and discuss the pros and cons of alternative recordings.
Agree on one standard format.

Lesson: Bioinformatics databases using a standardized data format to allow the pooling together of data from different experiments.

Step 3. Combine (15 min)

All groups with the same die pool their results.
Example: one group’s 10 rolls of a D6 = limited evidence, but 50 rolls across all groups = strong evidence about probabilities.

Lesson: Larger sample size = greater confidence in conclusions.

Step 4. Make it Tidy (15)

Does the agreed upon record standard follow the tidy rule?
- Each variable = a column
- Each observation = a row
- Each value = a cell

Lesson: This is the principle behind how most bioinformatics databases are structured.

Step 5. Reflect on Variables (10 min)

Different dice = different numbers of possible outcomes.
- D4 = 4 outcomes
- D20 = 20 outcomes

Lesson: Biology is similar:

Some variables have few states (genotype with 3 possible states).
Others are more complex (gene expression with many possible values).

Step 6. Single vs. Multivariate Analysis (10 min)

Single variable: Analyse only one column (e.g. “What is the most common result for D6?”).
Multivariate: Combine columns (e.g. “When D6 is high, is D20 also high?”).

Lesson: Bioinformatics often moves from univariate summaries (each gene/protein separately) to multivariate models (how variables interact).

Step 7. Wrap-up Reflection (5 min)

Questions:

What was hardest: generating, recording, or analysing?
How does this compare to working with real biological data?
Why do we need curated bioinformatics databases?

Take home message

By rolling dice:

you’ve built a mini database;
experienced the challenge of standardisation;
explored sample size effects;
practised tidy data;
understood variable ranges; and
compared single vs. multivariate analysis.

By the end of this practical class, you should know:

1. What made it possible to build a class database?
- Standardisation.
1. Why combine results?
- Sample size = confidence.
1. Why tidy format?
- Analysis-ready data.
1. What do different dice represent?
- Different measurement complexities (possible outcomes).
1. What’s the difference between single and multivariate analysis?
- One variable vs. relationships between variables.