vignettes/4. diversity-metrics.Rmd
4. diversity-metrics.RmdlifesimulatoR
Diversity is important in origin-of-life simulations because a system with more molecular variety can explore more possibilities. A diverse molecular population may contain more potential structures, interactions, catalytic patterns, or replication-like behaviours.
However, diversity alone is not the same as life. A random chemical mixture can be highly diverse but poorly organized. A highly selected system may have lower diversity but stronger functional structure. For this reason, diversity metrics should be interpreted carefully. They are useful indicators, not complete measures of life or complexity.
In simplified origin-of-life models:
We begin by creating a prebiotic molecular pool. This represents a simplified early chemical environment containing symbolic molecular sequences.
pool <- create_prebiotic_pool(
n_molecules = 100,
alphabet = c("A", "U", "G", "C"),
min_length = 5,
max_length = 15,
seed = 123
)
head(pool)## [1] "UGUUUGA" "UUAUGCAG" "ACAAAGCUGUAUGCU" "GGACG"
## [5] "AGAAUGGCAGAGCU" "UAACCGAUAAGAU"
This pool contains many symbolic molecules. Each molecule can be treated as a possible chemical variant in a simplified prebiotic environment.
Before examining diversity, it is useful to summarize the molecular population. A population may contain many molecules, making it difficult to understand its overall characteristics by inspecting individual sequences.
The function summarize_molecules() calculates simple
population-level statistics.
molecules <- c(
"AUGC",
"AUGC",
"UUUU",
"GCGCGC",
"AUAUAUAUAU"
)
summary_stats <- summarize_molecules(
molecules = molecules,
generation = 0
)
summary_stats## # A tibble: 1 × 6
## generation n_molecules mean_length mean_fitness diversity max_fitness
## <dbl> <int> <dbl> <dbl> <int> <dbl>
## 1 0 5 5.6 0.787 4 1.20
Depending on the package version, the summary may include:
These statistics provide a snapshot of the molecular population.
Population summaries are useful because evolutionary change is often easier to observe at the population level than at the level of individual molecules.
For example:
The function summarize_molecules() is also useful
because it is used internally by simulate_abiogenesis() to
build a time series of population-level change.
The function shannon_entropy() calculates Shannon
entropy from a numeric vector of counts or abundances. In
lifesimulatoR, entropy can be used as a simple measure of
diversity or uncertainty in a molecular population.
counts <- c(10, 5, 1)
shannon_entropy(counts)## [1] 1.198192
A higher entropy value means the counts are more evenly distributed across categories. A lower entropy value means one or a few categories dominate.
Two populations can have the same number of categories but very different diversity.
low_diversity <- c(100, 1, 1, 1)
high_diversity <- c(25, 25, 25, 25)
shannon_entropy(low_diversity)## [1] 0.2361547
shannon_entropy(high_diversity)## [1] 2
The high-diversity population has a more even distribution. The low-diversity population is dominated by one category.
Entropy is useful because it captures more than just the number of unique molecule types.
Consider two populations:
Both populations have the same richness, but Population B is more evenly distributed. Shannon entropy can reflect this difference.
In origin-of-life simulations, entropy can help us ask:
A simulation can be used to ask whether diversity increases, decreases, or stabilizes over generations.
sim <- simulate_abiogenesis(
n_molecules = 100,
generations = 100,
mutation_rate = 0.02,
selection_strength = 1,
seed = 123
)
head(sim)## # A tibble: 6 × 6
## generation n_molecules mean_length mean_fitness diversity max_fitness
## <int> <int> <dbl> <dbl> <int> <dbl>
## 1 0 100 12.6 1.00 100 1.25
## 2 1 100 12.7 1.04 67 1.25
## 3 2 100 12.3 1.05 61 1.25
## 4 3 100 12.3 1.11 61 1.25
## 5 4 100 12.5 1.11 48 1.25
## 6 5 100 12.8 1.13 53 1.25
If the simulation output contains diversity metrics by generation, those can be plotted directly.
plot_simulation(
sim,
x = "generation",
y = "diversity"
)
A plot can help users see whether the system becomes more diverse, less diverse, or more stable over time.
Mutation tends to introduce new variants. This can increase molecular diversity, especially when selection is weak or moderate.
sim_low_mutation <- simulate_abiogenesis(
n_molecules = 100,
generations = 100,
mutation_rate = 0.001,
selection_strength = 1,
seed = 123
)
sim_high_mutation <- simulate_abiogenesis(
n_molecules = 100,
generations = 100,
mutation_rate = 0.10,
selection_strength = 1,
seed = 123
)
head(sim_low_mutation)## # A tibble: 6 × 6
## generation n_molecules mean_length mean_fitness diversity max_fitness
## <int> <int> <dbl> <dbl> <int> <dbl>
## 1 0 100 12.6 1.00 100 1.25
## 2 1 100 12.7 1.04 58 1.25
## 3 2 100 12.2 1.07 46 1.25
## 4 3 100 12.2 1.09 37 1.25
## 5 4 100 12.2 1.11 30 1.25
## 6 5 100 12.3 1.13 29 1.25
head(sim_high_mutation)## # A tibble: 6 × 6
## generation n_molecules mean_length mean_fitness diversity max_fitness
## <int> <int> <dbl> <dbl> <int> <dbl>
## 1 0 100 12.6 1.00 100 1.25
## 2 1 100 12.7 1.04 90 1.25
## 3 2 100 13.0 1.07 90 1.25
## 4 3 100 13.0 1.10 88 1.25
## 5 4 100 12.6 1.10 94 1.25
## 6 5 100 12.6 1.12 88 1.25
Low mutation may preserve successful sequences but explore new possibilities slowly. High mutation may create many new variants, but it may also disrupt stable or successful molecules.
This creates a useful conceptual trade-off:
A system needs variation to evolve, but too much variation may prevent information from being preserved.
Selection can reduce diversity if a small number of high-fitness molecules dominate. Mutation can increase diversity by generating new variants. The balance between mutation and selection is therefore central to molecular evolution.
weak_selection <- simulate_abiogenesis(
n_molecules = 100,
generations = 100,
mutation_rate = 0.02,
selection_strength = 0.2,
seed = 123
)
strong_selection <- simulate_abiogenesis(
n_molecules = 100,
generations = 100,
mutation_rate = 0.02,
selection_strength = 3,
seed = 123
)
head(weak_selection)## # A tibble: 6 × 6
## generation n_molecules mean_length mean_fitness diversity max_fitness
## <int> <int> <dbl> <dbl> <int> <dbl>
## 1 0 100 12.6 1.00 100 1.25
## 2 1 100 13.2 1.01 74 1.25
## 3 2 100 13.1 0.993 67 1.25
## 4 3 100 13.0 1.01 64 1.25
## 5 4 100 12.3 1.02 56 1.25
## 6 5 100 12.6 1.05 52 1.25
head(strong_selection)## # A tibble: 6 × 6
## generation n_molecules mean_length mean_fitness diversity max_fitness
## <int> <int> <dbl> <dbl> <int> <dbl>
## 1 0 100 12.6 1.00 100 1.25
## 2 1 100 11.8 1.10 66 1.25
## 3 2 100 11.4 1.15 55 1.25
## 4 3 100 11.6 1.17 55 1.25
## 5 4 100 11.7 1.20 47 1.25
## 6 5 100 11.7 1.21 45 1.25
In a weak-selection scenario, many molecular types may persist. In a strong-selection scenario, high-fitness molecules may dominate more quickly. This can reduce diversity, even while increasing average fitness.
It is tempting to say that higher diversity means higher complexity, but this is not always true. Complexity can involve diversity, organization, interaction, information storage, persistence, and functional integration.
For example:
Therefore, diversity metrics should be used alongside other outputs such as:
This tutorial can support discussions in:
Students can be asked to compare how different parameter choices affect diversity. This helps connect abstract concepts such as entropy and selection to visible simulation outputs.
Try changing mutation_rate and
selection_strength and ask:
The metrics in this package are simplified educational tools. They do not fully capture biochemical complexity, functional organization, thermodynamics, or information storage in real prebiotic systems. They are best used as conceptual aids for exploring how diversity, mutation, and selection interact.