Diversity analysis

When we talk about “diversity” in microbiomes, we’re asking two key questions: how many kinds of microbes are present, and how do their communities differ across samples or groups? Measuring diversity helps us understand the complexity, stability, and ecological dynamics of microbial communities—whether in soil, roots, gut, or any other habitat.

In microbiome research, diversity is usually split into two main types:

Alpha diversity: the variety within a sample or group.
Beta diversity: the differences between samples or groups.

Both give complementary insights: alpha diversity tells us about richness and evenness inside a community, while beta diversity reveals patterns of similarity or difference between communities.

Richness
Evenness

Richness refers to the number of species present in a community. Simply put, it answers the question: how many different microbes are there? A community with more species is considered richer, which can indicate greater ecological complexity or stability.

Evenness describes how equally the individuals are distributed among the species in a community. A community where all species have similar abundances is highly even, while one dominated by just a few species is less even. Evenness gives insight into how balanced a community is, which can affect interactions, resilience, and ecosystem function.

Alpha Diversity

Alpha diversity is often described at the sample level, but in practice, researchers almost always summarize it across groups of samples (e.g., treatment groups, host species, timepoints). So wording it as “within a group” works fine.

Alpha diversity captures two important aspects:

Richness: the number of species present in a community. Simply put, how many different microbes are there?
Evenness: how evenly the individuals are distributed among those species. A community with similar numbers of each species is more even, while one dominated by a few species is less even.

Different metrics capture these aspects in different ways:

Observed: counts the species we see (richness).
Chao1: estimates rare or unseen species (richness).
Shannon: balances richness and evenness.
Simpson, InvSimpson, and Fisher: provide additional perspectives on community structure, weighting common versus rare species differently.

To find out if groups differ in their alpha diversity (for example, between host species), we turn to statistics. First, we check whether the data follow a normal distribution—this decides whether to use parametric or non-parametric tests. Then we can compare across groups and see which ones truly stand apart.

Fun fact: Even if two groups have the same number of species (richness), one can still be more diverse if the species are more evenly distributed (evenness).

# Load microeco and create alpha diversity object per treatment group
t1 <- trans_alpha$new(dataset = Defoliation_mt_rfiltered, group = "Treatment")

# Inspect the first few rows of the transformed alpha diversity data
head(t1$data_alpha)

# Inspect the summary statistics for each group
head(t1$data_stat)

# Load dplyr for data manipulation
library(dplyr)

# Extract the alpha diversity values
alpha_values <- t1$data_alpha

# Run Shapiro-Wilk test for each metric and treatment
shapiro_results <- alpha_values %>%
  group_by(Measure, Treatment) %>%
  summarise(
    W = shapiro.test(Value)$statistic,              # Shapiro-Wilk statistic
    p_value = shapiro.test(Value)$p.value,          # Corresponding p-value
    Normality = ifelse(p_value > 0.05, "Pass", "Fail"),  # Label normality
    .groups = "drop"
  )

# View the results
shapiro_results

Visualizing Alpha Diversity Across Treatments

It’s always helpful to see the data in addition to running statistical tests. Here, we use boxplots to compare each alpha diversity metric across treatments.

library(ggplot2)
library(tidyr)

# Convert to long format if necessary (already long in t1$data_alpha)
alpha_long <- t1$data_alpha %>%
  select(Treatment, Measure, Value)

# Plot all alpha diversity metrics faceted by Measure
ggplot(alpha_long, aes(x = Treatment, y = Value, fill = Treatment)) +
  geom_boxplot() +
  facet_wrap(~ Measure, scales = "free_y") +
  theme_minimal() +
  labs(
    title = "Alpha Diversity Metrics Across Treatments",
    x = "Treatment",
    y = "Alpha Diversity Value"
  ) +
  theme(legend.position = "none")

Testing for Significant Differences in Diversity (ex. Pielou)

Since all our alpha diversity metrics appear normally distributed and passed the Shapiro-Wilk test, we can use ANOVA (a parametric test) to check whether diversity differs significantly across treatments. Let’s use Pielou’s evenness index as an example:

# ---------------------------------------------
# ANOVA for Pielou's Evenness
# ---------------------------------------------

# Load dplyr for filtering
library(dplyr)

# 1. Filter Pielou diversity data
pielou_data <- t1$data_alpha %>%
  filter(Measure == "Pielou")

# 2. Perform ANOVA
pielou_aov <- aov(Value ~ Treatment, data = pielou_data) # You can also run with interactions ex; Value ~ Treatment*Crop,  

# 3. Print ANOVA table
summary(pielou_aov)

Here’s how the ANOVA results should appear:

> summary(pielou_aov)
            Df    Sum Sq   Mean Sq F value Pr(>F)  
Treatment    2 0.0002252 1.126e-04   3.745 0.0346 *
Residuals   32 0.0009621 3.007e-05                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since the treatment shows a significant effect on the groups, we can next use Tukey’s HSD (Honestly Significant Difference) test to see which groups differ from each other.

# Tukey HSD test + group letters
hsd <- HSD.test(pielou_aov, "Treatment", group = TRUE)
hsd$groups

> hsd$groups
            Value groups
D90     0.9161052      a
Control 0.9122825     ab
D50     0.9099236      b

Next, let’s visualize the group differences using a boxplot with letters indicating significant differences from Tukey’s HSD test.

# 4. Extract letters into a tibble
letters_df <- hsd$groups %>%
  tibble::rownames_to_column("Treatment") %>%
  tibble::as_tibble() %>%
  dplyr::select(Treatment, groups)

# 5. Compute group means (for positioning letters)
mean_df <- pielou_data %>%
  dplyr::group_by(Treatment) %>%
  dplyr::summarise(mean = mean(Value), max_val = max(Value), .groups = "drop")

# 6. Join letters with group stats
letters_df <- dplyr::left_join(mean_df, letters_df, by = "Treatment")

# 7. Boxplot with letters (letters above max whisker for clarity)
ggplot(pielou_data, aes(x = Treatment, y = Value, fill = Treatment)) +
  geom_boxplot(width = 0.6, alpha = 0.7, color = "black") +
  geom_text(data = letters_df,
            aes(x = Treatment, y = max_val + 0.02, label = groups),
            inherit.aes = FALSE,
            vjust = 0) +
  labs(x = "Treatment", y = "Pielou's Evenness") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

Although we used the classical approach here, the microeco package offers a simpler and more streamlined way to perform this type of analysis.

NOTE: One limitation, however, is that some of the scripts doesn’t really show you the underlying principles, or parameters behind the analyses. Therefore, it’s worth paying attention to what’s happening behind the scenes—especially the default tests being used. At first glance, the script seems very straightforward, but there’s quite a bit going on under the hood that you should be aware of. For example, instead of the Tukey test we used in our previous example, the default script applies Duncan’s multiple range test. If you want more control over the analysis, you’ll need to explore the script and tweak some of the parameters to suit your needs.

All of the steps we did above—calculating alpha diversity, running ANOVA, and examining group differences—can be done in just a few lines using microeco.

t1 <- trans_alpha$new(dataset = Defoliation_mt_rfiltered, group = "Treatment")
# return t1$data_stat
head(t1$data_stat)
t1$cal_diff(method = "anova")
# return t1$res_diff
head(t1$res_diff)
t1$plot_alpha(measure = "Pielou")

As I mentioned earlier, one caveat is that some functions in microeco are a bit hidden within the methods, but you can still customize them to fit your needs. For a closer look at all the functions and options, simply run ?microeco in R to view the package documentation. To give you some insight, this is what cal_diff() is doing under the hood:

cal_diff()

trans_alpha$cal_diff(
measure = NULL,
method = c(“KW”, “KW_dunn”, “wilcox”, “t.test”, “anova”, “scheirerRayHare”, “lm”,
“lme”, “betareg”, “glmm”, “glmm_beta”)[1],
formula = NULL,
p_adjust_method = “fdr”,
KW_dunn_letter = TRUE,
alpha = 0.05,
anova_post_test = “duncan.test”,
anova_varequal_test = FALSE,
return_model = FALSE,
…
)

And the arguments in the description are as follows;
measure
default NULL; character vector; If NULL, all indexes will be used; see names of microtable$alpha_diversity, e.g. c(“Observed”, “Chao1”, “Shannon”).

method
default “KW”; see the following available options:

‘KW’
Kruskal-Wallis Rank Sum Test for all groups (>= 2)

‘KW_dunn’
Dunn’s Kruskal-Wallis Multiple Comparisons <10.1080/00401706.1964.10490181> based on dunnTest function in FSA package

‘wilcox’
Wilcoxon Rank Sum Test for all paired groups When by_ID parameter is provided in creating the object of the class, paired Wilcoxon test will be performed.

‘t.test’
Student’s t-Test for all paired groups. When by_ID parameter is provided in creating the object of the class, paired t-test will be performed.

‘anova’
Variance analysis. For one-way anova, the default post hoc test is Duncan’s new multiple range test. Please use anova_post_test parameter to change the post hoc method. For multi-way anova, Please use formula parameter to specify the model and see aov for more details

‘scheirerRayHare’
Scheirer-Ray-Hare test (nonparametric test) for a two-way factorial experiment; see scheirerRayHare function of rcompanion package

‘lm’
Linear Model based on the lm function

‘lme’
Linear Mixed Effect Model based on the lmerTest package

‘betareg’
Beta Regression for Rates and Proportions based on the betareg package

‘glmm’
Generalized linear mixed model (GLMM) based on the glmmTMB package. A family function can be provided using parameter passing, such as: family = glmmTMB::lognormal(link = “log”)

‘glmm_beta’
Generalized linear mixed model (GLMM) with a family function of beta distribution. This is an extension of the GLMM model in ‘glmm’ option. The only difference is in glmm_beta the family function is fixed with the beta distribution function, facilitating the fitting for proportional data (ranging from 0 to 1). The link function is fixed with “logit”.

formula
default NULL; applied to two-way or multi-factor analysis when method is “anova”, “scheirerRayHare”, “lm”, “lme”, “betareg” or “glmm”; specified set for independent variables, i.e. the latter part of a general formula, such as ‘block + N*P*K’.

p_adjust_method
default “fdr” (for “KW”, “wilcox”, “t.test” methods) or “holm” (for “KW_dunn”); P value adjustment method; For method = ‘KW’, ‘wilcox’ or ‘t.test’, please see method parameter of p.adjust function for available options; For method = ‘KW_dunn’, please see dunn.test::p.adjustment.methods for available options.

KW_dunn_letter
default TRUE; For method = ‘KW_dunn’, TRUE denotes significances are presented by letters; FALSE means significances are shown by asterisk for paired comparison.

alpha
default 0.05; Significant level; used for generating significance letters when method is ‘anova’ or ‘KW_dunn’.

anova_post_test
default “duncan.test”. The post hoc test method for one-way anova. The default option represents the Duncan’s new multiple range test. Other available options include “LSD.test” (LSD post hoc test) and “HSD.test” (HSD post hoc test). All those are the function names from agricolae package.

anova_varequal_test
default FALSE; whether conduct Levene’s Test for equality of variances. Only available for one-way anova. Significant P value means the variance among groups is not equal.

return_model
default FALSE; whether return the original “lm”, “lmer” or “glmm” model list in the object.

…
parameters passed to kruskal.test (when method = “KW”) or wilcox.test function (when method = “wilcox”) or dunnTest function of FSA package (when method = “KW_dunn”) or agricolae::duncan.test/agricolae::LSD.test/agricolae::HSD.test (when method = “anova”, one-way anova) or rcompanion::scheirerRayHare (when method = “scheirerRayHare”) or stats::lm (when method = “lm”) or lmerTest::lmer (when method = “lme”) or betareg::betareg (when method = “betareg”) or glmmTMB::glmmTMB (when method = “glmm”).

Returns
res_diff, stored in object with the format data.frame.
When method is “betareg”, “lm”, “lme” or “glmm”, “Estimate” and “Std.Error” columns represent the fitted coefficient and its standard error, respectively.

It may seem long and detailed at first, but once you understand it, you’ll realize that the more you know, the more control you have over what you’re doing.

Since our data were normally distributed, ANOVA was appropriate. If they hadn’t been, we’d use a non-parametric alternative like the Kruskal-Wallis test. Parametric tests assume a certain distribution, while non-parametric tests are more flexible and don’t rely on those assumptions.

t1 <- trans_alpha$new(dataset = Defoliation_mt_rfiltered, group = "Treatment")
t1$cal_diff(method = "KW_dunn", KW_dunn_letter = TRUE)
t1$plot_alpha(plot_type = "ggboxplot", add = "none")

Correlations with diversity

coming soon …

Beta Diversity

Alpha diversity tells us about diversity within a single sample, but beta diversity lets us compare samples to see how their communities differ. Next, we’ll dive into beta diversity analysis.

# Install and load the GUniFrac package (for UniFrac distance calculation)
# install.packages("GUniFrac")
library(GUniFrac)

# Calculate beta diversity (including UniFrac if needed)
Defoliation_mt_rfiltered$cal_betadiv(unifrac = TRUE)

# Create a new beta diversity object grouped by Treatment
t1 <- trans_beta$new(dataset = Defoliation_mt_rfiltered, group = "Treatment", measure = "bray")

# Perform ordination using PCoA
t1$cal_ordination(method = "PCoA")

# Check the class of ordination results
class(t1$res_ordination)

# Plot PCoA with confidence ellipses
t1$plot_ordination(
  plot_color = "Treatment", 
  plot_shape = "Treatment", 
  plot_type = c("point", "ellipse")
)