4. Microbiome analysis

In the previous section we imported, formatted our data and created the micro table. And then we filtered out the unwanted parts and subset our data. Now the fun begins.

Microbiome analysis

Microbiome data can be analysed in several ways. These different ways gives you different answers and they address different research questions. Sometimes these terms can have different names. For example some might call it a correlation network where others might call it a Co-occurrence Network. But the underlying principal is the same.

Different ways that you can analyse your data are listed below,

  • Taxonomic Analysis – Identifies and quantifies the microbial taxa (e.g., species, genera, families) present in the samples.
  • Functional Analysis – Predicts or identifies the functional potential of the microbiome, focusing on metabolic pathways and gene functions.
  • Differential abundance analysis – Identifies microbial taxa or genes that differ significantly in abundance between groups.
  • Integration with Host and Environmental Data – Links microbiome data with host factors (e.g., genetics, health) or environmental variables (e.g., temperature, soil type).
  • Network and Co-occurrence Analysis – Examines interactions between microbial taxa or between microbes and environmental variables.
  • Composition-based microbiome analysis – focuses on the relative abundances of microbial taxa (e.g., species, genera, or higher levels) in a given sample or set of samples. Instead of analyzing absolute counts of microbes, it examines how different taxa are distributed as proportions of the total community.

Lets begin with Composition-based microbiome analysis

Composition-based microbiome analysis

Before doing the actual analysis, we need to normalize our data. Why?

  1. Address Unequal Sequencing Depths – Different samples often have varying numbers of sequences due to technical variations in sequencing (e.g., library preparation, machine performance).
  2. Avoid Sampling Bias – High sequencing depth might artificially inflate diversity measures (e.g., alpha diversity) because rare taxa are more likely to be detected in deeply sequenced samples.
  3. Ensure Consistency in Statistical Analysis – Many statistical tests assume that data points are derived from similar distributions. Unequal sequencing depths violate this assumption.
  4. Simplify Diversity Metrics – Calculating metrics like observed richness, Shannon index, or Simpson index directly on unnormalized data can lead to biased results due to sequencing depth.

There is a dozen ways to normalize data (also called transformation, scaling etc. based on the strategy). However, rarefaction is widely accepted and,

  • ensures that metrics such as observed richness, Shannon index reflect biological differences, not technical artifacts.
  • creates a dataset with uniform sequencing depth, satisfying the assumptions of many statistical models.
  • the influence of sequencing depth on diversity estimates is reduced, enabling more accurate comparisons.
  • standardizes the number of sequences per sample, making them comparable.
But there is a catch. Rarefaction, discards data from higher-depth samples, and loses statistical power.

Regardless, rarefaction is very important in diversity analysis (e.g., alpha and beta diversity).

1.1 check the sequence numbers

Lets peek into out sample to check the sequence numbers in each sample. This shows why it is important to normalize our data.

Drought_mt$sample_sums() %>% range 
# [1] 35874 80766, lowest and highest number of sequences are 35874 and 80766

1.2 rarefaction

Usually the rarefaction is done using the function rarefy_samples(sample.size = 35874) on the lowest number. However, it means that we are cutting off all samples at a sequence depth of the lowest. What if this lowest value is due to an outlier? If we rarefy all our samples at that depth, we may loose some valuable data due to that outlier.

Therefore, it’s best to look at your data first and then decide. In this case, I’m converting my micro table into a phyloseq object to simplify the plotting process. Once it’s in this format, plotting rarefaction curves becomes much easier. Sure, there are ways to examine the rarefaction curves directly from the micro table itself, but why not go on a bit of an adventure and explore the process using a phyloseq object? It’ll open up more possibilities and make things even smoother down the line. Let’s dive in and see where this takes us!. The first step in this process is to convert your micro table into a phyloseq object.

library(phyloseq)
library(file2meco)
# Lets convert from microtable to phyloseq object so it's easy to handle
drought_physeq <- meco2phyloseq(Drought_mt)
drought_physeq

library(MicrobiotaProcess)
library(patchwork)
# for reproducibly random number
set.seed(1024)
rareres <- get_rarecurve(obj=drought_physeq, chunks=400)

p_rare <- ggrarecurve(obj=rareres,
                      indexNames=c("Observe","Chao1","ACE"),
) +
  theme(legend.spacing.y=unit(0.01,"cm"),
        legend.text=element_text(size=6))
p_rare

The rarefaction curves might already give you a glimpse of any outliers, but to make things even clearer, let’s dive into the numbers. We’ll identify any outliers by examining the number of reads in each sample—this way, we can pinpoint exactly where things might be going off track. Let’s take a closer look!

# Extract the data from the 'rareres' object
rareres_data <- rareres$data

# Use dplyr to group by sample, calculate the maximum readsNums, and arrange in ascending order
library(dplyr)

max_reads_per_sample_sorted <- rareres_data %>%
  group_by(sample) %>%
  summarise(max_reads = max(readsNums)) %>%
  arrange(max_reads)

# Print the sorted result
print(max_reads_per_sample_sorted)

Now we can do the rarefaction based on the numbers.

> print(max_reads_per_sample_sorted)
# A tibble: 35 × 2
   sample  max_reads
   <chr>       <dbl>
 1 DR20M-1     35874
 2 DR20M-2     54302
 3 DR20M-3     56321
 4 DR20M-4     57431
 5 DR20M-5     57980
 6 DR20M-6     59872
 7 DR20X-1     60002
 8 DR20X-3     54300
 9 DR20X-4     54300
10 DR20X-5     54300

Now, we can perform the rarefaction based on the numbers. With the rarefaction curve graph right in front of us, this is the perfect moment to assess what we’re losing and retaining at each point we decide to rarefy our samples. It gives us a clear view of how our data changes as we adjust the rarefaction depth—let’s dig into it!

To decide the rarefaction depth, take a look at the x-axis, which shows the sequencing depth at different points. Then, check the y-axis to see what you’ll be cutting off at the chosen rarefaction depth. This helps you understand how much of your data is retained at each depth, ensuring you make an informed decision about where to rarefy your samples.

Now we can do the rarefaction.

# As an example, use 54300 sequences in each sample
Drought_mt$rarefy_samples(sample.size = 54300)
Drought_mt$sample_sums() %>% range # Shows the minimum and maximum number of reads we are left with 
Recovery_mt$sample_sums() %>% sort() # Shows number of reads we are left with for each sample 

Even though we’ve been using the phyloseq object to visualize the rarefaction curves, keep in mind that we’re switching back to the micro table for the actual analysis. This transition is important, as the micro table is where we’ll conduct most of the downstream analysis, so make sure everything aligns as we move forward!

In case you want to do the rarefaction in the phyloseq object instead of the microtable, you are not left alone

# Calculate the sample sums
sums <- sample_sums(physeq)
sums
# Sort the sums in increasing order and pick the second lowest
second_lowest_sum <- sort(sums)[2] # this number [2] decides the depth to rarefy 

# Rarefy using the second lowest sum
# Set a random seed for reproducibility
set.seed(1024)

# Rarefy using the second lowest sum, with rngseed set to TRUE for reproducibility
rarefied_physeq <- rarefy_even_depth(rarefied_physeq, sample.size = second_lowest_sum, rngseed = TRUE)

# Check if the rarefaction is done properly
set.seed(1024)
rareres <- get_rarecurve(obj=rarefied_physeq, chunks=400)

p_rare <- ggrarecurve(obj=rareres,
                      indexNames=c("Observe","Chao1","ACE"),
) +
  theme(legend.spacing.y=unit(0.01,"cm"),
        legend.text=element_text(size=6))
p_rare

Then, let’s calculate the taxa abundance at each taxonomic rank (back in the micro table).

Drought_mt$cal_abund() # default parameter (rel = TRUE) denotes relative abundance
Drought_mt$taxa_abund$Phylum[1:5, 1:5] # Show part of the relative abundance at Phylum level

1.3 Visualize taxonomic abundance

Let’s make a bar plot! I can already hear the pros scoffing: “Stacked bar charts? Really?” And to the newbies in microbiome analysis, who are currently nodding along like they totally get what’s happening—don’t worry, you’ll see soon enough. Yeah, they’re not the most informative visual tool, but hey—they do give us a nice bird’s-eye view of the data. Maybe they were all the rage back in the day, like bell-bottom jeans or floppy disks. So, even if they’re not perfect, we’re going to roll with it. Let’s just say we’re honoring tradition, shall we?

# create trans_abund object
t1 <- trans_abund$new(dataset = Drought_mt, taxrank = "Phylum", ntaxa = 8) # select top 8 abundant Phyla.
t1$plot_bar(others_color = "grey70", facet = "TRT", xtext_keep = TRUE, legend_text_italic = FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  # Rotate x-axis text
# return a ggplot2 object

Plot by mean of group. Looking at each individual replicate in all the treatments is also not the best and therefore, we can look at the mean of each group.

# The groupmean parameter can be used to obtain the group-mean barplot.
t1 <- trans_abund$new(dataset = Drought_mt, taxrank = "Phylum", ntaxa = 10, groupmean = "TRT")
g1 <- t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)
g1 + theme_classic() + theme(axis.title.y = element_text(size = 18))

Alluvial plots are better because they show the transformation (Lets look at the individual samples).

t1 <- trans_abund$new(dataset = Drought_mt, taxrank = "Genus", ntaxa = 8)
# require ggalluvial package
# use_alluvium = TRUE make the alluvial plot, clustering =TRUE can be used to reorder the samples by clustering
# bar_type = FALSE can remove 'others'
t1$plot_bar(bar_full = FALSE, use_alluvium = TRUE, clustering = TRUE, xtext_angle = 30, xtext_size = 3, color_values = RColorBrewer::brewer.pal(8, "Set2"))

And, by group mean

t1 <- trans_abund$new(dataset = Drought_mt, taxrank = "Phylum", ntaxa = 8, groupmean = "TRT")
# require ggalluvial package
# use_alluvium = TRUE make the alluvial plot, clustering =TRUE can be used to reorder the samples by clustering
# bar_type = FALSE can remove 'others'
t1$plot_bar(bar_full = FALSE, use_alluvium = TRUE, clustering = TRUE, xtext_angle = 30, xtext_size = 10, color_values = RColorBrewer::brewer.pal(8, "Set2"))

Now, remember I mentioned that we will switch from one package to another? Well, we have did it once when we looked at the rarefaction curves, lets do it again (for the rarefied data, ofcourse). The objective is to convert our data to a MPSE S4 class which is used by the MicrobiotaProcess package. There maybe a way to do it directly, but this is what i came up with. Once it is in mpse4, its much way easier to visualize the data filtered based on the group of interest.

physeq_drought <- meco2phyloseq(Drought_mt) # convert to Phyloseq 

library(phyloseq)
mpse4_drought <- physeq_drought  %>% as.MPSE() # convert to  mpse4 object
#mpse4

I initially thought I could complete it, but I’m currently busy with my experiments and manuscript preparations. I’ll return as soon as I have some time.