In the previous section we imported, formatted our data and created the micro table. And then we filtered out the unwanted parts and subset our data. Now the fun begins.
Ways to Explore a Microbiome
Microbiome data can be examined from different perspectives, and each approach provides a unique piece of the puzzle. No single method captures the whole picture, but together they can reveal a more complete understanding of microbial communities. Here are some of the main approaches:
- Diversity Analysis 📊
Measures variation within a sample (alpha diversity) and between samples (beta diversity), helping to understand how diverse microbial communities are and how they differ across conditions. - Composition-Based Analysis 🥧
Examines the relative proportions of taxa within a sample, emphasizing how different groups contribute to the overall community structure. - Differential Abundance Analysis ⚖️
Compares groups to identify taxa or genes that show significant differences in abundance under varying conditions (e.g., healthy vs. diseased, treated vs. untreated). - Network and Co-occurrence Analysis đź”—
Investigates relationships among taxa or between microbes and environmental factors, often through co-occurrence or correlation-based networks. - Functional Analysis ⚙️
Moves beyond “who is there” to consider “what they might be doing,” focusing on metabolic pathways and gene functions associated with the community. - Integration with Host and Environment 🌍
Links microbiome patterns to host traits (such as genetics, diet, or health) or environmental variables (such as soil type, temperature, or pH) to understand broader influences.
In practice, these approaches are often combined. Together, they help move from describing “who is present” to understanding “what roles they play” and “how they interact within their environment.”
Before diving into the actual analysis, we need to normalize our data. I know — it feels like it took forever to get here, and just when you think you can start the fun part, another hurdle appears. But hang in there, it’s an important step!
So, why normalize?
- Address Unequal Sequencing Depths – Different samples often have varying numbers of sequences due to technical variations in sequencing (e.g., library preparation, machine performance).
- Avoid Sampling Bias – High sequencing depth might artificially inflate diversity measures (e.g., alpha diversity) because rare taxa are more likely to be detected in deeply sequenced samples.
- Ensure Consistency in Statistical Analysis – Many statistical tests assume that data points are derived from similar distributions. Unequal sequencing depths violate this assumption.
- Simplify Diversity Metrics – Calculating metrics like observed richness, Shannon index, or Simpson index directly on unnormalized data can lead to biased results due to sequencing depth.
There is a dozen ways to normalize data (also called transformation, scaling etc. based on the strategy). However, rarefaction is widely accepted and,
- ensures that metrics such as observed richness, Shannon index reflect biological differences, not technical artifacts.
- creates a dataset with uniform sequencing depth, satisfying the assumptions of many statistical models.
- the influence of sequencing depth on diversity estimates is reduced, enabling more accurate comparisons.
- standardizes the number of sequences per sample, making them comparable.
Regardless, rarefaction is very important in diversity analysis (e.g., alpha and beta diversity).
1.1 check the sequence numbers
Lets peek into out sample to check the sequence numbers in each sample. This shows why it is important to normalize our data.
Drought_mt$sample_sums() %>% range # [1] 35874 80766, lowest and highest number of sequences are 35874 and 80766
1.2 rarefaction
Usually, rarefaction is done using something like rarefy_samples(sample.size = 35874) based on the sample with the lowest sequence count. But here’s the catch — this approach cuts all samples down to that depth. What if that lowest value is just an outlier? Rarefying everything to match it could mean losing valuable data. Honestly, I was a bit shocked to see some tutorials online doing exactly that — rarefying all samples to the lowest value without even glancing at the data. (I’ll show you a classic example of what can happen using a dataset I recently analyzed.)
Think of it like trimming a bunch of differently-sized strings to the same length. If one string is unusually short, cutting all the others to match it wastes a lot of material — and a lot of potential insights. That’s why it’s always worth checking your data before deciding how deep to rarefy.
In this case, I’m converting my micro table into a phyloseq object to make plotting rarefaction curves simpler. Sure, you can do it directly from the micro table, but working in phyloseq opens up more possibilities and makes the process smoother downstream.
So, the first step is to convert your micro table into a phyloseq object. Once it’s in that format, creating and exploring rarefaction curves becomes much easier — and a little more fun too!
library(phyloseq)
library(file2meco)
# Lets convert from microtable to phyloseq object so it's easy to handle
drought_physeq <- meco2phyloseq(Drought_mt)
drought_physeq
library(MicrobiotaProcess)
library(patchwork)
# for reproducibly random number
set.seed(1024)
rareres <- get_rarecurve(obj=drought_physeq, chunks=400)
p_rare <- ggrarecurve(obj=rareres,
indexNames=c("Observe","Chao1","ACE"),
) +
theme(legend.spacing.y=unit(0.01,"cm"),
legend.text=element_text(size=6))
p_rare
library(phyloseq)
library(file2meco)
# Lets convert from microtable to phyloseq object so it's easy to handle
drought_physeq <- meco2phyloseq(Drought_mt)
drought_physeq
library(MicrobiotaProcess)
library(patchwork)
# for reproducibly random number
set.seed(1024)
rareres <- get_rarecurve(obj=drought_physeq, chunks=400)
p_rare <- ggrarecurve(obj=rareres,
indexNames=c("Observe","Chao1","ACE"),
) +
theme(legend.spacing.y=unit(0.01,"cm"),
legend.text=element_text(size=6))
p_rare
The rarefaction curves might already give you a glimpse of any outliers, but to make things even clearer, let’s dive into the numbers. We’ll identify any outliers by examining the number of reads in each sample—this way, we can pinpoint exactly where things might be going off track. Let’s take a closer look!
# Extract the data from the 'rareres' object rareres_data <- rareres$data # Use dplyr to group by sample, calculate the maximum readsNums, and arrange in ascending order library(dplyr) max_reads_per_sample_sorted <- rareres_data %>% group_by(sample) %>% summarise(max_reads = max(readsNums)) %>% arrange(max_reads) # Print the sorted result print(max_reads_per_sample_sorted)
Now we can do the rarefaction based on the numbers.
> print(max_reads_per_sample_sorted)
# A tibble: 35 Ă— 2
sample max_reads
<chr> <dbl>
1 DR20M-1 35874
2 DR20M-2 54302
3 DR20M-3 56321
4 DR20M-4 57431
5 DR20M-5 57980
6 DR20M-6 59872
7 DR20X-1 60002
8 DR20X-3 54300
9 DR20X-4 54300
10 DR20X-5 54300
Now, we can perform the rarefaction based on the numbers. With the rarefaction curve graph right in front of us, this is the perfect moment to assess what we’re losing and retaining at each point we decide to rarefy our samples. It gives us a clear view of how our data changes as we adjust the rarefaction depth—let’s dig into it!
To decide the rarefaction depth, take a look at the x-axis, which shows the sequencing depth at different points. Then, check the y-axis to see what you’ll be cutting off at the chosen rarefaction depth. This helps you understand how much of your data is retained at each depth, ensuring you make an informed decision about where to rarefy your samples.
Now we can do the rarefaction.
# As an example, use 54300 sequences in each sample Drought_mt$rarefy_samples(sample.size = 54300) Drought_mt$sample_sums() %>% range # Shows the minimum and maximum number of reads we are left with Recovery_mt$sample_sums() %>% sort() # Shows number of reads we are left with for each sample
Even though we’ve been using the phyloseq object to visualize the rarefaction curves, keep in mind that we’re switching back to the micro table for the actual analysis. This transition is important, as the micro table is where we’ll conduct most of the downstream analysis, so make sure everything aligns as we move forward!
In case you want to do the rarefaction in the phyloseq object instead of the microtable, you are not left alone
# Calculate the sample sums
sums <- sample_sums(physeq)
sums
# Sort the sums in increasing order and pick the second lowest
second_lowest_sum <- sort(sums)[2] # this number [2] decides the depth to rarefy
# Rarefy using the second lowest sum
# Set a random seed for reproducibility
set.seed(1024)
# Rarefy using the second lowest sum, with rngseed set to TRUE for reproducibility
rarefied_physeq <- rarefy_even_depth(rarefied_physeq, sample.size = second_lowest_sum, rngseed = TRUE)
# Check if the rarefaction is done properly
set.seed(1024)
rareres <- get_rarecurve(obj=rarefied_physeq, chunks=400)
p_rare <- ggrarecurve(obj=rareres,
indexNames=c("Observe","Chao1","ACE"),
) +
theme(legend.spacing.y=unit(0.01,"cm"),
legend.text=element_text(size=6))
p_rare
Now that we’ve leveled the playing field with rarefaction, it’s finally time to dive into the fun stuff — exploring diversity hiding in our samples! 🎉
But before we get carried away, let’s take a quick pit stop for one last round of data cleanup: removing singletons.
Singletons are those quirky OTUs or ASVs that appear only once in your entire dataset. Most of the time, they aren’t biologically meaningful — they’re just little artifacts of sequencing errors, PCR gremlins, or random one-off reads. Keeping them is like trying to hear a whisper in a noisy room; they just add fuzziness and can distort your diversity estimates. Filter them out, and suddenly your dataset is sharper, cleaner, and the patterns you see reflect real biology, not random noise.
Once singletons are gone, your data is lean, tidy, and ready for the real adventure: alpha and beta diversity analysis.
Now, I know I keep slowing you down with all these “hold-on” moments, but trust me — this is important! Most microbiome studies take it a step further. After removing singletons, researchers often filter out low-abundance ASVs — for example, those appearing fewer than 5 times or in less than 10–20% of all samples. Why? Because these rare ASVs are often sequencing errors, PCR artifacts, or contaminants. Removing them improves the signal-to-noise ratio and ensures that the taxa you focus on are biologically relevant and consistently present.
OK, here we go…
# 1. Clone the original microtable to keep the raw data safe
# ---------------------------
mt_rfiltered <- clone(mt_rarefied)
# ---------------------------
# 2. Extract the OTU table
# ---------------------------
otu <- mt_rfiltered$otu_table
# ---------------------------
# 3. Filter OTUs
# Keep OTUs with at least 5 counts in at least 4 samples
# (Adjust thresholds as needed for your study)
# ---------------------------
keep_otus <- rownames(otu)[apply(otu, 1, function(x) sum(x >= 5) >= 4)]
# ---------------------------
# 4. Subset the microtable object
# ---------------------------
mt_rfiltered$otu_table <- otu[keep_otus, , drop = FALSE]
mt_rfiltered$tax_table <- mt_rfiltered$tax_table[keep_otus, , drop = FALSE]
mt_rfiltered$phylo_tree <- ape::keep.tip(mt_rfiltered$phylo_tree, keep_otus)
if (!is.null(mt_rfiltered$rep_fasta)) {
mt_rfiltered$rep_fasta <- mt_rfiltered$rep_fasta[keep_otus]
}
# ---------------------------
# 5. Verify the updated object
# ---------------------------
mt_rfiltered
Think of it like trimming the obvious weeds in your garden before you start exploring the hidden treasures — a little patience now makes the discovery phase much more meaningful. 🌱
Alright, the cleanup is done — singletons and low-abundance OTUs are out of the way, and our dataset is lean, tidy, and ready to shine. ✨
No more hold-ups, I promise — now it’s time for the fun part: diving into alpha diversity, beta diversity, and all the exciting patterns hiding in your microbial community. Let’s see what stories your microbes have been waiting to tell! 🦠📊