Step-by-Step Guide: Removing Singletons in a microtable (microeco)

Rarefying your dataset is just the first step. Before calculating diversity metrics, it’s good practice to remove singletons — OTUs that appear only once across the whole dataset. These are usually sequencing noise and can artificially inflate richness estimates.

Here’s how to do it safely using microeco:


Step 1: Clone the original microtable

Always keep your raw data intact. Start by cloning your microtable object:

mt_nosingletons <- clone(mt_rarefied)

Step 2: Extract the OTU table

otu <- mt_nosingletons$otu_table

Step 3: Identify singletons

Count OTUs that appear only once across all samples:

otu_sums <- rowSums(otu)
singletons <- names(otu_sums[otu_sums == 1])
cat("Number of singletons:", length(singletons), "\n")

Step 4: Remove singletons

Keep only OTUs with total abundance >1:

keep_otus <- rownames(otu)[otu_sums > 1]

# Subset OTU table
mt_nosingletons$otu_table <- otu[keep_otus, , drop = FALSE]

# Subset taxonomy table
mt_nosingletons$tax_table <- mt_rarefied$tax_table[keep_otus, , drop = FALSE]

# Subset phylogenetic tree
mt_nosingletons$phylo_tree <- ape::keep.tip(mt_nosingletons$phylo_tree, keep_otus)

# Subset representative sequences if present
if (!is.null(mt_nosingletons$rep_fasta)) {
  mt_nosingletons$rep_fasta <- mt_nosingletons$rep_fasta[keep_otus]
}

Step 5: Verify filtering

Check taxa counts before and after:

cat("Taxa before filtering:", nrow(otu), "\n")
cat("Taxa after filtering: ", length(keep_otus), "\n")
cat("Removed:", nrow(otu) - length(keep_otus), "singletons\n")

Step 6: Compare alpha diversity (Observed richness & Shannon)

# Observed richness
obs_before <- colSums(mt_rarefied$otu_table > 0)
obs_after  <- colSums(mt_nosingletons$otu_table > 0)

# Shannon diversity
shannon_before <- apply(mt_rarefied$otu_table, 2, vegan::diversity, index = "shannon")
shannon_after  <- apply(mt_nosingletons$otu_table, 2, vegan::diversity, index = "shannon")

# Combine into tidy dataframe for plotting
alpha_df <- data.frame(
  Sample = names(obs_before),
  Observed_before = obs_before,
  Observed_after  = obs_after,
  Shannon_before  = shannon_before,
  Shannon_after   = shannon_after
)

alpha_long <- alpha_df %>%
  tidyr::pivot_longer(-Sample, names_to = c("Metric", "Stage"),
                      names_sep = "_", values_to = "Value")

Step 7: Visualize alpha diversity before vs after

ggplot(alpha_long, aes(x = Stage, y = Value)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.12, size = 1.6, alpha = 0.7) +
  facet_wrap(~ Metric, scales = "free_y") +
  theme_minimal() +
  labs(title = "Alpha diversity: before vs after removing singletons",
       x = "", y = "Value")

Step 8: Save the singleton-free dataset (optional)

saveRDS(mt_nosingletons, "path/mt_nosingletons.rds")

Summary:

  • Count singletons to see how many “one-off” OTUs exist.
  • Remove singletons safely, keeping the microtable intact.
  • Check before-and-after taxa counts.
  • Compare alpha diversity (Observed richness & Shannon) to see the effect.
  • Your filtered dataset is now ready for clean diversity analysis.

Why Rarefaction Isn’t Just “Pick the Lowest and Pray”

Rarefaction isn’t just about chopping your data down to size — it’s about finding the balance between fairness and keeping as much information as possible. Here’s the long (but sensible!) version of why it matters, and how to do it right.

A lot of tutorials will tell you to just rarefy everything down to the lowest sequencing depth—or hand you a script and let you follow it without thinking twice. It’s quick and easy, sure. But once you look a little closer, it’s kind of shocking how much good data gets tossed out that way. The real trick is finding the sweet spot: keep as many ASVs as you can without cutting so many replicates that your treatment groups fall apart. That way, your stats stay solid and your story stays believable.

To really see what’s going on, let’s look at a dataset. In this case, we’ll plot the number of sequences against the number of ASVs. This kind of plot helps reveal how sequencing depth influences the diversity we capture — and where we might start losing valuable information.

First, let’s quickly convert our microtable into a phyloseq object. This makes downstream plotting (like rarefaction curves) much easier and more flexible:

library(microeco)
library(file2meco)
library(phyloseq)

ps <- meco2phyloseq(mt)

To plot and explore the data, we’ll need a few extra packages. If you don’t have them already, here’s how to install and load everything:

# Lets install the packages 
if (!require("devtools", quietly = TRUE)) {
  install.packages("devtools")
}
devtools::install_github("adrientaudiere/MiscMetabar")

install.packages("MiscMetabar")
## install iNEXT package from CRAN
install.packages("iNEXT")

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("dada2")

library("MicrobiotaProcess")
library("MiscMetabar")
library("ggplot2")
library("patchwork")
library("iNEXT")

👉 Now we’re ready to start plotting rarefaction curves and exploring how sequencing depth affects ASV richness!

Let’s plot how many ASVs we pick up as we sequence more and more. Think of it like scooping marbles into a jar — at first, you keep finding new colors (ASVs), but after a while, you mostly get repeats. This helps us see how well we’ve captured biodiversity.

observed <- accu_plot(
  ps,
  fact = "sample.id",     # group curves by sample
  add_nb_seq = TRUE,      # show number of sequences along the curve
  by.fact = TRUE,         # draw a curve for each sample
  step = 1000             # sequence step size 
) + theme(legend.position = "top")

# Display the plot
observed

# ⏳ Note: Depending on your step size, this may take ~10 minutes to run. 

Quick refresher:

  • x-axis = sequencing depth (how many reads you’ve got)
  • y-axis = number of ASVs observed (a measure of diversity)
  • each curve = a sample

The trick is to pick a cutoff on the x-axis that keeps as much of the y-axis signal as possible for all samples. For privacy, I’ve stripped the real sample IDs and just relabeled four of them as S1–S4. Let’s use S4 as our “example victim” and see what happens at different cutoffs.


👉 Cut at point a (S1, ~23k reads):
Everything is technically “equalized.” You keep all the samples, including S4. But look closely—S4’s curve (and many others) is still climbing. That means you’re chopping it off mid-growth and throwing away diversity. Ouch.

👉 Cut at point b (S2, ~62k reads):
Much better. S4 suddenly gains more ASVs (~1,250 more marked a “x” on the y axis). You’ve kept more of the real diversity signal. The downside? S1 drops out of the dataset.

👉 Cut at point c (S3):
Now S4 only picks up ~100 more ASVs (marked as “y”). Not nearly as dramatic as the jump from S1 to S2. Plus, most curves are already flattening out, meaning you’re not gaining much extra diversity at this deeper cutoff. The trade-off: both S1 and S2 get removed.

👉 Cut at point d:
What do you actually gain on the y-axis? Practically nothing. But the cost? You lose around 12 samples that don’t make it to this depth. That’s a pretty steep price for a tiny return.


The takeaway:
In this dataset, the smart move is to remove S1 and S2 when rarefying. These aren’t random samples—they simply have lower sequencing depth. Cutting them preserves most of the biological signal in samples like S4 without wasting reads. The catch: if S1 and S2 were from the same treatment, dropping them would weaken statistical power; from different treatments, it’s less of an issue. Rarefaction isn’t a mindless chop—it’s about informed trade-offs. Check your curves, think about which samples you can afford to lose, and pick a cutoff that keeps as much real diversity as possible while minimizing data loss.

That’s why rarefaction isn’t just a mechanical step. You have to look, think, and decide where the trade-off makes sense for your experiment.

From Clueless to Confident: A Newcomer’s Guide to Microbiome Research

When I started my journey into microbiome research, I was completely new to the field.I had no clue what I was doing. Zilch. Nada. But I was determined to learn—not just the science, but also the bioinformatics involved. So, like any eager newbie, I followed every tutorial I could find. And guess what? They worked beautifully…on their sample data. Then came the moment of truth: using my own data. That’s when the real struggle began. Steps that seemed crystal clear with example datasets turned into a tangled mess with my files. Suddenly, “easy” tutorials became confusing, frustrating, and downright maddening. To be fair, I don’t blame the authors of those tutorials. Maybe—just maybe—they wrote them assuming their readers already had some background in bioinformatics and coding. Perhaps they never imagined a total newbie like me would be bold (or crazy) enough to give them a try!

This guide is the product of my trials, errors, and eventual triumphs. This guide is all about harnessing the power of R for 16S rRNA data analysis using a mix of awesome packages like microeco, metacoder, and microbiotaprocess (among others). The pipeline I’ve laid out is cobbled together from half a dozen online tutorials, each offering a nugget of wisdom but often leaving big gaps to fill. I’ve taken those pieces, filled in the blanks, and turned it into a cohesive roadmap that works—especially when you’re using your own data!

Disclaimer;

Here’s the deal: this pipeline is for research purposes only. It’s not a magic wand, and I can’t promise perfect results. You’re still responsible for checking your data quality and interpreting your results based on your research goals. This guide is offered “as is,” and I can’t be held liable for errors or unexpected outcomes. If your research is critical, double-check everything and consult an expert if needed.

Why I Wrote This Guide;

Let’s be real: no one should have to spend hours wrestling with data prep and formatting before getting to the actual analysis. That’s exactly why I wrote this guide: to save you time, frustration, and maybe even a little sanity. If you’re planning to use the DADA2 pipeline, followed by a powerhouse mix of packages in R for downstream analysis, this guide will walk you through the setup step by step. That way, you can skip the headaches and dive straight into the exciting part—analyzing your microbiome data!

Enough about the struggles—let’s move forward and dive into the solutions. It’s time to get our raw sequences ready and set the stage for some serious analysis.

Is this tutorial complete yet? Not quite. 😅

I started this with the best intentions—thinking I could finish it alongside my analysis. How optimistic of me! Between two ongoing experiments, processing samples from two completed ones, submitting two manuscripts (with two more in the works), organizing a faculty-wide graduate research conference as president of the ALES Graduate Student Association, and presenting my research at other conferences… well, let’s just say the to-do list never ends!

This tutorial currently takes you up to the diversity analysis.
Thanks for your patience—I’ll be back with the exciting stuff very soon!