Why Rarefaction Isn’t Just “Pick the Lowest and Pray”

Rarefaction isn’t just about chopping your data down to size — it’s about finding the balance between fairness and keeping as much information as possible. Here’s the long (but sensible!) version of why it matters, and how to do it right.

A lot of tutorials will tell you to just rarefy everything down to the lowest sequencing depth—or hand you a script and let you follow it without thinking twice. It’s quick and easy, sure. But once you look a little closer, it’s kind of shocking how much good data gets tossed out that way. The real trick is finding the sweet spot: keep as many ASVs as you can without cutting so many replicates that your treatment groups fall apart. That way, your stats stay solid and your story stays believable.

To really see what’s going on, let’s look at a dataset. In this case, we’ll plot the number of sequences against the number of ASVs. This kind of plot helps reveal how sequencing depth influences the diversity we capture — and where we might start losing valuable information.

First, let’s quickly convert our microtable into a phyloseq object. This makes downstream plotting (like rarefaction curves) much easier and more flexible:

library(microeco)
library(file2meco)
library(phyloseq)

ps <- meco2phyloseq(mt)

To plot and explore the data, we’ll need a few extra packages. If you don’t have them already, here’s how to install and load everything:

# Lets install the packages 
if (!require("devtools", quietly = TRUE)) {
  install.packages("devtools")
}
devtools::install_github("adrientaudiere/MiscMetabar")

install.packages("MiscMetabar")
## install iNEXT package from CRAN
install.packages("iNEXT")

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("dada2")

library("MicrobiotaProcess")
library("MiscMetabar")
library("ggplot2")
library("patchwork")
library("iNEXT")

👉 Now we’re ready to start plotting rarefaction curves and exploring how sequencing depth affects ASV richness!

Let’s plot how many ASVs we pick up as we sequence more and more. Think of it like scooping marbles into a jar — at first, you keep finding new colors (ASVs), but after a while, you mostly get repeats. This helps us see how well we’ve captured biodiversity.

observed <- accu_plot(
  ps,
  fact = "sample.id",     # group curves by sample
  add_nb_seq = TRUE,      # show number of sequences along the curve
  by.fact = TRUE,         # draw a curve for each sample
  step = 1000             # sequence step size 
) + theme(legend.position = "top")

# Display the plot
observed

# ⏳ Note: Depending on your step size, this may take ~10 minutes to run. 

Quick refresher:

  • x-axis = sequencing depth (how many reads you’ve got)
  • y-axis = number of ASVs observed (a measure of diversity)
  • each curve = a sample

The trick is to pick a cutoff on the x-axis that keeps as much of the y-axis signal as possible for all samples. For privacy, I’ve stripped the real sample IDs and just relabeled four of them as S1–S4. Let’s use S4 as our “example victim” and see what happens at different cutoffs.


👉 Cut at point a (S1, ~23k reads):
Everything is technically “equalized.” You keep all the samples, including S4. But look closely—S4’s curve (and many others) is still climbing. That means you’re chopping it off mid-growth and throwing away diversity. Ouch.

👉 Cut at point b (S2, ~62k reads):
Much better. S4 suddenly gains more ASVs (~1,250 more marked a “x” on the y axis). You’ve kept more of the real diversity signal. The downside? S1 drops out of the dataset.

👉 Cut at point c (S3):
Now S4 only picks up ~100 more ASVs (marked as “y”). Not nearly as dramatic as the jump from S1 to S2. Plus, most curves are already flattening out, meaning you’re not gaining much extra diversity at this deeper cutoff. The trade-off: both S1 and S2 get removed.

👉 Cut at point d:
What do you actually gain on the y-axis? Practically nothing. But the cost? You lose around 12 samples that don’t make it to this depth. That’s a pretty steep price for a tiny return.


The takeaway:
In this dataset, the smart move is to remove S1 and S2 when rarefying. These aren’t random samples—they simply have lower sequencing depth. Cutting them preserves most of the biological signal in samples like S4 without wasting reads. The catch: if S1 and S2 were from the same treatment, dropping them would weaken statistical power; from different treatments, it’s less of an issue. Rarefaction isn’t a mindless chop—it’s about informed trade-offs. Check your curves, think about which samples you can afford to lose, and pick a cutoff that keeps as much real diversity as possible while minimizing data loss.

That’s why rarefaction isn’t just a mechanical step. You have to look, think, and decide where the trade-off makes sense for your experiment.