0. Upstream analysis

The five key files: Building blocks of your analysis

Here’s the plan: we’ll process our raw sequences step-by-step, and by the end, we’ll have these five essential files in this order:

table.qza – A feature table summarizing the abundance of sequences across samples.
rep-seqs.qza – Representative sequences after trimming, capturing unique sequence variants.
taxonomy.qza – Taxonomic classifications for our sequences, revealing the microbial communities.
rooted-tree.qza – A phylogenetic tree to support diversity analyses.

Oh! Did you notice I said five but only listed four? Don’t worry—keep going, and you’ll find out about the fifth file soon enough.

These output files will serve as the cornerstone for meaningful downstream analyses. Let’s dive in!

Once you’ve got QIIME2 installed and the environment activated, you’re all set to move forward! You can find the installation guide on their official website: QIIME2 Installation.

NOTE: Installing QIIME2 and setting up the environment is best explained directly in the QIIME2 documentation. Since their website is regularly updated, you’ll always find the most accurate and platform-specific instructions there.

Whether you’re on a PC or Mac, or working with an Intel-based or Apple Silicon chipset, the documentation will guide you through the process step-by-step. I won’t dive into those details here to avoid redundancy, but make sure to double-check the requirements for your platform before proceeding.

Activate your QIIME2 environment using the following command in your terminal:

conda env list # find the environments
conda activate qiime2-amplicon-2024.10 # Activate

1. 1 Importing Raw Sequences: Creating the Master `.qza` File

Now it’s time to import the raw sequences! This step will create a .qza file (Master .qza file) that includes all the paired sequences from your raw data, setting you up for the next stages of processing in QIIME2. To keep everything organized, it’s a good idea to store your raw files in a dedicated folder. For my workflow, I’ve created a separate folder for the outputs at /Users/chathu/Documents/qiime2_outputs/. This keeps everything neat and easy to find as we move forward with the analysis!

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --input-path /Users/chathu/Documents/DeSilvaV3V4 \
  --output-path /Users/chathu/Documents/qiime2_outputs/demultiplexed-sequences.qza

NOTE: The CasavaOneEightSingleLanePerSampleDirFmt refers to a specific file format for sequencing data that is compatible with the QIIME2 import process. This format originates from the CASAVA software (Consensus Assessment of Sequence and Variation), a tool developed by Illumina for processing raw sequencing data.

What Does It Mean?

Casava Output Format:
When sequencing data is generated by Illumina machines, it is often processed through the CASAVA software, which organizes the output into a directory structure where each sample has its own FASTQ file(s).

OneEight:
The term “OneEight” indicates the format introduced with CASAVA version 1.8, which is widely used and follows specific naming conventions for files.

Single Lane Per Sample Directory Format:
This part means that the data is organized so that each sample’s sequences (forward and reverse reads) are stored as separate FASTQ files within a directory.

Let’s assess the sequences and their sequencing depth using the qiime demux summarize command. It’s going to import the .qza file we’ve just created and generate a file that can be visualized in QIIME2 View, providing us with a detailed overview of the number of sequences per sample and offering insights into the quality of those sequences.

qiime demux summarize \
  --i-data /Users/chathu/Documents/qiime2_outputs/demultiplexed-sequences.qza \
  --o-visualization Users/chathu/Documents/qiime2_outputs/demultiplexed-sequences-summ.qzv

Now you can vizualize the .qzv file in qiime2 view.

NOTE: Visualizing your sequences in QIIME2 is a crucial step, as it helps you decide where to trim your sequences to keep only the high-quality regions and discard the lower-quality parts. You might be wondering, why remove any part of your sequences? Well, here’s the thing: sequencing tends to fail at the ends of your reads, resulting in poor-quality base calls. By trimming these low-quality regions, you ensure that only the most reliable data is kept for analysis, leading to more accurate and meaningful results.

1. 2 Deciding Where to Truncate: Interpreting Quality Score Visualizations

Here’s what you’ll see in the visualization and how to decide where to truncate your sequences:

The visualization will display quality scores across the length of your sequences. You’ll notice that the quality tends to drop off towards the ends of the reads. This is where sequencing errors, like poor-quality base calls, are more common.

To decide the position of truncation, look for the point where the quality score starts to significantly decline. The goal is to cut off the lower-quality regions while preserving the high-quality portion in the middle. Typically, you’ll want to truncate your sequences just before the quality score drops below an acceptable threshold (usually at 30 or above at 25ht percentile). IOW, when you see 29 as your 25th percentile, go back just one to the left and record the position (286).

By doing this, you’ll maximize the reliability of your data and improve the overall quality of your analysis!

Do the same for the reverse read and now you have the two positions where the forward and reverse reads needs to be truncated.

And then, we can move forward with the denoising

Now heres whats gonna happen

My samples were amplified using the Bacteria-specific (“Illumina”) V3-V4 primers:

341F = CCTACGGGNGGCWGCAG
805R = GACTACHVGGGTATCTAATCC

These were then sequenced on an Illumina MiSeq using a 2x300bp kit. The hypervariable region targeted by these primers is 464bp in length, so with 300bp reads, the sequences will overlap, allowing us to perform paired-end analysis during downstream processing.

To determine the overlap between the paired-end reads, we subtract the length of the target region (464 bp) from the read length (300 bp). Since both the forward and reverse reads are 300 bp each, we can calculate the overlap as follows:

Overlap=300 bp+300 bp−464 bp

Overlap=300bp+300bp−464bp

Overlap=136bp

The overlap between the two reads initially spans 136 base pairs. However, once we apply quality filtering and truncate the sequences, this overlap shrinks significantly—often down to just 20–30 base pairs. This reduction occurs because we trim off the low-quality regions at both ends of the reads, leaving a much smaller overlapping region for paired-end merging.

To put this into perspective: the DADA2 pipeline requires a minimum of 12 bases of overlap for paired-end reads to successfully merge. So, it’s crucial to keep this in mind when evaluating sequence quality and setting truncation parameters.

Here’s a quick visual guide to make this concept clearer.

Now that we decided our truncation and trimming Ready? Let’s dive into the DADA2 pipeline and get started!

1.3. DADA2

What is DADA2? Great question! Before diving into the pipeline, let’s take a moment to understand what DADA2 is.

NOTE: DADA2 is a powerful bioinformatics tool designed for amplicon sequence data analysis. It’s a key part of microbial community studies, particularly when analyzing 16S rRNA sequencing data. Unlike traditional methods that cluster sequences into Operational Taxonomic Units (OTUs) based on similarity thresholds (e.g., 97%), DADA2 takes a more refined approach by identifying Amplicon Sequence Variants (ASVs).

What does DADA2 do in a nutshell?

Filters and trims low-quality sequences.
Learns the error rates of your data.
Identifies and corrects errors to create a set of error-free ASVs.
Merges paired-end reads (if applicable).
Removes chimeric sequences.
Outputs a high-resolution table of ASVs for downstream analysis.

Ready to see it in action? Let’s continue with the DADA2 pipeline!

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs /Users/chathu/Documents/qiime2_outputs/demultiplexed-sequences.qza \
  --p-trunc-len-f 286 \
  --p-trunc-len-r 207 \
  --p-trim-left-f 17 \
  --p-trim-left-r 21 \
  --p-n-threads 10 \
  --o-table /Users/chathu/Documents/qiime2_outputs/table.qza \
  --o-representative-sequences /Users/chathu/Documents/qiime2_outputs/rep-seqs.qza \
  --o-denoising-stats /Users/chathu/Documents/qiime2_outputs/denoising-stats.qza \
  --verbose

Command Purpose

The qiime dada2 denoise-paired command processes paired-end sequencing data to remove errors, merge reads, and produce feature tables and representative sequences.

Key Parameters

Verbose: --verbose
Prints detailed logs during processing.

Input File: --i-demultiplexed-seqs
Specifies the demultiplexed sequences file (.qza).

Trimming:

--p-trim-left-f 17 / --p-trim-left-r 21: Remove primer bases at the start.

--p-trunc-len-f 286 / --p-trunc-len-r 207: Trim low-quality bases at the ends.

Threads: --p-n-threads 10
Speed up processing by using 10 CPU cores.

Output Files:

--o-table: Feature table (table.qza).

--o-representative-sequences: Unique ASVs (rep-seqs.qza).

--o-denoising-stats: DADA2 stats (denoising-stats.qza).

Now that we have several output files,

Now that we’ve generated several output files, let’s take a closer look at one of them to understand the denoising process. Specifically, let’s examine the file denoising-stats.qza. This file provides detailed insights into the quality filtering and error correction steps applied to your data during denoising. By exploring it, we’ll get a better sense of how the raw sequencing reads were processed and prepared for downstream analysis.

Converting the denoising-stats.qza file into a .qzv format is quick and easy, making it the best way to take a closer look at the denoising process. Once converted, we can interactively explore the details using Qiime2 View. This visualization will provide clear insights into the quality filtering and error correction steps applied to the data. Let’s go ahead and do that!

qiime metadata tabulate \
  --m-input-file /Users/chathu/Documents/qiime2_outputs/denoising-stats.qza \                   
  --o-visualization /Users/chathu/Documents/qiime2_outputs/dada2-stats-summ.qzv

Here’s what you’ll see: the table includes several column headers that are quite self-explanatory, so I won’t go into detail about them. The information presented is straightforward, allowing you to easily interpret the stats from the denoising process.

Having said that, what you’ll want to pay close attention to is the last column: percentage of input non-chimeric. This column provides critical information about the proportion of your reads that passed the chimera-checking step, offering valuable insight into the quality and integrity of your sequencing data.

As a rule of thumb, you should aim to retain at least 50-60% of non-chimeric reads after the denoising process. If the percentage is significantly lower, it may indicate issues with your sequencing data, such as the presence of a high number of chimeric sequences or suboptimal primer design.

Troubleshooting DADA2

Let’s be real—things don’t always go as planned, right? When you hit those frustrating moments, troubleshooting is your trusty sidekick. Think of filtering parameters in DADA2 as your starting point, not rigid rules. If the default settings aren’t giving you the results you want, it’s time to make some adjustments. If you’re losing too many reads during filtering, try relaxing the maxEE parameter to something like maxEE=c(2,5), especially for reverse reads. You can also reduce truncLen=240 to trim off those low-quality tails. Just be sure: when adjusting truncLen for paired-end reads, make sure you leave enough overlap after truncation (e.g., truncLen=c(240,220)) to ensure successful merging down the line.

Now, let’s head over to the safe haven of troubleshooting and get back on track.

Check out DADA2 troubleshooting tips

Troubleshooting

If you’ve been following along, congratulations! Remember I mentioned something about five essential files? Well, we now have two of them: table.qza and rep-seqs.qza.

Here’s how you can quickly examine your two files. As you might have guessed, the first step is to convert them into .qzv format. But now comes the moment where the fifth file—metadata—comes into play.

1.4. The fifth file

You may already have a metadata file prepared when you submitted your samples for sequencing. If not, a simple approach would be to download an example metadata file formatted for Qiime2 and edit it to include your sample-specific information.

To avoid any confusion, here’s an example of how a Qiime2-formatted metadata file should look. You DO NOT need the second column (barcode-sequence), so feel free to delete it. This column is only necessary if your sequences are not demultiplexed, but in our case, we are working with demultiplexed data.

Once ready, here’s how you can visualize your feature table and representative sequences using the metadata file:

qiime feature-table summarize \
  --i-table /Users/chathu/Documents/qiime2_outputs/table.qza \
  --m-sample-metadata-file /Users/chathu/Documents/qiime2_outputs/sample-metadata.tsv \
  --o-visualization table.qzv

qiime feature-table tabulate-seqs \
  --i-data /Users/chathu/Documents/qiime2_outputs/rep-seqs.qza \
  --o-visualization /Users/chathu/Documents/qiime2_outputs/rep-seqs.qzv

These commands will create .qzv visualizations that you can explore interactively with Qiime2 View to better understand your data.

Great progress so far—let’s keep building on this momentum!

1.5. Taxonomy assignment

The next step—brace yourselves—is a tricky one. We’re about to assign taxonomic ranks to our reads. This is a critical process that connects our sequences to known microbial communities, and it requires precision to ensure accurate classifications. Let’s tackle this step carefully!

NOTE: This is the code we are running to assign taxonomy classifications to our sequences:

qiime feature-classifier classify-sklearn \
–i-classifier gg-13-8-99-515-806-nb-classifier.qza \
–i-reads filtered-sequences-1.qza \
–o-classification taxonomy.qza

While this might seem straightforward, it’s important to understand that we are aligning our sequences to the file gg-13-8-99-515-806-nb-classifier.qza, which contains a pre-trained classifier based on the Greengenes reference database (just as an example). In our analysis, however, we will be using the SILVA database. This is a crucial decision because the choice of classifier directly impacts the accuracy of your taxonomic assignments. The reference database you select will define how your sequences are classified, so take time to ensure that the chosen classifier is well-suited for your data.

NOTE: Here’s a little insider tip: databases are constantly being updated, and you’ll want to use the latest version to stay ahead of the game. When I first started diving into microbiome analysis, I kicked things off with SILVA version 138.1. Fast forward to today, and we’re already at version 138.2! It’s crazy how fast things evolve—what was relevant just a few months ago is now outdated.

So, here’s my advice: don’t just grab the pre-prepared databases floating around online. Always double-check and make sure you’re working with the most up-to-date version to ensure your analysis is accurate and reflects the latest research.

NOTE: And here’s another important point: since this classifier file is tailored specifically to the primers you used to amplify your sequences, it excels at assigning taxonomy classifications that are specific to your region of interest. Not only does this make it more accurate for your dataset, but it also helps conserve your computer’s resources. Classifying sequences can be a bit memory-hungry, so using a classifier suited to your specific primers means it’s a little gentler on your RAM and CPU, making the process smoother and faster!

Before we dive in, here’s a helpful resource: this link will take you to a detailed online tutorial on how to prepare your classifier file. While I’ll walk you through the most important and necessary steps here, feel free to follow the tutorial if you prefer a more in-depth guide. Either way, you’ll be well on your way to getting your classifier ready for your analysis!

1.5.1. Download SILVA RNA Sequences and Taxonomy

Let’s head over to https://www.arb-silva.de/ and find the latest version of the database. Once you’ve figured out the version number, Qiime2 provides a convenient method to fetch the SILVA database, specifically the RNA sequences (SSURef_NR99) for version 138.2:

qiime rescript get-silva-data \
    --p-version '138.2' \
    --p-target 'SSURef_NR99' \
    --o-silva-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-rna-seqs.qza \
    --o-silva-taxonomy path/qiime2_outputs/silva-138.2-ssu-nr99-tax.qza

1.5.2. Transform RNA Sequences to DNA

Since our analysis requires DNA sequences, we’ll reverse-transcribe the RNA sequences into DNA using the following command:

qiime rescript reverse-transcribe \
    --i-rna-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-rna-seqs.qza \
    --o-dna-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs.qza

1.5.3. Removing Low-Quality Sequences with `cull-seqs`

In this step, we will filter out sequences containing 5 or more ambiguous bases (IUPAC-compliant ambiguity codes) and sequences with homopolymers of 8 or more consecutive bases. These filtering criteria are the default settings. For additional information, refer to the --help documentation.

qiime rescript cull-seqs \
    --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-rna-seqs.qza \
    --o-clean-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-cleaned.qza

1.5.4. Filtering sequences by length and taxonomy

To avoid bias in database selection, we’ll filter reference sequences based on their taxonomy rather than a blanket length threshold. Removing sequences below 1000 or 1200 bp would disproportionately affect Archaea and some Bacteria, potentially leading to the retention of lower-quality sequences from Bacteria or Eukaryota. Instead, we’ll apply specific length filters: Archaea (16S) ≥ 900 bp, Bacteria (16S) ≥ 1200 bp, and Eukaryota (18S) ≥ 1400 bp. See the help text for more details.

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy path/qiime2_outputs/silva-138.2-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-discard.qza

I know these were a lot of steps, but hang in there—we’re almost done!

1.5.5. Dereplicating

The SILVA 138.2 NR99 release may contain identical sequences with matching or differing taxonomies. To prevent confusion, we will dereplicate the data to remove redundancies before downstream processing.

qiime rescript dereplicate \
    --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-filt.qza \
    --i-taxa path/qiime2_outputs/silva-138.2-ssu-nr99-tax.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa path/qiime2_outputs/silva-138.2-ssu-nr99-tax-derep-uniq.qza

Now, we’ve hit a crossroads. You have two paths ahead:

1️⃣ Build a classifier for full-length SSU sequences – a versatile option for broader applications.
2️⃣ Create an amplicon-region-specific classifier – fine-tuned for targeted analysis.

Why choose now when you can have both? Let’s set them up so you can pick the right tool when the time comes.

Oh, and just so you know—having the right classifier in some labs makes you a legend (well, maybe not a god, but pretty close). Let’s get started! 🚀

1.5.6. Make classifier for use on full-length SSU sequences

Making the classifier for use on full-length SSU sequences is pretty straightforward.

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads  path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-derep-uniq.qza \
  --i-reference-taxonomy path/qiime2_outputs/silva-138.2-ssu-nr99-tax-derep-uniq.qza \
  --o-classifier path/qiime2_outputs/silva-138.2-ssu-nr99-classifier.qza

1.5.7. Make amplicon-region specific classifier

In this section, we’ll focus on creating an amplicon-specific classifier, which improves the accuracy of taxonomic classification. To do this, we’ll extract the amplicon region from our reference database using the same primer sequences used in PCR/sequencing. Remember to enter primers in the 5′-3′ direction, just as you would when ordering oligos.

For this tutorial, we’ll extract the Bacteria-specific (“Illumina”) V3-V4 region using our previously filtered full-length sequences. Since the SILVA database is curated in a forward orientation, we’ll set --p-read-orientation 'forward' to speed up processing and avoid handling mixed orientations during primer search.

Let’s put those files from earlier to good use! 🚀

qiime feature-classifier extract-reads \
    --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer CCTACGGGNGGCWGCAG \
    --p-r-primer GACTACHVGGGTATCTAATCC \
    --p-n-jobs 10 \
    --p-read-orientation 'forward' \
    --o-reads path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-341f-805r.qza

Dereplicate

You might be wondering—why dereplicate sequences that were already dereplicated? Well, previously, we worked with full-length sequences (well, not quite full), but after extracting the shorter amplicon regions based on our primers, duplicates can reappear. Dereplicating them again helps reduce database size without losing meaningful information. So, let’s clean things up once more!

qiime rescript dereplicate \
    --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-341f-805r.qza
    --i-taxa path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-derep-uniq.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-341f-805r-uniq.qza \
    --o-dereplicated-taxa  path/qiime2_outputs/silva-138.2-ssu-nr99-tax-341f-805r-derep-uniq.qza

Amplicon-region specific classifier

Voila!

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-341f-805r-uniq.qza \
    --i-reference-taxonomy path/qiime2_outputs/silva-138.2-ssu-nr99-tax-341f-805r-derep-uniq.qza \
    --o-classifier path/qiime2_outputs/silva-138.2-ssu-nr99-341f-805r-classifier.qza

Alright, if you’ve lost track of why we’re here—no worries, that’s totally normal! 😉 Let’s get you up to speed.

We’re at Step 1.5: Taxonomy Assignment. The plan was to run the script below, but we needed a trained classifier file first. Steps 1.5.1 to 1.5.7 got us there.

Now that we have what we need, let’s get this taxonomy annotation done! Just to keep everyone on the same page—we’re assigning taxonomy using our trained classifier on the representative sequences we obtained from the DADA2 outputs back in step 1.3. This step ensures each sequence gets its proper taxonomic identity.

qiime feature-classifier classify-sklearn \
  --i-classifier path/qiime2_outputs/silva-138.2-ssu-nr99-341f-805r-classifier.qza
  --i-reads path/rep-seqs.qza \
  --o-classification path/qiime2_outputs/taxonomy.qza
qiime metadata tabulate \
  --m-input-file path/qiime2_outputs/taxonomy.qza \
  --o-visualization path/qiime2_outputs/taxonomy.qzv

Amazing! 🎉 Congratulations—we now have all five essential files! Time to say goodbye to the terminal and import the data into RStudio.

In the meantime, you can also visualize your taxonomy-assigned sequences in QIIME 2 View. 🚀