The five key files: Building blocks of your analysis
Here’s the plan: we’ll process our raw sequences step-by-step, and by the end, we’ll have these five essential files in this order:
- table.qza – A feature table summarizing the abundance of sequences across samples.
- rep-seqs.qza – Representative sequences after trimming, capturing unique sequence variants.
- taxonomy.qza – Taxonomic classifications for our sequences, revealing the microbial communities.
- rooted-tree.qza – A phylogenetic tree to support diversity analyses.
Oh! Did you notice I said five but only listed four? Don’t worry—keep going, and you’ll find out about the fifth file soon enough.
These output files will serve as the cornerstone for meaningful downstream analyses. Let’s dive in!
Once you’ve got QIIME2 installed and the environment activated, you’re all set to move forward! You can find the installation guide on their official website: QIIME2 Installation.
Whether you’re on a PC or Mac, or working with an Intel-based or Apple Silicon chipset, the documentation will guide you through the process step-by-step. I won’t dive into those details here to avoid redundancy, but make sure to double-check the requirements for your platform before proceeding.
Activate your QIIME2 environment using the following command in your terminal:
conda env list # find the environments conda activate qiime2-amplicon-2024.10 # Activate
1. 1 Importing Raw Sequences: Creating the Master .qza
File
Now it’s time to import the raw sequences! This step will create a .qza
file (Master .qza
file) that includes all the paired sequences from your raw data, setting you up for the next stages of processing in QIIME2. To keep everything organized, it’s a good idea to store your raw files in a dedicated folder. For my workflow, I’ve created a separate folder for the outputs at /Users/chathu/Documents/qiime2_outputs/. This keeps everything neat and easy to find as we move forward with the analysis!
qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-format CasavaOneEightSingleLanePerSampleDirFmt \ --input-path /Users/chathu/Documents/DeSilvaV3V4 \ --output-path /Users/chathu/Documents/qiime2_outputs/demultiplexed-sequences.qza
What Does It Mean?
Casava Output Format:
When sequencing data is generated by Illumina machines, it is often processed through the CASAVA software, which organizes the output into a directory structure where each sample has its own FASTQ file(s).
OneEight:
The term “OneEight” indicates the format introduced with CASAVA version 1.8, which is widely used and follows specific naming conventions for files.
Single Lane Per Sample Directory Format:
This part means that the data is organized so that each sample’s sequences (forward and reverse reads) are stored as separate FASTQ files within a directory.
Let’s assess the sequences and their sequencing depth using the qiime demux summarize
command. It’s going to import the .qza file we’ve just created and generate a file that can be visualized in QIIME2 View, providing us with a detailed overview of the number of sequences per sample and offering insights into the quality of those sequences.
qiime demux summarize \ --i-data /Users/chathu/Documents/qiime2_outputs/demultiplexed-sequences.qza \ --o-visualization Users/chathu/Documents/qiime2_outputs/demultiplexed-sequences-summ.qzv
Now you can vizualize the .qzv file in qiime2 view.
1. 2 Deciding Where to Truncate: Interpreting Quality Score Visualizations
Here’s what you’ll see in the visualization and how to decide where to truncate your sequences:
The visualization will display quality scores across the length of your sequences. You’ll notice that the quality tends to drop off towards the ends of the reads. This is where sequencing errors, like poor-quality base calls, are more common.
To decide the position of truncation, look for the point where the quality score starts to significantly decline. The goal is to cut off the lower-quality regions while preserving the high-quality portion in the middle. Typically, you’ll want to truncate your sequences just before the quality score drops below an acceptable threshold (usually at 30 or above at 25ht percentile). IOW, when you see 29 as your 25th percentile, go back just one to the left and record the position (286).
By doing this, you’ll maximize the reliability of your data and improve the overall quality of your analysis!

Do the same for the reverse read and now you have the two positions where the forward and reverse reads needs to be truncated.
And then, we can move forward with the denoising
Now heres whats gonna happen
My samples were amplified using the Bacteria-specific (“Illumina”) V3-V4 primers:
- 341F = CCTACGGGNGGCWGCAG
- 805R = GACTACHVGGGTATCTAATCC
These were then sequenced on an Illumina MiSeq using a 2x300bp kit. The hypervariable region targeted by these primers is 464bp in length, so with 300bp reads, the sequences will overlap, allowing us to perform paired-end analysis during downstream processing.
To determine the overlap between the paired-end reads, we subtract the length of the target region (464 bp) from the read length (300 bp). Since both the forward and reverse reads are 300 bp each, we can calculate the overlap as follows:
Overlap=300 bp+300 bp−464 bp
Overlap=300bp+300bp−464bp
Overlap=136bp
The overlap between the two reads initially spans 136 base pairs. However, once we apply quality filtering and truncate the sequences, this overlap shrinks significantly—often down to just 20–30 base pairs. This reduction occurs because we trim off the low-quality regions at both ends of the reads, leaving a much smaller overlapping region for paired-end merging.
To put this into perspective: the DADA2 pipeline requires a minimum of 12 bases of overlap for paired-end reads to successfully merge. So, it’s crucial to keep this in mind when evaluating sequence quality and setting truncation parameters.
Here’s a quick visual guide to make this concept clearer.

Now that we decided our truncation and trimming Ready? Let’s dive into the DADA2 pipeline and get started!
1.3. DADA2
What is DADA2? Great question! Before diving into the pipeline, let’s take a moment to understand what DADA2 is.
What does DADA2 do in a nutshell?
- Filters and trims low-quality sequences.
- Learns the error rates of your data.
- Identifies and corrects errors to create a set of error-free ASVs.
- Merges paired-end reads (if applicable).
- Removes chimeric sequences.
- Outputs a high-resolution table of ASVs for downstream analysis.
Ready to see it in action? Let’s continue with the DADA2 pipeline!
qiime dada2 denoise-paired \ --i-demultiplexed-seqs /Users/chathu/Documents/qiime2_outputs/demultiplexed-sequences.qza \ --p-trunc-len-f 286 \ --p-trunc-len-r 207 \ --p-trim-left-f 17 \ --p-trim-left-r 21 \ --p-n-threads 10 \ --o-table /Users/chathu/Documents/qiime2_outputs/table.qza \ --o-representative-sequences /Users/chathu/Documents/qiime2_outputs/rep-seqs.qza \ --o-denoising-stats /Users/chathu/Documents/qiime2_outputs/denoising-stats.qza \ --verbose
Command Purpose
The qiime dada2 denoise-paired
command processes paired-end sequencing data to remove errors, merge reads, and produce feature tables and representative sequences.
Key Parameters
Verbose: --verbose
Prints detailed logs during processing.
Input File: --i-demultiplexed-seqs
Specifies the demultiplexed sequences file (.qza
).
Trimming:
--p-trim-left-f 17
/ --p-trim-left-r 21
: Remove primer bases at the start.
--p-trunc-len-f 286
/ --p-trunc-len-r 207
: Trim low-quality bases at the ends.
Threads: --p-n-threads 10
Speed up processing by using 10 CPU cores.
Output Files:
--o-table
: Feature table (table.qza
).
--o-representative-sequences
: Unique ASVs (rep-seqs.qza
).
--o-denoising-stats
: DADA2 stats (denoising-stats.qza
).
Now that we have several output files,
Now that we’ve generated several output files, let’s take a closer look at one of them to understand the denoising process. Specifically, let’s examine the file denoising-stats.qza
. This file provides detailed insights into the quality filtering and error correction steps applied to your data during denoising. By exploring it, we’ll get a better sense of how the raw sequencing reads were processed and prepared for downstream analysis.
Converting the denoising-stats.qza
file into a .qzv
format is quick and easy, making it the best way to take a closer look at the denoising process. Once converted, we can interactively explore the details using Qiime2 View. This visualization will provide clear insights into the quality filtering and error correction steps applied to the data. Let’s go ahead and do that!
qiime metadata tabulate \ --m-input-file /Users/chathu/Documents/qiime2_outputs/denoising-stats.qza \ --o-visualization /Users/chathu/Documents/qiime2_outputs/dada2-stats-summ.qzv
Here’s what you’ll see: the table includes several column headers that are quite self-explanatory, so I won’t go into detail about them. The information presented is straightforward, allowing you to easily interpret the stats from the denoising process.

Having said that, what you’ll want to pay close attention to is the last column: percentage of input non-chimeric. This column provides critical information about the proportion of your reads that passed the chimera-checking step, offering valuable insight into the quality and integrity of your sequencing data.
As a rule of thumb, you should aim to retain at least 50-60% of non-chimeric reads after the denoising process. If the percentage is significantly lower, it may indicate issues with your sequencing data, such as the presence of a high number of chimeric sequences or suboptimal primer design.
Troubleshooting DADA2
Now, let’s head over to the safe haven of troubleshooting and get back on track.
Check out DADA2 troubleshooting tips
If you’ve been following along, congratulations! Remember I mentioned something about five essential files? Well, we now have two of them: table.qza and rep-seqs.qza.
Here’s how you can quickly examine your two files. As you might have guessed, the first step is to convert them into .qzv
format. But now comes the moment where the fifth file—metadata—comes into play.
1.4. The fifth file
You may already have a metadata file prepared when you submitted your samples for sequencing. If not, a simple approach would be to download an example metadata file formatted for Qiime2 and edit it to include your sample-specific information.
To avoid any confusion, here’s an example of how a Qiime2-formatted metadata file should look. You DO NOT need the second column (barcode-sequence), so feel free to delete it. This column is only necessary if your sequences are not demultiplexed, but in our case, we are working with demultiplexed data.

Once ready, here’s how you can visualize your feature table and representative sequences using the metadata file:
qiime feature-table summarize \ --i-table /Users/chathu/Documents/qiime2_outputs/table.qza \ --m-sample-metadata-file /Users/chathu/Documents/qiime2_outputs/sample-metadata.tsv \ --o-visualization table.qzv qiime feature-table tabulate-seqs \ --i-data /Users/chathu/Documents/qiime2_outputs/rep-seqs.qza \ --o-visualization /Users/chathu/Documents/qiime2_outputs/rep-seqs.qzv
These commands will create .qzv
visualizations that you can explore interactively with Qiime2 View to better understand your data.
Great progress so far—let’s keep building on this momentum!
1.5. Taxonomy assignment
The next step—brace yourselves—is a tricky one. We’re about to assign taxonomic ranks to our reads. This is a critical process that connects our sequences to known microbial communities, and it requires precision to ensure accurate classifications. Let’s tackle this step carefully!
qiime feature-classifier classify-sklearn \
–i-classifier gg-13-8-99-515-806-nb-classifier.qza \
–i-reads filtered-sequences-1.qza \
–o-classification taxonomy.qza
While this might seem straightforward, it’s important to understand that we are aligning our sequences to the file gg-13-8-99-515-806-nb-classifier.qza, which contains a pre-trained classifier based on the Greengenes reference database (just as an example). In our analysis, however, we will be using the SILVA database. This is a crucial decision because the choice of classifier directly impacts the accuracy of your taxonomic assignments. The reference database you select will define how your sequences are classified, so take time to ensure that the chosen classifier is well-suited for your data.
So, here’s my advice: don’t just grab the pre-prepared databases floating around online. Always double-check and make sure you’re working with the most up-to-date version to ensure your analysis is accurate and reflects the latest research.
Before we dive in, here’s a helpful resource: this link will take you to a detailed online tutorial on how to prepare your classifier file. While I’ll walk you through the most important and necessary steps here, feel free to follow the tutorial if you prefer a more in-depth guide. Either way, you’ll be well on your way to getting your classifier ready for your analysis!
1.5.1. Download SILVA RNA Sequences and Taxonomy
Let’s head over to https://www.arb-silva.de/ and find the latest version of the database. Once you’ve figured out the version number, Qiime2 provides a convenient method to fetch the SILVA database, specifically the RNA sequences (SSURef_NR99
) for version 138.2
:
qiime rescript get-silva-data \ --p-version '138.2' \ --p-target 'SSURef_NR99' \ --o-silva-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-rna-seqs.qza \ --o-silva-taxonomy path/qiime2_outputs/silva-138.2-ssu-nr99-tax.qza
1.5.2. Transform RNA Sequences to DNA
Since our analysis requires DNA sequences, we’ll reverse-transcribe the RNA sequences into DNA using the following command:
qiime rescript reverse-transcribe \ --i-rna-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-rna-seqs.qza \ --o-dna-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs.qza
1.5.3. Removing Low-Quality Sequences with cull-seqs
In this step, we will filter out sequences containing 5 or more ambiguous bases (IUPAC-compliant ambiguity codes) and sequences with homopolymers of 8 or more consecutive bases. These filtering criteria are the default settings. For additional information, refer to the --help
documentation.
qiime rescript cull-seqs \ --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-rna-seqs.qza \ --o-clean-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-cleaned.qza
1.5.4. Filtering sequences by length and taxonomy
To avoid bias in database selection, we’ll filter reference sequences based on their taxonomy rather than a blanket length threshold. Removing sequences below 1000 or 1200 bp would disproportionately affect Archaea and some Bacteria, potentially leading to the retention of lower-quality sequences from Bacteria or Eukaryota. Instead, we’ll apply specific length filters: Archaea (16S) ≥ 900 bp, Bacteria (16S) ≥ 1200 bp, and Eukaryota (18S) ≥ 1400 bp. See the help text for more details.
qiime rescript filter-seqs-length-by-taxon \ --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-cleaned.qza \ --i-taxonomy path/qiime2_outputs/silva-138.2-ssu-nr99-tax.qza \ --p-labels Archaea Bacteria Eukaryota \ --p-min-lens 900 1200 1400 \ --o-filtered-seqs path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-filt.qza \ --o-discarded-seqs path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-discard.qza
I know these were a lot of steps, but hang in there—we’re almost done!
1.5.5. Dereplicating
The SILVA 138.2 NR99 release may contain identical sequences with matching or differing taxonomies. To prevent confusion, we will dereplicate the data to remove redundancies before downstream processing.
qiime rescript dereplicate \ --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-filt.qza \ --i-taxa path/qiime2_outputs/silva-138.2-ssu-nr99-tax.qza \ --p-mode 'uniq' \ --o-dereplicated-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-derep-uniq.qza \ --o-dereplicated-taxa path/qiime2_outputs/silva-138.2-ssu-nr99-tax-derep-uniq.qza
Now, we’ve hit a crossroads. You have two paths ahead:
1️⃣ Build a classifier for full-length SSU sequences – a versatile option for broader applications.
2️⃣ Create an amplicon-region-specific classifier – fine-tuned for targeted analysis.
Why choose now when you can have both? Let’s set them up so you can pick the right tool when the time comes.
Oh, and just so you know—having the right classifier in some labs makes you a legend (well, maybe not a god, but pretty close). Let’s get started! 🚀
1.5.6. Make classifier for use on full-length SSU sequences
Making the classifier for use on full-length SSU sequences is pretty straightforward.
qiime feature-classifier fit-classifier-naive-bayes \ --i-reference-reads path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-derep-uniq.qza \ --i-reference-taxonomy path/qiime2_outputs/silva-138.2-ssu-nr99-tax-derep-uniq.qza \ --o-classifier path/qiime2_outputs/silva-138.2-ssu-nr99-classifier.qza
1.5.7. Make amplicon-region specific classifier
In this section, we’ll focus on creating an amplicon-specific classifier, which improves the accuracy of taxonomic classification. To do this, we’ll extract the amplicon region from our reference database using the same primer sequences used in PCR/sequencing. Remember to enter primers in the 5′-3′ direction, just as you would when ordering oligos.
For this tutorial, we’ll extract the Bacteria-specific (“Illumina”) V3-V4 region using our previously filtered full-length sequences. Since the SILVA database is curated in a forward orientation, we’ll set --p-read-orientation 'forward'
to speed up processing and avoid handling mixed orientations during primer search.
Let’s put those files from earlier to good use! 🚀
qiime feature-classifier extract-reads \ --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-derep-uniq.qza \ --p-f-primer CCTACGGGNGGCWGCAG \ --p-r-primer GACTACHVGGGTATCTAATCC \ --p-n-jobs 10 \ --p-read-orientation 'forward' \ --o-reads path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-341f-805r.qza
Dereplicate
You might be wondering—why dereplicate sequences that were already dereplicated? Well, previously, we worked with full-length sequences (well, not quite full), but after extracting the shorter amplicon regions based on our primers, duplicates can reappear. Dereplicating them again helps reduce database size without losing meaningful information. So, let’s clean things up once more!
qiime rescript dereplicate \ --i-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-341f-805r.qza --i-taxa path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-derep-uniq.qza \ --p-mode 'uniq' \ --o-dereplicated-sequences path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-341f-805r-uniq.qza \ --o-dereplicated-taxa path/qiime2_outputs/silva-138.2-ssu-nr99-tax-341f-805r-derep-uniq.qza
Amplicon-region specific classifier
Voila!
qiime feature-classifier fit-classifier-naive-bayes \ --i-reference-reads path/qiime2_outputs/silva-138.2-ssu-nr99-seqs-341f-805r-uniq.qza \ --i-reference-taxonomy path/qiime2_outputs/silva-138.2-ssu-nr99-tax-341f-805r-derep-uniq.qza \ --o-classifier path/qiime2_outputs/silva-138.2-ssu-nr99-341f-805r-classifier.qza
Alright, if you’ve lost track of why we’re here—no worries, that’s totally normal! 😉 Let’s get you up to speed.
We’re at Step 1.5: Taxonomy Assignment. The plan was to run the script below, but we needed a trained classifier file first. Steps 1.5.1 to 1.5.7 got us there.
Now that we have what we need, let’s get this taxonomy annotation done! Just to keep everyone on the same page—we’re assigning taxonomy using our trained classifier on the representative sequences we obtained from the DADA2 outputs back in step 1.3. This step ensures each sequence gets its proper taxonomic identity.
qiime feature-classifier classify-sklearn \ --i-classifier path/qiime2_outputs/silva-138.2-ssu-nr99-341f-805r-classifier.qza --i-reads path/rep-seqs.qza \ --o-classification path/qiime2_outputs/taxonomy.qza qiime metadata tabulate \ --m-input-file path/qiime2_outputs/taxonomy.qza \ --o-visualization path/qiime2_outputs/taxonomy.qzv
Amazing! 🎉 Congratulations—we now have all five essential files! Time to say goodbye to the terminal and import the data into RStudio.
In the meantime, you can also visualize your taxonomy-assigned sequences in QIIME 2 View. 🚀