Alright, I know what you’re thinking—numbering these chapters is a bit of a wild ride, right? But please bear with me! I tend to reserve the number “1” for when we’re truly diving into something. A little bit of suspense never hurts, right? 😄
What you will work on:
After sequencing, you’ve probably received an email with a link to a Dropbox where your sequence data is hanging out, patiently waiting for you to get started. Most likely, you’ll get a compressed file (usually in zip format). Once you unzip it, you’ll discover a collection of files inside the folder. Here’s an example:

The files D00M1_S126_L001_R1_001.fastq.gz
and D00M1_S126_L001_R2_001.fastq.gz
correspond to the same sample ID. They represent paired-end reads, where:
R1
: Contains Read 1, also known as the forward read.
R2
: Contains Read 2, commonly referred to as the reverse read.
If your files follow this structure, with each sample having its own paired FASTQ files, congratulations! Your sequences have already been demultiplexed and assigned to individual samples using their barcodes. This is a key detail because it determines the next steps in your analysis and how we need to import the data into our workflow.
To confirm and get a closer look at the files, let’s familiarize ourselves with what we’re working with. It’s time to dive in and examine one of those fastq.gz
files in R. Let’s see what insights they hold!
# If you don’t have the required packages, install them first: if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install(c("ShortRead", "Biostrings")) library(ShortRead) # Path to the file file_path <- "/Users/chathu/Documents/PhD/TRIALS/Defoliation/DeSilvaV3V4/D00M1_S126_L001_R1_001.fastq.gz" # Read the FASTQ file fastq_data <- readFastq(file_path) # Inspect the data fastq_data # Summary of the object sread(fastq_data) # Extract sequences quality(fastq_data) # Extract quality scores library(Biostrings) # Read sequences sequences <- readDNAStringSet(file_path, format = "fastq") # Inspect the sequences sequences # Read the entire file fastq_lines <- readLines(gzfile(file_path)) # Display the first 20 lines (or adjust the number as needed) cat(fastq_lines[1:10], sep = "\n")
In your console, you will see something like this:

Each file (fastq) from the sequencing contains the nucleotide information. This is what we are interested in

A FASTQ file consists of multiple entries, each representing a sequencing read. Each entry typically has four lines:
- Header line: Starts with
@
and contains metadata about the read. - Sequence line: The actual DNA sequence.
- Separator line: A single
+
, sometimes repeated with the header information (optional). - Quality line: Encodes the quality scores for each base in the sequence.
Explanation of the Example
First Entry
Header line:@M06628:347:000000000-LPWW8:1:2119:19293:22866 1:N:0:126
@M06628
: Instrument ID (e.g., the sequencing machine).347
: Run number on the instrument.000000000-LPWW8
: Flow cell identifier.1
: Lane number on the flow cell.2119:19293:22866
: Coordinates of the cluster on the flow cell.1:N:0:126
:1
: Read number (e.g., Read 1 in paired-end sequencing).N
: Indicates whether the read passed filtering (Y
for yes,N
for no).0
: Control number (usually unused).126
: Index or barcode sequence identifier.
Sequence line:CCTACGGGGGCAGCAGTGGGGAATATTGCGCAATGGGCGGA...
- The actual DNA sequence for this read.
Separator line:+
- Indicates the start of the quality scores. Sometimes repeats the header but often just a
+
.
Quality line:CCCCCGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGG...
- Encodes the quality of each base in the sequence line. The characters correspond to Phred quality scores:
- Each character represents a score, with higher scores indicating better confidence in the base call.
- ASCII values are offset by 33 (e.g.,
C
= 34,F
= 37).
Now that we’ve taken a closer look at our sequences, it’s time to move on to the exciting phase of preparing them for analysis—what we like to call upstream analysis. Think of this as transforming your raw data into a polished, ready-to-analyze form.
There are plenty of ways to approach this step, but QIIME2 truly shines as a comprehensive microbiome analysis platform. What sets it apart? Its strong emphasis on transparency, from data processing to analysis, ensuring your work is robust and reproducible.
The only catch (well, more like a small learning curve) is that you’ll need to dabble in some command-line coding. Don’t worry, though—once you get the hang of it, the power and versatility of QIIME2 will make it all worthwhile!
Lets move on to upstream analysis.