1. Import data

We’ll kick things off by importing four datasets from your QIIME 2 outputs. Importing datasets and formatting them correctly is just as important as troubleshooting. It’s critical to double-check the data structure and explicitly format it as needed to avoid any issues down the line. Trust me—taking a moment now to ensure everything is properly set up can save you countless headaches later. Let’s make sure everything is seamless before moving forward!

Got other variables like biomass measurements or environmental factors? Bring them along! These are the secret sauce for meaningful analysis. Linking these variables to changes in your microbiome (later in Step 1.5) is what makes the whole exercise worthwhile. After all, what’s the point of staring at pretty graphs if they don’t connect to the experimental variables you actually care about? This is where the real magic happens—tying microbiome shifts to your study’s key drivers!

Resolve dataframe issues: Here’s a heads-up: I ran into some hiccups when using the tidy_dataset function later in the pipeline. This function ensures that OTU (ASV) and sample information is consistent across all files in the dataset object, but errors can pop up if the data frames aren’t formatted correctly.

To dodge those errors, we’ll address potential data frame issues right after importing the data. A bit of extra effort now will save you a ton of frustration later!

NOTE: So, you think converting it to a data frame will magically fix everything?

Well… think again! 😅 There are still plenty of ways things can go sideways. But let’s not panic just yet. First, we’ll make sure everything holds up through the data import process. Once we’re there, I’ll guide you through troubleshooting—step by step. Stay with me! 🚀

1.1. Importing Metadata

sample_info_16S <- read_tsv("path/metadata.txt")
sample_info_16S <- as.data.frame(sample_info_16S) # convert the sample_table_16S to data.frame
class(sample_info_16S)
head(sample_info_16S)

1.2. Importing Feature table (or ASV/OTU table)

otu_table_16S <- read_qza("path/286-211-17-21-table.qza")
otu_table_16S <- as.data.frame(otu_table_16S$data) # converts to data.frame
class(otu_table_16S) 
head(otu_table_16S)

1.3. Importing Taxonomy table

taxonomy_table_16s <- read_qza("path/silva138.2-aligned-taxonomy.qza")
taxonomy_table_16s <- taxonomy_table_16s$data # converts to data.frame
head(taxonomy_table_16s)
class(taxonomy_table_16s)
str(taxonomy_table_16s)

When you run the head function now, you’ll notice something interesting: all the taxonomy ranks—Kingdom, Phylum, Class, and so on—are bundled together in a single column labeled Taxon. While it’s neat to have everything in one place, it’s not quite practical for our future analyses. To make the most out of this data, we need to separate it into individual columns for each taxonomic rank. Let’s tackle that next!

# Assuming taxonomy_table_16S is your data frame
taxonomy_table_16s <- taxonomy_table_16s %>%
  separate(
    col = Taxon,
    into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"),
    sep = ";",
    extra = "merge",  # Merge any extra levels into the last column
    fill = "warn",    # Warn about missing values but do not pad with NA
    remove = FALSE    # Retain the original Taxon column initially
  ) %>%
  select(-Taxon)  # Remove the Taxon column

# View the resulting data frame
head(taxonomy_table_16s)

Troubleshooting memory issues

For some strange reason, my RStudio hit a memory cap—something that never happened in the past eight months. 🤯 I tried different fixes, but nothing worked. So instead of battling it out, I took a different approach.

Here’s the alternative code that got the job done:

library(tidyverse)

# Define taxonomic ranks
tax_ranks <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species") # Process taxonomy data while keeping prefixes taxonomy_table_16s <- taxonomy_table_16s %>%
mutate(Taxon = str_split(Taxon, “;”)) %>% # Split into lists
unnest_wider(Taxon, names_sep = “_”) %>% # Expand list columns into separate columns
rename_with(~ tax_ranks, starts_with(“Taxon_”)) # Rename the new columns

# Check the result
head(taxonomy_table_16s)

taxonomy_table_16s <- as.data.frame(taxonomy_table_16s) taxonomy_table_16s %<>% tidy_taxonomy()

Even after splitting the taxonomy into individual columns, taxonomic tables often come with hidden issues (empty slots with NA, taxa with different prefixes) that can complicate analysis. To tackle these and ensure a clean dataset, we’ll use the tidy_taxonomy function. It’s a quick and efficient fix—this command should run in just a couple of seconds.

taxonomy_table_16s %<>% tidy_taxonomy
head(taxonomy_table_16s)

At this point, you might notice something strange—some of the Feature.ID and Confidence values have prefixes like f__ or c__. I’m not exactly sure why this happens, but rather than diving into a time-consuming troubleshooting session, let’s take the simpler route and clean them up.

We can easily remove these prefixes using the following code:

taxonomy_table_16s %<>% 
  tidy_taxonomy %>%
  mutate(Feature.ID = gsub("^f__", "", Feature.ID), # Remove "f__" from Feature.ID
         Confidence = gsub("^c__", "", Confidence) # Remove "c__" from Confidence
  )
head(taxonomy_table_16s)

If everything worked as expected, your cleaned taxonomy table should look something like this when you run head(taxonomy_table_16s):

The next step is to designate Feature.ID as the row names for our taxonomy table. This helps ensure that each feature is uniquely identified and aligns with downstream analyses.

We can achieve this with the following code:

rownames(taxonomy_table_16s) <- taxonomy_table_16s$Feature.ID

1.4. Importing phylogenetic tree

phylo_tree_16S <- read_qza("path/Feature_TreeRooted.qza")
#Extract the tree data and convert it to a phylo object
phylo_tree_16S <- phylo_tree_16S$data
class(phylo_tree_16S) # Should return as "phylo"

1.5. Importing Environmental data (response variables (biomass, TOC), etc.)

You can use these data to correlate with changes in the microbiome—not mandatory, but highly recommended. It’s always exciting to see how your variables align with shifts in the microbiome. Understanding these relationships can add depth to your findings and make your analysis far more insightful.

env_data_16S <- read_csv("path/env_data.csv")
colnames(env_data_16S)[colnames(env_data_16S) == "SampleID"] <- "" # Remove the column name for "Feature ID" in sample_info_16S
env_data_16S <- as.data.frame(env_data_16S)
head(env_data_16S)
class(env_data_16S)
Have you checked everything? Once you’ve completed Steps 1, 2, 3, and 5, don’t forget to test your data. This step is already included in the steps above, but consider this your friendly nudge to double-check everything before diving deeper. It’s like tuning an instrument before the big performance—small adjustments now ensure harmony later!
class(taxonomy_table_16S) # example for checking taxonomy table

Make sure they read as; [1] "data.frame" for steps 1,2,3 and 5, and for step 4, it should be [1] "phylo"

If everything looks good with the taxonomy table and Feature.ID has been successfully set as the row names, we’re ready to move on and create the micro table! 🚀.

However, if something doesn’t look right, don’t ignore it. It’s important to troubleshoot and resolve any issues with the table before proceeding.