đ§ŦMicrobiome analysis using Qiime2
Last updated
Last updated
In this tutorial, we will show how Coretex can be integrated with external tools and platforms such as Qiime2. We have migrated 5 steps of the official Qiime2 Moving Images tutorial into Coretex as individual Job Templates, so we will show how to use those templates to analyze some DNA samples in Coretex. It is assumed that you have already worked with Projects, Tasks and Datasets so steps for creating them will be skipped.
Overview of the DNA processing pipeline:
Before running any of the Tasks in this tutorial you need to create a set of new Tasks for Bioinformatics Type, one Task for each step of the workflow. These five Tasks will define the code for each step and they are available out of the box in the list of Task templates.
Parameters that must be entered to start the run are specified in the Task code, more precisely in the config file. The parameters can be changed if there is a need for it just before starting the run.
Dataset that we will use in the first step of this tutorial can be downloaded from the official Qiime2 website. It is a dataset that contains DNA fragments from different samples that were pooled and sequenced altogether.
There are 3 methods available for Dataset uploading. For further information regarding the uploading process, please refer to the Dataset page.
Note: If you want to use your own dataset, it is possible, but you have to upload it to Coretex.ai, and Dataset Samples must have the following structure:
DataSample.zip: - sequences
- sequences.fastq.gz
- barcodes.fastq.gz
- sample_metadata.json
When running Jobs from this tutorial try selecting the option for caching the environment, so execution times can be much shorter.
Every time we start the Run, on the 'Run task' screen, next to the field for entering the name of the run, choosing the job, and setting the parameters, we can also choose the environment.
This tutorial comprises a series of 7 sequential steps, necessitating the creation of 7 distinct Tasks and the initiation of seven individual Runs.
As a result of each run, we get a Dataset that is used as an input parameter for the next run. For example, the result of the first step is the Dataset that we will use to start the Step 2 Run.
Each of the Runs in this tutorial takes in one (or more) Dataset as input, and they will always create a single Dataset as the output.
The initial step involves formatting the data into a structure compatible with QIIME 2's requirements for further processing and analysis.
Both multiplexed and demultiplexed sequencing reads, as well as both single-end and paired-end sequencing reads are supported.
Sequences contain multiple samples inside them, so demultiplexing is an important step to separate the samples according to their source. To differentiate between these samples, each of them had a unique barcode attached to them before sequencing.
You can inspect this data using Dataset Preview on Coretex, by opening the Dataset you created for this tutorial.
Before doing any kind of analysis on sequence data, in the first step of the workflow, you need to extract the sample data from sequences associated with each barcode. To do this you should use "Demultiplexing sequences" Task Template on Coretex.
To inspect the output, you can open the Dataset Preview for the Dataset that was created as a result of running this Task.
Here is one of the outputs of demultiplexing sequences which shows sequence count per sample:
For more information click here.
Sequenced samples contain some errors in the sequence so you need to apply a correction algorithm in 2nd run of the workflow to cleanup the data.
These algorithms are currently supported:
DADA2
Deblur
The choice between these two algorithms depends on specific analysis needs and preferences, as well as the characteristics of the data you want to analyze.
You can change the algorithm by changing the parameter associated with it.
Besides setting the algorithm as a parameter, from the data we obtained as a result of the previous run, we can find perfect values ââââfor the other parameters that are required to execute run number 3.
In the image above we can see that there are trimLeft
and truncLen
parameters. To pick good values for these parameters you must take a look at the quality plot generated in the previous step:
You can see that the quality score of the sequence bases at the start is high so no trimming is needed. Based on this value for trimLeft
should be set to 0. The quality is starting to drop at base ~120, so truncLen
parameter will be set to that value.
But also if you are not sure about the perfect value of the parameter, turn the parameter into a list and enter all the values ââthat you think can be good and that number of runs will start.
To do this you should use "Sequence quality control and feature table construction" Task Template on Coretex.
For more information click here.
An input dataset receives the output of the third step (merged paired-end reads).
clusteringMethod parameter decides which method will be used for OTU clustering.
Methods: - De Novo - Closed Reference - Open Reference
De Novo clustering algorithms identify sequence similarities to group sequences into clusters. This clustering method does not depend on any reference dataset.
The output of De Novo clustering methods is the organization of sequences into clusters based on their shared similarities.
Taxonomic analysis is used to extract taxonomy (taxonomic composition) of the samples. It achieves this by using a ML model for this process.
The info that gets extracted is also visualized, so you can easily explore multiple levels of taxonomic composition of the samples.
The only parameter that needs to be provided externally for executing this Run is the URL to the classifier itself. This Run singularly supports the classifier generated by the Qiime2 pipeline for training the classifiers. Any kind of URL which points to a file is supported.
Output data is the taxonomic composition of the provided samples.
For more information click here.
To do downstream analysis you first need to generate a phylogenetic tree representation of sequences contained in the data generated by the previous step. This will create a tree representation of all data samples contained inside the sequence.
To do this you should use "Generate a tree for phylogenetic diversity analyses" Task Template on Coretex.
For more information click here
Downstream analysis provides most information about the sequences. There are two kinds of downstream analysis:
Alpha diversity
Beta diversity
Alpha diversity provides detailed information about which bacteria are contained inside a single sample.
Beta diversity provides detailed information about regional and local diversity between samples.
You can see the list of parameters for this step here:
You can see that there are two parameters that will affect the result of executing this Run. Those are "samplingDepth" and "maxDepth". "samplingDepth" parameter value defines how much of the data each of the sample sequences has (how long sequences are).
This value should be picked based on the feature table visualization output done in the previous steps.
In the table above samples are sorted in descending order and the bottom of the table is shown. You can see that the values more or less have a consistent rise after they reach 1103 frequency. This value is optimal for sampling these sequences.
"maxDepth" parameter value defines how many rarefied tables will be generated at each sample depth. At each sample depth 10 tables are generated. In this tutorial value of 4000 was used to provide most of the details about data, and samples will be grouped in the resulting visualization.
To do this you should use "Alpha and beta diversity analysis" Task Template on Coretex.
For more information click here
Workflow steps | Tasks |
---|---|
Parameter | Value |
---|---|
Step 1
Import sequencing reads into a Qiime2 format
Step 2
Demultiplexing sequences
Step 3
DADA2 Denoising
Step 4
OTU Clustering
Step 5
Taxonomic Analysis
Step 6
Phylogenetic Diversity Analysis
Step 7
Alpha and Beta Diversity Analysis
dataset
Dataset which contains multiplexed or demultiplexed fastq sequences.
metadataFileName
Name of the sequence metadata file from the dataset.