EVAtool

EVAtool An optimized reads assignment tool for extracellular vesicle small ncRNA quantification

What is EVAtool? EVAtool is designed to quantificate small RNA-seq dataset in extracellular vesicles

Overview

Extracellular vesicles (EVs) play an important role as agents of cell-to-cell communication by transferring bioactive molecules. EVs carry various bioactive molecules such as DNA fragments, proteins, lipids, metabolites, multiple RNA types including messenger RNAs (mRNA) and small non-coding RNAs (sncRNAs) that could be delivered from donor cell to target cells. A larger part of EV-associated sncRNA types affected cancer cell behavior but remain to be quantified both in various physiological and pathophysiological processes. Therefore, the quality control and correct quantification of sncRNAs in EV is indispensable.

EVAtool is an optimized reads assignment tool for extracellular vesicle small ncRNA quantification. In EVAtool, we prospectively collected seven ncRNA types (miRNA, snoRNA, piRNA, snRNA, rRNA, tRNA and Y RNA) references as default to evaluate the abundence of each small ncRNA in EVs. With current newest dependences (mainly bowtie2, samtool, fastq-dump, bedtools and trimmomatic-0.39.jar) and high-performance algorithm Optimized Reads Assignment Algorithm (ORAA), the tool is perfectly capable of processing short reads mapped to multiple sncRNAs from small EVs (sEVs) or large EVs (lEVs). It is also capable of processing other sncRNA-seq data with minor modifications in the configure file. Finally, EVAtool visualized almost all results and supports the online report.

Github link: https://github.com/xieguiyan/EVAtool

Docker link: https://hub.docker.com/r/guobioinfolab/evatool

Guided tutorial in Python: https://pypi.org/project/evatool

Summary

Let's start

The detailed flow-diagram below illustrated the data processing process with four parts (left part) and the Optimized Reads Assignment Algorithm (right part).

Pseudocode for the ORAA:

Name: optimized reads assignment algorithm (ORAA)
Author: Gui-Yan Xie
Time: 2022-05-04 14:53

Input: All reads mapped to genome and seven types small ncRNA
Output: The abundance of each type small ncRNA


function ncRNA_abundance (all_mapped_reads)
    for each_read in all_mapped_reads
        if reads_mismatch == 0
            if reads_mapped_ncrna_types == 1
                candidata_ncRNA += read_count
            else
                high_proportion_ncRNA += read_count
        end if
        if reads_mismatch == 1 and mirna_editing existed
            if reads_mapped_ncrna_types == 1
                final_ncRNA += read_count
            else
                high_proportion_ncRNA += read_count
        end if
    end for
    final_abundance_of_each_ncRNA = each_ncRNA_abundance_in_high_proportion + candidata_ncRNA
    return final_abundance_of_each_ncRNA
end function

EVAtool is developed using Python 3.6.5 and could be freely available at https://pypi.org/project/evatool. All pipeline files are included in a Docker image deposited in a public repository on the Docker hub, which solved the issue of dependencies across various Linux platforms and Windows.

Attention: The required time is no more than 10 minutes, and the memory needs at least 12G if your data is about 1G.

EVAtool can be easily installed in many platforms (Linux, MacOS and Windows) via the Python Package Index or Docker Hub. Before installation, we need to create a workdirectory. Here, we take the "example" as the folder name:

1. Create a working directory in the current directory.

mkdir ./example

Download human Genome reference, seven types ncRNA references and configure file. All three type files are zipped in a package (4.4G) which can be download in two ways:

1. Download all files by clicking the link below:

Download all files

2. Use wget to download all files:

# Go to the workdirectory
cd ./example
# Download the references and config file
wget "http://bioinfo.life.hust.edu.cn/EVAtool/ref/refs.zip"
# Unzip the downloaded file
unzip refs.zip

Download example data by clicking: example.fastq.gz, or download through the command line.

wget http://bioinfo.life.hust.edu.cn/EVAtool/example/example.fastq.gz

Input the filename for your smRNA-seq data and the output directory if the data is loacated in the working directory, otherwise the absolute path of your data should be specified.

Notice: The 'refs' folder should be placed in the corresponding working directory, otherwise the -c parameter needs to be specified.

1. Run the Python package for EVAtool:

# Install EVAtool by pip
pip install evatool
# Run EVAtool
evatool -i example.fastq.gz -o {absolute path for output or .}

2. Run the Docker image of EVAtool:

# Pull from Docker hub
docker pull guobioinfolab/evatool
# Run EVAtool through docker image
# -v: Bind mount a volume; -w: Working directory inside the container
docker run -it -v $PWD:/{work_path} -w /{work_path} guobioinfolab/evatool -i example.fastq.gz -o {absolute path for output or .}

The user needs to input only 2 parameters at least, and the specific meaning of each parameter is as follows:

-i: sra file with path (required),

-o: output directory (required),

-c: configure file with path ( not required),

-n: ncRNA type list (not required)

For more details, please input the following command:

evatool -h
# or
docker run guobioinfolab/evatool -h

The result report is generated in the output directory, and the result report is a HTML file which can be opened in any Web browser.

The online report mainly consists 3 parts for results. The following 3 parts are screenshots from the online report of example, click for more details:

Part 1. Input parameters:

Part 2. Analysis results:

Part 3. Other results:

In EVAtool, users can customize the parameters according to their needs.
By default, EVAtool provides 7 types of small ncRNA reference, and explore the expression (count and RPM) of these ncRNAs. Notebly, EVAtool also allows users to add reference of interest to processed other types of ncRNA with modified config file. Several steps as follows:

1. Copy the reference(s) to the 'ref' folder;
2. Add the 'bowtie_para_4_{RNA name}' parameter to 'reference_config.json' file in the 'ref' folder;
4. Add the '{RNA name}_index' parameter to the 'reference_config.json' file;
5. Save and run as the 'Run EVAtool' Part.

For more usage, please see https://pypi.org/project/evatool

EVAtool includs five modules ('Quality control', 'Mapping', 'ORAA', 'Quantification & RPM normalization' and 'Visualization report'). Each module processes the data quickly and accurately.
1. Quality control: EVAtool provides a set of quality control tools to check the quality of the input data. The quality control process mainly include:

Trimming (trimmomatic-0.39.jar): Trimming the reads according to the length of the reads;
Fastqc (FastQC v0.11.9): FastQC is a quality control tool for high throughput sequence data. It is used to check the quality of the input data.

2. Mapping: EVAtool uses Bowtie2 to map your data to the varies references.

Bowtie2: Bowtie2 is a fast and sensitive aligner for short RNA sequences (reads). It aligns short DNA sequences to a long reference sequence by finding the best matches with a long reference sequence.

3. ORAA: EVAtool provides ORAA to calculate the abundance of ncRNA reasonably.

We summarized ORAA in following steps:
1. Each sample clean reads are mapped (one mismatch allowed) to seven ncRNA references simultaneously, and gather all ncRNAs mapped reads together to prepare assignment;
2. Reads with 0 or 1 mismatch only mapped to one ncRNA type are used to calculate ncRNA abundance;
3. the reads with 0 mismatch mapped to one ncRNA type are regarded as confident reads which are used to calculate ncRNA proportion in EV, which would be regarded as reference ncRNA abundance for later assignment;
4. multi-ncRNA mapped reads with 0 or 1 mismatch are assigned to exact one ncRNA type according to reference ncRNA abundance priority;
5. Final ncRNA abundance is determined by the assigned ncRNA mapped reads.

4. Quantification & RPM normalization:

1. Quantification: EVAtool provides two types transcripts abundance: The numbers of mapped reads are counted for each ncRNA, and RPM (Reads Per Million) normalization is performed for the in put data.

5. Visualization report: EVAtool visualized the most of the output results and provides an online report in HTML format which can be opened in any Web browser.

Report_result.html

#Trimmomatic parameters and adaptor sequences
"trimmomatic_sRNA_para": "2:10:4:5 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15",
"adp_path": "/refs/sRNA.fa",

#Tag filter
"tag_cut": "0",

#CPU number
"cpu_number": "8",

#sncRNA types
"RPM": "total",

#bowtie2 parameters
"bowtie_para_4_miRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 2 -L 15 -i S,1,1.15",
"bowtie_para_4_rRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 1 -L 12 -i S,1,1.15",
"bowtie_para_4_tRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 1 -L 12 -i S,1,1.15",
"bowtie_para_4_piRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 1 -L 12 -i S,1,1.15",
"bowtie_para_4_snRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 3 -L 10 -i S,1,1.15",
"bowtie_para_4_snoRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -L 12 -i S,1,1.15",
"bowtie_para_4_YRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 1 -L 12 -i S,1,1.15",
"bowtie_para_4_genome": "-D 15 -R 2 -N 1 -k 1 -L 10 -i S,1,1.15",

# bowtie2 index
"miRNA_index": "/refs/miRNA/hsa.hairpin.fa",
"mirbase": "/refs/miRNA/hsa.mirbase.txt",
"piRNA_index": "/refs/piRNA/piRNA.fa",
"YRNA_index": "/refs/YRNA/YRNA.fa",
"snoRNA_index": "/refs/snoRNA/snoRNA.fa",
"snRNA_index": "/refs/snRNA/snRNA.fa",
"rRNA_index": "/refs/rRNA/rRNA.fa",
"tRNA_index": "/refs/tRNA/tRNA.fa",
"genome_index": "/refs/Homo_sapiens_v86/Homo_sapiens_v86.chromosomes.fasta",

# Annotation for genome
"gtf_annotation": "/refs/Homo_sapiens.GRCh38.86.chr.gtf",
"trans_bed_annotation": "/refs/four_elements.sort.bed",
"exon_bed_annotation": "/refs/two_elements.sort.bed"

Tool	Version	Description
FastQC	v0.11.9	FastQC is a quality control tool for high throughput sequence data. It is used to check the quality of your data.
Trimmomatic	v0.39	Trimmomatic is a fast and accurate adapter trimming tool for Illumina NGS data. It is used to trim the adapter sequences from your data.
Bowtie2	v2.4.2	Bowtie2 is a fast and sensitive aligner for short RNA sequences (reads). It aligns short DNA sequences to a long reference sequence by finding the best matches with a long reference sequence.
Samtools	v1.12	Samtools is a set of utilities that manipulate alignments in the SAM format, including sorting, merging, indexing, and generating alignments in a variety of formats.
Bedtools	v2.30.0	Bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats.
Python	v3.5.6	Python can be used alongside software to create workflows.

Seven types ncRNA references were obtained from their authoritative database without any modification except for the deletion of pseudogene sequences in Y RNA references. The reference of miRNA was download from miRBase V22; snoRNAs were from snoDB; tRNAs were from GtRNAdb; piRNAs and rRNAs were from RNAcentral; snRNAs and Y RNAs were from NCBI GenBank. The human reference genome GRCh38 and the corresponding GTF annotation files were downloaded from GENCODE

Reference	Source	Link	Why
genome GRCh38	GENCODE	https://www.gencodegenes.org/	The GENCODE Comprehensive set contains more AS, more novel CDSs, more novel exons and a higher genomic coverage than the existig annotation database.
miRNA	miRBase V22	https://mirbase.org/	miRBase is the primary public repository and online resource for microRNA sequences and annotation.
snoRNA	snoDB	http://scottgroup.med.usherbrooke.ca/snoDB/	SnoDB houses 2064 human snoRNAs, integrating the annotations of the most existing databases, such as RefSeq (https://www.ncbi.nlm.nih.gov/refseq/), RNAcentral and Rfam (https://rfam.xfam.org/).
tRNA	GtRNAdb	http://gtrnadb.ucsc.edu/	GtRNAdb contains tRNA gene predictions on complete or nearly complete genomes
piRNA	RNAcentral	https://rnacentral.org/	piRBase is a database of various piRNA associated data to support piRNA functional study and it was imported into the RNAcentral.
rRNA	RNAcentral	https://rnacentral.org/	The Expert Databases for rRNA, such as 5SrRNAdb (http://combio.pl/rrna/),Greengenes (http://greengenes.secondgenome.com/?prefix=downloads/greengenes_database/gg_13_5/) and RDP (http://rdp.cme.msu.edu/), are imported into RNAcentral.
snRNA	NCBI GenBank	https://www.ncbi.nlm.nih.gov/nuccore/?term=human+AND+%22snRNA%22	There is no independent ncRNA database for snRNA and YRNA, so we downloaded from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) which is an annotated collection of all publicly available DNA sequences.
Y RNA	NCBI GenBank	https://www.ncbi.nlm.nih.gov/nuccore/?term=human+AND+%22Y+RNA%22+NOT+pseudogene	There is no independent ncRNA database for snRNA and YRNA, so we downloaded from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) which is an annotated collection of all publicly available DNA sequences.

Supplymentary figures

Supplymentary code

When using EVAtlas, please cite: Xie G Y, Liu C J, Guo A Y. EVAtool: an optimized reads assignment tool for small ncRNA quantification and its application in extracellular vesicle datasets[J]. Briefings in Bioinformatics, 2022. Link: https://doi.org/10.1093/bib/bbac310

Guo Lab

Welcome to Dr. An-Yuan Guo's Lab at the Huazhong University of Science and Technology.

Resources

EVAtlas

EVmiRNA

Contact us
An-Yuan Guo: guoay@hust.edu.cn
Gui-Yan Xie: xieguiyan@hust.edu.cn

Acknowledgment

Any comments and suggestions, please contact us.

Top