What is EVAtool? EVAtool is designed to quantificate small RNA-seq dataset in extracellular vesicles


Overview

Extracellular vesicles (EVs) play an important role as agents of cell-to-cell communication by transferring bioactive molecules. EVs carry various bioactive molecules such as DNA fragments, proteins, lipids, metabolites, multiple RNA types including messenger RNAs (mRNA) and small non-coding RNAs (sncRNAs) that could be delivered from donor cell to target cells. A larger part of EV-associated sncRNA types affected cancer cell behavior but remain to be quantified both in various physiological and pathophysiological processes. Therefore, the quality control and correct quantification of sncRNAs in EV is indispensable.

EVAtool is an optimized reads assignment tool for extracellular vesicle small ncRNA quantification. In EVAtool, we prospectively collected seven ncRNA types (miRNA, snoRNA, piRNA, snRNA, rRNA, tRNA and Y RNA) references as default to evaluate the abundence of each small ncRNA in EVs. With current newest dependences (mainly bowtie2, samtool, fastq-dump, bedtools and trimmomatic-0.39.jar) and high-performance algorithm Optimized Reads Assignment Algorithm (ORAA), the tool is perfectly capable of processing short reads mapped to multiple sncRNAs from small EVs (sEVs) or large EVs (lEVs). It is also capable of processing other sncRNA-seq data with minor modifications in the configure file. Finally, EVAtool visualized almost all results and supports the online report.

Github link: https://github.com/xieguiyan/EVAtool

Docker link: https://hub.docker.com/r/guobioinfolab/evatool

Guided tutorial in Python: https://pypi.org/project/evatool



Summary
home-fig


Let's start

  1. The detailed flow-diagram below illustrated the data processing process with four parts (left part) and the Optimized Reads Assignment Algorithm (right part).

    workflow

    Pseudocode for the ORAA:

    Name: optimized reads assignment algorithm (ORAA)
    Author: Gui-Yan Xie
    Time: 2022-05-04 14:53
    
    Input: All reads mapped to genome and seven types small ncRNA
    Output: The abundance of each type small ncRNA
    
    
    function ncRNA_abundance (all_mapped_reads)
        for each_read in all_mapped_reads
            if reads_mismatch == 0
                if reads_mapped_ncrna_types == 1
                    candidata_ncRNA += read_count
                else
                    high_proportion_ncRNA += read_count
            end if
            if reads_mismatch == 1 and mirna_editing existed
                if reads_mapped_ncrna_types == 1
                    final_ncRNA += read_count
                else
                    high_proportion_ncRNA += read_count
            end if
        end for
        final_abundance_of_each_ncRNA = each_ncRNA_abundance_in_high_proportion + candidata_ncRNA
        return final_abundance_of_each_ncRNA
    end function

  2. EVAtool is developed using Python 3.6.5 and could be freely available at https://pypi.org/project/evatool. All pipeline files are included in a Docker image deposited in a public repository on the Docker hub, which solved the issue of dependencies across various Linux platforms and Windows.

    Attention: The required time is no more than 10 minutes, and the memory needs at least 12G if your data is about 1G.

    EVAtool can be easily installed in many platforms (Linux, MacOS and Windows) via the Python Package Index or Docker Hub. Before installation, we need to create a workdirectory. Here, we take the "example" as the folder name:

    1. Create a working directory in the current directory.

    mkdir ./example
  3. Download human Genome reference, seven types ncRNA references and configure file. All three type files are zipped in a package (4.4G) which can be download in two ways:

    1. Download all files by clicking the link below:

    Download all files

    2. Use wget to download all files:

    # Go to the workdirectory
    cd ./example
    # Download the references and config file
    wget "http://bioinfo.life.hust.edu.cn/EVAtool/ref/refs.zip"
    # Unzip the downloaded file
    unzip refs.zip

    Download example data by clicking: example.fastq.gz, or download through the command line.

    wget http://bioinfo.life.hust.edu.cn/EVAtool/example/example.fastq.gz
  4. Input the filename for your smRNA-seq data and the output directory if the data is loacated in the working directory, otherwise the absolute path of your data should be specified.

    Notice: The 'refs' folder should be placed in the corresponding working directory, otherwise the -c parameter needs to be specified.

    1. Run the Python package for EVAtool:

    # Install EVAtool by pip
    pip install evatool
    # Run EVAtool
    evatool -i example.fastq.gz -o {absolute path for output or .}

    2. Run the Docker image of EVAtool:

    # Pull from Docker hub
    docker pull guobioinfolab/evatool
    # Run EVAtool through docker image
    # -v: Bind mount a volume; -w: Working directory inside the container
    docker run -it -v $PWD:/{work_path} -w /{work_path} guobioinfolab/evatool -i example.fastq.gz -o {absolute path for output or .}

    The user needs to input only 2 parameters at least, and the specific meaning of each parameter is as follows:

      -i: sra file with path (required),

      -o: output directory (required),

      -c: configure file with path ( not required),

      -n: ncRNA type list (not required)

    For more details, please input the following command:

    evatool -h
    # or
    docker run guobioinfolab/evatool -h

  5. The result report is generated in the output directory, and the result report is a HTML file which can be opened in any Web browser.

    The online report mainly consists 3 parts for results. The following 3 parts are screenshots from the online report of example, click for more details:

    Part 1. Input parameters:

    input parameters

    Part 2. Analysis results:

    analysis result

    Part 3. Other results:

    other result
  6. In EVAtool, users can customize the parameters according to their needs.
    By default, EVAtool provides 7 types of small ncRNA reference, and explore the expression (count and RPM) of these ncRNAs. Notebly, EVAtool also allows users to add reference of interest to processed other types of ncRNA with modified config file. Several steps as follows:

    1. Copy the reference(s) to the 'ref' folder;
    2. Add the 'bowtie_para_4_{RNA name}' parameter to 'reference_config.json' file in the 'ref' folder;
    4. Add the '{RNA name}_index' parameter to the 'reference_config.json' file;
    5. Save and run as the 'Run EVAtool' Part. 

    For more usage, please see https://pypi.org/project/evatool

  7. EVAtool includs five modules ('Quality control', 'Mapping', 'ORAA', 'Quantification & RPM normalization' and 'Visualization report'). Each module processes the data quickly and accurately.
    1. Quality control: EVAtool provides a set of quality control tools to check the quality of the input data. The quality control process mainly include:

    Trimming (trimmomatic-0.39.jar): Trimming the reads according to the length of the reads;
    Fastqc (FastQC v0.11.9): FastQC is a quality control tool for high throughput sequence data. It is used to check the quality of the input data.
    

    2. Mapping: EVAtool uses Bowtie2 to map your data to the varies references.

    Bowtie2: Bowtie2 is a fast and sensitive aligner for short RNA sequences (reads). It aligns short DNA sequences to a long reference sequence by finding the best matches with a long reference sequence.

    3. ORAA: EVAtool provides ORAA to calculate the abundance of ncRNA reasonably.

    We summarized ORAA in following steps:
    1. Each sample clean reads are mapped (one mismatch allowed) to seven ncRNA references simultaneously, and gather all ncRNAs mapped reads together to prepare assignment;
    2. Reads with 0 or 1 mismatch only mapped to one ncRNA type are used to calculate ncRNA abundance;
    3. the reads with 0 mismatch mapped to one ncRNA type are regarded as confident reads which are used to calculate ncRNA proportion in EV, which would be regarded as reference ncRNA abundance for later assignment;
    4. multi-ncRNA mapped reads with 0 or 1 mismatch are assigned to exact one ncRNA type according to reference ncRNA abundance priority;
    5. Final ncRNA abundance is determined by the assigned ncRNA mapped reads.

    4. Quantification & RPM normalization:

    1. Quantification: EVAtool provides two types transcripts abundance: The numbers of mapped reads are counted for each ncRNA, and RPM (Reads Per Million) normalization is performed for the in put data.

    5. Visualization report: EVAtool visualized the most of the output results and provides an online report in HTML format which can be opened in any Web browser.

    Report_result.html
    report result


  8. #Trimmomatic parameters and adaptor sequences
    "trimmomatic_sRNA_para": "2:10:4:5 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:15",
    "adp_path": "/refs/sRNA.fa",
    
    #Tag filter
    "tag_cut": "0",
    
    #CPU number
    "cpu_number": "8",
    
    #sncRNA types
    "RPM": "total",
    
    #bowtie2 parameters
    "bowtie_para_4_miRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 2 -L 15 -i S,1,1.15",
    "bowtie_para_4_rRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 1 -L 12 -i S,1,1.15",
    "bowtie_para_4_tRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 1 -L 12 -i S,1,1.15",
    "bowtie_para_4_piRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 1 -L 12 -i S,1,1.15",
    "bowtie_para_4_snRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 3 -L 10 -i S,1,1.15",
    "bowtie_para_4_snoRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -L 12 -i S,1,1.15",
    "bowtie_para_4_YRNA": "--no-head --no-sq -D 15 -R 2 -N 1 -k 1 -L 12 -i S,1,1.15",
    "bowtie_para_4_genome": "-D 15 -R 2 -N 1 -k 1 -L 10 -i S,1,1.15",
    
    # bowtie2 index
    "miRNA_index": "/refs/miRNA/hsa.hairpin.fa",
    "mirbase": "/refs/miRNA/hsa.mirbase.txt",
    "piRNA_index": "/refs/piRNA/piRNA.fa",
    "YRNA_index": "/refs/YRNA/YRNA.fa",
    "snoRNA_index": "/refs/snoRNA/snoRNA.fa",
    "snRNA_index": "/refs/snRNA/snRNA.fa",
    "rRNA_index": "/refs/rRNA/rRNA.fa",
    "tRNA_index": "/refs/tRNA/tRNA.fa",
    "genome_index": "/refs/Homo_sapiens_v86/Homo_sapiens_v86.chromosomes.fasta",
    
    # Annotation for genome
    "gtf_annotation": "/refs/Homo_sapiens.GRCh38.86.chr.gtf",
    "trans_bed_annotation": "/refs/four_elements.sort.bed",
    "exon_bed_annotation": "/refs/two_elements.sort.bed" 

  9. Tool Version Description
    FastQC v0.11.9 FastQC is a quality control tool for high throughput sequence data. It is used to check the quality of your data.
    Trimmomatic v0.39 Trimmomatic is a fast and accurate adapter trimming tool for Illumina NGS data. It is used to trim the adapter sequences from your data.
    Bowtie2 v2.4.2 Bowtie2 is a fast and sensitive aligner for short RNA sequences (reads). It aligns short DNA sequences to a long reference sequence by finding the best matches with a long reference sequence.
    Samtools v1.12 Samtools is a set of utilities that manipulate alignments in the SAM format, including sorting, merging, indexing, and generating alignments in a variety of formats.
    Bedtools v2.30.0 Bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats.
    Python v3.5.6 Python can be used alongside software to create workflows.

  10. Seven types ncRNA references were obtained from their authoritative database without any modification except for the deletion of pseudogene sequences in Y RNA references. The reference of miRNA was download from miRBase V22; snoRNAs were from snoDB; tRNAs were from GtRNAdb; piRNAs and rRNAs were from RNAcentral; snRNAs and Y RNAs were from NCBI GenBank. The human reference genome GRCh38 and the corresponding GTF annotation files were downloaded from GENCODE

    Reference Source Link Why
    genome GRCh38 GENCODE https://www.gencodegenes.org/ The GENCODE Comprehensive set contains more AS, more novel CDSs, more novel exons and a higher genomic coverage than the existig annotation database.
    miRNA miRBase V22 https://mirbase.org/ miRBase is the primary public repository and online resource for microRNA sequences and annotation.
    snoRNA snoDB http://scottgroup.med.usherbrooke.ca/snoDB/ SnoDB houses 2064 human snoRNAs, integrating the annotations of the most existing databases, such as RefSeq (https://www.ncbi.nlm.nih.gov/refseq/), RNAcentral and Rfam (https://rfam.xfam.org/).
    tRNA GtRNAdb http://gtrnadb.ucsc.edu/ GtRNAdb contains tRNA gene predictions on complete or nearly complete genomes
    piRNA RNAcentral https://rnacentral.org/ piRBase is a database of various piRNA associated data to support piRNA functional study and it was imported into the RNAcentral.
    rRNA RNAcentral https://rnacentral.org/ The Expert Databases for rRNA, such as 5SrRNAdb (http://combio.pl/rrna/),Greengenes (http://greengenes.secondgenome.com/?prefix=downloads/greengenes_database/gg_13_5/) and RDP (http://rdp.cme.msu.edu/), are imported into RNAcentral.
    snRNA NCBI GenBank https://www.ncbi.nlm.nih.gov/nuccore/?term=human+AND+%22snRNA%22 There is no independent ncRNA database for snRNA and YRNA, so we downloaded from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) which is an annotated collection of all publicly available DNA sequences.
    Y RNA NCBI GenBank https://www.ncbi.nlm.nih.gov/nuccore/?term=human+AND+%22Y+RNA%22+NOT+pseudogene There is no independent ncRNA database for snRNA and YRNA, so we downloaded from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) which is an annotated collection of all publicly available DNA sequences.

    Supplymentary figures

    Supplymentary code


When using EVAtlas, please cite: Xie G Y, Liu C J, Guo A Y. EVAtool: an optimized reads assignment tool for small ncRNA quantification and its application in extracellular vesicle datasets[J]. Briefings in Bioinformatics, 2022. Link: https://doi.org/10.1093/bib/bbac310