AnimalTFDB is a comprehensive database including classification and annotation of genome-wide transcription factors (TFs), transcription co-factors and chromatin remodeling factors in 65 animal genomes. The TFs are further classified into 70 families based on their DNA-binding domain (DBD).
e.g. ENSG00000141510; 7157; TP53; P53, TRP53
Methods for predicting TFs, transcription co-factors and chromatin remodeling factors
Transcription factors (TFs) are key regulators through binding to specific DNA sequence to activate or repress gene expression. Each TF has at least one DNA-binding domain (DBD) which is conserved in evolution. Based on their DBDs, TFs could be classified into different families. After reviewing literatures, we finally collected and curated 70 animal TF families and a group of them named "others" which includes some orphan TFs. We identified TFs based on the Hidden Markov Model (HMM) profiles of their DBDs. Among the 70 defined families, 56 families had HMM profiles of their DBDs in Pfam database (v27.0) and we downloaded them directly. For the remaining domains without available Pfam HMM profiles, we rebuilt the HMM profiles using the sequences in representative species (human, mouse, zebrafish and fly). To build the HMM profiles for them, we performed multiple sequence alignment by ClustalW2 for their DBD sequences and used the hmmbuild program in HMMER package to build HMM profiles. Then, we applied the hmmsearch program to search all the protein sequences in each species against the HMM profiles to predict TFs. Based on our manual curation, we took the E-value 0.0001 as the cutoff. In addition to the predicted TFs, we also found some TFs reported in publications. But none of them can be classified into one TF family, so we classified them into group "Others".
Transcription co-factors are considered as proteins that interact with TFs in the transcription complex but do not bind to the DNA directly. To identify them, we firstly got the human transcription co-factors from Tcof-DB database and GO database according to the GO items: "transcription coactivator activity", "transcription corepressor activity", "transcription cofactor activity" and "regulation of transcription". After removing redundant genes and the overlap with TFs, we got 415 transcription co-factors in human.
The chromatin remodeling factors were defined as proteins that regulate transcription by modifying the chromatin formation. We obtained the human chromatin remodeling factors from GO database. If the gene has one of the following GO annotations: "chromatin remodeling", "chromatin-mediated maintenance of transcription", "histone *ylation", "histone .*ylase activity", "histone *transferase activity", we think it is a chromatin remodeling factors. After manual curation, we got 142 chromatin remodeling factors in human.
In order to identify transcription co-factor and chromatin remodeling factor in other 64 species, we do the reciprocal best-hit BLAST between the human and other species with the conditions setting as e-value<=1e-4, coverage>=50%, identity>=30%.
After systematically reviewing recently published literatures, we found two new TF families comparing to AnimalTFDB v1.0. They are NCU-G1 and CEP-1, while CEP-1 only exist in Caenorhabditis elegans. In addition, we reclassified the nuclear receptor family. In version 1.0, this family is classified into 12 sub-families based on InterPro and Pfam annotations. In the update version, we classified it into 7 sub-families according to the common classification methods of the nuclear receptor nomenclature committee [1, 2]. The nuclear transcription factor Y (NFY) is also classified into 3 sub-families based on the three different subunits.
In most cases, a TF only has one kind of DBD, thus it is easy to assign it into one certain family correctly. But in some cases, a TF may have more than one kind of DBD. In order to classify them into correct family, we checked all the TFs of human and mouse which contained multiple kinds of DBDs, and then set up two rules. First, if a superfamily has several subfamilies, we classified the TFs based on the subfamily DBD. For example, the Homeobox superfamily has four subfamilies: Pou, CUT, TF_Otx and other Homeobox. In this superfamily, all TFs have a Homeobox domain, and some of them have one of the Pou, CUT, and TF_Otx subfamily signature domains. We assigned them into specific family based on their subfamily signature domain. The second rule is that if a TF has more than one unrelated DBD, we will classify it into the family based on the DBD with the smallest E-value. We checked all the classification results of human and mouse, and found our method was reasonable.
|Family||DNA-binding domain||Pfam ID or InterPro ID|
1. Gene basic information
This part shows the basic information for a gene, which was extracted from different sources. The Ensembl ID, Gene Symbol, Gene Type, Orientation, Length, Position and Transcripts information were extracted from Ensembl database. Entrez ID and HGNC ID were obtained through Ensembl BioMart. And then we used these two IDs to get gene Alias and Full Name from NCBI and HGNC) databases. Summary information was grabbed from NCBI database by Entrez ID. Cross links were extracted from Unigene, OMIM, GeneCards, MGI, RGD and FlyBase.
The evidences that a TF, co-factor or chromatin remodeling factor is experimentally verified or putative are provided in 7 model species (human, mouse, rat, chicken, zebrafish, fruit fly, and worm) based on the GO annotations. If a TF with GO annotation of "regulation of transcription or transcription factor" and the GO annotation is marked with an experimental evidence code will be considered as experimental validated TF. If a co-factor with GO annotations of "transcription coactivator/corepressor/cofactor activity" or "regulation of transcription" and their evidence codes are experimental, we think this co-factor is an experimentally validated co-factor. Otherwise, it is putative. For chromatin remodeling factors, if their GO annotation are “chromatin remodeling”, “chromatin-mediated maintenance of transcription”, “histone *ylation”，“histone .*ylase activity”，and “histone *transferase activity” and the evidence codes are experimental, we think them as experimentally validated chromatin remodeling factors.
2. Gene model
This part describes the distribution of CDS, UTR and intron of a gene on chromosome based on the information from Ensembl gtf files.
3. Function domain
The function domain displays the domains distribution of the longest protein for each gene. In order to identify the function domains for TFs, we firstly downloaded all the HMMER profiles from Pfam database (version 27.0). Then, we applied PfamScan to search the protein domain against all the TF longest protein sequences with the default setting. After a domain coverage>=70% filtration, we got the final function domain for each sequence.
4. Gene ontology
The GO annotations were parsed from gene2go file, which was downloaded from NCBI ftp).
We parsed the disease information from MalaCards and Ensembl Biomart.
7. Protein-protein Interaction
We extracted the ortholog information by Ensembl API.
Paralogs also were extracted by Ensembl API.
10. Gene expression
The mRNA expression profiles of human, mouse and rhesus monkey for different tissues and cell lines were obtained. In addition, the protein expression profile of human was also downloaded.
The gene expression information of 9 model species is provided in AnimalTFDB 2.0 involving normal tissues, cell lines, development stages and cancers in human.
We downloaded the human gene expression data of cancers from TCGA and downloaded the data of tissues and cell lines from EMBL-EBI Expression Atlas. The expression data of the human proteome were parsed from recent published nature paper [1,2]. The gene expression of D. melanogaster and C. elegans was extracted from the data published by Li et al . Our collaborators Drs. Yu Xue and Haibo Jia kindly provided the unpublished gene expression data of Danio rerio. We downloaded the raw data for Rattus norvegicus, Bos taurus and Gallus gallus from NCBI GEO DataSets published by Burge group and calculated gene expression with TopHat and Cufflinks programs. The gene expression data for Mus musculus and Macaca mulatta were downloaded from RhesusBase, which were calculated from the RNA-Seq data published by groups Burge, Kaessmann and Chuan-Yun Li.
11. Family Multi-alignment and Phylogenetic Tree
We made multiple sequence alignment for the DBD sequences by ClustalW2 and constructed phylogenetic trees for TFs in the same family of each species by applying PHYLIP Neighbor-Joining method with bootstrap 100. The multiple sequence alignment result and phylogenetic tree were displayed by Weblogo and Phylogeny.fr, respectively.
AnimalTFDB is a comprehensive database including classification and annotation of genome-wide transcription factors (TFs), transcription co-factors and chromatin remodeling factors in 65 animal genomes. The TFs are further classified into 70 families based on their DNA-binding domain (DBD). The family names and assignment rules could be found in TF family assignment rules page.
Current URL of AnimalTFDB is http://bioinfo.life.hust.edu.cn/AnimalTFDB/.The AnimalTFDB 2.0 provides multiple ways to browse, keyword search, BLAST search and download the data in our database. We also provided an online prediction sever of TF.