AnimalTFDB2.0

AnimalTFDB is a comprehensive database including classification and annotation of genome-wide transcription factors (TFs), transcription co-factors and chromatin remodeling factors in 65 animal genomes. The TFs are further classified into 70 families based on their DNA-binding domain (DBD).

Search by Ensembl/Entrez gene ID, Symbol or Alias

e.g. ENSG00000141510; 7157; TP53; P53, TRP53




Methods for predicting TFs, transcription co-factors and chromatin remodeling factors

Transcription factors (TFs) are key regulators through binding to specific DNA sequence to activate or repress gene expression. Each TF has at least one DNA-binding domain (DBD) which is conserved in evolution. Based on their DBDs, TFs could be classified into different families. After reviewing literatures, we finally collected and curated 70 animal TF families and a group of them named "others" which includes some orphan TFs. We identified TFs based on the Hidden Markov Model (HMM) profiles of their DBDs. Among the 70 defined families, 56 families had HMM profiles of their DBDs in Pfam database (v27.0) and we downloaded them directly. For the remaining domains without available Pfam HMM profiles, we rebuilt the HMM profiles using the sequences in representative species (human, mouse, zebrafish and fly). To build the HMM profiles for them, we performed multiple sequence alignment by ClustalW2 for their DBD sequences and used the hmmbuild program in HMMER package to build HMM profiles. Then, we applied the hmmsearch program to search all the protein sequences in each species against the HMM profiles to predict TFs. Based on our manual curation, we took the E-value 0.0001 as the cutoff. In addition to the predicted TFs, we also found some TFs reported in publications. But none of them can be classified into one TF family, so we classified them into group "Others".


Transcription co-factors are considered as proteins that interact with TFs in the transcription complex but do not bind to the DNA directly. To identify them, we firstly got the human transcription co-factors from Tcof-DB database and GO database according to the GO items: "transcription coactivator activity", "transcription corepressor activity", "transcription cofactor activity" and "regulation of transcription". After removing redundant genes and the overlap with TFs, we got 415 transcription co-factors in human.


The chromatin remodeling factors were defined as proteins that regulate transcription by modifying the chromatin formation. We obtained the human chromatin remodeling factors from GO database. If the gene has one of the following GO annotations: "chromatin remodeling", "chromatin-mediated maintenance of transcription", "histone *ylation", "histone .*ylase activity", "histone *transferase activity", we think it is a chromatin remodeling factors. After manual curation, we got 142 chromatin remodeling factors in human.


In order to identify transcription co-factor and chromatin remodeling factor in other 64 species, we do the reciprocal best-hit BLAST between the human and other species with the conditions setting as e-value<=1e-4, coverage>=50%, identity>=30%.


After systematically reviewing recently published literatures, we found two new TF families comparing to AnimalTFDB v1.0. They are NCU-G1 and CEP-1, while CEP-1 only exist in Caenorhabditis elegans. In addition, we reclassified the nuclear receptor family. In version 1.0, this family is classified into 12 sub-families based on InterPro and Pfam annotations. In the update version, we classified it into 7 sub-families according to the common classification methods of the nuclear receptor nomenclature committee [1, 2]. The nuclear transcription factor Y (NFY) is also classified into 3 sub-families based on the three different subunits.


In most cases, a TF only has one kind of DBD, thus it is easy to assign it into one certain family correctly. But in some cases, a TF may have more than one kind of DBD. In order to classify them into correct family, we checked all the TFs of human and mouse which contained multiple kinds of DBDs, and then set up two rules. First, if a superfamily has several subfamilies, we classified the TFs based on the subfamily DBD. For example, the Homeobox superfamily has four subfamilies: Pou, CUT, TF_Otx and other Homeobox. In this superfamily, all TFs have a Homeobox domain, and some of them have one of the Pou, CUT, and TF_Otx subfamily signature domains. We assigned them into specific family based on their subfamily signature domain. The second rule is that if a TF has more than one unrelated DBD, we will classify it into the family based on the DBD with the smallest E-value. We checked all the classification results of human and mouse, and found our method was reasonable.


Family DNA-binding domain Pfam ID or InterPro ID
AF-4 AF-4 PF05110
AP-2 AP_2 IPR004979 (self-build)
ARID ARID PF01388
bHLH HLH PF00010
CBF CBF_alpha PF02312
CEP-1 CEP1-DNA_bind PF09287
CSL BTD PF09270
NF-Y NF-YA CBFB_NFYA PF02045
NF-YB NF-YB self-build
NF-YC NF-YC self-build
CG-1 CG-1 PF03859
CP2 CP2 PF04516
CSD CSD PF00313
E2F E2F_TDP PF02319
ETS Ets PF00178
Fork head Fork_head PF00250
GCM GCM PF03615
GTF2I GTF2I PF02946
HMG HMG_box PF00505
HMGI/HMGY HMGI/HMGY IPR000116 (self-build)
HSF HSF_DNA-bind PF00447
HTH HTH_psq PF05225
IRF IRF PF00605
MYB Myb_DNA-bd PF00249
MBD MBD PF01429
NCU-G1 NCU-G1 PF15065
NDT80/PhoG NDT80_PhoG PF05224
Nrf1 Nrf1_DNA-bind PF10491
PC4 PC4 PF02229
P53 P53 PF00870
PAX PAX PF00292
HPD HPD PF05044
RFX RFX PF02257
RHD RHD PF00554
Runt Runt PF00853
SAND SAND PF01342
SRF SRF PF00319
STAT STAT_bind PF02864
T-box T-box PF00907
TEA TEA PF01285
COE COE IPR003523 (self-build)
TSC22 TSC22 PF01166
Tub Tub PF01167
bZIP TF_bZIP bZIP self-build
C/EBP bZIP self-build
MH1 CTF/NFI MH1 PF00859
MH1 MH1 PF03165
Homeobox Homeobox Homeobox PF00046
Pou Homeobox, Pou PF00157
CUT Homeobox, CUT PF02376
TF_Otx Homeobox, TF_Otx PF03529
Zinc finger zf-C2HC zf-C2HC PF01530
zf-GAGA zf-GAGA PF09237
zf-BED zf-BED PF02892
zf-C2H2 ZBTB zf-C2H2 PF00651
zf-C2H2 zf-C2H2 PF00096
Nuclear Receptor Miscellaneous zf-C4 self-build
THR-like zf-C4 self-build
RXR-like zf-C4 self-build
ESR-like zf-C4 self-build
NGFIB-like zf-C4 self-build
SF-like zf-C4 self-build
GCNF-like zf-C4 self-build
DM DM PF00751
zf-GATA zf-GATA PF00320
zf-LITAF-like zf-LITAF-like PF10601
zf-MIZ zf-MIZ PF02891
zf-NF-X1 zf-NF-X1 PF01422
THAP THAP PF05485
Others

References:

  • 1. A unified nomenclature system for the nuclear receptor superfamily. 1999 Apr 16; 97(2):161-3. PMID 10219237.
  • 2. Evolution of the nuclear receptor superfamily: early diversification from an ancestral orphan receptor. J Mol Endocrinol. 1997 Dec; 19 (3): 207-26. PMID 9460643

1. Gene basic information

This part shows the basic information for a gene, which was extracted from different sources. The Ensembl ID, Gene Symbol, Gene Type, Orientation, Length, Position and Transcripts information were extracted from Ensembl database. Entrez ID and HGNC ID were obtained through Ensembl BioMart. And then we used these two IDs to get gene Alias and Full Name from NCBI and HGNC) databases. Summary information was grabbed from NCBI database by Entrez ID. Cross links were extracted from Unigene, OMIM, GeneCards, MGI, RGD and FlyBase.

The evidences that a TF, co-factor or chromatin remodeling factor is experimentally verified or putative are provided in 7 model species (human, mouse, rat, chicken, zebrafish, fruit fly, and worm) based on the GO annotations. If a TF with GO annotation of "regulation of transcription or transcription factor" and the GO annotation is marked with an experimental evidence code will be considered as experimental validated TF. If a co-factor with GO annotations of "transcription coactivator/corepressor/cofactor activity" or "regulation of transcription" and their evidence codes are experimental, we think this co-factor is an experimentally validated co-factor. Otherwise, it is putative. For chromatin remodeling factors, if their GO annotation are “chromatin remodeling”, “chromatin-mediated maintenance of transcription”, “histone *ylation”,“histone .*ylase activity”,and “histone *transferase activity” and the evidence codes are experimental, we think them as experimentally validated chromatin remodeling factors.


2. Gene model

This part describes the distribution of CDS, UTR and intron of a gene on chromosome based on the information from Ensembl gtf files.


3. Function domain

The function domain displays the domains distribution of the longest protein for each gene. In order to identify the function domains for TFs, we firstly downloaded all the HMMER profiles from Pfam database (version 27.0). Then, we applied PfamScan to search the protein domain against all the TF longest protein sequences with the default setting. After a domain coverage>=70% filtration, we got the final function domain for each sequence.


4. Gene ontology

The GO annotations were parsed from gene2go file, which was downloaded from NCBI ftp).


5. Pathway

The pathway annotations were obtained from the KEGG and BioCarta databases.


6. Phenotype

We parsed the disease information from MalaCards and Ensembl Biomart.


7. Protein-protein Interaction

The protein-protein interactions were extracted from BioGRID version 3.2 and HPRD databases.


8. Ortholog

We extracted the ortholog information by Ensembl API.


9. Paralog

Paralogs also were extracted by Ensembl API.


10. Gene expression

The mRNA expression profiles of human, mouse and rhesus monkey for different tissues and cell lines were obtained. In addition, the protein expression profile of human was also downloaded.


The gene expression information of 9 model species is provided in AnimalTFDB 2.0 involving normal tissues, cell lines, development stages and cancers in human.


We downloaded the human gene expression data of cancers from TCGA and downloaded the data of tissues and cell lines from EMBL-EBI Expression Atlas. The expression data of the human proteome were parsed from recent published nature paper [1,2]. The gene expression of D. melanogaster and C. elegans was extracted from the data published by Li et al [3]. Our collaborators Drs. Yu Xue and Haibo Jia kindly provided the unpublished gene expression data of Danio rerio. We downloaded the raw data for Rattus norvegicus, Bos taurus and Gallus gallus from NCBI GEO DataSets published by Burge group and calculated gene expression with TopHat and Cufflinks programs. The gene expression data for Mus musculus and Macaca mulatta were downloaded from RhesusBase, which were calculated from the RNA-Seq data published by groups Burge, Kaessmann and Chuan-Yun Li.


11. Family Multi-alignment and Phylogenetic Tree

We made multiple sequence alignment for the DBD sequences by ClustalW2 and constructed phylogenetic trees for TFs in the same family of each species by applying PHYLIP Neighbor-Joining method with bootstrap 100. The multiple sequence alignment result and phylogenetic tree were displayed by Weblogo and Phylogeny.fr, respectively.


References

  • 1. Kim, M.S., Pinto, S.M. et al. (2014) A draft map of the human proteome. Nature, 509, 575-581.
  • 2. Wilhelm, M., Schlegl, J. et al. (2014) Mass-spectrometry-based draft of the human proteome. Nature, 509, 582-587.
  • 3. Li, J.J., Huang, H., Bickel, P.J. and Brenner, S.E. (2014) Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modENCODE RNA-seq data. Genome research, 24, 1086-1101.

AnimalTFDB is a comprehensive database including classification and annotation of genome-wide transcription factors (TFs), transcription co-factors and chromatin remodeling factors in 65 animal genomes. The TFs are further classified into 70 families based on their DNA-binding domain (DBD). The family names and assignment rules could be found in TF family assignment rules page.


Current URL of AnimalTFDB is http://bioinfo.life.hust.edu.cn/AnimalTFDB/.The AnimalTFDB 2.0 provides multiple ways to browse, keyword search, BLAST search and download the data in our database. We also provided an online prediction sever of TF.


Browse

  • 1. Browse by species. Users can browse the database by clicking the logo of species on the phylogenetic tree or by clicking the name on the left treeview. The cascading style of species->families->family gene list->single gene annotation is applied for this browse way. The family gene list page also show the multiple sequence alignment of the DBDs, the weblogo graph of the multi-alignment and phylogenetic tree of these TFs, as well as a brief introduction and reference of this TF family.
  • 2. Browse by family. Users can browse the database by clicking the logo of TF families or the bars of transcription co-factors and chromatin remodeling factors. The names of TFs, transcription co-factors and chromatin remodeling factors on the left treeview also could be clicked to browse the database. The full TF list in each species could be browse by clicking the “Transcription Factor Family” bar. The cascading style of families->species->family gene list->single gene annotation is applied for this browse way.

Search

  • 1. A quick search box for Ensembl gene id, Entrez gene id or gene symbol locates at the head of each page.
  • 2. Advanced search page provides multiple ways to search the database. Users can search by different basic information (multiple gene ids, gene symbol, alias and full name), annotation information (protein-protein interactions, gene ontology, pathway, orthologs and paralogs), mRNA or protein expression of a TF. For the expression search, specific species, types of tissues, cell lines, development stages and cancers, and the lowest threshold of gene expression levels could be selected to filter the search result.

TF prediction

  • To help users identify TFs from their own protein sequences, we set up a TF prediction server. The prediction method and TF family assignment rules can be found in prediction and TF family assignment rules pages. Currently, users can upload up to 1000 protein sequences in one time and obtain results within a few minutes. In the prediction result, TF family, alignment e-value, and detailed alignment information will be provided.

BLAST

  • To help users find homologous gene and explore functions of poorly studied TFs, a BLAST tool was provided to search against TFs, transcription co-factors and chromatin remodeling factors in our database with protein or DNA sequences. The protein sequences of all species or one specific species could be selected as the BLAST database. The specific e-vlaue can be chose.

Download

  • The gene lists of TFs, transcription co-factors and chromatin remodeling factors for each species could be downloaded from the download page. The longest protein sequences of all the genes also are provided in the download page.