Vertebrate and Genome Annotation Project
The Vertebrate and Genome Annotation (VEGA) project is a biological database dedicated to assisting researchers in locating specific areas of the genome and annotating genes or regions of vertebrate genomes. The VEGA browser is based on Ensembl web code and infrastructure and provides a public curation of known vertebrate genes for the scientific community. The VEGA website is updated frequently to maintain the most current information about vertebrate genomes and attempts to present consistently high-quality annotation of all its published vertebrate genomes or genome regions. VEGA was developed by the Wellcome Trust Sanger Institute and is in close association with other annotation databases, such as ZFIN (The Zebrafish Information Network), the Havana Group and GenBank. Manual annotation is currently more accurate at identifying splice variants, pseudogenes, polyadenylation features, non-coding regions and complex gene arrangements than automated methods.
The Vertebrate Genome Annotation (VEGA) database was first made public in 2004 by the Wellcome Trust Sanger Institute. It was designed to view manual annotations of human, mouse and zebrafish genomic sequences, and it is the central cache for genome sequencing centers to deposit their annotation of human chromosomes. Manual annotation of genomic data is extremely valuable to produce an accurate reference gene set but is expensive compared with automatic methods and so has been limited to model organisms. Annotation tools that have been developed at the Wellcome Trust Sanger Institute (WTSI) are now being used to fill that gap, as they can be used remotely and so open up viable community annotation collaborations. The HAVANA and VEGA Projects are currently being run by Dr. Jennifer Harrow of the Wellcome Sanger Institute.
The Human Genome
The Vega database is the central repository for the majority of genome sequencing centers to deposit their annotation of human chromosomes. Since the original VEGA publication, the number of human gene loci annotated has more than doubled to over 49,000 (September 2012 release), over 20,000 of which are predicted to be protein coding. The Havana Group as part of the consensus-coding sequence (CCDS) collaboration and whole-genome extension of the ENCODE project have fully manually annotated the human genome—which is available for reference, comparative analysis and sequence searches on the VEGA database.
The VEGA database combines the information from individual vertebrate genome databases and brings them all together to allow easier access and comparative analysis for researchers. The human and vertebrate analysis and annotation (Havana) team at the Wellcome Trust Sanger Institute (WTSI) manually annotate the human, mouse and zebrafish genomes using the Otterlace/ZMap genome annotation tool. The Otterlace manual annotation system comprises a relational database that stores manual annotation data and supports the graphical interface, Zmap and is based on the Ensembl schema.
The Zebrafish Genome, which is being fully sequenced and manually annotated. The Zebrafish genome currently has 18,454 annotated VEGA genes--of which, 16,588 are projected protein-coding genes (September 2012, release).
The Mouse genome currently has 23,322 annotated VEGA genes--of which, 14,805 are projected protein-coding genes (June 2012, release). The loci chosen for manual annotation are spread throughout the genome, but some regions have received more focus than others: Chromosomes 2, 4, 11 and X, which have been fully annotated. The annotation shown in this release of Vega is from a datafreeze taken on 19 March 2012 and the gene structures are presented in the merged mouse geneset shown in Ensembl release 67. Vega also shows artificial loci generated by the mouse Knockout programs.
The Pig genome currently has annotated 2,842 VEGA genes--of which, 2,264 are projected protein-coding genes (September 2012, release). The pig major histocompatibility complex (MHC), also known as the swine leukocyte antigen complex (SLA), spans a 2.4Mb region of submetacentric chromosome 7 (SSC7p1.1-q1.1). Implicated in the control of immune response and susceptibility to a range of diseases, the pig MHC plays a unique role in histocompatibility. Chromosomes X-WTSI and Y-WTSI are currently being annotated by Havana.
Dog, Chimpanzee, Wallaby, Gorilla
The Dog genome currently has 45 annotated VEGA genes--of which, 29 are projected protein-coding genes (February 2005, release). The Chimpanzee genome currently has 124 annotated VEGA genes--of which, 52 are projected protein-coding genes (January 2012, release). The Wallaby genome currently has 193 annotated VEGA genes--of which, 76 are projected protein-coding genes (March 2009, release). The Gorilla genome currently has 324 annotated VEGA genes--of which, 176 are projected protein-coding genes (March 2009, release).
How to Use VEGA
The VEGA database has a very "user-friendly" database, fully equipped with readily available links and pictures of every organism available on the site. Upon clicking any one of the vertebrates, a page will pop up showing a table of Statistics for that genome, as well as a chart depicting the amount of genome annotated to date. The table of Statistics gives such information as the numbers of VEGA annotated genes, protein coding genes, processed transcripts, pseudogenes, clones, total bases, and annotated transcripts. Researchers can search for this information within each vertebrate genome after getting to this page, or they can use the quick search at the top of the VEGA homepage. If information is wanted about sections of the genome which are homologous across species, search results will show in which species they have been annotated. Search results will also yield the domain and chromosome numbers specific to these vertebrates, as well as any primary literature that can be found on the gene query.
In addition to full genomes, and unlike other browsers, VEGA also displays small finished regions of interest from genomes of other vertebrates, human haplotypes and mouse strains. Currently this comprises the finished sequence and annotation of the major histocompatibility complex (MHC) from different human haplotypes, and dog and pig [the latter of which is currently otherwise only available in very limited form in Ensembl Pre!. Additionally there is mouse NOD (non-obese diabetes) strain annotation of IDD (insulin-dependent diabetes) candidate regions and two more pig regions.
Vega contains comparative pairwise analysis between specific genomic regions from either different species or from different haplotypes / strains. This is in contrast to Ensembl where many all genome versus all genome comparisons are performed. The analysis in Vega involves:
1. The identification of genomic alignments using LastZ. 2. Prediction of the orthologue pairs using the Ensembl gene tree pipeline. Note that although the pipeline generates phylogenetic genetrees, the limited scope of the Vega comparative analysis means that these will necessarily be incomplete and consequently only orthologs are shown on the website. 3. The manual identification of alleles in either different human haplotypes or mouse strains.
There are five sets of analyses:
1. The MHC region has been compared between dog, pig (two assemblies), gorilla, chimpanzee, wallaby, mouse and eight human haplotypes:
- dog chromosome 12-MHC
- gorilla chromosome 6-MHC
- chimpanzee chromosome 6-MHC
- wallaby chromosome 2-MHC
- pig chromosome 7 on Sscrofa10.2 (24.7Mb to 29.8Mbp)
- pig chromosome 7-MHC
- mouse chromosome 17 (33.3Mbp to 38.9Mbp)
- chromosome 6 on the human reference assembly (28Mbp to 34Mbp)
- chromosome 6 MHC region in the human COX, QBL, APD, DBB, MANN, MCF and SSTO haplotypes (full length chromosome fragments)
2. Comparisons between the LRC regions of pig, gorilla and human (nine haplotypes):
- pig chromosome 6 (53.6Mbp to 54.0Mbp)
- gorilla chromosome 19-LRC
- human chromosome 19q13.4 (54.6Mbp to 55.6Mbp) on the reference assembly.
- chromosome 19 LRC region in the COX_1, COX_2, PGF_1, PGF_2, DM1A, DM1B, MC1A and MC1B haplotypes (full length chromosome fragments).
- Insulin dependent diabetes (Idd) regions on six mouse chromosomes (1, 3, 4, 6, 11 and 17) have been compared between the CL57BL/6 reference and one or more of the DIL Non-Obese Diabetic (NOD), CHORI-29 NOD, and the 129 strains. Further details are described here
3. The regions of the CL57BL/6 reference assembly used in these comparisons are:
- Idd3.1: chromosome 3, clones AC117584.11 to AC115749.12
- Idd4.1: chromosome 11, clones AL596185.12 to AL663042.5
- Idd4.2: chromosome 11, clones AL663082.5 to AL604065.7
- Idd4.2Q: chromosome 11, clones AL596111.7 to AL645695.18
- Idd5.1: chromosome 1, clones AL683804.15 to AL645534.20
- Idd5.3: chromosome 1, clones AC100180.12 to AC101699.9
- Idd5.4: chromosome 1, clones AC123760.9 to AC109283.8
- Idd6.1 + Idd6.2: chromosome 6, clones AC164704.4 to AC164090.3
- Idd6.3: chromosome 6, clones AC171002.2 to AC163356.2
- Idd9.1: chromosome 4, clones AL627093.17 to AL670959.8
- Idd9.1M: chromosome 4, clones AL611963.24 to AL669936.12
- Idd9.2: chromosome 4, clones CR788296.8 to AL626808.28
- Idd9.3: chromosome 4, clones AL607078.26 to AL606967.14
- Idd10.1: chromosome 3, clones AC167172.3 to AC131184.4
- Idd16.1: chromosome 17, clones AC125141.4 to AC167363.3
- Idd18.1: chromosome 3, clones AL845310.4 to AL683824.8
- Idd18.2: chromosome 3, clones AC123057.4 to AC129293.9
4. Comparisons between three specific regions:
- pig chromosome 17 (58.2Mbp to 67.4Mbp)
- human chromosome 20q13.13-q13.33 (45.8Mbp to 62.4Mbp)
- mouse chromosome 2 (168.3Mbp to 179.0Mbp)
5. Pairwise comparisons between three pairs of full length mouse and human chromosomes:
- human chromosome 1 and mouse chromosome 4
- human chromosome 17 and mouse chromosome 11
- human chromosome X and mouse chromosome X
- "Vega Genome Browser". Wellcome Sanger Institute. Retrieved 30 October 2012.
- Searle, S. M.J.; Gilbert, J; Iyer, V; Clamp, M (1 May 2004). "The Otter Annotation System". Genome Research 14 (5): 963–970. PMC 479127. PMID 15123593. doi:10.1101/gr.1864804.
- Hubbard, T.; Barker, D; Birney, E; Cameron, G; Chen, Y; Clark, L; Cox, T; Cuff, J; Curwen, V (1 January 2002). "The Ensembl genome database project". Nucleic Acids Research 30 (1): 38–41. PMC 99161. PMID 11752248. doi:10.1093/nar/30.1.38.
- Loveland, J. (1 January 2005). "VEGA, the genome browser with a difference". Briefings in Bioinformatics 6 (2): 189–193. PMID 15975227. doi:10.1093/bib/6.2.189.
- Ashurst, J. L.; Chen, CK; Gilbert, JG; Jekosch, K; Keenan, S; Meidl, P; Searle, SM; Stalker, J; Storey, R (17 December 2004). "The Vertebrate Genome Annotation (Vega) database". Nucleic Acids Research 33 (Database issue): D459–D465. PMC 540089. PMID 15608237. doi:10.1093/nar/gki135.
- Wilming, L. G.; Gilbert, J. G. R.; Howe, K.; Trevanion, S.; Hubbard, T.; Harrow, J. L. (23 December 2007). "The vertebrate genome annotation (Vega) database". Nucleic Acids Research 36 (Database): D753–D760. doi:10.1093/nar/gkm987.
- "Wellcome Trust Sanger Institute".
- Loveland, J. E.; Gilbert, J. G. R.; Griffiths, E.; Harrow, J. L. (20 March 2012). "Community gene annotation in practice". Database 2012: bas009–bas009. doi:10.1093/database/bas009.
- "Human Genome".
- Birney, Ewan et al. (14 June 2007). "Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project". Nature 447 (7146): 799–816. PMC 2212820. PMID 17571346. doi:10.1038/nature05874.
- Ashurst, Jennifer L.; Collins, John E. (1 September 2003). "G A : P T". Annual Review of Genomics and Human Genetics 4 (1): 69–88. doi:10.1146/annurev.genom.4.070802.110300.
- "Havana Project".
- Sprague, J. (1 January 2006). "The Zebrafish Information Network: the zebrafish model organism database". Nucleic Acids Research 34 (90001): D581–D585. doi:10.1093/nar/gkj086.
- "Zebrafish Genome".
- "Mouse Genome".
- "Pig Genome".
- "Dog Genome".
- "Chimpanzee Genome".
- "Wallaby Genome".
- "Gorilla Genome".
- "Comparative Analysis".