The Gene Ontology Annotation (GOA) database ) goals to supply high-quality digital and handbook annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) utilizing the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a excessive degree of integration of the knowledge represented in UniProt with different databases. This is achieved by changing UniProt annotation right into a acknowledged computational format.
GOA supplies annotated entries for almost 60,000 species (GOA-SPTr) and is the most important and most complete open-source contributor of annotations to the GO Consortium annotation effort. By integrating GO annotations from different mannequin organism teams, GOA consolidates specialised knowledge and experience to make sure the information stay a key reference for up-to-date organic data. Furthermore, the GOA database absolutely endorses the Human Proteomics Initiative by prioritizing the annotation of proteins more likely to profit human well being and illness.
In addition to a non-redundant set of annotations to the human proteome (GOA-Human) and month-to-month releases of its GO annotation for all species (GOA-SPTr), a collection of GO mapping recordsdata and particular cross-references in different databases are additionally repeatedly distributed. GOA might be queried by way of a easy user-friendly net interface or downloaded in a parsable format through the EBI and GO FTP web sites. The GOA information set can be utilized to reinforce the annotation of explicit mannequin organism or gene expression information units, though more and more it has been used to guage GO predictions generated from textual content mining or protein interplay experiments.
In 2004, the GOA workforce will construct on its success and can proceed to complement the practical annotation of UniProt and work in direction of enhancing the flexibility of scientists to entry all obtainable organic data. Researchers wishing to question or contribute to the GOA venture are inspired to electronic mail: firstname.lastname@example.org.
ELECTRONIC GO ANNOTATION
The large-scale project of GO phrases to UniProt entries hasbeen made attainable by efficiently changing a proportion ofthe pre-existing knowledge held throughout the Øat Æles into GOterms (7).
For instance, UniProt description strains (DE) maycontain Enzyme Commission (EC) numbers. Using an exist-ing mapping of EC numbers to the GO molecular functionontology (ec2go) and a mapping of protein accession numbersto EC numbers, GOA can produce a UniProt to GOassociation. In an identical trend the GOA group maintains aSwiss-Prot key phrase to GO mapping (spkw2go).
This mappingÆle is routinely used to generate a lot of annotationsto GO course of, perform and element ontologies (seecontents of present launch on the GOA house web page).Bi-directional database cross-references additionally assist to inte-grate GO annotations. For instance, the vast majority of UniProtentries will cross-reference an InterPro identiÆcation numberand vice versa.
InterPro is a key database maintained at theEBI (11,12). It supplies an built-in documentation resourcefor proteins, households and domains. A single InterPro entryprovides complete annotation describing a set of relatedproteins, a few of which can have equivalent features, beinvolved in the identical processes and act in the identical places.During the curation of every InterPro entry, high-level GOterms are manually curated, based mostly on a assessment of the literatureavailable on the associated proteins. This annotation is used togenerate an InterPro2go mapping and in addition serves as abiological abstract in the InterPro entry.
So far, theapplication of the InterPro2go mapping in the electronicassignment of GO phrases to gene merchandise has produced themost protection in the GOA information set (see contents of currentrelease on the GOA house web page). Both spkw2go andInterPro2go mappings are maintained in-house and distributedon the GO and EBI FTP websites regularly. To supportinteroperability, InterPro2go has been used to generate GOmappings to its member databases (see Table 1) and these alsoare obtainable for obtain.
The GO assignments are launched month-to-month, in accordancewith a GO Consortium agreed format, inside a `geneassociation Æle’. As the mapping Æles utilized by GOA aremanually curated, GOA is conÆdent that its electronicannotation is of a excessive normal. Despite this, it’s importantthat customers have the flexibility to differentiate digital frommanually veriÆed GO annotation. For this purpose, the certaintyof every GOA affiliation is supported by annotating to oneof 10 Consortium agreed proof codes.
generated associations are labelled as `inferred from electronicannotation’ or IEA. Proteins assigned this code are `possible’ tobe concerned in a selected GO exercise.MANUAL GO ANNOTATIONThe large-scale project of GO phrases to UniProt proteinsusing digital strategies is a quick and efÆcient manner ofassociating high-level phrases to a lot of proteins.
However, to supply extra dependable and speciÆc annotation, theGOA venture additionally makes use of handbook curation usinginformation extracted from revealed scientiÆc literature (7).This course of is slower than the usage of digital strategies butprovides extra correct data as all annotation isvalidated by a workforce of expert biologists. GOA recommendsthat customers wishing to analyse GO annotation perceive howGO is organized and the way GO assignments are made.
Guidelines for GO annotation have been detailed earlier than (10)and are revealed on the GO house web page (http://www.geneontology.org/). Each assigned time period is related with aGO experimental proof code (see GO house web page fordetails of proof codes) and a PubMed ID, which allowsusers to trace the precise literature supply and sort ofexperiment used to assist the annotation.Priority is given in the GOA venture to the annotation ofdata from the human proteome. This enhances the effortsof the opposite consortium members as no different member freelyprovides human-speciÆc information. The GOA venture additionally assignsGO phrases to proteins from a variety of different species. Thereare nearly 60 000 totally different species represented in the UniProtdatabases. Approximately 500 of those have already had GOterms manually assigned and this quantity continues toincrease day by day.
New phrases are requested as required toadequately describe the numerous species in UniProt, thusenhancing the GO ontologies and lengthening their scope.UTILITY OF GO ANNOTATIONManual GO annotation generates high-quality dependable inform-ation that’s extra correct than digital annotation. It alsoallows comparisons to be made with new annotationapproaches and is a crucial instrument for validation of thesemethods. However, handbook annotation is time consuming anddependent on expert biologists able to extracting keyinformation from the revealed literature.
In view of this, higher emphasis has been positioned by thebioinformatics neighborhood on the event of recent auto-matic annotation strategies, equivalent to automated informationextraction and the conversion of this knowledge into the GOvocabulary (9,17±20). This has resulted in a wide range of GOprediction servers with various skills to interpret accuratelythe subtleties of the scientiÆc pure language in addition to GOstructure, mappings and annotation kinds (see GO Tools liston GO house web page). To assess these data extractiontechniques and permit customers to use the strategies judiciously,the BioLINK group (http://www.pdg.cnb.uam.es/BioLINK/)organized the BioCreative (Critical Assessment ofInformation Extraction programs in Biology) competitors.
Incollaboration with BioLINK, GOA offered one of many goldstandard coaching and take a look at units of GO annotation. UniProtcurators additionally took half in handbook veriÆcation of GO phrases.
GOA may also be used to reply speciÆc biologicalproblems. As GO represents a common set of curatedkeywords, many customers want to retrieve all attainable annotationsto a high-level GO time period in a candidate-based method.According to GO philosophy, each little one time period inherits themeaning of all of their guardian phrases.
As such, each annotationto a baby time period must be true for each guardian of that little one; thisis referred to as the `true path rule’. If a person wished to analyse allproteins concerned in the method of transcription they wouldhave to retrieve all proteins annotated to the GO time period for`transcription’ (GO:0006350) and the youngsters of those GOterms. Retrieving the annotation to the youngsters and guardian GOterm is feasible through SRS (25) however requires prior knowledge ofthis highly effective retrieval system.Another manner of performing the question is to make use of the proteinassignments to a set of GO-slim phrases. Essentially, GO-slim isa checklist of high-level GO phrases that cowl the principle elements ofeach of the three GO ontologies. As every neighborhood hasdifferent wants, a wide range of GO-slim Æles have been archivedon the GO house web page by Consortium members .
GOA has created itsown GO-slim (goslim_goa.2002) to summarize the GOannotation of every accomplished proteome on the ProteomeAnalysis pages (26, Table 1). As an extra service, thismapping of GO annotation to each GOA (goslim_goa.2002)and a generic set of GO-slim phrases (generic.0208) is availablefor obtain on the EBI FTP web site (Table 1). From there, userscan obtain all attainable annotations to the GO slim time period fortranscription `GO:0006350′.
Users wishing to make use of a differentset of GO-slim phrases are suggested to make use of the map2slim.pl scriptarchived on the Berkley Drosophila Genome Project (BDGP)house web page (http://www.fruitØy.org/developers/src/go.dev/apps/query-utils/). This script makes use of the GO MySQL databaseand requires prior knowledge of Perl API