Naming
- Problem ⇒ naming genes and proteins (i.e. many names for the same protein or gene, etc)
- Solution ⇒ scientific (and consistent) naming of genes = due to the HUGO Nomenclature Committee (HGNC) which provides unique human gene names and symbols
- For non-human species → model organism databases serve as central repositories!
- NCBI provides many approved gene names in its gene database
- An ontology in CS and information science is a formal representation of a set of concepts within a domain and the relationships between those concepts.
- Examples include the Unified Medical Language System (UMLS), The Gene Ontology and Medical Subject Headings (MeSH)
- Gene Ontology (GO) ⇒ Annotation of genes and proteins in genomic and protein databases, across all species.
- Consists of three structured, controlled vocabularies (ontologies) that describe gene products in a species-independent manner
- Biological Processes ⇒ a commonly recognized series of events (e.g. cell division)
- Cellular Components ⇒ where in the cell does the protein act? (e.g. mitochondrial inner membrane)
- Molecular Functions ⇒ protein activities (e.g. an enzymatic activity)
- Terms from these ontologies are then used to annotate gene products (i.e. making associations between the ontologies and the genes)
- GO ontologies are structured as a hierarchical directed acyclic graph (DAG) where terms can have more than one parent and zero, one or more children. There exist two types of relations (being “is-a” and “part-of”
⚠️ Important to note is that ontology graphs allow for reasoning!
Gene Ontology & Annotation
- Annotation is the process of assigning GO terms to gene products and is based on annotation evidence codes (i.e. where does the annotation or evidence come from)
- Experimental evidence codes (e.g. IDA, inferred from direct assay)
- Computational analysis evidence codes (e.g. ISO, inferred from sequence orthology)
- Annotation based on assessments of authors/curators (e.g. TAS)
- Automatically assigned annotations (e.g. IEA)
1 - Crucial Databases
- Ensembl ⇒ joint project to annotate genomes
- A (mostly) vertebrate genome tree, with genomes represented in Ensembl.
- It provides a genome view to an organism through
- Genes and transcripts mapped to the genome
- Sequence variations
- Functional genomics (i.e. what does the genome code for)
- Comparative genomics
- Bioinformatics tools such as BLA(S)T and Biomart
- Protein Data Bank (PDB) ⇒ Protein sequences and their 3D structures.
- UniProt ⇒ collection of functional information on proteins with annotations (⚠️ very different from PDB)
- UniProt is accessible through NCBI and Ensembl
⚠️ Why model organisms? ⇒ Findings in one organism are often relevant for many organisms. A vast majority in research is performed on model organisms or cellular models since they are easy, large-scale and allow for systematic experimentation.
Sharing Information
To share information, it is crucial that several databases contain the same unambiguous and unique IDs!
- To link databases, the BioMart Initiative facilitates scientific collaboration and the scientific discovery process.
- BioMart is a freely available, open source, federated database system that provides unified access to disparate, geographically distributed data sources.