Practical 2 - Knowledge Management & Specific Databases
2️⃣

Practical 2 - Knowledge Management & Specific Databases


Naming
  • Problem ⇒ naming genes and proteins (i.e. many names for the same protein or gene, etc)
  • Solution ⇒ scientific (and consistent) naming of genes = due to the HUGO Nomenclature Committee (HGNC) which provides unique human gene names and symbols
    • For non-human species → model organism databases serve as central repositories!
    • NCBI provides many approved gene names in its gene database
  • An ontology in CS and information science is a formal representation of a set of concepts within a domain and the relationships between those concepts.
    • Examples include the Unified Medical Language System (UMLS), The Gene Ontology and Medical Subject Headings (MeSH)
  • Gene Ontology (GO) Annotation of genes and proteins in genomic and protein databases, across all species.
    • Consists of three structured, controlled vocabularies (ontologies) that describe gene products in a species-independent manner
        1. Biological Processes ⇒ a commonly recognized series of events (e.g. cell division)
        1. Cellular Components ⇒ where in the cell does the protein act? (e.g. mitochondrial inner membrane)
        1. Molecular Functions ⇒ protein activities (e.g. an enzymatic activity)
    • Terms from these ontologies are then used to annotate gene products (i.e. making associations between the ontologies and the genes)
    • GO ontologies are structured as a hierarchical directed acyclic graph (DAG) where terms can have more than one parent and zero, one or more children. There exist two types of relations (being “is-a” and “part-of
      • ⚠️ Important to note is that ontology graphs allow for reasoning!
 
Gene Ontology & Annotation
  • Annotation is the process of assigning GO terms to gene products and is based on annotation evidence codes (i.e. where does the annotation or evidence come from)
    • Experimental evidence codes (e.g. IDA, inferred from direct assay)
    • Computational analysis evidence codes (e.g. ISO, inferred from sequence orthology)
    • Annotation based on assessments of authors/curators (e.g. TAS)
    • Automatically assigned annotations (e.g. IEA)

1 - Crucial Databases

  • Ensembl ⇒ joint project to annotate genomes
    • A (mostly) vertebrate genome tree, with genomes represented in Ensembl.
    • It provides a genome view to an organism through
      • Genes and transcripts mapped to the genome
      • Sequence variations
      • Functional genomics (i.e. what does the genome code for)
      • Comparative genomics
      • Bioinformatics tools such as BLA(S)T and Biomart
  • Protein Data Bank (PDB) ⇒ Protein sequences and their 3D structures.
  • UniProt ⇒ collection of functional information on proteins with annotations (⚠️ very different from PDB)
    • UniProt is accessible through NCBI and Ensembl
 
⚠️ Why model organisms? ⇒ Findings in one organism are often relevant for many organisms. A vast majority in research is performed on model organisms or cellular models since they are easy, large-scale and allow for systematic experimentation.
 
Sharing Information
To share information, it is crucial that several databases contain the same unambiguous and unique IDs!
  • To link databases, the BioMart Initiative facilitates scientific collaboration and the scientific discovery process.
  • BioMart is a freely available, open source, federated database system that provides unified access to disparate, geographically distributed data sources.