Practical 1 - Sequence Databases & Data Formats
1️⃣

Practical 1 - Sequence Databases & Data Formats


  • Bioinformatics can be seen as a cyclic cycle
    • Hypotheses → Formulated from knowledge (e.g. publications, further hypothesis generation, computational reasoning and modelling)
    • Experiments → Experiments follow naturally from experimental design driven by the hypotheses
    • Data (Processing & Structuring) → Occurs after experiments have been conducted. This involves databases, data sharing and retrieval
    • Data Analysis → Using algorithms, statistics, data integration, etc the data shared/retrieved/obtained can be analyzed
    • Knowledge → From the insights gained during data analysis, knowledge can be gained
  • Sharing data for biology is important ⇒ public databases play a central role!
    • GenBank (now Nucleotide) ⇒ First sequence database
      • Sequence Database
        A sequence database is a type of database that stores biological sequence data, such as DNA, RNA, or protein sequences. These databases are essential tools for bioinformatics research, allowing scientists to store, retrieve, and analyze large amounts of sequence information.
    • Sequence Read Archive (SRA) ⇒ Primary next-gen sequence data archive. It is a public repo for next-gen sequence data. It combines data from NCBI, EBI and DDBJ. These databases are now all synchronized.
 

1 - Databases

  • Databases exist in many different forms;
    • In its most simple form, a database is just a table containing data → flat file databases
      • First column contains indices for fast retrieval
      • Repetitive structure
    • Currently the most common type is a relational database.
      • Information is split over different linked tables
      • No duplication (i.e. advantage compared to flat file DB)
      • More flexibility and accessibility for complex queries
    • There also exist other types such as object-oriented databases.
  • Purpose ⇒ Store data efficiently, retrieve it quick and integrate data access with functional requirements in computer programs
 
Relational Databases
  • General Properties
    • Data is distributed over a collection of tables
    • Each table has rows (called instances/records/tuples) and columns (called fields/attributes) where each table has a primary ID or key
    • Data can be of different datatypes (i.e. integer, bool, string, BLOB = binary large object)
    • Good DB is normalized (i.e. removal of duplicated data, non-atomic values, etc)
    • Querying through e.g. SQL (structured query language)
  • Tables
    • Each bit of information (attribute) is located in its own column
    • Every row has a unique identifier key called the primary key (PK)
    • Columns can contain primary keys for other tables (i.e. this allows linkage between tables)
  • Relationship
    • Table can be linked if a column contains a PK for another table ⇒ relationship between two tables
    • PK in another table is called a foreign key
    • A relationship can have cardinality
      • One-to-one ⇒ every row in a table will link to only one row in another table
      • One-to-many ⇒ a row of table A links to many rows of table B and a single row in table B refers to a single row in table A
      • Many-to-many ⇒ A row in table A links to many rows in table B and a row in table B links to many rows in table A
(Dis)Advantages
  • ✅ Standardized model ⇒ transparent design
  • ✅ Separation design and data storage
  • ✅ Efficient querying
 
  • ❌ Complex data is inefficiently stored and queries (i.e. requires many joins on tables, BLOBs are not efficient, etc)
 
Database Types by Content
  • Primary ⇒ Direct experimental results with factual annotations
    • Example: INSDC (sequence database)
  • Secondary ⇒ Annotations that interpret and process primary data sources
    • Example: RefSeq
  • Integrated databases ⇒ Databases that report many sources of information
    • Example:
      • Gene (NCBI) for gene centric data
      • UniProt for protein centric data
       
  • RefSeq
    • Secondary database derived from GenBank whose goal is to provide a reference sequence for each molecule in the central dogma (DNA, mRNA, protein)
    • Each RefSeq entry represents a single, naturally occurring molecule from one organism and is only represented once (i.e. no duplication)
  • NCBI
    • USA-based resource for molecular biology information
    • Expanded to host a large set of databases including
      • OMIM → Online Mendelian Inheritance in Man
      • MMDB → Molecular Modeling DB of 3D protein structures
      • UniGene
  • Gene
    • Integrated DB that combines information around a genome locus
    • Linked to other DB such as PubMed Central
 
File Formats
  • Fasta
    • >” indicates a new sequence entry
    • Each entry contains a header for meta-data, terminated by EOL character (e.g. “\n”)
  • GenBank
    • Richer annotated format compared to Fasta
    • Contains numerous informative fields (e.g. locus, definition, accession, version, source organism, etc)
    • Also contains features on the sequence
    • Finally, also contains the sequence in a viewable format
  • SAM
    • = Sequence Alignment Map
    • TAB-delimited text format with a header section (optional) and alignment section
    • It maps reads (i.e. fragments of sequences DNA/RNA) to genome sequence
    • TAKE A LOOK A CIGAR STRINGS AND QUAL before exam!!!
 
Data Exchange
Often data is exchanged through “flat” text files or XML formats which allow somewhat of a systematic format. This has many downsides however. Binary formats such as BAM-files contain many short reads from next-gen DNA sequencers. They are compressed versions of the free-text SAM format.