Practical 3 - Biological Sequence Analysis
3️⃣

Practical 3 - Biological Sequence Analysis


  • Similarity between organisms can be explained either by
    • Homology ⇒ similarity explained by shared ancestry
      • A simple example is the similar bone structure in vertebrates.
    • Analogy ⇒ similarity not based on shared ancestry
  • Homology can be seen as similarity in sequence due to shared ancestry and not due to chance or convergence!
    • Orthology ⇒ Genes that descended from the same ancestral gene separated by a speciation event.
      • Speciations are mutations and adaptations that drive sub-populations of one species into sexual incompatibility.
    • Paralogy ⇒ Genes that arose from a duplication event within a genome.
  • Why is sequence homology important?
    • Conservation arises from evolutionary pressure → conserved regions (in a sequence) are functionally important
    • Gives information on evolutionary paths (e.g. phylogeny)
    • Functions can be inferred because of homology (e.g. DNA binding domain)
    • It allows the principled use of a simple model organism to study a homologous gene or process

Principles of Sequence Alignment

  • Sequence Alignment = Difficult Problem
    • Sequences can change in many ways during evolution (e.g. point mutations, insertions, deletions and rearrangements). These changes can hide the conservation of properties of the DNA or protein.
  • Alignment = searching for identifying similar regions in sequences to maximize their similarity (the similarity needs to be quantified ⇒ scoring)
    • How to compare biological sequences? ⇒ Nucleotide or protein sequences are typically represented with a single letter nucleotide or AA coding respectively.
    • Not only identical matches are positively scored
    • Necessary to allow for insertions and deletions
  • Example → What determines if (similar) amino-acids can replace each other in a protein?
    • The residue charge
    • Size of the AA
    • Hydrophobicity
    • Turning of the AA-chain (i.e. proline breaks -helix as its rigid side group kinks the helix backbone)
    • Capacity to form ion and disulfide bonds (and/or hydrogen bonds)
    • Aromatic stacking
    • Specific roles in active sites
  • For scoring alignment between two elements (i.e. nucleotide or amino acid) a substitution matrix is used, which is either obtained through a set of confirmed sequence alignments where the frequencies of the elements and their substitution score are computed.
    • Let be the set of frequencies of the elements. These are obtained by estimating the probability of observing a certain element in a sequence. In other words, the probability of observing a certain element is given by
      • where can intuitively be seen as the total number of elements in all observed sequences.
    • Let be the mutation probability matrix. If is the number of substitutions occurrences where element was substituted for element , then the probability that this mutation occurs, is given by
    • For randomly aligned sequences, the probability of seeing the two sequences aligned in a particular way is given by
      • where is the length of the aligned sequences.
    • Assuming substitution, the probability that the two sequences share a common ancestor where some elements are mutated is given by
    • The substitution score is given by the the substitution probability, divided by the probabilities of each element occurring. This likelihood ratio is then used as a score between two different elements and .
      • Often the log is taken since it improves numerical stability.
  • BLOSUM-L matrix (BLOcks SUbstitution Matrix)
    • Forms a basis for ungapped local alignment of protein families.
    • Groups sequences with more than identical amino acids to compute the substitution scores . Specifically, the substitution matrix is derived from the observed substitution frequency of elements between different groups, while correcting for group size. The substitution matrix then receives the integer closest to as its score for the elements and where
  • PAM matrix (Point Accepted Mutation)
    • Substitution matrix of an amino acid tolerated by natural selection as the protein function does not change.
      • A PAM-unit is the time in which 1% of the AA’s in a protein sequence undergo a PAM.
      • The mutation matrix MUT-1 is constructed such that the entry represents the probability of the -th AA mutating into the -th AA given the passage of 1 PAM unit of time.
        • where is the total number of AA’s in the alignment and is a scaling factor to force the 1% definition.
      • In this model, longer mutation times are considered as a simple probabilistic process (i.e. Markov-chain) such that .
      • The -entry in the PAM- matrix is defined as
  • How is the matrix chosen?
    • ⇒ Depends on the problem to be solved.
    • ⇒ The expected evolutionary time is important (e.g. longer time requires lower BLOSUM numbers due to natural selection and evolutionary changes).
    • Some specialist matrices exist
  • Gaps in an alignment can be used to compensate for insertions and deletions between sequences.
    • A gap penalty serves to keep gaps at a reasonable number. There often is a cost for opening a gap and a cost for the extension or size of the gap.
  • Finally, the full alignment score between two sequences where and are two sequences can mathematically be expressed as
    • where gives a defined gap score.

Sequence Alignment Algorithms

  • Naïve Search for optimal alignment is computationally intractable
    • There is a combinatorial explosion!
    • The number of options to align two sequences of length and is given by
      • To give an idea
      • Sequences of length 5 ⇒ alignments
      • Sequences of length 100 ⇒ alignments
      • Sequences of length 1000 ⇒ alignments
  • Alignment Algorithms use dynamic programming to find an alignment with the highest score:
    • Smith-Waterman for local alignment
    • Needleman-Wunsch for global alignment
    • ⇒ Dynamic programming is a general method for optimization where
    • The original problem is split into repeated subproblems
    • Each subproblem is solved one at a time and a partial solution is stored
    • The partial solutions are combined to an overall solution
    • The main advantage of dynamic programming is that it is efficient to solve just a series of small problems. Although that for some cases optimal solutions may be guaranteed, in general local solutions may not lead to global maxima (i.e. dynamic programming methods are greedy).
       
      In the case for sequence alignment, dynamic programming is used to split the problem into simple decisions on which path to follow.
    • The global score is a sum of the aligned individual positions
    • Contributions to overall score of these positions are independent
  • Needleman-Wunsch (⇒ Global Alignment → i.e. finds the globally optimal alignment for two sequences given a scoring function)
    • ⚠️ Optimal = highest score ⇒ gap penalties + positive and negative values from the substitution matrix
      1. A matrix with columns and rows representing the sequence (and the cells representing possible paths) is filled using a recurrence relation
      1. Trace-back step on this matrix finds the best alignment (arrows are kept from where the score was computed). The trace can be found by following back the final element (last comparison, i.e. if and are the lengths of the sequences.
    • Intuitively, the first row and column are always decreasing with the gap-penalty, since only deletions or insertions take place if only going vertical or horizontal in the matrix respectively.
  • Smith-Waterman (⇒ Local alignment → i.e. finds regions of optimal alignment)
    • Smith-Waterman updates the recurrence relation of Needleman-Wunsch as follows:
    • The trace-back step now finds the maximum value in the matrix (or set of most maximum values for the best alignments) and traces back until 0 is reached.
    • The first row/column are initialized with 0!
  • ⚠️ Global alignment IS NOT local alignment!
    • Global alignment ⇒ finds the best alignment across the whole two sequences
    • Local alignment ⇒ finds regions of high similarity in parts of the sequences
  • Multiple Sequence Alignment ⇒ aligns more than two sequences
    • Generally a more reliable assessment of similarity than two aligned sequences (due to more information on the role of individual positions and protein parts)
    • In general, alignments can be improved by incorporating additional information such as
      • Known structural or functional features that can guide alignment (e.g. alpha-helices, beta-sheets, loops, …)

Searching Databases

  • Large databases with DNA, RNA and protein sequences are available (e.g. UniProt, NCBI, etc…)
  • Goal ⇒ If there is a new sequence, can we find similar sequences?
  • BLAST
    • Basic Local Alignment Search Tool
    • Efficiently calculates similarity for biological sequences
      • Produces local alignments, i.e. only a portion of each sequence must be aligned!
      • It is a heuristic method ⇒ guaranteed optimality is traded for efficiency.
      • Uses statistical theory to determine if a match might have occurred by chance.
    • The algorithm works as follows
        1. Preprocessing → Determines the positions of all -mers in the sequences the query is compared to (i.e. they are indexed for fast recovery)
        1. Identification of High-scoring Pairs (HSP) → BLAST identifies pairs of hits, which are located on the same diagonal within a defined spatial distance. The pair of hits is elongated to an HSP. If the alignment score (i.e. local alignment) of a HSP exceeds a certain threshold, an elongation with gaps is started.
        1. Elongation with gaps → Starting from the HSP seed, BLAST extends the alignment via dynamic programming in both directions. The elongation is stopped if the score value falls below a certain threshold.
        1. Output → If the -value (i.e. expected number of hits given the database size) of a comparison (derived from the score) is lower than a predefined threshold, the alignment is added to the output list.
    • Blast exists for
      • Protein-protein comparisons
      • Nucleotide-nucleotide comparisons
      • A bunch others…
    • Look at PSI-BLAST