KEMBAR78
Finding motif | PDF
Finding motif
GENE RELATED MOTIFS ALIREZA ALIKHANI
1
introduction 2
introduction 3
introduction
DNA Human Dec. 2013
4
What is gene?
 Classically, a unit of inheritance. In practice, a gene is a segment of
DNA on a chromosome that encodes a protein and all the
regulatory sequences (promoter) required to control expression of
that protein.
5
What is motif?
 A conserved element of a protein sequence alignment that usually
correlates with a particular function. Motifs are generated from a
local multiple protein sequence alignment corresponding to a
region whose function or structure is known. It is sufficient that it is
conserved, and is hence likely to be predictive of any subsequent
occurrence of such a structural/functional region in any other novel
protein sequence.
 It is said to be patterns that can be used to identify the gene
6
What is motif? 7
What is motif? 8
Defining Motifs
 To define a motif, lets say we know where the motif starts
in the sequence
 The motif start positions in their sequences can be
represented as s = (s1,s2,s3,…,st)
9
Defining Motifs
 Line up the patterns
by their start indexes
s = (s1, s2, …, st)
10
How to find a motif?
 Fddfdd gggatactcttgtgaatggatttttaactga cacattagata
 fddfdd gggatactgataccgtatttggcctaggc cacattagata
 Fddfdd gggatactgataccgtatttggcctaggc cacattagata
 fddfdd gactgataccgtaccgtatttggcctaggc cacattagata
……………………………………………………………………………….....
…………………………………………………………………………………..
 fddfdd gggatactgataccgtatttggcctaggc cacattagata
 Fddfdd gggatactgataccgtatttggcctaggc cacattagata
 fddfdd gggatactgatacccttgtgaatggattgc cacattagata
 fddfdd gggatactgataccgtgatacccctaggc cacattagata
 fddfdd gggatactgataccgatactgcctaggc cacattagata
 fddfdd gggatactgataccgtatttggcctaggc cacattagata
11
Finding Motif Algorithms
Exact Match
Non Exact Match
12
Exact Match 13
Exact Match 14
Non Exact Match 15
Non Exact Match 16
Non Exact Match motif:acgtacgt 17
Distance with real motifs
 The distance between a real motif and the consensus sequence is
generally less than the distance between two real motifs
18
Type of Algorithms
 EPatternBranching
 PatternBranching
 PMS: Exhaustive Motif Search
 Pms1,pms2,…
 Pmsp
 Search Trees
 Projection
 The Gold Bug Problem
 EM
 Brute Force Motif Finding
 The Median String Problem
 Branch-and-Bound Motif Search
 Branch-and-Bound Median String Search
 Consensus and Pattern Branching: Greedy
Motif Search
 Boyer-Moore
 Knuth-Morris-Pratt
 Suffix Array Construction
19
Defining Some Terms
 T : Number of sample DNA sequences
 N : length of each DNA sequence
 DNA : sample of DNA sequences (T*N array)
 L : length of the motif (l-mer)
 𝑆𝑖 : starting position of an l-mer in sequence i
 S =(𝑆1, 𝑆2,… 𝑆𝑡) : array of motif’s starting positions
20
Parameters 21
Score function 22
The Motif Finding Problem
 the starting positions S are usually not given. How
can we find the “best” profile matrix?
23
The Motif Finding Problem:
Formulation
 Goal: Given a set of DNA sequences, find a set of l-mers, one
from each sequence, that maximizes the consensus score
 Input: A t x n matrix of DNA, and l the length of the pattern to
find
 Output: An array of t starting positions s = (s1, s2, … st)
maximizing Score(s,DNA)
24
Brute Force Finding Motif Solution
 Compute the scores for each possible combination of
starting positions s
 The best score will determine the best profile and the
consensus pattern in DNA
 The goal is to maximize Score(s,DNA) by varying the
starting positions si, where:
si = [1, …, n-l+1]
i = [1, …, t]
25
Brute Force Algorithm Search
1. BruteForceMotifSearch(DNA, t, n, l)
2. bestScore  0
3. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)
4. if (Score(s,DNA) > bestScore)
5. bestScore  score(s, DNA)
6. bestMotif  (s1,s2 , . . . , st)
7. return bestMotif
26
Brute Force features
 no preprocessing phase.
 constant extra space needed.
 always shifts the window by exactly 1 position to the right.
 comparisons can be done in any order.
 For each set of starting positions, the scoring function makes l
operations, so complexity is l (n – l + 1) 𝒕
= O(l𝒏 𝒕
).
 That means that for t = 8; n = 1000; l = 10 we must perform
approximately 𝟏𝟎 𝟐𝟎 computations it will take billions years.
27
Brute Force 28
Brute Force 29
Brute Force 30
Brute Force 31
Brute Force 32
Brute Force 33
Brute Force 34
Brute Force 35
Brute Force 36
Brute Force 37
Brute Force 38
Brute Force 39
PMS (Planted Motif Search)
 Generate all possible l-mers from out of the input
sequence Si. Let Ci be the collection of these
l-mers.
 Example:
AAGTCAGGAGT
Ci = 3-mers:
AAG-AGT-GTC-TCA-CAG-AGG-GGA-GAG-AGT
40
All patterns at Hamming distance d = 1 41
Sort the lists 42
Eliminate duplicates 43
Find motif common to all lists
 Follow this procedure
for all sequences
 Find the motif
common all Li (once
duplicates have been
eliminated)
 This is the planted
motif
44
PMS Algorithm 45
PMS Running Time
 It takes time to
 Generate variants
 Sort lists
 Find and eliminate duplicates
 Running time of this algorithm:
w is the word length of the computer
M is Total number of patterns, In the previous example m=90
46
References
 AN INTRODUCTION TO BIOINFORMATICS ALGORITHMS(NEIL C. JONES
AND PAVEL A. PEVZNER)
 essential bioinformatics by jin xiong
 http://www-igm.univ-mlv.fr/~lecroq/string/node3.html
 http://docs.seqan.de/seqan/1.3/index.html
 https://www.youtube.com/
 https://www.wikipedia.org/
 https://genome.ucsc.edu/
47

Finding motif

  • 1.
    Finding motif GENE RELATEDMOTIFS ALIREZA ALIKHANI 1
  • 2.
  • 3.
  • 4.
  • 5.
    What is gene? Classically, a unit of inheritance. In practice, a gene is a segment of DNA on a chromosome that encodes a protein and all the regulatory sequences (promoter) required to control expression of that protein. 5
  • 6.
    What is motif? A conserved element of a protein sequence alignment that usually correlates with a particular function. Motifs are generated from a local multiple protein sequence alignment corresponding to a region whose function or structure is known. It is sufficient that it is conserved, and is hence likely to be predictive of any subsequent occurrence of such a structural/functional region in any other novel protein sequence.  It is said to be patterns that can be used to identify the gene 6
  • 7.
  • 8.
  • 9.
    Defining Motifs  Todefine a motif, lets say we know where the motif starts in the sequence  The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st) 9
  • 10.
    Defining Motifs  Lineup the patterns by their start indexes s = (s1, s2, …, st) 10
  • 11.
    How to finda motif?  Fddfdd gggatactcttgtgaatggatttttaactga cacattagata  fddfdd gggatactgataccgtatttggcctaggc cacattagata  Fddfdd gggatactgataccgtatttggcctaggc cacattagata  fddfdd gactgataccgtaccgtatttggcctaggc cacattagata ………………………………………………………………………………..... …………………………………………………………………………………..  fddfdd gggatactgataccgtatttggcctaggc cacattagata  Fddfdd gggatactgataccgtatttggcctaggc cacattagata  fddfdd gggatactgatacccttgtgaatggattgc cacattagata  fddfdd gggatactgataccgtgatacccctaggc cacattagata  fddfdd gggatactgataccgatactgcctaggc cacattagata  fddfdd gggatactgataccgtatttggcctaggc cacattagata 11
  • 12.
    Finding Motif Algorithms ExactMatch Non Exact Match 12
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    Non Exact Matchmotif:acgtacgt 17
  • 18.
    Distance with realmotifs  The distance between a real motif and the consensus sequence is generally less than the distance between two real motifs 18
  • 19.
    Type of Algorithms EPatternBranching  PatternBranching  PMS: Exhaustive Motif Search  Pms1,pms2,…  Pmsp  Search Trees  Projection  The Gold Bug Problem  EM  Brute Force Motif Finding  The Median String Problem  Branch-and-Bound Motif Search  Branch-and-Bound Median String Search  Consensus and Pattern Branching: Greedy Motif Search  Boyer-Moore  Knuth-Morris-Pratt  Suffix Array Construction 19
  • 20.
    Defining Some Terms T : Number of sample DNA sequences  N : length of each DNA sequence  DNA : sample of DNA sequences (T*N array)  L : length of the motif (l-mer)  𝑆𝑖 : starting position of an l-mer in sequence i  S =(𝑆1, 𝑆2,… 𝑆𝑡) : array of motif’s starting positions 20
  • 21.
  • 22.
  • 23.
    The Motif FindingProblem  the starting positions S are usually not given. How can we find the “best” profile matrix? 23
  • 24.
    The Motif FindingProblem: Formulation  Goal: Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score  Input: A t x n matrix of DNA, and l the length of the pattern to find  Output: An array of t starting positions s = (s1, s2, … st) maximizing Score(s,DNA) 24
  • 25.
    Brute Force FindingMotif Solution  Compute the scores for each possible combination of starting positions s  The best score will determine the best profile and the consensus pattern in DNA  The goal is to maximize Score(s,DNA) by varying the starting positions si, where: si = [1, …, n-l+1] i = [1, …, t] 25
  • 26.
    Brute Force AlgorithmSearch 1. BruteForceMotifSearch(DNA, t, n, l) 2. bestScore  0 3. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1) 4. if (Score(s,DNA) > bestScore) 5. bestScore  score(s, DNA) 6. bestMotif  (s1,s2 , . . . , st) 7. return bestMotif 26
  • 27.
    Brute Force features no preprocessing phase.  constant extra space needed.  always shifts the window by exactly 1 position to the right.  comparisons can be done in any order.  For each set of starting positions, the scoring function makes l operations, so complexity is l (n – l + 1) 𝒕 = O(l𝒏 𝒕 ).  That means that for t = 8; n = 1000; l = 10 we must perform approximately 𝟏𝟎 𝟐𝟎 computations it will take billions years. 27
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
    PMS (Planted MotifSearch)  Generate all possible l-mers from out of the input sequence Si. Let Ci be the collection of these l-mers.  Example: AAGTCAGGAGT Ci = 3-mers: AAG-AGT-GTC-TCA-CAG-AGG-GGA-GAG-AGT 40
  • 41.
    All patterns atHamming distance d = 1 41
  • 42.
  • 43.
  • 44.
    Find motif commonto all lists  Follow this procedure for all sequences  Find the motif common all Li (once duplicates have been eliminated)  This is the planted motif 44
  • 45.
  • 46.
    PMS Running Time It takes time to  Generate variants  Sort lists  Find and eliminate duplicates  Running time of this algorithm: w is the word length of the computer M is Total number of patterns, In the previous example m=90 46
  • 47.
    References  AN INTRODUCTIONTO BIOINFORMATICS ALGORITHMS(NEIL C. JONES AND PAVEL A. PEVZNER)  essential bioinformatics by jin xiong  http://www-igm.univ-mlv.fr/~lecroq/string/node3.html  http://docs.seqan.de/seqan/1.3/index.html  https://www.youtube.com/  https://www.wikipedia.org/  https://genome.ucsc.edu/ 47