Finding motif

Finding motif
GENE RELATED MOTIFS ALIREZA ALIKHANI
1

introduction
DNA Human Dec. 2013
4

What is gene?
 Classically, a unit of inheritance. In practice, a gene is a segment of
DNA on a chromosome that encodes a protein and all the
regulatory sequences (promoter) required to control expression of
that protein.
5

What is motif?
 A conserved element of a protein sequence alignment that usually
correlates with a particular function. Motifs are generated from a
local multiple protein sequence alignment corresponding to a
region whose function or structure is known. It is sufficient that it is
conserved, and is hence likely to be predictive of any subsequent
occurrence of such a structural/functional region in any other novel
protein sequence.
 It is said to be patterns that can be used to identify the gene
6

Defining Motifs
 To define a motif, lets say we know where the motif starts
in the sequence
 The motif start positions in their sequences can be
represented as s = (s1,s2,s3,…,st)
9

Defining Motifs
 Line up the patterns
by their start indexes
s = (s1, s2, …, st)
10

How to find a motif?
 Fddfdd gggatactcttgtgaatggatttttaactga cacattagata
 fddfdd gggatactgataccgtatttggcctaggc cacattagata
 Fddfdd gggatactgataccgtatttggcctaggc cacattagata
 fddfdd gactgataccgtaccgtatttggcctaggc cacattagata
……………………………………………………………………………….....
…………………………………………………………………………………..
 Fddfdd gggatactgataccgtatttggcctaggc cacattagata
 fddfdd gggatactgatacccttgtgaatggattgc cacattagata
 fddfdd gggatactgataccgtgatacccctaggc cacattagata
 fddfdd gggatactgataccgatactgcctaggc cacattagata
11

Finding Motif Algorithms
Exact Match
Non Exact Match
12

Non Exact Match motif:acgtacgt 17

Distance with real motifs
 The distance between a real motif and the consensus sequence is
generally less than the distance between two real motifs
18

Type of Algorithms
 EPatternBranching
 PatternBranching
 PMS: Exhaustive Motif Search
 Pms1,pms2,…
 Pmsp
 Search Trees
 Projection
 The Gold Bug Problem
 EM
 Brute Force Motif Finding
 The Median String Problem
 Branch-and-Bound Motif Search
 Branch-and-Bound Median String Search
 Consensus and Pattern Branching: Greedy
Motif Search
 Boyer-Moore
 Knuth-Morris-Pratt
 Suffix Array Construction
19

Defining Some Terms
 T : Number of sample DNA sequences
 N : length of each DNA sequence
 DNA : sample of DNA sequences (T*N array)
 L : length of the motif (l-mer)
 𝑆𝑖 : starting position of an l-mer in sequence i
 S =(𝑆1, 𝑆2,… 𝑆𝑡) : array of motif’s starting positions
20

The Motif Finding Problem
 the starting positions S are usually not given. How
can we find the “best” profile matrix?
23

The Motif Finding Problem:
Formulation
 Goal: Given a set of DNA sequences, find a set of l-mers, one
from each sequence, that maximizes the consensus score
 Input: A t x n matrix of DNA, and l the length of the pattern to
find
 Output: An array of t starting positions s = (s1, s2, … st)
maximizing Score(s,DNA)
24

Brute Force Finding Motif Solution
 Compute the scores for each possible combination of
starting positions s
 The best score will determine the best profile and the
consensus pattern in DNA
 The goal is to maximize Score(s,DNA) by varying the
starting positions si, where:
si = [1, …, n-l+1]
i = [1, …, t]
25

Brute Force Algorithm Search
1. BruteForceMotifSearch(DNA, t, n, l)
2. bestScore  0
3. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)
4. if (Score(s,DNA) > bestScore)
5. bestScore  score(s, DNA)
6. bestMotif  (s1,s2 , . . . , st)
7. return bestMotif
26

Brute Force features
 no preprocessing phase.
 constant extra space needed.
 always shifts the window by exactly 1 position to the right.
 comparisons can be done in any order.
 For each set of starting positions, the scoring function makes l
operations, so complexity is l (n – l + 1) 𝒕
= O(l𝒏 𝒕
).
 That means that for t = 8; n = 1000; l = 10 we must perform
approximately 𝟏𝟎 𝟐𝟎 computations it will take billions years.
27

PMS (Planted Motif Search)
 Generate all possible l-mers from out of the input
sequence Si. Let Ci be the collection of these
l-mers.
 Example:
AAGTCAGGAGT
Ci = 3-mers:
AAG-AGT-GTC-TCA-CAG-AGG-GGA-GAG-AGT
40

All patterns at Hamming distance d = 1 41

Find motif common to all lists
 Follow this procedure
for all sequences
 Find the motif
common all Li (once
duplicates have been
eliminated)
 This is the planted
motif
44

PMS Running Time
 It takes time to
 Generate variants
 Sort lists
 Find and eliminate duplicates
 Running time of this algorithm:
w is the word length of the computer
M is Total number of patterns, In the previous example m=90
46

References
 AN INTRODUCTION TO BIOINFORMATICS ALGORITHMS(NEIL C. JONES
AND PAVEL A. PEVZNER)
 essential bioinformatics by jin xiong
 http://www-igm.univ-mlv.fr/~lecroq/string/node3.html
 http://docs.seqan.de/seqan/1.3/index.html
 https://www.youtube.com/
 https://www.wikipedia.org/
 https://genome.ucsc.edu/
47

Finding motif

More Related Content

What's hot

Similar to Finding motif

Recently uploaded

Finding motif