KEMBAR78
Tools For Motif and Pattern Searching: Prat Thiru | PDF | Mathematical And Theoretical Biology | Dna Sequencing
0% found this document useful (0 votes)
104 views24 pages

Tools For Motif and Pattern Searching: Prat Thiru

This document discusses tools for motif and pattern searching in biological sequences. It defines motifs as conserved regions in protein or DNA sequences. It describes different algorithms used for motif searching including enumeration, probabilistic optimization, and deterministic optimization. It provides an overview of several popular motif searching programs including MEME, MAST, AlignAce, Gibbs Motif Sampler, and others. It also outlines typical workflow and strategies for motif analysis.

Uploaded by

Daniel Mok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views24 pages

Tools For Motif and Pattern Searching: Prat Thiru

This document discusses tools for motif and pattern searching in biological sequences. It defines motifs as conserved regions in protein or DNA sequences. It describes different algorithms used for motif searching including enumeration, probabilistic optimization, and deterministic optimization. It provides an overview of several popular motif searching programs including MEME, MAST, AlignAce, Gibbs Motif Sampler, and others. It also outlines typical workflow and strategies for motif analysis.

Uploaded by

Daniel Mok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

TOOLS FOR MOTIF

AND PATTERN SEARCHING


Prat Thiru
OUTLINE

• What are motifs?


• Algorithms Used and Programs Available
• Workflow and Strategies
• MEME/MAST Demo (online and command
line)
Protein Motifs
DNA Motifs
MEME Output
Definitions
• Motif: Conserved regions of protein or
DNA sequences
• Pattern: Qualitative description of a motif
eg. regular expression C[AT]AAT[CG]X
• Profile: Quantitative description of a motif
eg. position weight matrix
Patterns
• Regular Expression Symbols
¾[ ] – OR eg. [GA] means G or A
¾{ } – NOT eg. {P,V} means not P or V
¾( ) – repeats eg. A(3) means AAA
¾X or N or “.” – any
• Complex patterns representation difficult
• Loose frequency information
eg. [AT] vs 20%A 80%T
Profiles
Sequence Logos
Algorithms
• Enumeration
• Probabilistic Optimization
• Deterministic Optimization

1. Identify motifs
2. Build a consensus
Enumeration
• Exhaustive search: word counting method,
count all n-mers and look for
overrepresentation
• Less likely to get stuck in a local optimum
• Computationally expensive
¾YMF
http://wingless.cs.washington.edu/YMF/YMFWeb/YMFInput.pl
¾Weeder
http://159.149.109.9/weederaddons/locator.html
Probabilistic Optimization
• Uses a Gibbs sampling approach
• One n-mer from each sequence is randomly picked to
determine initial model. In subsequent iterations, one
sequence, i, is removed and the model is recalculated.
Pick a new location of motif in sequence i iterate until
convergence
• Assumes most sequences will have the motif
¾ AlignAce
http://atlas.med.harvard.edu/cgi-bin/alignace.pl
¾ Gibbs Motif Sampler
http://bayesweb.wadsworth.org/gibbs/gibbs.html
Deterministic Optimization
• Based on expectation maximization (EM)
• EM: iteratively estimates the likelihood given
the data that is present
I. Expectation step: Use current parameters (and
observations) to reconstruct hidden structure
II. Maximization step: Use that hidden structure (and
observations) to re-estimate parameters
¾ MEME
http://meme.sdsc.edu
Multiple EM for Motif Elicitation
MEME
• Starting from a single site, EM
alternates between assigning
sites and updating motif model
• Performs a single iteration for
each n-mer in target
sequences, selects the best
motif from this site and then
iterates only that one to
convergence
• Search space increases
significantly with increasing
number of sequence and/or
sequence lengths
Programs Available*

*incomplete list

Fraenkel, E., et al. Practical Strategies for Discovering Regulatory DNA Sequence Motifs PLoS Computational Biology 2:201-210 (2006)
Programs Available: EMBOSS
Motif Searching
http://iona.wi.mit.edu/bio/tools/emboss/
• wordcount: Counts words of a specified size in a DNA
sequence
• prophecy: Creates matrices/profiles from multiple
alignments
• profit: Scan a sequence or database with a matrix or
profile
Programs Available: EMBOSS
Pattern Searching
http://iona.wi.mit.edu/bio/tools/emboss/
• fuzznuc: Nucleic acid pattern search
• fuzzpro: Protein pattern search
Programs Available: Other
• Allegro (Expression)
http://acgt.cs.tau.ac.il/allegro/

• CisGenome (ChIP-Seq)
http://www.biostat.jhsph.edu/~hji/cisgenome
Workflow and Strategies

Fraenkel, E., et al. Practical Strategies for Discovering Regulatory DNA Sequence Motifs PLoS Computational Biology 2:201-210 (2006)
Further Reading
• Practical Strategies for Discovering Regulatory DNA
Sequence Motifs
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0020036
• How does DNA sequence motif discovery work?
http://www.nature.com/nbt/journal/v24/n8/full/nbt0806-959.html
• MEME
Bailey, T. L. et al. Nucl. Acids Res. 2006 34:W369-W373; doi:10.1093/nar/gkl198
MEME/MAST Demo
http://meme.sdsc.edu
MEME/MAST Demo
MEME Suite
MEME/MAST Demo
Command Line (on tak)
Usage: MEME (Find ungapped motifs in unaligned sequences)

eg. meme sample.fa -dna -maxw 10 -nmotifs 5 -mod zoops -pal -maxsize 1000000 -o sample_meme
meme <dataset> [optional arguments]
• <dataset> file containing sequences in FASTA format
• [-text] output in text format (default is HTML)
• [-dna] sequences use DNA alphabet
• [-protein] sequences use protein alphabet
• [-mod oops|zoops|anr] distribution of motifs
• [-nmotifs <nmotifs>] maximum number of motifs to find
• [-evt <ev>] stop if motif E-value greater than <evt>
• [-minw <minw>] minimum motif width
• [-maxw <maxw>] maximum motif width

For complete list of options enter “meme” at the command prompt


MEME/MAST Demo
Command Line (on tak)
Usage: MAST (Searches a sequence database for occurrences of known
motifs )

eg. mast motifs.txt -d data.fa


mast <mfile> <database> [ optional arguments ... ]
• <mfile> file containing motifs to use; may be a MEME output file or
similar file
• [-d <database> | -stdin] search sequences in <database> with motifs

For complete list of options enter “mast” at the command prompt

You might also like