Lecture 6 Protein Modeling
June 7, 2007
Protein Structure Prediction Other topics
Protein Architecture
proteins are polymers consisting of amino acids linked by peptide bonds each amino acid consists of a central carbon atom an amino group NH 2 a carboxyl group COOH a side chain differences in side chains distinguish different amino acids
Peptide Bonds
amino group side chain carboxyl group
carbon (common reference point for coordinates of a structure)
Amino Acid Side Chains
side chains vary in: shape, size, charge, polarity
Levels of Description
protein structure is often described at four different scales primary structure secondary structure tertiary structure quaternary structure
Levels of Description
Levels of Description
Secondary Structure
secondary structure refers to certain common repeating structures it is a local description of structure two common secondary structures helices strands/sheets a third category, called coil or loop, refers to everything else
Helices
carbon
individual amino acid hydrogen bond
Sheets
Ribbon Diagram Showing Secondary Structures
The Protein Folding Problem
we know that the function of a protein is determined in large part by its 3D shape (fold, conformation) can we predict the 3D shape of a protein given only its amino-acid sequence? in general NO, current methods cannot do this accurately but the methods can often provide a partial description of the 3D structure, which is often helpful
Motivation
Want to identify the function of genes we find, and what different mutations/alleles do One gene = one protein (sort of)
Function of protein = function of gene
Function can be determined in many ways
Gene expression, knockouts, etc
But these take time, and are prone to mistakes Goal: If we can structure every protein, learning their functions isnt too far away
Thornton et al 2000 (Nature)
Similar problems
Straight up 3D prediction hard (Nobel awaits) Subproblem 1: Identify patterns in sequence
Profile HMMs, multiple sequence alignments
Subproblem 2: Identify common motifs
Various methods
Subproblem 3: Identify classes of proteins
SCOP
Subproblem 4: Identify homologs
BLAST
http://www.ludwig.edu.au/course/course2002/
What Determines Conformation?
in general, the amino-acid sequence of a protein determines the 3D shape of a protein [Anfinsen et al., 1950s] but some exceptions all proteins can be denatured some proteins are inherently disordered (i.e. lack a regular structure) some proteins get folding help from chaperones there are various mechanisms through which the conformation of a protein can be changed in vivo post-translational modifications such as phosphorylation prions etc.
What Determines Conformation?
what physical properties of the protein determine its fold? rigidity of the protein backbone interactions among amino acids, including electrostatic interactions van der Waals forces volume constraints hydrogen, disulfide bonds interactions of amino acids with water
Determining Protein Structures
protein structures can be determined experimentally (in many cases) by x-ray crystallography nuclear magnetic resonance (NMR)
DNA
Picture by Anthony North
Myoglobin
From www.inst.bnl.gov/GasDetectorLab/x-rays/SRI94.htm
Myoglobin
S.E.V. Phillips. "Structure and refinement of oxymyoglobin at 1.6 resolution.", J. Mol. Biol. 1980, 142, 531.
NMR
Nuclear Magnetic Resonance Spectroscopy Cannot handle large proteins like X-ray Exploits the chemical environment to return distances between atoms
Can use knowledge of restraints to identify positions of atoms that produce peaks
Protein structure determination in solution by NMR spectroscopy Wuthrich K. J Biol Chem. 1990 December 25;265(36):22059-62
Protein structure determination in solution by NMR spectroscopy Wuthrich K. J Biol Chem. 1990 December 25;265(36):22059-62
Experimental Methods
Very expensive and time-consuming
Computational methods can help with time (Frank DiMaio)
Many proteins still cannot be done in this manner
More motivation
there is a large sequence-structure gap 158K protein sequences in SwissProt database 27K protein structures in PDB database key question: can we predict structures by computational means instead?
Approaches to Protein Structure Prediction
prediction in 1D secondary structure solvent accessibility (which residues are exposed to water, which are buried) transmembrane helices (which residues span membranes) prediction in 2D inter-residue/strand contacts prediction in 3D homology modeling fold recognition (e.g. via threading) ab initio prediction (e.g. via molecular dynamics)
Prediction in 1D, 2D and 3D
predicted secondary structure and solvent accessibility
known secondary structure (E = beta strand) and solvent accessibility
Figure from B. Rost, Protein Structure in 1D, 2D, and 3D, The Encyclopaedia of Computational Chemistry, 1998
2D Prediction Approaches
use secondary structure predictions to predict short-range contacts (e.g. hydrogen bonds in helices)
use secondary structure predictions to predict strand alignments
use correlated mutations to predict contacts
Prediction in 3D
homology modeling given: a query sequence Q, a database of protein structures do: find protein P such that structure of P is known P has high sequence similarity to Q return Ps structure as an approximation to Qs structure fold recognition given: a query sequence Q, a database of known folds do: find fold F such that Q can be aligned with F in a highly compatible manner return F as an approximation to Qs structure ab initio prediction given: a query sequence Q (assuming no similar sequence or fold is known) do: return a predicted structure S for Q
Homology Modeling
most pairs of proteins with similar structure are remote homologs (< 25% sequence identity) homology modeling usually doesnt work for remote homologs ; most pairs of proteins with < 25% sequence identity are unrelated
probably unrelated
remote homologs
homologs
0%
20%
30%
100%
pairwise sequence identity
Threading
Form of fold recognition
prediction.ppt
From ai.stanford.edu/~serafim/CS262_2006/Slides/
Proteomics
Microarrays are useful primarily because mRNA concentrations serve as surrogate for protein concentrations Like to measure protein concentrations directly, but at present cannot do so in same high-throughput manner Proteins do not have obvious direct complements Could build molecules that bind, but binding greatly affected by protein structure
Time-of-Flight (TOF) Mass Spectrometry (thanks Sean McIlwain)
Detector Measures the time for an ionized particle, starting from the sample plate, to hit the detector Laser
Sample +V
Time-of-Flight (TOF) Mass Spectrometry 2
Matrix-Assisted Laser Desorption-Ionization (MALDI) Crystalloid structures made using proton-rich matrix molecule Hitting crystalloid with laser causes molecules to ionize and fly towards Sample +V detector
Detector Laser
Time-of-Flight Demonstration 0
Sample Plate
Time-of-Flight Demonstration 1
Matrix Molecules
Time-of-Flight Demonstration 2
Protein Molecules
Time-of-Flight Demonstration 3
Laser Detector
+10KV
Positive Charge
Time-of-Flight Demonstration 4
Laser pulsed directly onto sample
Proton kicked off matrix molecule onto another molecule
+10KV
Time-of-Flight Demonstration 5
Lots of protons kicked off matrix ions, giving rise to more positively charged molecules
+ +
+ +
+10KV
Time-of-Flight Demonstration 6
The high positive potential under sample plate, causes positively charged molecules to accelerate towards detector
+ + + + +
+10KV
Time-of-Flight Demonstration 7
+ + + + +
Smaller mass molecules hit detector first, while heavier ones detected later
+10Kv
Time-of-Flight Demonstration 8
+ + + + +
The incident time measured from when laser is pulsed until molecule hits detector
+10KV
Time-of-Flight Demonstration 9
+ + + + + +
Experiment repeated a number of times, counting frequencies of flight-times
+10KV
Example Spectra from a Competition by Lin et al. at Duke
These are different fractions from the same sample.
Intensity
M/Z
Trypsin-Treated Spectra
Frequency
M/Z
Many Challenges Raised by Mass Spectrometry Data
Noise: extra peaks from handling of sample, from machine and environment (electrical noise), etc. M/Z values may not align exactly across spectra (resolution ~0.1%) Intensities not calibrated across spectra: quantification is difficult Cannot get all proteins typically only several hundred. To improve odds of getting the ones we want, may fractionate our sample by 2D gel electrophoresis or liquid chromatography.
Challenges (Continued)
Better results if partially digest proteins (break into smaller peptides) first Can be difficult to determine what proteins we have from spectrum Isotopic peaks: C13 and N15 atoms in varying numbers cause multiple peaks for a single peptide
Handling Noise: Peak Picking
Want to pick peaks that are statistically significant from the noise signal
Want to use these as features in our learning algorithms.
Many Supervised Learning Tasks
Learn to predict proteins from spectra, when the organisms proteome is known Learn to identify isotopic distributions Learn to predict disease from either proteins, peaks or isotopic distributions as features Construct pathway models
Using Mass Spectrometry for Early Detection of Ovarian Cancer [Petricoin to al., early, often Ovarian cancer difficult et detect2002]
leading to poor prognosis Trained and tested on mass spectra from blood serum 100 training cases, 50 with cancer Held-out test set of 116 cases, 50 with cancer 100% sensitivity, 95% specificity (63/66) on heldout test set
Not So Fast
Data mining methodology seems sound But Keith Baggerly argues that cancer samples were handled differently than normal samples, and perhaps data were preprocessed differently too If we run cancer samples Monday and normals Wednesday, could get differences from machine breakdown or nearby electrical equipment thats running on Monday but not Wed Lesson: tell collaborators they must randomize samples for the entire processing phase and of course all our preprocessing must be same Debate is still raging results not replicated in trials
Other Proteomics: Interactions
Figure from Ideker et al., Science 292(5518):929-934, 2001
each node represents a gene product (protein) blue edges show direct protein-protein interactions yellow edges show interactions in which one protein binds to DNA and affects the expression of another
Protein-Protein Interactions
Yeast 2-Hybrid Immunoprecipitation
Antibodies (immuno) are made by combinatorial combinations of certain proteins Millions of antibodies can be made, to recognize a wide variety of different antigens (invaders), often by recognizing specific proteins
antibody protein
Protein-Protein Interactions
Immunoprecipitation
antibody
Co-Immunoprecipitation
antibody
Many Supervised Learning Tasks
Learn to predict protein-protein interactions: protein 3D structures may be critical Use protein-protein interactions in construction of pathway models Learn to predict protein function from interaction data
ChIP-Chip Data
Immunoprecipitation can also be done to identify proteins interacting with DNA rather than other proteins Chromatin immunoprecipitation (ChIP): grab sample of DNA bound to a particular protein (transcription factor) ChIP-Chip: run this sample of DNA on a microarray to see which DNA was bound Example of analysis of such new data: Keles et al., 2006
Metabolomics
Measures concentration of each low-molecular weight molecule in sample These typically are metabolites, or small molecules produced or consumed by reactions in biochemical pathways These reactions typically catalyzed by proteins (specifically, enzymes) This data typically also mass spectrometry, though could also be NMR
Lipomics
Analogous to metabolomics, but measuring concentrations of lipids rather than metabolites Potentially help induce biochemical pathway information or to help disease diagnosis or treatment choice
To Design a Drug:
Identify Target Protein Determine Target Site Structure Synthesize a Molecule that Will Bind Knowledge of proteome/genome Relevant biochemical pathways Crystallography, NMR Difficult if Membrane-Bound
Imperfect modeling of structure Structures may change at binding And even then
Molecule Binds Target But May:
Bind too tightly or not tightly enough. Be toxic. Have other effects (side-effects) in the body. Break down as soon as it gets into the body, or may not leave the body soon enough. It may not get to where it should in the body (e.g., crossing blood-brain barrier). Not diffuse from gut to bloodstream.
And Every Body is Different:
Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials). A molecule may work for some people but not others. A molecule may cause harmful side-effects in some people but not others.
Typical Practice when Target Structure is Unknown
High-Throughput Screening (HTS): Test many molecules (1,000,000) to find some that bind to target (ligands). Infer (induce) shape of target site from 3D structural similarities. Shared 3D substructure is called a pharmacophore. Perfect example of a machine learning task with spatial target.
An Example of Structure Learning
Inactive
Active
Common Data Mining Approaches
Represent a molecule by thousands to millions of features and use standard techniques (e.g., KDD Cup 2001) Represent each low-energy conformer by feature vector and use multiple-instance learning (e.g., Jain et al., 1998) Relational learning
Inductive logic programming (e.g., Finn et al., 1998) Graph mining
Supervised Learning Task
Given: a set of molecules, each labeled by activity -- binding affinity for target protein -- and a set of low-energy conformers for each molecule Do: Learn a model that accurately predicts activity (may be Boolean or real-valued)
Clinical Databases of the Future (Dramatically Simplified)
PatientID Gender Birthdate P1 M 3/22/63 PatientID Date P1 P1 1/1/01 2/1/03 Physician Symptoms Smith Jones Diagnosis palpitations hypoglycemic fever, aches influenza
PatientID Date P1 P1
Lab Test
Result 42 45
PatientID SNP1 SNP2 SNP500K P1 P2 AA AB AB BB Dose 10mg BB AA Duration 3 months
1/1/01 blood glucose 1/9/01 blood glucose
PatientID Date Prescribed Date Filled Physician Medication P1 5/17/98 5/18/98 Jones prilosec
Final Wrap-up
Molecular biology collecting lots and lots of data in post-genome era Opportunity to connect molecular-level information to diseases and treatment Need analysis tools to interpret Data mining opportunities abound Hopefully this tutorial provided solid start toward applying data mining to high-throughput biological data