Lecture 6 Protein Modeling
June 7, 2007
Protein Structure Prediction Other topics
Protein Architecture
 proteins are polymers consisting of amino acids linked by peptide bonds  each amino acid consists of  a central carbon atom  an amino group NH 2  a carboxyl group COOH  a side chain  differences in side chains distinguish different amino acids
Peptide Bonds
amino group side chain carboxyl group
 carbon (common reference point for coordinates of a structure)
Amino Acid Side Chains
 side chains vary in: shape, size, charge, polarity
Levels of Description
 protein structure is often described at four different scales  primary structure  secondary structure  tertiary structure  quaternary structure
Levels of Description
Levels of Description
Secondary Structure
 secondary structure refers to certain common repeating structures  it is a local description of structure  two common secondary structures  helices  strands/sheets  a third category, called coil or loop, refers to everything else
 Helices
 carbon
individual amino acid hydrogen bond
 Sheets
Ribbon Diagram Showing Secondary Structures
The Protein Folding Problem
 we know that the function of a protein is determined in large part by its 3D shape (fold, conformation)  can we predict the 3D shape of a protein given only its amino-acid sequence?  in general NO, current methods cannot do this accurately  but the methods can often provide a partial description of the 3D structure, which is often helpful
Motivation
 Want to identify the function of genes we find, and what different mutations/alleles do  One gene = one protein (sort of)
 Function of protein = function of gene
 Function can be determined in many ways
 Gene expression, knockouts, etc
 But these take time, and are prone to mistakes  Goal: If we can structure every protein, learning their functions isnt too far away
Thornton et al 2000 (Nature)
Similar problems
 Straight up 3D prediction hard (Nobel awaits)  Subproblem 1: Identify patterns in sequence
 Profile HMMs, multiple sequence alignments
 Subproblem 2: Identify common motifs
 Various methods
 Subproblem 3: Identify classes of proteins
 SCOP
 Subproblem 4: Identify homologs
 BLAST
http://www.ludwig.edu.au/course/course2002/
What Determines Conformation?
 in general, the amino-acid sequence of a protein determines the 3D shape of a protein [Anfinsen et al., 1950s]  but some exceptions  all proteins can be denatured  some proteins are inherently disordered (i.e. lack a regular structure)  some proteins get folding help from chaperones  there are various mechanisms through which the conformation of a protein can be changed in vivo  post-translational modifications such as phosphorylation  prions  etc.
What Determines Conformation?
 what physical properties of the protein determine its fold?  rigidity of the protein backbone  interactions among amino acids, including  electrostatic interactions  van der Waals forces  volume constraints  hydrogen, disulfide bonds  interactions of amino acids with water
Determining Protein Structures
 protein structures can be determined experimentally (in many cases) by  x-ray crystallography  nuclear magnetic resonance (NMR)
DNA
Picture by Anthony North
Myoglobin
From www.inst.bnl.gov/GasDetectorLab/x-rays/SRI94.htm
Myoglobin
S.E.V. Phillips. "Structure and refinement of oxymyoglobin at 1.6  resolution.", J. Mol. Biol. 1980, 142, 531.
NMR
 Nuclear Magnetic Resonance Spectroscopy  Cannot handle large proteins like X-ray  Exploits the chemical environment to return distances between atoms
 Can use knowledge of restraints to identify positions of atoms that produce peaks
Protein structure determination in solution by NMR spectroscopy Wuthrich K. J Biol Chem. 1990 December 25;265(36):22059-62
Protein structure determination in solution by NMR spectroscopy Wuthrich K. J Biol Chem. 1990 December 25;265(36):22059-62
Experimental Methods
 Very expensive and time-consuming
 Computational methods can help with time (Frank DiMaio)
 Many proteins still cannot be done in this manner
More motivation
 there is a large sequence-structure gap 158K protein sequences in SwissProt database 27K protein structures in PDB database  key question: can we predict structures by computational means instead?
Approaches to Protein Structure Prediction
 prediction in 1D  secondary structure  solvent accessibility (which residues are exposed to water, which are buried)  transmembrane helices (which residues span membranes)  prediction in 2D  inter-residue/strand contacts  prediction in 3D  homology modeling  fold recognition (e.g. via threading)  ab initio prediction (e.g. via molecular dynamics)
Prediction in 1D, 2D and 3D
predicted secondary structure and solvent accessibility
known secondary structure (E = beta strand) and solvent accessibility
Figure from B. Rost, Protein Structure in 1D, 2D, and 3D, The Encyclopaedia of Computational Chemistry, 1998
2D Prediction Approaches
 use secondary structure predictions to predict short-range contacts (e.g. hydrogen bonds in  helices)
 use secondary structure predictions to predict  strand alignments
 use correlated mutations to predict contacts
Prediction in 3D
 homology modeling given: a query sequence Q, a database of protein structures do:  find protein P such that  structure of P is known  P has high sequence similarity to Q  return Ps structure as an approximation to Qs structure  fold recognition given: a query sequence Q, a database of known folds do:  find fold F such that Q can be aligned with F in a highly compatible manner  return F as an approximation to Qs structure  ab initio prediction given: a query sequence Q (assuming no similar sequence or fold is known) do: return a predicted structure S for Q
Homology Modeling
 most pairs of proteins with similar structure are remote homologs (< 25% sequence identity)  homology modeling usually doesnt work for remote homologs ; most pairs of proteins with < 25% sequence identity are unrelated
probably unrelated
remote homologs
homologs
0%
20%
30%
100%
pairwise sequence identity
Threading
 Form of fold recognition
 prediction.ppt
 From ai.stanford.edu/~serafim/CS262_2006/Slides/
Proteomics
 Microarrays are useful primarily because mRNA concentrations serve as surrogate for protein concentrations  Like to measure protein concentrations directly, but at present cannot do so in same high-throughput manner  Proteins do not have obvious direct complements  Could build molecules that bind, but binding greatly affected by protein structure
Time-of-Flight (TOF) Mass Spectrometry (thanks Sean McIlwain)
Detector  Measures the time for an ionized particle, starting from the sample plate, to hit the detector Laser
Sample +V
Time-of-Flight (TOF) Mass Spectrometry 2
 Matrix-Assisted Laser Desorption-Ionization (MALDI)  Crystalloid structures made using proton-rich matrix molecule  Hitting crystalloid with laser causes molecules to ionize and fly towards Sample +V detector
Detector Laser
Time-of-Flight Demonstration 0
Sample Plate
Time-of-Flight Demonstration 1
Matrix Molecules
Time-of-Flight Demonstration 2
Protein Molecules
Time-of-Flight Demonstration 3
Laser Detector
+10KV
Positive Charge
Time-of-Flight Demonstration 4
Laser pulsed directly onto sample
Proton kicked off matrix molecule onto another molecule
+10KV
Time-of-Flight Demonstration 5
Lots of protons kicked off matrix ions, giving rise to more positively charged molecules
+ +
+ +
+10KV
Time-of-Flight Demonstration 6
The high positive potential under sample plate, causes positively charged molecules to accelerate towards detector
+ + + + +
+10KV
Time-of-Flight Demonstration 7
+ + + + +
Smaller mass molecules hit detector first, while heavier ones detected later
+10Kv
Time-of-Flight Demonstration 8
+ + + + +
The incident time measured from when laser is pulsed until molecule hits detector
+10KV
Time-of-Flight Demonstration 9
+ + + + + +
Experiment repeated a number of times, counting frequencies of flight-times
+10KV
Example Spectra from a Competition by Lin et al. at Duke
These are different fractions from the same sample.
Intensity
M/Z
Trypsin-Treated Spectra
Frequency
M/Z
Many Challenges Raised by Mass Spectrometry Data
 Noise: extra peaks from handling of sample, from machine and environment (electrical noise), etc.  M/Z values may not align exactly across spectra (resolution ~0.1%)  Intensities not calibrated across spectra: quantification is difficult  Cannot get all proteins typically only several hundred. To improve odds of getting the ones we want, may fractionate our sample by 2D gel electrophoresis or liquid chromatography.
Challenges (Continued)
 Better results if partially digest proteins (break into smaller peptides) first  Can be difficult to determine what proteins we have from spectrum  Isotopic peaks: C13 and N15 atoms in varying numbers cause multiple peaks for a single peptide
Handling Noise: Peak Picking
 Want to pick peaks that are statistically significant from the noise signal
Want to use these as features in our learning algorithms.
Many Supervised Learning Tasks
 Learn to predict proteins from spectra, when the organisms proteome is known  Learn to identify isotopic distributions  Learn to predict disease from either proteins, peaks or isotopic distributions as features  Construct pathway models
Using Mass Spectrometry for Early Detection of Ovarian Cancer [Petricoin to al., early, often Ovarian cancer difficult et detect2002]
leading to poor prognosis Trained and tested on mass spectra from blood serum 100 training cases, 50 with cancer Held-out test set of 116 cases, 50 with cancer 100% sensitivity, 95% specificity (63/66) on heldout test set
Not So Fast
 Data mining methodology seems sound  But Keith Baggerly argues that cancer samples were handled differently than normal samples, and perhaps data were preprocessed differently too  If we run cancer samples Monday and normals Wednesday, could get differences from machine breakdown or nearby electrical equipment thats running on Monday but not Wed  Lesson: tell collaborators they must randomize samples for the entire processing phase and of course all our preprocessing must be same  Debate is still raging results not replicated in trials
Other Proteomics: Interactions
Figure from Ideker et al., Science 292(5518):929-934, 2001
 each node represents a gene product (protein)  blue edges show direct protein-protein interactions  yellow edges show interactions in which one protein binds to DNA and affects the expression of another
Protein-Protein Interactions
 Yeast 2-Hybrid  Immunoprecipitation
 Antibodies (immuno) are made by combinatorial combinations of certain proteins  Millions of antibodies can be made, to recognize a wide variety of different antigens (invaders), often by recognizing specific proteins
antibody protein
Protein-Protein Interactions
Immunoprecipitation
antibody
Co-Immunoprecipitation
antibody
Many Supervised Learning Tasks
 Learn to predict protein-protein interactions: protein 3D structures may be critical  Use protein-protein interactions in construction of pathway models  Learn to predict protein function from interaction data
ChIP-Chip Data
 Immunoprecipitation can also be done to identify proteins interacting with DNA rather than other proteins  Chromatin immunoprecipitation (ChIP): grab sample of DNA bound to a particular protein (transcription factor)  ChIP-Chip: run this sample of DNA on a microarray to see which DNA was bound  Example of analysis of such new data: Keles et al., 2006
Metabolomics
 Measures concentration of each low-molecular weight molecule in sample  These typically are metabolites, or small molecules produced or consumed by reactions in biochemical pathways  These reactions typically catalyzed by proteins (specifically, enzymes)  This data typically also mass spectrometry, though could also be NMR
Lipomics
 Analogous to metabolomics, but measuring concentrations of lipids rather than metabolites  Potentially help induce biochemical pathway information or to help disease diagnosis or treatment choice
To Design a Drug:
Identify Target Protein Determine Target Site Structure Synthesize a Molecule that Will Bind Knowledge of proteome/genome Relevant biochemical pathways Crystallography, NMR Difficult if Membrane-Bound
Imperfect modeling of structure Structures may change at binding And even then
Molecule Binds Target But May:
 Bind too tightly or not tightly enough.  Be toxic.  Have other effects (side-effects) in the body.  Break down as soon as it gets into the body, or may not leave the body soon enough.  It may not get to where it should in the body (e.g., crossing blood-brain barrier).  Not diffuse from gut to bloodstream.
And Every Body is Different:
 Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials).  A molecule may work for some people but not others.  A molecule may cause harmful side-effects in some people but not others.
Typical Practice when Target Structure is Unknown
 High-Throughput Screening (HTS): Test many molecules (1,000,000) to find some that bind to target (ligands).  Infer (induce) shape of target site from 3D structural similarities.  Shared 3D substructure is called a pharmacophore.  Perfect example of a machine learning task with spatial target.
An Example of Structure Learning
Inactive
Active
Common Data Mining Approaches
 Represent a molecule by thousands to millions of features and use standard techniques (e.g., KDD Cup 2001)  Represent each low-energy conformer by feature vector and use multiple-instance learning (e.g., Jain et al., 1998)  Relational learning
 Inductive logic programming (e.g., Finn et al., 1998)  Graph mining
Supervised Learning Task
 Given: a set of molecules, each labeled by activity -- binding affinity for target protein -- and a set of low-energy conformers for each molecule  Do: Learn a model that accurately predicts activity (may be Boolean or real-valued)
Clinical Databases of the Future (Dramatically Simplified)
PatientID Gender Birthdate P1 M 3/22/63 PatientID Date P1 P1 1/1/01 2/1/03 Physician Symptoms Smith Jones Diagnosis palpitations hypoglycemic fever, aches influenza
PatientID Date P1 P1
Lab Test
Result 42 45
PatientID SNP1 SNP2  SNP500K P1 P2 AA AB AB BB Dose 10mg BB AA Duration 3 months
1/1/01 blood glucose 1/9/01 blood glucose
PatientID Date Prescribed Date Filled Physician Medication P1 5/17/98 5/18/98 Jones prilosec
Final Wrap-up
 Molecular biology collecting lots and lots of data in post-genome era  Opportunity to connect molecular-level information to diseases and treatment  Need analysis tools to interpret  Data mining opportunities abound  Hopefully this tutorial provided solid start toward applying data mining to high-throughput biological data