KEMBAR78
Structural databases | PDF
STRUCTURAL
DATABASES
PDB , CSD , CATH
INTRODUCTION:
• Structural databases are the essential tools for all
crystallographic works.
• They are used in the process of producing, solving
,refining and publishing the structure of a new material.
THE COMMON INFORMATION FOUND IN THE
STRUCTURAL DATABASE INCLUDE:
• Bibliographic information- author name, journal reference.
• The chemical compound name, formula and oxidation states
of the element present.
• Number of formula units per unit cell(contents)
• Dimension and symmetry of the unit cell.
• symmetry of the structure.
• Atomic coordinates, occupancies and thermal parameters.
• Any special features of the experiment to collect the
diffraction data.
• The structures in the database have been solved using X-ray,
neutron and electron diffraction techniques on sample,
computational modelling or by using NMR.
PDB:(PROTEIN DATABASES)
• Protein database contains the information about 3D structures of
the proteins.
• The structural information of the protein can be determined by
X-ray crystallography or Nuclear magnetic resonance(NMR)
spectroscopy methods.
• The PDB is overseen by an organisation called World Wide
Protein Data Bank,wwPDB.
• It is available at
• www.wwpdb.org
• www.pdbe.org
• www.pdbj.org
• Each entry in the PDB is provided with a unique identification
number called PDB ID.It is a 4 letter identification number which
consists of both alpha numeric characters.
PDB FILE FORMAT:
The PDB file format is the standard file format for protein
structure file. It describes how molecules are held together in
3-D Structure of a protein.
• The file contain hundreds or thousands of lines called
records. Each record provides a different set of information
like
• HEADER: This reocord contains file name, date of submission
and the PDB ID of the molecule.
• TITLE: This record contains the title of the PDB entry.
• COMPND: This record includes the protein name.
• SOURCE: This record contains the name of the organism in
which the particular protein is obtained.
• KEYWDS: This record contains the keywords that describes
about the protein.
PDB FILE FORMAT:
• EXPDTA: This record contains the method used for the
protein structure experiment.
• AUTHOR: This record contains the name of the
contributors who put the data into the database.
• REVDATA: This record contains the revision date of the
data related to protein.(Date of modification)
• JRNL: This record contains the journal details of the
literature about the protein
• REMARK: This record contains the remarks about the
protein structure.
• DBREF: This record contains the reference to the protein
in the sequence databases.
PDB FILE FORMAT:
• SEQRES: This record contains information about the
amino acid sequence of protein.
• HET: This record contains details about the non protein
substances in the protein.
• HETNAM: This record contain the compound name of
the non protein substances.
• HETSYN: This record contains the identical compound
name for the non protein substances.
• FORMUL: This record contain the chemical formula of
the non protein substances.
• HELIX: This record holds the recognition of helical
substructures.
PDB FILE FORMAT:
• LINK: This record holds the recognition of inter-residue bonds.
• ATOM: This record contains the atomic coordinates for the
structure.
• HEATM: This record contains the atomic coordinate record for
non protein substances.
• CONECT: This record contains the details about the bonds
involved in non protein atoms.
• MASTER: This record contains the details about the number of
REMARK records, HET records, HELIX records, CONECT records
and SEQRES records, etc.
• END: This record represent the end of the file.
•
THE PDB FORMAT
• 123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+
• HEADER RETINOIC-ACID TRANSPORT 28-SEP-94 1CBS 1CBS 2
• COMPND CELLULAR RETINOIC-ACID-BINDING PROTEIN TYPE II COMPLEXED 1CBS 3
• COMPND 2 WITH ALL-TRANS-RETINOIC ACID (THE PRESUMED PHYSIOLOGICAL 1CBS 4
• COMPND 3 LIGAND) 1CBS 5
• SOURCE HUMAN (HOMO SAPIENS) 1CBS 6
• SOURCE 2 EXPRESSION SYSTEM: (ESCHERICHIA COLI) BL21 (DE3) 1CBS 7
• SOURCE 3 PLASMID: PET-3A 1CBS 8
• SOURCE 4 GENE: HUMAN CRABP-II 1CBS 9
• AUTHOR G.J.KLEYWEGT,T.BERGFORS,T.A.JONES 1CBS 10
• REVDAT 1 26-JAN-95 1CBS 0 1CBS 11
• -------------------------------------------------------------------------------------------------------------------------------------------
CATH:
• The CATH means Class, Architecture,Topology and
homologouus super family database for proteins
• It was created by Janet Thornton and colleagues at the
university college London.
• It is available at
http://www.biochem.ucl.ac.uk/bsm/cath
• http://www.cathdb.info
• It is a protein classification tool
IT CONSISTS OF FOUR LEVELS
• Class: It includes structural conformations of proteins
and their contents(alpha, beta, alpha/beta, etc.)
• Architecture: It describes the gross orientation of
secondary structures. It also gives information about
folding of polypeptide chains.
• Topology: It deals with the structures formed due to
different topological arrangement of secondary
structures. It explains the super families of the proteins.
• Homologous super family: It compares the sequence
and structure of various proteins. It helps to trace the
evolutionary relationship among the proteins.
CATH
• The CATH aims to provide official releases of protein
structures every 12 months
• It is a free publicly available online resource.
• The latest version of CATH contains 1,14,215
domains,2178 homologous superfamilies,1110 fold
groups.
THE CATH SERVER
• The CATH have recently set up a server which allows
the user to submit the co-ordinates of the newly
determined structure for automatic classification in
CATH.
• DOMAIN BOUNDARIES AND SEQUENCE COMPARISON
• CATH contains a detective program which is good for
identifying multidomain proteins.
• The results from the detective are returned to the user in
less than a minutes.
• Identified domains are scanned against non identical
representatives from CATH using a global sequence
alignment method
CATH SERVER
• If a sequence match 95% then the domain is identical
to one in CATH.
• If a sequence match less than 30% then the structures
are compared with all the sequence families (s-level).
• ASSESING STRUCTURAL SIMILARITY:
• TOPSCAN compares the secondary strucutres in each
fold family to identify the possible fold families to which
the new structures belong.
• Subsequently the fast version of structure comparison
SSAP scans represetatives from all the families
• Structural pairs having a ssap score more than 80 are
possible homologues while the score with 70-80 don’t
have no sequence or functional similiarity.
• Finally the SSAP structural alignment is displayed using a
graphical display package.
CSD
• The cambridge structural Database is both a repository
and a validated resource for 3-D structural data of
molecules containing carbon and hydrogen.
• It is used to know about the structures of organic,
metal-organic and organometallic molecules
• The specific entries in the CSD are complementary to
PDB and Inorganic crystal structure database.
• The data in the CSD is typically obtained by X-ray
crystallography and less frequently by neutron
diffraction
CSD
• The data in the CSD is submitted by crystallographers and
chemists from all over the world.
• The CSD is maintained by an incorporated company called
Cambridge Crystallographic Data centre, CCDC
• The CCDC are publicly available for download at the point of
publication.
• The CSD is updated with about 50,000 new structures each
year and are freely available to support teaching and other
activities
• The CSD is available at
• www.ccdc.cam.ac.uk
• webcsd.ccdc.cam.ac.uk
Structural
Database
Applications
Prediction
Analysis
Mining
Compariso
n
Classificatio
n
Structure
Refinement
Databases
Annotation
Structural databases

Structural databases

  • 1.
  • 2.
    INTRODUCTION: • Structural databasesare the essential tools for all crystallographic works. • They are used in the process of producing, solving ,refining and publishing the structure of a new material.
  • 3.
    THE COMMON INFORMATIONFOUND IN THE STRUCTURAL DATABASE INCLUDE: • Bibliographic information- author name, journal reference. • The chemical compound name, formula and oxidation states of the element present. • Number of formula units per unit cell(contents) • Dimension and symmetry of the unit cell. • symmetry of the structure. • Atomic coordinates, occupancies and thermal parameters. • Any special features of the experiment to collect the diffraction data. • The structures in the database have been solved using X-ray, neutron and electron diffraction techniques on sample, computational modelling or by using NMR.
  • 4.
    PDB:(PROTEIN DATABASES) • Proteindatabase contains the information about 3D structures of the proteins. • The structural information of the protein can be determined by X-ray crystallography or Nuclear magnetic resonance(NMR) spectroscopy methods. • The PDB is overseen by an organisation called World Wide Protein Data Bank,wwPDB. • It is available at • www.wwpdb.org • www.pdbe.org • www.pdbj.org • Each entry in the PDB is provided with a unique identification number called PDB ID.It is a 4 letter identification number which consists of both alpha numeric characters.
  • 7.
    PDB FILE FORMAT: ThePDB file format is the standard file format for protein structure file. It describes how molecules are held together in 3-D Structure of a protein. • The file contain hundreds or thousands of lines called records. Each record provides a different set of information like • HEADER: This reocord contains file name, date of submission and the PDB ID of the molecule. • TITLE: This record contains the title of the PDB entry. • COMPND: This record includes the protein name. • SOURCE: This record contains the name of the organism in which the particular protein is obtained. • KEYWDS: This record contains the keywords that describes about the protein.
  • 8.
    PDB FILE FORMAT: •EXPDTA: This record contains the method used for the protein structure experiment. • AUTHOR: This record contains the name of the contributors who put the data into the database. • REVDATA: This record contains the revision date of the data related to protein.(Date of modification) • JRNL: This record contains the journal details of the literature about the protein • REMARK: This record contains the remarks about the protein structure. • DBREF: This record contains the reference to the protein in the sequence databases.
  • 9.
    PDB FILE FORMAT: •SEQRES: This record contains information about the amino acid sequence of protein. • HET: This record contains details about the non protein substances in the protein. • HETNAM: This record contain the compound name of the non protein substances. • HETSYN: This record contains the identical compound name for the non protein substances. • FORMUL: This record contain the chemical formula of the non protein substances. • HELIX: This record holds the recognition of helical substructures.
  • 10.
    PDB FILE FORMAT: •LINK: This record holds the recognition of inter-residue bonds. • ATOM: This record contains the atomic coordinates for the structure. • HEATM: This record contains the atomic coordinate record for non protein substances. • CONECT: This record contains the details about the bonds involved in non protein atoms. • MASTER: This record contains the details about the number of REMARK records, HET records, HELIX records, CONECT records and SEQRES records, etc. • END: This record represent the end of the file. •
  • 12.
    THE PDB FORMAT •123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+ • HEADER RETINOIC-ACID TRANSPORT 28-SEP-94 1CBS 1CBS 2 • COMPND CELLULAR RETINOIC-ACID-BINDING PROTEIN TYPE II COMPLEXED 1CBS 3 • COMPND 2 WITH ALL-TRANS-RETINOIC ACID (THE PRESUMED PHYSIOLOGICAL 1CBS 4 • COMPND 3 LIGAND) 1CBS 5 • SOURCE HUMAN (HOMO SAPIENS) 1CBS 6 • SOURCE 2 EXPRESSION SYSTEM: (ESCHERICHIA COLI) BL21 (DE3) 1CBS 7 • SOURCE 3 PLASMID: PET-3A 1CBS 8 • SOURCE 4 GENE: HUMAN CRABP-II 1CBS 9 • AUTHOR G.J.KLEYWEGT,T.BERGFORS,T.A.JONES 1CBS 10 • REVDAT 1 26-JAN-95 1CBS 0 1CBS 11 • -------------------------------------------------------------------------------------------------------------------------------------------
  • 13.
    CATH: • The CATHmeans Class, Architecture,Topology and homologouus super family database for proteins • It was created by Janet Thornton and colleagues at the university college London. • It is available at http://www.biochem.ucl.ac.uk/bsm/cath • http://www.cathdb.info • It is a protein classification tool
  • 14.
    IT CONSISTS OFFOUR LEVELS • Class: It includes structural conformations of proteins and their contents(alpha, beta, alpha/beta, etc.) • Architecture: It describes the gross orientation of secondary structures. It also gives information about folding of polypeptide chains. • Topology: It deals with the structures formed due to different topological arrangement of secondary structures. It explains the super families of the proteins. • Homologous super family: It compares the sequence and structure of various proteins. It helps to trace the evolutionary relationship among the proteins.
  • 16.
    CATH • The CATHaims to provide official releases of protein structures every 12 months • It is a free publicly available online resource. • The latest version of CATH contains 1,14,215 domains,2178 homologous superfamilies,1110 fold groups.
  • 18.
    THE CATH SERVER •The CATH have recently set up a server which allows the user to submit the co-ordinates of the newly determined structure for automatic classification in CATH. • DOMAIN BOUNDARIES AND SEQUENCE COMPARISON • CATH contains a detective program which is good for identifying multidomain proteins. • The results from the detective are returned to the user in less than a minutes. • Identified domains are scanned against non identical representatives from CATH using a global sequence alignment method
  • 19.
    CATH SERVER • Ifa sequence match 95% then the domain is identical to one in CATH. • If a sequence match less than 30% then the structures are compared with all the sequence families (s-level). • ASSESING STRUCTURAL SIMILARITY: • TOPSCAN compares the secondary strucutres in each fold family to identify the possible fold families to which the new structures belong. • Subsequently the fast version of structure comparison SSAP scans represetatives from all the families • Structural pairs having a ssap score more than 80 are possible homologues while the score with 70-80 don’t have no sequence or functional similiarity. • Finally the SSAP structural alignment is displayed using a graphical display package.
  • 22.
    CSD • The cambridgestructural Database is both a repository and a validated resource for 3-D structural data of molecules containing carbon and hydrogen. • It is used to know about the structures of organic, metal-organic and organometallic molecules • The specific entries in the CSD are complementary to PDB and Inorganic crystal structure database. • The data in the CSD is typically obtained by X-ray crystallography and less frequently by neutron diffraction
  • 23.
    CSD • The datain the CSD is submitted by crystallographers and chemists from all over the world. • The CSD is maintained by an incorporated company called Cambridge Crystallographic Data centre, CCDC • The CCDC are publicly available for download at the point of publication. • The CSD is updated with about 50,000 new structures each year and are freely available to support teaching and other activities • The CSD is available at • www.ccdc.cam.ac.uk • webcsd.ccdc.cam.ac.uk
  • 27.