KEMBAR78
Intro To Bioinformatics | PDF | Bioinformatics | Gene
0% found this document useful (0 votes)
23 views50 pages

Intro To Bioinformatics

Introduction to bioinformatics

Uploaded by

ahmedafify970
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views50 pages

Intro To Bioinformatics

Introduction to bioinformatics

Uploaded by

ahmedafify970
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Prof.

Naglaa Abdallah
Course contents
• 12 weeks
• Lectures, exercise, discussion.
• Materials (presentations, links, books, etc.)

Class Structure
2 hours lecture
1 hour tutorial
Grading
• Final exam 60%
• Practical exam 20%
• Quizzes (Homework assignments), midterm 10%
• Oral 10%
Course description
An introduction to theory and practice of Bioinformatics and
computational biology

Goals for the course:


The course will familiarize the students with the tools and principles of
contemporary bioinformatics. By the end of the course, students will
have a working knowledge of a variety of publicly available data and
computational tools important in bioinformatics and a grasp of the
underlying principles that is adequate for them to evaluate and use
novel techniques as they arise the future.
What is Bioinformatics
• Bioinformatics: Collection and storage of biological
information
• It is The field of science in which biology, computer
science, and information technology merge to form a
single discipline
Computational Biology: Development of statistical models
to analyze biological data

• Ultimate goal: to enable the discovery of new biological


insights as well as to create a global perspective from
which unifying principles in biology can be discerned
• Bioinformatics: any use of computers to handle biological information.

• Bioinformatics (Oxford English Dictionary): The branch of science


concerned with information and information flow in biological systems,
esp. the use of computational methods in genetics and genomics.

• Molecular Bioinformatics: involves the use of computational tools to


discover new information in complex data sets (from the one-dimensional
information of DNA through the two-dimensional information of RNA
and the three-dimensional information of proteins, to the four-dimensional
information of evolving living systems).
The field of science in which biology, computer science and
information technology merge into a single line
Biologists
collect molecular data:
DNA & Protein sequences,
gene expression, etc.
Bioinformaticians
Study biological questions
by analyzing molecular
data

Computer scientists
(+Mathematicians, Statisticians, etc.)
Develop tools, softwares, algorithms
to store and analyze the data.
011
010
010
010
000

01001011 01010101010
• Store data 01100101 010101010101

• Create Interfaces to the data


Database
• Build tools to analyze data Database
Database

The objective of biological experimentation is not just to generate


biological data, but also to analyze the data and extract information and
knowledge from it. The high complexities and volumes of these data
require the use of computers for the storage and analysis of the data,
and makes bioinformatics an complete part of modern molecular
science.
Origin of Bioinformatics
• Bioinformatics started in the 1960s, Term ‘Bioinformatics’
appears in the 1989.

• Bioinformatics was created to serve molecular biology or


Studying life at molecular level.

• Development of fast computers, good algorithmic techniques.

• Bioinformatics is the application of many sciences (Applied


mathematics, Statistics, Physics, Biology, Genetics and
Biochemistry.
History of Bioinformatics
1953:DNA structure discovered.
1956: First protein sequenced (insulin).
1960: Assembly of protein sequence databases.
1972: Protein Data Bank (PDB).
1977: Sanger sequencing technique developed.
1979: First DNA Data Bank (GenBank).
1987: Multiple sequence alignment.
1988: National Centre for Biotechnology information (NCBI) is
released by Larry Wall.
History of Bioinformatics
1990: Human Genome Project started.
1990: BLAST program introduced by Kartin and Altshul.
1993: The first genome database (C.elegans).
1995: Influenza genome sequences (5Mb).
2000: Drosophila genome sequences (180Mb).
2001: The human genome (3.000 Mbp) is published.
The evolution of bioinformatics as seen in the 90’s
The requirement of bioinformatics

• Data collection techniques (DNA sequencing, protein


sequencing, microarrays)

• Theoretical concept (concepts of DNA structure, protein


structure, evolution)

• Programs (BLAST, FASTA)

• Databases

• Institutions

• Complex genomic and high throughput data


The important of Bioinformatics
• Applications areas include
• Medicine
• Pharmaceutical drug design
• Toxicology
• Molecular evolution
• Biosensors
• Biomaterials
• Biological computing models
• DNA computing
What could Bioinformatics offer?
Analyze and interpret biological data: Genomic Sequences, RNA
structure & Transcriptomic Sequences & Protein sequences and
Structures - RNA structure (RNA)
Develop new algorithms and tools to: Assess the biological
information, handle large datasets, find relationships between data
sources etc…
Basic Science :
- Understand the living cell
- Find the function of a new protein
- Find the genes/proteins that are unique to human
Medical applications: identify the mutations (SNPs) that cause
genetic diseases, disease diagnosis & find and develop new and better
drugs
Agriculture applications: higher yield crop, increase shelf life, etc.
15
• Structural genomics is a field of genomics that involves the
characterization of genome structures.
• This knowledge can be useful in the practice of manipulating the
genes and DNA segments of a species.
• Functional genomics is a field of molecular biology that
attempts to describe gene functions and interactions.
• Functional genomics make use of the large data generated by
genomic and transcriptomic projects.
Bioinformtics could ue used to:
• Sequence complete genome

• Identify protein coding regions

• Identify unique genes

• Gene knockout

• Functional analysis (phenotype,


detailed functional characterization..)

• Structural studies, drug development


• The most logical way to look at how bioinformatics assists
molecular biology, is to look at it from the central dogma.

• Bioinformatics plays a role at each stage of the central dogma of


molecular biology.

First, there is DNA

• DNA is the most basic data gathered from molecular experiments,


and data types associated with DNA are genomes, genes, and gene
features.

18
Central Paradigm in Molecular Biology

Gene (DNA) mRNA Protein

21ST centaury

Genome Transcriptome Proteome


The second part in the dogma is mRNA.

• Data is generated in many areas of experimentation involving


mRNA, and this includes the levels of expression of the mRNA.

• Typically these would be microarray experiments.

• Data associated with the structure of RNA, and then data associated
with other RNA.

• This include ribosomal RNA, transfer RNA and studies involving


RNAi

20
Next are data associated with proteins.

• Especially in modern molecular biology and biochemistry,


proteomics is a growing field with many resources being allocated
to the molecular study of proteins.

• The data types most associated with proteins are sequence data,
structural data and phylogenetics data.

21
• Today, the field of Systems Biology is very high on the list of important research
topics. It is a field that studies the complex interactions of genes proteins and
other cellular elements and is very important for the advancement of
knowledge regarding the function of the life. Metabolic pathway
determination and modeling is a very important aspect of systems biology.
• The highest level at which molecular biologists are working, is at the
phenotypic level, trying to explain the reasons for phenotypes given genetic and
proteomic makeup of cells.
• Human disease manifestation as a result of genetic flaws is a big topic of study,
and bioinformatics is playing a major role in determining the causes of genetic
disease, as well as helping with the search for cures for these conditions.

• Genome browsers, is one that exists only in silico, inside computers, these
resources provide an integrated look at entire organisms, and how everything
known about its biology is interrelated.
22
Central Dogma and Genome Browsers
Integrated Resources

DNA RNA Protein Systems Phenotype

Genomes Expression Sequence Metabolic Disease


Pathways
Genes Structure Structure

Features Other RNA Phylogenetics


(tRNA,rRNA&iRNA)
Central Dogma and Genome Browsers

Integrated Resources

DNA RNA Protein Systems Phenotype


DNA

Genomes
• Automated sequencing
• 200 million by 1998
• 1.5 billion by 2003 (Human Genome complete)

• Roche 454: >1Gb/day


• Cost drops too....
• ATCGATCGATCATGCTAGCTAGCTAGCTAGCTAGCG
CTATGCTAGCTCGTGCTAGCATGATCGATCATG.......
DNA

TCATCGGTCATGCATGC TCATCGGTCATGCAATCGA

TCATCGGTCA ACCTGTGTTCATCGGTCATGC
TCATCGGTCATGC TCATCGGTCA
TCATCGGTCATGCACGGTTA
TCATCGGTCATGC
•Huge number of sequences from library (1000 000 000 clones)

•Sequence must be determined

•Must be assembled

•Must be stored
Genome Storage –what we need?
Computational tools are needed to distill pathways of interest
from large molecular interaction databases

• Need modern computer systems


• Files
• Specific formats
• Databases
• Manage data
• Searchable
• Fast, reliable, available
DNA

• Genes (Eukaryotic cell)


DNA

• Genome annotation
• Finding genes and the features associated with the genes
• Must be done on computers

TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCAATTATATATATTTTCTCTTATATAACTCGATAGCTACTACTACCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
promoter TF binding site

CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
Transcription

AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
Start Site

AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT
.................................
Ribosome binding Site
.............. TGAAAAACGTA

ORF=Open Reading Frame


CDS=Coding Sequence
31
Central Dogma and Genome Browsers

Integrated Resources

DNA RNA Protein Systems Phenotype


RNA

• Poly-nucleotide molecules

• mRNA most well known

• Carries gene information for translation

• Expression analysis for genes

• Comparison between different environmental stimuli or different


cellular states

• Level of protein production proportional to RNA level

Gives a good indication of gene expression


RNA

Structure – Function relationship important


Primary Secondary Tertiary
5’AACUCGAGC
UACUAGCUAG
GCGCGUUAAU
UAUCGUACUA
UAGCUACUAC
UUCGCGUAAU
UAUUACGAUG
UUCGGCUAGA
UUAGCGAUAU
UAUUACGAUA
UAUAUGCGCA
UAUCAGAUU3’
Central Dogma and Genome Browsers

Integrated Resources

DNA RNA Protein Systems Phenotype


Protein
• Relatively easy to find DNA sequence
• More than 200 organisms sequenced
• Putative proteomes are abundant

• Bioinformatics: Computational tools to determine protein


structure and function from sequence

• Proteomics à analyze protein sequences


• Translation of RNA into proteins. The process of genome
sequencing is relatively simple, and to data several organisms had
their entire genome sequenced.

• Through various annotation processes using computers, it is possible


to predict the genes, and the proteins they will encode.

• This creates the need for bioinformatics tools to computationally


determine protein structure and function from sequence, with the end
goal being that the accuracy of predicting the structure and
function of proteins would be extremely high.
Protein function

• Classify proteins (Database of protein


motifs)

• Choose and express representative


proteins from all families

• Determine structure by X-ray

• Predict the rest by homology modeling


• When comparing protein sequences, it can be assumed that for
proteins with similar sequences, the proteins should have similar
functions.

• This fact is also true for subsections of a particular sequence, and not
only for the entire sequences.

• In Proteomics, many discoveries as to the specific functionality of a


particular stretch of sequence (motif) have been made, and these
discoveries were stored, and are being used today in sequence
function assignment.
• Protein structures have been available in the public domain for
longer than any other type of biological data, with the first public
repository being created in the early 70’s.

• More than 33000 structures are available for download via the
internet.

• These structures can be downloaded for the purpose of homology


modeling.

• If 2 sequences are homologues, and functionally similar as well,


then their 3-dimensional structures should be similar as well.
This approach can be useful during drug discovery, or just studies
on the functioning of a particular protein.
Protein

• Similar sequence à Similar function


• Smaller stretches of sequence carries similar function
• Motifs or signature sequences
• DNA binding motifs

Sequence B
Sequence A
Protein

• Motifs and signatures à Identify unknown proteins


• Search protein database for proteins with probable functionality

• Databases of protein signatures and motifs


• Pfam, Prosite, Prints, BLOCKS

• Various methods of representation


• HMMs – Pfam
• Regular expressions - Prosite

• PSSMs - BLOCKS
Central Dogma and Genome Browsers

Integrated Resources

DNA RNA Protein Systems Phenotype


• Knowing the function of one protein is not enough. If we are to
understand the way life function, we must know and understand the way
in which all proteins in the cell function together.

• We must understand the complex relationships there are between the


molecules in the cell that makes it function.

• The way in which this is done, is first to understand the metabolic


pathways.

• These are networks into which functional groups of proteins are divided.
• To do that, we must add all the different pathways and their effects
together.

• The field of study concerned with doing this is systems biology, in


which the biological system, as a whole, is studied with the end goal
of studying the effects of all the pathways eventually put together.

• Bioinformatics is an essential component of systems biology.


• Many software packages exist for the study of metabolic pathways
and the greater system as a whole.

• One such packages is called CellDesigner (http://www.celldesigner.org/).

• It is a structured diagram editor for drawing gene-regulatory and


biochemical networks.

• It is a package used to create in silico metabolic systems.

• These models can then be assigned metabolic characteristics, for example


the effect of the up regulation of one component on another. This can
eventually lead to the modeling of entire cascading effects in the system.
Cell Designer (http://www.celldesigner.org/)

47
Central Dogma and Genome Browsers

Integrated Resources

DNA RNA Protein Systems Phenotype


Phenotype

• All individual systems together à phenotype

• Phenotype studies à big money required

• Human Health à Phenotype studies

• Agriculture à Phenotype studies

You might also like