0% found this document useful (0 votes)

27 views22 pages

Bioinformatics

Bioinformatics is a scientific discipline that integrates biology, computer science, and information technology to manage complex biological data. The field has evolved significantly since the 1990s with advancements in high-throughput DNA sequencing and the growth of various 'omics' projects, necessitating sophisticated computational tools for data analysis. Key tasks in bioinformatics include sequence alignment, protein folding, and evolutionary analysis, with various databases and algorithms available for researchers to utilize.

Uploaded by

georginaroudri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views22 pages

Bioinformatics

Uploaded by

georginaroudri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Bioinformatics:

Copyright© Kerstin Wagner

Introduction: What is bioinformatics?
Can be defined as the body of tools, algorithms needed to handle large
and complex biological information.

Bioinformatics is a scientific discipline created from the interaction

of biology and computer science.

The NCBI defines bioinformatics as:

"Bioinformatics is the field of science in which biology, computer
science, and information technology merge into a single discipline”
Genomics era: High-throughput DNA sequencing

The first high-throughput genomics

technology was automated DNA sequencing
in the early 1990.

In 1995, Venter and Hamilton used whole-

genome shotgun sequencing strategy to
sequence the genomes of Mycoplasma and
Haemophilus .

In September 1999, Celera Genomics

completed the sequencing of the
Drosophila genome.

The 3-billion-bp human genome sequence

was generated in a competition between
the publicly funded Human Genome
Project and Celera
High-throughput DNA sequencing

Top image: confocal detection

by the MegaBACE sequencer
of fluorescently labeled DNA

That was then. How about

now?
The trend of data growth
21st century is a century of biotechnology and OMICS:
8
7

Nucleotides(billion)
6
5
 Genomics: New sequence information is being 4
3
produced at increasing rates. (The 2
contents of GenBank double every year) 1
0
1980 1985 1990 1995 2000

 Transcriptomics: Microarray: Global expression analysis: RNA Years

levels of every gene in the genome analyzed in parallel.

Progressively replaced by RNA-seq

 Proteomics: Global protein analysis generates by large mass

spectra libraries.

 Metabolomics: Global metabolite analysis: 25,000 secondary

metabolites characterized
How to handle the large amount of information?

Drew Sheneman, New Jersey--The Newark Star Ledger

Answer: bioinformatics and Internet

Bioinformatics history
In1960s: the birth of bioinformatics

IBM 7090 computer

Margaret Oakley Dayhoff created:

The first protein database
The first program for sequence assembly

There is a need for computers and algorithms that allow:

Access, processing, storing, sharing, retrieving, visualizing, annotating…
Why do we need the Internet?
“omics” projects and the information associated with involve a huge amount
of data that is stored on computers all over the world.
Because it is impossible to maintain up-to-date copies of all relevant
databases within the lab. Access to the data is via the internet.
Database
storage

You are
here
Scope of this lab
The lab will touch on the following computational tasks:
Similaritysearch
Sequence comparison: Alignment, multiple alignment, retrieval
Sequences analysis: Signal peptide, transmembrane domain,…
Protein folding: secondary structure from sequence
Sequence evolution: phylogenetic trees

Make you familiar with bioinformatics resources available on the

web to do these tasks.
Applying algorithms to analyze genomics data
-Accession #?
-Annotation?
Is it already in
databases?
Protein Other
characteristics? information?
-Sub-localization -Expression profile?
-Soluble? -Mutants?
You have just
-3D fold
cloned a gene

Is there conserved Is there similar Evolutionary

regions? sequences? relationship?
-Alignments? -% identity? -Phylogenetic
-Domains? -Family member? tree

A critical failure of current bioinformatics is the lack of a single software

package that can perform all of these functions.
DNA (nucleotide sequences) databases
They are big databases and searching either one should produce
similar results because they exchange information routinely.

-GenBank (NCBI): http://www.ncbi.nlm.nih.gov

-Ensembl: http://useast.ensembl.org/index.html

-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp

-TIGR: http://tigr.org/tdb/tgi

-Yeast: http://yeastgenome.org

-Microbes: http://img.jgi.doe.gov/cgi-bin/pub/main.cgi
Protein (amino acid) databases
Known proteins:
-Swiss-Prot (very high level of annotation)
http://au.expasy.org/

-PIR (protein identification resource) the world's most

comprehensive catalog of information on proteins
http://www.pir.uniprot.org/

Translated databases:
-TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot.
http://www.ebi.ac.uk/trembl/access.html

-GenPept (translation of coding regions in GenBank)

-pdb (sequences derived from the 3D structure

Brookhaven PDB) http://www.rcsb.org/pdb/
Database homology searching
Use algorithms to efficiently provide mathematical basis of searches
that can be translated to statistical significance.

Assumes that sequence, structure, and function are inter-related.

All
similarity searching methods rely on the concepts of alignment
and distance between sequences.

A similarity
score is calculated from a distance: the number of DNA
bases or amino acids that are different between two sequences.
Database search methods: Sequence Alignment
Two broad classes of sequence alignments exist:

QKESGPSSSYC
 Global alignment: not sensitive
VQQESGLVRTTC

ESG
 Local alignment: faster
ESG

The most widely used local similarity algorithms are:

Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)
Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)

Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;

http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
Which algorithm to use for database similarity search?

Speed:
BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a
LOT OF COMPUTER POWER)

Sensitivity/statistics:
FASTA is more sensitive, misses less homologues
Smith-Waterman is even more sensitive.

BLAST calculates probabilities

FASTA more accurate for DNA-DNA search then BLAST

Tools to search databases
The dilemma: DNA or protein?

Search by similarity

Using nucleotide seq. Using amino acid seq.

 Is the comparison of two nucleotide sequences accurate?

 By translating into amino acid sequence, are we losing information?

The genetic code is degenerate (Two or more codons can represent
the same amino acid)

 Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!
Reasons for translating
Comparing DNA sequences give more random matches:
A good alignment with end-gaps A very poor alignment

Almost 50% identity!

Conservation of protein in evolution (DNA similarity decays faster!)

Conclusion:
It is almost always better to compare coding sequences in their amino acid form,
especially if they are very divergent.
 Very highly similar nucleotide sequences may give better results.
BLAST and FASTA variants

FASTA: Compares a DNA query to DNA database, or a protein query

to protein database
FASTX: Compares a translated DNA query to a protein database
TFASTA: Compares a protein query to a translated DNA database

BLASTN: Compares a DNA query to DNA database.

BLASTP: Compares a protein query to protein database.

BLASTX: Compares the 6-frame translations of DNA query to protein

database.
TBLASTN: Compares a protein query to the 6-frame translations of a DNA
database. You can however define your frame of interest
TBLASTX: Compares the 6-frame translations of DNA query to the 6-frame
translations of a DNA database (each sequence is comparable to
BLASTP searches!)

PSI-BLAST: Performs iterative database searches. The results from each round
are incorporated into a 'position specific' score matrix, which is
used for further searching
A practical example of sequence alignment
http://www.ncbi.nlm.nih.gov

BLAST results
Detailed BLAST results

E value: is the expectation value or probability to find by chance hits similar to

your sequence. The lower the E, the more significant the score.
Database searching tips
Use latest database version.

Use BLAST first, then a finer tool (FASTA,…)

Search both strands when using FASTA.

Translate sequences where relevant

Search 6-frame translation of DNA database

E < 0.05 is statistically significant, usually biologically

interesting.

If the query has repeated segments, delete them and

repeat search

Bioinformatics:: Guide To Bio-Computing and The Internet
No ratings yet
Bioinformatics:: Guide To Bio-Computing and The Internet
34 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Bioinformatics: ABE 2007 Kent Koster Group 3
No ratings yet
Bioinformatics: ABE 2007 Kent Koster Group 3
43 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
Unit 1
No ratings yet
Unit 1
24 pages
8024 Bio Info
No ratings yet
8024 Bio Info
28 pages
Bioinfo Course Notes M1 2020 DR Mbulli
No ratings yet
Bioinfo Course Notes M1 2020 DR Mbulli
56 pages
DIVYA Bioinformatics
No ratings yet
DIVYA Bioinformatics
20 pages
BTH 403-BTG407 Lecture 1
No ratings yet
BTH 403-BTG407 Lecture 1
6 pages
Sec1 Introduction To Bioinformatics
No ratings yet
Sec1 Introduction To Bioinformatics
20 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
Bio in For Matics
100% (1)
Bio in For Matics
160 pages
Lab 1 - Introduction and Protocol
No ratings yet
Lab 1 - Introduction and Protocol
28 pages
MSC - Bioinformatics - Year1 Detailing by Bioinformatics Centre SPPU - 03082023
No ratings yet
MSC - Bioinformatics - Year1 Detailing by Bioinformatics Centre SPPU - 03082023
33 pages
Bioinformatics & Protein Analysis Guide
No ratings yet
Bioinformatics & Protein Analysis Guide
70 pages
Genetic Engineering Software Guide
No ratings yet
Genetic Engineering Software Guide
44 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Download
No ratings yet
Download
19 pages
PB Bioinfo L1 2023
No ratings yet
PB Bioinfo L1 2023
21 pages
Bio Tics
No ratings yet
Bio Tics
7 pages
Bioinformatics 1
No ratings yet
Bioinformatics 1
37 pages
Bioinformatics: Tools and Applications
No ratings yet
Bioinformatics: Tools and Applications
17 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
8 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
Toolsofbioinforformatics 200511063020
No ratings yet
Toolsofbioinforformatics 200511063020
18 pages
Latthika
No ratings yet
Latthika
21 pages
Bioinformatics for Researchers
No ratings yet
Bioinformatics for Researchers
33 pages
Bioinformatics Class Notes
No ratings yet
Bioinformatics Class Notes
12 pages
Collection
No ratings yet
Collection
8 pages
Blast Introduction
No ratings yet
Blast Introduction
42 pages
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
100% (3)
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
23 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Bioinformatics Overview & Applications
No ratings yet
Bioinformatics Overview & Applications
9 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
14 pages
Blast
100% (1)
Blast
21 pages
Introduction To Bioinformatics: Tolga Can
No ratings yet
Introduction To Bioinformatics: Tolga Can
21 pages
BioInformatics Abstract For Paper Presentation
100% (1)
BioInformatics Abstract For Paper Presentation
11 pages
Introduction
No ratings yet
Introduction
13 pages
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
100% (2)
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
54 pages
Bioinformatics Intro
No ratings yet
Bioinformatics Intro
69 pages
Introduction To Bioinformatics Presentation
No ratings yet
Introduction To Bioinformatics Presentation
13 pages
Lecture 1 (Introduction To Bioinformatics)
No ratings yet
Lecture 1 (Introduction To Bioinformatics)
21 pages
Lec (1) - Introduction
No ratings yet
Lec (1) - Introduction
41 pages
Lecture 2
No ratings yet
Lecture 2
24 pages
Introduction A La Bioinformatique
100% (1)
Introduction A La Bioinformatique
165 pages
Bioinformatics Tutorial
No ratings yet
Bioinformatics Tutorial
12 pages
Bioinformatics MSC
No ratings yet
Bioinformatics MSC
85 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
7 pages
Intro to Bioinformatics Course
No ratings yet
Intro to Bioinformatics Course
104 pages
To Bioinformatics: Dan Lopresti
No ratings yet
To Bioinformatics: Dan Lopresti
43 pages
120-202 Lab 01 - Fall 2018
No ratings yet
120-202 Lab 01 - Fall 2018
13 pages
Stuart M. Brown-Bioinformatics - A Biologist's Guide To Biocomputing and The Internet-Eaton Publishing Company - Biotechniques Books (2000)
No ratings yet
Stuart M. Brown-Bioinformatics - A Biologist's Guide To Biocomputing and The Internet-Eaton Publishing Company - Biotechniques Books (2000)
189 pages
BIOINFORMATICS Basic
No ratings yet
BIOINFORMATICS Basic
10 pages
Thesis On Homology Modeling
100% (3)
Thesis On Homology Modeling
6 pages
MPI Parallelization for Bioinformatics
No ratings yet
MPI Parallelization for Bioinformatics
4 pages
Scientific Computing Course List
No ratings yet
Scientific Computing Course List
26 pages
Bioinformatics Alignment Methods
No ratings yet
Bioinformatics Alignment Methods
11 pages
Omics Breakthroughs For Environmental Microbiology
No ratings yet
Omics Breakthroughs For Environmental Microbiology
16 pages
Modeller
No ratings yet
Modeller
6 pages
Bioinformatic Tools and Resources
No ratings yet
Bioinformatic Tools and Resources
17 pages
PAM Abd BLOSUM
No ratings yet
PAM Abd BLOSUM
3 pages
S.C. Rastogi Parag Rastogi, Namita Mendiratta - Bioinformatics - Methods and Applications - Genomics, Proteomics and Drug Discovery-PHI (2022)
100% (2)
S.C. Rastogi Parag Rastogi, Namita Mendiratta - Bioinformatics - Methods and Applications - Genomics, Proteomics and Drug Discovery-PHI (2022)
626 pages
Needlemanwunsch 130216130832 Phpapp01
No ratings yet
Needlemanwunsch 130216130832 Phpapp01
39 pages
LittlewoodOlson Book PDF
No ratings yet
LittlewoodOlson Book PDF
43 pages
A Threading Approach To Protein Structure Prediction - Studies On
No ratings yet
A Threading Approach To Protein Structure Prediction - Studies On
114 pages
MTPC 140: Molecular Biology and Diagnostics
100% (1)
MTPC 140: Molecular Biology and Diagnostics
37 pages
PLAZA 3.0: An Access Point For Plant Comparative Genomics
No ratings yet
PLAZA 3.0: An Access Point For Plant Comparative Genomics
8 pages
PSSM
No ratings yet
PSSM
17 pages
Bioinformatics 1
No ratings yet
Bioinformatics 1
62 pages
Ebooks File Genome-Scale Algorithm Design Bioinformatics in The Era of High-Throughput Sequencing 2nd Edition Veli Mäkinen All Chapters
100% (2)
Ebooks File Genome-Scale Algorithm Design Bioinformatics in The Era of High-Throughput Sequencing 2nd Edition Veli Mäkinen All Chapters
40 pages
Civil Engineering VI Semester Guide
No ratings yet
Civil Engineering VI Semester Guide
49 pages
Bayesian Evolutionary Analysis With BEAST
100% (1)
Bayesian Evolutionary Analysis With BEAST
262 pages
Zhou 105
No ratings yet
Zhou 105
11 pages
01-Intro To Sequence
No ratings yet
01-Intro To Sequence
2 pages
PlasmidFinder and in Silico PMLST. Identification and Typing of Plasmid Replicons in Whole-Genome Sequencing (WGS)
No ratings yet
PlasmidFinder and in Silico PMLST. Identification and Typing of Plasmid Replicons in Whole-Genome Sequencing (WGS)
10 pages
Gauthier, 2019 (History)
No ratings yet
Gauthier, 2019 (History)
16 pages
Lima Et Al., 2019
No ratings yet
Lima Et Al., 2019
18 pages
Analyzing Molecular Interactions
No ratings yet
Analyzing Molecular Interactions
272 pages
Algae Bioinformatics for Researchers
No ratings yet
Algae Bioinformatics for Researchers
10 pages
Chapter 5 Pairwise Alignment
No ratings yet
Chapter 5 Pairwise Alignment
8 pages
Sequence Classification
No ratings yet
Sequence Classification
9 pages
Lecture 8 - BLAST - MSA
No ratings yet
Lecture 8 - BLAST - MSA
15 pages

Bioinformatics

Uploaded by

Bioinformatics

Uploaded by

Bioinformatics:

Copyright© Kerstin Wagner

Bioinformatics is a scientific discipline created from the interaction

The NCBI defines bioinformatics as:

The first high-throughput genomics

In 1995, Venter and Hamilton used whole-

In September 1999, Celera Genomics

The 3-billion-bp human genome sequence

Top image: confocal detection

That was then. How about

 Transcriptomics: Microarray: Global expression analysis: RNA Years

levels of every gene in the genome analyzed in parallel.

 Proteomics: Global protein analysis generates by large mass

 Metabolomics: Global metabolite analysis: 25,000 secondary

Drew Sheneman, New Jersey--The Newark Star Ledger

Answer: bioinformatics and Internet

IBM 7090 computer

Margaret Oakley Dayhoff created:

There is a need for computers and algorithms that allow:

Make you familiar with bioinformatics resources available on the

Is there conserved Is there similar Evolutionary

A critical failure of current bioinformatics is the lack of a single software

-GenBank (NCBI): http://www.ncbi.nlm.nih.gov

-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp

-PIR (protein identification resource) the world's most

-GenPept (translation of coding regions in GenBank)

-pdb (sequences derived from the 3D structure

Assumes that sequence, structure, and function are inter-related.

The most widely used local similarity algorithms are:

Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;

BLAST calculates probabilities

FASTA more accurate for DNA-DNA search then BLAST

Using nucleotide seq. Using amino acid seq.

 Is the comparison of two nucleotide sequences accurate?

 By translating into amino acid sequence, are we losing information?

Almost 50% identity!

Conservation of protein in evolution (DNA similarity decays faster!)

FASTA: Compares a DNA query to DNA database, or a protein query

BLASTN: Compares a DNA query to DNA database.

BLASTP: Compares a protein query to protein database.

BLASTX: Compares the 6-frame translations of DNA query to protein

E value: is the expectation value or probability to find by chance hits similar to

Use BLAST first, then a finer tool (FASTA,…)

Search both strands when using FASTA.

Translate sequences where relevant

Search 6-frame translation of DNA database

E < 0.05 is statistically significant, usually biologically

If the query has repeated segments, delete them and

You might also like