SEQUENCE COMPARISON AND PHYLOGENY USING BIOINFOMATICS
INTERNSHIP STUDY SUBMITTED TO PSGR KRISHNAMMAL COLLEGE FOR WOMEN
 IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE AWARD OF THE
   DEGREE OF BACHELOR OF COMPUTER APPLICATIONS OF BHARATHIAR
                     UNIVERSITY, COIMBATORE – 641046
                                 Submitted by,
                             SATHYA PRABHA K
                                  III BCA B
                                  19BCA087
                                  Guided by,
                 Mrs. BHARATHI .V M.C.A, M.phil (Ph.D)
                  Assistant Professor, Department of BCA,
                   PSGR Krsihnammal College For Women,
                             Coimbatore- 641 004.
                                AUGUST 2021
                          DEPARTMENT OF BCA
            PSGR KRISHNAMMAL COLLEGE FOR WOMEN
                        COLLEGE OF EXCELLENCE
       (An Autonomous Institution – Affiliatedd To Bharathiar University)
                    (Reaccredited with A+ Grade by NAAC)
                    (An ISO 9001:2005 Certified Institution)
                       Peelamedu, Coimbatore – 641 004
                                                                            1
                                   CERTIFICATE
                PSGR KRISHNAMMAL COLLEGE FOR WOMEN
                        College of Excellence, NIRF 10th Rank
            (An Autonomous Institution Affiliated to Bharathiar University) 
                        Reaccredited with ‘A’ Grade by NAAC 
                       An ISO 9001:2015 Certified Institution  
                            Peelamedu, Coimbatore-641004
This is to certify that SATHYA PRABHA K (19BCA 087) of BCA has undergone Internship
         training SEQUENCE COMPARISON AND PHYLOGENY USING
                       BIOINFOMATICS during August 2021.
___________________________                      _______________________________
      FACULTY GUIDE                                HEAD OF THE DEPARTMENT
Mrs. BHARATHI.V M.CA.,                           Mrs. GEETHALAKSHMI. K, MCA
      M.Phil., (Ph.D)                                  M.Phil, B.Ed.,(Ph.D)
                                                                                   2
Company certificate
                      3
                                    WORK DAIRY
   DATE                                PARTICULARS                            SIGNATURE
July 12,2021   Cell and cell components
July 13,2021   DNA, Gene and Protein
July 14,2021   Biological databases – Primary and secondary databases
July 15,2021   NCBI, PDB, NDB, EMBL, File formats
July16,2021    PROSITE, PRINTS, Pfam, SCOP, CATH
July 17,2021   Sequence Alignment and comparison
July 19,2021   Expasy protein server
July 20,2021   Gene Prediction Toole(Fgenesh, GenScan, HMMGene)
July 21,2021   Metabolic pathway databases and enzyme classification-KEGG
July 22,2021   Introduction to cheminformatics
July 23,2021   Drugbank, Pubchem, ChemBL, databases-Data retrieval analysis
July 24,2021   Protein structure prediction tools- Rasmol, Molview
July 26,2021   Drug interaction studies -Chimera
July 27,2021   Report preparation
July 28,2021   Test
                                                                                  4
                                 ACKNOWLEDGEMENT:
       I take this opportunity to acknowledge with great pleasure deep satisfaction and
 gratitude, to the contribution of many individuals in the successful completion of the
 project.
       I express my whole hearted thanks to Dr. R. NANDINI, Chairperson, PSGR
 Krishnammal College for Women, Coimbatore, for providing me the necessary
 infrastructure for the successful completion of the project work.
       I convey my profound gratitude to Dr. N. YESODHA DEVI, M. Com., M.Phil.,
 Ph.D., Secretary, PSGR Krishnammal College for Women, Coimbatore for given me the
 opportunity to undergo this course and to undertake this project.
       I express my gratitude to Dr. (Mrs.) S. NIRMALA, MBA, M.Phil, Ph.D.,
 Principal, and PSGR Krishnammal College for Women, for granting me the
 permission to do the project work.
       I am extremely grateful to Mrs. K. GEETHALAKSHMI, MCA, M.Phil, B.Ed.,
 (Ph.D) Head, Department of BCA for the guidance and enthusiasm provided throughout
 the project work.
       I am highly indebted to my guide Mrs.BARATHI.V Department of BCA, for her
 valuable guidance that has gone a long way to make this report a successful one.
       My sincere thanks to all staff of our department for their constant support
 and encouragement.
       I express my heartfelt thanks to Mrs. BANU- Proprietor Accentz Techno Soft, for
providing me an opportunity to undertake Internship Study in his esteemed concern. I
heartfully thanks to Mrs. Geethalakshmi who teaches the internship classes through a
discipline manner for understanding easily.
                                                                                      5
                                     DECLARATION
I hereby declare that this Internship study entitled “SEQUENCE COMPARISON AND
PHYLOGENY USING BIOINFOMATICS” submitted to PSGR Krishnammal College for
Women, Coimbatore for the award of the Degree of Bachelor of Computer Application is a
record of original work done by SATHYA PRABHA K (19BCA087) under the guidance of
Mrs.BHARATHI.V M.C A., M.phil., (Ph.D) Assistant Professor, Department of BCA,
PSGR Krishnammal College for Women, Coimbatore and this internship have not found the
basis for the award of any Degree/Diploma or similar title to any candidate of any university.
Place : Coimbatore                            SATHYA PRABHA K (19BCA087)
Date :
                                        Endorsed by
Place : Coimbatore                                         Mrs.BHARATHI.V
Date :                                                      M.CA ., M.phil., (Phd)
                                                              (Faculty guide)
                                                                                            6
CONTENTS
           7
     SEQUENCE COMPARISON AND PHYLOGENY USING BIOINFOMATICS
                                       ABSTRACT
This is a study of bioinformatics field which also includes the base of information,
technology, computer science, mathematics and statistics. It represents the cell and cell
components details. In this we learn about lot of proteins, Genes & Where to collect their
databases and how can we analysis it. There is lot of tools there for finding structure,
molecule weights, type of proteins, count of acids, gene proteins in both offline and online.
While comparing the unknown sequences we can identify which sequence is related to the
query sequence. We can find the molecules pathway in a biological resource which shows the
energy component or some other reactions in a human resource. You can also find the drugs
usage, drug molecule and more details about the drugs to use. It is considered as more
valuable field in future.
                                                                                           8
INTRODUCTION TO BIOINFORMATICS:
         Bioinformatics is the application of tools of computation and analysis to the capture
and interpretation of biological data. Bioinformatics solution usually involves the following
steps:
                    ⮚ Collect statistics from biological data.
                    ⮚ Build a computational model.
                    ⮚ Solve a computational modelling problem.
                    ⮚ Test and evaluate a computational algorithm.
         Bioinformatics is essential for management of data in modern biology and
medicine. The bioinformatics toolbox includes computer software programs such as BLAST,
which depend on the availability of the internet. Analysis of genome sequence data,
particularly the analysis of the human genome project, is one of the main achievements of
bioinformatics of data. Prospects in the field of bioinformatics include its future contribution
to functional understanding of the human genome, leading to enhanced discovery of drug
targets and individualised therapy.
         WHAT DO WE DO IN BIOINFORMATICS?
                -Analyse and interpret the various types of biological data:
                       1. Genomic Sequences (DNA).
                       2. Transcriptomic Sequences (RNA).
                       3. Proteomic Sequence (Proteins).
                       4. Protein Structures (Proteins).
                       5. RNA Structure (RNA).
                -Develop new algorithms and tools
                       1. To access the biological information.
                       2. Handle large datasets.
                       3. Find relationships between data sources etc..,
PROTEIN CHOSEN AND FASTA FORMAT
                                                                                              9
BLASTp
         10
11
INTERPRETATION:
       I have chosen Secretin preprotein [Homo Sapiens] for Multiple Sequence Alignment.
The RID for secretin [Homo Sapiens] is GSEFZVX801R with a program of BLASTP and
query id is lcl|Query_679205 with the length of 121 of a molecular type of amino acid.I have
collected 10 sequences for Multiple Sequence Alignment. The query matches with secretin
[Gorilla   gorilla   gorilla],   secretin   [Nomascus   leucogenys],   secretin   [Piliocolobus
tephrosceles], secretin [Sapajus apella], hypothetical protein DBR06_SOUSAS35410027
[Sousa chinensis], secretin [ursus maritimus], secretin [canis lupus dingo], secretin
[Peromyscus leucopus], and secretin isoform X2 [Arvicola amphibius]. Gorilla gorilla gorilla
matches 94.40%, Nomascus leucogenys matches 89.68%, Piliocolobus tephrosceles matches
88.00%, Sapajus apella matches 78.46%, Sousa chinensis matches 69.23%, ursus maritimus
matches 65.71%, canis lupus dingo matches 65.20%, Peromyscus leucopus matches 57.76%,
Arvicola amphibius matches 54.92%. These are the percentages of the sequences that
matches with the query secretin preprotein [Homo Sapiens].
                                                                                            12
INTRODUCTION TO MULTIPLE SEQUENCE ALIGNMENT AND ITS USES:
       A Multiple Sequence Alignment (MSA) is a basic tool for the sequence alignment of
two or more biological sequences. It refers to a series of algorithmic solution for the
alignment of evolutionarily related sequences, while taking into account evolutionary events
such as mutations, insertions, deletions and rearrangements under certain conditions. It is a
tool used to study closely related genes or proteins in order to find the evolutionary
relationships between genes and to identify shared patterns among functionally or structurally
related genes. Generally, Protein, DNA, or RNA. In many cases, the input set of query
sequences are assumed to have an evolutionary relationship. By which they share a lineage
and are descended from a common ancestor. Multiple sequence alignment (MSA)
has assumed a key role in comparative structure and function analysis of biological
sequences. It often leads to fundamental biological insight into sequence-structure-function
relationships of nucleotide or protein sequence families. Multiple sequence alignments can
be used to create a phylogenetic tree. This is made possible by two reasons. The first is
because functional domains that are known in annotated sequences can be used for alignment
in non-annotated sequences. These algorithms can deal with sequences that are quite
different, but, as in the pair-wise case, when the sequences are very different they might have
problems creating good algorithm. A good algorithm should align the homologous positions
or the positions with the same structure or function. Computational algorithms are used to
produce and analyse the MSAs due to the difficulty and intractability of manually processing
the sequences given their biologically-relevant length. MSAs require more sophisticated
methodologies than pairwise alignment because they are more computationally complex.
CLUSTALW:
                                                                                            13
RESULTS:
           14
MENU IN CLUSTALW:
                    15
16
INTRODUCTION TO PHYLOGENETIC ANALYSIS
                                        17
       Phylogenetic analysis provides an in-depth understanding of how species evolve
through genetic changes. Using phylogenetics, scientists can evaluate the path that connects a
present-day organism with its ancestral origin, as well as can predict the genetic divergence
that may occur in the future. To construct a visual representation (a tree) to describe the
assumed evolution occurring between and among different groups (individuals, populations,
species, etc.) and to study the reliability of the consensus tree. Phylogenetic analysis can be
useful in comparative genomics, which studies the relationship between genomes of different
species. In this context, one major application is gene prediction or gene finding, which
means locating specific genetic regions along a genome.
       Parts of Phylogenetic tree:
                                             Branch
                         Node
          Root
                                                                                    Ingroup
                                                                                            18
TYPES OF TREES:
       There are three types of phylogenetic tree namely, Cladogram, Phylogram and
Ultrametric tree. Cladogram represents no numbers that is it gives random length Outgroup
                                                                                 and it is a
rough representation of phylogenetic tree. Phylogram shows the genetic change of the
organism that is the time taken by the organism to get mutated to another organism and it is
represented by my i.e., million years. Ultrametric tree shows the size of the branch which is
taken time for mutation of the organism. The branches of the Ultrametric tree represents the
time i.e., if the organism takes 2my for mutation the size of the branch will be 2cm. The
below diagram shows the types of trees.
       All show the same evolutionarily relationships, or branching orders, between the taxa.
       The goal of phylogeny inference is to resolve the branching orders of lineages in
evolutionary tree: The first one is a star phylogeny which is completely unresolved, the
second is partially resolved phylogeny and the third one is fully resolved that is bifurcating
                                                                                           19
phylogeny. In the diagram the arrow points at the polytomy or multifurcation and a
bifurcation of the evolutionary tree.
Taxonomical Reaction:
       There are three possible unrooted trees for four taxa (A, B, C, D). Phylogenetic tree
building (or inference) methods are aimed at discovering which of the possible unrooted trees
is “correct”. We would like this to be the “true” biological tree – that is, one that accurately
represents the evolutionary history of the taxa. However, we must settle for discovering the
computationally correct or optimal tree for the phylogenetic method of choice.
PHYLOGENETIC TREE:
                                                                                             20
INTERPRETATION:
       A phylogenetic tree is a branching diagram or a tree showing the evolutionary
relationships among various biological species or other entities based upon similarities and
differences in their physical or genetic characteristics. All life on Earth is part of a single
                                                                                            21
phylogenetic tree, indicating common ancestry. I have taken Homo sapiens [EAX02368.1],
Gorilla gorilla gorilla [XP_004050413.2], Nomascus Leucogenys [XP_030657519.1],
Pillocolobus Tephroscales [XP_023039225.1], Sapajus Apella [0321227361.2], Sausa
Chinacnisis [TEA36508.1], Ursus Maritims [XP_040494734.1], Canis Lupus Dingo
[XP_025308592],      Peromyscus     Leucopius[XP_028728132]         ,   Arvicola   Amphibius
[XP_038172384.1] of secretin for phylogenetic comparison.
       The Phylogenetic tree have three parts which are, root is the first organism that been
created, node is the junction region in the phylogenetic tree and branch is the line that tells
the evolutionary relationship between the organism.
       The root of the phylogenetic tree is Pillocolobus Tephroscales [XP_023039225.1].
First branch is Peromyscus Leucopius [XP_030657519.1] and Arvicola Amphibius
[XP_038172384.1] which both are evolutionary related, second branch is Canis Lupus Dingo
[XP_025308592] and Ursus Maritims [XP_040494734.1] are evolutionary related to each
other, third branch is Sausa Chinacnisis [TEA36508.1], the first and second set are related to
third set. Fourth branch is Sapella Apella [0321227361.2] which is evlotionary related to the
first three sets. Fifth is the root of the phylogenetic tree i.e., Pillocolobus Tephroscales
[XP_023039225.1]. Fifth set is Nomascus Leucogenys [XP_030657519.1] which is related to
the above sets, and the final branch is homo sapiens [EAX02368.1] and Gorilla gorilla gorilla
[XP_004050413.2] which is related to all six branches in the phylogenetic tree. This is the
structure of phylogenetic tree, which shows the evolutionary relation between the organism.
                                                                                            22