KEMBAR78
Intro to Python programming and iPython | PDF
3/30/14	
  
1	
  
R. Burke Squires
Computational Genomics Specialist
Bioinformatics and Computational Biosciences Branch (BCBB)
2
Bioinformatics &
Computational Biology Branch (BCBB) Why Python?
4Source: http://xkcd.com/353/
Topics
§  iPython & Integrated Development Environments
§  Printing and manipulating text
§  Reading and writing files
§  Lists and Loops
§  Writing your own functions
§  Conditional tests
§  Regular Expressions
§  Dictionaries
§  Files, programs and user input
5
Resource:
Bioinformatics Programming
6
3/30/14	
  
2	
  
Goals
§  Introduce you to the basics of the python
programming language
§  Introduce you to the iPython environment and
integrated development environments (IDE)
§  Enable you to write or assemble scripts of your own or
modify existing scripts for your own purposes
§  Prepare you for the next session “Introduction to
Biopython Programming”
7 8
Programming…is it Magic?
§  No…BUT it can seem
like it at times! J
§  Working with text files
9
Python Scripts vs. Program
10
Each function
is interpreted
and executed
(slower)
Code is
compiled once;
executed as
machine code
(fastest)
11
In Your Toolbelt…
§  Python environment
§  Integrated Development Environment (IDE)
–  Continuum Analytics Anaconda
§  http://continuum.io/downloads.html
–  Enthought Canopy Express
§  https://www.enthought.com/products/epd/free/
–  iPython Notebook
§  http://ipython.org
–  PyCharm CE (Community Edition)
§  http://www.jetbrains.com/pycharm/
12
3/30/14	
  
3	
  
Python Environment
§  Open Terminal
§  Type “python” and hit return
•  You should see “>>>”
§  Enter “print(‘hello world’)” and hit return
§  Congratulation! You have just written your first python
script!
§  You could also put code in text file and execute:
•  $python script_name.py
13
iPython
14
iPython
§  IPython provides architecture for interactive
computing:
•  Powerful interactive shells (terminal and Qt-based).
•  A browser-based notebook with support for code,
text, mathematical expressions, inline plots and
other rich media.
•  Support for interactive data visualization and use of
GUI toolkits.
•  Easy to use, high performance tools for parallel
computing.
15
Source: ipython.org
iPython
§  Already installed
•  Source: Continuum Analytics Anaconda
–  http://continuum.io/downloads.html
§  Double-click on icon on desktop:
§  Launch the ipython-notebook
16
iPython – Home Screen
17
iPython – New Notebook
18
3/30/14	
  
4	
  
iPython
§  Add text using Cell -> Markdown
•  Type #Intro to Python
•  Type “This is my first iPython notebook.”
•  (To edit change to raw text)
§  Add a code cell
§  Type “print(“Hello world”)
§  Click play or run button (or Cell -> Run)
19
Source: ipython.org
iPython Notebook
20
iPython Notebook Help
21
iPython Notebook Help
§  Add images to your notebook
•  “!["DNA"](files/DNA_chain.jpg)”
•  In the same folder as notebook
§  Add YouTube Videos to your notebook:
•  from IPython.display import YouTubeVideo
•  YouTubeVideo('iwVvqwLDsJo')
22
Additional Tools
Canopy PyCharm
23
Advantages of IDEs
24
§  PyCharm features:
•  Intelligent Editor:
–  Code completion, on-the-fly error highlighting, auto-fixes, etc.
•  Automated code refactorings and rich navigation
capabilities
•  Integrated debugger and unit testing support
•  Native version control system (VCS) integrations
•  Customizable UI and key-bindings, with VIM
emulation available
Source: http://www.jetbrains.com/pycharm/
3/30/14	
  
5	
  
Lastly: Python IDEs in the Cloud
25
§  Python Anywhere
•  http://www.pythonanywhere.com
§  Python Fiddle: Python Cloud IDE
•  http://pythonfiddle.com
§  Koding: Free Programming Virtual Machine
•  http://koding.com
26
Printing and manipulating text:
“Hello World”
§  While in iPython:
•  Type print(“Hello world”)
•  “Run” the program
§  The whole thing is a statement; print is a function
§  Comments
•  # This is a comment!
27
Printing and manipulating text:
Storing String in Variables
# store a short DNA sequence in the variable my_dna!
my_dna = "ATGCGTA"!
!
# now print the DNA sequence!
print(my_dna)
28Source: Python for Biologists, Dr. Martin Jones
Printing and manipulating text:
Concatenation
my_dna = "AATT" + "GGCC"!
print(my_dna)!
!
upstream = "AAA"!
my_dna = upstream + "ATGC"!
# my_dna is now "AAAATGC"
29Source: Python for Biologists, Dr. Martin Jones
Printing and manipulating text:
Finding the Length of a String
# store the DNA sequence in a variable!
my_dna = "ATGCGAGT”!
!
# calculate the length of the sequence and store it in a
variable!
dna_length = len(my_dna)!
!
# print a message telling us the DNA sequence lenth!
print("The length of the DNA sequence is " + dna_length)!
!
my_dna = "ATGCGAGT"!
dna_length = len(my_dna)!
print("The length of the DNA sequence is " + str(dna_length))
30Source: Python for Biologists, Dr. Martin Jones
3/30/14	
  
6	
  
Printing and manipulating text:
Replacement
protein = "vlspadktnv"!
!
# replace valine with tyrosine!
print(protein.replace("v", "y"))!
!
# we can replace more than one character!
print(protein.replace("vls", "ymt"))!
!
# the original variable is not affected!
print(protein)
31Source: Python for Biologists, Dr. Martin Jones
Printing and manipulating text:
Replacement
protein = "vlspadktnv"!
!
# print positions three to five!
print(protein[3:5])!
!
# positions start at zero, not one!
print(protein[0:6])!
!
# if we use a stop position beyond the end, it's the same as
using the end!
print(protein[0:60])
32Source: Python for Biologists, Dr. Martin Jones
Printing and manipulating text:
Replacement
protein = "vlspadktnv"!
!
# count amino acid residues!
valine_count = protein.count('v')!
lsp_count = protein.count('lsp')!
tryptophan_count = protein.count('w')!
!
# now print the counts!
print("valines: " + str(valine_count))!
print("lsp: " + str(lsp_count))!
print("tryptophans: " + str(tryptophan_count))
33Source: Python for Biologists, Dr. Martin Jones
Printing and manipulating text:
Homework
Calculating AT content!
Here's a short DNA sequence:!
!
ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT!
!
Write a program that will print out the AT content of this
DNA sequence. Hint: you can use normal mathematical symbols
like add (+), subtract (-), multiply (*), divide (/) and
parentheses to carry out calculations on numbers in Python.!
!
Reminder: if you're using Python 2 rather than Python 3,
include this line at the top of your program:!
from __future__ import division
34Source: Python for Biologists, Dr. Martin Jones
35
Reading and Writing Files:
Reading a File
my_file = open("dna.txt")!
file_contents = my_file.read()!
print(file_contents)!
!
my_file = open("dna.txt")!
my_file_contents = my_file.read()!
!
# remove the newline from the end of the file contents!
my_dna = my_file_contents.rstrip("n")!
dna_length = len(my_dna)!
print("sequence is " + my_dna + " and length is " +
str(dna_length))
36Source: Python for Biologists, Dr. Martin Jones
3/30/14	
  
7	
  
Reading and Writing Files:
Writing to a File
my_file = open("out.txt", "w")!
my_file.write("Hello world")!
!
# remember to close the file!
my_file.close()!
!
my_file = open("/Users/martin/Desktop/myfolder/myfile.txt")
37Source: Python for Biologists, Dr. Martin Jones
Reading and Writing Files:
Homework
Writing a FASTA file
FASTA file format is a commonly-used DNA and protein sequence file format. A
single sequence in FASTA format looks like this:
>sequence_name
ATCGACTGATCGATCGTACGAT
Write a program that will create a FASTA file for the following three sequences –
make sure that all sequences are in upper case and only contain the bases A, T, G
and C.
Sequence header DNA sequence
ABC123 ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456 actgatcgacgatcgatcgatcacgact
HIJ789 ACTGAC-ACTGT--ACTGTA----CATGTG
38Source: Python for Biologists, Dr. Martin Jones
39
Lists and Loops:
Lists
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]!
conserved_sites = [24, 56, 132]!
print(apes[0])!
first_site = conserved_sites[2]!
!
chimp_index = apes.index("Pan troglodytes")!
# chimp_index is now 1!
!
nucleotides = ["T", ”C", ”A”, “G”]
last_ape = apes[-1]!
!
!
40
−1
>>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[−1]
'A'
>>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[−5]
'D'
>>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[7 // 2]
'K'
0
>>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[50]
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[50]
IndexError: string index out of range
Slicing
[m:n]
10 | Chapter 1: Primitives
Source: Python for Biologists, Dr. Martin Jones & O’Reilly Bioinformatics Programming Using Python
Lists and Loops:
Slicing & Appending Lists
ranks = ["kingdom", "phylum", "class", "order", "family"]!
lower_ranks = ranks[2:5]!
# lower ranks are class, order and family!
!
!
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]!
print("There are " + str(len(apes)) + " apes")!
apes.append("Pan paniscus")!
print("Now there are " + str(len(apes)) + " apes")!
41Source: Python for Biologists, Dr. Martin Jones
Lists and Loops:
Concatenating, Reversing & Sorting Lists
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]!
monkeys = ["Papio ursinus", "Macaca mulatta"]!
primates = apes + monkeys!
print(str(len(apes)) + " apes")!
print(str(len(monkeys)) + " monkeys")!
print(str(len(primates)) + " primates")!
!
!
ranks = ["kingdom", "phylum", "class", "order", "family"]!
print("at the start : " + str(ranks))!
ranks.reverse()!
print("after reversing : " + str(ranks))!
ranks.sort()!
print("after sorting : " + str(ranks))!
42Source: Python for Biologists, Dr. Martin Jones
3/30/14	
  
8	
  
Lists and Loops:
Looping through Lists
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]!
for ape in apes:!
print(ape + " is an ape")!
!
!
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]!
for ape in apes:!
name_length = len(ape)!
first_letter = ape[0]!
print(ape + " is an ape. Its name starts with " + "!
" first_letter)!
print("Its name has " + str(name_length) + " letters")!
43Source: Python for Biologists, Dr. Martin Jones
Python:
Indentation
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]!
for ape in apes:!
name_length = len(ape)!
first_letter = ape[0]!
print(ape + " is an ape. Its name starts with " + !
" first_letter)!
print("Its name has " + str(name_length) + " letters")!
!
Indentation errors!
!
Use tabs or spaces but not both !
44Source: Python for Biologists, Dr. Martin Jones
Lists and Loops:
Using Strings as Lists, Splitting
name = "martin"!
for character in name:!
print("one character is " + character)!
!
!
names = "melanogaster,simulans,yakuba,ananassae"!
species = names.split(",")!
print(str(species))!
45Source: Python for Biologists, Dr. Martin Jones
Lists and Loops:
Looping through File, Line by Line
file = open("some_input.txt")!
for line in file:!
# do something with the line!
46Source: Python for Biologists, Dr. Martin Jones
Lists and Loops:
Looping with Ranges
protein = "vlspadktnv”!
vls!
vlsp!
vlspa…!
!
!
stop_positions = [3,4,5,6,7,8,9,10]!
for stop in stop_positions:!
substring = protein[0:stop]!
print(substring)!
!
for number in range(3, 8):!
print(number)!
!
for number in range(6):!
print(number)!
47Source: Python for Biologists, Dr. Martin Jones
Lists and Loops:
Looping with Ranges
protein = "vlspadktnv”!
vls!
vlsp!
vlspa…!
!
!
stop_positions = [3,4,5,6,7,8,9,10]!
for stop in stop_positions:!
substring = protein[0:stop]!
print(substring)!
!
for number in range(3, 8):!
print(number)!
!
for number in range(6):!
print(number)!
48Source: Python for Biologists, Dr. Martin Jones
3/30/14	
  
9	
  
Lists and Loops:
Homework
§  Processing DNA in a file
•  The file input.txt contains a number of DNA
sequences, one per line. Each sequence starts with
the same 14 base pair fragment – a sequencing
adapter that should have been removed. Write a
program that will (a) trim this adapter and write the
cleaned sequences to a new file and (b) print the
length of each sequence to the screen.
49Source: Python for Biologists, Dr. Martin Jones 50
Writing Your Own Functions:
Convert Code to Function
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT”!
length = len(my_dna)!
a_count = my_dna.count('A’)!
t_count = my_dna.count('T’)!
at_content = (a_count + t_count) / length!
print("AT content is " + str(at_content))!
==============================================!
from __future__ import division #if using python 2!
!
def get_at_content(dna):!
length = len(dna)!
a_count = dna.count('A’)!
t_count = dna.count('T’)!
at_content = (a_count + t_count) / length!
return at_content!
==============================================!
print("AT content is " + str(get_at_content("ATGACTGGACCA")))
51Source: Python for Biologists, Dr. Martin Jones
Writing Your Own Functions:
Improving our Function
def get_at_content(dna, sig_figs):!
length = len(dna)!
a_count = dna.upper().count('A')!
t_count = dna.upper().count('T')!
at_content = (a_count + t_count) / length!
return round(at_content, sig_figs)!
!
!
test_dna = "ATGCATGCAACTGTAGC"!
print(get_at_content(test_dna, 1))!
print(get_at_content(test_dna, 2))!
print(get_at_content(test_dna, 3))
52Source: Python for Biologists, Dr. Martin Jones
Writing Your Own Functions:
Improving our Function
§  Functions do not always have to take parameters
§  Functions do not always have to return a value
!
def get_at_content():!
test_dna = "ATGCATGCAACTGTAGC"!
length = len(dna)!
a_count = dna.upper().count('A')!
t_count = dna.upper().count('T')!
at_content = (a_count + t_count) / length!
print(round(at_content, sig_figs))!
!
!
§  What are the disadvantages of doing these things?
53Source: Python for Biologists, Dr. Martin Jones
Writing Your Own Functions:
Defaults & Named Arguments
§  Function arguments can be named
§  Order then does not matter!
!
get_at_content(dna="ATCGTGACTCG", sig_figs=2)!
get_at_content(sig_figs=2, dna="ATCGTGACTCG")!
!
§  Functions can have default values
§  Default values do not need to be provided unless a
different value is desired
!
def get_at_content(dna, sig_figs=2):!
(function code)!
54Source: Python for Biologists, Dr. Martin Jones
3/30/14	
  
10	
  
Writing Your Own Functions:
Testing Functions
§  Functions should be testing with know good values
§  Functions should be tested with known bad values!
!
assert get_at_content("ATGC") == 0.5!
assert get_at_content("A") == 1!
assert get_at_content("G") == 0!
assert get_at_content("ATGC") == 0.5!
assert get_at_content("AGG") == 0.33!
assert get_at_content("AGG", 1) == 0.3!
assert get_at_content("AGG", 5) == 0.33333!
55Source: Python for Biologists, Dr. Martin Jones 56
Conditional Tests:
True, False, If…else…elif…then
§  Python has a built-in values “True”, “False”
§  Conditional statements evaluate to True or False
§  If statements use conditional statements
expression_level = 125!
if expression_level > 100:!
print("gene is highly expressed")!
!
expression_level = 125!
if expression_level > 100:!
print("gene is highly expressed")!
else:!
print("gene is lowly expressed")
57Source: Python for Biologists, Dr. Martin Jones
Conditional Tests:
True, False, If…else…elif…then
file1 = open("one.txt", "w")!
file2 = open("two.txt", "w")!
file3 = open("three.txt", "w")!
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']!
for accession in accs:!
if accession.startswith('a'):!
file1.write(accession + "n")!
elif accession.startswith('b'):!
file2.write(accession + "n")!
else:!
file3.write(accession + "n")
58Source: Python for Biologists, Dr. Martin Jones
Conditional Tests:
While loops
§  While loops loop until a condition is met
count = 0!
while count<10:!
print(count)!
count = count + 1
59Source: Python for Biologists, Dr. Martin Jones
Conditional Tests:
While Loops
§  While loops loop until a condition is met
count = 0!
while count<10:!
print(count)!
count = count + 1
60Source: Python for Biologists, Dr. Martin Jones
3/30/14	
  
11	
  
Conditional Tests:
Building Complex Conditions
§  Use “and”, “or”, “not and”, “not or” to build complex
conditions
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']!
for accession in accs:!
if accession.startswith('a') and accession.endswith('3'):!
print(accession)
61Source: Python for Biologists, Dr. Martin Jones 62
Regular Expressions:
Patterns in Biology
§  There are a lot of patterns in biology:
–  protein domains
–  DNA transcription factor binding motifs
–  restriction enzyme cut sites
–  runs of mononucleotides
§  Pattern in strings inside text:
–  read mapping locations
–  geographical sample coordinates
–  taxonomic names
–  gene names
–  gene accession numbers
–  BLAST searches
63Source: Python for Biologists, Dr. Martin Jones
Regular Expressions:
Patterns in Biology
§  Many problems that we want to solve that require
more flexible patterns:
–  Given a DNA sequence, what's the length of the poly-A tail?
–  Given a gene accession name, extract the part between the
third character and the underscore
–  Given a protein sequence, determine if it contains this highly-
redundant domain motif
64Source: Python for Biologists, Dr. Martin Jones
Regular Expressions:
Modules in Python
§  To search for these patterns, we use the regular expression
module “re”
import re!
!
re.search(pattern, string)!
!
dna = "ATCGCGAATTCAC"!
if re.search(r"GAATTC", dna):!
print("restriction site found!")!
!
if re.search(r"GC(A|T|G|C)GC", dna):!
print("restriction site found!")!
!
if re.search(r"GC[ATGC]GC", dna):!
print("restriction site found!")
65Source: Python for Biologists, Dr. Martin Jones
Regular Expressions:
Get String and Position of Match
§  Get the string that matched
dna = "ATGACGTACGTACGACTG"!
# store the match object in the variable m!
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)!
print("entire match: " + m.group())!
print("first bit: " + m.group(1))!
print("second bit: " + m.group(2))!
!
§  Get the positions of the match
!
print("start: " + str(m.start()))!
print("end: " + str(m.end()))
66Source: Python for Biologists, Dr. Martin Jones
3/30/14	
  
12	
  
67
Dictionaries:
Storing Paired Data
enzymes = {}!
enzymes['EcoRI'] = r'GAATTC'!
enzymes['AvaII] = r'GG(A|T)CC'!
enzymes['BisI'] = r'GC[ATGC]GC'!
!
# remove the EcoRI enzyme from the dict!
enzymes.pop('EcoRI')!
!
dna = "AATGATCGATCGTACGCTGA"!
counts = {}!
for base1 in ['A', 'T', 'G', 'C']:!
for base2 in ['A', 'T', 'G', 'C']:!
for base3 in ['A', 'T', 'G', 'C']:!
trinucleotide = base1 + base2 + base3!
count = dna.count(trinucleotide)!
counts[trinucleotide] = count!
print(counts)
68Source: Python for Biologists, Dr. Martin Jones
Dictionaries:
Storing Paired Data
!
{'ACC': 0, 'ATG': 1, 'AAG': 0, 'AAA': 0, 'ATC': 2, 'AAC': 0,
'ATA': 0, 'AGG': 0, 'CCT': 0, 'CTC': 0, 'AGC': 0, 'ACA': 0,
'AGA': 0, 'CAT': 0, 'AAT': 1, 'ATT': 0, 'CTG': 1, 'CTA': 0,
'ACT': 0, 'CAC': 0, 'ACG': 1, 'CAA': 0, 'AGT': 0, 'CAG': 0,
'CCG': 0, 'CCC': 0, 'CTT': 0, 'TAT': 0, 'GGT': 0, 'TGT': 0,
'CGA': 1, 'CCA': 0, 'TCT': 0, 'GAT': 2, 'CGG': 0, 'TTT': 0,
'TGC': 0, 'GGG': 0, 'TAG': 0, 'GGA': 0, 'TAA': 0, 'GGC': 0,
'TAC': 1, 'TTC': 0, 'TCG': 2, 'TTA': 0, 'TTG': 0, 'TCC': 0,
'GAA': 0, 'TGG': 0, 'GCA': 0, 'GTA': 1, 'GCC': 0, 'GTC': 0,
'GCG': 0, 'GTG': 0, 'GAG': 0, 'GTT': 0, 'GCT': 1, 'TGA': 2,
'GAC': 0, 'CGT': 1, 'TCA': 0, 'CGC': 1}!
!
print(counts['TGA'])
69Source: Python for Biologists, Dr. Martin Jones
Dictionaries:
Storing Paired Data
if 'AAA' in counts:!
print(counts('AAA'))!
!
for trinucleotide in counts.keys():!
if counts.get(trinucleotide) == 2:!
print(trinucleotide)!
!
for trinucleotide in sorted(counts.keys()):!
if counts.get(trinucleotide) == 2:!
print(trinucleotide)!
!
for trinucleotide, count in counts.items():!
if count == 2:!
print(trinucleotide)
70Source: Python for Biologists, Dr. Martin Jones
71
Files, Programs, & User Input:
Basic File Manipulation
§  Rename a file
!
import os!
os.rename("old.txt", "new.txt")!
!
§  Rename a folder
!
os.rename("/home/martin/old_folder", "/home/martin/
new_folder")!
!
§  Check to see if a file exists
!
if os.path.exists("/home/martin/email.txt"):!
print("You have mail!")
72Source: Python for Biologists, Dr. Martin Jones
3/30/14	
  
13	
  
Files, Programs, & User Input:
Basic File Manipulation
§  Remove a file
os.remove("/home/martin/unwanted_file.txt")!
§  Remove empty folder
os.rmdir("/home/martin/emtpy")!
§  To delete a folder and all the files in it, use
shutil.rmtree
shutil.rmtree("home/martin/full")
73Source: Python for Biologists, Dr. Martin Jones
Files, Programs, & User Input:
Running External Programs
§  Run an external program
import subprocess!
subprocess.call("/bin/date")!
!
§  Run an external program with options
!
subprocess.call("/bin/date +%B", shell=True)!
!
§  Saving program output
!
current_month = subprocess.check_output("/bin/date +%B",
shell=True)
74Source: Python for Biologists, Dr. Martin Jones
Files, Programs, & User Input:
User Input
§  Interactive user input
accession = input("Enter the accession name")!
# do something with the accession variable!
§  Capture command line arguments
!
import sys!
print(sys.argv)!
# python myprogram.py one two three!
# sys.argv[1] return script name!
75Source: Python for Biologists, Dr. Martin Jones 76
Goals
§  Introduce you to the basics of the python
programming language
§  Introduced you to the iPython environment
§  Prepare you for the next session “Introduction to
Biopython for Scientists”
§  Enable you to write or assemble scripts of your own or
modify existing scripts for your own purposes
77
Resources: Website
§  Websites
•  http://pythonforbiologists.com
•  http://www.pythonforbeginners.com
•  http://www.pythontutor.com/visualize.html#mode=display
§  Free eBook in HTML / PDF
•  http://pythonforbiologists.com
•  http://greenteapress.com/thinkpython/
•  http://openbookproject.net/books/bpp4awd/index.html
§  Python Regular Expressions (pattern matching)
•  http://www.pythonregex.com
§  Python Style Guide
•  http://www.python.org/dev/peps/pep-0008/
78
3/30/14	
  
14	
  
Additional Seminars
§  Introduction to BioPython for Scientists
§  Introduction to Data Analysis with Python
•  Utilizing NumPy and pandas modules
79
Collaborations welcome
One-on-one training available for those on NIH campus and related
agencies
ScienceApps at niaid.nih.gov
80

Intro to Python programming and iPython

  • 1.
    3/30/14   1   R.Burke Squires Computational Genomics Specialist Bioinformatics and Computational Biosciences Branch (BCBB) 2 Bioinformatics & Computational Biology Branch (BCBB) Why Python? 4Source: http://xkcd.com/353/ Topics §  iPython & Integrated Development Environments §  Printing and manipulating text §  Reading and writing files §  Lists and Loops §  Writing your own functions §  Conditional tests §  Regular Expressions §  Dictionaries §  Files, programs and user input 5 Resource: Bioinformatics Programming 6
  • 2.
    3/30/14   2   Goals § Introduce you to the basics of the python programming language §  Introduce you to the iPython environment and integrated development environments (IDE) §  Enable you to write or assemble scripts of your own or modify existing scripts for your own purposes §  Prepare you for the next session “Introduction to Biopython Programming” 7 8 Programming…is it Magic? §  No…BUT it can seem like it at times! J §  Working with text files 9 Python Scripts vs. Program 10 Each function is interpreted and executed (slower) Code is compiled once; executed as machine code (fastest) 11 In Your Toolbelt… §  Python environment §  Integrated Development Environment (IDE) –  Continuum Analytics Anaconda §  http://continuum.io/downloads.html –  Enthought Canopy Express §  https://www.enthought.com/products/epd/free/ –  iPython Notebook §  http://ipython.org –  PyCharm CE (Community Edition) §  http://www.jetbrains.com/pycharm/ 12
  • 3.
    3/30/14   3   PythonEnvironment §  Open Terminal §  Type “python” and hit return •  You should see “>>>” §  Enter “print(‘hello world’)” and hit return §  Congratulation! You have just written your first python script! §  You could also put code in text file and execute: •  $python script_name.py 13 iPython 14 iPython §  IPython provides architecture for interactive computing: •  Powerful interactive shells (terminal and Qt-based). •  A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media. •  Support for interactive data visualization and use of GUI toolkits. •  Easy to use, high performance tools for parallel computing. 15 Source: ipython.org iPython §  Already installed •  Source: Continuum Analytics Anaconda –  http://continuum.io/downloads.html §  Double-click on icon on desktop: §  Launch the ipython-notebook 16 iPython – Home Screen 17 iPython – New Notebook 18
  • 4.
    3/30/14   4   iPython § Add text using Cell -> Markdown •  Type #Intro to Python •  Type “This is my first iPython notebook.” •  (To edit change to raw text) §  Add a code cell §  Type “print(“Hello world”) §  Click play or run button (or Cell -> Run) 19 Source: ipython.org iPython Notebook 20 iPython Notebook Help 21 iPython Notebook Help §  Add images to your notebook •  “!["DNA"](files/DNA_chain.jpg)” •  In the same folder as notebook §  Add YouTube Videos to your notebook: •  from IPython.display import YouTubeVideo •  YouTubeVideo('iwVvqwLDsJo') 22 Additional Tools Canopy PyCharm 23 Advantages of IDEs 24 §  PyCharm features: •  Intelligent Editor: –  Code completion, on-the-fly error highlighting, auto-fixes, etc. •  Automated code refactorings and rich navigation capabilities •  Integrated debugger and unit testing support •  Native version control system (VCS) integrations •  Customizable UI and key-bindings, with VIM emulation available Source: http://www.jetbrains.com/pycharm/
  • 5.
    3/30/14   5   Lastly:Python IDEs in the Cloud 25 §  Python Anywhere •  http://www.pythonanywhere.com §  Python Fiddle: Python Cloud IDE •  http://pythonfiddle.com §  Koding: Free Programming Virtual Machine •  http://koding.com 26 Printing and manipulating text: “Hello World” §  While in iPython: •  Type print(“Hello world”) •  “Run” the program §  The whole thing is a statement; print is a function §  Comments •  # This is a comment! 27 Printing and manipulating text: Storing String in Variables # store a short DNA sequence in the variable my_dna! my_dna = "ATGCGTA"! ! # now print the DNA sequence! print(my_dna) 28Source: Python for Biologists, Dr. Martin Jones Printing and manipulating text: Concatenation my_dna = "AATT" + "GGCC"! print(my_dna)! ! upstream = "AAA"! my_dna = upstream + "ATGC"! # my_dna is now "AAAATGC" 29Source: Python for Biologists, Dr. Martin Jones Printing and manipulating text: Finding the Length of a String # store the DNA sequence in a variable! my_dna = "ATGCGAGT”! ! # calculate the length of the sequence and store it in a variable! dna_length = len(my_dna)! ! # print a message telling us the DNA sequence lenth! print("The length of the DNA sequence is " + dna_length)! ! my_dna = "ATGCGAGT"! dna_length = len(my_dna)! print("The length of the DNA sequence is " + str(dna_length)) 30Source: Python for Biologists, Dr. Martin Jones
  • 6.
    3/30/14   6   Printingand manipulating text: Replacement protein = "vlspadktnv"! ! # replace valine with tyrosine! print(protein.replace("v", "y"))! ! # we can replace more than one character! print(protein.replace("vls", "ymt"))! ! # the original variable is not affected! print(protein) 31Source: Python for Biologists, Dr. Martin Jones Printing and manipulating text: Replacement protein = "vlspadktnv"! ! # print positions three to five! print(protein[3:5])! ! # positions start at zero, not one! print(protein[0:6])! ! # if we use a stop position beyond the end, it's the same as using the end! print(protein[0:60]) 32Source: Python for Biologists, Dr. Martin Jones Printing and manipulating text: Replacement protein = "vlspadktnv"! ! # count amino acid residues! valine_count = protein.count('v')! lsp_count = protein.count('lsp')! tryptophan_count = protein.count('w')! ! # now print the counts! print("valines: " + str(valine_count))! print("lsp: " + str(lsp_count))! print("tryptophans: " + str(tryptophan_count)) 33Source: Python for Biologists, Dr. Martin Jones Printing and manipulating text: Homework Calculating AT content! Here's a short DNA sequence:! ! ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT! ! Write a program that will print out the AT content of this DNA sequence. Hint: you can use normal mathematical symbols like add (+), subtract (-), multiply (*), divide (/) and parentheses to carry out calculations on numbers in Python.! ! Reminder: if you're using Python 2 rather than Python 3, include this line at the top of your program:! from __future__ import division 34Source: Python for Biologists, Dr. Martin Jones 35 Reading and Writing Files: Reading a File my_file = open("dna.txt")! file_contents = my_file.read()! print(file_contents)! ! my_file = open("dna.txt")! my_file_contents = my_file.read()! ! # remove the newline from the end of the file contents! my_dna = my_file_contents.rstrip("n")! dna_length = len(my_dna)! print("sequence is " + my_dna + " and length is " + str(dna_length)) 36Source: Python for Biologists, Dr. Martin Jones
  • 7.
    3/30/14   7   Readingand Writing Files: Writing to a File my_file = open("out.txt", "w")! my_file.write("Hello world")! ! # remember to close the file! my_file.close()! ! my_file = open("/Users/martin/Desktop/myfolder/myfile.txt") 37Source: Python for Biologists, Dr. Martin Jones Reading and Writing Files: Homework Writing a FASTA file FASTA file format is a commonly-used DNA and protein sequence file format. A single sequence in FASTA format looks like this: >sequence_name ATCGACTGATCGATCGTACGAT Write a program that will create a FASTA file for the following three sequences – make sure that all sequences are in upper case and only contain the bases A, T, G and C. Sequence header DNA sequence ABC123 ATCGTACGATCGATCGATCGCTAGACGTATCG DEF456 actgatcgacgatcgatcgatcacgact HIJ789 ACTGAC-ACTGT--ACTGTA----CATGTG 38Source: Python for Biologists, Dr. Martin Jones 39 Lists and Loops: Lists apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]! conserved_sites = [24, 56, 132]! print(apes[0])! first_site = conserved_sites[2]! ! chimp_index = apes.index("Pan troglodytes")! # chimp_index is now 1! ! nucleotides = ["T", ”C", ”A”, “G”] last_ape = apes[-1]! ! ! 40 −1 >>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[−1] 'A' >>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[−5] 'D' >>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[7 // 2] 'K' 0 >>> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[50] Traceback (most recent call last): File "<pyshell#14>", line 1, in <module> 'MNKMDLVADVAEKTDLSKAKATEVIDAVFA'[50] IndexError: string index out of range Slicing [m:n] 10 | Chapter 1: Primitives Source: Python for Biologists, Dr. Martin Jones & O’Reilly Bioinformatics Programming Using Python Lists and Loops: Slicing & Appending Lists ranks = ["kingdom", "phylum", "class", "order", "family"]! lower_ranks = ranks[2:5]! # lower ranks are class, order and family! ! ! apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]! print("There are " + str(len(apes)) + " apes")! apes.append("Pan paniscus")! print("Now there are " + str(len(apes)) + " apes")! 41Source: Python for Biologists, Dr. Martin Jones Lists and Loops: Concatenating, Reversing & Sorting Lists apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]! monkeys = ["Papio ursinus", "Macaca mulatta"]! primates = apes + monkeys! print(str(len(apes)) + " apes")! print(str(len(monkeys)) + " monkeys")! print(str(len(primates)) + " primates")! ! ! ranks = ["kingdom", "phylum", "class", "order", "family"]! print("at the start : " + str(ranks))! ranks.reverse()! print("after reversing : " + str(ranks))! ranks.sort()! print("after sorting : " + str(ranks))! 42Source: Python for Biologists, Dr. Martin Jones
  • 8.
    3/30/14   8   Listsand Loops: Looping through Lists apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]! for ape in apes:! print(ape + " is an ape")! ! ! apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]! for ape in apes:! name_length = len(ape)! first_letter = ape[0]! print(ape + " is an ape. Its name starts with " + "! " first_letter)! print("Its name has " + str(name_length) + " letters")! 43Source: Python for Biologists, Dr. Martin Jones Python: Indentation apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]! for ape in apes:! name_length = len(ape)! first_letter = ape[0]! print(ape + " is an ape. Its name starts with " + ! " first_letter)! print("Its name has " + str(name_length) + " letters")! ! Indentation errors! ! Use tabs or spaces but not both ! 44Source: Python for Biologists, Dr. Martin Jones Lists and Loops: Using Strings as Lists, Splitting name = "martin"! for character in name:! print("one character is " + character)! ! ! names = "melanogaster,simulans,yakuba,ananassae"! species = names.split(",")! print(str(species))! 45Source: Python for Biologists, Dr. Martin Jones Lists and Loops: Looping through File, Line by Line file = open("some_input.txt")! for line in file:! # do something with the line! 46Source: Python for Biologists, Dr. Martin Jones Lists and Loops: Looping with Ranges protein = "vlspadktnv”! vls! vlsp! vlspa…! ! ! stop_positions = [3,4,5,6,7,8,9,10]! for stop in stop_positions:! substring = protein[0:stop]! print(substring)! ! for number in range(3, 8):! print(number)! ! for number in range(6):! print(number)! 47Source: Python for Biologists, Dr. Martin Jones Lists and Loops: Looping with Ranges protein = "vlspadktnv”! vls! vlsp! vlspa…! ! ! stop_positions = [3,4,5,6,7,8,9,10]! for stop in stop_positions:! substring = protein[0:stop]! print(substring)! ! for number in range(3, 8):! print(number)! ! for number in range(6):! print(number)! 48Source: Python for Biologists, Dr. Martin Jones
  • 9.
    3/30/14   9   Listsand Loops: Homework §  Processing DNA in a file •  The file input.txt contains a number of DNA sequences, one per line. Each sequence starts with the same 14 base pair fragment – a sequencing adapter that should have been removed. Write a program that will (a) trim this adapter and write the cleaned sequences to a new file and (b) print the length of each sequence to the screen. 49Source: Python for Biologists, Dr. Martin Jones 50 Writing Your Own Functions: Convert Code to Function my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT”! length = len(my_dna)! a_count = my_dna.count('A’)! t_count = my_dna.count('T’)! at_content = (a_count + t_count) / length! print("AT content is " + str(at_content))! ==============================================! from __future__ import division #if using python 2! ! def get_at_content(dna):! length = len(dna)! a_count = dna.count('A’)! t_count = dna.count('T’)! at_content = (a_count + t_count) / length! return at_content! ==============================================! print("AT content is " + str(get_at_content("ATGACTGGACCA"))) 51Source: Python for Biologists, Dr. Martin Jones Writing Your Own Functions: Improving our Function def get_at_content(dna, sig_figs):! length = len(dna)! a_count = dna.upper().count('A')! t_count = dna.upper().count('T')! at_content = (a_count + t_count) / length! return round(at_content, sig_figs)! ! ! test_dna = "ATGCATGCAACTGTAGC"! print(get_at_content(test_dna, 1))! print(get_at_content(test_dna, 2))! print(get_at_content(test_dna, 3)) 52Source: Python for Biologists, Dr. Martin Jones Writing Your Own Functions: Improving our Function §  Functions do not always have to take parameters §  Functions do not always have to return a value ! def get_at_content():! test_dna = "ATGCATGCAACTGTAGC"! length = len(dna)! a_count = dna.upper().count('A')! t_count = dna.upper().count('T')! at_content = (a_count + t_count) / length! print(round(at_content, sig_figs))! ! ! §  What are the disadvantages of doing these things? 53Source: Python for Biologists, Dr. Martin Jones Writing Your Own Functions: Defaults & Named Arguments §  Function arguments can be named §  Order then does not matter! ! get_at_content(dna="ATCGTGACTCG", sig_figs=2)! get_at_content(sig_figs=2, dna="ATCGTGACTCG")! ! §  Functions can have default values §  Default values do not need to be provided unless a different value is desired ! def get_at_content(dna, sig_figs=2):! (function code)! 54Source: Python for Biologists, Dr. Martin Jones
  • 10.
    3/30/14   10   WritingYour Own Functions: Testing Functions §  Functions should be testing with know good values §  Functions should be tested with known bad values! ! assert get_at_content("ATGC") == 0.5! assert get_at_content("A") == 1! assert get_at_content("G") == 0! assert get_at_content("ATGC") == 0.5! assert get_at_content("AGG") == 0.33! assert get_at_content("AGG", 1) == 0.3! assert get_at_content("AGG", 5) == 0.33333! 55Source: Python for Biologists, Dr. Martin Jones 56 Conditional Tests: True, False, If…else…elif…then §  Python has a built-in values “True”, “False” §  Conditional statements evaluate to True or False §  If statements use conditional statements expression_level = 125! if expression_level > 100:! print("gene is highly expressed")! ! expression_level = 125! if expression_level > 100:! print("gene is highly expressed")! else:! print("gene is lowly expressed") 57Source: Python for Biologists, Dr. Martin Jones Conditional Tests: True, False, If…else…elif…then file1 = open("one.txt", "w")! file2 = open("two.txt", "w")! file3 = open("three.txt", "w")! accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']! for accession in accs:! if accession.startswith('a'):! file1.write(accession + "n")! elif accession.startswith('b'):! file2.write(accession + "n")! else:! file3.write(accession + "n") 58Source: Python for Biologists, Dr. Martin Jones Conditional Tests: While loops §  While loops loop until a condition is met count = 0! while count<10:! print(count)! count = count + 1 59Source: Python for Biologists, Dr. Martin Jones Conditional Tests: While Loops §  While loops loop until a condition is met count = 0! while count<10:! print(count)! count = count + 1 60Source: Python for Biologists, Dr. Martin Jones
  • 11.
    3/30/14   11   ConditionalTests: Building Complex Conditions §  Use “and”, “or”, “not and”, “not or” to build complex conditions accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']! for accession in accs:! if accession.startswith('a') and accession.endswith('3'):! print(accession) 61Source: Python for Biologists, Dr. Martin Jones 62 Regular Expressions: Patterns in Biology §  There are a lot of patterns in biology: –  protein domains –  DNA transcription factor binding motifs –  restriction enzyme cut sites –  runs of mononucleotides §  Pattern in strings inside text: –  read mapping locations –  geographical sample coordinates –  taxonomic names –  gene names –  gene accession numbers –  BLAST searches 63Source: Python for Biologists, Dr. Martin Jones Regular Expressions: Patterns in Biology §  Many problems that we want to solve that require more flexible patterns: –  Given a DNA sequence, what's the length of the poly-A tail? –  Given a gene accession name, extract the part between the third character and the underscore –  Given a protein sequence, determine if it contains this highly- redundant domain motif 64Source: Python for Biologists, Dr. Martin Jones Regular Expressions: Modules in Python §  To search for these patterns, we use the regular expression module “re” import re! ! re.search(pattern, string)! ! dna = "ATCGCGAATTCAC"! if re.search(r"GAATTC", dna):! print("restriction site found!")! ! if re.search(r"GC(A|T|G|C)GC", dna):! print("restriction site found!")! ! if re.search(r"GC[ATGC]GC", dna):! print("restriction site found!") 65Source: Python for Biologists, Dr. Martin Jones Regular Expressions: Get String and Position of Match §  Get the string that matched dna = "ATGACGTACGTACGACTG"! # store the match object in the variable m! m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)! print("entire match: " + m.group())! print("first bit: " + m.group(1))! print("second bit: " + m.group(2))! ! §  Get the positions of the match ! print("start: " + str(m.start()))! print("end: " + str(m.end())) 66Source: Python for Biologists, Dr. Martin Jones
  • 12.
    3/30/14   12   67 Dictionaries: StoringPaired Data enzymes = {}! enzymes['EcoRI'] = r'GAATTC'! enzymes['AvaII] = r'GG(A|T)CC'! enzymes['BisI'] = r'GC[ATGC]GC'! ! # remove the EcoRI enzyme from the dict! enzymes.pop('EcoRI')! ! dna = "AATGATCGATCGTACGCTGA"! counts = {}! for base1 in ['A', 'T', 'G', 'C']:! for base2 in ['A', 'T', 'G', 'C']:! for base3 in ['A', 'T', 'G', 'C']:! trinucleotide = base1 + base2 + base3! count = dna.count(trinucleotide)! counts[trinucleotide] = count! print(counts) 68Source: Python for Biologists, Dr. Martin Jones Dictionaries: Storing Paired Data ! {'ACC': 0, 'ATG': 1, 'AAG': 0, 'AAA': 0, 'ATC': 2, 'AAC': 0, 'ATA': 0, 'AGG': 0, 'CCT': 0, 'CTC': 0, 'AGC': 0, 'ACA': 0, 'AGA': 0, 'CAT': 0, 'AAT': 1, 'ATT': 0, 'CTG': 1, 'CTA': 0, 'ACT': 0, 'CAC': 0, 'ACG': 1, 'CAA': 0, 'AGT': 0, 'CAG': 0, 'CCG': 0, 'CCC': 0, 'CTT': 0, 'TAT': 0, 'GGT': 0, 'TGT': 0, 'CGA': 1, 'CCA': 0, 'TCT': 0, 'GAT': 2, 'CGG': 0, 'TTT': 0, 'TGC': 0, 'GGG': 0, 'TAG': 0, 'GGA': 0, 'TAA': 0, 'GGC': 0, 'TAC': 1, 'TTC': 0, 'TCG': 2, 'TTA': 0, 'TTG': 0, 'TCC': 0, 'GAA': 0, 'TGG': 0, 'GCA': 0, 'GTA': 1, 'GCC': 0, 'GTC': 0, 'GCG': 0, 'GTG': 0, 'GAG': 0, 'GTT': 0, 'GCT': 1, 'TGA': 2, 'GAC': 0, 'CGT': 1, 'TCA': 0, 'CGC': 1}! ! print(counts['TGA']) 69Source: Python for Biologists, Dr. Martin Jones Dictionaries: Storing Paired Data if 'AAA' in counts:! print(counts('AAA'))! ! for trinucleotide in counts.keys():! if counts.get(trinucleotide) == 2:! print(trinucleotide)! ! for trinucleotide in sorted(counts.keys()):! if counts.get(trinucleotide) == 2:! print(trinucleotide)! ! for trinucleotide, count in counts.items():! if count == 2:! print(trinucleotide) 70Source: Python for Biologists, Dr. Martin Jones 71 Files, Programs, & User Input: Basic File Manipulation §  Rename a file ! import os! os.rename("old.txt", "new.txt")! ! §  Rename a folder ! os.rename("/home/martin/old_folder", "/home/martin/ new_folder")! ! §  Check to see if a file exists ! if os.path.exists("/home/martin/email.txt"):! print("You have mail!") 72Source: Python for Biologists, Dr. Martin Jones
  • 13.
    3/30/14   13   Files,Programs, & User Input: Basic File Manipulation §  Remove a file os.remove("/home/martin/unwanted_file.txt")! §  Remove empty folder os.rmdir("/home/martin/emtpy")! §  To delete a folder and all the files in it, use shutil.rmtree shutil.rmtree("home/martin/full") 73Source: Python for Biologists, Dr. Martin Jones Files, Programs, & User Input: Running External Programs §  Run an external program import subprocess! subprocess.call("/bin/date")! ! §  Run an external program with options ! subprocess.call("/bin/date +%B", shell=True)! ! §  Saving program output ! current_month = subprocess.check_output("/bin/date +%B", shell=True) 74Source: Python for Biologists, Dr. Martin Jones Files, Programs, & User Input: User Input §  Interactive user input accession = input("Enter the accession name")! # do something with the accession variable! §  Capture command line arguments ! import sys! print(sys.argv)! # python myprogram.py one two three! # sys.argv[1] return script name! 75Source: Python for Biologists, Dr. Martin Jones 76 Goals §  Introduce you to the basics of the python programming language §  Introduced you to the iPython environment §  Prepare you for the next session “Introduction to Biopython for Scientists” §  Enable you to write or assemble scripts of your own or modify existing scripts for your own purposes 77 Resources: Website §  Websites •  http://pythonforbiologists.com •  http://www.pythonforbeginners.com •  http://www.pythontutor.com/visualize.html#mode=display §  Free eBook in HTML / PDF •  http://pythonforbiologists.com •  http://greenteapress.com/thinkpython/ •  http://openbookproject.net/books/bpp4awd/index.html §  Python Regular Expressions (pattern matching) •  http://www.pythonregex.com §  Python Style Guide •  http://www.python.org/dev/peps/pep-0008/ 78
  • 14.
    3/30/14   14   AdditionalSeminars §  Introduction to BioPython for Scientists §  Introduction to Data Analysis with Python •  Utilizing NumPy and pandas modules 79 Collaborations welcome One-on-one training available for those on NIH campus and related agencies ScienceApps at niaid.nih.gov 80