Perl Notes
Perl Notes
Perl is a high-level programming language with an eclectic heritage written by Larry Wall and a
cast of thousands. It derives from the ubiquitous C programming language and to a lesser
extent from sed, awk, the Unix shell, and at least a dozen other tools and languages. Perl's
process, file, and text manipulation facilities make it particularly well-suited for tasks involving
quick prototyping, system utilities, software tools, system management tasks, database access,
graphical programming, networking, and web programming. These strengths make it especially
popular with system administrators and web developers, but mathematicians, geneticists,
journalists, and even managers also use Perl. Maybe you should, too.
The most important thing about programming is that it's a hands-on learning activity such as
dancing, playing music, cooking, or some other family-oriented activity. You can read about it,
but you can't actually do it until you actually do it.
While learning to program in Perl, you need to read about how Perl works, as you will in the
chapters that follow. You also need to look at plenty of examples of programs. But you
especially need to attempt to write your own programs, as you are asked to do in the exercises
at the end of the later chapters. Only this kind of direct experience will make you a
programmer.
So I want to give you an overview of the most important tasks involved in writing programs, to
help you approach your first programs with a clearer idea of what's really involved.
What exactly will you be doing at the computer? The bulk of a programmer's work involves the
steps of writing or revising a program in an editor, then running the program and watching how
it behaves, and on the basis of that behavior going back and revising the program again. A
typical programmer spends more than half of his or her time editing the program.
Ease of Programming
Computer languages differ in which things they make easy. By "easy" I mean easy for a
programmer to program. Perl has certain features that simplifies several common
bioinformatics tasks. It can deal with information in ASCII text files or flat files, which are exactly
the kinds of files in which much important biological data appears, in the GenBank and PDB
databases, among others. easy to process and manipulate long sequences such as DNA and
proteins. Perl makes it convenient to write a program that controls one or more other
programs. As a final example, Perl is used to put biology research labs, and their results, on
their own dynamic web sites. Perl does all this and more.
Interpreters
An interpreter normally means a computer program that executes, i.e. performs, instructions
written in a programming language. An interpreter may be a program that either
1. executes the source code directly.
2. translates source code into some efficient intermediate representation (code) and
immediately executes this.
3. explicitly executes stored precompiled code made by a compiler which is part of the
interpreter system.
Perl, Python, MATLAB, and Ruby are examples of type 2, while UCSD Pascal and Java are type 3:
Source programs are compiled ahead of time and stored as machine independent code, which
is then linked at run-time and executed by an interpreter and/or compiler (for JIT systems).
Some systems, such as Smalltalk, BASIC and others, may also combine 2 and 3. While
interpreting and compiling are the two main means by which programming languages are
implemented, these are not fully distinct categories, one of the reasons being that most
interpreting systems also perform some translation work, just like compilers. The terms
"interpreted language" or "compiled language" merely mean that the canonical
implementation of that language is an interpreter or a compiler; a high level language is
basically an abstraction which is (ideally) independent of particular implementations.
CPAN is the Comprehensive Perl Archive Network, a large collection of Perl software and
documentation. You can begin exploring from either http://www.cpan.org/or any of the mirrors
listed at http://www.cpan.org/SITES.html.
CPAN is also the name of a Perl module, CPAN.pm, which is used to download and install Perl
software from the CPAN archive. This FAQ covers only a little about the CPAN module and you
may find the documentation for it by using perldoc CPAN via the command line or on the web
athttp://search.cpan.org/dist/CPAN/lib/CPAN.pm.
Perl Variables
Perl Variables with the techniques of handling them are an important part of the Perl language.
As a language-type script, Perl was designed to handle huge amounts of data text. Working with
variables is fairly straightforward given that it is not necessary to define and allocate them, so
no sophisticated techniques for the release of memory occupied by them.
As general information, to note that the names of Perl variables contain alphabetic characters,
numbers and the underscore (_) character and are case sensitive.
A specific language feature is that variables have a non-alphabetical prefix that fashion
somewhat cryptic the language. This, however, presents the advantage that provides
immediate information on the type of variable and what can be done with it. Thus, according to
the first name that begins with a variable, we have:
The $, @ and % characters actually predefine the variable type in Perl. Perl language also offers
some built-in predefined variables that facilitate and shorten the programming code.
Below are the most important characteristics of the three types of Perl variables and some
examples of their use.
Are simple variables that can contain a single element: a string, a number or a reference to an
object. Strings are sequences of characters including any symbol, letter or number. The
numbers may contain exponents, integers or decimal values. The reference is a scalar value that
contains a memory address where a scalar, array or hash variable is stored. The reference type
variable is preceded by the character.
Below you can see some examples of code on how to use the scalar variable type.
$pi = 3.14;
$height = "20 cm";
$message = "Hello World!\n";
print $message;
$message = "Welcome!\n";
print $message;
$ref_a=$message;
print $$ref_a;
Array Perl variables.
Arrays are ordered list of scalars, where the first element of the list begins with the index 0. An
array can hold un unlimited number of elements and its name begins with the character @. An
element of the array is written with the $ prefix followed by the array name and its index
placed in square brackets.
Taking the example above, by @timeUnits we mean the entire array and by$timeUnits[0] we
mean the first element of the array @timeUnits, in our case "hours". In order to print all the
elements of the @timeUnits array, we could use a code snippet like this:
foreach $i (@timeUnits)
{
print "$i\n";
}
Hashes or associative arrays consists of a group of pairs of elements – a key and a data value.
Meanwhile the arrays are indexed by numbers, hashes are indexed by strings. Perl hash names
are prefixed by % character. Let’s look below for a hash variable example:
If you want to refer the first element of the %animalColors hash, you must use the $ scalar
symbol and curly brackets as in $animalColors{"bear"}.
If you want to print the %animalColors hash, you could use a little code like this:
And if you will run it you’ll get as result something like this:
The Perl language has a lot of special variables you must know. You can modify the values of
these variables, except the variables that are read only. Of course, the predefined variables are
of scalar, array or hash type. One of the most important is the scalar variable $_ which is used
by default by many operators and functions as in the follows example:
If you will run the code, this will print the text "Hello World!".
There are three built-in Perl Data Types: scalars, arrays and hash or associative arrays, that
make the Perl language a powerful tool for text manipulation.
1. The first type of Perl data types is the scalar which is Perl’s data fundamental unit and it can
be a single string, anumber or a reference to a specific object.
scalar literals (or constants) that don’t change over the life of a program – one simple
example is the value of pi
scalar variables which let you hold data and manipulate them:
o each variable is associated with a name that enables you to refer to the data and
the address of a chunk of memory where the variable value is stored
o the variable value can be changed during the execution of the program.
Scalar variables are used for storing scalar literals. We represent a scalar variable with the dollar
sign $ followed by the name of the variable: $Company, $country, $count, $x.
The name of a variable can contain any alphabetic characters, numbers or underscore. Note
that the first character of a scalar variable name can’t be a number and the variable names are
case sensitive, which means there is a significant difference between upper and lowercase
characters. In Perl language it’s not necessary to declare a scalar variable, you just name and
use it like in this scalar variable assignment example:
$pi = 3.14;
In a scalar variable we can store the memory address of a chunk of memory, too. We
call reference the scalar value that contains a memory address. Perl allocates and deallocates
automatically the memory for references. Look at the following code snippet to see how you
can use a reference variable:
$a="Hello World!";
$ref_a=\$a;
print $$ref_a;
In the first line of code, the variable named $a stores the string “Hello World”. In the second
line, the scalar variable ref_a is assigned with a reference to the variable $a (note that $a is
preceded by the character \). The third line of the snippet is an example of how to print the
string stored in the variable $a using a reference – we say that we dereference the scalar
variable $ref_a before print. It’s a good practice to name a scalar reference beginning with the
substring ref_, because this will tell you that this is a reference variable and if you want to use
it, you must dereference it before.
Another kind of constants are literal lists which are used to initialize an array or a hash. To
create a literal list it is very easy, just put a set of parentheses enclosing scalar values as in the
following example:
(1, 'hello', 'world', $a)
2. The second basic type of the Perl data types is the array which is indexed by a number. For
creating an array you simple must put something into it. It is not necessary to declare or specify
its dimension. The name of an array begins with the character @ and in the example below we
will assign the array @things with the literal list presented above:
3. The third basic type of the Perl data types is the hash or associative array, which, like the
array, contains a number of scalars. Hashes are indexed by strings and each hash has two parts:
a key that identifies each element of the hash and a value which is the data associated with the
key. The name of a hash structure begins with the character %. In order to assign a hash key,
we’ll write it like in the following code line:
$hash{$key}=value;
$NotebookPrices{"Toshiba"}=650;
$NotebookPrices{"HP"}=550;
$NotebookPrices{"Acer"}=750;
You can combine the three Perl data types enumerated above to get more complex data
structures (like array of arrays, array of hashes, and so on).
Operators
In Perl, there are different versions of the operators for numbers and strings. For instance, if you want to
compare a number, you would use a traditional symbol such as <, >, and so on. However, when you
compare two strings, the less-than and greater-than signs are not used. Instead, a special version is used
to compare strings. Less-than would be the two letters lt and greater-than would be gt. In the lists that
follow, there will be separate listings for numerical and string operators where this is necessary.
Arithmetic Operators
These operators are used to perform mathematical calculations on numbers. Keep in mind though, they
are not used to combine strings. There are some special string operators for this. Note that the
assignment operator does work both ways, we used it to assign values to variables in the last section.
Here is the list:
Operator Function
+ Addition
* Multiplication
/ Division
% Modulus
** Exponent
To use these, you will place them in your statements like a mathematical expression. So, if you want to
store the sum of two variables in a third variable, you would write something like this:
$adrevenue=20;
$sales=10;
$total_revenue=$adrevenue+$sales;
You can use the other arithmetic operators in the same way, it is quite similar to other programming
languages.
Assignment Operators
We have already used the = sign as an assignment operator to assign values to a variable. You can also
use the = sign with another arithmetic operator to perform a special type of assignment. You can precede
the = sign with the + operator, for example:
$revenue+=10;
What this does is create a shorthand for writing out the following statement:
$revenue=$revenue+10;
It takes the variable $revenue and assigns it the value of $revenue (itself) plus 10. So, if you had an initial
value for $revenue set at 5:
$revenue=5; $revenue=$revenue+10;
After these statements, $revenue is 15. It added 10 to the value it had before the new assignment.
The others work the same way, but perform the various different operations. Here a list of the arithmetic
operators we used above when we place them with the assignment operator:
Operator Function
= Normal Assignment
Remember, these are used for the sake of typing less or cutting the file size of the code. You can write
the statements out the long way if it makes it more understandable when you read your code.
Increment/Decrement
Another shorthand method is to use the increment and decrement operators, rather than writing out
something like this:
$revenue=$revenue+1;
$revenue++;
However, using these operators you must remember that you could also write something like this:
++$revenue;
If you place the ++ before the variable name, it the variable adds one to itself before it is used or
evaluated. For example, if you write:
$revenue=5;
$total= ++$revenue + 10;
The $revenue variable is incremented before it is used in the calculation, so it is changed to 6 before 10 is
added to it. Thus, $total turns out to be 16 here.
If you want to increment the variable after it is used, you use the ++ after the variable name:
$revenue=5;
$total= $revenue++ + 10;
This way $total is only 15 because $revenue is used before being incremented, so it stays at 5 for this
expression. If you use $revenue again after this, it will have a value of 6.
With that in mind, here is the short list of the two operators:
Operator Function
++ Increment (Add 1)
-- Decrement (Subtract 1)
The -- operator works the same way as ++, but it subtracts one from the value of the variable (decrements
it).
Adding Strings
Like I mentioned at the beginning of this section, there are different operators for strings under certain
conditions. If you want to put two strings together (also called concatenate), you will want to use the dot
operator. Unlike C and JavaScript (where it is used with objects), the dot operator in Perl concatenates
two strings.
For example, if you want to place two strings together, you could do this:
$full_string="light" . "house";
This would make $full_string have the value of lighthouse. This is more useful if you are using variables
for this:
$word1="light";
$word2="house";
$full_string=$word1 . $word2;
print "If I had a $word1 and a $word2, would I be able to make a
$full_string?";
This can also be used with the assignment operator to do what we did with numbers earlier. In this case,
it gives the string the value of itself put together with another string:
$word1="light";
$full_string=$word1 . "house";
Of course, we again get the value of lighthouse for the $full_string variable. Here is the list for these two
string operators:
Operator Function
. Concatenate Strings
Operator Function
== Equal to
!= Not Equal to
So, suppose you want to execute some code only if one number is equal to another. You would use the if
() condition with the == operator above:
$money=5;
if ($money==5)
{
....more code....
}
Since money is equal to 5, it would execute the code you place between the curly brackets. It works the
same way if you use one of the other operators:
$money=5;
if ($money<3)
{
....more code....
}
This time it would not go through, as 5 is not less than 3-- the value of $money.
String Comparison
These are similar to the numerical comparisons, but they work with strings. We will note a few differences
in the way these work after the list below:
Operator Function
eq Equal to
ne Not Equal to
gt Greater than
lt Less than
Strings are equal if they are exactly the same. So, "cool" and "cool" are equal, but "cool" and "coolz" are
not. Here is a string equality example:
$i_am="cool";
if ($i_am eq "cool")
{
....more cool code....
}
The greater-than and less-than operators compare strings using alphabetical order. Here is a sample:
$i_am="all right";
if ($i_am lt "cool")
{
print "You are not very cool, dude.";
}
Logical Operators
These are often used when you need to check more than one condition. Here they are:
Operator Function
&& AND
|| OR
! NOT
So, if you want to see if a number is less than or equal to 10, and also greater than zero:
$number=5;
if (($number <= 10) && ($number > 0))
{
...code....
}
Notice the nested parentheses. Since we are checking for two conditions, we want to be sure the
comparison is done first. Thus, they have their own sets of parentheses within the parentheses for the if ()
condition.
PERL functions
Perl functions are available to you at any place in your script and they do not need any declaration to use,
being installed within your system with Perl packages. You can do a search at CPAN and find an almost
exhaustive presentation of all the built-in Perl functions, grouped by categories or alphabetically. If you
want to write your own function, you can use a user-defined subroutine and then call it anywhere within
your Perl code.
It’s beyond the scope of this section to clear all the aspects related to functions in a programming
language. In general, a function has a name and represents a block of code which we can call anywhere
we need in a program source. We use the functions in order not to repeat indefinitely the same block of
code using the copy and paste techniques.
PERL modules
We'll refer below to Perl Modules. When we talk about modules we refer to some well-structured
components of the system, which have an interface defined to some other components of the system.
Starting with the fifth version, Perl provides you thousands of modules available through CPAN
(Comprehensive Perl Archive Network) or other resources.
A module is a piece of code that others have already written, code that you can use and integrate in your
own Perl script in order to save a lot of time. You can design your own modules in order to optimize your
code or use modules made by others. Refactoring the source code by creating your own modules is
beyond the scope of this page, so we’ll discuss here only the opportunities you have in implementing the
modules available at a certain moment on different platforms.
It is very probable that, if you have a problem to code into your script, there is somebody who has already
solved and made it available on CPAN, so reusing code through a module interface is always a good
idea.
On the other hand, there are two types of Perl modules: those that are distributed with Perl which you can
use immediately after the Perl installation, and those you can download from CPAN or other sources and
install yourself. Each module available on CPAN has a detailed documentation which you can read before
downloading the entire package and see by yourself if it’s what you are looking for. Some modules
depend on other modules, so please read carefully the associated documentation before you download
and install them.
In order to use a module it’s not necessary to know how all of the behind-the-scenes magic works, it’s
only important to understand its interface and know its functions or subroutines.
To find modules that are not distributed with Perl you can start browsing through the CPAN categories or
search Google or some other search engines directly.
PERL subroutines
Like any good programming langauge Perl allows the user to define their own functions,
called subroutines. They may be placed anywhere in your program but it's probably best to put them all at
the beginning or all at the end. A subroutine has the form
sub mysubroutine
{
print "Not a very interesting routine\n";
print "This does the same thing every time\n";
}
regardless of any parameters that we may want to pass to it. All of the following will work to call this
subroutine. Notice that a subroutine is called with an & character in front of the name:
Parameters
In the above case the parameters are acceptable but ignored. When the subroutine is called any
parameters are passed as a list in the special @_ list array variable. This variable has absolutely nothing
to do with the $_ scalar variable. The following subroutine merely prints out the list that it was called with.
It is followed by a couple of examples of its use.
sub printargs
{
print "@_\n";
}
Just like any other list array the individual elements of @_ can be accessed with the square bracket
notation:
sub printfirsttwo
{
print "Your first argument was $_[0]\n";
print "and $_[1] was your second\n";
}
Again it should be stressed that the indexed scalars $_[0] and $_[1] and so on have nothing to with the
scalar $_ which can also be used without fear of a clash.
Returning values
Result of a subroutine is always the last thing evaluated. This subroutine returns the maximum of two
input parameters. An example of its use follows.
sub maximum
{
if ($_[0] > $_[1])
{
$_[0];
}
else
{
$_[1];
}
}
The &printfirsttwo subroutine above also returns a value, in this case 1. This is because the last thing that
subroutine did was a print statement and the result of a successful print statement is always 1.
Local variables
The @_ variable is local to the current subroutine, and so of course are $_[0], $_[1], $_[2], and so on.
Other variables can be made local too, and this is useful if we want to start altering the input parameters.
The following subroutine tests to see if one string is inside another, spaces not withstanding. An example
follows.
sub inside
{
local($a, $b); # Make local variables
($a, $b) = ($_[0], $_[1]); # Assign values
$a =~ s/ //g; # Strip spaces from
$b =~ s/ //g; # local variables
($a =~ /$b/ || $b =~ /$a/); # Is $b inside $a
# or $a inside $b?
}
In fact, it can even be tidied up by replacing the first two lines with
Opening files
Opening a file in perl in straightforward: open FILE, "filename.txt" or die $!; The command
above will associate the FILE filehandle with the file filename.txt. You can use the filehandle to
read from the file. If the file doesn't exist - or you cannot read it for any other reason - then the script
will die with the appropriate error message stored in the $! variable.
What if you wanted to modify the file instead of just reading from it? Then you'd have to specify the
appropriate mode using the three-argument form of open.
open FILEHANDLE, MODE, EXPR The available modes are the following:
read <
write > ✓ ✓
append >> ✓
Each of the above modes can also be prefixed with the + character to allow for simultaneous reading
and writing.
read/write +<
read/write +> ✓ ✓
read/append +>> ✓
Notice, how both +< and +> open the file in read/write mode but the latter also creates the file if it
doesn't exist or truncates (deletes) an existing file. So, if you wanted to open a file for writing,
creating it if it doesn't exist and truncating it first if does, you'd do the following: open FILE, ">",
"filename.txt" or die $! This operation might fail if for example you don't have the appropriate
permissions. In this case $! will be set appropriately.
The mode and the filename in the three-argument form can be combined, so the above can also be
written as: open FILE, ">filename.txt" or die $!; As you might have guessed already if you
just want read access you can skip the mode just as we did in the very first example above.
Reading files
If you want to read a text file line-by-line then you can do it as such: my @lines =
<FILE>; The <FILE> operator - where FILE is a previously opened filehandle - returns all the
unread lines of the text file in list context or a single line in scalar context. Hence, if you had a
particularly large file and you wanted to conserve memory you could process it line by line: while
(<FILE>) {
print $_;
} The $_ variable is automatically set for you to the contents of the current line. If you wish you may
name your line variable instead: while (my $line = <FILE>) { ... will set the $line variable to
the contents of the current line. The newline character at the end of the line is not removed
automatically. If you wish to remove it you can use the chomp command. After all lines have been
read the <FILE> operator will return a false value hence causing the loop to terminate.
There may cases where you need to read a file only a few characters at a time instead of line-by-
line. This may be the case for binary data. To do just that you can use the read command. open
FILE, "picture.jpg" or die $!;
binmode FILE;
$buf .= $data;
close(FILE);
There is a lot going on here so let's take it step by step. In the first line of the above code fragment a
file is opened. As you can guess from the filename it is a binary file. Binary files need to treated
differently than text files on some operating systems (eg, Windows). The reason is that on these
platforms a newline "character" is actually represented within text files by the two character
sequence \cM\cJ (that's control-M, control-J). When reading the text file Perl will convert the \cM\
cJ sequence into a single \n newline characted. The converse also holds when writing files. Clearly,
when reading binary data this behavior is undesired and calling binmode on the filehandle will make
sure that this conversion is avoided.
The read command takes either 3 or 4 arguments. The 3-argument form is: read FILEHANDLE,
SCALAR, LENGTH while the 4-argument form is: read FILEHANDLE, SCALAR, LENGTH, OFFSET In the
first case LENGTH characters of data are read in the variable specified by SCALAR from FILEHANDLE.
The return value of read is the number of characters actually read, 0 at the end of the file
or undef in the case of an error. Returning to our example above the third line of code will read at
most 4 characters of data into the $data variable. The number of characters read will be stored
in $n. Successive read operations on the same filehandle will set the current file position to be just
before the first unread character. Thus the code above will read the contents of the
file picture.jpg and store them in $buf, printing the number of characters read at every iteration.
If OFFSET is specified then the characters read will be placed at that position within the SCALAR.
Taking advantage of this we could rewrite the loop above as such: my ($data, $n, $offset);
while (($n = read FILE, $data, 4, $offset) != 0) {
$offset += $n;
Even though the example above demonstrates binary reading the read command works just as well
on text files - just make sure to use (for binary) or not use (for text) binmode accordingly.
Writing files
Now that you know how to open and read files learning how to write to them is straighforward. Take
a look at the following code: open FILE, ">file.txt" or die $!;
print FILE $str;
close FILE; Not much is new here. The only thing to observe is the two-argument use of print,
the first argument being the FILEHANDLE to write to and the second an expression to be written. The
expression can be anything: a scalar, a list, a hash, etc. Appending to a file can be accomplished
in exactly the same manner - apart from specifying the appropriate (>>) mode of course.
At this point you are probably thinking that a description of the write is what follows. Actually, as the
manual page for write puts it:
Instead write is used to write formatted records to file, a subject outside the scope of this article.
Closing files
Once you are done reading and writing you should close any open filehandles. open FILE1,
"file.txt" or die $!;
...
close FILE2;
close FILE1; If you forget to close a filehandle Perl will do it for you before your script exists but it
is good practice to close yourself what you have opened.
The close command may also fail returning false, eg, if you try to close a closed filehandle. If you
want to catch these errors you can check the return value of close and the approriate error
message stored in $! as is done in the following example: close FILE or die $!
The open function opens a file for input (i.e. for reading). The first parameter is
the filehandle which allows Perl to refer to the file in future. The second parameter is
an expression denoting the filename. If the filename was given in quotes then it is
taken literally without shell expansion. So the expression '~/notes/todolist' will not be
interpreted successfully. If you want to force shell expansion then use angled
brackets: that is, use <~/notes/todolist> instead.
There are a few useful points to add to this discussion on filehandling. First,
the open statement can also specify a file for output and for appending as well as for
input. To do this, prefix the filename with a > for output and a >> for appending:
open(INFO, $file); # Open for input
open(INFO, ">$file"); # Open for output
open(INFO, ">>$file"); # Open for appending
open(INFO, "<$file"); # Also open for input
Second, if you want to print something to a file you've already opened for output then
you can use the print statement with an extra parameter. To print a string to the file
with the INFO filehandle use
print INFO "This line goes to the file.\n";
Third, you can use the following to open the standard input (usually the keyboard) and
standard output (usually the screen) respectively:
open(INFO, '-'); # Open standard input
open(INFO, '>-'); # Open standard output
In the above program the information is read from a file. The file is the INFO file and
to read from it Perl uses angled brackets. So the statement
@lines = <INFO>;
reads the file denoted by the filehandle into the array @lines. Note that the <INFO>
expression reads in the file entirely in one go. This because the reading takes place in
the context of an array variable. If @ lines is replaced by the scalar $lines then only
the next one line would be read in. In either case each line is stored complete with its
newline character at the end.
Flow control
Flow control is the order in which the statements of a program are executed. A
program executes from the first statement at the top of the program to the last
statement at the bottom, in order, unless told to do otherwise. There are two ways to
tell a program to do otherwise: conditional statements and loops. A conditional
statement executes a group of statements only if the conditional test succeeds;
otherwise, it just skips the group of statements. A loop repeats a group of statements
until an associated test fails.
Finding motifs
One of the most common things we do in bioinformatics is to look for motifs, short
segments of DNA or protein that are of particular interest. They may be regulatory
elements of DNA or short stretches of protein that are known to be conserved across
many species. (The PROSITE web site athttp://www.expasy.ch/prosite/ has extensive
information about protein motifs.)
The motifs you look for in biological sequences are usually not one specific sequence.
They may have several variants—for example, positions in which it doesn't matter
which base or residue is present. They may have variant lengths as well.
Perl has a handy set of features for finding things in strings. This, as much as
anything, has made it a popular language for bioinformatics. Example 5-3 introduces
this string-searching capability; it does something genuinely useful, and similar
programs are used all the time in biology research. It does the following:
#!/usr/bin/perl -w
# Searching for motifs
$proteinfilename = <STDIN>;
# Read the protein sequence data from the file, and store it
# into the array variable @protein
@protein = <PROTEINFILE>;
# Close the file - we've read all the data into @protein now.
close PROTEINFILE;
# Remove whitespace
$protein =~ s/\s//g;
# In a loop, ask the user for a motif, search for the motif,
# and report if it was found.
# Exit if no motif is entered.
do {
print "Enter a motif to search for: ";
$motif = <STDIN>;
# Remove the newline at the end of $motif
chomp $motif;
# Look for the motif
if ( $protein =~ /$motif/ )
{
print "I found it!\n\n";
}
else
{
print "I couldn\'t find it.\n\n";
}
Counting Nucleotides
#!/usr/bin/perl -w
# Determining frequency of nucleotides, take 2
$dna_filename = <STDIN>;
chomp $dna_filename;
@DNA = <DNAFILE>;
close DNAFILE;
# Remove whitespace
$DNA =~ s/\s//g;
if ( $base eq 'A' )
{ ++$count_of_A; }
elsif ( $base eq 'C' )
{ ++$count_of_C; }
elsif ( $base eq 'G' )
{ ++$count_of_G; }
elsif ( $base eq 'T' )
{ ++$count_of_T; }
else
{
print "!!!!!!!! Error - I don\'t recognize this base: $base\n";
++$errors;
}
}
( -e $dna_filename) explanation
Note that files have several attributes, such as size, permission, location in the
filesystem, and type of file, and that many of these things can be tested for easily with
the file test operators, and –e checks for file existence in the location specified.
Everything else is familiar, until you hit the for loop; it requires a little explanation:
for ( $position = 0 ; $position < length $DNA ; ++$position ) {
++$position;
}
Take a moment and compare these two loops. You'll see the same statements but in
different locations.
As you can see, the for loop brings the initialization and increment of a counter
($position) into the loop statement, whereas in the while loop, they are separate
statements. In the for loop, both the initialization and the increment statement are
placed between parentheses, whereas you find only the conditional test in
the while loop. In the for loop, you can put initializations before the first semicolon
and increment statements after the second semicolon. The initialization statement is
done just once before starting the loop, and the increment statement is done at the end
of each iteration through the block before going back to the conditional test. It's really
just a shorthand for the equivalent while loop as just shown.
The conditional test checks to see if the position reached in the string is less than the
length of the string. It uses the length Perl function. Obviously, you don't want to
check characters beyond the length of the string. But a word is in order here about
the numbering of positions in strings and arrays.
By default, Perl assumes that a string begins at position 0 and its last character is at a
position that's numbered one less than the length of the string. Why do it this way
instead of numbering the positions from 1 up to and including the length of the string?
There are reasons, but they're somewhat abstruse; see the documentation for
enlightenment. If it's any comfort, many other programming languages make the same
choice.
This way of numbering is important to biologists because they are used to numbering
sequences beginning with 1, not with 0 the way Perl does it. You sometimes have to
add 1 to a position before printing out results so they'll make sense to
nonprogrammers. It's mildly annoying, but you'll get used to it.
The same holds true for numbering the elements of an array. The first element of an
array is element0; the last is element $length-1.
Anyway, you see that the conditional test evaluates to true while the value
of $position islength-1 or less and fails when $position reaches the same value as
the length of the string. For example, say you have a string that contains the text
"seeing." This has a length of six characters. The "s" is at position 0, and the "g" is at
position 5, which is one less than the string length 6.
Back in the block, you call the substr function to look into the string:
$base = substr($DNA, $position, 1);
This is a fairly general-purpose function for working with strings; you can also insert
and delete things. Here, you look at just one character, so you call substr on the
string $DNA, ask it to look in position $position for one character, and save the result
in scalar variable $base.
http://www.bioon.com/book/biology/Beginning%20Perl%20for%20Bioinformatics/44.htm