I recommend having one of these books around in case you need some help using the command line: • For students who haven’t done much UNIX: Sams Teach Yourself Unix in 24 Hours 4th Editio
Trang 1Introduction to Perl
Matt Hudson
(with thanks to Stuart Brown of NYU, for some great examples and
teaching ideas)
Trang 2• clustalw: align protein or DNA sequences
• fasta34: search a sequence using an older, slower, but sometimes more flexible algorithm
Trang 3grep – my favorite
• Allows you to pick out lines of a text file that match a query, count them, and retrieve lines around the
match.
grep ‘Query=’ myblast.txt
What sequences did I BLAST?
grep –c ‘>’ testprotein.txt
How many sequences are in this file?
grep –A 10 ‘>’ testprotein.txt
Give me the first ten lines of each protein
Trang 4ftp commands
• ftp ftp.ncbi.nih.gov go to the NCBI site
• open open a connection
• ls same as UNIX
• cd same as UNIX
• get get me this file
• mget get more than one file
• put put a file on the server
• lcd local cd
• ! local shell
• close close connection
• bye exit the ftp program
Trang 5Test time
• OK You are now up and running with UNIX, and can use it to do some fairly sophisticated bioinformatics
• We’re going to concentrate on Perl
scripting from now on
Trang 6UNIX books
• You might find that your UNIX skills need some refreshing from time to time I recommend having one of these books around in case you need some help using the command line:
• For students who haven’t done much UNIX:
Sams Teach Yourself Unix in 24 Hours (4th Edition) (Sams Teach Yourself in 24 Hours) (Paperback)
by Dave Taylor
For more advanced UNIX users:
UNIX System V: A Practical Guide (3rd Edition) (Paperback)
by Mark G Sobell
• Also, for those of you not so familiar with bioinformatics:
Bioinformatics for Dummies (Paperback)
by Jean-Michel Claverie , Cedric Notredame , Jean-Michel Claverie ,
Cedric Notredame
Trang 7This I have heard good things about but not used much myself:
Beginning Perl, Second Edition (Paperback)
This is a classic but slow going if you know no programming:
Learning Perl, Fourth Edition (Paperback)
This is better if you have little programming experience, but not a textbook:
Perl for Dummies (Fourth Edition) (Paperback)
by Paul Hoffman
• Once you get started
Programming Perl, 3 rd edition,
Trang 8Why use Perl?
• Interpreted language – quick to program
• Easy to learn compared to most
languages
• Designed for working with text files
• Free for all operating systems
• Most popular language in bioinformatics – many scripts available you can
“borrow”, also ready made modules
Trang 9• Run the program
• Look at the output
• Correct the errors (debugging)
• Edit the script and try again.
Trang 10All programming courses traditionally start with a
program that prints “Hello, world!” So in keeping with
that tradition:
Note:
No line numbers.
Each command line ends with a semicolon
Remember your program?
#!/usr/bin/perl
print “Hello, world\n”;
Trang 11– Use \n in a text string to signify a newline.
– The \ character is called “backslash”.
– It is an “escape” – it changes the meaning of the character after it In this case it changes “n” to “newline” Other
examples are \t (tab) or \$ (= print an actual dollar sign,
normally a dollar sign has a special meaning).
Trang 12Program details
• Perl programs on UNIX start with a line like:
#!/usr/bin/perl
• Perl ignores anything after a # (this is a
command not to Perl, but to the UNIX shell).
• Elsewhere in the program # is used for
comments to explain the code.
• Lines that are Perl commands end with a semicolon (;).
Trang 13Run your Perl program
Trang 14• In Perl, strings are very important They are just a series of any text characters – letters, numbers, ><?>:$%^&*, etc.
• In the statement
print “Hello, world\n”;
this is a
Trang 15string Numbers, etc
• The other common type of data is a number.
• Perl can handle numbers in most common formats, without any complications:
456 5.6743 6.3E-26
• Arithmetic functions:
+ (add)
- (minus) / (divide)
* (multiply)
** (exponentiation)
Trang 16A program using numbers
#!/usr/bin/perl
print “2+2\n”;
print 3*4 , “\n”;
print “8/2=” , 8/2 , “\n”;
Do you get it?
Numbers in quotes are part of a string.
Numbers outside quotes are numbers, and
Trang 17• Up till now, we’ve been telling the
computer exactly what to print But in order for the program to generate what
is printed, we need to use variables
• A variable name starts with “$”
• It can be either a string or a number
Trang 18Assigning values
In pretty much all programming languages, = means
“assign this value to this variable”.
The “my” command in Perl initializes the variable This is optional but highly recommended.
So, you assign values to a variable as follows:
my $number = 123;
Trang 19A program with variables
Trang 20• If you put a variable inside double quotes, Perl
interpolates the variable
print “The number is $number\n”
The number is 9
• If you use single quotes, no interpolation happens
print ‘The number is $number\n’
The number is $number\n
• A more flexible way to do this is to “escape” the $
print “The value of \$number is $number\n”;
Trang 21Variables - summary
• A variable name starts with a $
• It contains a number or a text string
• Use my to define a variable
• Use = to assign a value
• Use \ to stop the variable being
interpolated
• Take care with variable names and with changing the contents of variables
Trang 22Standard Input
• To make the program do something, we need to input data
expect input, by default from the keyboard – Usually this is assigned to a variable
print “Please type a number: ”;
my $num = <STDIN>;
print “Your number is $num\n”;
Trang 24Perl evaluates the expression (1 == 1 )
Note TWO NOT ONE EQUALS SIGNS!
The if operator causes the command in curly
brackets to be executed ONLY IF the expression is true
Trang 25• if evaluates some statement in
parentheses (must be true or false)
• Note: conditional block is indented,
using tabs
– Perl doesn’t care about indents, but it
makes your code more “human readable”
Trang 26Comparing variables
if ($one == $two) {print “one equals two”;}
Note there are TWO equals signs in this expression If you remember, = means “assign this variable this value” So == actually means “equals” You can also use
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
!= Not equal to
Trang 27What’s a block?
• In the case of an “if” statement:
• If the test is true, execute all the
command lines inside the {} brackets If not, then go on past the closing } to the statements below
• You can also do stuff in a block over and
over again using a loop – more later.
Trang 28die, scum
• die kills your script safely and prints a
message
• It is often used to prevent you doing
something regrettable – e.g running your script on a file that doesn’t exist, or
overwriting an existing file
Trang 29Exercising the Perl muscles
• Now let’s write a script to ask the user their age, and then deliver an insult
specific to the age bracket:
• Over 25 - old fogey
• Under 15 – callow youth
• 15-25 – (insert your own insult here)
Trang 30Conditional Blocks, summary
• An if test can be used to control multiple lines of commands, as in this example *
print “Enter your age: ”;
$age = <STDIN>;
chomp $age;
if ($age < 15) { print “You are too young for this kind of work!\n”;
die “too young”;
}
if ($age > 25) {
print “You’re old enough to know better!”;
die “too old”;
}
Trang 31• An array can store multiple pieces of data
• They are essential for the most useful
functions of Perl They can store data such as:
– the lines of a text file (e.g primer sequences) – a list of numbers (e.g BLAST e values)
• Arrays are designated with the symbol @
my @bases = (“A”, “C”, “G”, “T”);
Trang 32Converting a variable to an array
split splits a variable into parts and puts them
in an array.
my $dnastring = "ACGTGCTA";
my @dnaarray = split //, $dnastring;
@dnaarray is now (A, C, G, T, G, C, T, A)
@dnaarray = split /T/, $dnastring;
@dnaarray is now (ACG, GC, A)
Trang 33• join combines the elements of an array into a single scalar variable (a string)
$dnastring = join('', @dnaarray);
Converting an array to a variable
which array spacer
(empty here)
Trang 34• A loop repeats a bunch of functions until it is done The functions are placed in a BLOCK – some code delimited with curly brackets {}
• Loops are really useful with arrays.
• The “foreach” loop is probably the most useful of all:
foreach my $base (@dnaarray) {
print "$base “;
}
Trang 35• String comparison (is the text the same?)
• eq (equal )
• ne (not equal ) There are others but beware of them!
Comparing strings
Trang 36Getting part of a string
• substr takes characters out of a
string
$letter = substr($dnastring, $position, 1)
which string where in
the string
how many letters to take
Trang 37Combining strings
• Strings can be concatenated (joined)
• Use the dot . operator
Trang 38Making Decisions - review
• The if operator is generally used together
with numerical or string comparison
operators, inside an (expression)
numerical: ==, !=, >, <, ≥, ≤
• You can make decisions on each member
of an array using a loop which puts each
part of the array through the test, one at a
Trang 39More healthy exercise
• Write a program that asks the user for a DNA restriction site, and then tells them whether that particular sequence matches the site for the restriction enzyme EcoRI, or Bam HI, or Hind III.
• Site for EcoR1: GAATTC
• Bam H1: GGATCC
• Hind III: AAGCTT