Perl overview 1• Perl = Practical extraction and report language • Perl = Pathologically eclectic rubbish lister • It is a powerful general-purpose language, which is particularly usef
Trang 1An Introduction to Perl
Sources and inspirations:
http://www.cs.utk.edu/~plank/plank/classes/cs494/494/notes/Perl/lecture.html
Randal L Schwartz and Tom Christiansen,
“Learning Perl” 2nd ed., O’Reilly
Randal L Schwartz and Tom Phoenix,
“Learning Perl” 3rd ed., O’Reilly
Dr Nathalie Japkowicz, Dr Alan Williams
Go O'Reilly!
Trang 2Perl overview (1)
• Perl = Practical extraction and report language
• Perl = Pathologically eclectic rubbish lister
• It is a powerful general-purpose language, which is particularly useful for writing “quick and dirty”
Trang 3Perl overview (2)
• In the hierarchy of programming language, Perl is located half-way between high-level languages
such as Pascal, C and C++, and shell scripts
(languages that add control structure to the Unix command line instructions) such as sh, sed and awk
• By the way:
– awk = Aho, Weinberger, Kernighan
– sed = Stream Editor
Trang 5Advantages of Perl (2)
• Perl offers extremely strong regular expression
capabilities, which allow fast, flexible and reliable
string handling operations, especially pattern
Trang 6Disadvantages of Perl
• Perl is a jumble! It contains many, many
features from many languages and tools
• It contains different constructs for the same
functionality (for example, there are at least 5
ways to perform a one-line if statement)
It is not a very readable language
• You cannot distribute a Perl program as an
opaque binary That is, you cannot really
commercialize products you develop in Perl
Trang 7Perl resources and versions
• http://www.perl.org tells you everything that you want to know about Perl
• What you will see here is Perl 5
• Perl 5.8.0 has been released in July 2002
• Perl 6 (http://dev.perl.org/perl6/) is the next
version, still under development, but moving
along nicely The first book on Perl 6 is in stores (http://www.oreilly.com/catalog/perl6es)
Trang 8Scalar data: strings and numbersScalars need not to be defined or their types declared:
Perl understands from context.
% cat hellos.pl
#!/usr/bin/perl -w
print "Hello" " " "world\n";
print "hi there " 2 " worlds!" "\n"; print (("5" + 6) " eggs\n" " in " "
Trang 9Scalar variables
Scalar variable names start with a dollar sign They
do not have to be declared.
12
$k\n
Trang 10Quotes and substitution
Suppose $x = 3
Single-quotes ' ' allow no substitution except for the
escape sequences \\ and \'
print('$x\n'); gives $x\n and no new line
Double-quotes " " allow substitution of variables like $xand control codes like \n (newline)
print("$x\n"); gives 3 (and a new line).
Back-quotes ` ` also allow substitution, then try to
execute the result as a system command, returning as the final value whatever the system command outputs
$y = `date`; print($y); results in
Trang 11Control statements: if, else, elsif
print "'$name' follows 'fred'\n";}
elsif ($name eq 'fred') {
print "both names are 'fred'\n";}
Trang 12Control statements: loops (1)
% oddsum_while.pl
10
Use of uninitialized value at
oddnums.pl line 6, <STDIN> chunk 1.
Trang 13Control statements: loops (2)
• End-line comments begin with #
• It is okay, though not nice, to use a variable
without initialization (like $sum) Such a
variable is initialized to 0 if it is first used as a
number or to the empty string "" if it is first
used as a string In fact, it is always undef,
variously converted
• Perl can, if asked, issue a warning (use the -w
flag)
• Of course, while is only one of many looping
constructs in Perl Read on
Trang 14Control statements: loops (3)
Trang 15Control statements: loops (4)
We also have do-while and do-until, and we have
foreach Read on.
Trang 16Control statements: loops (5)
Trang 17Control constructs compared
C Perl (braces required)
the same if () { } if () { }
if (! ) { } unless () { }
different } else if () { } } elsif () { }
the same while () { } while () { }
the same for (aa;bb;cc) { } for (aa;bb;cc) { }
foreach $v (@array){ }
similar 0 is FALSE 0, "0", and "" are FALSE
similar != 0 is TRUE anything not false is TRUE
Trang 18Lists and arrays
• A list is an ordered collection of scalars An array is a variable that contains a list
• Each element is an independent scalar value A list can hold numbers, strings, undef values—any
mixture of kinds of scalar values
• To use an array element, prefix the array name with
a $; place a subscript in square brackets
• To access the whole array, prefix its name with a @
• You can copy an array into another You can use the
Trang 19Command-line arguments
Suppose that a Perl program stored in the file cleanUp
is invoked in Unix/Linux with the command:
cleanUp -o result.htm data.htm
The built-in list named @ARGV then contains three
elements:
('-o', 'result.htm', 'data.htm')
These three element can be accessed as:
$ARGV[0]
$ARGV[1]
$ARGV[2]
Trang 20John Nathalie Zebra
hello nil notary
Trang 21Array examples (2A)
Trang 22Array examples (2B)
% each_rev.pl
a bc d efg efg d bc a
hi j
j hi klm nopq st
Trang 23Array examples (3)Reversing a text file (whole lines)
Trang 25• A hash is similar to an array, but instead of subscripts, we
can have anything as a key, and we use curly brackets
rather than square brackets
• The official name is associative array (known to be
implemented by hashing )
• Keys and values can be any scalars; keys are always
converted to strings
• To refer to a hash as a whole, prefix its name with a %
• If you assign a hash to an array, it becomes a simple list
Trang 29<> loops over the files listed
Trang 30Hash examples III:
character frequency count
# end of input, print %count
for $c (sort keys %count) {
print "$c\t$count{$c}\n";
Trang 31Character frequency count (2)
Trang 32• A subroutine is a user-defined function The
syntax is very simple; so is the semantics
marked with & The value returned is that of
the last expression evaluated
Trang 33Subroutines (2)
A few housekeeping rules
• You can place your definitions anywhere in the file,
though it is recommended to have them at the beginning
• Perl always uses the latest definition in the file—any
preceding one is ignored
• Certain elements of the syntax are optional
• The & might sometimes be omitted (but it is not a good idea).
• The return operator may precede a value to be
returned (this can be useful):
if ( $x > $y ) { return $x }
Trang 34Subroutines (3)
• Clearly, the use of global variables is much
too limited Subroutines take arguments, and work on them via a predefined list variable
@_ or its elements $_[0], $_[1] and so on
Trang 35Subroutines (4)
• $_[0], $_[1] are not fun to work with We
can rename them locally, using the my
operator—it creates a sub's private variables
Here, we declare two such variables and
right away initialize them
Trang 36• This produces 19 (23 gets ignored) and
26 (the second value is undef, that is, 0)
Trang 37Subroutines (6)
• We could stop the subroutine if the number
of arguments is wrong The (generally very useful!) operator die does that for us
The script is stopped after printing this:
max needs two arguments: 16 19 23
Trang 38Subroutines (7)
• We can have just a warning, if we use the
operator warn instead
The script prints this:
max needs two arguments: 16 19 23
Trang 39Subroutines (8)
• It is, by the way, not a bad idea to generalize max
by allowing it to take any number of arguments
Trang 40$curr_max
}
$z = &max ( );
if ( defined $z ) { print $z "\n"; }
Trang 41Regular expressions (1)
• A regular expression (also called a pattern) is a
template that describes a class of strings A string can either match or not match the pattern
• The simplest pattern is one character
• A character class—the pattern matches any of
these characters—is written in square brackets:
[01234567] an octal digit
[0-7] an octal digit
[0-9A-F] a hex digit
[^A-Za-z] not a letter (^ "negates")
[0-9-] a decimal digit
or a minus
Trang 42Regular expressions (2)
• Metacharacters:
(dot) any character except \n
• Anchors:
^ the beginning of a string
$ the end of a string
Trang 43Regular expressions (3)
$x = "01239876AGH";
if ( $x =~ /^0[1-9]{4,}/ ) { print "yes1\n"; }
if ( $x =~ /[A-Z]{3}$/ ) { print "yes2\n"; }
if ( $x =~ /^.*[A-Z]{4}$/ ) { print "yes3\n"; }
• The Boolean operator =~ tries to match a string
with a regular expression written inside slashes
Trang 44• Patterns can be grouped by parentheses
(the whole pattern becomes one item)
Alternative is denoted by the bar |
Trang 45• Some character classes are predefined:
class not class
Trang 46Regular expression examples (1)
Trang 47Regular expression examples (2)
$j = "JjJjJjJj";
Trang 48Regular expression examples (3)
$k = "Boom Boom, out go the lights!";
$k =~ /(Boom\W){2}/; # yes: \W is space, comma
$k =~ /\Bgo\B/; # no: "go" is a complete word
Trang 49Regular expression substitution (1)
We can modify a string variable by applying a substitution.The operator is =~ and the substitution is written as:
Trang 50Regular expression substitution (2)
Matched patterns are remembered in built-in variables
$1, $2, $3 etc These variables keep their values till
the next matching operation
Each set of paretheses in a pattern corresponds to a
'just' a single string to play with
a single string to play with, just just
Trang 51Regular expression substitution (3)
A substitution can be applied to all occurrences of
the pattern, that is, globally:
Trang 52Regular expression substitution (4)
$v = "This is a double double word.";
$v =~ s/(\b\w+\b) \1/\1/;
print "$v\n";
This is a double word
$v = "This is a triple triple triple word.";
Trang 53Regular expression substitution (5)
# Find all dates, selecting and reinserting the context.
# $1 and $6 match the context Superfluous digits,
# as 43 and 55 in 432001-01-2255, belong in the context.
# "Dates" such as April 31 or February 30 are allowed.
# There are no provisions for leap years.
s/(\D*)(($Year)-($Month)-($Day))(\D|.*$)/$1<date>$2<\/date>$6/g; s/(\D*)(($Day)-($Month)-($Year))(\D|.*$)/$1<date>$2<\/date>$6/g; print $_;
}
Here is a more realistic example (last year's homework).
You rather need explanations: in class, please.
Trang 54Regular expression substitution (6)
DATA
Both 12-09-2000 and 25-8-324 are good dates,
but 30-14-1955 and 10-10-10 are not OTOH, 10-10-010 is.
Trang 55In another course
• Predefined variables (lots!)
• More on lists, arrays and hashes
• More on regular expressions
Trang 56Adapted from Programming Perl, page 361
1 Testing "all-at-once" instead of incrementally,
either bottom-up or top-down
2 Optimistically skipping print scaffolding to
dump values and show progress
3 Not running with the perl -w switch to catch
obvious typographical errors
4 Leaving off $ or @ or % from the front of a
variable
5 Forgetting the trailing semicolon
Mistakes that novices make (1)
Thanks to Alan Williams for this list
Trang 577 Unbalanced (), {}, [], "", '', ``, and sometimes
<>
8 Confusing '' and "", or / and \
9 Using == instead of eq, != instead of ne, =
instead of ==, and so on
• ('White' == 'Black') and ($x = 5) evaluate
as (0 == 0) and (5) and thus are true!
10.Using "else if" instead of "elsif"
11.Putting a comma after the file handle in a
print statement
Mistakes that novices make (2)
Trang 58Mistakes that novices make (3)
12.Not chopping the output of backquotes `date` or not
normally start at 0, not 1
14.Using $_, $1, or other side-effect variables, then modifying the code in a way that unknowingly affects or is affected
by these
15.Forgetting that regular expressions are greedy, seeking