Perl - introduction A full-featured, fast, and easy to use scripting language Very powerful pattern-matching facilities More powerful than gawk; very popular for web programming and
Trang 1152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140
Perl
for
Web Log Analysis
Trang 2Perl - introduction
A full-featured, fast, and easy to use scripting language
Very powerful pattern-matching facilities
More powerful than gawk; very popular for web
programming and CGI files
Many Perl tutorials, e.g
learn.perl.org/
www.perl.com/pub/a/2000/10/begperl1.html
www.perlmonks.org/index.pl?node=Tutorials
Trang 3Perl – historical note
PERL stands for Practical Extraction and
Reporting Language
Developed by Larry Wall
Perl 1.0 was released to usenet's
alt.comp.sources in 1987
Perl is the most popular web programming language – due to powerful text manipulation and quick development
Perl is widely known as
Trang 4Perl - running
First Perl script (on Unix) file1.pl
#!/usr/local/bin/perl -w print "Hi there!\n";
Note: On Windows, first line usually is
#!c:/Perl/bin/perl.exe -w
% file1.pl
Result: Hi there!
Trang 5Perl for Windows
Active Perl – ready-to-install Perl distribution
Free download
www.activestate.com/Products/ActivePerl/
Trang 6Perl basics
Two data types: numbers and strings
Perl uses many special characters $, @, %, as part of its syntax
Perl variables:
Scalars (simple variables, things) start with $, e.g $count
Arrays (lists) start with @, e.g @array1
Hashes (associative arrays) start with %
Usual control structures
Full introduction to Perl is beyond the scope of this
module
Trang 7What does this code do?
Trang 8The Tao of Coding
computer time
It is much better (and faster) to develop
programs using methods that AVOID mistakes than try to find bugs in badly written programs
Trang 9Perl style: understandability first
Perl allows you to do tricky programs to save a
few lines of text
AVOID this approach
Use careful, step by step development
Test after every step
A good program should be easy to understand
Only after you have an understandable program, and only if you need it, you can improve
efficiency
Trang 10Perl coding
Variables can be declared implicitly by their first use, e.g
$oldvar=$nevar+27
if $nevar was not declared before, it will be initialized to zero
Danger! Can lead to hard-to-find errors (what if the variable was misspelled and was supposed to be $newvar ?)
Much better to declare variables explicitly e.g
my $newvar = 0;
Enforced by command
use strict
Trang 11Sample log file
We will again use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file.
We will give useful code examples You are
encouraged to try the code examples in this lecture on this file
You should get the same answers!
Trang 12Perl for parsing a web log file
Program 0: logparse0.pl - read and print log file
Trang 13Perl regular expressions, 1
Usage:
$var =~ / regex /
where regex is a regular expression E.g.
$line =~ /google/
will match all lines containing "google"
Note: / delimit regular expression, so / can't be used inside (unless escaped like this \/ )
Trang 14Perl log parsing, 1
print " $cnt lines matched google";
Check how many lines refer to google
Applying this code to d100.log,you get:
2 lines matched google
Trang 15Perl regular expressions, 2
Special characters:
: matches one character
a* : matches zero or more repeats of "a"
a+ : matches 1 or more repeats of "a"
\S : matches any non-white space character
^ : anchor – matches beginning of string
$ : anchor – matches end of string
Trang 16Log parse 2: IP address
IP address is the first item on the log line
In almost all log files it is followed by " - - ", representing missing "ident_user" and
"auth_user" fields
Regular expression for matching these 3 fields: $line =~ /^(\S+) - - /;
Trang 17Perl regex: parentheses capture match variables
Perl regex items enclosed in parentheses ()
correspond to special match variables
Variable $1 contains value matched by regular expression in the first parentheses, etc
Trang 18Perl regex: match variables
print " processed $cnt log lines\n";
Note: First line with Perl is probably different on your machine
Trang 19Perl regular expression 4: brackets
Brackets [ ] allow you match any character inside
Example:
[cmt]an will match can, man or tan,
will not match ban or dan
Trang 20Perl regular expression 4b:
brackets [^ ]
[^x] will match any character except x
(note: here ^ is not the beginning of text anchor)
Example: [^:]* will match any string that does not include
a colon :
Example: if $date is 16/Nov/2005:031415 , after
$date =~ ([^:]*):.*
[^:]* will match 16/Nov/2005
Because it was enclosed in (), match result stored in $1
Trang 21Parsing log: Date, Time
Date, Time is specified in the log as
[DD/Mon/YYYY:HH:MM:SS timezone]
Matching regular expression
\[([^:]+):( ):( ):( ) -0500\]
Trang 22Parsing log: Date, Time
Matching regular expression in detail
\[([^:]+):( ):( ):( ) -0500\]
\[ matches brackets \]
[^:] matches any string that does not contain :
first ( ) will match HH (hours); value in $2 second ( ) will match MM ; in $3
third ( ) matches SS; in $4
Trang 23Parsing log: Time Zone
The time zone is relative to GMT
The time zone in the log file is for the SERVER, not for the visitor, so it is nearly always the same
in the time log
but it changes during daylight savings time
In our test log file the time zone is -0500, US
Eastern time zone
Trang 24Parsing log: Request
HTTP version
- usually
ignored
Trang 25Parsing log:
Status code and Object size
Status (Response) code is always a 3-digit number,
followed by space, so it can be matched with
(\d\d\d)
Object size is either a number or "-" followed by space Simplest regex to match it is
(\S+)
Trang 26Parsing log: Referrer
The Referrer is a string enclosed in double quotes "…"Can have anything inside except for a double quote
Can also be "-" in case of a direct request
Not documented, but can be "" (nothing between the quotes) Referrer can be matched by:
Trang 27Parsing log: User agent
User agent is also a string enclosed in double quotes
"…", that can have anything inside except for a double quote It can also be "-"
User agent can be matched by:
Trang 28Parsing a web log line:
putting all together
if ($line =~ /^(\S+) - - \[([^:]+):( ):( ):( ) -0500\] "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" (\d\d\d) (\S+) "([^"]*)" "([^"]+)"/ ) {
…
}
The matching is done by the following
(should be all on one line)
Full code is in program weblog_parse.pl
Trang 29Perl arrays
Perl array is an ordered list of items
Array names begin with @
Array initialization:
@days=("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
Trang 30Perl arrays, num of items
When referring to a single array item, name begins with
"$" E.g we print the first array item (index 0) using
print $days[0] ;
Number of items in an array is
$#array
$#days is 7
Trang 31Perl array iteration
Iterating over entire array
foreach $day (@days) {print $day,"\n" } ;
is the same as
for $n ($n=0; $n <7; $n++) {
print $days[$n],"\n" } ;
Trang 32Perl hash
Hash is unordered list of key, value pairs.
Hash initialization:
%capitals=("USA", "Washington D.C.",
"France", "Paris",
"China", "Beijing") ;
Trang 33Perl hash reference
Referring to a single hash item, name begins with
Trang 34Perl hash iteration
Iteration over the entire hash
foreach $country (keys %capitals) {
print "$country capital $capitals{$country}\n"; }
Trang 35Additional tools for Web log