1. Trang chủ
  2. » Công Nghệ Thông Tin

Perl for web log analysis

35 159 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 35
Dung lượng 179,5 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Perl - introduction A full-featured, fast, and easy to use scripting language  Very powerful pattern-matching facilities  More powerful than gawk; very popular for web programming and

Trang 1

152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140

Perl

for

Web Log Analysis

Trang 2

Perl - introduction

 A full-featured, fast, and easy to use scripting language

 Very powerful pattern-matching facilities

 More powerful than gawk; very popular for web

programming and CGI files

 Many Perl tutorials, e.g

learn.perl.org/

www.perl.com/pub/a/2000/10/begperl1.html

www.perlmonks.org/index.pl?node=Tutorials

Trang 3

Perl – historical note

 PERL stands for Practical Extraction and

Reporting Language

 Developed by Larry Wall

 Perl 1.0 was released to usenet's

alt.comp.sources in 1987

 Perl is the most popular web programming language – due to powerful text manipulation and quick development

 Perl is widely known as

Trang 4

Perl - running

 First Perl script (on Unix) file1.pl

#!/usr/local/bin/perl -w print "Hi there!\n";

Note: On Windows, first line usually is

#!c:/Perl/bin/perl.exe -w

% file1.pl

Result: Hi there!

Trang 5

Perl for Windows

 Active Perl – ready-to-install Perl distribution

 Free download

www.activestate.com/Products/ActivePerl/

Trang 6

Perl basics

 Two data types: numbers and strings

 Perl uses many special characters $, @, %, as part of its syntax

 Perl variables:

 Scalars (simple variables, things) start with $, e.g $count

 Arrays (lists) start with @, e.g @array1

 Hashes (associative arrays) start with %

 Usual control structures

 Full introduction to Perl is beyond the scope of this

module

Trang 7

What does this code do?

Trang 8

The Tao of Coding

computer time

 It is much better (and faster) to develop

programs using methods that AVOID mistakes than try to find bugs in badly written programs

Trang 9

Perl style: understandability first

 Perl allows you to do tricky programs to save a

few lines of text

 AVOID this approach

 Use careful, step by step development

 Test after every step

 A good program should be easy to understand

 Only after you have an understandable program, and only if you need it, you can improve

efficiency

Trang 10

Perl coding

 Variables can be declared implicitly by their first use, e.g

$oldvar=$nevar+27

 if $nevar was not declared before, it will be initialized to zero

 Danger! Can lead to hard-to-find errors (what if the variable was misspelled and was supposed to be $newvar ?)

 Much better to declare variables explicitly e.g

my $newvar = 0;

 Enforced by command

use strict

Trang 11

Sample log file

 We will again use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file.

 We will give useful code examples You are

encouraged to try the code examples in this lecture on this file

 You should get the same answers!

Trang 12

Perl for parsing a web log file

Program 0: logparse0.pl - read and print log file

Trang 13

Perl regular expressions, 1

 Usage:

$var =~ / regex /

where regex is a regular expression E.g.

$line =~ /google/

will match all lines containing "google"

Note: / delimit regular expression, so / can't be used inside (unless escaped like this \/ )

Trang 14

Perl log parsing, 1

print " $cnt lines matched google";

Check how many lines refer to google

Applying this code to d100.log,you get:

2 lines matched google

Trang 15

Perl regular expressions, 2

Special characters:

: matches one character

a* : matches zero or more repeats of "a"

a+ : matches 1 or more repeats of "a"

\S : matches any non-white space character

^ : anchor – matches beginning of string

$ : anchor – matches end of string

Trang 16

Log parse 2: IP address

 IP address is the first item on the log line

 In almost all log files it is followed by " - - ", representing missing "ident_user" and

"auth_user" fields

 Regular expression for matching these 3 fields: $line =~ /^(\S+) - - /;

Trang 17

Perl regex: parentheses capture match variables

 Perl regex items enclosed in parentheses ()

correspond to special match variables

 Variable $1 contains value matched by regular expression in the first parentheses, etc

Trang 18

Perl regex: match variables

print " processed $cnt log lines\n";

Note: First line with Perl is probably different on your machine

Trang 19

Perl regular expression 4: brackets

 Brackets [ ] allow you match any character inside

 Example:

 [cmt]an will match can, man or tan,

 will not match ban or dan

Trang 20

Perl regular expression 4b:

brackets [^ ]

[^x] will match any character except x

 (note: here ^ is not the beginning of text anchor)

Example: [^:]* will match any string that does not include

a colon :

Example: if $date is 16/Nov/2005:031415 , after

$date =~ ([^:]*):.*

[^:]* will match 16/Nov/2005

Because it was enclosed in (), match result stored in $1

Trang 21

Parsing log: Date, Time

 Date, Time is specified in the log as

[DD/Mon/YYYY:HH:MM:SS timezone]

Matching regular expression

\[([^:]+):( ):( ):( ) -0500\]

Trang 22

Parsing log: Date, Time

Matching regular expression in detail

\[([^:]+):( ):( ):( ) -0500\]

\[ matches brackets \]

[^:] matches any string that does not contain :

first ( ) will match HH (hours); value in $2 second ( ) will match MM ; in $3

third ( ) matches SS; in $4

Trang 23

Parsing log: Time Zone

 The time zone is relative to GMT

 The time zone in the log file is for the SERVER, not for the visitor, so it is nearly always the same

in the time log

 but it changes during daylight savings time

 In our test log file the time zone is -0500, US

Eastern time zone

Trang 24

Parsing log: Request

HTTP version

- usually

ignored

Trang 25

Parsing log:

Status code and Object size

Status (Response) code is always a 3-digit number,

followed by space, so it can be matched with

(\d\d\d)

Object size is either a number or "-" followed by space Simplest regex to match it is

(\S+)

Trang 26

Parsing log: Referrer

The Referrer is a string enclosed in double quotes "…"Can have anything inside except for a double quote

Can also be "-" in case of a direct request

Not documented, but can be "" (nothing between the quotes) Referrer can be matched by:

Trang 27

Parsing log: User agent

User agent is also a string enclosed in double quotes

"…", that can have anything inside except for a double quote It can also be "-"

User agent can be matched by:

Trang 28

Parsing a web log line:

putting all together

if ($line =~ /^(\S+) - - \[([^:]+):( ):( ):( ) -0500\] "(GET|HEAD|POST|OPTIONS) (\S+) HTTP(\S+)" (\d\d\d) (\S+) "([^"]*)" "([^"]+)"/ ) {

}

The matching is done by the following

(should be all on one line)

Full code is in program weblog_parse.pl

Trang 29

Perl arrays

 Perl array is an ordered list of items

 Array names begin with @

 Array initialization:

@days=("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")

Trang 30

Perl arrays, num of items

 When referring to a single array item, name begins with

"$" E.g we print the first array item (index 0) using

print $days[0] ;

 Number of items in an array is

$#array

$#days is 7

Trang 31

Perl array iteration

 Iterating over entire array

foreach $day (@days) {print $day,"\n" } ;

 is the same as

for $n ($n=0; $n <7; $n++) {

print $days[$n],"\n" } ;

Trang 32

Perl hash

 Hash is unordered list of key, value pairs.

 Hash initialization:

%capitals=("USA", "Washington D.C.",

"France", "Paris",

"China", "Beijing") ;

Trang 33

Perl hash reference

 Referring to a single hash item, name begins with

Trang 34

Perl hash iteration

Iteration over the entire hash

foreach $country (keys %capitals) {

print "$country capital $capitals{$country}\n"; }

Trang 35

Additional tools for Web log

Ngày đăng: 23/10/2014, 16:11

TỪ KHÓA LIÊN QUAN

w