Web Client Programming with Perl-Chapter 4: The Socket Library- P2

hi" | shcat http://publish.ora.com/ Grep out URL References When you need to quickly get a list of all the references in an HTML page, here's a utility you can use to fetch an HTML page

Trang 1

Chapter 4: The Socket Library- P2

Now we wait for a response from the server We read in the response and selectively echo it out, where we look at the $response, $header, and $data variables to see if the user is interested in looking at each part of the reply: # get the HTTP response line

# get the entity body

if ($all || defined $data) {

Trang 3

# parse command line arguments

getopts('hHrd');

# print out usage if needed

if (defined $opt_h || $#ARGV<0) { help(); }

# if it wasn't an option, it was a URL

while($_ = shift @ARGV) {

hcat($_, $opt_r, $opt_H, $opt_d);

Trang 4

print " -h help\n";

print " -r print out response\n";

print " -H print out header\n";

print " -d print out data\n\n";

print "Hypertext cat help\n\n";

print "This program prints out documents on a

remote web server.\n";

print "By default, the response code, header, and data are printed\n";

Trang 5

print "but can be selectively printed with the

Trang 6

# we're only interested in HTTP URL's

return if ($the_url[0] !~ m/http/i);

# connect to server specified in 1st parameter

Trang 7

if (!defined open_TCP('F', $the_url[1],

# request the path of the document to get

print F "GET $the_url[3] HTTP/1.0\n";

Trang 8

if ($all || defined $data) {

Trang 9

With hcat, one can easily retrieve documents from remote web servers But there are times when a client request needs to be more complex than hcat is willing to allow To give the user more flexibility in sending client requests, we'll change hcat into shcat, a shell utility that accepts methods, headers, and entity-body data from standard input With this program, you can write shell scripts that specify different methods, custom headers, and submit form data

All of this can be done by changing a few lines around In hcat, where you see this:

print F "Accept: */*\n";

print F "User-Agent: hcat/1.0\n\n";

Replace it with this:

# copy STDIN to network connection

while (<STDIN>) {print F;}

and save it as shcat Now you can say whatever you want on shcat's STDIN, and it will forward it on to the web server you specify This allows you to do things like HTML form postings with POST, or a file upload with PUT, and selectively look at the results At this point, it's really all up to you what you want to say, as long as it's HTTP compliant

Here's a UNIX shell script example that calls shcat to do a file upload:

Trang 10

hi" | shcat http://publish.ora.com/

Grep out URL References

When you need to quickly get a list of all the references in an HTML page, here's a utility you can use to fetch an HTML page from a server and print out the URLs referenced within the page We've taken the hcat code and modified it a little There's also another function that we added to parse out URLs from the HTML Let's go over that first:

Trang 11

# while there are HTML tags

skip_others: while ($data =~ s/<([^>]*)>//) {

newlines,returns anywhere in url

push (@urls, $link);

next skip_others;

}

Trang 12

# handle case when url isn't in quotes (ie:

newlines,returns anywhere in url

push (@urls, $link);

the form: <tag parameter=" "> The outer if statement looks for HTML

Trang 13

tags, like <A>, <IMG>, <BODY>, <FRAME> The inner if statement looks for parameters to the tags, like SRC and HREF, followed by text Upon finding a match, the referenced URL is pushed into an array, which is

returned at the end of the function We've saved this in web.pl, and will include it in the hgrepurl program with a require 'web.pl'

The second major change from hcat to hgrepurl is the addition of:

Trang 14

@links=grab_urls($data, ('img', 'src', 'body', 'background'));

}

if (defined $hyperlinks || $all) {

@links2= grab_urls($data, ('a', 'href'));

}

my $link;

for $link (@links, @links2) { print "$link\n"; }

This appends the entity-body into the scalar of $data From there, we call

grab_urls( ) twice The first time looks for image references by recognizing

<img src=" "> and <body background=" "> in the HTML The second time looks for hyperlinks by searching for instances of <a href=" "> Each call to grab_urls( ) returns an array of URLs, stored in @links and @links2, respectively Finally, we print the results out

Other than that, there are some smaller changes For example, we look at the response code If it isn't 200 (OK), we skip it

# if not an "OK" response of 200, skip it

if ($the_response !~ m@^HTTP/\d+\.\d+\s+200\s@) {return;}

Trang 15

We've retrofitted the reading of the response line, headers, and entity-body

to not echo to STDOUT This isn't needed anymore in the context of this program Also, instead of parsing the -r, -H, and -d command-line

arguments, we look for -i for displaying image links only, and -l for

displaying only hyperlinks

So, to see just the image references at www.ora.com, one would do this:

Trang 16

# print out usage if needed

if (defined $opt_h || $#ARGV<0) { help(); }

# if it wasn't an option, it was a URL

while($_ = shift @ARGV) {

Trang 17

hgu($_, $opt_i, $opt_l);

Trang 18

# Subroutine to print out help text along with

usage information

sub help {

print "Hypertext grep URL help\n\n";

print "This program prints out hyperlink and

image links that\n";

print "are referenced by a user supplied URL on a web server.\n\n";

Trang 19

sub hgu {

# grab parameters

my($full_url, $images, $hyperlinks)=@_;

my $all = !($images || $hyperlinks);

Trang 20

print "Please use fully qualified valid URL\n";

exit(-1);

}

# we're only interested in HTTP URL's

return if ($the_url[0] !~ m/http/i);

# connect to server specified in 1st parameter

if (!defined open_TCP('F', $the_url[1],

print F "Accept: */*\n";

Trang 21

print F "User-Agent: hgrepurl/1.0\n\n";

Trang 22

if (defined $images || $all) {

@links=grab_urls($data, ('img', 'src', 'body', 'background'));

}

if (defined $hyperlinks || $all) {

@links2= grab_urls($data, ('a', 'href'));

}

Trang 23

my $link;

for $link (@links, @links2) { print "$link\n"; }

}

Client Design Considerations

Now that we've done a few examples, let's address some issues that arise

when developing, testing, and using web client software Most of these

issues are automatically handled by LWP, but when programming directly with sockets, you have to take care of them yourself

How does your client handle tag parameters?

The decision to process or ignore extra tag parameters depends on the application of the web client Some tag parameters change the tag's

appearance by adjusting colors or sizes Other tags are informational, like variable names and hidden variable declarations in HTML forms Your client may need to pay close attention to these tags For

example, if your client sends form data, it may want to check all the parameters Otherwise, your client may send data that is inconsistent with what the HTML specified e.g., an HTML form might specify

that a variable's value may not exceed a length of 20 characters If the client ignored this parameter, it might send data over 20 characters

As the HTML standard evolves, your client may require some

updating

Trang 24

What does your client do when the server's expected HTML format

current operation, and to notify someone to look into it The client may need to be updated to handle the changes at the server

Does the client analyze the response line and headers?

It is not advisable to write clients that skip over the HTTP response line and headers While it may be easier to do so, it often comes back

to haunt you later For example, if the URL used by the client

becomes obsolete or is changed, the client may interpret the body incorrectly Media types for the URL may change, and could be noticed in the HTTP headers returned by the server In general, the client should be equipped to handle variations in metadata as they occur

entity-Does your client handle URL redirection? entity-Does it need to?

Perhaps the desired data still exists, but not at the location specified

by your client In the event of a redirection, will your client handle it?

Trang 25

Does it examine the Location header? The answers to these

questions depend on the purpose of the client

Does the client send authorization information when it shouldn't?

Two or more separate organizations may have CGI programs on the same server It is important for your client not to send authorization information unless it is requested Otherwise, the client may expose its authentication to an outside organization This opens up the user's account to outsiders

What does your client do when the server is down?

When the server is down, there are several options The most obvious option is for the client to attempt the HTTP request at a later time Other options are to try an alternate server or abort the transaction The programmer should give the user some configuration options about the client's actions

What does your client do when the server response time is long?

For simple applications, it may be better to allow the user to interrupt the application For user-friendly or unattended batch applications, it

is desirable to time out the connection and notify the user

What does your client do when the server has a higher version of HTTP?

And what happens when the client doesn't understand the response? The most logical thing is to attempt to talk on a common denominator Chances are that just about anything will understand HTTP/1.0, if

Trang 26

that's what you feel comfortable using In most cases, if the client doesn't understand the response, it would be nice to tell the user or at least let the user know to get the latest version of HTTP for the client!

Định dạng
Số trang	26
Dung lượng	40,86 KB