hi" | shcat http://publish.ora.com/ Grep out URL References When you need to quickly get a list of all the references in an HTML page, here's a utility you can use to fetch an HTML page
Trang 1Chapter 4: The Socket Library- P2
Now we wait for a response from the server We read in the response and selectively echo it out, where we look at the $response, $header, and $data variables to see if the user is interested in looking at each part of the reply: # get the HTTP response line
# get the entity body
if ($all || defined $data) {
Trang 3
# parse command line arguments
getopts('hHrd');
# print out usage if needed
if (defined $opt_h || $#ARGV<0) { help(); }
# if it wasn't an option, it was a URL
while($_ = shift @ARGV) {
hcat($_, $opt_r, $opt_H, $opt_d);
Trang 4print " -h help\n";
print " -r print out response\n";
print " -H print out header\n";
print " -d print out data\n\n";
print "Hypertext cat help\n\n";
print "This program prints out documents on a
remote web server.\n";
print "By default, the response code, header, and data are printed\n";
Trang 5print "but can be selectively printed with the
Trang 6# we're only interested in HTTP URL's
return if ($the_url[0] !~ m/http/i);
# connect to server specified in 1st parameter
Trang 7if (!defined open_TCP('F', $the_url[1],
# request the path of the document to get
print F "GET $the_url[3] HTTP/1.0\n";
Trang 8# get the entity body
if ($all || defined $data) {
Trang 9With hcat, one can easily retrieve documents from remote web servers But there are times when a client request needs to be more complex than hcat is willing to allow To give the user more flexibility in sending client requests, we'll change hcat into shcat, a shell utility that accepts methods, headers, and entity-body data from standard input With this program, you can write shell scripts that specify different methods, custom headers, and submit form data
All of this can be done by changing a few lines around In hcat, where you see this:
# request the path of the document to get
print F "GET $the_url[3] HTTP/1.0\n";
print F "Accept: */*\n";
print F "User-Agent: hcat/1.0\n\n";
Replace it with this:
# copy STDIN to network connection
while (<STDIN>) {print F;}
and save it as shcat Now you can say whatever you want on shcat's STDIN, and it will forward it on to the web server you specify This allows you to do things like HTML form postings with POST, or a file upload with PUT, and selectively look at the results At this point, it's really all up to you what you want to say, as long as it's HTTP compliant
Here's a UNIX shell script example that calls shcat to do a file upload:
Trang 10hi" | shcat http://publish.ora.com/
Grep out URL References
When you need to quickly get a list of all the references in an HTML page, here's a utility you can use to fetch an HTML page from a server and print out the URLs referenced within the page We've taken the hcat code and modified it a little There's also another function that we added to parse out URLs from the HTML Let's go over that first:
Trang 11# while there are HTML tags
skip_others: while ($data =~ s/<([^>]*)>//) {
newlines,returns anywhere in url
push (@urls, $link);
next skip_others;
}
Trang 12# handle case when url isn't in quotes (ie:
newlines,returns anywhere in url
push (@urls, $link);
the form: <tag parameter=" "> The outer if statement looks for HTML
Trang 13tags, like <A>, <IMG>, <BODY>, <FRAME> The inner if statement looks for parameters to the tags, like SRC and HREF, followed by text Upon finding a match, the referenced URL is pushed into an array, which is
returned at the end of the function We've saved this in web.pl, and will include it in the hgrepurl program with a require 'web.pl'
The second major change from hcat to hgrepurl is the addition of:
Trang 14@links=grab_urls($data, ('img', 'src', 'body', 'background'));
}
if (defined $hyperlinks || $all) {
@links2= grab_urls($data, ('a', 'href'));
}
my $link;
for $link (@links, @links2) { print "$link\n"; }
This appends the entity-body into the scalar of $data From there, we call
grab_urls( ) twice The first time looks for image references by recognizing
<img src=" "> and <body background=" "> in the HTML The second time looks for hyperlinks by searching for instances of <a href=" "> Each call to grab_urls( ) returns an array of URLs, stored in @links and @links2, respectively Finally, we print the results out
Other than that, there are some smaller changes For example, we look at the response code If it isn't 200 (OK), we skip it
# if not an "OK" response of 200, skip it
if ($the_response !~ m@^HTTP/\d+\.\d+\s+200\s@) {return;}
Trang 15We've retrofitted the reading of the response line, headers, and entity-body
to not echo to STDOUT This isn't needed anymore in the context of this program Also, instead of parsing the -r, -H, and -d command-line
arguments, we look for -i for displaying image links only, and -l for
displaying only hyperlinks
So, to see just the image references at www.ora.com, one would do this:
Trang 16# print out usage if needed
if (defined $opt_h || $#ARGV<0) { help(); }
# if it wasn't an option, it was a URL
while($_ = shift @ARGV) {
Trang 17hgu($_, $opt_i, $opt_l);
Trang 18# Subroutine to print out help text along with
usage information
sub help {
print "Hypertext grep URL help\n\n";
print "This program prints out hyperlink and
image links that\n";
print "are referenced by a user supplied URL on a web server.\n\n";
Trang 19sub hgu {
# grab parameters
my($full_url, $images, $hyperlinks)=@_;
my $all = !($images || $hyperlinks);
Trang 20print "Please use fully qualified valid URL\n";
exit(-1);
}
# we're only interested in HTTP URL's
return if ($the_url[0] !~ m/http/i);
# connect to server specified in 1st parameter
if (!defined open_TCP('F', $the_url[1],
# request the path of the document to get
print F "GET $the_url[3] HTTP/1.0\n";
print F "Accept: */*\n";
Trang 21print F "User-Agent: hgrepurl/1.0\n\n";
Trang 22# get the entity body
if (defined $images || $all) {
@links=grab_urls($data, ('img', 'src', 'body', 'background'));
}
if (defined $hyperlinks || $all) {
@links2= grab_urls($data, ('a', 'href'));
}
Trang 23
my $link;
for $link (@links, @links2) { print "$link\n"; }
}
Client Design Considerations
Now that we've done a few examples, let's address some issues that arise
when developing, testing, and using web client software Most of these
issues are automatically handled by LWP, but when programming directly with sockets, you have to take care of them yourself
How does your client handle tag parameters?
The decision to process or ignore extra tag parameters depends on the application of the web client Some tag parameters change the tag's
appearance by adjusting colors or sizes Other tags are informational, like variable names and hidden variable declarations in HTML forms Your client may need to pay close attention to these tags For
example, if your client sends form data, it may want to check all the parameters Otherwise, your client may send data that is inconsistent with what the HTML specified e.g., an HTML form might specify
that a variable's value may not exceed a length of 20 characters If the client ignored this parameter, it might send data over 20 characters
As the HTML standard evolves, your client may require some
updating
Trang 24What does your client do when the server's expected HTML format
current operation, and to notify someone to look into it The client may need to be updated to handle the changes at the server
Does the client analyze the response line and headers?
It is not advisable to write clients that skip over the HTTP response line and headers While it may be easier to do so, it often comes back
to haunt you later For example, if the URL used by the client
becomes obsolete or is changed, the client may interpret the body incorrectly Media types for the URL may change, and could be noticed in the HTTP headers returned by the server In general, the client should be equipped to handle variations in metadata as they occur
entity-Does your client handle URL redirection? entity-Does it need to?
Perhaps the desired data still exists, but not at the location specified
by your client In the event of a redirection, will your client handle it?
Trang 25Does it examine the Location header? The answers to these
questions depend on the purpose of the client
Does the client send authorization information when it shouldn't?
Two or more separate organizations may have CGI programs on the same server It is important for your client not to send authorization information unless it is requested Otherwise, the client may expose its authentication to an outside organization This opens up the user's account to outsiders
What does your client do when the server is down?
When the server is down, there are several options The most obvious option is for the client to attempt the HTTP request at a later time Other options are to try an alternate server or abort the transaction The programmer should give the user some configuration options about the client's actions
What does your client do when the server response time is long?
For simple applications, it may be better to allow the user to interrupt the application For user-friendly or unattended batch applications, it
is desirable to time out the connection and notify the user
What does your client do when the server has a higher version of HTTP?
And what happens when the client doesn't understand the response? The most logical thing is to attempt to talk on a common denominator Chances are that just about anything will understand HTTP/1.0, if
Trang 26that's what you feel comfortable using In most cases, if the client doesn't understand the response, it would be nice to tell the user or at least let the user know to get the latest version of HTTP for the client!