Tài liệu Web Client Programming with Perl-Chapter 2: Demystifying the Browser-P1 pdf

In Chapter 3, you'll see all the different ways that a client can request something, and all the ways a server can reply.. In this example, the browser is given the following URL: http:

Trang 1

Chapter 2: Demystifying the Browser-P1

Before you start writing your own web programs, you have to become

comfortable with the fact that your web browser is just another client Lots

of complex things are happening: user interface processing, network

communication, operating system interaction, and HTML/graphics

rendering But all of that is gravy; without actually negotiating with web servers and retrieving documents via HTTP, the browser would be as useless

as a TV without a tuner

HTTP may sound intimidating, but it isn't as bad as you might think Like most other Internet protocols, HTTP is text-based If you were to look at the communication between your web browser and a web server, you would see text and lots of it After a few minutes of sifting through it all, you'd find out that HTTP isn't too hard to read By the end of this chapter, you'll be able

to read HTTP and have a fairly good idea of what's going on during typical everyday transactions over the Web

The best way to understand how HTTP works is to see it in action You actually see it in action every day, with every click of a hyperlink it's just that the gory details are hidden from you In this chapter, you'll see some common web transactions: retrieving a page, submitting a form, and

publishing a web page In each example, the HTTP for each transaction is printed as well From there, you'll be able to analyze and understand how your actions with the browser are translated into HTTP You'll learn a little bit about how HTTP is spoken between a web client and server

Trang 2

After you've seen bits and pieces of HTTP in this chapter, Chapter 3,

Learning HTTP, introduces HTTP in a more thorough manner In Chapter 3,

you'll see all the different ways that a client can request something, and all the ways a server can reply In the end, you'll get a feel for what is possible under HTTP

Behind the Scenes of a Simple Document

Let's begin by visiting a hypothetical web server at

http://hypothetical.ora.com/ Its imaginary (and intentionally sparse) web page appears in Figure 2-1

Figure 2-1 A hypothetical web page

Trang 3

This is something you probably do every day request a URL and then view

it in your browser But what actually happened in order for this document to appear in your browser?

The Browser's Request

Your browser first takes in a URL and parses it In this example, the browser

is given the following URL:

http://hypothetical.ora.com/

The browser interprets the URL as follows:

http://

In the first part of the URL, you told the browser to use HTTP, the Hypertext Transfer Protocol

hypothetical.ora.com

In the next part, you told the browser to contact a computer over the

network with the hostname of hypothetical.ora.com

/

Anything after the hostname is regarded as a document path In this example, the document path is /

So the browser connects to hypothetical.ora.com using the HTTP protocol Since no port was specified, it assumes port 80, the default port for HTTP The message that the browser sends to the server at port 80 is:

Trang 4

GET / HTTP/1.0

Connection: Keep-Alive

User-Agent: Mozilla/3.0Gold (WinNT; I)

Host: hypothetical.ora.com

Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*

Let's look at what these lines are saying:

1 The first line of this request (GET / HTTP/1.0) requests a

document at / from the server HTTP/1.0 is given as the version of the HTTP protocol that the browser uses

2 The second line tells the server to keep the TCP connection open until explicitly told to disconnect If this header is not provided, the server has no obligation to stick around under HTTP 1.0, and disconnects after responding to the client's request The behavior of the client and server depend on what version of HTTP is spoken (See the discussion

of persistent connections in Chapter 3 for the full scoop.)

3 In the third line, beginning with the string User-Agent, the client identifies itself as Mozilla (Netscape) version 3.0, running on

Windows NT

4 The fourth line tells the server what the client thinks the server's

hostname is Since the server may have multiple hostnames, the client

Trang 5

indicates which hostname was used In this environment, a web server can have a different document tree for each hostname it owns If the client hasn't specified the server's hostname, the server may be unable

to determine which document tree to use

5 The fifth line tells the server what kind of documents are accepted by the browser This is discussed more in the section "Media Types" in Chapter 3

Together, these 5 lines constitute a request Lines 2 through 5 are request headers

The Server's Response

Given a request like the one previously shown, the server looks for the file associated with "/" and returns it to the browser, preceding it with some

"header information":

HTTP/1.0 200 OK

Date: Fri, 04 Oct 1996 14:31:51 GMT

Server: Apache/1.1.1

Content-type: text/html

Content-length: 327

Last-modified: Fri, 04 Oct 1996 14:06:11 GMT

Trang 6

<title>Sample Homepage</title>

<h1>Welcome</h2>

Hi there, this is a simple web page Granted, it may not be as elegant

as some other web pages you've seen on the net, but there are

some common qualities:

<ul>

<li> An image,

<li> Text,

<li> and a <a href="/example2.html"> hyperlink

</a>

</ul>

If you look at this response, you'll see that it begins with a series of lines that specify information about the document and about the server itself Then after a blank line, it returns the document The series of lines before the first blank line is called the response header, and the part after the first blank line

Trang 7

is called the body or entity, or entity-body Let's look at the header

information:

1 The first line, HTTP/1.0 200 OK, tells the client what version of the HTTP protocol the server uses But more importantly, it says that the document has been found and is going to be transmitted

2 The second line indicates the current date on the server The time is expressed in Greenwich Mean Time (GMT)

3 The third line tells the client what kind of software the server is

running In this case, the server is Apache version 1.1.1

4 The fourth line (Content-type) tells the browser the type of the

document In this case, it is HTML

5 The fifth line tells the client how many bytes are in the entity body that follows the headers In this case, the entity body is 327 bytes long

6 The sixth line specifies the most recent modification time of the

document requested by the client This modification time is often used for caching purposes so a browser may not need to request the entire HTML file again if its modification time doesn't change

After all that, a blank line and the document text follow

Figure 2-2 shows the transaction

Trang 8

Figure 2-2 A simple transaction

Parsing the HTML

The document is in HTML (as promised in the Content-type line) The browser retrieves the document and then formats it as needed for example, each <li> item between the <ul> and </ul> is printed as a bullet and indented, the <img> tag displays a graphic on the screen, etc

And while we're on the topic of the <img> tag, how did that graphic get on the screen? While parsing the HTML file, the browser sees:

Trang 9

and figures out that it needs the data for the image as well Your browser then sends a second request, such as this one, through its connection to the web server:

GET /images/oreilly_mast.gif HTTP/1.0

The server responds with:

HTTP/1.0 200 OK

Date: Fri, 04 Oct 1996 14:32:01 GMT

Content-type: image/gif

Content-length: 9487

Last-modified: Tue, 31 Oct 1995 00:03:15 GMT

Trang 10

[data of GIF file]

Figure 2-3 shows the complete transaction, with the image requested as well

as the original document

Figure 2-3 Simple transaction with embedded image

There are a few differences between this request/response pair and the

previous one Based on the <img> tag, the browser knows where the image

is stored on the server From <img

Trang 11

src="/images/oreilly_mast.gif">, the browser requests a

document at a different location than "/":

GET /images/oreilly_mast.gif HTTP/1.0

The server's response is basically the same, except that the content type is different:

Content-type: image/gif

From the declared content type, the browser knows what kind of image it will receive and can render it as required The browser shouldn't guess the content type based on the document path; it is up to the server to tell the client

The important thing to note here is that the HTML formatting and image rendering are done at the browser end All the server does is return

documents; the browser is responsible for how they look to the user

Clicking on a Hyperlink

When you click on a hyperlink, the client and server go through something

similar to what happened when we visited http://hypothetical.ora.com/ For

example, when you click on the hyperlink from the previous example, the browser looks at its associated HTML:

<a href="/example2.html"> hyperlink </a>

From there, it knows that the next location to retrieve is /example2.html The browser then sends the following to hypothetical.ora.com:

Trang 12

GET /example2.html HTTP/1.0

The server responds with:

HTTP/1.0 200 OK

Date: Fri, 04 Oct 1996 14:32:14 GMT

Content-length: 431

Last-modified: Thu, 03 Oct 1996 08:39:45 GMT

[HTML data]

And the browser displays the new HTML page on the user's screen

Retrieving a Document Manually

Trang 13

Now that you see what a browser does, it's time for the most empowering statement in this book: There's nothing in these transactions that you can't do yourself And you don't need to write a program you can just do it by hand, using the standard telnet client and a little knowledge of HTTP

Telnet to www.ora.com at port 80 From a UNIX shell prompt:[1]

% telnet www.ora.com 80

Trying 198.112.208.23

Connected to www.ora.com

Escape character is '^]'

(The second argument for telnet specifies the port number to use By default,

telnet uses port 23 Most web servers use port 80 If you are behind a

firewall, you may have problems accessing www.ora.com directly from your machine Replace www.ora.com with the hostname of a web server inside

your firewall for the same effect.)

Now type in a GET command[2] for the document root:

GET / HTTP/1.0

Press ENTER twice, and you receive what a browser would receive:

HTTP/1.0 200 OK

Server: WN/1.15.1

Date: Mon, 30 Sep 1996 14:14:20 GMT

Trang 14

Last-modified: Fri, 20 Sep 1996 17:04:18 GMT

Title: O'Reilly & Associates

Link: <mailto:webmaster@ora.com>; rev="Made"

<HTML>

<HEAD>

When the document is finished, your shell prompt should return The server has closed the connection

Congratulations! What you've just done is simulate the behavior of a web client

Behind the Scenes of an HTML Form

You've probably seen fill-out forms on the Web, in which you enter

information into your browser and submit the form Common uses for forms

Trang 15

are guestbooks, accessing databases, or specifying keywords for a search engine

Định dạng
Số trang	15
Dung lượng	74,88 KB