About the Plug-in This plug-in takes the URL of a web page and parses it looking only for... The Plug-in function PIPHP_GetLinksFromURL$page { $contents = @file_get_contents$page; if
Trang 1About the Plug-in This plug-in takes the URL of a web page and parses it looking only for <a href links, and returns all that it finds in an array It takes a single argument:
• $page A web page URL, including the http:// preface and domain name
Variables, Arrays, and Functions
$contents String containing the HTML contents of $page
$urls Array holding the discovered URLs
$dom Document object of $contents
$xpath XPath object for traversing $dom
$hrefs Object containing all href link elements in $dom
$j Integer loop counter for iterating through $hrefs PIPHP_RelToAbsURL() Function to convert relative URLs to absolute
How It Works This plug-in first reads the contents of $page into the string $contents (returning NULL if there’s an error) Then it creates a new Document Object Model (DOM) of $contents in
$dom using the loadhtml() method The statement is prefaced with an @ character to suppress any warning or error messages Even poorly formatted HTML is generally useable with this method because it finds the URLs easy to extract and parse
Then a new XPath object is created in $xpath with which to search $dom for all instances of href elements, and all those discovered are then placed in the $hrefs object
Next a for loop is used to iterate through the $hrefs object and extract all the attributes, which in this case are the links we want Prior to storing the URLs in $urls, each one is passed through the PIPHP_RelToAbsURL() function to ensure they are converted to absolute URLs (if not already)
Once extracted, the links are then returned as an array
F IGURE 5-2 Using this plug-in you can extract and return all the links in a web page.
Trang 2Note that this plug-in makes use of plug-in 21, PIPHP_RelToAbsURL(), and so it must also be pasted into (or included by) your program
The Plug-in
function PIPHP_GetLinksFromURL($page) {
$contents = @file_get_contents($page);
if (!$contents) return NULL;
$urls = array();
$dom = new domdocument();
@$dom ->loadhtml($contents);
$xpath = new domxpath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($j = 0 ; $j < $hrefs->length ; $j++) $urls[$j] = PIPHP_RelToAbsURL($page, $hrefs->item($j)->getAttribute('href'));
return $urls;
}
Check Links
The two previous plug-ins provide the foundation for being able to crawl the Internet by:
• Reading in a third-party web page
• Extracting all URLs from the page
• Converting all the URLs to absolute
Armed with these abilities, it’s now a simple matter for this plug-in to offer the facility
to check all links on a web page and test whether the pages they refer to actually load or not; a great way to alleviate the frustration of your users upon encountering dead links or
mistyped URLs Figure 5-3 shows this plug-in being used to check the links on the alexa.com
home page
23
Trang 3About the Plug-in This plug-in takes the URL of a web page (yours or a third party’s) and then tests all the links found within it to see whether they resolve to valid pages It takes these three arguments:
• $page A web page URL, including the http:// preface and domain name
• $timeout The number of seconds to wait for a web page before considering it
unavailable
• $runtime The maximum number of seconds your script should run before timing
out
Variables, Arrays, and Functions
$contents String containing the HTML contents of $page
$checked Array of URLs that have been checked
$failed Array of URLs that could not be retrieved
$fail Integer containing the number of failed URLs
$urls Array of URLs extracted from $page
$context Stream context to set the URL load timeout PIPHP_GetLinksFromURL() Function to retrieve all links from a page PIPHP_RelToAbsURL() Function to convert relative URLs to absolute
How It Works The first thing this plug-in does is set the maximum execution time of the script using the ini_set() function This is necessary because crawling a set of web pages can take a considerable time I recommend you may want to experiment with maximums of up to
180 seconds or more If the script ends without returning anything, try increasing the value The contents of $page are then loaded into $contents After these two arrays are initialized The first, $checked, will contain all the URLs that have been checked so that, where a page links to another more than once, a second check is not made for that URL
F IGURE 5-3 The plug-in has been run on the alexa.com home page, with all URLs reported present and correct.
Trang 4is added to the $checked array and the file_get_contents() function is called (with the
$context object) to attempt to fetch the first 256 bytes of the web page If that fails, the URL
is added to the $failed array and $fail is incremented
Once the loop has completed, an array is returned with the first element containing 0 if there were no failed URLs Otherwise, it contains the number of failures, while the second element contains an array listing all the failed URLs
How to Use It
To check all the links on a web page, call the function using code such as this:
$page = "http://myserver.com";
$result = PIPHP_CheckLinks($page, 2, 180);
To then view or otherwise use the returned values, use code such as the following, which either displays a success message or lists the failed URLs:
if ($result[0] == 0) echo "All URLs successfully accessed.";
else for ($j = 0 ; $j < $result[0] ; ++$j) echo $result[1][$j] "<br />";
Because this plug-in makes use of plug-in 22, PIPHP_GetLinksFromURL(), which itself relies on plug-in 21, PIPHP_RelToAbsURL(), you must ensure you have copied both of them into your program file, or that they are included by it
TIP Because crawling like this can take time, when nothing is displayed to the screen you may wonder whether your program is actually working So, if you wish to view the plug-in’s progress, you can uncomment the line shown to have each URL displayed as it’s processed.
The Plug-in
function PIPHP_CheckLinks($page, $timeout, $runtime) {
ini_set('max_execution_time', $runtime);
$contents = @file_get_contents($page);
if (!$contents) return array(1, array($page));
$checked = array();
$failed = array();
$fail = 0;
$urls = PIPHP_GetLinksFromURL($page);
Trang 5$context = stream_context_create(array('http' =>
array('timeout' => $timeout)));
for ($j = 0 ; $j < count($urls); $j++) {
if (!in_array($urls[$j], $checked)) {
$checked[] = $urls[$j];
// Uncomment the following line to view progress // echo " $urls[$j]<br />\n"; ob_flush(); flush();
if (!@file_get_contents($urls[$j], 0, $context, 0, 256)) $failed[$fail++] = $urls[$j];
} }
return array($fail, $failed);
}
Directory List
When you need to know the contents of a directory on your server—for example, because you support file uploads and need to keep tabs on them—this plug-in returns all the filenames using a single function call Figure 5-4 shows the plug-in in action
F IGURE 5-4 Using the Directory List plug-in under Windows to return the contents of Zend Server CE’s
document root
24