It requires these arguments: • $url The URL to fetch • $agent The User Agent string of a browser F IGURE 10-2 This plug-in is used to fetch and display the www.pluginphp.com home page...
Trang 1Curl Get Contents Some web sites don’t like to be accessed by anything other than a web browser, which can make it difficult to fetch data from them with a PHP program using a function such as file_get_contents() Such sites generally block your program by checking for a User Agent string, which is something all browsers send to web sites they visit and which can vary widely They look something like this:
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729)
Therefore, to access these sites it is necessary to simulate being a browser, which, as shown in Figure 10-2, this plug-in will do for you
About the Plug-in
This plug-in is intended to replace the PHP file_get_contents() function when used to fetch a web page It accepts the URL of a page and a browser User Agent to emulate, and on success it returns the contents of the page at the given URL On failure, it returns FALSE It requires these arguments:
• $url The URL to fetch
• $agent The User Agent string of a browser
F IGURE 10-2 This plug-in is used to fetch and display the www.pluginphp.com home page.
72
Trang 2Variables, Arrays, and Functions
$ch CURL handle to an opened curl_init() session
$result The returned result from the curl_exec() call
How It Works
This plug-in uses the Mod CURL (Client URL) library extension to PHP If it fails, then you need to read your server and/or PHP installation instructions or consult your server administrator about enabling Mod CURL What it does is open a session with curl_init(), passing a handle for the session to $ch Such a session can perform a wide range of URL related tasks
But first the plug-in uses curl_setopt() to set up the various options required prior to making the curl_exec() call These include setting CURLOPT_URL to the value of $url and CURLOPT_USERAGENT to the value of $agent Additionally, a number of other options are set
to sensible values
The curl_exec() function is then called, with the result of the call being placed in
$result The session is then closed with a call to curl_close(), and the value in $result
is returned
How to Use It
Using this plug-in is as easy as replacing calls to file_get_contents() with PIPHP_ CurlGetContents() As long as you have also passed a sensible-looking User Agent string, the plug-in will then be able to return some pages that could not be retrieved using the former function call For example, you can load in and display the contents of a web page like this:
$agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; ' 'rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR ' '3.5.30729)';
$url = 'http://pluginphp.com';
echo PIPHP_CurlGetContents($url, $agent);
This will display the main page of the www.pluginphp.com web site, which should look
like Figure 10-2 There’s a comprehensive explanation (and collection) of User Agent strings
at www.useragentstring.com.
CAUTION Sometimes the reason a web site only allows a browser access to a web page is because other programs are not permitted to access it So please check how you are allowed to access information from such a web site, and what you are allowed to do with it, before using this plug-in.
The Plug-in
function PIPHP_CurlGetContents($url, $agent) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_HEADER, 0);
Trang 3curl_setopt($ch, CURLOPT_ENCODING, "gzip");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 8);
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
Fetch Wiki Page Wikipedia is an excellent resource with several million articles Even if you take into account that some of the information may not always be correct due to any user being able to edit a page, on the whole, most of the web site is factual and it contains a summary of almost the whole depth and breadth of human knowledge
What’s even better is that Wikipedia is published under the GNU Free Documentation
License—see www.gnu.org/copyleft/fdl.html Essentially this means that you can use any text
from it as long you give full attribution of the source, and also offer the text (with any amendments) under the same license As a consequence, I now have the entire Wikipedia database stored in my iPhone so that I can instantly look up any entry, even when mobile connectivity is limited By using data compression techniques, and keeping only the main article text, it takes up just 2GB of space
The GFDL license used also means you can use programs such as this plug-in to reformat and reuse articles from Wikipedia, as shown in Figure 10-3, in which just the text has been extracted from its article on PHP
F IGURE 10-3 Using this plug-in, you can extract just the text from a Wikipedia entry.
73
Trang 4If you also take a look at Figure 10-4, you’ll see the original article at Wikipedia and, comparing the two, you’ll notice that the plug-in has completely ignored all the formatting, graphics, tables, and other extras, leaving behind just the text of the article
Using it you could create your own reduced size local copy of Wikipedia, or perhaps use it to add hyperlinks to words or terms you may wish to explain to your readers I have used this code to add short encyclopedia entries to searches returned by a customized Google search engine I wrote
Combined with other plug-ins from this book, you could reformat articles into RSS feeds, translate them into “friendly” text, or, well, once you have access to the Wikipedia text, it’s really only up to your imagination what you choose to do with it
About the Plug-in
This plug-in takes the title of a Wikipedia entry and returns just the text of the article, or on failure it returns FALSE It requires this argument:
• $entry A Wikipedia article title
F IGURE 10-4 The original article about PHP on the Wikipedia web site
Variables, Arrays, and Functions
$agent String containing a browser User Agent string
$url String containing the URL of Wikipedia’s XML export API
$page String containing the result of fetching the Wikipedia entry
$xml SimpleXML object created from $page
$title String containing the article title as returned by Wikipedia
Trang 5$text String containing the article text
$sections Array of four section headings at which to truncate the text
$section String containing each element of $sections in turn
$ptr Integer offset into $text indicating start of $section
$data Array of search and replace strings for converting raw Wikipedia data
$j Integer loop counter for processing search and replace actions
$url String containing the URL of the original Wikipedia article
How It Works
Wikipedia has kindly created an API with which you can export selected articles from their database You can access it at:
http://en.wikipedia.org/wiki/Special:Export Unfortunately, they have set this API to deny access to programs that do not present it with
a browser User Agent string Luckily, the previous plug-in provides just that functionality, so using it, along with this plug-in, it’s possible to export any Wikipedia page as XML, which can then be transformed into just the raw text
This is done by setting up a browser User Agent string and then calling the Export API using PIPHP_CurlGetContents(), passing the Export API URL, along with the article title and the browser agent Before making the call, though, $entry is passed though the rawurlencode() function to convert non URL-compatible characters into acceptable equivalents, such as spaces into %20 codes
The XML page returned from this call is then parsed into an XML object using the simplexml_load_string() function, the result being placed in $xml
Next, the only two items of information that are required, the article title and its text, are extracted from $xml->page->title and $xml->page->text, into $title and $text
Notice that all of this occurs inside a while loop This is because by far the majority of Wikipedia articles are redirects from misspellings or different capitalizations What the loop does is look for the string #REDIRECT in a response and, if one is discovered, the loop goes around again using the redirected article title, which is placed in $entry by using preg_
match() to extract it from between a pair of double square parentheses The loop can handle multiple redirects, which are not as infrequent as you might think with the age of Wikipedia, and the amount of times many articles have been moved by now
So, with the raw Wikipedia text now loaded into $text, the next section truncates the
string at whichever of five headings out of References, See Also, External Links, Notes, or
Further Reading (if any) appears first, because those entries are not part of the main article and are to be ignored This is done by using a foreach loop to iterate through the headings, which are enclosed by pairs of = symbols, Wikipedia’s markup to indicate an <h2> heading Because some Wikipedia authors use spaces inside the ==, both cases (with and without spaces) are tested Each heading in turn is searched for using the stripos() function and, if
a heading is found in $text, $ptr will point to its start so that $text is then truncated to end at that position