Now that $text has the raw article we want, it’s time to convert Wikipedia’s special markup into the text and basic HTML this plug-in supports.. There’s also some more complicated markup
Trang 1Now that $text has the raw article we want, it’s time to convert Wikipedia’s special markup into the text and basic HTML this plug-in supports Before writing this plug-in, I performed hours of searching to try and find other code already doing the job And while there were a few examples, they were all quite long-winded and seemed overly
complicated, which is why I chose to write my own code
In the end, it turned out that less than a couple of dozen rules were enough to make sense
of most of Wikipedia’s markup For example, you’ve already seen how ==Heading== stands for <h2>Heading</h2> Similarly, ===Subheading=== stands for <h3>Subheading</h3>, and so on While '''word''' (three single quotes on either side of some text) stands for
<i>word</i> and ''word'' (two single quotes on either side of some text) stands for
<b>word</b> Ordered and unordered lists are also indicated by starting a new line with a #
or a * symbol for each item, so for simplicity, I chose to convert both into the HTML bullet entity, ●, and treat nested lists as if they are on the same level
Tables begin by starting a newline with a { symbol, so the code ignores everything from
\n{ up to a closing } symbol, and double newlines, \n\n, are converted into <p> tags There’s also some more complicated markup such as [[Article]], meaning “Place a
hyperlink here to Wikipedia’s article entitled Article,” or [[Article|Look at this]], which means “Add a hyperlink to Wikipedia’s article entitled Article here, but display the hyperlink text Look at this.” A few more variations on a theme exist here, plus there are
several types of markup I chose to ignore such as [[Image…]], [[File…]], and [[Category…]], which contain material supplemental to the main text, and [http…] which contains hyperlinks I didn’t want to use
What’s more, there are also sections such as <gallery> and <ref>, which I decided should also be ignored, and some major sections appearing within the {{ and }} pairs of symbols, which are often nested with sub, and sub-sub sections Again, all of these provide more rich content to a standard Wikipedia article, but are not necessary when we simply want the main text
Therefore, the $data array contains a sequence of regular expressions to be searched for, accompanied by strings with which to replace the matches Using a for loop, the array is iterated through a pair at a time, passing each pair of strings to the preg_replace() function If you want to learn more about the regular expressions used, there’s a lot of
information at http://en.wikipedia.org/wiki/Regular_expression.
Anyway, having massaged the text into almost plain text (with the exception of <h1> through <h7> headings, and the <p>, <br>, <b>, and <i> tags), the strip_tags() function
is called to remove any other tags (except those just mentioned) that remain
Finally, before returning the article text, a notice and hyperlink are appended to it showing the original Wikipedia article from which the text was derived
In all, I think you’ll find that these rules handle the vast majority of Wikipedia pages very well, although you will encounter the odd page that doesn’t come out quite right In such cases, you should be able to spot the markup responsible and add a translation for it into the $data array
If you use this plug-in on a production server, you’ll also need to comply with Wikipedia’s licensing requirements by adding a link to the GNU Free Documentation License, and indicating that your version of the article is also released under this license For details,
please see http://en.wikipedia.org/wiki/Wikipedia_Copyright.
Trang 2How to Use It
To use this plug-in, just pass it a Wikipedia article title and you can display the result returned, like this:
$result = PIPHP_FetchWikiPage('Climate Change');
if (!$result) echo "Could not fetch article.";
else echo $result;
Incidentally, I chose this article because it is one of those that returns the previously mentioned #REDIRECT string In this case, Climate Change is redirected to Climate change (with a lowercase c in the second word), and serves to show that the code correctly handles redirects
Because Wikipedia makes use of the UTF-8 character set to enable all the different languages it supports, you may also need to ensure you include the following HTML
<meta> tag in the <head> section of your HTML output, to ensure that all characters display correctly:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
To save on thrashing Wikipedia’s servers and to also cut down on the programming power required on your own, you should definitely consider saving the result from each call to this plug-in, either as a text file or, preferably, in a MySQL database, and then serve
up the cached copy whenever future requests are made for the same article
If you wish to compile your own database of Wikipedia articles using this plug-in, you
can find all the various indexes at http://en.wikipedia.org/wiki/Portal:Contents.
Remember, when you use this plug-in you must also copy and paste the PIPHP_
CurlGetContents() plug-in into your program, or otherwise include it, due to it being called by this plug-in
The Plug-in
function PIPHP_FetchWikiPage($entry) {
$agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; ' 'rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR ' '3.5.30729)';
$text = '';
while ($text == '' || substr($text, 0, 9) == '#REDIRECT') {
$entry = rawurlencode($entry);
$url = "http://en.wikipedia.org/wiki/Special:Export/$entry";
$page = PIPHP_CurlGetContents($url, $agent);
$xml = simplexml_load_string($page);
$title = $xml->page->title;
$text = $xml->page->revision->text;
Trang 3if (substr($text, 0, 9) == '#REDIRECT') {
preg_match('/\[\[(.+)\]\]/', $text, $matches);
$entry = $matches[1];
} } $sections = array('References', 'See also', 'External links', 'Notes', 'Further reading');
foreach($sections as $section) {
$ptr = stripos($text, "==$section==");
if ($ptr) $text = substr($text, 0, $ptr);
$ptr = stripos($text, "== $section ==");
if ($ptr) $text = substr($text, 0, $ptr);
} $data = array('\[{2}Imag(\[{2})*.*(\]{2})*\]{2}', '', '\[{2}File(\[{2})*.*(\]{2})*\]{2}', '', '\[{2}Cate(\[{2})*.*(\]{2})*\]{2}', '', '\{{2}([^\{\}]+|(?R))*\}{2}', '', '\'{3}(.*?)\'{3}', '<b>$1</b>', '\'{2}(.*?)\'{2}', '<i>$1</i>', '\[{2}[^\|\]]+\|([^\]]*)\]{2}', '$1', '\[{2}(.*?)\]{2}', '$1', '\[(http[^\]]+)\]', ' ', '\n(\*|#)+', '<br /> ● ', '\n:.*?\n', '', '\n\{[^\}]+\}', '', '\n={7}([^=]+)={7}', '<h7>$1</h7>', '\n={6}([^=]+)={6}', '<h6>$1</h6>', '\n={5}([^=]+)={5}', '<h5>$1</h5>', '\n={4}([^=]+)={4}', '<h4>$1</h4>', '\n={3}([^=]+)={3}', '<h3>$1</h3>', '\n={2}([^=]+)={2}', '<h2>$1</h2>', '\n={1}([^=]+)={1}', '<h1>$1</h1>', '\n{2}', '<p>', '<gallery>([^<]+?)<\/gallery>', '', '<ref>([^<]+?)<\/ref>', '', '<ref [^>]+>', '');
for ($j = 0 ; $j < count($data) ; $j += 2) $text = preg_replace("/$data[$j]/", $data[$j+1], $text); $text = strip_tags($text, '<h1><h2><h3><h4><h5><h6><h7>' '<p><br><b><i>');
$url = "http://en.wikipedia.org/wiki/$title";
$text = "<p>Source: <a href='$url'>Wikipedia ($title)</a>"; return trim($text);
}
Trang 4Fetch Flickr Stream
If you enjoy looking at photographs, chances are you have used the Flickr photo sharing service and may also have discovered a few photographers whose Flickr streams you like
to follow Well, now you can offer the same facility to your users with this plug-in
Using it you can look up any public Flickr stream and return the (up to) 20 most recent photographs from it Figure 10-5 shows the result of pointing the plug-in at a new account
I created at Flickr In this instance, I chose to display links to the photos, but you can also embed them in your web pages if you wish
About the Plug-in
This plug-in takes the name of a public Flickr account and returns the most recent photos
Upon success, it returns a two-element array, the first of which is the number of photos returned, and the second is an array containing URLs for each photo On failure, it returns a single-element array with the value FALSE It requires this argument:
• $account A Flickr account name such as xxxxxxxx@Nxx (where the x symbols
represent digits), or the more friendly Flickr usernames such as mine, which is
robinfnixon
Variables, Arrays, and Functions
$url String containing the Flickr photo stream base URL
$page String containing the Flickr stream HTML page contents
$rss String containing the location of the RSS feed for $page
$xml String containing the contents of $rss
$sxml SimpleXML object created from $xml
$pics Array containing the image URLs
$item SimpleXML object extracted from item in $sxml
$j Integer loop variable for iterating through image URLs
$t String used for transforming URLs into the form required
F IGURE 10-5 With this plug-in you can view the stream of a public Flickr user.
74
Trang 5How It Works
This plug-in takes the base Flickr stream URL and appends the account name in $account to it This HTML page is then returned using the file_get_contents() function and its contents are stored in $page The @ symbol prefacing the function suppresses any error messages should the call fail And, if it does fail, a value of FALSE is returned in a single-element array
Next, the array that will hold the image URLs, $pics, is initialized and the program
screen scrapes the HTML page to locate the position of the RSS link within it Screen scraping
is the term given to the process of extracting information from HTML pages that hasn’t been explicitly provided to you in an API or via another method Actually, there are Flickr APIs
to do this, but these three lines of code are simpler and represent all the coding required to find the RSS feed on the page and return its URL to the variable $rss
Using this URL, the RSS feed is fetched and placed in the string $xml, from where it is transformed into a SimpleXML object in $sxml This is a DOM (Document Object Model) object that can be easily traversed To do this, a foreach loop iterates through the items in
$sxml->entry, placing each in a new object called $item
Then a for loop is used to iterate though all the items in $item->link, which contains the URLs we are interested in If $item->link[$j]['type'] has the value image, then
$item->link[$j]['href'] will contain a URL, so this is extracted into the variable $t, first removing any _t or _m sequences from the URL, since they represent different sizes of the photo that we are not interested in Once $t contains the URL wanted, its value is assigned to the next available element of the $pics array and the foreach loop continues The plug-in returns a two-element array with the first element containing the number of photos found, calculated using the count() function, and the second containing an array of the photo URLs
Figure 10-6 shows a photo taken at random from the list returned and entered into a browser In this case, it has the following Flickr URL:
http://farm3.static.flickr.com/2522/3708788611_5a9964f24d_o.jpg
F IGURE 10-6 The plug-in determines the exact URL required for each photo.