We will be accessing spatial data in several formats—html, xml, shapefiles, and text—locally and over the web to produce a map of home foreclosure auc-tions and perform statistical anal
Trang 1Data Mashups
in R
by Jeremy Leipzig and Xiao-Yi Li
Copyright © 2009 O’Reilly Media
ISBN: 9780596804770
Released: June 5, 2009
This article demonstrates how the
real-world data is imported, managed,
visual-ized, and analyzed within the R
statisti-cal framework Presented as a spatial
mashup, this tutorial introduces the user
to R packages, R syntax, and data
struc-tures The user will learn how the R
en-vironment works with R packages as well
as its own capabilities in statistical
anal-ysis We will be accessing spatial data in
several formats—html, xml, shapefiles,
and text—locally and over the web to
produce a map of home foreclosure
auc-tions and perform statistical analysis on
these events.
Contents
Messy Address Parsing 2
Shaking the XML Tree 6
The Many Ways to Philly (Latitude) 8
Exceptional Circumstances 9
Taking Shape 11
Developing the Plot 14
Turning Up the Heat 17
Statistics of Foreclosure 19
Final Thoughts 28
Appendix: Getting Started 28
Trang 2Programmers can spend good part of their careers scripting code to conform tocommercial statistics packages, visualization tools, and domain-specific third-par-
ty software The same tasks can force end users to spend countless hours in paste purgatory, each minor change necessitating another grueling round of for-matting tabs and screenshots R scripting provides some reprieve Because thisopen source project garners support of a large community of package developers,the R statistical programming environment provides an amazing level of extensi-bility Data from a multitude of sources can be imported into R and processedusing R packages to aid statistical analysis and visualization R scripts can also beconfigured to produce high-quality reports in an automated fashion - saving time,energy, and frustration
copy-This article will attempt to demonstrate how the real-world data is imported,managed, visualized, and analyzed within R Spatial mashups provide an excellentway to explore the capabilities of R, giving glimpses of R packages, R syntax anddata structures To keep this tutorial in line with 2009 zeitgeist, we will be plottingand analyzing actual current home foreclosure auctions Through this exercise, wehope to provide an general idea of how the R environment works with R packages
as well as its own capabilities in statistical analysis We will be accessing spatialdata in several formats—html, xml, shapefiles, and text—locally and over the web
to produce a map of home foreclosures and perform statistical analysis on theseevents
Messy Address Parsing
To illustrate how to combine data from disparate sources for statistical analysisand visualization, let’s focus on one of the messiest sources of data around: webpages
The Philadelphia Sheriff’s office posts foreclosure auctions on its website [http:// www.phillysheriff.com/properties.html] each month How do we collect thisdata, massage it into a reasonable form, and work with it? First, let’s create a new
folder (e.g ~/Rmashup) to contain our project files It is helpful to change the R
working directory to your newly created folder
Trang 3Here is some of this webpage’s source html, with addresses highlighted:
<center><b> 258-302 </b></center>
84 E Ashmead St.
    22nd Ward
974.96 sq ft BRT# 121081100 Improvements: Residential Property
<br><b>
Homer Simpson
   
</b> C.P November Term, 2008 No 01818     $55,132.65
    Some Attorney & Partners, L.L.P.
<hr />
<center><b> 258-303 </b></center>
1916 W York St.
    16th Ward
992 sq ft BRT# 162254300 Improvements: Residential Property
The Sheriff’s raw html listings are inconsistently formatted, but with the right ular expression we can identify street addresses: notice how they appear alone on
reg-a line Our goreg-al is to submit vireg-able reg-addresses to the geocoder Here reg-are some typicreg-aladdresses that our regular expression should match:
</b> C.P August Term, 2008 No 002804
R has built-in functions that allow the use of perl-type regular expressions (Formore info on regular expressions, see Mastering Regular Expressions [http:// oreilly.com/catalog/9780596528126/], Regular Expression Pocket Refer- ence [http://oreilly.com/catalog/9780596514273])
With some minor deletions to clean up address idiosyncrasies, we should be able
to correctly identify street addresses from the mess of other data contained in
properties.html We’ll use a single regular expression pattern to do the cleanup.
For clarity, we can break the pattern into the familiar elements of an address(number, name, suffix)
> stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"
> stName<-"([NSEW]\\ )?[0-9A-Z ]+"
> stSuf<-"(St|Ave|Place|Blvd|Drive|Lane|Ln|Rd)(\\.?)$"
> myStPat<-paste(stNum,stName,stSuf,sep=" ")
Trang 4Note the backslash characters themselves must be escaped with a backslash toavoid conflict with R syntax Let’s test this pattern against our examples usingR’s grep() function:
> grep(myStPat,"3331 Morning Glory Rd.",perl=TRUE,value=FALSE,ignore.case=TRUE)
> badStrings<-"(\\r| a\\/?[kd]\\/?a.+$| - Premise.+$| assessed as.+$|,
Unit.+|<font size=\"[0-9]\">|Apt\\ +| #.+$|[,\"]|\\s+$)"
Test this against some examples using R’s gsub() function:
> gsub(badStrings,'',"205 N 4th St., Unit BG, a/k/a 205-11 N 4th St., Unit BG", perl=TRUE)
Trang 5Obtaining Latitude and Longitude Using Yahoo
To plot our foreclosures on a map, we’ll need to get latitude and longitude dinates for each street address Yahoo Maps provides such a service (called “geo-coding”) as a REST-enabled web service Via HTTP, the service accepts a URLcontaining a partial or full street address, and returns an XML document with therelevant information It doesn’t matter whether a web browser or a robot is sub-mitting the request, as long as the URL is formatted correctly The URL mustcontain an appid parameter and as many street address arguments as are known
Trang 6To use this service with your mashup, you must sign up with Yahoo! and receive
an Application ID Use that ID in with the ‘appid’ parameter of the request url.Sign up here: http://developer.yahoo.com/maps/rest/V1/geocode.html
Shaking the XML Tree
Parsing well-formed and valid XML is much easier parsing than the Sheriff’s html
An XML parsing package is available for R; here’s how to install it from CRAN’srepository:
> install.packages("XML")
> library("XML")
Warning
If you are behind a firewall or proxy and getting errors:
On Unix: Set your http_proxy environment variable
On Windows: try the custom install R wizard with internet2 option instead
of “standard” Click for additional info [http://cran.r-project.org/bin/ windows/base/rw-FAQ.html#The-Internet-download-functions-
fail_00]
Trang 7Our goal is to extract values contained within the <Latitude> and <Longitude> leafnodes These nodes live within the <Result> node, which lives inside a
<ResultSet> node, which itself lies inside the root node
To find an appropriate library for getting these values, call library(help=XML).This function lists the functions in the XML package
> library(help=XML) #hit space to scroll, q to exit
> ?xmlTreeParse
I see the function xmlTreeParse will accept an XML file or url and return an Rstructure Paste in this block after inserting your Yahoo App ID
> library(XML)
> appid<-'<put your appid here>'
> street<-"1 South Broad Street"
Trang 8the jugular by using what we can gleam from the data structure that
$ text: list()
- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAb - attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLA
(snip)
That’s kind of a mess, but we can see our Longitude and Latitude are Lists inside
of Lists inside of a List inside of a List
Tom Short’s R reference card, an invaluable handy resource, tells us to get theelement named name in list X of a list in R x[['name']]: http://cran.r-project.org/ doc/contrib/Short-refcard.pdf
The Many Ways to Philly (Latitude)
Using Data Structures
Using the indexing list notation from R we can get to the nodes we need
> lat<-xmlResult[['doc']][['ResultSet']][['Result']][['Latitude']][['text']]
> long<-xmlResult[['doc']][['ResultSet']][['Result']][['Longitude']][['text']] > lat
39.951405
looks good, but if we examine this further
> str(lat)
list()
- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XML
Although it has a decent display value this variable still considers itself an
XMLNode and contains no index to obtain raw leaf value we want—the descriptorjust says list() instead of something we can use (like $lat) We’re not quite thereyet
Using Helper Methods
Fortunately, the XML package offers a method to access the leaf value: xmlValue
Trang 9> lat<-xmlValue(xmlResult[['doc']][['ResultSet']][['Result']][['Latitude']]) > str(lat)
chr "39.951405"
Using Internal Class Methods
There are usually multiple ways to accomplish the same task in R Another means
to get to this our character lat/long data is to use the “value” method provided bythe node itself
> lat<-xmlResult[['doc']][['ResultSet']][['Result']][['Latitude']][['text']]$value
If we were really clever we would have understood that XML doc class provided
us with useful methods all the way down! Try neurotically holding down the tabkey after typing
> lat<-xmlResult$ (now hold down the tab key)
Exceptional Circumstances
The Unmappable Fake Street
Now we have to deal with the problem of bad street addresses—either the Sheriffoffice enters a typo or our parser lets a bad street address pass: http://local.ya hooapis.com/MapsService/V1/geocode?ap
pid=YD-9G7bey8_JXxQP6rxl.fBFGgCdNjoDMACQA &street=1+Fake +St&city=Philadelphia&state=PA
From the Yahoo documentation—when confronted with an address that cannot
be mapped, the geocoder will return coordinates pointing to the center of the city
Trang 10Note the “precision” attribute of the result is “zip” instead of address and there is
a warning attribute as well
warning="The street could not be found Here is the center of the city.">
We need to get a hold of the attribute tags within <Result> to distinguish bad
geocoding events, or else we could accidentally record events in the center of thecity as foreclosures By reading the RSXML FAQ [http://www.omegahat.org/ RSXML/FAQ.html] it becomes clear we need to turn on the addAttributeNames-paces parameter to our xmlTreeParse call if we are to ever see the precision tag
Trang 11We will compile all this code into a single function once we know how to merge
it with a map (see Developing the Plot)
Taking Shape
Finding a Usable Map
To display a map of Philadelphia with our foreclosures, we need to find a polygon
of the county as well as a means of plotting our lat/long coordinates onto it Boththese requirements are met by the ubiquitous ESRI shapefile format The term
shapefile [http://en.wikipedia.org/wiki/Shapefile] collectively refers to a shpfile, which contains polygons, and related files which store other features, indices,and metadata
Googling “philadelphia shapefile” returns several promising results including thispage: http://www.temple.edu/ssdl/Shape_files.htm
“Philadelphia Tracts” seems useful because it has US Census Tract informationincluded We can use these tract ids to link to other census data Tracts are stand-ardized to contain roughly 1500-8000 people, so densely populated tracts tend to
be smaller This particular shapefile is especially appealing because the map jection” uses the same WGS84 [http://en.wikipedia.org/wiki/World_Geodet ic_System] Lat/Long coordinate system that our address geocoding service uses,
“pro-as opposed to a “state plane coordinate system” which can be difficult to transform(transformations require the rgdal [http://cran.r-project.org/web/packages/ rgdal/index.html] package and gdal [http://www.gdal.org/] executables).Save and unzip the following to your project directory: http://en.wikipedia.org/ wiki/World_Geodetic_System
Trang 12PBSmapping is a popular R package that offers several means of interacting withspatial data It relies on some base functions from the maptools package to readESRI shapefiles, so we need both packages
> install.packages(c("maptools","PBSmapping"))
As with other packages we can see the functions using
library(help=PBSmapping) and view function descriptions using ?topic: http:// cran.r-project.org/web/packages/PBSmapping/index.html
We can use str to examine the structure of the shapefile imported by
PBSmapping::importShapeFile
> library(PBSmapping)
PBS Mapping 2.59 Copyright (C) 2003-2008 Fisheries and Oceans Canada
PBS Mapping comes with ABSOLUTELY NO WARRANTY; for details see the
file COPYING This is free software, and you are welcome to redistribute
it under certain conditions, as outlined in the above file.
A complete user's guide 'PBSmapping-UG.pdf' appears
in the ' /library/PBSmapping/doc' folder.
To see demos, type '.PBSfigs()'.
Built on Oct 7, 2008
Pacific Biological Station, Nanaimo
> myShapeFile<-importShapefile("tracts2000",readDBF=TRUE)
Loading required package: maptools
Loading required package: foreign
Loading required package: sp
$ TRT2000 : Factor w/ 381 levels "000100","000200", : 1 2 3 4 5 6 7 8 9 $ STFID : Factor w/ 381 levels "42101000100", : 1 2 3 4 5 6 7 8 9 10 $ TRACTID : Factor w/ 381 levels "1","10","100", : 1 114 226 313 327 337 $ PARK : num 0 0 0 0 0 0 0 0 0 0
$ OLDID : num 1 1 1 1 1 1 1 1 1 1
$ NEWID : num 2 2 2 2 2 2 2 2 2 2
- attr(*, "parent.child")= num 1 1 1 1 1 1 1 1 1 1
Trang 13- attr(*, "shpType")= int 5
- attr(*, "prj")= chr "Unknown"
- attr(*, "projection")= num 1
While the shapefile itself consists of 16290 points that make up Philadelphia, itappears a lot of the polygon data associated with this shapefile is stored in as anattribute of myShapeFile We should set that to a top level variable for easier access
Trang 14Developing the Plot
Preparing to Add Points to Our Map
To use the PBSmapping’s addPoints function, the reference manual suggests wetreat our foreclosures as "EventData“ The EventData format is a standard R dataframe (more on data frames below) with required columns X, Y, and a unique rowidentifier EID With this in mind we can write a function around our geocodingcode that will accept a list of streets and return a kosher EventData-like dataframe
Figure 2