1 Collect large numbers of news articles being stored on the web to be used as test cases and 2 generate query strings based on the attributes of the articles collected, feed them it int
Trang 1DIRS: Disconnected Information Retrieval System
Gregory Tracy
Department of Computer Sciences
University of Wisconsin, Madison
1210 West Dayton Street Madison, WI 53706 gtracy@cs.wisc.edu
Patrick Votruba
Department of Computer Sciences University of Wisconsin, Madison
1210 West Dayton Street Madison, WI 53706 votruba@cs.wisc.edu
Abstract
The World Wide Web gives individuals access to
huge amounts of data This includes access to
information also found in traditional formats such
as news copy This study addresses a desire to
blend these two mediums in such a way that media
consumers can move transparently from a hard
copy of a given article to an electronic copy.
Document retrieval experiments were performed in
an attempt to determine the feasibility of
implementing a handheld scanning device used to
mark traditional newspaper articles for subsequent
online retrieval Several thousand random articles
were fetched from two popular news search
services to emulate the scanning of print media
also available online Experiments were performed
on these articles to quantify the success of
searching with various article attributes Query
success is quantified by measuring whether or not
the article is found, and how deep into the query
results we must parse to locate the correct article.
When searching on the title of a news article, it
was retrieved correctly 98% of the time with an
average depth of one When searching for an
article based on a randomly chosen, 30-character
string, 92% of the articles were retrieved
successfully with an average depth of two
The World Wide Web provides individuals with access to
a tremendous amount of information Although there are
web search services that do a terrific job of locating
specific items of interest, it can still be a difficult and
time-consuming process In addition, it requires that a
person be connected to the network Given the fact that
lifestyles have become more hectic, there exists a market
for products and services to help manage this wealth of
information The information age has already spawned
the use of cellular telephones and personal digital assistants (PDAs) to manage our daily lives However, more can be done to bridge the divide between traditional sources of information and their more recent electronic counterparts
People’s daily routines will cause them to shift from the connected world to the disconnected world and back again There is a need for a service that can make this transition easier; a service which makes the disconnected moments feel more connected and the transitions back to the connected world less cumbersome For instance, our quest for information does not stop when we step away from our computer Walking down the street, we can be drawn to various data points that we may want to do more research on A flyer on a telephone pole may advertise a concert we wish to attend or learn more about A certain dish on a Thai restaurant menu may tease our curiosity about recipes for home A visit to Home Depot may generate some home improvement thoughts These moments of curiosity can be captured in many different ways including a simple scrap of paper But there is a need to make a transparent transition from these disconnected moments to a connected world without the manual transfer of thoughts from a notation device to an online data query
Perhaps the best example of consuming information while disconnected is print media Now that the vast majority of print media is available online via the World Wide Web, it should be possible to make note of an article printed in a periodical and easily find the corresponding digital version via one of the several Internet search engines For instance, if someone finds a newspaper article of interest while waiting in an airport or a doctor's office, but doesn't have time to finish it, they could later use a search engine
to retrieve the web-based version of the article This scenario raises two primary questions 1) What sort of information needs to be captured from the article so that it can be fed to a search engine for a reliable, automated
Trang 2retrieval? 2) Is there an electronic device capable of
capturing the relevant information? This paper presents
the results of experiments attempting to answer the first
question while speculating on the feasibility of using a
pen-sized scanning device to address the second question
The inspiration for this work was a Sony product called
the eMarker [7] The now discontinued eMarker was a
small, inexpensive device that allowed users to
"bookmark" songs they heard on the radio by pressing the
lone button on the device Users could later connect the
device to their personal computer (PC) via a Universal
Serial Bus (USB) connection The eMarker retrieved song
information by recording the time that the user pressed
the device button It then used a separate service,
Broadcast Data Systems (BDS), which provides the play
lists for over 1,000 radio stations A user provides a set of
favorite radio stations as part of the eMarker account
preferences, which allows a radio station's play list to be
easily searched using the time recorded by the device
This allows retrieval of a song's primary attributes such as
song title, artist, and album that was playing at the time
the user set the bookmark
The eMarker was innovative in many ways It relied on a
simple device that could be purchased for a low cost
($19.95) that interfaced with existing technology
(personal computers with web browsers) via a
standardized interface (USB) Furthermore, the eMarker's
reliance on a pre-existing service (BDS) whose primary
source of revenue was already well established minimized
the overhead needed to provide the necessary services to
its customers
We propose that producing an analogous device for
marking print material for later retrieval via an Internet
search engine would be relatively straightforward The
device used to capture information could be a pen-sized
scanning device, and the information from the
bookmarked articles could then be fed to a web-based
search engine
In the next section, we present an approach using the
Yahoo! News search engine [8] Yahoo! News is a
specialized search engine designed for retrieving news
articles The advantage of using a specialized news
search engine is that it is updated with wire articles from
the Associated Press and Reuters news services as soon as
they are released, and typically archived by Yahoo! News
for two weeks Yahoo! News also archives articles from
other news sources such as The New York Times,
Business Week, and USA Today for about one month
Although the number of publishers is limited, it is our
belief that all print media will be available online in the
not too distant future
The approach of our feasibility study parallels that of an actual implementation for a prospective retrieval service
If this service were to be successful, all manual data entry would need to be eliminated from the process of retrieving articles Furthermore, the intensity and volume
of running the large experiments in this study makes manual data entry prohibitive To automate the process of performing web searches, we created Perl scripts that perform two primary tasks 1) Collect large numbers of news articles being stored on the web to be used as test cases and 2) generate query strings based on the attributes
of the articles collected, feed them it into an online search engine, and check the results to determine whether the search successfully found the respective article
3.1 Data Collection
In order to evaluate the success of searching with various search criteria, we first needed to collect a set of online articles As we later determined, it was important to collect articles repeatedly to have a "fresh" bank of articles to test against To simplify this process we
developed a Perl script, fetchit, which takes a generic
search topic as an argument and produces all of the online articles that match the search string Under the hood of
fetchit is an HTTP interface to the Yahoo! News web
portal
Rather than focus on a particular publisher, we did not discriminate, and chose instead to treat all articles as if they came from print media and thus can be scanned with
a prospective pen-scanning device In terms of the experiments we performed, the source of the article is not
as important as the fact that Yahoo! News or some other online search service is caching it For example, one cannot locate an article online, no matter how much detail has been saved by a scanning device, if the search engine being used does not crawl that site which contains the article As stated earlier, it is assumed that sometime soon,
all print material will be available online and thus
searchable by a news search engine such as Yahoo! News Based on this argument, if Yahoo! News has cached it, we assume that there is a print version and thus can be scanned
Although we limited ourselves to the publishers cached
by Yahoo! News, this is not an architectural issue associated with the prospective scanning device itself As
we show later, we were able to use Excite's news search engine to do similar experiments When implemented, the
Trang 3searching can be extended to any number of online
sources including a publisher's own content site
Each article retrieved from an online source is actually a
content rich web page with the article embedded within it
In order to simplify the experiments, each web page is
parsed as the fetchit script reads it in The HTML code is
broken down into the pieces that make up a print version
and fall under the following categories:
• Title
• Author
• Publisher
• Text body of the article
This information is stored in a unique file along with the
URL where the article resides online Collectively, this
information makes up the meta-data our experiments can
use to create search queries as well as to determine
whether the search was successful or not An example of a
meta-file is shown in Figure 1
When articles were collected and put into this meta-data
format, a wide range of topics was chosen From "Rudy
Giuliani" to "Green Bay Packers football", anywhere from
50 to 200 articles were retrieved for each topic totaling
between 1500 and 2000 articles With this process
automated, we were able to focus our attention on the
feasibility experiments
3.2 Article Searching
Once we were able to collect large sets of news articles, it
was much easier to generate scenarios to understand and
measure the success or failure of searches that could be
performed by users of this proposed service The search
mechanism is similar in nature to fetchit We used a Perl script, matchit, which performed three tasks:
1) Determine search criteria: This is intended to
simulate the choices a user might have when scanning print media The search criterion is a combination of those article attributes stored in each meta-data file
By selecting one attribute, such as article title, or a set
of attributes, we are able to experiment with various combinations and quantify their success in finding the correct article For example, we may set up an experiment in which the search criterion is made up of the article's title and author These two attributes are extracted from a meta-data file and used to construct a web search query If the criterion includes a portion of the text body, we also specify the number of characters from the text to be used As will be explained later in Section 4.2, simply specifying the number of consecutive characters was not enough information to produce successful search strings
2) Apply search criteria: Once the search criteria has
been determined for an experiment, it is applied to
every article in our test bank The matchit script reads
in a meta-data file corresponding to a single article, extracts the data associated with the search criteria, concatenates a search string, and sends an HTTP request to an online news search service
3) Parse search result: When the search engine
returns its results, we break down the HTML code and examine only the URLs that pertain to actual search results We compare these URLs against the article URL we store in the respective meta-data file, and look for a match The success of the search is quantified in two ways 1) Did we find the article? 2)
If it was found, how deep into the search results did
we have to comb before we found the link to the
URL: http://dailynews.yahoo.com/h/ap/20011209/sp/bba_red_sox_coaches_1.html
TITLE: Red Sox Talking to Herzog, Alou
AUTHOR: Jimmy Golen
DATE: December 09
PUBLISHER: AP
TEXT:
By JIMMY GOLEN, AP Sports Writer BOSTON (AP) - The Red Sox have approached former
major league managers Whitey Herzog and Felipe Alou about serving as bench coaches along
side inexperienced Boston manager Joe Kerrigan………
Figure 1: One example of a meta-data file created by the fetchit script Each line begins with a tag
identifying an article attribute In this example, the "TEXT" section has been cut short for space
considerations.
Trang 4article? We refer to the first metric as the hit rate For
example, if nine out of ten articles are successfully
found in a given experiment, the hit rate is 90% We
refer to the second metric as the URL depth If a
search returns ten URLs, and we find the matching
URL five links deep into the results list; the URL
depth is five This value is averaged over all article
searches for a given experiment
The matchit script repeats steps two and three for each
article (meta-data file) we have collected In the case of
the searches using the text body as criteria, we loop over
the set of articles five times and take the average of the
results This is due to the fact that when selecting text
body for a query, we seek to a random location in the file
to use as a starting point Averaging over several runs
allows us to detect and avoid any abnormalities produced
by the random function As it turned out, there was very
little variance amongst the five loop iterations
Given the ease at which we are able setup experiments
and evaluate results with the matchit script, we were able
to focus our attention on the breadth of scenarios to test
What are some alternatives for recording information to
give us the best opportunity for retrieving this article
later? It's a very basic question, and one that we used to
direct our experiments
All experiments were performed on Linux PCs in the
Computer Sciences Department at the University of
Wisconsin, Madison The only apparent bottleneck of the
scripts is the latency of the search engine servers and the
Internet itself It took between one and two seconds on
average to successfully find an article we were targeting
Complete experiments typically ran for around thirty
minutes
4.1 Searching On Title
Perhaps the most unique characteristic of an article is its
title If one were able to scan the entire title of an article
before walking away from the waiting room, the chances
of finishing that article later are remarkable As shown in
the first row of Table 1, the correct article will be found
98% of the time Furthermore, when the article is found,
the correct URL will be in one of the first two returned by
the search This experiment was set up using the string
found after the TITLE tag in each article's meta-data file.
The exact title string was submitted to the search engine
surrounded by quotation marks If the quotation marks are
removed, the success rate falls only slightly (less than
2%) but the URL depth falls more significantly It grows
by a factor of four While four doesn't seem like a huge
number, that's four more articles (on average) that a user
may have to visit before they find the one they intended to find
Table 1 also lists the results of our experiments when we used various combinations of search criteria that included the title string In these cases, each article attribute was surrounded by quotation marks and the combination was concatenated together For example, given the article found in Figure 1, a search using title and author would produce the following query string1:
"Red Sox Talking to Herzog, Alou" +"Jimmy Golen" The results are mixed, but adding any other criteria to the title search consistently makes the search results worse Interestingly, however, the URL depth is very consistent
If the article is in fact found, we will usually find it within the first two articles returned by the search engine This is investigated in more detail in Section 4.4
Date experiments can be improved using a special interface to Yahoo! News Currently, when the date is used in the search criteria, it gets passed to the search engine as if it were just another text string It is clear, however, that these dates are not indexed by the search engine as other words in the article are Yahoo! News has added an interface which allows a query to specify a date (or range of dates) the article can be found on Although this is appealing when using Yahoo! News, the solution does not port well to other services and thus is not studied
in great detail in this paper
4.2 Searching On Text
Users may not always have the option of scanning an article title Font sizes, graphics, and page layout may prevent this type of operation One other unique characteristic of an article is a string of consecutive characters from within the article body It is not uncommon, however, for an arbitrary string of characters
to be found in multiple articles Naturally, the longer the string is, the more likely the case that it is unique to one article In an attempt to understand these properties, we conducted a set of experiments that searched for articles
1 The format of these strings is a subtlety of each search engine Although we experimented with different combinations that were successful with Yahoo! News, the format may not yield similar results with other search engines In fact, using any form of quotations with Excite's News produces a hit rate of 0%.
Criteria Hit Rate URL Depth
Table 1: The hit rate, and average URL depth for queries using Yahoo! News with a combination of article attributes.
Trang 5using a variable number of consecutive characters from
the text body of the article The string is chosen through
the use of the TEXT tag in the meta-data file A random
offset into the text body is chosen for each search We
seek to that location and then slide forward if necessary to
find the first word boundary2 The end point of the string
is the start offset plus the specified number of characters
plus a variable number of characters if necessary to get to
the first word boundary beyond that
The number of characters we experimented with ranged
from five to fifty and the results are reported in Figure 2
The hit rate very quickly ramps up to 90% as the number
of characters used to search grows from five to fifteen In
fact, based solely on the hit rate, there is little motivation
to use more than fifteen characters from the article body
to create a search query As shown by Figure 2, the hit
rate hits a plateau and sustains approximately 92%
success rate up to fifty character search strings We did
not test strings longer than fifty characters
This experiment emphasizes the importance of our second
quantitative metric for success, the URL depth As more
characters are added to the search string, the range of our
search depth decreases and approaches one As was stated
earlier, this is a very practical measurement of the search
results As the uniqueness of our search string grows, the
2 It was discovered very early in the experiments that it is imperative to
send search queries that fall on word boundaries Even though a set of
strings may be unique to an article, the search engines are indexing
words and not random character strings.
number of articles returned by the query shrinks and a user is able to locate the correct article more quickly Figure 3 isolates the average URL depth measurements It illustrates that by jumping from fifteen to thirty characters, the search depth improves by a factor of three
4.3 Effects Of Time
When the search criterion is set to title, the results are not always stellar Other than the examples shown in Table 1, the results also degrade as the data ages In an attempt to simulate this behavior, we collected sets of articles (and the corresponding meta-data files) each day for seven days On the eighth day, we began running the title searches on the file sets for each day Figure 4 illustrates the deterioration of the hit rate as the data ages The primary reason for this behavior is that Yahoo! News stops caching some articles after a specified period of time Although this isn't happening within a seven-day span, this does show that on any given day, a person may pick up an article in print, and potentially be limited by the number days they could successfully search for the article
0%
20%
40%
60%
80%
100%
5 10 15 20 25 30 35 40 50
Number of Characters
0 5 10 15 20 25 30
Avg URL Depth Hit Rate (%)
Figure 2: The average URL depth and average hit
rate for Yahoo! News searches performed using a
variable number of characters from the article's text
body.
Figure 3: The average URL depth for Yahoo! News searches using a variable number of characters from the article's text body (blown up from Figure 2).
0 1 2 3 4 5 6 7 8
5 10 15 20 25 30 35 40 50
Number of Characters
p Avg URL Depth
Trang 64.4 Analyzing Misses
When an article title was combined with other attributes,
we were surprised to see the dramatic decrease in hit rate
For instance, when an article's author(s) is combined with
its title, the hit rate falls from 98% to 61% as shown in
Table 1 In the latter case, the 39% of the articles that
constitute "misses" (where the article could not be found)
were analyzed to try and understand when there is a miss,
why there is a miss It was discovered that the success of
searches containing the author depends almost exclusively
on the source of the article The publishers that are
responsible for the majority of the misses are Reuters,
New York Daily News, and New York Post Reuters was
the publisher for 82% of all misses in the title/author
search In another experiment, in which strings from the
text body were combined with the author, Reuters was the
publisher of 78% of those articles that could not be found
As for the 61% of the articles that could be found, 76% of
those hits came from the AP Some of the other
publications that performed well with author searches
included The Sporting News, The Daily Herald, Business
Week, and the San Jose Mercury News It is not clear
what makes these publishers more "author friendly" We
cannot find anything inherent in the HTML documents
themselves that show that in some cases the author is part
of the article's text body and others it is not In each case,
the author has clear markup so that it stands out when
rendered by the browser The answer lies somewhere in
the interface between Yahoo! News and the publishers
themselves
5 Validation
In an effort to demonstrate that the results presented here
extend beyond the use of the Yahoo! News engine, the hit
rate experiment shown in Figure 2 was repeated with the
Excite News search engine using Perl scripts analogous to
fetchit and matchit Figure 5 compares the Yahoo! News
results with those generated using Excite's News service The performance of Yahoo! News is clearly superior, but Excite News also produces good hit rates if a long enough search string is used Despite the mixed results produced
by Excite News, its high hit count with longer search strings lead us to believe that this model could be extended to use search engines other than Yahoo! News Therefore, we assert that the implementation of this scheme does not rely solely on the service provided by Yahoo! News
In addition to the automated experiments, we were able construct a prototype service and generate experiments that resembled a real-world situation involving a handheld pen scanner Using the Quick Link™ pen from Wizcom Technologies, a group of articles from a printed copy of The New York Times were used to test the success rate of
various text scans The matchit Perl script was adapted
into a cgi-bin script to complete the real-world setting as well as to ease the implementation of the pen experiments A web page on the department's web server was created which had a text input form and an action
button that sent the text to the matchit cgi script The
Quick Link™ pen can be configured such that scans can
be transferred directly into the text box of a browser window3 Once the text is there, the action button is pressed and a list of potential web links is returned Among the feasibility experiments run earlier, only the text-body experiment was duplicated with the pen scanner This was due to two factors First, the font sizes
of the article titles in The New York Times were too large for the Quick Link™ scanner to capture This prevented any experiments involving the article title Second, the process is labor intensive which is what led us to the
3 The Quick Link™ pen scanner can also be configured so that each individual scan makes up a unique "note file" All notes can be uploaded
at the same time and posted to the cgi script individually.
60%
65%
70%
75%
80%
85%
90%
95%
100%
0 1 2 3 4 5 6 7
Age of Articles (days)
Figure 4: The hit rate for a Yahoo! News search using
article titles The results progressively get worse as the
titles become older.
20%
30%
40%
50%
60%
70%
80%
90%
100%
5 10 15 20 25 30 35 40 50
Number of Characters
Excite New s Yahoo! New s
Figure 5: The hit rate for text string searches using both Yahoo! News and Excite News.
Trang 7earlier methodology in the first place Therefore, we
scanned portions of ten different articles Each article had
three non-overlapping text-body selections scanned and
uploaded into the web browser displaying our cgi form
Each individual scan was a scan input of one string that
stretched an entire column of the article This is a logical
start point and stop point for a user trying to mark a
particular article The scans ranged in length of 27 to 29
characters (not including spaces), which resulted in 5 to 8
words for each scan
Among the ten articles scanned, five were successfully
retrieved from the web In each of these five cases, all
three scans were successful with a URL depth of one
Overall, the 50% hit rate was disappointing However,
after more investigation it was discovered that Yahoo!
News was not caching the articles that constituted misses
No matter what the search criteria may have been, the
article would not be found This may sound disappointing,
but this is just one news service available There are other
services - that come at some expense - which index nearly
every publisher in the United States Therefore, every
article can be found given the right search engine with
minimal query data The contribution of this paper is that
it demonstrates the characteristics and properties of
accurate web searches given limited query data common
in a disconnected application
Future work in this area will need to address two primary
questions First, how can the services described here be
integrated into existing hardware? Second, can this
technology be applied to the retrieval of other forms of
media besides print material?
6.1 Integrating Document Retrieval Service
One possible solution for the implementation of this
document retrieval scheme is a pen-sized text-scanning
device such as the Quick Link™ described in Section 5 A
pen device has many desirable features They are small,
lightweight, and use standard communication interfaces to
easily transfer data to and from PCs and PDAs One
drawback of pen devices is the relatively steep price when
compared to the eMarker that retailed for $20 Another
alternative would be to incorporate small scanning
devices into existing, well-established technologies such
as PDAs and cellular telephones Consumers may be
more likely to use a service like this if it was available
through devices that they are already likely to own
Another possible approach would be to use a dictation
device that records whatever notations are spoken into it
Using voice recognition, article attributes such as title and
author can be identified and sent to a search engine to retrieve the desired information
6.2 Alternative Data Retrieval Applications
The experiments described in this paper addressed the retrieval of documents found in periodicals Work should
be done to see if a variation of this scheme could be used for retrieval of other types of media such as music The proliferation of online music has made it feasible to locate music titles and artists without the use of a time stamped database like the ones used by eMarker A “smart” eMarker device would rely on recorded music samples as
a means to locate the song's album and artist Using one
of the popular music sharing services like AudioGalaxy or Gnutella, it could be possible to either search for a string
of music notes, or to use voice recognition technologies to gather song lyrics and use them in a search query
As more traditional media producers make their product available in electronic form, the distinction between print media and web media begins to blur One of the benefits
of this trend is the ability to locate information whenever
a user is connected to the network However, many media
consumers still lack tools to help them easily manage
personal notations as they transition from their disconnected lives into their connected ones The waiting room of a doctor's office or a city bus ride will many times become a playground of information as people become exposed to new genres or topics which spark interest Through the use of paper scraps, most people manage to mark information they would like to revisit later This paper addresses an alternative approach to simplify the interaction between an individual and information on the web
In this feasibility study we have shown that it is possible
to ease the transition through the use of small text scanning devices already available commercially The best results have been seen through the use of article titles
as the primary search query If one is able scan the title, they will be able to retrieve the same article electronically over 90% of the time If, while riding the bus, one is only able to find the last page of a torn newspaper article, we have shown that the prospect of retrieving that article in its entirety is still very promising Yahoo! News had a 91% success rate when searching from a string of fifteen characters from an article We have shown that such a service is viable and the only roadblock we see in the adoption of this technology is the breadth at which search engines are able to crawl and cache traditional print media
Trang 8The authors thank David DeWitt for inspiring the ideas
proposed in this work as well as his helpful discussions
concerning the direction of this project
References
[1] Brin, S., L Page, "The Anatomy of a Large-Scale
Hypertextual Web Search Engine" WWW-7 April, 1998.
[2] Hawking, D., N Craswell, P Thistlewaite, and D Harman.
"Results and Challenges in Web Search Evaluation".
WWW-8 Proceedings, 1999.
[3] Hu, W., Y Chen, M S Schmalz, and G X Ritter "An overview of World Wide Web search technologies" In
Proceedings of the 5th World Multi-Conference on Systemics, Cybernetics and Informatics, SCI 2001, pages
356-361, Orlando, Florida, July 22-25, 2001.
[4] Kobayashi, M, and K Takeda, "Information Retrieval on
the Web", ACM Computing Surveys, 32(2), pp 144-173,
2000.
[5] Marchiori, M "The Quest for Correct Information on the
Web: Hyper Search Engines", Proceedings of the Sixth
International World Wide Web Confernce (WWW6) Santa
Clara, CA., 1997 [6] Mendelzon, A., G Mihaila, T Milo, "Querying the World
Wide Web", In Proceedings PDIS'96 December, 1996.
[7] Sony eMarker http://www.emarker.com
[8] Yahoo! News http://news.yahoo.com