DIRS Disconnected Information Retrieval System

1 Collect large numbers of news articles being stored on the web to be used as test cases and 2 generate query strings based on the attributes of the articles collected, feed them it int

Trang 1

DIRS: Disconnected Information Retrieval System

Gregory Tracy

Department of Computer Sciences

University of Wisconsin, Madison

1210 West Dayton Street Madison, WI 53706 gtracy@cs.wisc.edu

Patrick Votruba

Department of Computer Sciences University of Wisconsin, Madison

1210 West Dayton Street Madison, WI 53706 votruba@cs.wisc.edu

Abstract

The World Wide Web gives individuals access to

huge amounts of data This includes access to

information also found in traditional formats such

as news copy This study addresses a desire to

blend these two mediums in such a way that media

consumers can move transparently from a hard

copy of a given article to an electronic copy.

Document retrieval experiments were performed in

an attempt to determine the feasibility of

implementing a handheld scanning device used to

mark traditional newspaper articles for subsequent

online retrieval Several thousand random articles

were fetched from two popular news search

services to emulate the scanning of print media

also available online Experiments were performed

on these articles to quantify the success of

searching with various article attributes Query

success is quantified by measuring whether or not

the article is found, and how deep into the query

results we must parse to locate the correct article.

When searching on the title of a news article, it

was retrieved correctly 98% of the time with an

average depth of one When searching for an

article based on a randomly chosen, 30-character

string, 92% of the articles were retrieved

successfully with an average depth of two

The World Wide Web provides individuals with access to

a tremendous amount of information Although there are

web search services that do a terrific job of locating

specific items of interest, it can still be a difficult and

time-consuming process In addition, it requires that a

person be connected to the network Given the fact that

lifestyles have become more hectic, there exists a market

for products and services to help manage this wealth of

information The information age has already spawned

the use of cellular telephones and personal digital assistants (PDAs) to manage our daily lives However, more can be done to bridge the divide between traditional sources of information and their more recent electronic counterparts

People’s daily routines will cause them to shift from the connected world to the disconnected world and back again There is a need for a service that can make this transition easier; a service which makes the disconnected moments feel more connected and the transitions back to the connected world less cumbersome For instance, our quest for information does not stop when we step away from our computer Walking down the street, we can be drawn to various data points that we may want to do more research on A flyer on a telephone pole may advertise a concert we wish to attend or learn more about A certain dish on a Thai restaurant menu may tease our curiosity about recipes for home A visit to Home Depot may generate some home improvement thoughts These moments of curiosity can be captured in many different ways including a simple scrap of paper But there is a need to make a transparent transition from these disconnected moments to a connected world without the manual transfer of thoughts from a notation device to an online data query

Perhaps the best example of consuming information while disconnected is print media Now that the vast majority of print media is available online via the World Wide Web, it should be possible to make note of an article printed in a periodical and easily find the corresponding digital version via one of the several Internet search engines For instance, if someone finds a newspaper article of interest while waiting in an airport or a doctor's office, but doesn't have time to finish it, they could later use a search engine

to retrieve the web-based version of the article This scenario raises two primary questions 1) What sort of information needs to be captured from the article so that it can be fed to a search engine for a reliable, automated

Trang 2

retrieval? 2) Is there an electronic device capable of

capturing the relevant information? This paper presents

the results of experiments attempting to answer the first

question while speculating on the feasibility of using a

pen-sized scanning device to address the second question

The inspiration for this work was a Sony product called

the eMarker [7] The now discontinued eMarker was a

small, inexpensive device that allowed users to

"bookmark" songs they heard on the radio by pressing the

lone button on the device Users could later connect the

device to their personal computer (PC) via a Universal

Serial Bus (USB) connection The eMarker retrieved song

information by recording the time that the user pressed

the device button It then used a separate service,

Broadcast Data Systems (BDS), which provides the play

lists for over 1,000 radio stations A user provides a set of

favorite radio stations as part of the eMarker account

preferences, which allows a radio station's play list to be

easily searched using the time recorded by the device

This allows retrieval of a song's primary attributes such as

song title, artist, and album that was playing at the time

the user set the bookmark

The eMarker was innovative in many ways It relied on a

simple device that could be purchased for a low cost

($19.95) that interfaced with existing technology

(personal computers with web browsers) via a

standardized interface (USB) Furthermore, the eMarker's

reliance on a pre-existing service (BDS) whose primary

source of revenue was already well established minimized

the overhead needed to provide the necessary services to

its customers

We propose that producing an analogous device for

marking print material for later retrieval via an Internet

search engine would be relatively straightforward The

device used to capture information could be a pen-sized

scanning device, and the information from the

bookmarked articles could then be fed to a web-based

search engine

In the next section, we present an approach using the

Yahoo! News search engine [8] Yahoo! News is a

specialized search engine designed for retrieving news

articles The advantage of using a specialized news

search engine is that it is updated with wire articles from

the Associated Press and Reuters news services as soon as

they are released, and typically archived by Yahoo! News

for two weeks Yahoo! News also archives articles from

other news sources such as The New York Times,

Business Week, and USA Today for about one month

Although the number of publishers is limited, it is our

belief that all print media will be available online in the

not too distant future

The approach of our feasibility study parallels that of an actual implementation for a prospective retrieval service

If this service were to be successful, all manual data entry would need to be eliminated from the process of retrieving articles Furthermore, the intensity and volume

of running the large experiments in this study makes manual data entry prohibitive To automate the process of performing web searches, we created Perl scripts that perform two primary tasks 1) Collect large numbers of news articles being stored on the web to be used as test cases and 2) generate query strings based on the attributes

of the articles collected, feed them it into an online search engine, and check the results to determine whether the search successfully found the respective article

3.1 Data Collection

In order to evaluate the success of searching with various search criteria, we first needed to collect a set of online articles As we later determined, it was important to collect articles repeatedly to have a "fresh" bank of articles to test against To simplify this process we

developed a Perl script, fetchit, which takes a generic

search topic as an argument and produces all of the online articles that match the search string Under the hood of

fetchit is an HTTP interface to the Yahoo! News web

portal

Rather than focus on a particular publisher, we did not discriminate, and chose instead to treat all articles as if they came from print media and thus can be scanned with

a prospective pen-scanning device In terms of the experiments we performed, the source of the article is not

as important as the fact that Yahoo! News or some other online search service is caching it For example, one cannot locate an article online, no matter how much detail has been saved by a scanning device, if the search engine being used does not crawl that site which contains the article As stated earlier, it is assumed that sometime soon,

all print material will be available online and thus

searchable by a news search engine such as Yahoo! News Based on this argument, if Yahoo! News has cached it, we assume that there is a print version and thus can be scanned

Although we limited ourselves to the publishers cached

by Yahoo! News, this is not an architectural issue associated with the prospective scanning device itself As

we show later, we were able to use Excite's news search engine to do similar experiments When implemented, the

Trang 3

searching can be extended to any number of online

sources including a publisher's own content site

Each article retrieved from an online source is actually a

content rich web page with the article embedded within it

In order to simplify the experiments, each web page is

parsed as the fetchit script reads it in The HTML code is

broken down into the pieces that make up a print version

and fall under the following categories:

• Title

• Author

• Publisher

• Text body of the article

This information is stored in a unique file along with the

URL where the article resides online Collectively, this

information makes up the meta-data our experiments can

use to create search queries as well as to determine

whether the search was successful or not An example of a

meta-file is shown in Figure 1

When articles were collected and put into this meta-data

format, a wide range of topics was chosen From "Rudy

Giuliani" to "Green Bay Packers football", anywhere from

50 to 200 articles were retrieved for each topic totaling

between 1500 and 2000 articles With this process

automated, we were able to focus our attention on the

feasibility experiments

3.2 Article Searching

Once we were able to collect large sets of news articles, it

was much easier to generate scenarios to understand and

measure the success or failure of searches that could be

performed by users of this proposed service The search

mechanism is similar in nature to fetchit We used a Perl script, matchit, which performed three tasks:

1) Determine search criteria: This is intended to

simulate the choices a user might have when scanning print media The search criterion is a combination of those article attributes stored in each meta-data file

By selecting one attribute, such as article title, or a set

of attributes, we are able to experiment with various combinations and quantify their success in finding the correct article For example, we may set up an experiment in which the search criterion is made up of the article's title and author These two attributes are extracted from a meta-data file and used to construct a web search query If the criterion includes a portion of the text body, we also specify the number of characters from the text to be used As will be explained later in Section 4.2, simply specifying the number of consecutive characters was not enough information to produce successful search strings

2) Apply search criteria: Once the search criteria has

been determined for an experiment, it is applied to

every article in our test bank The matchit script reads

in a meta-data file corresponding to a single article, extracts the data associated with the search criteria, concatenates a search string, and sends an HTTP request to an online news search service

3) Parse search result: When the search engine

returns its results, we break down the HTML code and examine only the URLs that pertain to actual search results We compare these URLs against the article URL we store in the respective meta-data file, and look for a match The success of the search is quantified in two ways 1) Did we find the article? 2)

If it was found, how deep into the search results did

we have to comb before we found the link to the

URL: http://dailynews.yahoo.com/h/ap/20011209/sp/bba_red_sox_coaches_1.html

TITLE: Red Sox Talking to Herzog, Alou

AUTHOR: Jimmy Golen

DATE: December 09

PUBLISHER: AP

TEXT:

By JIMMY GOLEN, AP Sports Writer BOSTON (AP) - The Red Sox have approached former

major league managers Whitey Herzog and Felipe Alou about serving as bench coaches along

side inexperienced Boston manager Joe Kerrigan………

Figure 1: One example of a meta-data file created by the fetchit script Each line begins with a tag

identifying an article attribute In this example, the "TEXT" section has been cut short for space

considerations.

Trang 4

article? We refer to the first metric as the hit rate For

example, if nine out of ten articles are successfully

found in a given experiment, the hit rate is 90% We

refer to the second metric as the URL depth If a

search returns ten URLs, and we find the matching

URL five links deep into the results list; the URL

depth is five This value is averaged over all article

searches for a given experiment

The matchit script repeats steps two and three for each

article (meta-data file) we have collected In the case of

the searches using the text body as criteria, we loop over

the set of articles five times and take the average of the

results This is due to the fact that when selecting text

body for a query, we seek to a random location in the file

to use as a starting point Averaging over several runs

allows us to detect and avoid any abnormalities produced

by the random function As it turned out, there was very

little variance amongst the five loop iterations

Given the ease at which we are able setup experiments

and evaluate results with the matchit script, we were able

to focus our attention on the breadth of scenarios to test

What are some alternatives for recording information to

give us the best opportunity for retrieving this article

later? It's a very basic question, and one that we used to

direct our experiments

All experiments were performed on Linux PCs in the

Computer Sciences Department at the University of

Wisconsin, Madison The only apparent bottleneck of the

scripts is the latency of the search engine servers and the

Internet itself It took between one and two seconds on

average to successfully find an article we were targeting

Complete experiments typically ran for around thirty

minutes

4.1 Searching On Title

Perhaps the most unique characteristic of an article is its

title If one were able to scan the entire title of an article

before walking away from the waiting room, the chances

of finishing that article later are remarkable As shown in

the first row of Table 1, the correct article will be found

98% of the time Furthermore, when the article is found,

the correct URL will be in one of the first two returned by

the search This experiment was set up using the string

found after the TITLE tag in each article's meta-data file.

The exact title string was submitted to the search engine

surrounded by quotation marks If the quotation marks are

removed, the success rate falls only slightly (less than

2%) but the URL depth falls more significantly It grows

by a factor of four While four doesn't seem like a huge

number, that's four more articles (on average) that a user

may have to visit before they find the one they intended to find

Table 1 also lists the results of our experiments when we used various combinations of search criteria that included the title string In these cases, each article attribute was surrounded by quotation marks and the combination was concatenated together For example, given the article found in Figure 1, a search using title and author would produce the following query string1:

"Red Sox Talking to Herzog, Alou" +"Jimmy Golen" The results are mixed, but adding any other criteria to the title search consistently makes the search results worse Interestingly, however, the URL depth is very consistent

If the article is in fact found, we will usually find it within the first two articles returned by the search engine This is investigated in more detail in Section 4.4

Date experiments can be improved using a special interface to Yahoo! News Currently, when the date is used in the search criteria, it gets passed to the search engine as if it were just another text string It is clear, however, that these dates are not indexed by the search engine as other words in the article are Yahoo! News has added an interface which allows a query to specify a date (or range of dates) the article can be found on Although this is appealing when using Yahoo! News, the solution does not port well to other services and thus is not studied

in great detail in this paper

4.2 Searching On Text

Users may not always have the option of scanning an article title Font sizes, graphics, and page layout may prevent this type of operation One other unique characteristic of an article is a string of consecutive characters from within the article body It is not uncommon, however, for an arbitrary string of characters

to be found in multiple articles Naturally, the longer the string is, the more likely the case that it is unique to one article In an attempt to understand these properties, we conducted a set of experiments that searched for articles

1 The format of these strings is a subtlety of each search engine Although we experimented with different combinations that were successful with Yahoo! News, the format may not yield similar results with other search engines In fact, using any form of quotations with Excite's News produces a hit rate of 0%.

Criteria Hit Rate URL Depth

Table 1: The hit rate, and average URL depth for queries using Yahoo! News with a combination of article attributes.

Trang 5

using a variable number of consecutive characters from

the text body of the article The string is chosen through

the use of the TEXT tag in the meta-data file A random

offset into the text body is chosen for each search We

seek to that location and then slide forward if necessary to

find the first word boundary2 The end point of the string

is the start offset plus the specified number of characters

plus a variable number of characters if necessary to get to

the first word boundary beyond that

The number of characters we experimented with ranged

from five to fifty and the results are reported in Figure 2

The hit rate very quickly ramps up to 90% as the number

of characters used to search grows from five to fifteen In

fact, based solely on the hit rate, there is little motivation

to use more than fifteen characters from the article body

to create a search query As shown by Figure 2, the hit

rate hits a plateau and sustains approximately 92%

success rate up to fifty character search strings We did

not test strings longer than fifty characters

This experiment emphasizes the importance of our second

quantitative metric for success, the URL depth As more

characters are added to the search string, the range of our

search depth decreases and approaches one As was stated

earlier, this is a very practical measurement of the search

results As the uniqueness of our search string grows, the

2 It was discovered very early in the experiments that it is imperative to

send search queries that fall on word boundaries Even though a set of

strings may be unique to an article, the search engines are indexing

words and not random character strings.

number of articles returned by the query shrinks and a user is able to locate the correct article more quickly Figure 3 isolates the average URL depth measurements It illustrates that by jumping from fifteen to thirty characters, the search depth improves by a factor of three

4.3 Effects Of Time

When the search criterion is set to title, the results are not always stellar Other than the examples shown in Table 1, the results also degrade as the data ages In an attempt to simulate this behavior, we collected sets of articles (and the corresponding meta-data files) each day for seven days On the eighth day, we began running the title searches on the file sets for each day Figure 4 illustrates the deterioration of the hit rate as the data ages The primary reason for this behavior is that Yahoo! News stops caching some articles after a specified period of time Although this isn't happening within a seven-day span, this does show that on any given day, a person may pick up an article in print, and potentially be limited by the number days they could successfully search for the article

0%

20%

40%

60%

80%

100%

5 10 15 20 25 30 35 40 50

Number of Characters

0 5 10 15 20 25 30

Avg URL Depth Hit Rate (%)

Figure 2: The average URL depth and average hit

rate for Yahoo! News searches performed using a

variable number of characters from the article's text

body.

Figure 3: The average URL depth for Yahoo! News searches using a variable number of characters from the article's text body (blown up from Figure 2).

0 1 2 3 4 5 6 7 8

5 10 15 20 25 30 35 40 50

p Avg URL Depth

Trang 6

4.4 Analyzing Misses

When an article title was combined with other attributes,

we were surprised to see the dramatic decrease in hit rate

For instance, when an article's author(s) is combined with

its title, the hit rate falls from 98% to 61% as shown in

Table 1 In the latter case, the 39% of the articles that

constitute "misses" (where the article could not be found)

were analyzed to try and understand when there is a miss,

why there is a miss It was discovered that the success of

searches containing the author depends almost exclusively

on the source of the article The publishers that are

responsible for the majority of the misses are Reuters,

New York Daily News, and New York Post Reuters was

the publisher for 82% of all misses in the title/author

search In another experiment, in which strings from the

text body were combined with the author, Reuters was the

publisher of 78% of those articles that could not be found

As for the 61% of the articles that could be found, 76% of

those hits came from the AP Some of the other

publications that performed well with author searches

included The Sporting News, The Daily Herald, Business

Week, and the San Jose Mercury News It is not clear

what makes these publishers more "author friendly" We

cannot find anything inherent in the HTML documents

themselves that show that in some cases the author is part

of the article's text body and others it is not In each case,

the author has clear markup so that it stands out when

rendered by the browser The answer lies somewhere in

the interface between Yahoo! News and the publishers

themselves

5 Validation

In an effort to demonstrate that the results presented here

extend beyond the use of the Yahoo! News engine, the hit

rate experiment shown in Figure 2 was repeated with the

Excite News search engine using Perl scripts analogous to

fetchit and matchit Figure 5 compares the Yahoo! News

results with those generated using Excite's News service The performance of Yahoo! News is clearly superior, but Excite News also produces good hit rates if a long enough search string is used Despite the mixed results produced

by Excite News, its high hit count with longer search strings lead us to believe that this model could be extended to use search engines other than Yahoo! News Therefore, we assert that the implementation of this scheme does not rely solely on the service provided by Yahoo! News

In addition to the automated experiments, we were able construct a prototype service and generate experiments that resembled a real-world situation involving a handheld pen scanner Using the Quick Link™ pen from Wizcom Technologies, a group of articles from a printed copy of The New York Times were used to test the success rate of

various text scans The matchit Perl script was adapted

into a cgi-bin script to complete the real-world setting as well as to ease the implementation of the pen experiments A web page on the department's web server was created which had a text input form and an action

button that sent the text to the matchit cgi script The

Quick Link™ pen can be configured such that scans can

be transferred directly into the text box of a browser window3 Once the text is there, the action button is pressed and a list of potential web links is returned Among the feasibility experiments run earlier, only the text-body experiment was duplicated with the pen scanner This was due to two factors First, the font sizes

of the article titles in The New York Times were too large for the Quick Link™ scanner to capture This prevented any experiments involving the article title Second, the process is labor intensive which is what led us to the

3 The Quick Link™ pen scanner can also be configured so that each individual scan makes up a unique "note file" All notes can be uploaded

at the same time and posted to the cgi script individually.

60%

65%

70%

75%

80%

85%

90%

95%

100%

0 1 2 3 4 5 6 7

Age of Articles (days)

Figure 4: The hit rate for a Yahoo! News search using

article titles The results progressively get worse as the

titles become older.

20%

30%

40%

50%

60%

70%

80%

90%

100%

5 10 15 20 25 30 35 40 50

Excite New s Yahoo! New s

Figure 5: The hit rate for text string searches using both Yahoo! News and Excite News.

Trang 7

earlier methodology in the first place Therefore, we

scanned portions of ten different articles Each article had

three non-overlapping text-body selections scanned and

uploaded into the web browser displaying our cgi form

Each individual scan was a scan input of one string that

stretched an entire column of the article This is a logical

start point and stop point for a user trying to mark a

particular article The scans ranged in length of 27 to 29

characters (not including spaces), which resulted in 5 to 8

words for each scan

Among the ten articles scanned, five were successfully

retrieved from the web In each of these five cases, all

three scans were successful with a URL depth of one

Overall, the 50% hit rate was disappointing However,

after more investigation it was discovered that Yahoo!

News was not caching the articles that constituted misses

No matter what the search criteria may have been, the

article would not be found This may sound disappointing,

but this is just one news service available There are other

services - that come at some expense - which index nearly

every publisher in the United States Therefore, every

article can be found given the right search engine with

minimal query data The contribution of this paper is that

it demonstrates the characteristics and properties of

accurate web searches given limited query data common

in a disconnected application

Future work in this area will need to address two primary

questions First, how can the services described here be

integrated into existing hardware? Second, can this

technology be applied to the retrieval of other forms of

media besides print material?

6.1 Integrating Document Retrieval Service

One possible solution for the implementation of this

document retrieval scheme is a pen-sized text-scanning

device such as the Quick Link™ described in Section 5 A

pen device has many desirable features They are small,

lightweight, and use standard communication interfaces to

easily transfer data to and from PCs and PDAs One

drawback of pen devices is the relatively steep price when

compared to the eMarker that retailed for $20 Another

alternative would be to incorporate small scanning

devices into existing, well-established technologies such

as PDAs and cellular telephones Consumers may be

more likely to use a service like this if it was available

through devices that they are already likely to own

Another possible approach would be to use a dictation

device that records whatever notations are spoken into it

Using voice recognition, article attributes such as title and

author can be identified and sent to a search engine to retrieve the desired information

6.2 Alternative Data Retrieval Applications

The experiments described in this paper addressed the retrieval of documents found in periodicals Work should

be done to see if a variation of this scheme could be used for retrieval of other types of media such as music The proliferation of online music has made it feasible to locate music titles and artists without the use of a time stamped database like the ones used by eMarker A “smart” eMarker device would rely on recorded music samples as

a means to locate the song's album and artist Using one

of the popular music sharing services like AudioGalaxy or Gnutella, it could be possible to either search for a string

of music notes, or to use voice recognition technologies to gather song lyrics and use them in a search query

As more traditional media producers make their product available in electronic form, the distinction between print media and web media begins to blur One of the benefits

of this trend is the ability to locate information whenever

a user is connected to the network However, many media

consumers still lack tools to help them easily manage

personal notations as they transition from their disconnected lives into their connected ones The waiting room of a doctor's office or a city bus ride will many times become a playground of information as people become exposed to new genres or topics which spark interest Through the use of paper scraps, most people manage to mark information they would like to revisit later This paper addresses an alternative approach to simplify the interaction between an individual and information on the web

In this feasibility study we have shown that it is possible

to ease the transition through the use of small text scanning devices already available commercially The best results have been seen through the use of article titles

as the primary search query If one is able scan the title, they will be able to retrieve the same article electronically over 90% of the time If, while riding the bus, one is only able to find the last page of a torn newspaper article, we have shown that the prospect of retrieving that article in its entirety is still very promising Yahoo! News had a 91% success rate when searching from a string of fifteen characters from an article We have shown that such a service is viable and the only roadblock we see in the adoption of this technology is the breadth at which search engines are able to crawl and cache traditional print media

Trang 8

The authors thank David DeWitt for inspiring the ideas

proposed in this work as well as his helpful discussions

concerning the direction of this project

References

[1] Brin, S., L Page, "The Anatomy of a Large-Scale

Hypertextual Web Search Engine" WWW-7 April, 1998.

[2] Hawking, D., N Craswell, P Thistlewaite, and D Harman.

"Results and Challenges in Web Search Evaluation".

WWW-8 Proceedings, 1999.

[3] Hu, W., Y Chen, M S Schmalz, and G X Ritter "An overview of World Wide Web search technologies" In

Proceedings of the 5th World Multi-Conference on Systemics, Cybernetics and Informatics, SCI 2001, pages

356-361, Orlando, Florida, July 22-25, 2001.

[4] Kobayashi, M, and K Takeda, "Information Retrieval on

the Web", ACM Computing Surveys, 32(2), pp 144-173,

2000.

[5] Marchiori, M "The Quest for Correct Information on the

Web: Hyper Search Engines", Proceedings of the Sixth

International World Wide Web Confernce (WWW6) Santa

Clara, CA., 1997 [6] Mendelzon, A., G Mihaila, T Milo, "Querying the World

Wide Web", In Proceedings PDIS'96 December, 1996.

[7] Sony eMarker http://www.emarker.com

[8] Yahoo! News http://news.yahoo.com

Định dạng
Số trang	8
Dung lượng	182,5 KB