QuantNet – A Database-Driven Online Repository of Scientiﬁc Information potx

Along with Extensible Stylesheet Language Transformations XSLT, XML is able to be transformed into another XML document, ASCII, HTML or even PDF file.. Oracle, Informix, mySQL or Microso

Trang 1

QuantNet – A Database-Driven Online Repository of Scientific Information

A Master’s Thesis Presented

by Anton Andriyashin

(188779) to

CASE – Center of Applied Statistics and Economics

Humboldt University, Berlin

in partial fulfillment of the requirements

for the degree of Master of Science

Berlin, June 20, 2007

Trang 2

Declaration of Authorship

I hereby confirm that I have authored this Master’s thesis independently and without use of others

than the indicated resources All passages, which are literally or in general matter taken out ofpublications or other resources, are marked as such

Anton Andriyashin

Berlin, June 20, 2007

Trang 3

1.1 Motivation 4

1.2 QuantNet: A Look Inside 6

1.3 An Online Repository of Information 10

1.4 What Is Wrong With Regular HTML Publishing? 11

2 Single Document Setup 13 2.1 Typical Structure of a Submitted ASCII File 13

2.2 What is XML? 14

2.3 XML and XSLT – A Single Document in HTML 19

2.4 ASCII to XML: Atox and XSLT 22

3 Multiple Documents Setup 26 3.1 From a Single Document to Multiple Documents 26

3.2 mySQL and PHP 27

3.3 Javascript, CSS and PHP 32

3.4 Putting Everything Together 34

Trang 4

4.1 Scalability – User-defined Tags 38

4.2 Ease of Administration 39

4.3 Ways to Make QuantNet Even More Powerful 40

4.4 Concluding Remarks 43

Trang 5

Already in the 1980s the OECD realized the importance of information as an asset in the global

economy [10; 11] and has been using the definition of Porrat for the indication of informationeconomy [12] as the one, where at least 50% of the GNP is produced in the so called primary or

secondary information sectors, i.e sectors that employ information goods and services directly inthe production, distribution or information processing, or information services produced for internal

consumption by companies, which do not produce information for sell, and by government [7].Nowadays information becomes one of the most valuable assets in the world economy

New information technologies are able to broaden the horizons and tackle the traditional

chal-lenges in unexpected ways – consider, for instance, the Hypertext Markup Language (HTML).Its first published specification was drafted by Berners-Lee with Dan Connolly and was published

in 1993 by the IETF [5], and already in 2000 HTML became an international standard (ISO/IEC15445:2000 ) This language offered the new way of content navigation by having a possibility to

switch quickly author-defined parts of the entity via so called hyperlinks Having realized manyadvantages of modern IT, many ”offline” journals and magazines established online presence with

unique features of delivering the information that paper-based editions lack High-resolution graphic materials, video content, audio podcasts, quick search, archives and links to other resources

photo-– these are just a few of comme il faut elements that almost any high-end online edition has to offertoday In the last 15 years HTML became the corner-stone of the whole Internet that changed the

living style of the majority of people on the planet

Trang 6

And HTML is just one example New markup languages like the Extensible Markup Language

style instructions (as opposed to, for instance, font color and many other similar properties in

HTML) Along with Extensible Stylesheet Language Transformations (XSLT), XML is able to

be transformed into another XML document, ASCII, HTML or even PDF file Therefore, it is

not for nothing that many web sites employ the data-driven approach of construction and areable to provide automatic updates based on new incoming information in real time Just imagine

some online weather forecast service that receives remote data (in XML format) from a researchcenter and is able to update the site almost with no delay If HTML were used instead of XML

and XSLT, the update would take much longer due to the manual corrections inside HTML coderequired And since XML is now supported by many prominent software applications, e.g Oracle,

Informix, mySQL or Microsoft Office, not to mention the support of XML by many programminglanguages like PHP, Python and Perl, the potential for employment of XML in the Internet is

The aim of this work is to provide a semi-automated core called QuantNet allowing to publish

significant amounts of scientific information online in the situation when regular updates are impliedand, most importantly, when the authors of submitted materials are not assumed to be aware

of any markup language, i.e the materials can be submitted as ASCII files with the simpleststructure

It is not an ultimate goal of this study to provide a ready-to-ship commercial web application

Trang 7

Instead, the implementation of the core, examination of its possibilities and limitations are of

particular interest At the same time, only minor efforts should later be undertaken to deploy afull-scale online system, basing on the created core

Consider a virtual example of a project or a procedure that is about to be submitted online via

QuantNet A typical ASCII file could look as follows:

The fist part of the file contains some general information about the project like its name,

author and so on, while the second part may contain a detailed description and/or computer code

As it can be seen from the listing, the ASCII file does not contain any language-specific markup

tags The only tags employed are natural field descriptors, followed by the @ symbol The author

of the submitted document does not have to care about auxiliary properties like font size, color,

family and so on The only thing required is just to follow the sample structure

But what is the next step? How is this ASCII file to be transformed into a well-formed HTML

file that ultimately will be rendered by the client’s browser? There are several steps that should

Trang 8

be undertaken, but the crucial one is to transform the data from the ASCII file into an XML file

that could look as follows:

At the same time advanced users of QuantNet are supposed to profit from the maximum

amount of possibilities offered by native HTML, XML and XSLT, so inline tags, if present, should

be processed adequately For instance, if the <bold> tag is an allowed one in the ASCII file and

stands for the <b> HTML counterpart, then QuantNet should be able to process the followingASCII file adequately:

Trang 9

The file is screened for the <bold> tag that is substituted with the valid HTML <b> tag,

which stands for bold text style

And, of course, extra tags should by no means be limited only by markup group In principle

even MathML, when supported by the browser (e.g Mozilla Firefox ), should adequately be played inside, say, the <math> tag That can be very handy for the documents containing a lot

a complete and scalable content system like QuantNet

In this work the representation part of the content is put solely on XSLT while string tion accounts only for the preparation of necessary raw data files in XML The logic of Textile, for

manipula-instance, is employed exactly at this stage – while creating XML files out of submitted ASCII files

Trang 10

– but with one key difference: no style options to appear later inside HTML code are considered

Part 2 focuses of the implementation issues in the single document framework Section 2.1

in-troduces the structure of ASCII files, recommended by QuantNet XML is presented in Section 2.2

in the context of a dynamic weather forecasting web application Section 2.3 focuses on the

con-junction of XSLT and XML as well as discusses multiple document web application implementationissues, if HTML were the only language employed

Part 3 primarily concerns the multiple document nature of QuantNet Section 3.1 provides

a short overview of implementation tools necessary to deploy the web application of this type.Section 3.2 introduces mySQL – a popular database management system for online applications –

and PHP – a scripting language that is mostly used at the server side Later, in Section 3.3, themotivation for Javascript as a client-sided scripting language in conjunction with PHP and CSS

is provided A step-by-step overview of the implementation of QuantNet, available in Section 3.4,concludes Part 3

Finally, Part 4 focuses on the potential of QuantNet as a scalable web application Section 4.1

concentrates on the ability of QuantNet to handle potentially unlimited amount of additional tags

in ASCII files Several useful applications of this feature as well as the implementation logic are

provided there In Section 4.2 the process of adding a new project to QuantNet or the change

of the application’s content structure is considered Validation by means of XML schemas and

analytic grammar, based on Backus-Naur form, as well as scripting are discussed in Section 4.3

Trang 11

1.3 An Online Repository of Information

A typical application of QuantNet could be an online interdisciplinary repository of research terials submitted by various parties – from professional researchers to university students These

ma-materials could contain not only results and algorithm descriptions, which is a traditional form ofalmost any publication, but also source codes, when available, as well as other supplementary data

upon author’s wish

If the target institution is a big organization like a leading university or a research center, then

in many cases different departments introduce their own web-publishing standards and may have

different hosting That could create additional obstacles for an end-user of the content, who may beunsure where to find this content and, more importantly, have a question if there are some relevant

research results available perhaps at some other department

Therefore, the aim of QuantNet is to introduce a centralized system that is constituted bydocuments from different scientific areas submitted by various departments Centralized content

management not only eases the navigation for the end-user but also provides significant advantages

in terms of administration All documents are located at the same server, they share the same

predefined structure, they are easy to catalog, and if there is a decision to change the structure

of QuantNet in some way, for instance, introduce a new version of layout, these changes can be

applied automatically, and they will not affect the original submitted ASCII files

Every publishing entity like a journal has its own styling instructions The same applies toweb-publishing The aim of QuantNet is to avoid any of prerequisites that come from a markup

field Instead, QuantNet imposes only several restrictions on the original ASCII data files withsubmitted projects so that each of them contained the author’s name, the name of the project etc.,

refer to Table 2 for more details And that is all! A researcher should not worry about what fontsize to employ for a certain heading unless he or she is well aware of specific HTML tags to take

advantage of

Trang 12

In this sense the submitted ASCII files normally are to contain only data and minimum amount

(or no) markup tags This is the fundamental feature of QuantNet – a user supplies a structureddata file, and QuantNet semi-automatically processes this file and incorporates it in the proper cell

of the system: the plain data ASCII file becomes a well-formed HTML document with adequategraphic elements and navigation tools

Structured fields like @Author, @Name or @Area inside ASCII files could potentially lead to

an effective search mechanism inside QuantNet Although it is not a goal of this study to introduceelements like that, this possibility seems to be very important to mention

While it may be clear what advantages provides a submission of a research study as the data ASCII

file for a person who is not aware of HTML for online publication, several administration aspects,which may be not so obvious, are worth mentioning here

Suppose the author of the project to be published online has the file in HTML format Does

that automatically mean that this person is aware of HTML? Not necessarily Even Microsoft Word– one of the mostly used text processor – can save its output as an HTML file [8] LaTeX2HTML

is another solution for those preferring LATEX to Word

So what is wrong with HTML as a submission format or even Microsoft Word that can be laterconverted to HTML? If there is a single document to be published, nothing is probably wrong If

there are style and/or document structure prerequisites, they can be matched However, in themultiple documents setup several problems do arise

Imagine that, for instance, a new version of graphical design of the web application is to be

introduced And if, say, there are 500 HTML documents contained in the system, each of themmust be changed one by one! Not to mention the problems of navigation across these individual

Trang 13

Figure 1: BBC Weather web page: navigation menu example

files and difficulties to introduce content-driven dynamic functions like automatic generation of

links to auxiliary materials given, for instance, the project name

Would not it be greater if the user intending to give a name of the project had to type something

And this is only one element – name An HTML document with rich formatting contains dozens

of such elements If one of them is to be changed, then all the documents in the system have to be

updated

At what about the navigation? Assuming that HTML frames are not employed following therecommendations of leading web-designers, there is no easy solution in a multiple documents setup

for navigation elements

On Figure 1 the BBC Weather web site is considered as an example of a system that provides

navigation to the elements created out of raw XML data in real time Important is that the left part

Trang 14

of the page contains some fixed links like Weather Home, UK, World, Sports, Cost and Area, Climate

Change and others, connecting different documents into a well-formed weather forecasting portal.Working with a single page does not restrict the end-user in any form: the links to other areas

are always available At the same time the current document does not contain any explicit links

to other parts of portal because these links are created on the fly If plain HTML were employed,

the left navigation pane would be unavailable unless every page is post-processed manually to addthese controls – a real nightmare for an administrator

Fortunately XML, XSLT and PHP could provide a much more efficient solution in these terms

Before addressing these areas in more detail, let us have a closer look at what QuantNet gets as

an input – an ASCII file with raw project information

Since every XML file normally contains only raw structured information like in a database, theemployment of the ASCII files with no markup elements perfectly fits XML in this sense

The very basic structure of an ASCII file describing, say, some statistical procedure could look

as follows The first file block refers to project cataloging, i.e it contains relevant informationabout authors, software platform, project stage an so on The aim of this substructure is to present

summarized information about the project in a compact form when it is being viewed by theend-user

The second block contains the project itself with supplementary computer code for this

par-ticular example Most of information is located at @desc field while @input and @output refersolely to the algorithm implementation

Trang 15

While XSLT is a style template applied to a given XML file, XML could be generated out of

the submitted ASCII file This important aspect will be regarded later in Section 2.4

XML – the Extensible Markup Language – is a markup language with user-defined tags used forinformation management While HTML is another markup language and XML files are used to

Trang 16

Figure 2: BBC Weather web page: generation of dynamic HTML out of raw XML files with weatherforecast data

create HTML output, there are several noticeable differences between these two languages

First and foremost, HTML is constituted by a fixed set of allowed tags that are in charge ofproper representation of text and graphics on the web page XML file can contain any used-defined

tag, in fact there are no predefined tags at all And how is that possible for a language? Sincethe aim of XML is just to structure the information and not to display it, this approach is quite

natural because it is impossible to make predefined templates for all or at least the vast majority

of different information sets

Let us continue with the example of the BBC Weather web page On Figure 2 one can see the

forecasts made for five days of the current week

A part of the underlying XML file that is used to generate the forecasts could look as

Trang 17

fol-Figure 3: BBC Weather web page: Tuesday forecast

lows Since the object of main interest is the forecast, the root tag can have the same name –

<forecast> Following the structure of the page on Figure 2, every forecast gets its own group taglike <tuesday> or <wednesday> Inside these tags all numerical information appearing on the

page can also be easily structured by imposing extra tags for each element, e.g the forecasted sunindex of the wind speed – this information may then be stored inside <sun index> or <wind sp>

Trang 18

<sunrise h> Hours part of the sunrise time

<sunrise m> Minutes part of the sunrise time

<sunset h> Hours part of the sunset time

<sunset m> Minutes part of the sunset time

<p weather> Predominant weather

<min night> Minimum temperature during the night

<wind dir> Wind direction

<r humidity> Relative humidity

Table 1: Description of some of the employed XML tags

Once again – unlike HTML, XML is not designed to present directly the stored content,however HTML code processed by the browser of the end-user’s computer is generated on the fly

from XML Although only part of data from XML comes directly to the web page, e.g the maximumday temperature or the wind speed, other elements like predominant weather category trigger quite

certain graphic objects to appear – consider once again Figure 2 and forecasts for different days ofthe week There predominant weather condition determines the image type employed to indicate

Trang 19

that in a more obvious form – one can clearly distinguish between the weather states of Tuesday

and Thursday, for instance Analogously, graphic objects for the wind direction are put on thepage according to the values stored at the <wind dir> cell

However, up to this moment the way how the final HTML output becomes as it is on, for

instance, Figure 2, has not been described For example, why is the sign of the forecasted sunindex, which is equal to two on Tuesday, green while the same sign for Wednesday, where the sun

index is expected to be equal to four, yellow?

There could be different mechanisms employed to process the relevant information from XML.Naturally one could assume a tool that makes a correspondence between the forecasted sun index

and the graphical object to appear on the HTML page While there exist different tools to employ

in this situation, XSLT is the one that is to be examined in more detail next PHP could also

become such a tool if server-sided interference is allowed – XSLT can produce the HTML output

on a client machine while PHP is mostly a server-scripting language PHP will be covered in more

detail in Section 3.2 but in perspective that is different from the one for XSLT

Let us switch back to the setup of QuantNet as a system Assuming that submitted ASCII filescan effectively be transformed into well-formed XML documents – this challenge is to be described

in Section 2.4 – the analogy with the example presented on Figure 2 is straightforward If XSLT isthe way to get the HTML output from XML, then even ASCII files with no markup information

can later be turned into rich-formatted HTML documents!

Figure 4 summarizes this process With the structure rules defined as in Section 2.1, one isable, first, to transform the project or algorithm documentation into an ASCII file with intuitive

and clear fields and tags, if any Later, when the ASCII file is represented as the well-formed XMLdocument, HTML output is produced by applying an XSLT template to this XML object – that

aspect will be covered in more detail in the next section

Trang 20

Figure 4: From the project documentation to the final HTML output

As it was mentioned before, XSLT is a powerful means to transform structured information from

XML into a rich-formatted representation like HTML or even PDF XSLT is a set of stylesheet rulesapplied to specific portions of XML document and resulting in creation of different style elements

that are frequently content-driven For instance, if the processed portion of information from XML

is a heading, then XSLT template may apply the <h3> HTML tag

Let us consider the following code example to realize the architecture of XSLT

Trang 21

13 < / xsl : t e m p l a t e >

The part of the template presented here is the heading of the actual template employed in

QuantNet Important is that it has a scalable structure – the third line starts to define one globaltemplate that matches all possible elements of any XML file The final HTML content is mostly

generated through a recursive template application <xsl:apply-templates/> [9] These templatesare defined after the <xsl:template match=”/”> element is closed

For instance, a summary table element (refer to Figure 9) defined in QuantNet’s XSLT file

Trang 22

Figure 5: Production of HTML output via XML and XSLT

As one can see, this is a very general table definition Table styles are defined with the help

of HTML/CSS, the number of table elements is arbitrary So how does XSLT know where to stop

building the table? The <xsl:template match=”head/*”> tag tells to take those elements of

XML file in the table that are nested into the <head> tag In this way one can maintain a trulyflexible structure of QuantNet, because if later some extra elements are to be added, for instance,

to the summary table environment, it would suffice just to ensure the presence of these elements

in the XML file in the way the they are properly nested

Important is that a single general XSLT file can be applied to numerous XML documentsallowing to define one metastyle and render appropriate HTML content for every project file inde-

Trang 23

2.4 ASCII to XML: Atox and XSLT

One of the very first processing steps in QuantNet is the transformation of the structured ASCIIfiles into well-formed XML ones But what does it mean – structured? As it was mentioned before,

an ASCII file is supposed to contain the predefined fields or cells

@short desc Project short description, usually one-two sentences

@function call Function call of the submitted computer program if any

Table 2: Fields employed by QuantNet in ASCII files

By design of QuantNet, ASCII files should maintain as easy and natural structure as possible.Therefore, the choice of auxiliary elements like the indication of field start or end should be trans-

parent as well In this case even an unexperienced user would not have difficulties understandinghow to replicate this structure for his/her own project description for online publishing

How does any text processing unit work looking for a particular element? First of all, it looks

for a set of symbols defining the beginning of field QuantNet implies a field every time the symbol

@ appears So, for instance, @ABCD would mean that the field ABCD is about to begin

Most certainly, the end of field could be defined analogously, i.e introducing some extra

auxil-iary symbol like # Instead of this, QuantNet uses the double new line character as a trigger to anend of field – so there is no need for additional visible character to be introduced As it is in text,

different paragraphs are separated by new lines Since QuantNet is assumed to contain possibly

Tiêu đề	QuantNet – A Database-Driven Online Repository of Scientific Information
Tác giả	Anton Andriyashin
Người hướng dẫn	Prof. Dr. Wolfgang H"ardle
Trường học	Humboldt University
Chuyên ngành	Applied Statistics and Economics
Thể loại	master’s thesis
Năm xuất bản	2007
Thành phố	Berlin

Định dạng
Số trang	46
Dung lượng	8,18 MB