Along with Extensible Stylesheet Language Transformations XSLT, XML is able to be transformed into another XML document, ASCII, HTML or even PDF file.. Oracle, Informix, mySQL or Microso
Trang 1QuantNet – A Database-Driven Online Repository of Scientific Information
A Master’s Thesis Presented
by Anton Andriyashin
(188779) to
CASE – Center of Applied Statistics and Economics
Humboldt University, Berlin
in partial fulfillment of the requirements
for the degree of Master of Science
Berlin, June 20, 2007
Trang 2Declaration of Authorship
I hereby confirm that I have authored this Master’s thesis independently and without use of others
than the indicated resources All passages, which are literally or in general matter taken out ofpublications or other resources, are marked as such
Anton Andriyashin
Berlin, June 20, 2007
Trang 31.1 Motivation 4
1.2 QuantNet: A Look Inside 6
1.3 An Online Repository of Information 10
1.4 What Is Wrong With Regular HTML Publishing? 11
2 Single Document Setup 13 2.1 Typical Structure of a Submitted ASCII File 13
2.2 What is XML? 14
2.3 XML and XSLT – A Single Document in HTML 19
2.4 ASCII to XML: Atox and XSLT 22
3 Multiple Documents Setup 26 3.1 From a Single Document to Multiple Documents 26
3.2 mySQL and PHP 27
3.3 Javascript, CSS and PHP 32
3.4 Putting Everything Together 34
Trang 44.1 Scalability – User-defined Tags 38
4.2 Ease of Administration 39
4.3 Ways to Make QuantNet Even More Powerful 40
4.4 Concluding Remarks 43
Trang 5Already in the 1980s the OECD realized the importance of information as an asset in the global
economy [10; 11] and has been using the definition of Porrat for the indication of informationeconomy [12] as the one, where at least 50% of the GNP is produced in the so called primary or
secondary information sectors, i.e sectors that employ information goods and services directly inthe production, distribution or information processing, or information services produced for internal
consumption by companies, which do not produce information for sell, and by government [7].Nowadays information becomes one of the most valuable assets in the world economy
New information technologies are able to broaden the horizons and tackle the traditional
chal-lenges in unexpected ways – consider, for instance, the Hypertext Markup Language (HTML).Its first published specification was drafted by Berners-Lee with Dan Connolly and was published
in 1993 by the IETF [5], and already in 2000 HTML became an international standard (ISO/IEC15445:2000 ) This language offered the new way of content navigation by having a possibility to
switch quickly author-defined parts of the entity via so called hyperlinks Having realized manyadvantages of modern IT, many ”offline” journals and magazines established online presence with
unique features of delivering the information that paper-based editions lack High-resolution graphic materials, video content, audio podcasts, quick search, archives and links to other resources
photo-– these are just a few of comme il faut elements that almost any high-end online edition has to offertoday In the last 15 years HTML became the corner-stone of the whole Internet that changed the
living style of the majority of people on the planet
Trang 6And HTML is just one example New markup languages like the Extensible Markup Language
style instructions (as opposed to, for instance, font color and many other similar properties in
HTML) Along with Extensible Stylesheet Language Transformations (XSLT), XML is able to
be transformed into another XML document, ASCII, HTML or even PDF file Therefore, it is
not for nothing that many web sites employ the data-driven approach of construction and areable to provide automatic updates based on new incoming information in real time Just imagine
some online weather forecast service that receives remote data (in XML format) from a researchcenter and is able to update the site almost with no delay If HTML were used instead of XML
and XSLT, the update would take much longer due to the manual corrections inside HTML coderequired And since XML is now supported by many prominent software applications, e.g Oracle,
Informix, mySQL or Microsoft Office, not to mention the support of XML by many programminglanguages like PHP, Python and Perl, the potential for employment of XML in the Internet is
The aim of this work is to provide a semi-automated core called QuantNet allowing to publish
significant amounts of scientific information online in the situation when regular updates are impliedand, most importantly, when the authors of submitted materials are not assumed to be aware
of any markup language, i.e the materials can be submitted as ASCII files with the simpleststructure
It is not an ultimate goal of this study to provide a ready-to-ship commercial web application
Trang 7Instead, the implementation of the core, examination of its possibilities and limitations are of
particular interest At the same time, only minor efforts should later be undertaken to deploy afull-scale online system, basing on the created core
Consider a virtual example of a project or a procedure that is about to be submitted online via
QuantNet A typical ASCII file could look as follows:
The fist part of the file contains some general information about the project like its name,
author and so on, while the second part may contain a detailed description and/or computer code
As it can be seen from the listing, the ASCII file does not contain any language-specific markup
tags The only tags employed are natural field descriptors, followed by the @ symbol The author
of the submitted document does not have to care about auxiliary properties like font size, color,
family and so on The only thing required is just to follow the sample structure
But what is the next step? How is this ASCII file to be transformed into a well-formed HTML
file that ultimately will be rendered by the client’s browser? There are several steps that should
Trang 8be undertaken, but the crucial one is to transform the data from the ASCII file into an XML file
that could look as follows:
At the same time advanced users of QuantNet are supposed to profit from the maximum
amount of possibilities offered by native HTML, XML and XSLT, so inline tags, if present, should
be processed adequately For instance, if the <bold> tag is an allowed one in the ASCII file and
stands for the <b> HTML counterpart, then QuantNet should be able to process the followingASCII file adequately:
Trang 9The file is screened for the <bold> tag that is substituted with the valid HTML <b> tag,
which stands for bold text style
And, of course, extra tags should by no means be limited only by markup group In principle
even MathML, when supported by the browser (e.g Mozilla Firefox ), should adequately be played inside, say, the <math> tag That can be very handy for the documents containing a lot
a complete and scalable content system like QuantNet
In this work the representation part of the content is put solely on XSLT while string tion accounts only for the preparation of necessary raw data files in XML The logic of Textile, for
manipula-instance, is employed exactly at this stage – while creating XML files out of submitted ASCII files
Trang 10– but with one key difference: no style options to appear later inside HTML code are considered
Part 2 focuses of the implementation issues in the single document framework Section 2.1
in-troduces the structure of ASCII files, recommended by QuantNet XML is presented in Section 2.2
in the context of a dynamic weather forecasting web application Section 2.3 focuses on the
con-junction of XSLT and XML as well as discusses multiple document web application implementationissues, if HTML were the only language employed
Part 3 primarily concerns the multiple document nature of QuantNet Section 3.1 provides
a short overview of implementation tools necessary to deploy the web application of this type.Section 3.2 introduces mySQL – a popular database management system for online applications –
and PHP – a scripting language that is mostly used at the server side Later, in Section 3.3, themotivation for Javascript as a client-sided scripting language in conjunction with PHP and CSS
is provided A step-by-step overview of the implementation of QuantNet, available in Section 3.4,concludes Part 3
Finally, Part 4 focuses on the potential of QuantNet as a scalable web application Section 4.1
concentrates on the ability of QuantNet to handle potentially unlimited amount of additional tags
in ASCII files Several useful applications of this feature as well as the implementation logic are
provided there In Section 4.2 the process of adding a new project to QuantNet or the change
of the application’s content structure is considered Validation by means of XML schemas and
analytic grammar, based on Backus-Naur form, as well as scripting are discussed in Section 4.3
Trang 111.3 An Online Repository of Information
A typical application of QuantNet could be an online interdisciplinary repository of research terials submitted by various parties – from professional researchers to university students These
ma-materials could contain not only results and algorithm descriptions, which is a traditional form ofalmost any publication, but also source codes, when available, as well as other supplementary data
upon author’s wish
If the target institution is a big organization like a leading university or a research center, then
in many cases different departments introduce their own web-publishing standards and may have
different hosting That could create additional obstacles for an end-user of the content, who may beunsure where to find this content and, more importantly, have a question if there are some relevant
research results available perhaps at some other department
Therefore, the aim of QuantNet is to introduce a centralized system that is constituted bydocuments from different scientific areas submitted by various departments Centralized content
management not only eases the navigation for the end-user but also provides significant advantages
in terms of administration All documents are located at the same server, they share the same
predefined structure, they are easy to catalog, and if there is a decision to change the structure
of QuantNet in some way, for instance, introduce a new version of layout, these changes can be
applied automatically, and they will not affect the original submitted ASCII files
Every publishing entity like a journal has its own styling instructions The same applies toweb-publishing The aim of QuantNet is to avoid any of prerequisites that come from a markup
field Instead, QuantNet imposes only several restrictions on the original ASCII data files withsubmitted projects so that each of them contained the author’s name, the name of the project etc.,
refer to Table 2 for more details And that is all! A researcher should not worry about what fontsize to employ for a certain heading unless he or she is well aware of specific HTML tags to take
advantage of
Trang 12In this sense the submitted ASCII files normally are to contain only data and minimum amount
(or no) markup tags This is the fundamental feature of QuantNet – a user supplies a structureddata file, and QuantNet semi-automatically processes this file and incorporates it in the proper cell
of the system: the plain data ASCII file becomes a well-formed HTML document with adequategraphic elements and navigation tools
Structured fields like @Author, @Name or @Area inside ASCII files could potentially lead to
an effective search mechanism inside QuantNet Although it is not a goal of this study to introduceelements like that, this possibility seems to be very important to mention
While it may be clear what advantages provides a submission of a research study as the data ASCII
file for a person who is not aware of HTML for online publication, several administration aspects,which may be not so obvious, are worth mentioning here
Suppose the author of the project to be published online has the file in HTML format Does
that automatically mean that this person is aware of HTML? Not necessarily Even Microsoft Word– one of the mostly used text processor – can save its output as an HTML file [8] LaTeX2HTML
is another solution for those preferring LATEX to Word
So what is wrong with HTML as a submission format or even Microsoft Word that can be laterconverted to HTML? If there is a single document to be published, nothing is probably wrong If
there are style and/or document structure prerequisites, they can be matched However, in themultiple documents setup several problems do arise
Imagine that, for instance, a new version of graphical design of the web application is to be
introduced And if, say, there are 500 HTML documents contained in the system, each of themmust be changed one by one! Not to mention the problems of navigation across these individual
Trang 13Figure 1: BBC Weather web page: navigation menu example
files and difficulties to introduce content-driven dynamic functions like automatic generation of
links to auxiliary materials given, for instance, the project name
Would not it be greater if the user intending to give a name of the project had to type something
And this is only one element – name An HTML document with rich formatting contains dozens
of such elements If one of them is to be changed, then all the documents in the system have to be
updated
At what about the navigation? Assuming that HTML frames are not employed following therecommendations of leading web-designers, there is no easy solution in a multiple documents setup
for navigation elements
On Figure 1 the BBC Weather web site is considered as an example of a system that provides
navigation to the elements created out of raw XML data in real time Important is that the left part
Trang 14of the page contains some fixed links like Weather Home, UK, World, Sports, Cost and Area, Climate
Change and others, connecting different documents into a well-formed weather forecasting portal.Working with a single page does not restrict the end-user in any form: the links to other areas
are always available At the same time the current document does not contain any explicit links
to other parts of portal because these links are created on the fly If plain HTML were employed,
the left navigation pane would be unavailable unless every page is post-processed manually to addthese controls – a real nightmare for an administrator
Fortunately XML, XSLT and PHP could provide a much more efficient solution in these terms
Before addressing these areas in more detail, let us have a closer look at what QuantNet gets as
an input – an ASCII file with raw project information
Since every XML file normally contains only raw structured information like in a database, theemployment of the ASCII files with no markup elements perfectly fits XML in this sense
The very basic structure of an ASCII file describing, say, some statistical procedure could look
as follows The first file block refers to project cataloging, i.e it contains relevant informationabout authors, software platform, project stage an so on The aim of this substructure is to present
summarized information about the project in a compact form when it is being viewed by theend-user
The second block contains the project itself with supplementary computer code for this
par-ticular example Most of information is located at @desc field while @input and @output refersolely to the algorithm implementation
Trang 15While XSLT is a style template applied to a given XML file, XML could be generated out of
the submitted ASCII file This important aspect will be regarded later in Section 2.4
XML – the Extensible Markup Language – is a markup language with user-defined tags used forinformation management While HTML is another markup language and XML files are used to
Trang 16Figure 2: BBC Weather web page: generation of dynamic HTML out of raw XML files with weatherforecast data
create HTML output, there are several noticeable differences between these two languages
First and foremost, HTML is constituted by a fixed set of allowed tags that are in charge ofproper representation of text and graphics on the web page XML file can contain any used-defined
tag, in fact there are no predefined tags at all And how is that possible for a language? Sincethe aim of XML is just to structure the information and not to display it, this approach is quite
natural because it is impossible to make predefined templates for all or at least the vast majority
of different information sets
Let us continue with the example of the BBC Weather web page On Figure 2 one can see the
forecasts made for five days of the current week
A part of the underlying XML file that is used to generate the forecasts could look as
Trang 17fol-Figure 3: BBC Weather web page: Tuesday forecast
lows Since the object of main interest is the forecast, the root tag can have the same name –
<forecast> Following the structure of the page on Figure 2, every forecast gets its own group taglike <tuesday> or <wednesday> Inside these tags all numerical information appearing on the
page can also be easily structured by imposing extra tags for each element, e.g the forecasted sunindex of the wind speed – this information may then be stored inside <sun index> or <wind sp>
Trang 18<sunrise h> Hours part of the sunrise time
<sunrise m> Minutes part of the sunrise time
<sunset h> Hours part of the sunset time
<sunset m> Minutes part of the sunset time
<p weather> Predominant weather
<min night> Minimum temperature during the night
<wind dir> Wind direction
<r humidity> Relative humidity
Table 1: Description of some of the employed XML tags
Once again – unlike HTML, XML is not designed to present directly the stored content,however HTML code processed by the browser of the end-user’s computer is generated on the fly
from XML Although only part of data from XML comes directly to the web page, e.g the maximumday temperature or the wind speed, other elements like predominant weather category trigger quite
certain graphic objects to appear – consider once again Figure 2 and forecasts for different days ofthe week There predominant weather condition determines the image type employed to indicate
Trang 19that in a more obvious form – one can clearly distinguish between the weather states of Tuesday
and Thursday, for instance Analogously, graphic objects for the wind direction are put on thepage according to the values stored at the <wind dir> cell
However, up to this moment the way how the final HTML output becomes as it is on, for
instance, Figure 2, has not been described For example, why is the sign of the forecasted sunindex, which is equal to two on Tuesday, green while the same sign for Wednesday, where the sun
index is expected to be equal to four, yellow?
There could be different mechanisms employed to process the relevant information from XML.Naturally one could assume a tool that makes a correspondence between the forecasted sun index
and the graphical object to appear on the HTML page While there exist different tools to employ
in this situation, XSLT is the one that is to be examined in more detail next PHP could also
become such a tool if server-sided interference is allowed – XSLT can produce the HTML output
on a client machine while PHP is mostly a server-scripting language PHP will be covered in more
detail in Section 3.2 but in perspective that is different from the one for XSLT
Let us switch back to the setup of QuantNet as a system Assuming that submitted ASCII filescan effectively be transformed into well-formed XML documents – this challenge is to be described
in Section 2.4 – the analogy with the example presented on Figure 2 is straightforward If XSLT isthe way to get the HTML output from XML, then even ASCII files with no markup information
can later be turned into rich-formatted HTML documents!
Figure 4 summarizes this process With the structure rules defined as in Section 2.1, one isable, first, to transform the project or algorithm documentation into an ASCII file with intuitive
and clear fields and tags, if any Later, when the ASCII file is represented as the well-formed XMLdocument, HTML output is produced by applying an XSLT template to this XML object – that
aspect will be covered in more detail in the next section
Trang 20Figure 4: From the project documentation to the final HTML output
As it was mentioned before, XSLT is a powerful means to transform structured information from
XML into a rich-formatted representation like HTML or even PDF XSLT is a set of stylesheet rulesapplied to specific portions of XML document and resulting in creation of different style elements
that are frequently content-driven For instance, if the processed portion of information from XML
is a heading, then XSLT template may apply the <h3> HTML tag
Let us consider the following code example to realize the architecture of XSLT
Trang 2113 < / xsl : t e m p l a t e >
The part of the template presented here is the heading of the actual template employed in
QuantNet Important is that it has a scalable structure – the third line starts to define one globaltemplate that matches all possible elements of any XML file The final HTML content is mostly
generated through a recursive template application <xsl:apply-templates/> [9] These templatesare defined after the <xsl:template match=”/”> element is closed
For instance, a summary table element (refer to Figure 9) defined in QuantNet’s XSLT file
Trang 22Figure 5: Production of HTML output via XML and XSLT
As one can see, this is a very general table definition Table styles are defined with the help
of HTML/CSS, the number of table elements is arbitrary So how does XSLT know where to stop
building the table? The <xsl:template match=”head/*”> tag tells to take those elements of
XML file in the table that are nested into the <head> tag In this way one can maintain a trulyflexible structure of QuantNet, because if later some extra elements are to be added, for instance,
to the summary table environment, it would suffice just to ensure the presence of these elements
in the XML file in the way the they are properly nested
Important is that a single general XSLT file can be applied to numerous XML documentsallowing to define one metastyle and render appropriate HTML content for every project file inde-
Trang 232.4 ASCII to XML: Atox and XSLT
One of the very first processing steps in QuantNet is the transformation of the structured ASCIIfiles into well-formed XML ones But what does it mean – structured? As it was mentioned before,
an ASCII file is supposed to contain the predefined fields or cells
@short desc Project short description, usually one-two sentences
@function call Function call of the submitted computer program if any
Table 2: Fields employed by QuantNet in ASCII files
By design of QuantNet, ASCII files should maintain as easy and natural structure as possible.Therefore, the choice of auxiliary elements like the indication of field start or end should be trans-
parent as well In this case even an unexperienced user would not have difficulties understandinghow to replicate this structure for his/her own project description for online publishing
How does any text processing unit work looking for a particular element? First of all, it looks
for a set of symbols defining the beginning of field QuantNet implies a field every time the symbol
@ appears So, for instance, @ABCD would mean that the field ABCD is about to begin
Most certainly, the end of field could be defined analogously, i.e introducing some extra
auxil-iary symbol like # Instead of this, QuantNet uses the double new line character as a trigger to anend of field – so there is no need for additional visible character to be introduced As it is in text,
different paragraphs are separated by new lines Since QuantNet is assumed to contain possibly