Introduction
In this preliminary chapter, we highlight the growing significance of the web's usage and developments over the past decade We particularly focus on the diversity and multitude of languages present in this dynamic and ever-evolving environment The inherent international nature of the web naturally leads us to consider a predominant phenomenon: the role of multilingualism in the design and structuring of information.
The study of multilingual information resources on the Web encompasses various disciplines, including information retrieval, information extraction, and categorization Researchers such as Adriani, Bertoldi, and Besanácon have contributed to the field of information retrieval, while Azzam and Maynard have focused on information extraction In categorization, scholars like Cavnar and Giguet have made significant advancements Within these disciplines, multilingualism is often viewed as a challenge involving the federation of models and methods derived mainly from previous research efforts.
The article discusses the research in Natural Language Processing (NLP) tools, including morphological and syntactic analysis tools, dictionaries, automatic translators, and automatic summarization generators.
Our bibliographic studies revealed a partial lack of interest in the structuring of information within multilingual resources However, we noted a significant growth in the design of multilingual websites Additionally, research indicates that users appear to easily familiarize themselves with and navigate multilingual web document structures.
La diversit´ e des langues sur le Web
At the dawn of the Internet, monolingualism, particularly the dominance of the English language, prevailed in the vast majority of information shared on this new medium This phenomenon was a logical outcome of the Internet's history, which began with ARPANET, a network introduced by the U.S Department of Defense in 1969 Similarly, the creation of the World Wide Web in 1989 further solidified this linguistic trend.
In 1990, Tim Berners-Lee developed the World Wide Web at CERN, which led to the widespread distribution of the Mosaic web browser in November 1993 This browser, a precursor to Netscape, gained significant popularity first in North America and then globally.
R´ epartition en terme d’information (quantit´ e d’infor- mation)
In 1997, a significant presence of various languages on the Internet was observed A study conducted by the Babel 3 team, a collaboration between Alis Technologies and the Internet Society, revealed that English dominated the online landscape, accounting for 82.3% of the available information This was followed by German at 4%, Japanese at 1.6%, French at 1.5%, Spanish and Swedish both at 1.1%, and Italian at 1.0%.
1 History of the Internet (http ://www.historyoftheinternet.com/).
2 About the World Wide Web Consortium (W3C) (http ://www.w3.org/Consortium/).
3 Palmar` es des langues de la Toile, juin 1997 (http ://alis.isoc.org/palmares.html). tel-00258948, version 1 - 26 Feb 2008
1.2 La diversit´e des langues sur le Web 13
R´ epartition en terme de sites (nombre de sites)
Comprehensive statistical studies conducted as part of the Web Characterization Project by the OCLC (Online Computer Library Center) revealed that in 1999, there were 29 languages represented on the web, an increase from 24 languages in 1998 According to the same source, 80% of websites in 1999 were in English, a slight decrease from 84% in 1998.
R´ epartition en terme de pages Web (nombre de pages)
O'Neill observed that from 1999 to 2002, English-language websites comprised approximately 72% of all web pages During this period, the number of pages in Japanese significantly increased, rising to 6% in 2002 from 3% in 1999 Meanwhile, the proportion of pages in other languages remained relatively stable, with German at 7%, French at 3%, and Spanish at 3% Additionally, Italian saw a slight increase from 2% in 1999 to 3% in 2002.
2002 Contrairement, le chinois diminuait provisoirement de 3% en 1999 `a 2% en 2002, ainsi que le portugais de 2% en 1999 `a 1% en 2002 [OLB03].
A 2000 UNESCO report revealed that nearly two-thirds (68%) of web pages were written in English, with this figure rising to 96% for e-commerce sites.
In 2003, a study by Global Reach revealed the distribution of languages on the web, with English dominating at 68.4% Other languages included Japanese at 5.9%, German at 5.8%, Chinese at 3.9%, French at 3.0%, Spanish at 2.4%, Russian at 1.9%, Italian at 1.6%, Portuguese at 1.4%, Korean at 1.3%, and various other languages making up 4.6%.
The number of non-English speaking internet users has grown more rapidly than that of English-speaking users In 1999, 48.7% of internet users did not have English as their first language, marking a 20% increase since 1996 By 2000, UNESCO reported that English speakers would no longer be the majority online, dropping to 49% from 92.2% in 1998.
En septembre 2003, l’´etude de Global Reach 4 a montr´e que les Anglophones ne repr´esentaient que 35,8% des utilisateurs de l’Internet suivis par les Chi- nois (13,7%), les Espagnols (9%), les Allemands (6,9%), les Franácais (4,2%),
4 Global Internet statistics (by language) (http ://www.glreach.com/globstats/). tel-00258948, version 1 - 26 Feb 2008 les Japonais (8,4%), les Cor´eens (3,9%), les Italiens (3,8%), les Portugais (3,1%), les Arabes (1,7%), les Russes (0,8%).
In 2005, it was anticipated that Asian internet users would experience significant growth, with projections indicating that Chinese users would comprise 20% of the online population, followed by Japanese users at 9% and South Korean users at 4.3% In contrast, English speakers were expected to represent only 29% of internet users, while Spanish and German speakers would account for 7% and 6%, respectively.
In response to linguistic diversity and the influx of multinational users on the web, multilingual search engines emerged in 1995, with Altavista being the most notable By 2001, several search engines had been developed in multiple languages, showcasing their plurilingual capabilities: Google supported 25 languages, Excite offered 11 languages, Altavista had 19 languages, and AllTheWeb featured 44 languages.
Today, it is common for English not to be the default language for disseminating information on the Internet, a scenario unimaginable a decade ago However, English-speaking communities, including American and Australian English, remain dominant English continues to be the most widely spoken first foreign language due to economic factors O'Neill noted that in 1998, 7% of websites were multilingual, a figure that dropped to 5% in 1999 Lavoie surveyed 156 multilingual websites in 1999, revealing that English was present on 100% of them, followed by French and German at 31%, Italian and Spanish at 21%, Japanese and Portuguese at 10%, Swedish at 10%, and Chinese at 7% This study indicates that powerful languages still attract a significant number of users, even though they prefer to search the web in their native languages.
The significance of multilingualism is underscored by the extensive research conducted on the topic, especially when compared to other aspects of Internet development, such as multimedia, electronic libraries, and databases, which remain relatively underexplored Additionally, the growing interest in multilingualism can be attributed to the increasing advancement of automatic translation software in recent years.
Typologies des sites Web multilingues
1.3 Typologies des sites Web multilingues
According to the W3C, multilingualism refers to websites that utilize multiple languages, emphasizing that it encompasses more than just language selection This definition also includes pages that feature multiple languages within the same content, highlighting the capability of a multilingual site to blend various languages on a single page.
The concept of a multilingual website is often misunderstood, as it encompasses a wide range of multilingual phenomena The classification of a multilingual website relies on specific characteristics and evaluation criteria related to its multilingual properties, as well as the various types of source anchors identified.
Dans un premier temps et afin de d´eterminer les propri´et´es les plus concr`etes des sites Web multilingues, nous avons effectu´e une s´erie d’analyses simples
(semi-automatiques) d’un ensemble de sites Ces analyses nous ont permis de retenir trois caract´eristiques primitives relatives aux sites Web multilingues :
– la structure de navigation, – la structure logique interne des pages Web, – le contenu.
These characteristics will enable us to extract correlation sequences related to different languages used in a multilingual website In most multilingual websites, we observe a strong similarity in the behaviors of the first and third characteristics, which are navigation structure and content However, the internal logical structure of documents can vary across languages for several reasons, likely related to cultural differences in the logical structuring of information.
The internal logical structure of web pages does not provide adequate information regarding content visualization, yet this is a fundamental element of user interface interaction management.
The article discusses key questions regarding international and multilingual websites, emphasizing that visualization is the user's perceived experience, primarily focusing on the visual modality Additionally, it highlights that the navigation structure often does not accurately represent the operational logic of a multilingual website.
An in-depth analysis focused on language recognition in a multilingual website has allowed us to identify a targeted projection of all the characteristics of a website across three essential axes that explain its behavior and classification.
– la visualisation (partie perácue), – la logique de fonctionnement, – le contenu.
Caract´eristiques explicatives Repr´esentation
Visualisation Structure physique des pages Web
Interface : couleurs, mise en pages, etc. Logique Structure de navigation
Interactions utilisateurs - site Web Contenu Texte, figure, etc.
Tab 1.1 – Caract´eristiques explicatives d’un site Web multilingue
Based on these observations, we have introduced the concept of multilingual parallelism as the most significant criterion that can explain the similarities found in the various correlation structures explored.
Multilingual parallelism is evidenced by the identification of similarities or equivalences among the essential features of languages that can coexist on a multilingual website.
En se r´ef´erant `a ce crit`ere de parall´elisme, nous distinguons, dans un site Web, trois niveaux pour la propri´et´eômultilingueằ(cf tableau 1.2).
Il est `a rappeler que l’´evaluation de la propri´et´e multilingue des sites Web s’ins`ere compl`etement dans l’objectif de cette ´etude qui consiste `a d´efinir une
6 European Environment Information and Observation Net- work : Multilingual websites structures and definitions
(http ://www.eionet.eu.int/software/design/multilinguality/websitestructures). tel-00258948, version 1 - 26 Feb 2008
1.3 Typologies des sites Web multilingues 17
Parall´elisme dans les langues Propri´et´e multilingue (visualisation/logique/contenu)
Compl`etement parall`ele Forte
Table 1.2 outlines the three levels of multilingual property for websites, highlighting a universal approach that is independent of linguistic concepts and, consequently, languages, for the classification and recognition of multilingual websites.
En effet, nous distinguons au moins deux types, assez r´epandus, des sites Web multilingues :
The first type allows users to select their preferred language for content presentation, with the option to switch languages at any time using hyperlinks.
The second type of website offers multiple languages (mixed) on the same webpage to simultaneously present content For ergonomic reasons and to ensure effective information structuring across various devices, this type of website imposes limitations on the number of languages used per page and the amount of content provided in each language.
Les particularit´es principales de ces deux types sont pr´esent´ees dans le tableau 1.3.
Logique Contenu Changement de langue Direct ou indirect Non
Nombre de langues Nombreux Limit´e
Tab.1.3 – Particularit´es des types de sites Web multilingues
A specific type of website to consider is the association of monolingual sites, where the content for each language is independently localized, typically accessed via distinct URLs This approach is commonly seen in international organizations and large companies like Coca-Cola and Nike However, for the purposes of our research, these sites are not classified as truly multilingual.
To strengthen these classification hypotheses, we evaluated several websites related to international organizations (see Appendix A for their full names) The results of these evaluations are illustrated in Table 1.4 However, it is important to note that our assessments were focused on specific criteria.
` a ne rep´erer que les langues dites dominantes 9
Strat´ egies de changement de langue dans un site Web multilingue 18
Ancre source de changement de langue : primaire, se-
The functioning of language switch anchors varies across multilingual websites, leading to potential inconsistencies in content For instance, there may be multiple pages in one language that are no longer accessible from other languages This often occurs when web pages in one language lack their equivalents in others due to delays in development or overlooked translations Conversely, it can also happen that certain web pages are deemed less important, resulting in a lack of multilingual support.
7 http ://www.coca-cola.com/
9 Abr´ eviations : an (anglais), ar (arabe), ch (chinois), es (espagnol), fr (franá cais), ru (russe) tel-00258948, version 1 - 26 Feb 2008
1.4 Strat´egies de changement de langue dans un site Web multilingue 19
Site Web Nombre de langues dominantes
EUROPA 20 langues Forte Dans toutes les langues FAO 5 langues (an, ar, ch, es, fr)
Forte Dans quatre langues : an, fr, es, ar ILO 6 langues (an, ar, de, es, fr, ru)
Forte Dans trois langues : an, fr, es IMF 3 langues (an, es, fr)
Forte Entre les langues : fr-an, fr-es, es-an, es-fr
UN 6 langues (an, ar, ch, fr, es, ru)
UNDP 3 langues (an, es, fr) (non-compris 18 sites r´egionaux)
UNFPA 4 langues (an, ar, es, fr)
Forte Dans trois langues : an, fr, es UNICEF 3 langues (an, es, fr) (non-compris 37 sites r´egionaux)
(distribu´ees en plusieurs sites r´egionaux)
WTO 3 langues (an, es, fr)
Forte Dans toutes les langues Tab.1.4 – Caract´eristiques des sites Web ´etudi´es tel-00258948, version 1 - 26 Feb 2008
Fig 1.1 – Changement de langue : m´ethode directe n’ayant pas d’ancres sources de changement de langues pointant sur leur traductions, pourtant celles-ci existent.
Pour les sites Web multilingues du premier type, le changement de langue peut se faire de deux mani`eres (cf tableau 1.5).
There are two methods for changing the language on a website: the direct method and the indirect method The direct method involves a language switch anchor that redirects users to the corresponding page in the targeted language In contrast, the indirect method uses language switch anchors that redirect users to the homepage of the selected language.
On very large websites, it is common to find various forms of anchor links that facilitate language switching on certain pages For instance, in a bilingual French-English website, most of the English pages employ these language change anchors effectively.
1.4 Strat´egies de changement de langue dans un site Web multilingue 21
The indirect language change method utilizes a named anchor source in French to display translations, while other pages in English may employ a named anchor source to direct users to their French translations.
We differentiate between primary anchor sources, which are the language change anchors found on the homepage, and secondary anchor sources, which appear on other pages of the site in a different form than the primary anchors Typically, secondary anchor sources are abbreviated versions of the primary textual anchors.
Ancre source partag´ ee par plusieurs langues
In many multilingual websites, there are source anchors that remain unchanged across different languages For instance, the source anchor "Einstein" may be present in all languages of a multilingual site We consider this source anchor to be shared among multiple languages.
Notion 2 Ancre source partag´ee par plusieurs langues : Une ancre source partag´ee est pr´esente dans plusieurs langues du site Web multilingue. Dans chaque langue, elle permet de diriger vers une page de la mˆeme langue.
In this scenario, we identify two types of situations: the first type involves a source anchor shared by multiple languages that does not trigger a language change, resulting in a monolingual hyperlink effect For instance, this occurs in a bilingual English website.
The source anchor "Einstein" is available on both French and English pages When on an English page, this anchor directs to an English page, and similarly, when it appears on a French page, it points to a French page (see figure 1.3).
The second type of anchor is a shared source anchor used by multiple languages, consistently linking to a fixed document available in only one language This type of anchor can trigger a language change For instance, on a bilingual English-French website, a source anchor like "Einstein" present on a page, whether in French or English, directs users to an English page.
The second type of situation presents significant challenges due to the confusion between source anchors shared by multiple languages and source anchors used for interlingual references.
Ancre source pour la r´ ef´ erence interlingue
This type of source anchor is encountered when a document is written in a language that references existing materials in other languages within a multilingual website.
Notion 3 Ancre source pour la r´ef´erence interlingue : Ce n’est pas une ancre partag´ee, il s’agit bien d’une ancre se trouvant sur une page en une langue quelconque et qui est utilis´ee pour pointer vers une autre page (de r´ef´erence) qui est en langue diff´erente.
For instance, a bilingual website, as illustrated in Figure 1.4, demonstrates that the source anchor "Newton" present on a page in Language A refers to a different page in Language B.
1.4 Strat´egies de changement de langue dans un site Web multilingue 23
Fig 1.3 – Exemple d’une ancre source partag´ee par deux langues
Fig 1.4 – Exemple d’une ancre source pour la r´ef´erence interlingue tel-00258948, version 1 - 26 Feb 2008
The second type of shared source anchor situation can be viewed as a specific case of a source anchor for interlingual reference For instance, in a bilingual website, the source anchor "Einstein" appears in two languages (A and B) but refers solely to a single page in language B Therefore, it qualifies as a source anchor for interlingual reference.
Fig 1.5 – Exemple de la confusion entre une ancre source pour la r´ef´erence interlingue avec une ancre source partag´ee par deux langues
Document compl´ ementaire
L’une des caract´eristiques remarquables qui se pr´esente tr`es souvent dans la structure des sites Web multilingues est l’existence des documents com- pl´ementaires.
Notion 4 Document compl´ementaire: Il s’agit d’un document existant en plusieurs langues utilis´ees sur un site Web multilingues mais il n’existe pas dans toutes les langues.
Dans ce contexte nous distinguons deux types de situations : – le premier type : le document n’a pas ´et´e traduit en d’autres langues, tel-00258948, version 1 - 26 Feb 2008
Evolution des sites Web multilingues
– le deuxi`eme type : le document n’est pas important, l’auteur ne souhaite pas (ou oublie) mettre de liens vers ses traductions, pourtant celles-ci existent.
Les documents compl´ementaires peuvent ˆetre ´egalement d’une grande im- portance lorsqu’il s’agit d’un probl`eme de recherche d’information, ou d’ex- traction d’information `a partir des sites Web multilingues.
1.6 Evolution des sites Web multilingues ´
Several years ago, one of the most significant events on the web was the introduction of online translation services by AltaVista and FreeTranslation These services played a crucial role in accelerating the development of automatic translation tools for websites The improvement in the quality of machine translators has created new opportunities for the creation of multilingual websites Monolingual websites can now be translated into multiple languages at a reasonable cost using specialized software.
It is important to recognize that the quality of machine translation remains problematic, particularly between languages from different families These challenges are primarily linguistic, involving syntactic, morphological, and especially semantic issues Consequently, machine translators are often unable to meet the demands for high-quality translations Beyond scientific, technical, and commercial texts, machine translation holds little promise for cultural works, as well as legal, diplomatic, and political documents.
Conclusion
The diversity of languages on the Internet and the Web arises from two main dynamics: the rapid standardization of information technologies globally and the Web's potential as a resource distribution environment Concurrently, human activities online reflect trends of internationalization and globalization This situation suggests that European languages, such as English and French, continue to attract a significant number of Internet users due to their cultural, economic, and political influence, as well as historical factors, often at the expense of native languages To address this cultural barrier, the emergence and rapid growth of multilingualism on the Web have become essential.
In this chapter, we explored various types of multilingual websites and the different strategies for navigating between languages We also highlighted that several concepts, considerations, and confusions may arise when dealing with multilingual sites.
We supported our approach with experiments aimed at evaluating hypotheses related to linguistic parallelism and source anchors Additionally, we presented the various types of source anchors that can exist on multilingual websites, including language change anchors, shared source anchors across multiple languages, and interlingual reference anchors.
In this thesis, we focused solely on the first type of multilingual websites (non-mixed languages), as the usage and development of the second type (mixed languages) are becoming increasingly rare These important elements contribute to a comprehensive understanding of the topology of multilingual websites.
Nous consacrons le chapitre suivant, pour exposer l’´etude sur l’´etat de l’art en extraction d’information dans les sites Web multilingues que nous avons men´e. tel-00258948, version 1 - 26 Feb 2008
Extraction d’information ` a partir des sites Web multilingues
Introduction
Information extraction refers to the automatic extraction of specified types of information from natural language texts This process is essential for creating structured information, particularly databases, from unstructured textual information sources Traditionally, the term has its roots in computational linguistics and natural language processing research, heavily influenced by text comprehension studies until the late 1980s.
This chapter provides a brief history of automatic and semi-automatic information extraction, highlighting the evolving trends in this field It also discusses the strategies currently being developed within this discipline, with a particular focus on information extraction from multilingual websites.
1 Particuli` erement, Popov a consid´ er´ e l’extraction d’information comme une nouvelle discipline dans le traitement automatique des langues naturelles [PKK + 03].
Approche de repr´ esentation diminu´ ee de l’information
Forme de l’information
Information is a multifaceted concept that varies across scientific disciplines, such as thermodynamics with entropy, physics with signal theory, biology with genome theory, and economics with decision theory This diversity is also evident in professional sectors like journalism and public administration However, the primary architect of information theory is Shannon.
In 1948, Shannon built upon the studies of Kupfmuller (1924), Hartley (1928), and Whittaker (1935) to develop his theory of information He fundamentally transformed the concept of information into a quantifiable physical phenomenon, defining it as a measure of entropy and the degradation of a signal in the presence of noise.
La th´eorie de Shannon, a ´et´e l’objet de nombreuses critiques portant sur les applications de la th´eorie statistique de la communication En opposition
In contrast to Shannon's position, various authors have explored different avenues in the field of semantic information theory Bar-Hillel developed a semantic information theory based on propositional logic, independent of transmission aspects Barwise's work on situation theory focused on pragmatics, emphasizing the strong connection between information and context Jakobson proposed a human communication model that includes an emitter, a receiver, a context, a contact between them, a common code, and a message Kerbrat-Orecchioni reformulated this model by incorporating the concept of a discourse universe, which encompasses the concrete conditions of communication, constraints on the discourse topic, and the specific nature of both the emitter and the receiver.
Toujours dans l’esprit de la th´eorie de Shannon, l’information, dans son usage technique, incarne deux sens distincts :
2 Encyclop´ edie Hachette Multim´ edia 2005. tel-00258948, version 1 - 26 Feb 2008
2.2 Approche de repr´esentation diminu´ee de l’information 29
Information is quantitatively defined in the strict sense of information theory, using a formula similar to that employed by physicist Ludwig Boltzmann in the late 19th century for measuring gas entropy, albeit with an inverted sign The term "information" also refers to a binary coded numerical symbol (0 or 1) [Bre93].
The recent usage of the term "information" stems from a crucial distinction between the form and meaning of a message In this thesis, we have compelling reasons to adopt a preliminary interpretation of information that focuses more on its form than its meaning Consequently, our research emphasizes the structure of this form as a fundamental element for the treatments we propose, specifically structural analyses aimed at recognizing the multilingual characteristics of a website.
Structure de l’information
Géry identifies four types of information based on its structure: atomic information, structured information, hyper-information, and contextual hyper-information He emphasizes that information cannot exist independently of context and is closely tied to the types of documents that represent it, such as atomic documents, structured documents, hyperdocuments, and contextual hyperdocuments (dynamic, like a website).
In this context, the concept of a document is linked to the organization of information-carrying components A document features a structure defined by appropriate coding, such as SGML (Standard Generalized Markup Language), HTML (Hypertext Markup Language), or XML (Extensible Markup Language) It also requires suitable representations—audio, visual, etc.—across various material formats, including digital and printed media According to Estival, a document is defined as "any knowledge that is memorized, stored on a medium, fixed by writing, or inscribed through mechanical, physical, chemical, or electronic means constitutes a document."
Toute ´evolution du concept document est concr´etis´ee par la mise en place de nouveaux mod`eles du document comme par exemple le texte, l’hypertexte
The Agora Encyclopedia (http://agora.qc.ca/) discusses hypermedia and the concept of document modeling A document model is defined as a standard that proposes a formal system or notation for describing and constructing documents Additionally, a language can define multiple document models, showcasing its versatility in document creation and representation.
The structure of a document involves identifying and describing its various textual and non-textual elements Generally, there are two types of structures: physical structure and logical structure The physical structure pertains to the layout of the document, detailing the arrangement of text areas and their typographic characteristics, such as font, color, bold, and italics In contrast, the logical structure focuses on the role, behavior, and nature of each document element, as well as the hierarchical and logical connections between them.
Extraction d’information
Premi` ere p´ eriode : avant le programme MUC
Initially, information extraction was limited to a few projects that Gaizauskas categorized as work prior to the MUC 4 program Among these, he highlighted two long-term natural language processing research projects: the Linguistic String Project (LSP), which began in the mid-1990s.
1960 et a dur´ee jusqu’au d´ebut des ann´ees 1980 `a l’universit´e de New
Gaizauskas categorizes the developments in information extraction into three main categories: early works prior to the MUC program, research conducted within the MUC program, and studies outside the MUC program [GW98].
York Ce projet consistait `a d´evelopper une grammaire computation- nelle de l’anglais pour cr´eer des formes d’information r´egularis´ees (c’est-
` a-dire des motifs) dans le domaine m´edical.
– FRUMP, est bas´e sur le mod`ele de R Schank et a ´et´e r´ealis´e `a l’uni- versit´e de Yale, pour la compr´ehension de la langue, en particulier des textes d’histoire [DeJ82].
Suite `a ces projets, les ann´ees 1980 ont vu les premiers d´eveloppements des syst`emes commerciaux [GW98] :
ATRANS is designed for the automated processing of money transfer messages between banks It utilizes an approach developed at Yale University to create a framework that initiates automatic money transfers following a manual verification process.
JASPER, developed by Carnegie Group for Reuters, analyzes press releases from PR Newswire to provide a module that offers insights into company revenues and dividends.
– SCISOR, a ´et´e d´evelopp´e par General Electric pour analyser les fusions et acquisitions des corporations [JR90].
Two additional academic research projects were also conducted: the first, developed by J Cowie, focused on extracting regular descriptions, or patterns, of plants from flower guides, utilizing a specific domain and a set of manually chosen keywords The second project, led by G P Zarri, aimed to automatically translate historical texts into French using a metalanguage that captures certain semantic relationships related to biographical details.
The common feature of these early projects was the application of pattern filling using information extracted from natural language texts, which were still processed manually and tailored to specific domains.
Deuxi` eme p´ eriode : durant le programme MUC
In the early period before MUC, information extraction was significantly shaped by research in text comprehension, primarily relying on linguistic approaches and techniques.
Subsequent research has focused on extracting information from textual data to address linguistic challenges, with the exception of a few transitional projects that have emerged.
` a la fin de la premi`ere ´etape de MUC et grˆace `a l’acc´el´eration de productions de documents Web semi-structur´es sur l’Internet.
Three dominant trends characterized this second period: the adaptation of linguistic processing to the specificities of systems (automata), the automatic acquisition of extraction rules, and the integration of relatively independent modules.
Les syst`emes les plus connus, issues de ce mouvement, d’apr`es MUC-3
Since 1991, notable projects in information extraction have emerged, including TACITUS [Hob91], Proteus [GS93], and PIE [Lin95] The field has adopted an official mission and established a coordinated set of tasks primarily focused on pattern extraction This evolution has been clearly demonstrated through various projects presented at meetings, starting from MUC-4 in 1992.
In 1995, the MUC-6 conference showcased various systems such as SRI, FASTUS, SRA, and TIPSTER Most of these systems lacked advanced computational linguistic processing However, systems like LASIE, PLUM, and MITRE incorporated more sophisticated linguistic treatments.
Les cinq tˆaches fondamentales de l’extraction d’information, qui ont ´et´e d´efinies par le programme MUC, sont :
– NE (Named Entity) : reconnaissance des entit´es nomm´ees, – CO (Coreference) : r´esolution des cor´ef´erences,
– TE (Template Element) : construction des ´el´ements de motif, – TR (Template Relation) : construction des relations de motif, – ST (Senario Template) : production des motifs de sc´enario.
Pour ´etablir un mod`ele standard d’extraction d’information, Hobbs a pro- pos´e un syst`eme comprenant dix modules [HAT + 92] :
– segmentation du texte, – pr´e-traitement d’un segment de texte en phrases, – filtrage des phrases,
– pr´e-analyse des structures lexicales comme par exemple les groupes nominaux, groupes verbaux et appositions,
– analyse des ´el´ements lexicaux et des structures de phrase, – combinaison de fragments, tel-00258948, version 1 - 26 Feb 2008
– interpr´etation s´emantique, – enl`evement de l’ambigu¨ıt´e lexicologique, – r´esolution de cor´ef´erence,
This period was characterized by the completion of several European projects, including POETIQUE (Portable Extendable Traffic Information Collator), SINTESI (Integrated Systems for Tests in Italian), TREE (Trans-European Employment), and FACILE (Fast Accurate Categorization of Information using Language Engineering) Additionally, it encompassed various initiatives from the program.
LC CEC (Language Engineering, Commission of the European Communities)
Troisi` eme p´ eriode : apr` es le programme MUC
During this period, several systems were introduced, including WHISK, RAPIER, SRV, WIEN, SoftMealy, and STALKER Three emerging trends were observed: the portability of information extraction systems, automatic content extraction, and annotation for the semantic web.
La portabilit´e des syst`emes d’extraction d’information
L’adaptation des syst`emes existants aux nouveaux domaines d’application ´etait une tˆache difficile dans laquelle trois grands courants de travail se sont distingu´es [Cun05] :
– l’apprentissage des r`egles d’extraction `a partir des exemples annot´es [Car97],
The development of machine learning algorithms relies on the creation of rules and models through the observation and analysis of tasks performed by qualified personnel.
The ACE (Automatic Content Extraction) program, initiated in September 1999, has led to the development of a new generation of robust natural language processing applications This advancement is driven by the faster development of autonomous systems for processing annotated corpora, resulting in significant potential outcomes for the field.
` a la recherche documentaire, l’exploitation de donn´ees, le d´eveloppement de tel-00258948, version 1 - 26 Feb 2008 grandes bases de connaissances et l’annotation automatique pour le Web S´emantique.
L’annotation pour le Web s´emantique
L’annotation des pages Web ainsi que la cr´eation des ontologies sont deve- nues des tˆaches automatiques ou semi-automatiques Cela a donn´e naissance
` a tout un nouveau domaine de recherche intitul´e OBIE (Ontology-Based Information Extraction) [BW04] OBIE s’est donn´e deux d´efis principaux [Bri98] :
– l’identification de nouveaux concepts et des exemples dans le texte pour enrichir l’ontologie du Web,
– la classification des plateformes d’annotation s´emantique en plusieurs cat´egories primaires, bas´ees sur le motif ou l’apprentissage automa- tique 5 ou une combinaison de deux approches.
Several systems have been developed to address these challenges, including AeroDAML, Armadillo, KIM, Magpie, MnM, MUSE, Ont-O-Mat, and SemTag.
AeroDAML employs a pattern-based approach to assign proper names and common relationships to corresponding classes and attributes defined by the DAML ontology The system features AeroText, a Java API for information extraction, which organizes ontology usage at two levels: a top level consisting of a WordNet hierarchy and a bottom level comprising a knowledge base AeroText includes four main components: a compiler that transforms linguistic data into a knowledge base, a engine for processing source documents, an Integrated Development Environment (IDE) for constructing and testing the knowledge base, and a common knowledge base containing domain-independent rules for extracting proper names and relationships.
Armadillo [DCW03] is an evolution of the Amilcare system, incorporating an induction adapter module (induction wrapper) for websites with a highly regular structure This system employs a pattern-based approach to enhance its functionality.
5 Les plateformes d’annotation s´ emantique bas´ ees sur l’apprentissage automatique uti- lisent deux approches : probabiliste et inductive.
6 DAML - DARPA Agent Markup Language (http ://www.daml.org/). tel-00258948, version 1 - 26 Feb 2008
2.3 Extraction d’information 35 chercher des entit´es nomm´ees Aucune annotation manuelle n’est exig´ee Ce syst`eme fait appel aux services Web de Google et CiteSeer pour v´erifier et confirmer ou refuser les entit´es trouv´ees.
KIM [PKK + 04] consists of an ontology, a knowledge base, semantic annotation, indexing, and a server It utilizes the SESAME RDF repository [BKvH02] for storing the ontology and knowledge base The semantic annotation process is based on a pre-built ontology, KIMO, and an inter-domain knowledge base Additionally, the information extraction component for semantic annotation reuses elements from the GATE tool [CMBT02].
Magpie [DD04] automatically associates a semantic layer with a web resource instead of relying on manual annotation, utilizing an ontology proposed by [Gru93] This innovation positions Magpie as a significant step towards the development of a semantic web browser.
MnM [VVMD + 02] offers a platform for the manual annotation of a training corpus It also incorporates induction mechanisms based on the Lazy-NLP algorithm The results are presented as a library of induction rules that facilitate information extraction.
` a partir des textes (de corpus).
MUSE [MTB + 03] is designed for the recognition of named entities and coreferences, utilizing the GATE framework [May03] for its implementation Information extraction modules (Processing Resources) serve as a processing channel to identify entities, while semantic tagging is achieved through JAPE [CMBT02].
Ont-O-Mat [HSC02] is an implementation of the S-CREAM semantic annotation framework (Semi-automatic CREAtion of Metadata) It utilizes Amilcare's information extraction tools, which employ the ANNIE module (A Nearly-New IE system) from GATE for information extraction ANNIE sends the results to Amilcare, which generates extraction rules Subsequently, the annotation module in Ont-O-Mat is replaced by the PAN-KOW algorithm (Pattern-based Annotation through Knowledge On the Web) [CHS04], which is similar to the one used by Armadillo [DCW03].
SemTag [DEG + 03] is a semantic annotation module within the Seeker platform, designed to annotate web pages through three main phases: identification, learning, and labeling This extensible system allows for new implementations to replace the Taxonomy-based Disambiguation (TBD) algorithm SemTag utilizes the TAP taxonomy, which encompasses a wide range of lexical and taxonomic information derived from diverse, non-specialized articles covering topics such as music, cinema, sports, and health.
Strat´ egies d’extraction d’information
Adaptateurs ` a base de langues descriptives
One of the earliest primitive approaches involves defining descriptive languages to assist users in building adapters Notable systems in this area include Minerva, TSIMMIS, Web-OQL, and LIXTO.
The structural property of a web document is a relative concept that depends on specific criteria Hsu proposed a classification system to differentiate web documents into three types: unstructured, semi-structured, and structured.
Minerva [CM98] is a crucial module of the Araneus system, designed for creating adapters with a descriptive grammar written in EBNF (Extended Backus Naur Form) For each document, a set of "productions" is defined, where each production outlines the structure of a non-terminal symbol within the grammar Additionally, Minerva features a language for document search and restructuring, known as Editor, which also provides fundamental text editing functions.
TSIMMIS is a system that allows users to specify rules for extracting semi-structured data from web pages It includes adaptable components that can be configured through user-written specification files Each specification file consists of a sequence of commands that outline the extraction process Commands are formatted as [variables, source, pattern], where "variables" refer to the extraction results, "source" indicates the web document, and "pattern" describes the data to be recognized.
Web-OQL is a declarative query language designed to extract specific patterns from HTML pages It utilizes a generic adapter to analyze the input page, presenting the results as an abstract syntax tree of HTML, known as a hypertree, which represents the document This syntax allows users to write queries that locate desired data within the tree and transform that data into structured formats, such as tables.
LIXTO [BFG01] is a system designed to assist users in the semi-automatic creation of adapters through a visual and interactive interface It features a rule extraction description language (Elog) based on first-order logic Additionally, LIXTO can convert data extracted from an HTML page into XML format.
G´ en´ eration (par induction) d’adaptateurs ` a partir des
tir des pages ´ etiquet´ ees
This approach utilizes adapter construction methods based on learning from labeled example pages We present the systems WIEN, STALKER, and SOFTMEALY in that order.
WIEN [Kus97] introduced the first formalization of an inductive generation of an adapter, which consists of a set of tools for automatically labeling documents This process utilizes inductive learning to create an adapter, comprising a set of rules derived from a collection of labeled pages.
STALKER [MMK98] is an unsupervised learning system for extraction rule discovery It introduces the concept of Embedded Catalog Trees (EC) to describe the logical structure of documents Additionally, STALKER transforms documents into sequences of symbols, represents extraction rules as automata, adapts a method of induction through successive refinements, and broadens the notion of delimiters.
SOFTMEALY [HD98] primarily aims to address issues related to the order of attribute representation and missing attributes The extraction rules must consider various permutations of attributes present in the occurrences of a relation to be extracted SOFTMEALY introduces an approach that relies on separators rather than delimiters A separator characterizes a position based on the text immediately before and after it, taking into account both the left and right context of the determined position This position can correspond to either the beginning or the end of a value, ensuring that even the format of the content of that value is considered by the separator.
2.4.3 G´ en´ eration d’adaptateurs par l’extraction de mo- tifs : analyses de la structure du document
The regularity of tag sequences in document structures allows for pattern extraction through structural analysis, enabling the generation of expressions that describe the general format of data to be extracted Notable systems that exemplify this approach include W4F, XWRAP, ROADRUNNER, and IE-PAD.
W4F [SA99] is a tool designed for building adapters by dividing the development process into three phases First, the user specifies how to access the document, then describes the desired data, and finally outlines the structure for storing the results When a document is located online based on search rules, W4F sends it to a parser that constructs a Document Object Model (DOM) tree Users can write extraction rules using the HTML Extraction Language (HEL) to extract data from the tree The extracted data is stored in a Nested String List (NSL) format before being transformed into other formats for various applications.
XWRAP [LPH00] represents documents as trees by building a component library and an interactive interface, which helps users create Java adapters tailored for each specific data source.
ROADRUNNER [CMM01] analyzes HTML attributes to automatically generate adapters The recommended method involves comparing the HTML structure of two or more pages belonging to the same class to create a data schema for the information contained within these pages From this schema, a grammar is derived to identify attribute instances recognized for this schema within web pages.
IEPAD [CL01] is a system designed to automatically discover extraction rules from web pages It can effectively identify the boundaries between fields by utilizing frequent patterns and multiple sequence alignment The discovery of these frequent patterns is achieved through the use of a PAT tree, a specialized data structure Moreover, the frequent patterns can be extended through alignment, allowing for a broader range of examples to be covered.
G´ en´ eration d’adaptateurs par des techniques de traite-
traitements automatiques des langues naturelles
Certain systems are developed using natural language processing techniques, which are implemented in tools like WHISK, RAPIER, and SRV These systems learn extraction rules for data present in texts, based on syntactic and semantic constraints.
WHISK [Sod99] is a machine learning system designed to generate extraction rules for a wide range of both structured and unstructured documents The extraction patterns consist of special regular expressions that include two components: one describes the context of the pattern, while the other specifies the delimiters for the information to be extracted.
RAPIER [Cal98] learns extraction patterns by utilizing both syntactic information and semantic class information Its model includes a pre-filling pattern, a post-filling pattern (which serve as left and right delimiters), and a filling pattern that describes the structure of the information to be extracted.
SRV is a tool designed for learning extraction rules from textual data It operates on a processing method that utilizes a set of thematically oriented attributes, known as token-oriented features These attributes can be classified as either simple or relational, where a simple attribute assigns a discrete value to a term, while a relational attribute links one term to another The process of learning rules involves identifying and generating attributes found in examples Additionally, SRV is capable of extracting data from HTML pages using specific attributes.
G´ en´ eration d’adaptateurs ` a partir des motifs
This approach involves extracting data segments from documents to populate pre-constructed templates Two examples of this method are the NoDoSE systems and DEByE.
NoDoSE [Ade98] est un outil interactif pour d´eterminer semi-automatiquement des structures de documents pour extraire des donn´ees semi-structur´ees.
The user hierarchically breaks down the document's structure through the interface by selecting and describing data groups At each decomposition level, the user creates a complex object and then simplifies it into other objects with a more straightforward structure.
NoDoSE apprend la faácon dont l’utilisateur identifie des objets en induisant une grammaire des documents `a partir des objets construits.
DEByE [RNLdS99] is an interactive tool that collects a set of objects from a web page and generates extraction patterns to identify new objects in similar pages.
G´ en´ eration d’adaptateurs ` a partir d’ontologie
This approach does not rely on the document's data structure to generate rules or patterns for information extraction Instead, extraction is performed directly on the data For a specific domain, an ontology is employed to identify data segments within a document, from which objects are constructed A well-known system utilizing this approach is BYU.
BYU [ECJ + 99] is a tool developed by the Data Extraction Group at Brigham Young University It analyzes a manually constructed ontology created by experts to automatically generate a database from associated documents.
De plus de ces approches, Habegger cite les adaptateurs g´en´er´es `a partir de relations extraites ou `a partir des bases de connaissances [Hab04].
Adaptateurs g´ en´ er´ es ` a partir de relations extraites
This approach aims to develop a set of patterns that enable the extraction of a subset of instances from a given relationship These patterns can be applied across all web pages An example of this method is the DIPRE system.
DIPRE, as proposed by Bri98, is based on the hypothesis of a duality between motifs and relations, suggesting that for a given relation, there exists a set of motifs that can help identify some occurrences of that relation The algorithm automatically identifies new motifs from the initial relations, which are then utilized to extract new relations This iterative process continues until a fixed point is reached, where no new examples of the relation are generated, or until the user is satisfied with the number of extracted examples.
Adaptateurs g´ en´ er´ es ` a partir des bases de connaissances 42
The goal of this approach is not to create a specific adapter for a given source Instead, it leverages domain knowledge to develop a generic adapter that can be applied broadly.
` a des pages appartenant `a un domaine donn´e Dans cette approche, deux syst`emes sont propos´es : l’un par Gao [GS99] et l’autre par Seo [SYC01].
Gao [GS99] introduced a hybrid representation method for semi-structured data schemas, where a schema is depicted as a hierarchy of concepts and a collection of knowledge units A generic adapter algorithm was developed to utilize the created schemas and leverage page structures effectively.
XTROS [SYC01] represents applied domain knowledge and automatically constructs an adapter for each information source The adapter generation algorithm identifies logical lines within a document using domain knowledge and then searches for the most frequent pattern in the sequence of these logical lines Subsequently, the adapter is built based on the position and structure of this identified pattern.
Automation is a key criterion for comparing systems in information extraction from web sources Traditionally, the natural approach involves manually constructing extraction rules or patterns In this manual method, a system can either establish predefined extraction rules without detailing their derivation or provide tools to assist users in rule creation However, manually building adapters is a tedious task, especially given the multitude of sources requiring such efforts This necessity led to the idea of automating the process, first proposed by Kushmerick in a pioneering method of automation.
Selon Laender [LRNdST02], trois groupes de syst`emes se distinguent par leurs niveaux d’automatisation :
– des syst`emes manuels : Minerva [CM98], TSIMMIS [HGMN + 97], Web- OQL [AM99] et BYU [ECJ + 99].
The article discusses various automated and semi-automated systems Among the fully automated systems mentioned are XWRAP and RoadRunner In contrast, the semi-automated systems include W4F, WHISK, RAPIER, SRV, WIEN, SoftMealy, STALKER, NoDoSE, and DEByE, highlighting the diversity of approaches in automation technologies.
Extraction d’information multilingue
Since MUC-6, DARPA has joined the Multi-lingual Entity Task (MET) for the first task of recognizing multilingual named entities However, multilingual information extraction remains closely related to inter-lingual information extraction At times, they are mutually utilized in a fundamental approach known as inter-lingual projection.
Numerous multilingual information extraction research projects, such as ECRAN and MIETTA, have been introduced These models share common features by utilizing machine translation tools to handle the targeted languages, along with language-independent modules that perform similar tasks across these languages.
Masche a propos´e un mod`ele assez complet pour l’extraction d’information multilingue en se basant sur les id´ees suivantes [Mas04] :
Documents written in various languages are identified by a natural language processing module and translated into a preferred language, where a corresponding monolingual information extraction system is available.
Independent language modules, specifically multilingual modules, are designed to perform common, predefined tasks such as word segmentation and named entity recognition In contrast, other tasks remain monolingual, including morphological analysis, syntactic processing, and coreference analysis.
3 Les motifs extraits sont traduits en plusieurs langues selon le besoin de l’utilisateur.
Through these models, it is evident that the core concept of multilingual information extraction may lead to the repetitive execution of certain tasks across multiple languages This approach is likely to encounter several weaknesses.
– la n´egligence de la structure du corpus multilingue si elle existe comme dans le cas du site Web multilingue,
– la redondance de l’information r´ep´et´ee dans plusieurs langues, – le manque d’information n’existant pas dans une ou quelques langues du corpus multilingue. tel-00258948, version 1 - 26 Feb 2008
Conclusion
Ceci nous m`ene `a consid´erer quelques nouveaux probl`emes fondamentaux
` a r´esoudre pour am´eliorer la performance de syst`emes d’extraction d’infor- mation multilingue :
The goal is to identify additional information available in one or more languages that is not present in others within the corpus, and to establish the translated correspondence between documents across the different languages in the collection.
Le domaine d’extraction d’information a bien d´etermin´e ses strat´egies de d´eveloppement, depuis le d´ebut des ann´ees 1990 De nombreux domaines y font appel et inversement 8
The most significant consequence of the MUC program in relation to our project is its focus on prioritizing the extraction of informational patterns in the information extraction process from documents This approach distinguishes the information extraction process from text comprehension Consequently, we emphasize structural analyses over linguistic treatments of documents, particularly hyperdocuments.
Currently, we identify three key trends in the field of information extraction: the portability of extraction systems, automatic content extraction, and ontology-based automatic annotation At the core of these trends, adapters represent an exciting technical phenomenon that embodies a significant level of development in information extraction in recent years.
Aussi, nous avons observ´e l’´emergence d’un nouveau mouvement qui do- minerait probablement cette discipline de recherche dans les ann´ees `a ve- nir : l’extraction d’information multilingue En s’appuyant essentiellement
Several authors consider the bilateral relationship between information extraction and text mining, web mining, and information retrieval On one hand, information extraction serves as a preprocessing task for these other disciplines, while on the other hand, they are integral components of the information extraction process Research into automatic translation tools remains crucial, as their quality is still problematic, highlighting the need for further definition and validation of the theoretical and methodological foundations in this emerging field of study.
Reconnaissance des langues dominantes dans un site Web multilingue
Introduction
In this chapter, we analyze the structure of the Web at three levels: the internal structure of a web page, the external structure of a web page, and the macroscopic structure of the Web This analytical approach distinguishes between the hierarchical structure and the hypertextual structure of the Web.
This article presents the key properties of the Web that have been identified to date, examining them from two perspectives: the macroscopic view, which questions whether the Web graph can be decomposed into large components, and the microscopic view, which explores the presence of specific local structures within the Web graph Additionally, it discusses the statistical properties and the most significant models used in the study of the Web graph.
Finally, we focus on the superficial structure of the Web, which consists of a collection of static HTML pages, to introduce our model for representing websites We also highlight some interesting concepts such as source anchors and the source anchor graph.
Structure du Web
Structure hi´ erarchique du Web
Par d´efinition, un document structur´e est compos´e d’un ensemble d’´el´e- ments (ou objets) organis´e dans une logique la plus souvent hi´erarchique (la structure logique). tel-00258948, version 1 - 26 Feb 2008
La notion de document structur´e comprend, dans le cadre de cette th`ese, trois composants principaux : le contenu, les structures et les strat´egies de lecture [G´er02].
1 Le contenu d’un document structur´e d´esigne les informations textuelles ou multim´edia, repr´esent´ees sous la forme d’un ensemble de composants (des figures, des images, des tableaux, des paragraphes, etc.).
2 Les normes de repr´esentation de documents structur´es, telles que l’ODA (Office Document Architecture) et le SGML (Standard Generalized Markup Language), distinguent deux types de structures : la struc- ture physique et la structure logique, qui sont d´efinies de la mani`ere suivante :
The physical structure refers to the way data is organized and displayed within a document It is influenced by the presentation environment, such as the paper format or the type of screen used, including personal computers, handheld devices, and mobile phones.
The logical structure aligns with the hierarchical organization of the document's data, implicitly suggesting a reading strategy This structure is often independent of the presentation environment.
3 La strat´egie de lecture d’un document structur´e consiste `a enchaˆıner la lecture des parties successives, dans un sens connu implicitement, jusqu’`a la conclusion ou la prise d’une d´ecision d’arrˆet de lecture.
The concept of hierarchical structure is prominent in both HTML page design and website organization, including homepages and directories It is important to differentiate between the hierarchical structure of individual pages and that of entire sites, as this allows for the logical structure to be articulated within an HTML page as well as between multiple HTML pages.
Structure hi´erarchique intra-page
Les pages HTML (ou ´equivalent) poss`edent une structure interne, appe- l´ee structure hi´erarchique intra-page, qui permet de d´efinir des ´el´ements de diff´erentes granularit´es.
Various approaches have been developed to extract or identify the hierarchical structure within a web hyperdocument, utilizing logical structures described through HTML tags or other types of structured description languages, such as SGML.
– Fuller propose de fragmenter un document textuel, exprim´e `a l’aide de SGML, en un ensemble de noeuds et de relations de composition pour transformer cette structure en un hypertexte [FMSDW93].
Riahi suggests the use of an object-oriented database that is structured around informational units, which are extracted and organized according to HTML tags.
– Carchiolo mod´elise la structure logique interne des sites Web en com- binant la structure d´ecrite `a l’aide des balises HTML et la similarit´e structurelle des parties de documents [CLM00].
– G´ery analyse la structuration interne des pages HTML selon trois ni- veaux de granularit´e HTML : la phrase, le paragraphe et la section [G´er02].
Other approaches utilize patterns for integrating semi-structured data from heterogeneous sources into a unified document model We have also explored Salton's work on similarity search among text segments to identify semantic hyperlinks within a document.
Structure hi´erarchique intra-site
In a website's structure, there are two main types of hyperlinks: referential and organizational (structural) Referential hyperlinks create pathways between source documents and destination documents, guiding the reader through content In contrast, organizational hyperlinks establish the hierarchical structure of a website in a tree format, linking parent documents to child documents and vice versa.
Thanks to standardized notations like URLs (Uniform Resource Locators), it is possible to establish hyperlinks between various resources, creating an internal hierarchical structure of a website, known as intra-site hierarchical structure This structure allows different sections to be divided into multiple HTML documents instead of being contained within a single document However, these standards do not predict whether the website represents a single structured document (linear reading) or a collection of documents organized in a hypertextual format (navigational reading).
Botafogo a montr´e qu’il est possible de diff´erencier automatiquement les hyperliens hi´erarchiques (organisationnels) des hyperliens de r´ef´erence, en tel-00258948, version 1 - 26 Feb 2008
3.2 Structure du Web 53 extrayant une racine et la hi´erarchie qui en d´ecoule [BRS92] Il consid`ere qu’une racine permet d’acc´eder `a tous les noeuds sauf ceux qui sont isol´es, qu’elle est `a une distance faible des autres noeuds, et qu’elle poss`ede un nombre consid´erable de fils Les deux premi`eres consid´erations sont v´erifiables si le noeud poss`ede un fils La derni`ere consid´eration permet d’´eliminer les noeuds qui ont uniquement un rˆole d’index (sans ˆetre r´eellement racine du site) [G´er02].
Aguiar highlights the challenges of identifying structural hyperlinks within a website and proposes two hypotheses: first, that structural hyperlinks exist but are mixed with other types, necessitating a method for sorting them; second, that structural hyperlinks may not necessarily exist and need to be extracted Favoring the latter hypothesis, the author suggests a method based on statistical analysis of term distribution within and between pages, as well as the distribution of hyperlinks among pages, to effectively extract these structural hyperlinks.
The ability to extract an internal hierarchical structure from a website has been enhanced by Gery's research, which introduces an algorithm that employs simple heuristics based on URL syntax This approach emphasizes the significance of the hierarchical organization of the server's directories.
Structure hypertextuelle du Web
A website is a hypertext consisting of nodes (HTML pages) interconnected by hyperlinks defined through URLs The web is viewed as a collection of hypertexts, with each website serving as a distinct hypertext that is independently structured and autonomously organizes its information.
The concept of hypertext serves as a model for organizing information through independent, autonomous, and interconnected units known as nodes Each node represents a web page that can potentially link to numerous other nodes, facilitating a dynamic and interactive information structure.
The concept of hypertext was first introduced by Bush, emphasizing the ability for users to manage documents in a non-linear fashion by organizing information associatively Nelson later popularized the term "hypertext," envisioning a network of cooperative machines that provide access to a distributed body of knowledge In this context, a node serves as the fundamental unit of information within a hypertext, which can take various forms such as text, graphics, animations, images, or multimedia elements Additionally, hypertext links can connect nodes across different hypertexts, allowing for external connections between sites.
The internal hypertext structure of a website, known as intra-site hypertext structure, organizes HTML documents within the same site This structure allows users to navigate the website by selecting their own reading paths, in contrast to structured documents that impose a fixed reading order.
Bray analyzed a collection of 11 million HTML pages and found that while the density of hyperlinks is significant—averaging 14 outgoing links per page, with only 25% of pages being standalone—these pages tend to cluster together, which he formalizes with the concept of a website A website is defined as a group of highly interconnected pages that have minimal links to the broader web, with four out of five pages linking exclusively to other pages within the same site.
Many websites are isolated, with 80% being linked by fewer than ten other sites, and a similar percentage not linking to any other sites.
The intra-site hypertext structure requires determining the role of a page rather than its position within a hierarchical framework Pirolli proposes a classification system for web pages based on their roles in hypertext, demonstrating that a page's type can be identified through a combination of network topology analysis, document similarity, site usage statistics (such as access frequency and navigation patterns), and various other statistical criteria, including title, author, and page size Each page is represented by a set of characteristics stored in a vector, which is then compared to a predefined list of vectors that represent the characteristics of different classification types.
Spertus proposes that the web has a general structure, particularly within websites, which can be extracted from URLs He establishes a classification system for web pages and formulates rules to gather information about a site's pages These rules rely on data contained in hyperlinks, enabling their classification through a syntactic analysis of the URL.
When focusing solely on outgoing hyperlinks from websites, it becomes essential to view pages within the broader context of the web rather than just locally within a single site This macroscopic structure organizes websites in relation to one another, highlighting the connections established through their links.
Most methods for extracting macro-level structures from the Web focus on groups of pages rather than individual pages For instance, clusters of websites often exhibit typical structures, such as the Web rings.
G´ery distingue deux types d’approche pour extraire une structure macro- scopique du Web [G´er02] :
1 Traitement d’une page ou d’un site par rapport au Web global : L’ori- gine de cette approche a commenc´e dans l’analyse de citations ou de co-citations dans la litt´erature scientifique : la bibliom´etrie adapt´ee au Web [Kes63], [Sma74], [WM89] Il existe plusieurs m´ethodes qui cherchent `a extraire les pages Web jouant un rˆole particulier dans le r´eseau d’hyperliens, en se basant sur un ô scoreằ pour extraire des pages qui font autorit´e (r´ef´erenc´ees par beaucoup de pages) ou des pages rayonnantes (qui r´ef´erencent beaucoup de pages) [Bri98] Ce score est ´eventuellement am´elior´e en int´egrant une notion de qualit´e [The01] ou de r´eputation [RM00] Ces notions demeurent toutefois subjectives en ne se basant que sur le r´eseau d’hyperliens pour les ´evaluer Enfin, on peut aussi se baser sur des scores combinant autorit´e et rayonnement [Kle99a], [Kle99b].
2 Traitement d’un groupe de pages ou d’un groupe de sites : La structure macroscopique du Web est extraite en analysant la connectivit´e du r´eseau d’hyperliens inter-sites Selon Kleinberg, ce sont des structures tel-00258948, version 1 - 26 Feb 2008 de communaut´e qui identifient une communaut´e d’int´erˆets [GKR98], [KKR + 99].
Initial findings from a large-scale analysis of web topology reveal a strong connectivity within the hyperlink network A study conducted by Albert, which examined 325,000 pages and 1.5 million hyperlinks, indicated that the average shortest distance between two nodes in the collection is notably small.
- vue comme un graphe orient´e - serait de d = 0,35 + 2,06∗log(N), avec
Albert extrapolated his estimation of the number of nodes to the entire Web, which was then evaluated to contain 800 million documents, leading to an estimated diameter of the Web at 18.59 hyperlinks.