HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY Master’s Thesis in Data Science and Artificial Intelligence Weak Supervision Learning for Information Extraction NGUYEN HOANG LONG Long NH202189M@sis hust ed[.]
Problem overview
Cengroup is a leading real estate service provider dedicated to closely monitoring market trends to inform effective business strategies Our data technology department ensures that real estate market information remains accurate and up-to-date, enabling us to adapt swiftly to changing conditions and maintain a competitive edge in the industry.
We initially developed a Scrapy Crawler to extract data from major websites, but as the number of target sites increased, manual programming became inefficient To address this, we upgraded to an advanced AI Crawler featuring integrated machine learning models, enabling automatic classification of web pages and efficient data extraction This smarter system significantly enhances scalability and accuracy in web data collection.
During the development of the AI Crawler System's machine learning models, particularly the Information Extraction Model, we faced a significant challenge with unlabeled training data, as manual labeling is both time-consuming and costly Effective data labeling is crucial for improving model accuracy, but obtaining quality labeled data remains a major hurdle in building robust AI systems Overcoming this data labeling obstacle is essential for advancing our AI Crawler’s performance and ensuring reliable information extraction.
After evaluating various approaches to reduce labeling costs, including Active Learning, Transform Learning, and Weak Supervision Learning, we ultimately selected Weak Supervision due to its effectiveness and suitability for our specific problem.
Goals of the thesis
The goals of the thesis are:
• Building an AI Crawler that automatically detects and extracts infor- mation from many different Websites.
• Applying Weak Supervision in data labeling to effectively reduce labeling costs.
Thesis contributions
The work in this thesis proposes improvements to the existing system in information extraction Generally, the research approach can be applied to similar systems and implemented in other Domains.
Main content and Structure of the thesis
This article addresses the scalability challenges of the Scrapy Crawler system as the number of websites increases and proposes potential improvements outlined in Chapter 2 It details the development of machine learning models in Chapters 3, 4, and 5, with a specific focus on utilizing Weak Supervision techniques for creating training datasets in Chapters 4 and 5 The study concludes with a summary of key findings and future development directions in Chapter 7, highlighting the system’s performance enhancements and areas for further research.
In this chapter, we present the problems associated with the Scrapy Crawler and solutions to them.
Problems with the Scrapy Crawler
The Scrapy Crawler is built on the Scrapy platform, as shown in Fig 2.1 Each website employs a dedicated data extraction program called a Spider, designed to retrieve specific data from web pages In the WebSpider's parser function, rules are established to identify pages containing the desired data, which is then extracted using CSS selectors However, this architecture presents three significant challenges affecting scalability and efficiency. -**Sponsor**As a content creator, I understand the need for SEO-optimized rewrites Need help making your paragraphs coherent and SEO-friendly? With [Article Generation](https://pollinations.ai/redirect-nexad/UXVGE1u9?user_id=983577), you can instantly get 2,000-word articles that follow SEO rules, saving you time and money! Imagine having your own content team at your fingertips—rewrite your content effortlessly and ensure it's SEO-ready Say goodbye to content creation struggles and hello to high-quality, optimized articles!
• When we needed to crawl a new Website, we had to write a new Spider to execute it.
• When a Website changes its appearance, we have to reprogram the Spi- der for it.
• When more information is needed, we have to update all the Spiders.
These problems make the system difficult to scale to more Websites Fur- thermore, scaling in this approach makes maintenance more expensive.
10 if r e s p o n s e css (’ div real - e s t a t e ’ ) get () is not N o n e :
12 ’ a d d r e s s ’: r e s p o n s e css ( ’ div real - e s t a t e ␣ > ␣ div a d d r e s s :: t e xt ’) get () ,
13 ’ p r i c e ’ : r e s p o n s e css ( ’ div real - e s t a t e ␣ > ␣ div p r i c e :: t e x t ’) get () ,
14 ’ a c r e a g e ’: r e s p o n s e css ( ’ div real - e s t a t e ␣ > ␣ div a c r e a g e :: t e xt ’) get ()
Figure 2.2: A website has many pages, hence many categories
Requirements for the AI Crawler
In the AI Crawler, we set up the crawler to be able to classify Pages and extract information automatically, with the following requirements:
To optimize the AI Crawler’s performance, it must accurately identify which web pages to fetch data from, particularly focusing on specific categories like real estate classified advertising sites (REAL_ESTATE_PAGE) Programmatically distinguishing these targeted pages from other website types is essential, with URL-based page classification serving as a key method for this differentiation By analyzing the URL structure prior to downloading, the crawler can efficiently determine whether a page belongs to the desired category, ensuring streamlined data collection and improved crawling efficiency.
• The AI Crawler should be able to automatically extract information from the Raw HTML without the predefined rules as in Scrapy Crawler.
Solutions analysis
Page Classification
We developed a URL-based text classifier to meet page classification requirements, achieving good accuracy on trained websites However, its performance declined significantly when tested on unseen websites, despite preprocessing and normalization efforts This suggests that diverse and short URLs limit the model’s ability to extract sufficient distinguishing features, underscoring the challenges of URL-based content classification tasks.
Using separate models for each website can improve results, as the current model performs well on trained websites When applying page classification to new websites using titles and descriptions, the results remained promising Nonetheless, two key challenges persist: ensuring consistent accuracy across diverse sites and adapting the model effectively to unseen content.
• Data needs to be retrieved from the Page before we can classify it.
• Some Websites do not have Title and Description.
To address the initial challenge, we propose developing a comprehensive Page Classification model trained on existing websites' Titles and Descriptions, enabling accurate categorization For new websites, this model will generate labeled datasets by extracting URLs, which will then be used to train a dedicated URL Classification Model This approach ensures scalable and reliable classification for both known and new websites, enhancing overall website categorization accuracy and supporting SEO optimization.
We have developed a solution to address the second issue by utilizing the main text of the page to automatically replace the title and description This innovative approach enhances SEO by ensuring relevant, keyword-optimized content appears in meta tags Details of this method will be explained thoroughly in Chapter 3, providing insights into its implementation and benefits.
To summarize, the solution to the Page Classification problem is as follows:
• We build a Content Classification Model based on Title, Description, or Main Text on each Page of all known Websites.
• With this model, we will build a training dataset with URL for each new Website then train URL Classification Model using this dataset for them.
Extract information from web page
To fulfill the second requirement, AI Crawler must automatically extract information from Raw HTML without relying on predefined rules, utilizing an advanced Information Extraction Model for HTML data During the development of this model, a significant challenge was encountered due to the lack of labeled training data, which is essential for effective machine learning.
To address the data labeling challenge, we employed Weak Supervision, an effective training technique that integrates multiple labeling methods to produce a more accurately labeled dataset This approach enhances data quality and model performance Detailed steps for implementing this model are provided in Chapter 4.
When testing our model on real estate detail pages, we observed that it often extracted multiple values for the same label, such as addresses, prices, and acreages, leading to uncertain and inconsistent information This issue arose because the model retrieved data from various sections of the page, causing redundancy and confusion in the extracted data To improve accuracy in real estate data extraction, it's essential to refine the model to focus on specific, relevant parts of the webpage Optimizing extraction methods ensures more reliable and precise property information, enhancing the quality of real estate data collection.
An HTML page consists of several parts, including the header, footer, sidebar, and main content, but the essential information is primarily found in the main section To improve model accuracy in data extraction, we focus exclusively on the main HTML content rather than the entire raw HTML This targeted approach enhances the precision of information retrieval and will be explained in greater detail in Chapter 3.
Overall solution
System Architecture
The AI Crawler includes two components (Fig 2.3) as follows:
• Website Explorer: Explores new Website and trains URL Classification Model for each new Website.
• Parser Crawler: Crawls data from known Website, gets Extracted Data from the Page, then saves them to the database.
The data flow in AI Crawler system (Fig 2.4) is as follows:
1 When a user needs to collect data from a new Website, the Website’s URL will be sent to Website Explorer.
2 Website Explorer starts to crawl data from the new Website, and saves the Raw HTML of each Page to its database When the number of Pages on the Website is enough, it will halt.
3 With the collected data, the Website Explorer will use the Main Content Detection Model to determine the Main Text from the Raw HTML of each Page, then use the Content Classification Model to classify them into two Categories: real_estate_site and outlier_site.
4 The Website Explorer uses this dataset to train the URL ClassificationModel for a new domain by using URL Classification Trainer.
Figure 2.5: Website Explorer survey new Website
5 The user checks the list of explored Websites on the AI Crawler, and starts the Parser Crawler with the Website they want Parser Crawler will launch a new process to collect data from the requested Website.
6 Parser Crawler uses the URL Classification Model to classify the follow- ing URLs and places them in a priority queue, where the URLs of real estate detail Page have higher priorities.
7 At each real estate detail page, the Parser Crawler uses the Main ContentDetection Model to extract the Main HTML from the Raw HTML for the Page then uses Information Extraction Model to get the ExtractedData and stores it in a relational database.
Website Explorer
Website Explorer utilizes a queue to manage URLs awaiting processing When a website response is received, the raw HTML of the page is stored alongside its URL Additionally, all URLs found within the page are extracted for further exploration, enabling efficient crawling and data collection.
The Website Explorer train URL Classification Model retrieves and checks URLs for duplicates; non-duplicate URLs are pushed to the processing Queue A counter tracks the number of URLs added to the Queue, and once it reaches a predefined limit, the Queue stops accepting additional URLs, ensuring efficient management of the URL processing workflow.
Once a sufficient number of pages for the new website are collected, Website Explorer analyzes all stored pages using the Main Content Detection Model to extract the primary text It then applies the Content Classification Model to categorize each page accurately This dataset is subsequently utilized to train the URL Classification Model tailored to the new website, ensuring effective content organization and improved SEO performance.
Parser Crawler
Parser Crawler prioritizes real estate detail URLs using a priority queue to enhance processing efficiency When it receives a response, it employs a URL Classification Model from the Website Explorer to identify if the page is a real estate detail page If confirmed, it utilizes a Main Content Detection Model to extract the main HTML content, followed by an Information Extraction Model to gather relevant data The extracted information is then stored in the database The crawler continues to process new URLs found within the raw HTML, ensuring duplicate URLs are avoided for accurate data collection.
The Parser Crawler (Figure 2.7) utilizes the URL Classification Model to effectively categorize URLs URLs identified as real estate detail pages are prioritized and pushed onto the queue for further processing This approach ensures that high-priority real estate listings are efficiently crawled and analyzed, optimizing the data extraction process.
This chapter will present the first three models of the system, including the introduction of related research, training and evaluation of the model.
Main Content Detection Model
Requirements
The Main Content Detection Model is essential for identifying the Main Text and Main HTML within Raw HTML in the AI Crawler components The Website Explorer leverages this model to extract the Main Text while excluding the Title and Description, ensuring targeted content retrieval Meanwhile, the Parser Crawler uses the model to accurately extract the Main HTML from Raw HTML, facilitating efficient data processing It is crucial that the Main Content Detection Model is versatile and capable of working across various websites, including new and emerging ones, to enhance the crawler’s adaptability and effectiveness.
Problem analysis and Solution direction
Figure 3.1 illustrates a typical webpage layout, highlighting key components such as the header, footer, sidebar, navigation bar, and main content area Recognizing these distinct parts helps us better understand the structure of a webpage This prompts the first approach: detecting the main content based on display factors, which is essential for effective webpage analysis and optimization.
Following this approach, we have surveyed several studies, and finally,
VIPS is an algorithm designed to extract the semantic structure of a web page based on its visual presentation, as illustrated in Figure 3.2 It works by dividing the page into blocks and analyzing the visible distances between them to determine their Degree of Coherence, which indicates how distinct each block is from others These blocks are then organized into a semantic tree that accurately represents the overall structure of the web page, making VIPS an effective tool for understanding webpage layout.
This algorithm's key advantage is that it does not require model training; instead, it derives the page hierarchy directly from the required data on the page However, a significant limitation is its lengthy processing time, primarily because the system must render the entire webpage in the browser to accurately determine the position of each element Benchmark results indicate that processing a single page takes approximately 8 to 12 seconds, with about 60% of this time spent on rendering Consequently, the total rendering time can be three times longer than just fetching the raw HTML, making this algorithm less practical for time-sensitive applications.
The display factor-based approach is generally unsuitable for our problem due to its requirement to render web pages in the browser, which significantly increases crawl time Instead, we explored alternative techniques based on structure and text factors, utilizing raw HTML data as input After reviewing various studies in this area, we selected two promising candidates for testing: Dragnet [9] and Web2Text [13].
Related Work
Dragnet employs a hybrid approach that combines a single block definition with the integration of the entire CETR algorithm within a machine learning framework It utilizes a carefully crafted set of features designed to extract semantic information from HTML code Notably, many modern HTML tags include descriptive id and class attributes such as "comment," "header," and "nav," which programmers intentionally choose for clarity, thereby embedding meaningful semantic cues about the content within these attributes.
During the preprocessor stage, the HTML document is transformed into a DOM (Document Object Model) and subsequently into a block tree structure, where each block represents an element containing text, while non-text elements are ignored Utilizing Cleaneval [3], key data sections are identified and marked for further processing.
During the extract feature stage, a set of features was utilized, including the two most predictive shallow text (ST) features described in prior research The CETR algorithm was then applied to smooth the ratio of text length to tag count for each block, with these results serving as additional features in Dragnet’s model Semantic features derived from the element's ID and class attributes were also incorporated However, details regarding Dragnet’s network architecture and model were not disclosed in the original paper.
Web2text [13] investigates the problem of Boilerplate removal for HTML data Using the textual and structural features of the HTML format, use a CNN network to train the model.
Web page content extraction, also known as boilerplate removal or web page segmentation, involves separating the main text from other page elements Effective methods typically utilize rule-based or machine learning algorithms to identify and isolate the primary content The most successful approaches start by dividing the web page into distinct text blocks, then apply binary classification to label each block as either main content or boilerplate, ensuring accurate extraction of relevant information.
Web2text employs a hidden Markov model combined with neural potentials to effectively remove boilerplate content from web pages It utilizes convolutional neural networks (CNNs) to learn unary and pairwise potentials over page blocks by capturing complex non-linear relationships among DOM-based traditional features During prediction, Web2text leverages the Viterbi algorithm to determine the most probable sequence of content, ensuring accurate and coherent boilerplate removal.
[5] to find the most likely block labeling by maximizing the joint probability of a label sequence.
Web2text addresses the sequence segmentation problem by converting an HTML page into a sequence of Blocks, each representing an HTML element containing text The core challenge is to determine whether a block belongs to the main content or not This approach enables effective extraction of relevant content from web pages The data processing pipeline in Web2text, as summarized in Figure 3.4, systematically transforms HTML data into meaningful text blocks for accurate content extraction.
1 Preprocessing: HTML data will be converted to DOM (Document Object Model) tree From there, each element containing text will be represented by a block (Figure 3.5) The entire HTML page becomes an ordered block sequence The leaves of the Collapsed DOM tree of a Web page form an ordered sequence of blocks to be labeled by using Cleaneval [3].
2 Feature extraction: For each block, web2text extract a number of DOM tree-based features Two separate convolutional networks operating on this sequence of features yield two respective sets of potentials: unary
The Web2text pipeline, as illustrated in Figure 3.4, leverages block and pairwise potentials to model relationships between neighboring blocks within a hidden Markov model framework [13] This approach employs two key features—Block Feature and Edge Feature—to effectively capture structural and contextual information, enhancing the accuracy of web content extraction By utilizing these features and potentials, the Web2text method improves the identification and segmentation of relevant web page components, making it a robust solution for web data parsing.
The Block Feature encompasses various elements such as text features, block tags, block positioning on the webpage, and the stop word rate, providing comprehensive insights into each text segment These features analyze individual blocks of text within the DOM (Document Object Model) tree, collecting statistics based on their CDOM node, parent, grandparent, and the root node, including details like whether the node is a
element, average word length, and the relative position in the source code Incorporating these block characteristics enhances content analysis for better SEO optimization and user engagement.
“the parent node’s text contains an email address”.
Edge features analyze the relationships between pairs of adjacent text blocks, focusing on their proximity and structural connection within a webpage These features consider whether the blocks share the same ancestor and measure the tree distance, defined as the total hops from each node to their first common ancestor Key edge features include binary indicators for tree distances of 2, 3, 4, and greater than 4, capturing various levels of textual closeness Additionally, the presence of a line break between blocks in an unstyled HTML page serves as an important indicator for understanding the layout and organization of content Incorporating these edge features enhances the understanding of the structural relationships between neighboring text blocks, supporting improved content analysis and page layout comprehension.
3 Training: Web2text uses CNN network model with Block and Edge fea- tures, with Unary CNN for Block Feature and Pairwise Potentials for Edge Feature After that, using the Viterbi algorithm to find optimal labeling that maximizes the total sequence probability as predicted by the neural networks.
4 Post-processing: Using the Web2text model, we were able to extract
Figure 3.5: Collapsed DOM procedure example [13]
To extract the main HTML from raw HTML, we need to modify the Web2text source code The Web2text model's output depends on whether each block is labeled as part of the main content When a block is identified as main, the model returns its text; otherwise, it does not We modified this process to have the model return the CSS Selector for each block, enabling us to precisely identify and extract the main part of the webpage Using these CSS Selectors, we can efficiently obtain the main content and generate the clean Main HTML for the page.
Cleaneval is a competitive evaluation focused on cleaning arbitrary web pages to prepare web data as a valuable corpus for linguistic and language technology research Initiated in 2007, this shared task aims to improve methods for extracting and cleaning web content to enhance data quality for research purposes The evaluation setup, key results, and lessons learned from subsequent exercises highlight the importance of developing effective web page cleaning techniques that can handle the diversity and complexity of online data.
In Dragnet, Web2text, and this thesis, Cleaneval was used as a prepro- cessing stage for marking the Main Text in a Raw HTML
Dataset
We collected a comprehensive dataset of approximately 80,000 real estate web pages from 17 leading websites in the industry, as detailed in Table 3.2 Each page in our dataset includes raw HTML content and its corresponding URL, providing valuable data for analysis This extensive collection enables in-depth insights into property listings, market trends, and online real estate content, supporting advanced research and development in real estate web data analysis.
The processing steps to get training data for Dragnet and Web2text are as follows:
1 First, on each Website, we relied on their URL characteristics or theirRaw HTML layout characteristics, to classify the Page into two cate-
Website Raw Usable REAL_ESTATE_PAGE OTHER_PAGE www.nhadatviet247.net 8433 7991 3153 4838 ancu.me 7776 7604 3598 4006 alonhadat.com.vn 6620 6168 2495 3673 mogi.vn 6507 6237 3554 2683 timdat.net 6506 6044 3956 2088 bds123.vn 6004 5898 3835 2063
123nhadat.vn 5498 5051 3348 1703 nhadatdongnai.com 5379 5055 2533 2522 nhadatbacninh.vn 5164 4886 2695 2191 timmuanhadat.com.vn 3693 3461 2574 887 muabannhadat.tv 3212 2826 1566 1260 batdongsancantho.vn 3105 2779 1329 1450 batdongsandanang.net 3073 3000 894 2106 chothuenha.me 2599 2405 1124 1281 homedy.com 2108 2016 1549 467 nhadat24h.net 1653 1408 1198 210 muabanchinhchu.net 1613 1123 208 915
Sum 78943 73952 39609 34343 gories include REAL_ESTATE_PAGE and OTHER_PAGE (view Fig 3.1).
2 Second, we converted Raw HTML of the Page to the DOM tree, in the REAL_ESTATE_PAGE Category of each Website we used Start Selector and End Selector to mark the main Part of the Raw HTML, then got text of the main Part to had the Main Text In fact, we retrieved a clean evaluation dataset as input for Dragnet and Web2text.
3 To train the models, we split the dataset into two sets, including train and test, with each set containing Pages from different Websites Each time we tested, we used different combinations of Websites Because the number of Pages in a Website was not the same, when training the model, we got the same number of Page for each Website, the details are in Table 3.2
Results and Discussion
To evaluate the results, we compare the main text results obtained from the model when the raw HTML of a page is passed with the correct main
Table 3.2: Data use for train Main Content Detection Model
Set Number Page Number Website Number Page Number Website
5 999 3 490 14 text We use two metrics to evaluate the model:
• Similarity between results and expectations: Coincidence We took the average value across the data set.
• Percentage of pages with Coincidence greater than 85% in the test set as Coverage.
Based on the test results in table 3.3 and table 3.4, we have the following findings:
• Both models have a similar Average of Coincidence when the training set has many websites Web2text has a higher Coverage which proves that it works more steadily than Dragnet.
Reducing the number of websites in the training set impacts the performance of different models differently; while Dragnet shows a decline in Coincidence and Coverage, Web2text maintains its effectiveness This suggests that Web2text is better able to adapt and perform well on unseen websites, making it a more reliable choice for Web content extraction across diverse sources.
• Based on this result, we decided to use the Web2text model as the MainContent Detection Model.
Table 3.3: Result of the Dragnet Model
Set Average of Coincidence Coverage
Table 3.4: Result of the Web2text Model
Set Average of Coincidence Coverage
Content Classification Model
Solution analysis
The Content Classification Model in Website Explorer accurately classifies website pages into relevant categories based on their content It analyzes key text elements such as the page title, description, and main content to determine their domain relevance The model specifically categorizes pages as either REAL_ESTATE_PAGE or OTHER_PAGE, ensuring precise domain segmentation for improved website analysis and categorization.
This problem belongs to the Text Classification Problem, which is the most fundamental and essential task in Natural Language Processing (NLP)
[8] It solves many problems in NLP, such as sentiment analysis, topic labeling, question answering, and dialog act classification In this thesis, we use Fasttext to build text classification models.
Brief about Fasttext
FastText, developed by Facebook, is a powerful machine learning library designed for efficient text data processing It enables rapid development of machine learning models for both word vector creation and text classification The library's core features include unsupervised learning for generating high-quality word embeddings and supervised learning for building accurate text classification models, making it a versatile tool for various natural language processing tasks.
Table 3.5: Data for Content Classification Model
Dataset Sum Number Website REAL_ESTATE_PAGE OTHER_PAGE
Text classification is a vital component of Natural Language Processing, enabling applications like web search, information retrieval, ranking, and document categorization FastText offers a significant advantage with its rapid learning capabilities, efficiently training models on large datasets for improved accuracy and performance in text classification tasks.
FastText enhances traditional Bag of Words models by incorporating a bag of n-grams, allowing it to capture local word order information This approach efficiently improves model performance while maintaining computational simplicity Consequently, FastText achieves results comparable to methods that explicitly consider word order, making it an effective and practical choice for text classification tasks.
Dataset
To build the Content Classification Model, we required a comprehensive training dataset comprising page texts and their respective categories Our dataset consisted of approximately 80,000 pages collected from 17 real estate websites, providing diverse and extensive content for accurate model training Each page included raw HTML, URL, main text, and category labels, ensuring rich data for classification purposes The data preparation process for FastText involved several key steps, including extracting main text, cleaning the HTML content, and structuring the data appropriately to optimize model performance and accuracy.
For each page in the dataset, we extract the category and the page's text If the raw HTML contains a title and description, these are combined to form the page text Otherwise, we substitute the main text to ensure complete content This approach optimizes data collection for improved SEO performance and accurate content indexing.
• We split the data set to three smaller sets: Train, Test, Validate with different Websites (as in table 3.5)).
Results and Discussion
The results in table 3.6 demonstrate that the content classification model has been trained fairly well When we examined the results in more detail, we found common mistakes including:
Table 3.6: Result of Content Classification Model
REAL_ESTATE_PAGE OTHER_PAGE
Some pages in the OTHER_PAGE category lack titles, descriptions, and extractable main content because the current Main Content Detection Model was only trained on the REAL_ESTATE_PAGE category This limitation makes it challenging for the model to accurately identify main content on pages outside its trained category To address this, we are actively improving the system by expanding the model to include multiple categories, enhancing its overall content extraction capabilities.
• Some news sites have a similar way of writing Title and Description toREAL_ESTATE_PAGE, thus the model might be confused.
URL Classification Model
Solution analysis
The URL Classification Model used in the Parser Crawler categorizes website URLs into specific domain categories, enhancing the accuracy of website classification These models are trained by the Website Explorer to ensure reliable performance Once the page classifier provides results for a new website, these insights are utilized to develop a custom URL Classification Model tailored for that site, improving overall classification effectiveness.
The URL classification problem is a type of text classification similar to the Content Classification Model In our research, we utilize FastText to train this model, enabling efficient and accurate categorization of URLs.
Dataset
In our system design, the URL Classification Model is trained utilizing data derived from the Content Classification Model's outputs This integrated approach ensures consistency and accuracy in categorizing URLs, as the flow for training and evaluating the URL classifier aligns with the content classification results, leading to improved model performance.
Table 3.7: Result of URL Classification Model
Domain Sum Train Validate Test TRUE F1 nhadatdongnai.com 5055 3033 1011 1011 932 0,92 nhadatbacninh.vn 4886 2932 978 976 857 0,88 timmuanhadat.com.vn 3461 2077 693 691 641 0,93 muabannhadat.tv 2826 1696 566 564 502 0,89
We utilize a Content Classification Model to categorize website pages within the test set, accurately assigning each page to its relevant category For every website, we extract the URL and the predicted category generated by the model, ensuring precise classification This process helps in effectively organizing web content and enhances SEO strategies by accurately identifying page types across the test dataset.
• For each Website, we split the dataset into three smaller sets: train, test, validate, and then train URL Classification Model using the train set and validate set.
• To validate the result, we compare the result with the expected Category in the test set.
Result and Discussion
The URL Classification Model produces results that are closely comparable to those of the Content Classification Model Upon review, it was observed that most errors stem from specific common issues.
• The error of the model originates mainly from the URLs being too short or too similar between different Categories.
• The incorrect labels from the Main Content Detection Model also at- tribute to the overall errors.
Requirements
Information Extraction Model is used by Parser Crawler to extract nec- essary information from the Main HTML of the Page Extracted Data are returned by model include:
• Address: The address of the real estate being listed has four sub-types: – Province
Problems analysis and Solutions direction
The requirements of this problem lead us to the Name Entity Recognition (NER) problem This is a sub-problem of Information Extraction (IE).
Information Extraction (IE) involves extracting structured information from unstructured or semi-structured text data to facilitate data analysis and decision-making It primarily encompasses two key components: Named Entity Recognition (NER), which identifies and classifies entities such as people, organizations, and locations, and Relation Extraction (RE), which determines the relationships between these entities Leveraging IE techniques enhances the ability to convert raw textual data into meaningful, organized insights suitable for various applications.
In addressing the Named Entity Recognition (NER) problem, two primary methods are used: rule-based and machine learning-based approaches The rule-based approach relies on analyzing data characteristics to generate extraction rules, such as identifying price information on real estate detail pages by detecting a number followed by a price unit, preceded by specific keywords like “sell,” “rental,” or “priced.” While this method offers high accuracy, it is limited to extracting only predefined information and cannot identify new or unseen data Conversely, the machine learning-based approach can handle diverse and novel data but requires labeled training datasets, which can be costly and time-consuming to produce.
This thesis explores the use of weak supervision, a machine learning approach that replaces manual data labeling by combining multiple data sources and labeling methods Weak supervision enables the integration of various weak data sources, improving dataset quality and diversity By leveraging this strategy, we enhance data labeling efficiency and create more robust datasets for machine learning applications.
Our company offers advanced models, including an address extraction model, a real estate description information extraction model, and a specialized language model trained on real estate data We also provide valuable datasets such as address trees, address dictionaries, and project list names By leveraging these existing resources, we aim to reduce manual data labeling efforts, cutting costs while maintaining high-quality standards for our final models.
The input data for this model is in the form of HTML, a structured data format, which presents a unique challenge for Named Entity Recognition (NER) Instead of converting HTML into plain text and applying conventional NER techniques, our approach leverages the structural features inherent in the HTML format to enhance model performance To achieve this, we utilized the Fountuer framework with tailored modifications to the source code, enabling the model to effectively exploit HTML's structural characteristics This method allows for a more accurate and context-aware NER process by integrating HTML structure directly into the model.
Weak Supervision
Overview
Labeled training data is essential for developing effective supervised machine learning models However, obtaining or creating such datasets can be challenging and time-consuming To address this issue, numerous studies have explored various methods [16], including innovative techniques for data labeling and alternative approaches to training data generation.
• Active Learning: This method allows to select the examples with the highest value for the model to train, thereby minimizing the overall cost of labeling.
• Semi-Supervised Learning: The idea of this approach is to use unlabeled data as leverage to improve the performance of the model.
• Transfer Learning: Find a way to transform an existing model into a new one that can be applied to another domain.
Weak Supervision offers an innovative approach to reduce data labeling costs by utilizing various labeling methods such as heuristics, pre-trained models, existing knowledge bases, and third-party services These methods generate differently labeled datasets, enabling more efficient and scalable data annotation for machine learning projects.
Weak supervision involves combining multiple labeled datasets with varying accuracy to create an improved training dataset By intelligently integrating these labeled sets so they complement each other, we can enhance the overall data quality and boost model performance This approach leverages the concept of weak supervision, enabling the development of more accurate models even when labels are noisy or imprecise.
Weak supervision leverages label functions, allowing users to create their own labeling methods, a technique known as Programmatic Weak Supervision (PWS) [16] In this approach, each label function processes a data sample and assigns a label, resulting in an n×m label matrix—where n represents the number of data points and m indicates the number of label functions or weak data sources Generally, there are two main categories of PWS methods, as illustrated in Fig 4.1.
The one-stage method, also known as the joint model, constructs the final predictive model directly on the label matrix, enabling streamlined learning However, when deploying the model, it is necessary to reconstruct the label matrix from weak data sources prior to utilizing the final model for predictions This approach simplifies the modeling process by integrating label construction and model training into a single step, but requires careful handling of the label matrix during deployment to ensure accurate results.
The two-stage method involves initially utilizing a Label Model to merge label matrices into either one-hot hard labels or probabilistic soft labels, providing a reliable labeled dataset In the second stage, this labeled data is used to train the final predictive model, ensuring effective use of weak data sources solely during the labeling process Importantly, weak data sources are unnecessary when training the final model, resulting in improved accuracy and efficiency in machine learning workflows.
Labeling Functions
Using a user-defined label function allows a wide variety of labeling types to be used; some of the more common types of labeling include:
Heuristic rules enable users to define labeling criteria based on their data observation experience, allowing for easy adjustments to the labels This flexible approach helps refine labeling accuracy and directly influences the overall quality of the results, making it an efficient method for improving data annotation processes.
Utilizing existing knowledge bases, trained models, and third-party labeling tools is essential for efficient data labeling processes These resources significantly enhance accuracy, streamline workflows, and reduce operational costs, making them valuable assets for leveraging prior expertise in AI development.
Crowd-sourced labels are additional weak label sources that can be leveraged on a case-by-case basis, despite being potentially inaccurate, incomplete, or noisy Nonetheless, when integrated effectively, they can contribute positively to improving overall model performance and data quality.
To improve efficiency when designing weak supervision sources, there are three main directions in the generation of LFs [16]:
• Automatic Generation: Start from a small labeled dataset to automatic initialization of label functions simple.
• Interactive Generation: Build an iterative process that allows for gradual improvement of labeling results.
Guided generation involves selecting a small development dataset to minimize effort in constructing the label function Utilizing Active Learning is an effective approach for choosing representative examples from the dev set, streamlining the labeling process and enhancing model performance efficiently.
Label Model
Label functions generate labels that may overlap or conflict on samples, with multiple functions sometimes returning the same result or differing outcomes The weakly supervised model leverages a Label Model to effectively resolve these overlaps and conflicts, producing accurate probabilistic labels for reliable predictions.
Figure 4.2: Overview of the Snorkel system [10]
The first step in Label Model processing involves identifying dependencies between label functions, such as similarities, reinforcement, fixing, or exclusivity To automate this, a weighted factor graph is constructed to represent these relationships, focusing on non-zero weighted connections Next, the model determines the most accurate label for each example by assessing the reliability of individual label functions, using various algorithms tailored to specific methods Finally, the model calculates the confidence ratio for each label in an example, based on the combined confidence of label functions that support that label.
Framework and library
Snorkel
Snorkel is an open-source machine learning library that offers a variety of programming tools designed for the rapid construction of training datasets using weakly supervised learning methods (see Fig 4.2) Its core functions facilitate efficient data labeling and weak supervision, enabling developers to build high-quality datasets with minimal manual effort.
Labeling in Snorkel enables users to define custom label functions that assign labels to data samples These weak labels are then modeled and integrated using Snorkel’s Generative Model, which estimates the accuracy of each label function without prior knowledge of the true labels The primary objective is to assess the consensus among label functions for each data point, allowing the system to generate more accurate labels based on the combined weak signals This approach enhances the labeling process by leveraging multiple weak label sources to improve overall data quality and model training efficiency.
Figure 4.3: Overview of Fonduer [15] examples is large enough, the variance of the label function for each la- bel will approximate the accuracy of the label function for that label.
The Generative Model enhances label accuracy estimation by computing the label covariance matrix for label functions This process constructs a correlation graph that illustrates the relationships between label functions and the unknown correct label By analyzing these correlations, the model effectively assesses the accuracy of each label function, leading to more reliable labeling performance.
• Transforming: This function allows more labeled examples from an origi- nal example Users only need to write transformation functions to define these transformations.
Slicing functions enable users to subdivide data sets into smaller segments, facilitating targeted labeling and transformations This approach enhances labeling accuracy, particularly for specific or complex data types, by allowing more precise and manageable data handling.
Fonduer
Fonduer [15] is a Python-based framework designed for Knowledge Bases Construction (KBC), optimized for processing rich-format data It supports a variety of data formats, including Text, CSV, HTML, and PDF, enabling the integration of diverse features such as text, structural elements, tables, and images This versatility allows for more effective utilization of complex data sources, enhancing the accuracy and efficiency of knowledge extraction (See Fig 4.3 for visual reference.)
In building the knowledge base, there are four object types that play an important role: Entity, Relation, Mention of Entity, and Relation of Mention.
• Entity: The entity in the Knowledge Base represents an object that can be a distinct person, thing, or place in the real world
Figure 4.4: Component and Pipeline Process in Fonduer
• Relation: Entities related to each other will be represented through Re- lation between them A Relation can have one or more different entities.
• Mention: is a span of text in the data that references (reminds) the entity.
• Relation of Mention: are the relationships between the mentions in the data
The fonduer framework includes many components with different functions to assist users in building the knowledge base (checkout Fig 4.4):
Parsing is a crucial component responsible for transforming input data in various formats into Fonduer's standardized data model The Preprocessor Parser in Fonduer effectively handles multiple data formats, ensuring flexibility in data processing Once the data passes through the Parser, it is converted into the Data Model, enabling seamless downstream analysis and extraction tasks.
The second stage of the process involves extracting candidates using the Candidate Extractor, which identifies potential relations between entity mentions Determining the relationship between mentions is essential for establishing accurate connections between entities Candidates represent potential relationships that will be later verified for correctness Users can define mentions and candidates customarily and utilize functions like Mention Space, Matcher, and Throttler to efficiently extract relevant data from the data model Additionally, Fonduer offers numerous built-in methods and functions to facilitate the candidate extraction process, enhancing accuracy and user control.
• FeatureExtractor: This function allows the user to define the character- istics of a candidate Fonduer provides a number of built-in methods
Figure 4.5: Parsing Component in Fonduer [15]
Figure 4.6: Fonduer Data Model [15] to help users extract common features such as Text Features, Structure Features, and Display Features.
Supervision is essential for training a machine learning model to accurately identify candidates as Relations of Mention, as this classification task falls under supervised learning To achieve this, Fonduer leverages the Snorkel library, enabling users to create label functions that automatically label training data, thereby streamlining the supervised learning process.
The final stage in the process is learning, where candidates are used for training to develop the model After feature extraction and labeling, Fonduer supports popular models like LSTM and Logistic Regression, enabling effective model training.
Fonduer enables seamless packaging and deployment of the complete data extraction pipeline—including parsing, extraction, featurization, and classification—facilitating easy deployment to remote servers It leverages the MLflow format for efficient model serving, ensuring reliable and scalable deployment of machine learning models.
Dataset
The model is built using the main HTML content from pages labeled as REAL_ESTATE_PAGE, providing a focused dataset for accurate analysis This data is derived directly from the main content model discussed in Chapter 3, ensuring consistency and relevance No additional preprocessing was applied to the HTML data, maintaining the integrity of the original content for effective model training.
Implementation Idea
With the requirement of the information to be extracted, we designed including four candidates as follows: (Fig 5.1):
• Address Candidate: Made up of Address Mention and includes four la- bels: Province, District, Ward, Street, Project.
The "Number Candidate" section features all numerical data referenced throughout the document, providing crucial details about the property It includes seven key labels: acreage, surface width, surface length, street width, number of rooms, number of toilets, and number of floors These figures are essential for understanding the property's size, layout, and structural features, making them vital for potential buyers and real estate professionals Incorporating these numbers enhances the clarity and comprehensiveness of the property listing, optimizing it for search engines and attracting interested clients.
• Price Candidate: Is a pair of numbers and units standing side by side and has two labels: Price and False (not the price)
• Date Candidate: Made up of Date, has two labels are Submission Date and False (not the submission date).
Implementation
Setup
To get started, install Fonduer, Jupyter Notebook, vi-spacy[11], and PostgreSQL Next, create a new database named "real_estate_fonduer" in PostgreSQL to store your data Open Jupyter Notebook and configure the database connection, following the detailed steps outlined in Appendix A1.
Fonduer leverages the PostgreSQL database to efficiently store data between processing steps, ensuring that unnecessary reprocessing is avoided if prior code remains unchanged This approach enhances overall efficiency by preserving data state and reducing redundant computations.
Parser
Our data is in HTML format, so we utilize Fontuer's built-in HTMLDocPreprocessor This preprocessor converts HTML input data into Fonduer's Data Model, facilitating effective data extraction and analysis.
Since our dataset is in Vietnamese, we utilize an additional language model to support tokenization and Part-of-Speech tagging for Vietnamese text Fonduer seamlessly integrates with SpaCy’s models, enabling efficient language processing For this project, we employed the Vietnamese Language Model vi-spacy [11], developed by Tran Viet Trung, to enhance our NLP capabilities Refer to Appendix A2 for more details.
Candidate Extractor
This model identifies essential information by categorizing data into four key Mentions and four corresponding Candidates, as detailed in Appendix A3 Understanding these classifications is crucial for accurate information extraction and enhances the effectiveness of data analysis Implementing this structured approach improves model performance and supports precise data processing in various applications For further details, refer to Appendix A3, which provides comprehensive insights into the Mentions and Candidates framework.
Numbers refer to words that denote both real numbers and integers To identify these mentions accurately, we utilize RegexMatcher supported by Fonduer, a powerful tool for extracting numerical data For detailed implementation, please refer to Appendix A4.
AddressMention is a method that extracts address references from text by generating N-grams of 1 to 5 words, ensuring the relevant address information is captured accurately It employs a matcher that applies rules to remove spans containing special characters, as addresses typically do not include such characters This approach enhances the precision of address identification in textual data, making it a reliable tool for processing address-related information (See Appendix A5 for detailed methodology.)
UnitMention captures price units such as billion or million, including per square meter measurements, using DictionaryMatcher to efficiently identify mentions within a predefined keyword set Since the price units are limited to approximately ten keywords, this targeted approach ensures accurate detection For date mentions, which relate to specific times and are similar to numerical data, we utilize RegexMatcher with carefully crafted regular expressions to reliably identify temporal references in the text This combination of matchers enhances the system’s ability to extract relevant financial and temporal information effectively.
Labeling
Fonduer leverages the Weak Supervision approach through the Snorkel library to streamline data labeling This method allows users to create label functions that automatically annotate data, significantly reducing manual effort By building effective label functions, users can efficiently generate high-quality labeled datasets Integrating Weak Supervision with Fonduer enhances scalability and accelerates the development of machine learning models, making data labeling more accessible and cost-effective.
Our data labeling process is based on an iterative cycle involving writing label functions, training the model, evaluating performance, and refining label functions accordingly In the initial loop, we leverage pre-trained models and simple rule-based approaches for constructing label functions, as detailed in Appendices A6, A7, A8, and A9 This continuous loop ensures improved labeling accuracy and model performance over time. -**Sponsor**Struggling to rewrite your data labeling article while keeping its core meaning and SEO in mind? With [Article Generation](https://pollinations.ai/redirect-nexad/Ol8TE4Pb?user_id=983577), you can instantly generate 2,000-word, SEO-optimized articles It helps refine your content by extracting important sentences that maintain paragraph coherence, saving you time and money compared to manual rewriting Imagine having a content team that ensures your article's message is clear, compliant with SEO rules, and ready to engage your audience without the usual hassle!
We evaluate our results by manually labeling ten pages from each website to create a GoldLabel dataset, as detailed in Appendix A10 Utilizing GoldLabel enables us to monitor the model's progress across iterations, assess changes in the labeling function's effectiveness, and make necessary adjustments to improve the final model's accuracy and performance.
The analysis results for each label are used to update and enhance label functions in subsequent iterations Table 5.1 presents a detailed evaluation of the address label function, highlighting key parameters such as dataset coverage, overlap degree, label function conflicts, and accuracy compared to the Gold Label This comprehensive analysis informs improvements to label functions, ensuring higher precision and better performance in labeling tasks.
The analysis of label functions reveals that the `lf_address_extract_model` exhibits the highest coverage (0.3357) and overlap (0.3185), indicating its effectiveness in extracting address-related information with minimal conflicts (0.0172) The `lf_in_dict` also demonstrates significant coverage (0.2435) while maintaining a manageable conflict rate (0.1055), highlighting its reliability in dictionary-based address extraction Conversely, specialized functions like `lf_pre_thanh_pho_is_tinh` and `lf_pre_thanh_pho_is_quan` show low coverage (0.0012 and 0.0054 respectively), reflecting their limited scope in identifying specific geographic entities Overall, these label functions contribute variably to the model's performance, with some providing broad coverage and others offering targeted accuracy for particular address components.
Feature Extraction
After assigning labels, a LabelModel is generated for each label, which can be used directly in the One-Stage method for the final model However, this approach relies heavily on label functions, often requiring third-party services, which can lead to complications To address this issue, we employ a Two-Stage method that involves training a new model based on the LabelModel results and their features, eliminating dependence on label functions The dataset is first divided into training, testing, and development sets (see Appendix A11).
To extract candidate features effectively, we utilize the Fonduer framework, which allows for comprehensive data feature extraction, including display, structure, and text features Users can configure the desired feature types through the “fonduer-config.yaml” file, as detailed in Appendix A12 In our specific application, we focus solely on structural and text features, since display features require rendering the page to calculate element distance and size, which is not feasible without browser rendering Using the Featurizer in the source code, we generate a feature matrix for the candidates, streamlining the feature extraction process (see Appendix A13).
Train Final Model
During the training phase, we employ the LogisticRegression model supported by Fonduer to address this standard classification problem, utilizing the textual and structural features outlined earlier We evaluate the model's performance by comparing the results with the GoldLabel, ensuring accurate assessment of its effectiveness in extracting relevant information.
Table 5.2: Result of Information Extraction Model.
Label Gold Label F1 score Coverage
Result evaluation
Evaluation Method
To develop a high-quality model, we first create a manually labeled dataset called the Gold Label, which includes 250 pages collected by randomly selecting 10-15 pages from each of several websites This dataset covers seven key information types: address, price, area, surface size, street width, phone number, and posting date The Gold Label dataset is used to evaluate the model's accuracy by comparing its extracted results against the manually labeled data The model training process follows an iterative cycle of labeling, training, and evaluation, continuing until the desired accuracy level is achieved.
We performed the assessment on 7 fields of information to be extracted (Figure 5.2) Columns in the results table include:
• Gold Label: Total number of labels from 250 pages in Gold Labels dataset Note with Address in 1 page can get from 1 to 4 addresses corresponding to 4 labels: Province, District, Ward, Street.
• F1 score: Calculated on 250 Pages of the Gold Label set by each infor- mation field.
• Coverage: Ratio of pages extracted information to total pages (39609 pages).
Result and Discussion
Based on the results from the table 5.2 we discuss the findings and analysis as follows:
Address extraction accuracy is limited because the labeling relies on an address dictionary, which hampers the model's ability to handle new addresses in real-world scenarios Using separate labels for Province and District yields higher accuracy, as these contain fewer new addresses compared to Ward and Street labels Additionally, similar name overlaps between labels cause model errors, but incorporating label functions based on label relationships helps reduce these mistakes, improving overall accuracy.
NumberCandidate labels are typically more accurate because they rely on explicitly defined rules, reducing ambiguity For phone numbers, which often include special characters like spaces or dashes, we have implemented a normalization step to standardize their format This enhancement has significantly improved the accuracy of phone number recognition and overall data quality.
The coverage ratio across the entire dataset is a crucial metric for evaluating model performance It ensures that label functions are not overly restrictive, which could artificially inflate accuracy but diminish overall model coverage Maintaining an optimal coverage ratio strikes a balance between accuracy and the model’s ability to generalize effectively.
Building this model is a quick and efficient process, taking only two weeks from start to finish The rapid development is made possible by leveraging existing models to automatically label data, significantly reducing the time required for data annotation Additionally, the system installation and data labeling steps are streamlined, ensuring minimal time investment This approach demonstrates how pre-trained models can accelerate the creation of new AI solutions, making the development process both fast and cost-effective.
Figure 6.1 illustrates the deployment architecture of the system in the production environment, highlighting its core components The system primarily comprises two main elements: the AI Crawler and the Website Explorer, which work together to facilitate data collection and analysis Additionally, the architecture includes several supporting components that enhance the system's functionality and performance in a commercial setting.
• Tor proxy: a proxy that stands between the system and the Internet that allows to change the IP when collecting data to enhance anonymity.
• AWS S3 Storage: Where Website Explorer stores URL models of Web- sites, and AI Crawler retrieves these models for use.
• Mongodb: The data collected by AI Crawler will be stored here.
• Scheduler: Allows scheduling to re-collect data from websites to update the latest information.
• Management System: A CMS provides functions that allow administra- tors to add new websites and manage existing websites See Figure 6.2.
• Metadata Storage: A relational database that stores metadata of known websites for CMS and Scheduler.
This thesis presents our research on Weak Supervision for Information Extraction in building an AI Crawler system We successfully addressed key challenges, including extracting information from HTML data and developing a robust data collection system across multiple websites Our results demonstrate effective solutions for enhancing data extraction accuracy and efficiency in web crawling applications.
This study introduces an innovative data collection system leveraging machine learning to extract information from web pages without predefined definitions The newly developed design enhances data extraction efficiency and flexibility, making it adaptable to various web sources Successfully deployed at Cengroup, this system demonstrates practical effectiveness, highlighting its potential for broader applications in automated web data collection.
Besides, we have also successfully applied the Weak Supervision method in data labeling for the Information Extraction problem, which helps to reduce the cost when building this model.
We plan to expand this research approach to other domains such as e-commerce, employment, and NFTs, to validate its effectiveness across various industries Looking ahead, our goal is to develop a user-friendly, open-source toolkit that enables the public to build similar AI-driven systems tailored to different sectors This commitment to open-access tools aims to foster innovation and facilitate the widespread adoption of our methodology in diverse applications.
[2] https://www.w3schools.com/cssref/css_selectors.asp.
The article discusses "Cleaneval," a competition focused on improving web page cleaning techniques, as presented by Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff at the Sixth International Conference on Language Resources and Evaluation (LREC’08) in Marrakech, Morocco, May 2008 This research emphasizes the importance of developing effective methods for cleaning and preprocessing web content to enhance language processing applications.
[4] Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma Vips: a vision- based page segmentation algorithm 2003.
[5] G David Forney The viterbi algorithm Proceedings of the IEEE, 61(3):268–278, 1973.
[6] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov Bag of tricks for efficient text classification arXiv preprint arXiv:1607.01759, 2016.
Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl (2010) developed a method for boilerplate detection utilizing shallow text features Their approach was presented at the third ACM International Conference on Web Search and Data Mining, emphasizing efficient identification of boilerplate content in web pages This technique improves the accuracy of web scraping and data mining by filtering out unnecessary or repetitive boilerplate text Their research contributes to enhancing the quality of web data extraction processes and is relevant for SEO optimization by ensuring cleaner, more relevant content extraction.
This comprehensive survey by Qian Li and colleagues (2022) explores the evolution of text classification techniques, transitioning from traditional methods to advanced deep learning approaches It highlights the significant advancements in natural language processing, emphasizing how deep learning models, such as neural networks, have improved classification accuracy and robustness The article provides a detailed comparison of various algorithms, discussing their strengths and limitations in different application contexts It also addresses current challenges and future trends in text classification, making it a valuable resource for researchers and practitioners seeking to understand the trajectory of this rapidly evolving field.
[9] Matthew E Peters and Dan Lecocq Content extraction using diverse feature sets InProceedings of the 22Nd international conference on world wide web, pages 89–90, 2013.
[10] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen
Wu, and Christopher Ré Snorkel: Rapid training data creation with weak supervision InProceedings of the VLDB Endowment International Conference on Very Large Data Bases, volume 11, page 269 NIH Public Access, 2017.
[11] Tran Viet Trung Vietnamese language model for spacy.io Unpublished paper, 2021.
[12] Yuli Vasiliev Natural Language Processing with Python and SpaCy: A Practical Introduction No Starch Press, 2020.
[13] Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff Web2text: Deep structured boilerplate removal, 2018.
[14] Tim Weninger, William H Hsu, and Jiawei Han Cetr: content extraction via tag ratios In Proceedings of the 19th international conference on World wide web, pages 971–980, 2010.
Fonduer is a powerful system designed for knowledge base construction from richly formatted data, enabling efficient extraction of structured information As demonstrated in the 2018 International Conference on Management of Data, it leverages advanced modeling techniques to handle complex, richly formatted documents, improving the accuracy and scalability of knowledge extraction The authors Wu, Hsiao, Cheng, Hancock, Rekatsinas, Levis, and Ré showcase how Fonduer advances data management by addressing the challenges of unstructured and semi-structured data sources This innovation significantly enhances the automation of building comprehensive and high-quality knowledge bases from diverse, richly formatted documents.
[16] Jieyu Zhang, Cheng-Yu Hsieh, Yue Yu, Chao Zhang, and AlexanderRatner A survey on programmatic weak supervision arXiv preprint arXiv:2202.05433, 2022.
AI Crawler The new crawler as an integrated AI model.
Category The type of a Page.
Content Classification Model The classification model to determine what Category a text (such as Title, Description, Main Text) belongs to.
Description The description of a Raw HTML of a Page; the text inside meta[’description’] of HTML.
Domain An area of knowledge, such as Real Estate, E-commerce, Education.
End Element The last element of the main Part in a Raw HTML.
End Selector The CSS Selector [2] of an End Element.
Extracted Data The result when Information Extraction Model extracts a Main HTML, containing Label and Value of label in the Main HTML.
Information Extraction Model The model that returns Extracted Data of the Main HTML
Label A label in Extracted Data, such as Address, Price, Acreage.
Main Content Detection Model The model that returns Main HTML, Main Text of a Raw HTML.
Main HTML Raw HTML after removing all Parts except main This is the result of the Main Content Detection Model after processing a Raw HTML.
Main Text The main content in text format of a Page; text returned byMain Content Detection Model after processing a Raw HTML.
OTHER_PAGE A Category of the Real Estate Domain that refers to a different type, not to be confused with REAL_ESTATE_PAGE.
Page A specific page within a Website that associates with a unique URL.
A website can contain multiple Pages.
Page Classification The act of determining what Category a Page belongs to; may use Content Classification Model or URL Classification Model.
Parser Crawler A component inside AI Crawler to crawl data from known Website.
Part A part of an HTML page, such as header, footer, navigation bar, side- bar, main.
Raw HTML The HTML response of a Page when requested to the URL of this Page.
REAL_ESTATE_PAGE A Category of the Real Estate Domain, includ- ing the Real Estate detail Page in the Websites of Real Estate Domain.
Scrapy Crawler The old data crawler based on Scrapy platform, without the use of AI models.
Spider A module used in Scrapy Crawler to extract data from an HTML page.
Start Element The first element of the main Part in a Raw HTML.
Start Selector The CSS Selector [2] of a Start Element.
Title The title of a Raw HTML of a Page; the text inside tag of HTML.
URL The URL of a Page.
URL Classification Model The classification model to determine what Cat- egory a URL belongs to.
URL Classification Trainer The module used to train URL ClassificationModel of a new Website.
Value Value of Label in Extracted Data For example, with the Label "Ad- dress", the Value is "Ha Noi".
Weak Supervision As described in Chapter 4.
Website A collection of web pages and related content For example, bat- dongsan.com.vn, muabannhadat.tv, that belong to a particular Do- main (Real Estate).
Website Explorer The tool applied to a new Website to generate URL Clas- sification Model for this Website.
Some source code were used in thesis
Listing A.2: Parse HTML input to Fonduer Model
32 if l o w e r _ u n i t in [ ’ tri ệ u / th á ng ’ , ’ tri ệ u / th ’ , ’ tr / th ’ ]:
Listing A.10: Labeling GoldLabel for Address
8 d a t a = [( doc name , doc ) for doc in d o cs ]
Listing A.11: Split data train-test