In the thesis phase, I will implement a web application that will collect information on different real estate websites and display them on my current web page.. Unlike React, whichuses
Trang 1O O O
TP.HCMBK
GRADUATION THESIS
DEVELOPMENT OF A WEB APPLICATION TO MONITOR
STATISTICAL DATA OF REAL ESTATE
PROPERTIES
DEPARTMENT OF SOFTWARE ENGINEERING
INSTRUCTOR: Assoc Prof Quan Thanh Tho REVIEWER: Assoc Prof Bui Hoai Thang
STUDENT: Pham Minh Tuan 1752595
Ho Chi Minh 7/2021
Trang 3Kpxguvkicvg"vgejpqnqikgu"vq"etcyn"tgcn"guvcvg"fcvc< Uetcr{"htcogyqtm."DgcwvkhwnUqwrR{vjqp"nkdtct{0
Kpxguvkicvg"Ugngpkwo"ygdftkxgt"vq"eqnngev"fcvc"kp cpvk/etcynkpi"ygdukvgu0Kpxguvkicvg"cpf"dwknf"c"rtkeg/rtgfkevkpi"oqfgn"wukpi Nkpgct"Tgitguukqp"dcugf"qpeqnngevgf"tgcn"guvcvg"fcvc0
Etcyn"fcvc"gxgt{fc{"cpf"wrfcvg"pgy"fcvc"kpvq"fcvcdcug0Kpxguvkicvg"r{vjqp"nkdtctkgu"vq"korngogpv"oqfgn cpf"vtckp"oqfgn"ykvj"eqnngevgf"fcvc0
Trang 7- Designed and implemented a web-based application to show statistics of collected data, and
performed some experientations on predicting prices
Trang 9contents, which I read from the documents of the technologies will be used in my webapplication, are defined in the References page and I promise that I do not copy or takethe content of others’ thesis without permission.
Trang 10First, I want to say thanks to my instructor, Assoc Prof Quan Thanh Tho for instructing
my topic as well as the knowledge to implement the web application from the first step.Beside, I am extremely happy because of my family ’s support during the time I dothe thesis and throughout three years of my study at Ho Chi Minh city University ofTechnology This work cannot be finished well if I don’t receive many encouragementsfrom my parents and my little brother
Trang 11kind of industry starts growing fast in recent years to respond to the need of customers,and some real estate companies want to collect the data in this area as much as possible
to analyze the specific area that abstract most investors as well as sellers However, thedata gained about real estate is enormous with lots of websites It is hard for them togather all information at once This thesis will introduce a web application that willcollect information about real estate websites so that real estate agents can give a preciseanalysis of the resources-real property
Trang 121.1 Introduction 8
1.2 Objectives and Scope 8
2 Background Knowledge 9 2.1 Frontend 10
2.1.1 Single-page Application (SPA) 10
2.1.2 JSX 12
2.1.3 Components and Props 13
2.1.4 States 14
2.1.5 Events 15
2.1.6 Code Splitting 16
2.1.7 Fragment 17
2.2 Backend 18
2.2.1 Model Layer 19
2.2.2 View Layer 20
2.2.3 Template Layer 20
2.3 AJAX Request - Response 21
1
Trang 132.5 Scrapy 22
2.5.1 Basic concept 23
2.5.2 Scrapy’s Architecture 29
2.5.3 Working with Dynamically-loaded Content 30
2.5.4 Selenium 32
2.6 Linear Regression 36
2.6.1 Linear Model 36
2.6.2 Cost Function 37
2.7 Polynomial Features 37
2.8 Evaluation Metrics 38
2.8.1 Mean Squared Error (MSE) 38
2.8.2 R-Squared Score (R2 ) 39
2.8.3 Cross Validation Score 39
2.9 Underfitting and Overfitting 40
2.10 Regularization 41
2.10.1 Ridge Regression 41
2.10.2 Lasso Regression 42
2.11 Feature Engineering 42
2.11.1 Handling missing data 42
2.11.2 Outliers removal 43
2.11.3 Log Transformation 44
2.12 Supported Python Libraries 46
2.12.1 Numpy 46
2.12.2 Pandas 46
2.12.3 Matplotlib 47
Trang 14Thesis 3
2.12.4 Scikit-Learn 47
3 System Implementation 48 3.1 Use-case Diagram 49
3.2 Architecture Diagram 50
3.3 Database 51
3.4 Workflow 51
3.5 Crawling Data 52
3.5.1 Handling duplicated data 52
3.5.2 Formating Item’s name 52
3.5.3 Handling web pages with Selenium 53
3.6 Data Modeling 54
3.6.1 Data Preparation 55
3.6.2 Training Models 59
3.6.3 Model Evaluation Result 60
3.7 Web Application 61
3.7.1 Register page 61
3.7.2 Login page 62
3.7.3 Dashboard page 62
3.7.4 Data page 63
3.7.5 Price prediction page 64
3.7.6 Admin page 65
4 Summary 66 4.1 Achievement 67
4.2 Future Development 67
4.2.1 Thesis limitation 67
Trang 154.2.2 Further development 68
Trang 16List of Figures
2.1 Single-page Application 10
2.2 Scrapy’s architecture 29
2.3 How Selenium WebDriver works 33
2.4 Non-linear dataset 38
2.5 Five-folds cross-validation 40
2.6 Undefitting & Overfitting 41
2.7 Boxplot components 44
2.8 Matplotlib Boxplot 44
2.9 Skewned Data 45
2.10 Data before using Log Transform 45
2.11 Data after using Log Transform 46
3.1 Use-case diagram 49
3.2 Architecture diagram 50
3.3 CRED System Database Schema 51
3.4 Scrapy collected data sample 54
3.5 Cu Chi Selling Land dataset 55
3.6 Dataset in Ba Thien, Nhuan Duc, Cu Chi 55
3.7 Origin dataset in Nguyen Huu Canh, 22, Binh Thanh 56
3.8 Dataset after using Log Transformation 56
5
Trang 173.9 Dataset Histogram before & after Log Transformation 57
3.10 Dataset after removing duplicates 57
3.11 Boxplot of area values 58
3.12 Divide dataset into equal parts 58
3.13 Outliers detection after apply on each part 58
3.14 Polynomial Regression degree choosing by train set’s RMSE 59
3.15 Polynomial Regression degree choosing by validation set’s RMSE 59
3.16 Register page 61
3.17 Login page 62
3.18 Dashboard page 62
3.19 Dashboard page 63
3.20 Data page 63
3.21 Price prediction page 64
3.22 Admin page - Manage users 65
3.23 Admin page - Retrain models 65
Trang 181 — Overview
In this chapter, I going to introduce about my thesis topic, then I will show the targets and the scope of my thesis for the web application.
1.1 Introduction 81.2 Objectives and Scope 8
7
Trang 191.1 Introduction
Crawling data is a common fields that appeared in many web application which mainfunction is collecting data However a web app that collects and monitor data in realestate fields for common real estate agents is rare Therefore, they need a software tohelp them synthesizing all data from other real estate website and then started makinganalysis about collected data
In the thesis phase, I will implement a web application that will collect information
on different real estate websites and display them on my current web page Clients canregister accounts and log in to the web app and view the data The web app also providesfilter functionality to let users filter the data for custom viewing and help users predictprice of a specific real estate properties based on elements like area, street, ward anddistrict Moreover, the web app provides users a general view of real estate data throughthe chart displayed on the Dashboard page
The objectives of the thesis are:
Implement a web application to show the data collected
Build a custom crawler to get the data from three main real estate website: dongsan.com.vn, homedy.com, and propzy.vn
bat- Provide some real estate data statistics based on collected data
Build machine learning models to predict price of real estate properties based oncollected data
For a topic scope:
The web app is limit for a few users only (less than ten users) and it works locally
The collected real estate posts is limited to only in Ho Chi Minh city only Thereal estate post type is selling
Trang 202 — Background Knowledge
In this chapter, I am going to illustrate about the knowledge I research to implment the web application including some definitions of the technologies I will use in both Frontend and Backend part
2.1 Frontend 10
2.2 Backend 18
2.3 AJAX Request - Response 21
2.4 Postgres SQL 21
2.5 Scrapy 22
2.6 Linear Regression 36
2.7 Polynomial Features 37
2.8 Evaluation Metrics 38
2.9 Underfitting and Overfitting 40
2.10 Regularization 41
2.11 Feature Engineering 42
2.12 Supported Python Libraries 46
9
Trang 212.1 Frontend
2.1.1 Single-page Application (SPA)
A single-page application is a web application that can interact with the clients cally with an only-one-single web page at the time Whenever a user changes the content
dynami-in the web page, it will be rewritten with the new content dynami-instead of reloaddynami-ing a newpage
Each page of the web application usually has the JavaScript layer at the header or bottom
to communicate with the web services on the server-side The content of the web page isloaded from the server-side in response to events that users make in a current web pagelike clicking on a link or a button
Figure 2.1: Single-page Application
Nowadays, with the development of the technologies used to build web applications pecially in the frontend area, there are three popular frameworks that most developersuse, that is React JS, Angular JS, and Vue JS I will state the comparison between them,and why I choose to use React JS to implement the frontend of my web application
es-Angular JS
Angular JS is a full-fledged MVC framework which provides a set of defined library andfunctionality for building a web application It is developed and maintained by Google,was first released in 2010 and it is based on TypeScript rather than using JavaScriptitself
Trang 22Thesis 11
Angular is unique with its built-in two-way data binding feature Unlike React, whichuses one-way data binding, both React JS and Angular JS use Component to build theweb application means that when the component (or a model) is rendered in the viewcan have different data inside based on the web page the user is seeing For Angular,when the data in the view changes or also let to the change of the data in the model,while React is not This is more convenient for developers to build a web applicationeasier However, in my opinion, it would be hard for managing the data when the webapplication becomes larger as well as debugging
React JS
React JS is actually a JavaScript library not a framework like the other two, that isdeveloped and maintained by Facebook It is released in March 2013, and is described as
”A JavaScript library for building user interfaces”
Since ReactJS is mostly used to create interactive UIs in client view ReactJS is notreally difficult to learn at first However, when I get used to some concepts it provides, itwill be easier to maintain the content of the web page Moreover, it comes with the bigresources of the library to support developers to create the website quickly with the help
of available components The developers will have a freedom in choosing the suitablelibrary to use in building web application because React JS only work at the client viewonly not like Angular JS, which is built as an MVC Framework, so to develop the fullweb application, developers have to follow strictly the template it provides
Vue JS
Vue JS is a frontend framework that is developed to make a flexible web application It
is developed by the ex-Google employee Evan You in 2014 It shares many similaritieswith the two above, it uses Components to build the web application, also two-way databinding like Angular
Vue is versatile, and it helps you with multiple tasks, it comes with flexibility to designthe web application architecture Vue syntax is simple with people just get started withJavaScript background can still learn it, and it also support TypeScript as well
Why React ?
The reason I choose to use React JS to build a frontend of the web application is because
of its simplicity and its freedom to build the web app React JS let developers freely build
a web page on their library using not like Angular, beside Angular requires developer getused to TypeScript and it is also hard for a beginner in using frontend framework tostart with since Angular is considered as the most complex one among three frontend
Trang 23web framework I listed above For Vue JS, although it is simple and easy to approach fordevelopers at first, it is not stable like the other two and its support community is small.
To conclude, each framework has its pros and cons depend on how programmers decide
to implement their web application For me, I found that React JS is good to start first
to build the web application
Firstly, I want to introduce the JSX before present about some main concepts that areused in building the view of the web application in React JS
const element = <h4>I am JSX!</h4>
The line above is neither HTML nor String, it is called JSX and this syntax is used inReact JS to render elements It is like the combination between JavaScript and XML,XML stands for Extensible Markup Language It is more like the tag in HTML but itcan be customizable By using JSX developer can manipulate the data inside an element
by embedding it inside the HTML tag
const element = <h4>I am JSX!</h4>
const hello = <h1>Hello, {element}!</h1>
By default, When React DOM render element, it will escape any value that is embedded
inside JSX to prevent XSS (Cross-site-scripting) attacks Each element defined in JSXcan then be rendered in the web page by React DOM
Trang 24Thesis 13
const element = <h4>I am JSX!</h4>
const hello = <h1>Hello, {element}!</h1>
ReactDOM.render(
element,
document.getElementById('root')
);
element is rendered inside a div tag with id=”root”
Components in React JS let developers split a web page into independent, reusable pieces,each of them will work isolated to each other A web page can contain lots of componentsinside, whenever a user changes the contents of the web page, the current page will berewritten with the new component or update the data inside the current componentwithout reloading a whole page
The way components work is similar to how we use functions or methods in most gramming languages When we define a function we declare the parameter for a function,
pro-it is like props in Component Then, we can call pro-it inside render of ReactDOM to render
that component in the web page we want to show
To define a component in React JS two ways use JavaScript Function and ES6 JavaScriptClass:
Trang 25const helloComponent = <HelloReact name="tuanminh" />
its input which in React JS it is called pure function “All React components must act
like pure functions with respect to their props.” However, we want our web application
is designed dynamically, the content in UI can be changed over time, therefore React JScomes with the concept be used Component that is states
2.1.4 States
State is similar to props, but it is private in Component only and it is changeable State
is introduced in React JS when we define Component as JavaScript class However, since
v16.8, React introduce a new concept which is called Hooks that let developers can use
state inside a function without writing a class During the time I read the documentabout React JS, I only learn to define Component in class only because it is the basicone in building and customize the Component Hooks are a new concept and most of thelibraries that support React JS are changed to use Hook to build their Component now.When I start building the full web application in the next phase of the thesis I will try
to use it to build the Components in a web application
To define state in React Component, we firstly create a constructor then create state
with its initial value
class HelloReact extends React.Component {
constructor(props) {
super(props);
this.state = {
name: 'tuanminh',}
Trang 26Thesis 15
Lifecycle
One important concept related to using state in React is Lifecycle methods In case
of loading the contents with many components into the web page, it is necessary tofree up resources taken by previous components used when they are destroyed, or afternew contents have loaded we want to run a function to operate the functionality ofthe current web page like getting data from the server That how React comes with
componentDidMount() and componentWillUnmount() to control when a component is
mounted or unmounted
Modify State
As state is changeable compare to props, to change the state value we must use setState()
provided by React Component State cannot be changed directly like this:
this.state.name = 'tuanminh';
Instead:
// This will change the state name of Component
this.setState({ name: 'tuanminh' })
<button onClick={handleClick}>Click me</button>
However when we declare a function inside JavaScript class, we need to use bind() to bind the current function to class when we use this keyword This is JavaScript’s characteristic, the default binding of this in JavaScript is the global context which is window, not the
current class we declare our function in
class ReactComponent extends React.Component {
constructor(props) {
Trang 27In case you want to use parameter in handleClick() we can try JavaScript arrow function
or use JavaScript bind() instead.
<button onClick={() => this.handleClick(param)}>Click me</button>
//Or:
<button onClick={this.handleClick.bind(this, param)}>Click me</button>
2.1.6 Code Splitting
This concept introduces the import() in React JS, it allow developers import Component
from a library or a custom Component to be reused in our Component
import { Component } from 'react';
import NavButton from './NavButton';
class Navbar extends Component {
At first, we need to import the standard Component from react to implement our custom
Component Then we import the custom button to use it in the navbar
Trang 28Thesis 17
Component class in React JS allows developers to return one element only, which could
be a div tag or the Component that is imported from the library However, with the help
of Fragment, it allows returning multiple elements.
Trang 292.2 Backend
Since React only provides the client view only so to communicate with the database andthe crawler to get the data for the web app, we need to build the backend to handle thisproblem React JS can integrate with most backend frameworks nowadays easily as long
as it provides the API for the React JS to render the content in view Nowadays it isnot hard to find yourself a backend framework to create a full-stack web app Yet we canbuild the backend part from scratch too, we can start with building the architecture likethe common MVC architecture that is most used in many software not just in the aspect
of the web application, then is database design and how to manipulate the data in thedatabase, routing and control the view that will be rendered to the client
These parts now are handled in most of the backend web framework currently Besidesframework provides a way to manage the architecture more efficiently, we do not need
to care much about the web services (localhost) to run the server-side or the security ofour web application It also lets developers deploy their web quickly since they alreadyprovide a reusable piece of code that most web applications have For example, cookie orsession management for users when they logged in to the web app It now handles thatjob immediately for you, yet we can custom that part to let it works individually.One disadvantage of using the framework is that if we want to make a unique functionthat is based on the current one it provides However, the framework may not have thefunctionality or the API to let us custom it for your own So we will need to rewrite thatfunction entirely and that piece of work will not be simple if we do not have a look atthe way it is implemented Eventually, there is still a way to change it as overriding thatfunction by using our function which is based on the one framework provide
The backend framework I choose to implement my web application in Django It is ommended as one of the best frameworks on the Internet since Django lets developers usePython to work with the server-side of the web Python is a high-level programming lan-guage, yet friendly for most programmers due to its straightforward syntax and support
rec-of various libraries or tools Besides, Django comes with Django REST which providesRESTFUL APIs that work well with React JS
Django follows the Model-View-Controller (MVC) architecture, which is Model for aging database, Views is the client’s view, Controller acts as the middleman betweenModel and View When a client sends a request which is from the View to the server,Controller will handle the request and communicate with the Model to retrieve, update,
man-or add new data to the database through Model
Django provides developers three separate layers, which are:
The Model Layer, which plays the role Model in MVC architecture This layer
provides classes and modules for structuring and manipulating the data of the webapplication
Trang 30Thesis 19
The View Layer, which is a Controller This layer lets developers define
request-response functions or classes that can be used to access data through the Modellayer
The Template Layer are the templates, where store HTML files that are used to
display as the view to the client of the web application However, in my web appReact JS already handles this role so I will not present much about this layer
attributes inside the Model stands for the fields in database For example:
from django.db import models
class RealEstateData(models.Model):
url = models.CharField(max_length=100)
post_title = models.CharField(max_length=50)
url and post title is the two fields of the table named myapp real estate data.
By using models we need to specify the name of the app in settings.py in INSTALLED APPS
= [ ] firstly Then when we define a model in models.py, we can save that model
into database as a table with two CLI are makemigrations and migrate The first CLI
is used when we first define a model or change some characteristics in the model likeadding attributes, changing the type of attributes or deleting attributes After that we
use migrate to take that changes into the data inside database.
Making queries
Models provided by Django gives developers a database-abstraction API to let you nipulate directly the data inside database without making any SQL queries I will statesome common methods:
ma- To create an object, we simply call the name of models given available attributes
inside the model like creating a class in Python, then object.save() let the object
saved in database
Trang 31data = RealEstateDatắwww.realestatẹcom', 'mua_ban_nha_dat')
datạsave()
To retrieve objects, simply call methods objects.all().
To retrieve only parts of objects, we can use filter(), the inputs of filter are attributes
and what request will be handled We modify this feature in the url.py in our project.
from djangọurls import path
from myapp import views
Trang 32Thesis 21
AJAX stands for Asynchronous JavaScript And XML, it is the use of XMLHttpRequest
to communicate with the backend, it allows the frontend can send and receive data indifferent format like text, JSON, HTML and update the data in frontend view withouthaving to refresh the current page
React JS provide the interactive view to the client to view the web page, however, it is just
a view so that whenever the user triggers an event in the view the make a component’sdata change the view should communicate with the server-side to load the suitable datainto the current component To communicate with the backend, the changeable compo-nent in React will send an AJAX request into the backend then the backend respondsdata back to the requested component and changes the data displaying the component,this update will not refresh the whole page of the web application, fewer data will beloaded at the time and this will increase user’s experience in using the web app
There are two common ways for React JS to integrate with the backend, that is using
fetch() API provided by the current browser or by using a JavaScript library axios
In my web app, I decided to use axios to handle AJAX request/response rather than using fetch() because axios is the famous library for handling AJAX in most of web
applications that developing using JavaScript It provides trusted security to protect thecommunication between the frontend and backend I will not recommend using AJAXJQuery in the web application developed by React since the way React and JQueryoperate is different from each other They will conflict if you put them in one place
I choose to use Postgres SQL to set up the database of my web application PostgresSQL is a powerful, open-source object-relational database system that uses and extendsthe SQL language combined with many features that safely store and scale the most com-plicated data workloads Postgres is famous for its proven architecture, reliability, dataintegrity, robust feature set, extensibility, and the dedication of the open-source commu-nity that stands behind the software to consistently deliver performant and innovativesolutions
Postgres SQL comes with many features aimed to help developers build applications,administrators to protect data integrity and build fault-tolerant environments Moreover,
it helps developers manage the data no matter how big or small the dataset PostgresSQL is a free and open-source database, however, it is highly extensible It is available
in most operating systems and it is used in many web applications as well as mobile andanalytics applications
Trang 332.5 Scrapy
The main function of web application is collect data from different real estate websites
to display in my web app at once, so that I need to make a crawler for crawling thoseinformation Currently, I am researching about Scrapy, an Python application frameworkfor crawling websites
Scrapy is an application framework for crawling web sites and extracting structured datawhich can be used for a wide range of useful applications, like data mining, informationprocessing or historical archival Although Scrapy was originally designed for web scrap-ing, it can also be used to extract data using APIs or as a general purpose web crawler.The main component of Scrapy is the spider
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()') get(),
'text': quote.css('span.text::text') get(),}
next_page = response.css('li.next a::attr("href")') get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Below is the sample code of using spider for crawling author and text in the sample website
’http://quotes.toscrape.com/tag/humor/’ To run the spider we use this command:
scrapy runspider quotes spider.py -o quotes.jl
After executing the command finishes it will create one file named quotes.jl, which
con-tains a list of quotes in JSON format
quotes.jl
{
{
Trang 34Thesis 23
"author": "Jane Austen",
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in
a good novel, must be intolerably stupid.\u201d"
}
{
"author": "Steve Martin",
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
}
}
When we run the command scrapy runspider quotes spider.py, Scrapy looked for a Spider
definition inside it and ran it through its crawler engine
The crawl started by making requests to the URLs defined in the start urls attribute (in
this case, only the URL for quotes in humor category) and called the default callback
method parse, passing the response object as an argument In the parse callback, we loop
through the quote elements using a CSS Selector, yield a Python dict with the extracted
quote text and author, look for a link to the next page and schedule another request using
the same parse method as callback Furthermore, I will go through some basic concepts
and tools provided by Scrapy to perform crawling real estate data
2.5.1 Basic concept
Spiders
Spiders are classes defined to instruct Scrapy to perform a crawl on a certain website or
list of websites and process the crawled data in structured items They are the places that
you can customize how your data crawled and control the number of websites involved
To start writing your custom spider you need to inherit the default Scrapy spider, class
scrapy.spiders.Spider This is the simplest spider that comes bundled with Scrapy, it
provides some necessary methods and attributes for crawling:
name - A string attribute is used to define the name of the spider It should be a
unique name
allowed domains - An attribute contains a list of valid domains defined by the
user to control the requested link If the requested link has a domain that does not
belong to the list, it will not be allowed
start urls - An attribute that is used to store a list of urls where the Spider will
begin to crawl
Trang 35 start requests - This method returns an iterable with the first Request to crawl.
It is called by Scrapy when the spider start crawling If the user use start urls, each url in this list will be called start request method automatically.
parse(response) - This is the default callback method after Scrapy finish
down-loaded response Inside parse method you can extract data into smaller items and
process them
There are other attributes and methods but since it is not necessary so I will skip them,you can refer to Scrapy’s document Besides the default Scrapy spider, there are othersgeneric spiders defined depend on the needs of the scraping site By using generic spidersusers can easily custom the spider for their usage purpose
CrawlSpider - This is the most commonly used spider for crawling websites Thisspider come with some specific rule to control the behaviors of crawling websites
XMLFeedSpider - This spider is designed for parsing XML feeds
CSVFeedSpider - This spider is similar to XML except that it handle CSV files
SitemapSpider - This spider allows users to crawl a site by discovering the URLsusing Sitemaps
Selectors
After Scrapy finish downloaded response, it will call a callback function, which default
method is parse(response) or your custom one So it will be a HTML source, this time
you will need to extract your data from this source and to achieve it, Scrapy provide a
mechanism to handle this job, selectors Selectors try to parts in the HTML source by using some expression like CSS or XPath CSS is the way to select components in HTML
files while XPath is used in XML documents To query the response using XPath or CSS,
we use response.xpath(<selector>) and response.css(<selector>), inside the method is the
your component selector, just pass it as a string For example:
<div class="images">
< href="image1.png" alt="this is image">My image</a
Trang 36Thesis 25
And XPath will be:
response.xpath('//div[@class="images"]/a')
To get used to these kinds of selectors, you can refer to the CSS Selectors and XPath
Usage on the Internet or trusted resources for more details Scrapy only uses it as a way
to manipulate data after we have an HTML source file already
To extract textual data, we use get() and getall() method; get() return a single result only while getall() return a list of results.
response.xpath('//div[@class="images"]/a') get()
response.xpath('//div[@class="images"]/a') getall()
In case you want to select element attribute, for example the href attribute inside <a> tag Scrapy provide the attrib property of Selector to lookup for the attributes of a
HTML element
response.xpath(’//div[@class=”images”]/a’).attrib[”href”]
Selectors are often used inside the callback function, parse(response) (if you define a
custom parse function, you can pass your function in callback attribute in your Scrapy’srequest, I will discuss about this kind of request in Request & Response section)
Items
The term items in Scrapy represents a piece of structured information When I scrape
a real estate page, I tried to get as much data as possible like post type, address, email,phone, etc However, the data now is unstructured and may be hard to retrieve for using
later and Items can help me to access this information conveniently Scrapy supports four main types of Items via itemadapter : dictionaries, Item objects, dataclass objects,
and attrs objects
Dictionaries: It is similar to a Python dictionary
Item objects: Scrapy provide a class for Item with dict-like API (user can access
through Item like a dictionary) Beside this kind of Item provide class Field to
define field name so that user can raise error out (like KeyError) when there isexception that you want to handle or stop the crawling on current page This is thetype of item I choose to use in Scrapy since it is the easiest way to handle Items foraccessing and storing By defining field name in Item class, we can control whichfield we want to scrape visually
Trang 37 Dataclass objects: dataclass() allows defining item classes with field names, so that item exporters can export all fields by default even the first scraped object does not have values for all of them Beside, dataclass() also let field define its own type like str or int, float.
Working with Item Objects
To start using Item Objects, I need to modify the items.py file when I create Scrapy
project, create a class for Item object:
class CrawlerItem(scrapy.Item):
url = scrapy.Field()
content = scrapy.Field()
post_type = scrapy.Field()
item['post_type'] = something
Beside the default item provided by Scrapy, there is another item types ItemAdapter It
is aminly used in Item Pipeline or Spider Middleware, ItemAdapter is wrapper class to
interact with data container objects, provide a common interface to extract and set datawithout having to take the object’s type into account
Scrapy Shell
Scrapy shell is the interactive shell for users to debug the Scrapy code without runningspider It is used to test XPath or CSS expressions to see if they work successfully, alsoScrapy shell run along with the website I want to scrape so that I can know if that page
is scraped without errors or intercepted Scrapy shell is similar to Python shell if you
have already worked with it Scrapy Shell also works along well with IPython, which is
a powerful interactive Python interpreter with highlight syntax and many other modern
features If users have IPython installed, Scrapy Shell will use it as a default shell.
Launch the Shell
To open the shell, we use the shell command like this:
Trang 38Thesis 27
scrapy shell <url>
Work with Scrapy Shell
The Scrapy Shell provides some additional functions and objects to used directly in theshell:
shelp: print a help with the list of available objects and functions
fetch(url[, redirect=True]): fetch a new response with a given url and updatedall related objects
fetch(request): fetch a new response from the given request and update all relatedobjects
view(response): open the give response on browser If the response is a HTMLfile, it is rerender from text into a HTML file
crawler: the current Crawler object.
spider: the Spider is used to handle current url
request: a Request object of the last fetched url.
response: a response returned from fetched url
settings: the current setting in Scrapy.
Item Pipeline
After an item is scraped in spider, it will then send to Item Pipeline This is where item
is manipulated like change format of item or give decision to keep or drop item To write
a Item Pipeline, we need to create a class inside pipelines.py, same driectory as items.py
file The naming convention of the class follows by <Name of class>Pipeline, we canfollow an available class already implemented inside the file
Work with Item Pipeline
Item Pipeline provides some methods to process item, it recommended to use ItemAdapter
to work with items To use ItemAdapter, you need to import it into the file:
from itemadapter import ItemAdapter
I will list some useful methods in ItemPipeline below:
Trang 39 process item(self, item, spider): This method is called for every item pipeline
component, it must returns an item objects at final Inside this method, you can
access item by create ItemAdapter object:
adapter = ItemAdapter(item)
# Then you can access through item like normally:
adapter['post_type'] = something
To drop item, I use DropItem objects provided by scrapy.exceptions, you just need
to import it like ItemAdapter :
from scrapy.exceptions import DropItem
# To drop item you raise DropItem:
raise DropItem("drop item because post type is missing", adapter['post_type'])
open spider(self, spider): This method is called when spider opened
close spider(self, spider): This method is called when spider closed
Activate Item Pipeline in Scrapy Setting
To make Item Pipeline be able to process item, you need to activate them in Scrapy
The value assigned to each pipeline present the the order that they will run
Request & Response
Scrapy Request is used to send a request to the given URL in the spider If you define your
crawled URL in start urls list, Scrapy will call the default request to every URL in the list.
However, in some cases, you want to rewrite the request with additional arguments or a
custom callback function, not the default self.parse so that you need to call the request
to get the page source The response is the returned object when executing a request,
both request and response can be used by instantiating their objects The response is
commonly used in Middleware, for example, to return a custom Response when a request
is executed successfully
Trang 40Thesis 29
Passing additional data to callback function
Users may want to pass data to their callback function, which is called when request is
executed successfully You can achieve this by using cb kwargs attribute, passing value
should be a dictionary
def parse(self, response):
yield scrapy.Request(url='something', callback=self.parse_response, cb_kwargs=dict
def parse_response(self, response, url):
item['url'] = url # this url is passed
2.5.2 Scrapy’s Architecture
Below is the picture illustrating the Data Flow in Scrapy:
Figure 2.2: Scrapy’s architecture
These flows are controlled by the execution engine and go in the order of numbers defined
in the picture:
1 The Spider will send the request to the Engine
2 Then the Engine transfers it to the scheduler for scheduling the request and
con-tinues listening for the next request to crawl