Development of a web application to monitor statistical data of real estate properties

In the thesis phase, I will implement a web application that will collect information on different real estate websites and display them on my current web page.. Unlike React, whichuses

Trang 1

O O O

TP.HCMBK

GRADUATION THESIS

DEVELOPMENT OF A WEB APPLICATION TO MONITOR

STATISTICAL DATA OF REAL ESTATE

PROPERTIES

DEPARTMENT OF SOFTWARE ENGINEERING

INSTRUCTOR: Assoc Prof Quan Thanh Tho REVIEWER: Assoc Prof Bui Hoai Thang

STUDENT: Pham Minh Tuan 1752595

Ho Chi Minh 7/2021

Trang 3

Kpxguvkicvg"vgejpqnqikgu"vq"etcyn"tgcn"guvcvg"fcvc< Uetcr{"htcogyqtm."DgcwvkhwnUqwrR{vjqp"nkdtct{0

Kpxguvkicvg"Ugngpkwo"ygdftkxgt"vq"eqnngev"fcvc"kp cpvk/etcynkpi"ygdukvgu0Kpxguvkicvg"cpf"dwknf"c"rtkeg/rtgfkevkpi"oqfgn"wukpi Nkpgct"Tgitguukqp"dcugf"qpeqnngevgf"tgcn"guvcvg"fcvc0

Etcyn"fcvc"gxgt{fc{"cpf"wrfcvg"pgy"fcvc"kpvq"fcvcdcug0Kpxguvkicvg"r{vjqp"nkdtctkgu"vq"korngogpv"oqfgn cpf"vtckp"oqfgn"ykvj"eqnngevgf"fcvc0

Trang 7

- Designed and implemented a web-based application to show statistics of collected data, and

performed some experientations on predicting prices

Trang 9

contents, which I read from the documents of the technologies will be used in my webapplication, are defined in the References page and I promise that I do not copy or takethe content of others’ thesis without permission.

Trang 10

First, I want to say thanks to my instructor, Assoc Prof Quan Thanh Tho for instructing

my topic as well as the knowledge to implement the web application from the first step.Beside, I am extremely happy because of my family ’s support during the time I dothe thesis and throughout three years of my study at Ho Chi Minh city University ofTechnology This work cannot be finished well if I don’t receive many encouragementsfrom my parents and my little brother

Trang 11

kind of industry starts growing fast in recent years to respond to the need of customers,and some real estate companies want to collect the data in this area as much as possible

to analyze the specific area that abstract most investors as well as sellers However, thedata gained about real estate is enormous with lots of websites It is hard for them togather all information at once This thesis will introduce a web application that willcollect information about real estate websites so that real estate agents can give a preciseanalysis of the resources-real property

Trang 12

1.1 Introduction 8

1.2 Objectives and Scope 8

2 Background Knowledge 9 2.1 Frontend 10

2.1.1 Single-page Application (SPA) 10

2.1.2 JSX 12

2.1.3 Components and Props 13

2.1.4 States 14

2.1.5 Events 15

2.1.6 Code Splitting 16

2.1.7 Fragment 17

2.2 Backend 18

2.2.1 Model Layer 19

2.2.2 View Layer 20

2.2.3 Template Layer 20

2.3 AJAX Request - Response 21

1

Trang 13

2.5 Scrapy 22

2.5.1 Basic concept 23

2.5.2 Scrapy’s Architecture 29

2.5.3 Working with Dynamically-loaded Content 30

2.5.4 Selenium 32

2.6 Linear Regression 36

2.6.1 Linear Model 36

2.6.2 Cost Function 37

2.7 Polynomial Features 37

2.8 Evaluation Metrics 38

2.8.1 Mean Squared Error (MSE) 38

2.8.2 R-Squared Score (R2 ) 39

2.8.3 Cross Validation Score 39

2.9 Underfitting and Overfitting 40

2.10 Regularization 41

2.10.1 Ridge Regression 41

2.10.2 Lasso Regression 42

2.11 Feature Engineering 42

2.11.1 Handling missing data 42

2.11.2 Outliers removal 43

2.11.3 Log Transformation 44

2.12 Supported Python Libraries 46

2.12.1 Numpy 46

2.12.2 Pandas 46

2.12.3 Matplotlib 47

Trang 14

Thesis 3

2.12.4 Scikit-Learn 47

3 System Implementation 48 3.1 Use-case Diagram 49

3.2 Architecture Diagram 50

3.3 Database 51

3.4 Workflow 51

3.5 Crawling Data 52

3.5.1 Handling duplicated data 52

3.5.2 Formating Item’s name 52

3.5.3 Handling web pages with Selenium 53

3.6 Data Modeling 54

3.6.1 Data Preparation 55

3.6.2 Training Models 59

3.6.3 Model Evaluation Result 60

3.7 Web Application 61

3.7.1 Register page 61

3.7.2 Login page 62

3.7.3 Dashboard page 62

3.7.4 Data page 63

3.7.5 Price prediction page 64

3.7.6 Admin page 65

4 Summary 66 4.1 Achievement 67

4.2 Future Development 67

4.2.1 Thesis limitation 67

Trang 15

4.2.2 Further development 68

Trang 16

List of Figures

2.1 Single-page Application 10

2.2 Scrapy’s architecture 29

2.3 How Selenium WebDriver works 33

2.4 Non-linear dataset 38

2.5 Five-folds cross-validation 40

2.6 Undefitting & Overfitting 41

2.7 Boxplot components 44

2.8 Matplotlib Boxplot 44

2.9 Skewned Data 45

2.10 Data before using Log Transform 45

2.11 Data after using Log Transform 46

3.1 Use-case diagram 49

3.2 Architecture diagram 50

3.3 CRED System Database Schema 51

3.4 Scrapy collected data sample 54

3.5 Cu Chi Selling Land dataset 55

3.6 Dataset in Ba Thien, Nhuan Duc, Cu Chi 55

3.7 Origin dataset in Nguyen Huu Canh, 22, Binh Thanh 56

3.8 Dataset after using Log Transformation 56

5

Trang 17

3.9 Dataset Histogram before & after Log Transformation 57

3.10 Dataset after removing duplicates 57

3.11 Boxplot of area values 58

3.12 Divide dataset into equal parts 58

3.13 Outliers detection after apply on each part 58

3.14 Polynomial Regression degree choosing by train set’s RMSE 59

3.15 Polynomial Regression degree choosing by validation set’s RMSE 59

3.16 Register page 61

3.17 Login page 62

3.18 Dashboard page 62

3.19 Dashboard page 63

3.20 Data page 63

3.21 Price prediction page 64

3.22 Admin page - Manage users 65

3.23 Admin page - Retrain models 65

Trang 18

1 — Overview

In this chapter, I going to introduce about my thesis topic, then I will show the targets and the scope of my thesis for the web application.

1.1 Introduction 81.2 Objectives and Scope 8

7

Trang 19

1.1 Introduction

Crawling data is a common fields that appeared in many web application which mainfunction is collecting data However a web app that collects and monitor data in realestate fields for common real estate agents is rare Therefore, they need a software tohelp them synthesizing all data from other real estate website and then started makinganalysis about collected data

In the thesis phase, I will implement a web application that will collect information

on different real estate websites and display them on my current web page Clients canregister accounts and log in to the web app and view the data The web app also providesfilter functionality to let users filter the data for custom viewing and help users predictprice of a specific real estate properties based on elements like area, street, ward anddistrict Moreover, the web app provides users a general view of real estate data throughthe chart displayed on the Dashboard page

The objectives of the thesis are:

Implement a web application to show the data collected

Build a custom crawler to get the data from three main real estate website: dongsan.com.vn, homedy.com, and propzy.vn

bat- Provide some real estate data statistics based on collected data

Build machine learning models to predict price of real estate properties based oncollected data

For a topic scope:

The web app is limit for a few users only (less than ten users) and it works locally

The collected real estate posts is limited to only in Ho Chi Minh city only Thereal estate post type is selling

Trang 20

2 — Background Knowledge

In this chapter, I am going to illustrate about the knowledge I research to implment the web application including some definitions of the technologies I will use in both Frontend and Backend part

2.1 Frontend 10

2.2 Backend 18

2.3 AJAX Request - Response 21

2.4 Postgres SQL 21

2.5 Scrapy 22

2.6 Linear Regression 36

2.7 Polynomial Features 37

2.8 Evaluation Metrics 38

2.9 Underfitting and Overfitting 40

2.10 Regularization 41

2.11 Feature Engineering 42

2.12 Supported Python Libraries 46

9

Trang 21

2.1 Frontend

2.1.1 Single-page Application (SPA)

A single-page application is a web application that can interact with the clients cally with an only-one-single web page at the time Whenever a user changes the content

dynami-in the web page, it will be rewritten with the new content dynami-instead of reloaddynami-ing a newpage

Each page of the web application usually has the JavaScript layer at the header or bottom

to communicate with the web services on the server-side The content of the web page isloaded from the server-side in response to events that users make in a current web pagelike clicking on a link or a button

Figure 2.1: Single-page Application

Nowadays, with the development of the technologies used to build web applications pecially in the frontend area, there are three popular frameworks that most developersuse, that is React JS, Angular JS, and Vue JS I will state the comparison between them,and why I choose to use React JS to implement the frontend of my web application

es-Angular JS

Angular JS is a full-fledged MVC framework which provides a set of defined library andfunctionality for building a web application It is developed and maintained by Google,was first released in 2010 and it is based on TypeScript rather than using JavaScriptitself

Trang 22

Thesis 11

Angular is unique with its built-in two-way data binding feature Unlike React, whichuses one-way data binding, both React JS and Angular JS use Component to build theweb application means that when the component (or a model) is rendered in the viewcan have different data inside based on the web page the user is seeing For Angular,when the data in the view changes or also let to the change of the data in the model,while React is not This is more convenient for developers to build a web applicationeasier However, in my opinion, it would be hard for managing the data when the webapplication becomes larger as well as debugging

React JS

React JS is actually a JavaScript library not a framework like the other two, that isdeveloped and maintained by Facebook It is released in March 2013, and is described as

”A JavaScript library for building user interfaces”

Since ReactJS is mostly used to create interactive UIs in client view ReactJS is notreally difficult to learn at first However, when I get used to some concepts it provides, itwill be easier to maintain the content of the web page Moreover, it comes with the bigresources of the library to support developers to create the website quickly with the help

of available components The developers will have a freedom in choosing the suitablelibrary to use in building web application because React JS only work at the client viewonly not like Angular JS, which is built as an MVC Framework, so to develop the fullweb application, developers have to follow strictly the template it provides

Vue JS

Vue JS is a frontend framework that is developed to make a flexible web application It

is developed by the ex-Google employee Evan You in 2014 It shares many similaritieswith the two above, it uses Components to build the web application, also two-way databinding like Angular

Vue is versatile, and it helps you with multiple tasks, it comes with flexibility to designthe web application architecture Vue syntax is simple with people just get started withJavaScript background can still learn it, and it also support TypeScript as well

Why React ?

The reason I choose to use React JS to build a frontend of the web application is because

of its simplicity and its freedom to build the web app React JS let developers freely build

a web page on their library using not like Angular, beside Angular requires developer getused to TypeScript and it is also hard for a beginner in using frontend framework tostart with since Angular is considered as the most complex one among three frontend

Trang 23

web framework I listed above For Vue JS, although it is simple and easy to approach fordevelopers at first, it is not stable like the other two and its support community is small.

To conclude, each framework has its pros and cons depend on how programmers decide

to implement their web application For me, I found that React JS is good to start first

to build the web application

Firstly, I want to introduce the JSX before present about some main concepts that areused in building the view of the web application in React JS

const element = <h4>I am JSX!</h4>

The line above is neither HTML nor String, it is called JSX and this syntax is used inReact JS to render elements It is like the combination between JavaScript and XML,XML stands for Extensible Markup Language It is more like the tag in HTML but itcan be customizable By using JSX developer can manipulate the data inside an element

by embedding it inside the HTML tag

const hello = <h1>Hello, {element}!</h1>

By default, When React DOM render element, it will escape any value that is embedded

inside JSX to prevent XSS (Cross-site-scripting) attacks Each element defined in JSXcan then be rendered in the web page by React DOM

Trang 24

Thesis 13

const hello = <h1>Hello, {element}!</h1>

ReactDOM.render(

element,

document.getElementById('root')

);

element is rendered inside a div tag with id=”root”

Components in React JS let developers split a web page into independent, reusable pieces,each of them will work isolated to each other A web page can contain lots of componentsinside, whenever a user changes the contents of the web page, the current page will berewritten with the new component or update the data inside the current componentwithout reloading a whole page

The way components work is similar to how we use functions or methods in most gramming languages When we define a function we declare the parameter for a function,

pro-it is like props in Component Then, we can call pro-it inside render of ReactDOM to render

that component in the web page we want to show

To define a component in React JS two ways use JavaScript Function and ES6 JavaScriptClass:

Trang 25

const helloComponent = <HelloReact name="tuanminh" />

its input which in React JS it is called pure function “All React components must act

like pure functions with respect to their props.” However, we want our web application

is designed dynamically, the content in UI can be changed over time, therefore React JScomes with the concept be used Component that is states

2.1.4 States

State is similar to props, but it is private in Component only and it is changeable State

is introduced in React JS when we define Component as JavaScript class However, since

v16.8, React introduce a new concept which is called Hooks that let developers can use

state inside a function without writing a class During the time I read the documentabout React JS, I only learn to define Component in class only because it is the basicone in building and customize the Component Hooks are a new concept and most of thelibraries that support React JS are changed to use Hook to build their Component now.When I start building the full web application in the next phase of the thesis I will try

to use it to build the Components in a web application

To define state in React Component, we firstly create a constructor then create state

with its initial value

class HelloReact extends React.Component {

constructor(props) {

super(props);

this.state = {

name: 'tuanminh',}

Trang 26

Thesis 15

Lifecycle

One important concept related to using state in React is Lifecycle methods In case

of loading the contents with many components into the web page, it is necessary tofree up resources taken by previous components used when they are destroyed, or afternew contents have loaded we want to run a function to operate the functionality ofthe current web page like getting data from the server That how React comes with

componentDidMount() and componentWillUnmount() to control when a component is

mounted or unmounted

Modify State

As state is changeable compare to props, to change the state value we must use setState()

provided by React Component State cannot be changed directly like this:

this.state.name = 'tuanminh';

Instead:

// This will change the state name of Component

this.setState({ name: 'tuanminh' })

<button onClick={handleClick}>Click me</button>

However when we declare a function inside JavaScript class, we need to use bind() to bind the current function to class when we use this keyword This is JavaScript’s characteristic, the default binding of this in JavaScript is the global context which is window, not the

current class we declare our function in

class ReactComponent extends React.Component {

constructor(props) {

Trang 27

In case you want to use parameter in handleClick() we can try JavaScript arrow function

or use JavaScript bind() instead.

<button onClick={() => this.handleClick(param)}>Click me</button>

//Or:

<button onClick={this.handleClick.bind(this, param)}>Click me</button>

2.1.6 Code Splitting

This concept introduces the import() in React JS, it allow developers import Component

from a library or a custom Component to be reused in our Component

import { Component } from 'react';

import NavButton from './NavButton';

class Navbar extends Component {

At first, we need to import the standard Component from react to implement our custom

Component Then we import the custom button to use it in the navbar

Trang 28

Thesis 17

Component class in React JS allows developers to return one element only, which could

be a div tag or the Component that is imported from the library However, with the help

of Fragment, it allows returning multiple elements.

Trang 29

2.2 Backend

Since React only provides the client view only so to communicate with the database andthe crawler to get the data for the web app, we need to build the backend to handle thisproblem React JS can integrate with most backend frameworks nowadays easily as long

as it provides the API for the React JS to render the content in view Nowadays it isnot hard to find yourself a backend framework to create a full-stack web app Yet we canbuild the backend part from scratch too, we can start with building the architecture likethe common MVC architecture that is most used in many software not just in the aspect

of the web application, then is database design and how to manipulate the data in thedatabase, routing and control the view that will be rendered to the client

These parts now are handled in most of the backend web framework currently Besidesframework provides a way to manage the architecture more efficiently, we do not need

to care much about the web services (localhost) to run the server-side or the security ofour web application It also lets developers deploy their web quickly since they alreadyprovide a reusable piece of code that most web applications have For example, cookie orsession management for users when they logged in to the web app It now handles thatjob immediately for you, yet we can custom that part to let it works individually.One disadvantage of using the framework is that if we want to make a unique functionthat is based on the current one it provides However, the framework may not have thefunctionality or the API to let us custom it for your own So we will need to rewrite thatfunction entirely and that piece of work will not be simple if we do not have a look atthe way it is implemented Eventually, there is still a way to change it as overriding thatfunction by using our function which is based on the one framework provide

The backend framework I choose to implement my web application in Django It is ommended as one of the best frameworks on the Internet since Django lets developers usePython to work with the server-side of the web Python is a high-level programming lan-guage, yet friendly for most programmers due to its straightforward syntax and support

rec-of various libraries or tools Besides, Django comes with Django REST which providesRESTFUL APIs that work well with React JS

Django follows the Model-View-Controller (MVC) architecture, which is Model for aging database, Views is the client’s view, Controller acts as the middleman betweenModel and View When a client sends a request which is from the View to the server,Controller will handle the request and communicate with the Model to retrieve, update,

man-or add new data to the database through Model

Django provides developers three separate layers, which are:

 The Model Layer, which plays the role Model in MVC architecture This layer

provides classes and modules for structuring and manipulating the data of the webapplication

Trang 30

Thesis 19

 The View Layer, which is a Controller This layer lets developers define

request-response functions or classes that can be used to access data through the Modellayer

 The Template Layer are the templates, where store HTML files that are used to

display as the view to the client of the web application However, in my web appReact JS already handles this role so I will not present much about this layer

attributes inside the Model stands for the fields in database For example:

from django.db import models

class RealEstateData(models.Model):

url = models.CharField(max_length=100)

post_title = models.CharField(max_length=50)

url and post title is the two fields of the table named myapp real estate data.

By using models we need to specify the name of the app in settings.py in INSTALLED APPS

= [ ] firstly Then when we define a model in models.py, we can save that model

into database as a table with two CLI are makemigrations and migrate The first CLI

is used when we first define a model or change some characteristics in the model likeadding attributes, changing the type of attributes or deleting attributes After that we

use migrate to take that changes into the data inside database.

Making queries

Models provided by Django gives developers a database-abstraction API to let you nipulate directly the data inside database without making any SQL queries I will statesome common methods:

ma- To create an object, we simply call the name of models given available attributes

inside the model like creating a class in Python, then object.save() let the object

saved in database

Trang 31

data = RealEstateDatắwww.realestatẹcom', 'mua_ban_nha_dat')

datạsave()

 To retrieve objects, simply call methods objects.all().

 To retrieve only parts of objects, we can use filter(), the inputs of filter are attributes

and what request will be handled We modify this feature in the url.py in our project.

from djangọurls import path

from myapp import views

Trang 32

Thesis 21

AJAX stands for Asynchronous JavaScript And XML, it is the use of XMLHttpRequest

to communicate with the backend, it allows the frontend can send and receive data indifferent format like text, JSON, HTML and update the data in frontend view withouthaving to refresh the current page

React JS provide the interactive view to the client to view the web page, however, it is just

a view so that whenever the user triggers an event in the view the make a component’sdata change the view should communicate with the server-side to load the suitable datainto the current component To communicate with the backend, the changeable compo-nent in React will send an AJAX request into the backend then the backend respondsdata back to the requested component and changes the data displaying the component,this update will not refresh the whole page of the web application, fewer data will beloaded at the time and this will increase user’s experience in using the web app

There are two common ways for React JS to integrate with the backend, that is using

fetch() API provided by the current browser or by using a JavaScript library axios

In my web app, I decided to use axios to handle AJAX request/response rather than using fetch() because axios is the famous library for handling AJAX in most of web

applications that developing using JavaScript It provides trusted security to protect thecommunication between the frontend and backend I will not recommend using AJAXJQuery in the web application developed by React since the way React and JQueryoperate is different from each other They will conflict if you put them in one place

I choose to use Postgres SQL to set up the database of my web application PostgresSQL is a powerful, open-source object-relational database system that uses and extendsthe SQL language combined with many features that safely store and scale the most com-plicated data workloads Postgres is famous for its proven architecture, reliability, dataintegrity, robust feature set, extensibility, and the dedication of the open-source commu-nity that stands behind the software to consistently deliver performant and innovativesolutions

Postgres SQL comes with many features aimed to help developers build applications,administrators to protect data integrity and build fault-tolerant environments Moreover,

it helps developers manage the data no matter how big or small the dataset PostgresSQL is a free and open-source database, however, it is highly extensible It is available

in most operating systems and it is used in many web applications as well as mobile andanalytics applications

Trang 33

2.5 Scrapy

The main function of web application is collect data from different real estate websites

to display in my web app at once, so that I need to make a crawler for crawling thoseinformation Currently, I am researching about Scrapy, an Python application frameworkfor crawling websites

Scrapy is an application framework for crawling web sites and extracting structured datawhich can be used for a wide range of useful applications, like data mining, informationprocessing or historical archival Although Scrapy was originally designed for web scrap-ing, it can also be used to extract data using APIs or as a general purpose web crawler.The main component of Scrapy is the spider

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'author': quote.xpath('span/small/text()') get(),

'text': quote.css('span.text::text') get(),}

next_page = response.css('li.next a::attr("href")') get()

if next_page is not None:

yield response.follow(next_page, self.parse)

Below is the sample code of using spider for crawling author and text in the sample website

’http://quotes.toscrape.com/tag/humor/’ To run the spider we use this command:

scrapy runspider quotes spider.py -o quotes.jl

After executing the command finishes it will create one file named quotes.jl, which

con-tains a list of quotes in JSON format

quotes.jl

{

Trang 34

Thesis 23

"author": "Jane Austen",

"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in

a good novel, must be intolerably stupid.\u201d"

}

{

"author": "Steve Martin",

"text": "\u201cA day without sunshine is like, you know, night.\u201d"

}

When we run the command scrapy runspider quotes spider.py, Scrapy looked for a Spider

definition inside it and ran it through its crawler engine

The crawl started by making requests to the URLs defined in the start urls attribute (in

this case, only the URL for quotes in humor category) and called the default callback

method parse, passing the response object as an argument In the parse callback, we loop

through the quote elements using a CSS Selector, yield a Python dict with the extracted

quote text and author, look for a link to the next page and schedule another request using

the same parse method as callback Furthermore, I will go through some basic concepts

and tools provided by Scrapy to perform crawling real estate data

2.5.1 Basic concept

Spiders

Spiders are classes defined to instruct Scrapy to perform a crawl on a certain website or

list of websites and process the crawled data in structured items They are the places that

you can customize how your data crawled and control the number of websites involved

To start writing your custom spider you need to inherit the default Scrapy spider, class

scrapy.spiders.Spider This is the simplest spider that comes bundled with Scrapy, it

provides some necessary methods and attributes for crawling:

name - A string attribute is used to define the name of the spider It should be a

unique name

allowed domains - An attribute contains a list of valid domains defined by the

user to control the requested link If the requested link has a domain that does not

belong to the list, it will not be allowed

start urls - An attribute that is used to store a list of urls where the Spider will

begin to crawl

Trang 35

start requests - This method returns an iterable with the first Request to crawl.

It is called by Scrapy when the spider start crawling If the user use start urls, each url in this list will be called start request method automatically.

parse(response) - This is the default callback method after Scrapy finish

down-loaded response Inside parse method you can extract data into smaller items and

process them

There are other attributes and methods but since it is not necessary so I will skip them,you can refer to Scrapy’s document Besides the default Scrapy spider, there are othersgeneric spiders defined depend on the needs of the scraping site By using generic spidersusers can easily custom the spider for their usage purpose

CrawlSpider - This is the most commonly used spider for crawling websites Thisspider come with some specific rule to control the behaviors of crawling websites

XMLFeedSpider - This spider is designed for parsing XML feeds

CSVFeedSpider - This spider is similar to XML except that it handle CSV files

SitemapSpider - This spider allows users to crawl a site by discovering the URLsusing Sitemaps

Selectors

After Scrapy finish downloaded response, it will call a callback function, which default

method is parse(response) or your custom one So it will be a HTML source, this time

you will need to extract your data from this source and to achieve it, Scrapy provide a

mechanism to handle this job, selectors Selectors try to parts in the HTML source by using some expression like CSS or XPath CSS is the way to select components in HTML

files while XPath is used in XML documents To query the response using XPath or CSS,

we use response.xpath(<selector>) and response.css(<selector>), inside the method is the

your component selector, just pass it as a string For example:

< href="image1.png" alt="this is image">My image</a

Trang 36

Thesis 25

And XPath will be:

response.xpath('//div[@class="images"]/a')

To get used to these kinds of selectors, you can refer to the CSS Selectors and XPath

Usage on the Internet or trusted resources for more details Scrapy only uses it as a way

to manipulate data after we have an HTML source file already

To extract textual data, we use get() and getall() method; get() return a single result only while getall() return a list of results.

response.xpath('//div[@class="images"]/a') get()

response.xpath('//div[@class="images"]/a') getall()

In case you want to select element attribute, for example the href attribute inside <a> tag Scrapy provide the attrib property of Selector to lookup for the attributes of a

HTML element

response.xpath(’//div[@class=”images”]/a’).attrib[”href”]

Selectors are often used inside the callback function, parse(response) (if you define a

custom parse function, you can pass your function in callback attribute in your Scrapy’srequest, I will discuss about this kind of request in Request & Response section)

Items

The term items in Scrapy represents a piece of structured information When I scrape

a real estate page, I tried to get as much data as possible like post type, address, email,phone, etc However, the data now is unstructured and may be hard to retrieve for using

later and Items can help me to access this information conveniently Scrapy supports four main types of Items via itemadapter : dictionaries, Item objects, dataclass objects,

and attrs objects

Dictionaries: It is similar to a Python dictionary

Item objects: Scrapy provide a class for Item with dict-like API (user can access

through Item like a dictionary) Beside this kind of Item provide class Field to

define field name so that user can raise error out (like KeyError) when there isexception that you want to handle or stop the crawling on current page This is thetype of item I choose to use in Scrapy since it is the easiest way to handle Items foraccessing and storing By defining field name in Item class, we can control whichfield we want to scrape visually

Trang 37

Dataclass objects: dataclass() allows defining item classes with field names, so that item exporters can export all fields by default even the first scraped object does not have values for all of them Beside, dataclass() also let field define its own type like str or int, float.

Working with Item Objects

To start using Item Objects, I need to modify the items.py file when I create Scrapy

project, create a class for Item object:

class CrawlerItem(scrapy.Item):

url = scrapy.Field()

content = scrapy.Field()

post_type = scrapy.Field()

item['post_type'] = something

Beside the default item provided by Scrapy, there is another item types ItemAdapter It

is aminly used in Item Pipeline or Spider Middleware, ItemAdapter is wrapper class to

interact with data container objects, provide a common interface to extract and set datawithout having to take the object’s type into account

Scrapy Shell

Scrapy shell is the interactive shell for users to debug the Scrapy code without runningspider It is used to test XPath or CSS expressions to see if they work successfully, alsoScrapy shell run along with the website I want to scrape so that I can know if that page

is scraped without errors or intercepted Scrapy shell is similar to Python shell if you

have already worked with it Scrapy Shell also works along well with IPython, which is

a powerful interactive Python interpreter with highlight syntax and many other modern

features If users have IPython installed, Scrapy Shell will use it as a default shell.

Launch the Shell

To open the shell, we use the shell command like this:

Trang 38

Thesis 27

scrapy shell <url>

Work with Scrapy Shell

The Scrapy Shell provides some additional functions and objects to used directly in theshell:

shelp: print a help with the list of available objects and functions

fetch(url[, redirect=True]): fetch a new response with a given url and updatedall related objects

fetch(request): fetch a new response from the given request and update all relatedobjects

view(response): open the give response on browser If the response is a HTMLfile, it is rerender from text into a HTML file

 crawler: the current Crawler object.

spider: the Spider is used to handle current url

 request: a Request object of the last fetched url.

response: a response returned from fetched url

 settings: the current setting in Scrapy.

Item Pipeline

After an item is scraped in spider, it will then send to Item Pipeline This is where item

is manipulated like change format of item or give decision to keep or drop item To write

a Item Pipeline, we need to create a class inside pipelines.py, same driectory as items.py

file The naming convention of the class follows by <Name of class>Pipeline, we canfollow an available class already implemented inside the file

Work with Item Pipeline

Item Pipeline provides some methods to process item, it recommended to use ItemAdapter

to work with items To use ItemAdapter, you need to import it into the file:

from itemadapter import ItemAdapter

I will list some useful methods in ItemPipeline below:

Trang 39

process item(self, item, spider): This method is called for every item pipeline

component, it must returns an item objects at final Inside this method, you can

access item by create ItemAdapter object:

adapter = ItemAdapter(item)

# Then you can access through item like normally:

adapter['post_type'] = something

To drop item, I use DropItem objects provided by scrapy.exceptions, you just need

to import it like ItemAdapter :

from scrapy.exceptions import DropItem

# To drop item you raise DropItem:

raise DropItem("drop item because post type is missing", adapter['post_type'])

open spider(self, spider): This method is called when spider opened

close spider(self, spider): This method is called when spider closed

Activate Item Pipeline in Scrapy Setting

To make Item Pipeline be able to process item, you need to activate them in Scrapy

The value assigned to each pipeline present the the order that they will run

Request & Response

Scrapy Request is used to send a request to the given URL in the spider If you define your

crawled URL in start urls list, Scrapy will call the default request to every URL in the list.

However, in some cases, you want to rewrite the request with additional arguments or a

custom callback function, not the default self.parse so that you need to call the request

to get the page source The response is the returned object when executing a request,

both request and response can be used by instantiating their objects The response is

commonly used in Middleware, for example, to return a custom Response when a request

is executed successfully

Trang 40

Thesis 29

Passing additional data to callback function

Users may want to pass data to their callback function, which is called when request is

executed successfully You can achieve this by using cb kwargs attribute, passing value

should be a dictionary

def parse(self, response):

yield scrapy.Request(url='something', callback=self.parse_response, cb_kwargs=dict

def parse_response(self, response, url):

item['url'] = url # this url is passed

2.5.2 Scrapy’s Architecture

Below is the picture illustrating the Data Flow in Scrapy:

Figure 2.2: Scrapy’s architecture

These flows are controlled by the execution engine and go in the order of numbers defined

in the picture:

1 The Spider will send the request to the Engine

2 Then the Engine transfers it to the scheduler for scheduling the request and

con-tinues listening for the next request to crawl

Tiêu đề	Development Of A Web Application To Monitor Statistical Data Of Real Estate Properties
Tác giả	Pham Minh Tuan
Người hướng dẫn	Assoc. Prof. Quan Thanh Tho
Trường học	Vietnam National University Ho Chi Minh City
Chuyên ngành	Software Engineering
Thể loại	graduation thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh

Định dạng
Số trang	82
Dung lượng	1,4 MB