Named entity recognition for vietnamese real estate advertisements

Motivated by that, we present the first manually annotated Vietnamese dataset in the real estate domain.. Detecting entities is also helpful for build-ing downstream information extracti

Trang 1

Named Entity Recognition for Vietnamese Real

Estate Advertisements

Son Huynh, Khiem Le, Nhi Dang, Bao Le, Dang Huynh, Binh T Nguyen

University of Science, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam

Trung T Nguyen, Nhi Y T Ho

Hung Thinh Corp.

Ho Chi Minh City, Vietnam

Abstract—With the booming development of the Internet and

e-Commerce, advertising has appeared in almost all areas of

life, especially in the real estate domain Understanding these

advertising posts is necessary to capture the status of real

estate transactions and rent and sale prices in different areas

with various properties Motivated by that, we present the

first manually annotated Vietnamese dataset in the real estate

domain Remarkably, our dataset is annotated for the named

entity recognition task with lots of entity types In comparison

to other Vietnamese NER datasets, our dataset contains the

largest number of entities We empirically investigate a strong

baseline on our dataset using the API supported by the spaCy

library, which comprises four main components: tokenization,

embedding, encoding, and parsing For the encoding, we conduct

experiments with various encoders, including Convolutions with

Maxout activation (MaxoutWindowEncoder), Convolutions with

Mish activation (MishWindowEncoder), and bidirectional Long

short-term memory (BiLSTMEncoder) The experimental results

show that the MishWindowEncoder gives the best performance

in terms of micro F1-score (90.72 %) Finally, we aim to publish

our dataset later to contribute to the current research community

related to named entity recognition.

Keywords—Named Entity Recognition, embedding,

Convolu-tion, Skip ConnecConvolu-tion, LSTM

I INTRODUCTION

Named Entity Recognition (NER) - also called Entity

Identification or Entity Extraction, has become an essential

and fundamental task in Natural Language Processing (NLP),

which involves identifying named entities in a text and

clas-sifying them into predefined categories A named entity is a

real-life object with proper identification and can be denoted

with an appropriate name Named entities can be a place,

person, organization, time, object, or geographic entity NER

has been investigated for many years [1] however, the majority

of existing research relies on a reasonably large annotated

dataset, which is mainly available in popular languages such

as English, French This is a bottleneck for low-resource

languages like Vietnamese, so it is worth creating a novel

manually annotated NER dataset for Vietnamese to accelerate

Vietnamese NER research

Nowadays, real estate news or advertisement sources are

massive and daily posted on many different real estate

web-sites Extracting key entities in these data sources leads to

understanding the status of real estate transactions and the

Corresponding author: Binh T Nguyen (VNU-HCM University of Science,

Ho Chi Minh City, Vietnam) (Email: ngtbinh@hcmus.edu.vn).

customer’s demand Detecting entities is also helpful for build-ing downstream information extraction, text summarization, or chatbot systems in the real estate domain Moreover, this helps store data more efficiently, facilitate data analysis, or build dashboards for data visualization

To summarize, our contributions are summarized as follows: 1) We introduce and provide the community the first man-ually annotated Vietnamese dataset in the real estate domain for the NER task Our dataset is annotated with

16 different named entity types, larger than 3 of

VLSP-2018 and 10 of PhoNER-COVID-19 Also, our dataset has the largest number of entities, consisting of over 53,000 entities

2) We conduct experiments using strong baselines with support of the spaCy library and empirically investigate three different encoders, including MaxoutWindowEn-coder, MishWindowEnMaxoutWindowEn-coder, and BiLSTMEncoder The experimental results show that MishWindowEncoder has the best performance in all Recall, Precision, and F1-Score

II RELATEDWORK

Compared to other languages, data resources for the Viet-namese NLP task are limited, specifically for the NER task To the best of our knowledge, there are only two public datasets for the Vietnamese NER task The first one is the VLSP-2018 NER dataset [2], which is an extension of the VLSP-2016 NER dataset with more data This dataset recognizes generic entities of person names, organizations, and locations in daily news articles The second one is the recently

PhoNER-COVID-19 [3] with the COVID-PhoNER-COVID-19 specified domain, which helps facilitate many types of research and downstream applications such as building question-answering systems for pandemic prevention tasks In this work, we develop and release the first Vietnamese NER dataset in the real estate domain

Existing research on the Vietnamese NER approach is also pretty limited; some techniques have been proposed using various learning models such as classifier voting [4], CRF [5] Vu et al [6] propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets Thang et al [7] combined BiLSTM and CRF Moreover, they enhanced word embeddings with infor-mation from characters to archive comparative results on the VLSP-2018 NER dataset Quang et al [8] proposed an online

Trang 2

learning algorithm, i.e., MIRA, in combination with CRF and

bootstrapping Recently, Thinh et al [3] investigated

BiLSTM-CNN-CRF and the pre-trained language models XLM-R and

PhoBERT as well as the effect of Automatic Vietnamese word

segmentation on the Vietnamese NER task The most relevant

work to ours is proposed by Lien et al [9], which built

an Information Extraction system for Vietnamese online real

estate advertisements, but they use the rule-based approach

III OURDATASET

This section presents how to crawl data from many sources

on the internet, preprocess and annotate the raw data One can

observe our system clearly in Figure 1

Fig 1 Our Dataset Building Process

A Data Collection

For this study, we crawled real estate advertisement posts

dated between August 2020 and September 2020 from three

different real estate websites in Vietnam, including:

• propzy.vn

• nhadat247.com.vn

• batdongsan.com.vn

B Data Processing

Before training models to identify named entities, we

pre-process real-estate posts to clean noisy data and follow a

standard format For detail, firstly, we remove posts that do

not contain clear and critical information about real estate,

then normalize Unicode text and particular Vietnamese word

mark position in words Then, we preprocess each string with

the following steps:

1) Remove meaningless characters in a string like \n, \r

(ASCII codes), emotion icons,

2) Split multiple joined words

3) Fix spaCy format issue by separating dots or commas

connected to a word However, we need to keep dots or

commas connected to a number (like 12.5, 1,000,000)

4) Replace multiple spaces with a single space in string

Table I show some examples for each case above

C Data Annotation

We use Doccano [10] as a tool for labeling this dataset and

meticulously define 16 entities containing the critical

infor-mation of a real estate sale advertisement The comprehensive

description of each entity type is briefly shown in Table II

TABLE I

S EVERAL EXAMPLES FOR THE DATA PROCESSING PHASE Case Raw Text Preprocessed Text

1 ^ -DT: \xa0195.35m2 ¨ ¨ ^ \n DT 195.35m2 DTSD

¨

^ -DTSD: 152.50m2 ¨ ^ 152.50m2

2 Đất Quận 6Đã lên thổ cư giá Đất Quận 6 Đã lên thổ cư 12,89tỷ và 1tỷ700tr giá 12,89 tỷ và 1 tỷ 700 tr

3 thuộc tầng 6 thuộc tầng 6

D Data Partitions

After annotating the dataset, we have 3152 real estate adver-tisements as a golden dataset We then split this dataset into training, validation, and testing sets with a ratio 60%, 20%, and 20%, respectively Statistics of our dataset is presented in Table III

TABLE III

S TATISTICS OF OUR DATASET

# Entity Type Num of Entities

1 district_name 3123

2 place_name 7471

3 transaction_type 3490

4 property_certificate 2515

5 property_type 10397

7 number_street_name 5415

10 province_city 1065

11 host_name 2079

12 ward_name 1254

14 direction 737

15 front_road 961

# Entities in total 53515

# Sentences in total 3152

IV METHODOLOGY

In this paper, we aim to investigate a named entity recogni-tion system for Vietnamese real estate documents This system can help users parse the possible real estate information field

of an advertisement automatically It is worth noting that such

a system is crucial and has become an indispensable tool

in the real estate market In what follows, we present the problem formula of our paper and how we extract features from real estate documents and train our proposed model for this problem

A Problem Formula

Our paper problem is that given a Vietnamese real estate advertisement, from 16 entities which we defined in section III-C our model will detect the entity of each word

B Feature extraction

This section describes the layers we use as feature extractors for real estate documents to measure the performance Firstly, we push the input data, which are annotated doc-uments about real estate, into a tokenizer to split sentences

Trang 3

TABLE II

T HE NAME ENTITY DEFINITION OF A GIVEN REAL ESTATE MENTIONED IN ONE ADVERTISEMENT POST

Label Definition

district_name The district name where the real estate is located

place_name The name of one specific location, such as e.g one building, one shopping mall, or an airport,

transaction_type The transaction type of the real estate advertisement post, including sell, buy, or rent,

property_certificate The property certificate information of the real estate

property_type The property type of the real estate, such as e.g home, department, or land,

phone Phone number of a real estate

number_street_name The street name or the house number with the street name

area The area nearby the real estate

distance The distance, such as 10m, 20m, 300m, etc.

province_city The name of the province or city where the real estate locate

host_name The host name of the real estate

ward_name The ward name where the real estate is located

price The price related to real estate mentioned in the real estate

direction The house direction information of the real estate Example: East or West

front_road The front-road information of real estate

email The contacted email

into lists of words, including punctuation A settled number

of UTF-8 byte characters are utilized for each word We add

a token <padding> at the end of each list such that the length

of each list is equal We then put these numerical lists into

an Embedding layer name CharacterEmbed in spaCy [11] to

vectorize to matrices N × M (where N is the number of

words in each sentence and M is the figure of dimensions

representing a word) that represent the meaning of each

sentence

Next, we use one of the following four architectures to

perform feature extraction of real estate advertisements:

1) MaxoutWindowEncoder: MaxoutWindowEncoder is the

architecture that gets an embedding vector as an input Then,

this feature is pushed into a Convolution 1D layer with a

window size of 2 × 2 with the number of filters being 4 The

skip connection [12] adds embedding vector with features are

extracted by Convolution 1D layer After that, this information

goes through a Maxout activation function Finally, this feature

is normalized by BatchNormalization BatchNormalization has

the effect of avoiding overfitting and securing the model more

straightforward to converge In contrast, residual connections

help the model retains information before feature extraction

through the convolution layer One can see more detail in

Figure 2

Fig 2 The architecture of MaxoutWindowEncoder and MisWindowEncoder

2) MishWindowEncoder: This Encoder has an architecture

similar to MaxoutWindowEncoder However, the difference

here is that this Encoder utilizes Mish [13] as an activation

function instead of using Maxout, which one can find more clarity in the Figure 2 According to Diganta Misra, in 75 experimental tasks with various models (DenseNet, Inception v3, Xception Net), Mish outperforms ReLU in 55/75 tasks and overcome Swish in 53/75 tasks

3) LSTMEncoder: LSTM (Long Short Term Memory Net-works) [14] was first introduced by Hochreiter & Schmidhuber

in 1997 This architecture is a particular structure of RNN (Recurrent Neural Networks) proposed in 1982 by David Rumelhart [15] According to the authors, LSTM is designed

to resolve long-term dependency problems that can not store information of long string data and avoid vanishing or explod-ing gradient problems faced in RNN In our experiment, the embedding vector is passed LSTM network has the number of hidden states equal to N which is the number of words in the input sentence to get the extracted features

4) BiLSTMEncoder: BiLSTM ( Bidirectional Long Short Term Memory Networks) [16] is based on both LSTM [14] and BiRNN [17] This architecture is similar to LSTMEncoder in section IV-B3 However, instead of using one LSTM network, this approach includes two LSTM stacks on top of each other: One takes information forwards, whereas the other takes it backward BiLSTMs effectively enhance the quantity of data available to the network, improving the content available to the algorithm

Fig 3 The architecture of BiLSTMEncoder, if remove one LSTM network, the architecture will become LSTMEncoder

Trang 4

C Modeling

spaCy API provides a powerful model for named entity

recognition task is called TransitionBasedParser As per the

authors of spacy, TransitionBasedParsing is an approach to

structured prediction where the task of predicting the structure

is mapped to a series of state transitions1 The authors claim

that TransitionBasedParsing currently is more superior and

quicker than Stanford’s CoreNLP [18] One can see more detail

by visiting spaCy’s blog2

In this experiment, after using one of four encoder ways

that we mention in Section IV-B to extract features from real

estate documents, we push this informative feature into the

TransitionBasedParser model to conduct entity recognition of

each word in the text One can observe in detail our end-to-end

pipeline in Figure 4

V EXPERIMENT

This paper runs all experiments on a computer with Intel(R)

Core(TM) i7 2CPUs running at 2.4GHz with 8GB of RAM

and an Nvidia GeForce RTX2080Ti GPU with 11GB VRAM

In the data processing step of this study, we use different

Python packages, including, NLTK3, and Regex4, as tools to

clean data Additionally, a package Scikit-learn5is applied as a

tool to split our dataset Finally, we use spacy [11] as a toolkit

for the problem named entity recognition

A Experiment Settings

We detail the Real Estate Information NER task for

Viet-namese with BIO labeling scheme (short for inside, outside,

beginning) that was presented by Ramshaw and Marcus in

1995 [19] In our experiment, we used four different Encoders

with two widths W = 64 and W = 300, which spaCy

define as the number of sentence’s input width, one can find

more information in their document6 From that, we have eight

combinations to measure the performance of the NER for the

real estate sale advertisements problem Figure IV displays the

setting of our pipeline

TABLE IV

T HE HYPER - PARAMETERS OF OUR MODELING PIPELINE

Hyper-parameters Values

Learning rate 0.001

Optimizer Adam with betabeta 1= 0.9,

2 = 0, 99

1 https://spacy.io/api/architectures

2 https://explosion.ai/blog/parsing-english-in-python

3 https://www.nltk.org/

4 https://regexr.com/

5 https://scikit-learn.org/stable/

6 https://spacy.io/api/architectures

B Performance Metrics

In this experiment, we choose Precision, Recall, and F1-Score as critical metrics in measuring the performance of our proposed models for each entity

P = T P

T P + F P, R =

T P

T P + F N, F 1 = 2 ×

P × R

P + R, (1) where P stands for Precision, R is the Recall, and F 1 is the F1-score T P denotes true positive, T N indicates true negative, F P and F N are false positive and false negative After that, we average each metric on all entities to calculate the general performance for our proposed approach

C Results

We compare the performance of four different backbones in-cluding: MaxoutWindowEncoder, MishoutWWindowEncoder, LSTM, and BiLSTM and each above method combine with width which is defined the input and output width and is recommended width = 64 or width = 300 by spaCy One can find more comprehensive in our experimental results in Figures V

In general, four feature extractors that we apply in our experiment have good initial results in terms of Precision, Recall, and F1-score Conversely, the lowest result in each measure is 0.8486, 0.8331, and 0.8450, respectively

Next, we compare four feature extractors using width = 64

in our dataset The experimental results show that using Win-dowEncoder has superior performance in all three measures

to LSTM and BiLSTM In other words, the performance

in terms of Precision, Recall of WindowEncoder are higher than variants of LSTM, especially in terms of F1-score, which is a critical metric in machine learning, the result of MaxoutWindowEncoder and MishWindowEncoder are 0.8775 and 0.8673, respectively Meanwhile, the F1-score of LSTM and BiLSTM are 0.8556 and 0.8450 correspondingly Interestingly, in the case of using four feature extractors with width = 300, the performance of WindowEncoder methods in three metrics including Precision, Recall, and F1-score once again overcome LSTM and BiLSTM This result makes it worth noting that using two WindowEncoder types always has

a better performance than LSTM and BiLSTM One possible reason is that using skip connection in WindowEncoder can help the model stabilize gradient updates by keeping much information from being lost by connecting from the previous layer to the following layers and skipping some intermediate layers Furthermore, the normalization layer allows faster train-ing and stabilization of deep neural networks by stabiliztrain-ing the distribution of layer inputs during training; as a result, the model is easier to converge Additionally, using Mish [13] as

an activation function can help model increase performance instead of utilizing Maxout This result is because the Mish activation function is bounded below so that it results in regularization effects and reduces overfitting Moreover, our best model from eight experiments is the model that uses MishWindowEncoder with width = 300 as a feature extractor,

Trang 5

Fig 4 Our proposed data pipeline for named entity recognition problem

this approach can gain a result in terms of Precision, Recall,

and F1-score are 0.8914, 0.9237, and 0.9072 correspondingly

Finally, one can see more detail our experimental result

of each entity in terms of F1-score, Precision, and Recall in

Table VI, VII, and VIII All most of the performance of each

entity is pretty stable However, the entity ward_name is the

challenge for our model; our best model (MishWindowEncoder

with width = 300) has F1-score, Precision, and Recall are

0.7741 and 0.6738, respectively To put it differently, the ratio

of correctly predicting an entity is ward_name to the total

number of entities correct ward_name is just 0.6738 We aim

to solve this issue in the future

TABLE V

T HE AVERAGE RESULTS OF DIFFERENT METHODS

Methods Precision Recall F1-score

MaxoutWindowEncoder W64 0,8623 0.8933 0,8775

MishtWindowEncoder W64 0,8677 0,8669 0,8673

MaxoutWindowEncoder W300 0,8739 0,8871 0,8805

MishWindowEncoder W300 0,8914 0,9237 0,9072

BILSTM W300 0,8524 0,8549 0,8535

VI CONCLUSION

In this paper, we contribute a new dataset with 3152

ad-vertisements in Real Estate Information Named Entity

Recog-nition task for Vietnamese including 13 entities and propose

eight methods for measuring the initial performance in terms

of Recall, Precision, and F1-score We find out using

Mish-WindowEncoder has an experimental result that outperforms

total other techniques in all metrics In the future, we aim

to extend our results for different datasets and apply new

approaches to improve the proposed algorithms’ performance

ACKNOWLEDGMENTS

We want to thank the University of Science, Vietnam

Na-tional University in Ho Chi Minh City, Hung Thinh Corp., and

AISIA Research Lab in Vietnam for supporting us throughout

this paper This research is funded by Hung Thinh Corp under

grant number HTHT2021-18-01

REFERENCES [1] E F Tjong Kim Sang and F De Meulder, “Introduction to the

CoNLL-2003 shared task: Language-independent named entity recognition,”

in Proceedings of the Seventh Conference on Natural Language

Learning at HLT-NAACL 2003, 2003, pp 142–147 [Online] Available:

https://aclanthology.org/W03-0419

[2] H Nguyen, Q Ngo, L Vu, V Tran, and H Nguyen, “Vlsp shared

task: Named entity recognition,” Journal of Computer Science and

Cybernetics, vol 34, pp 283–294, 01 2019.

[3] T H Truong, M H Dao, and D Q Nguyen, “COVID-19 Named Entity

Recognition for Vietnamese,” in Proceedings of the 2021 Conference

of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.

[4] P T X Thao, T Q Tri, D Dien, and N Collier, “Named entity

recognition in vietnamese using classifier voting,” ACM Transactions

on Asian Language Information Processing, vol 6, no 4, Dec 2008 [Online] Available: https://doi.org/10.1145/1316457.1316460 [5] H.-Q Le, M.-V Tran, N.-N Bui, N.-C Phan, and Q.-T Ha, “An integrated approach using conditional random fields for named entity

recognition and person property extraction in vietnamese text,” in 2011

International Conference on Asian Language Processing, 2011, pp 115– 118.

[6] V Nguyen Hong, H Nguyen, and V Snasel, “Text normalization for

named entity recognition in vietnamese tweets,” Computational Social

Networks, vol 3, 12 2016.

[7] L Viet-Thang and L K Pham, “Za-ner: Vietnamese named entity

recog-nition at vlsp 2018 evaluation campaign,” Proceedings of Vietnamese

Speech and Language Processing (VLSP), 2018.

[8] Q H Pham, M.-L Nguyen, B T Nguyen, and N V Cuong,

“Semi-supervised learning for Vietnamese named entity recognition

using online conditional random fields,” in Proceedings of the

Fifth Named Entity Workshop Beijing, China: Association for Computational Linguistics, Jul 2015, pp 50–55 [Online] Available: https://aclanthology.org/W15-3907

[9] L V Pham and S B Pham, “Information extraction for vietnamese

real estate advertisements,” in 2012 Fourth International Conference on

Knowledge and Systems Engineering, 2012, pp 181–186.

[10] H Nakayama, T Kubo, J Kamura, Y Taniguchi, and X Liang,

“doccano: Text annotation tool for human,” 2018, software available from https://github.com/doccano/doccano [Online] Available: https: //github.com/doccano/doccano

[11] M Honnibal, I Montani, S Van Landeghem, and A Boyd, “spacy: Industrial-strength natural language processing in python,” 2020 [Online] Available: https://doi.org/10.5281/zenodo.1212303

[12] K He, X Zhang, S Ren, and J Sun, “Deep residual learning for image

recognition,” in 2016 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), 2016, pp 770–778.

[13] D Misra, “Mish: A self regularized non-monotonic activation function,”

in BMVC, 2020.

[14] S Hochreiter and J Schmidhuber, “Long short-term memory,” Neural

Comput., vol 9, no 8, p 1735–1780, Nov 1997 [Online] Available: https://doi.org/10.1162/neco.1997.9.8.1735

[15] D Rumelhart, G E Hinton, and R J Williams, “Learning

representa-tions by back-propagating errors,” Nature, vol 323, pp 533–536, 1986.

[16] A Graves and J Schmidhuber, “Framewise phoneme classification with

bidirectional lstm networks,” in Proceedings 2005 IEEE International

Joint Conference on Neural Networks, 2005., vol 4, 2005, pp 2047–

2052 vol 4.

[17] M Schuster and K Paliwal, “Bidirectional recurrent neural networks,”

IEEE Transactions on Signal Processing, vol 45, no 11, pp 2673–2681, 1997.

[18] C D Manning, M Surdeanu, J Bauer, J R Finkel, S Bethard, and D McClosky, “The stanford corenlp natural language processing

toolkit.” in ACL (System Demonstrations). The Association for Computer Linguistics, 2014, pp 55–60 [Online] Available: http: //dblp.uni-trier.de/db/conf/acl/acl2014-d.html#ManningSBFBM14

Trang 6

TABLE VI

T HE F1- SCORE RESULTS OF DIFFERENT METHODS

Entity MaxoutWindow

EncoderW64

LSTM W64

MishWindow EncoderW64

BiLSTM W64

MaxoutWindow EncoderW300

LSTM W300

MishWindow EncoderW300

BiLSTM W300

district_name 0.8846 0.8267 0.8564 0.825 0.8853 0.8786 0.9256 0.8092

transaction_type 0.9104 0.8272 0.6539 0.8198 0.845 0.8171 0.8816 0.9291 property_certificate 0.7204 0.8239 0.815 0.9101 0.9311 0.9284 0.9367 0.6782 property_type 0.8362 0.9205 0.9285 0.6609 0.714 0.6595 0.7676 0.8256

number_street_name 0.8601 0.9814 0.9858 0.8109 0.988 0.9846 0.9923 0.8372

province_city 0.8128 0.9101 0.9371 0.7195 0.8772 0.8012 0.8619 0.9042

TABLE VII

T HE P RECISION RESULT OF DIFFERENT METHODS

Entity MaxoutWindow EncoderW64 LSTMW64 MishWindow EncoderW64 BiLSTM W64 MaxoutWindow EncoderW300 LSTM W300 EncoderW300 MishWindow BiLSTM W300

district_name 0.8831 0.7925 0.8507 0.8543 0.8779 0.8712 0.9419 0.8419

TABLE VIII

T HE R ECALL RESULTS OF DIFFERENT METHODS

Entity MaxoutWindow EncoderW64 LSTMW64 MishWindow EncoderW64 BiLSTM W64 MaxoutWindow EncoderW300 LSTM W300 EncoderW300 MishWindow BiLSTM W300

district_name 0.8861 0.8639 0.8622 0.7976 0.8929 0.8861 0.9099 0.7789

province_city 0.8736 0.9162 0.9347 0.7299 0.8621 0.7874 0.8966 0.9054

[19] L Ramshaw and M Marcus, “Text Chunking Using

Transformation-Based Learning,” in Proceedings of the Third ACL Workshop on Very

Large Corpora, 1995.

Tiêu đề	Named Entity Recognition for Vietnamese Real Estate Advertisements
Tác giả	Son Huynh, Khiem Le, Nhi Dang, Bao Le, Dang Huynh, Binh T. Nguyen
Người hướng dẫn	Trung T. Nguyen, Nhi Y. T. Ho
Trường học	University of Science, Ho Chi Minh City
Chuyên ngành	Natural Language Processing
Thể loại	Conference Paper
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	6
Dung lượng	1,79 MB