1. Trang chủ
  2. » Thể loại khác

Standardization procedure for automatic environmental data A case study in Hanoi, Vietnam

6 152 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 6
Dung lượng 321,25 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Hanoi, Vietnam linhnd@fimo.edu.vn Abstract - In Vietnam, environmental data collected from ground-based stations may contain abnormal or missing values due to several problems during o

Trang 1

Standardization procedure for automatic

environmental data: a case study in Hanoi, Vietnam

Linh Nguyen Duc, Man Duc Chuc, Bui Quang Hung, Nguyen Thi Nhat Thanh

Center of Multidisciplinary Integrated Technology for Field Monitoring, University of Engineering and Technology, Vietnam National University

Hanoi, Vietnam linhnd@fimo.edu.vn

Abstract - In Vietnam, environmental data collected from

ground-based stations may contain abnormal or missing values

due to several problems during operation, i.e sensor’s problems

This paper proposes a standardization procedure which try to

detect unusual values and fill in missing data Experiments were

conducted for PM10 data Two datasets measured in 01/2011 and

01/2012 at Nguyen Van Cu station in Hanoi, Vietnam is used for

experiments For the abnormal detection process, unusual data

can be informed to the data analyzers at ground stations for

judging For the missing filling process, the first dataset is used as

training dataset to construct regression models for predicting

missing data, the second dataset is used as testing data In the

worst case, suppose 100% PM10 is missing, Root Mean Square

Error (RMSE) and Mean Absolute Percentage Error (MAPE)

are 51 μg/m3 and 45% respectively Correlation coefficient (R)

between original PM10 data and predicted PM10 data is 0.56 In

addition, different scenarios taking account of percentage of

missing data of the whole testing dataset are also considered

Experimental results showed that it is best to perform missing

filling process on datasets that contain 10% to 30% of missing

data For this case, RMSE ranges from 15-25 μg/m3 and MAPE

varies from 5 to 13%

Keywords—environmental data, abnormal detection, missing

filling, PM10

I INTRODUCTION

Environmental monitoring data is a dataset obtained

by the process of measuring one or more indicators of physical

properties, chemical and biological components of the

environment, according to a preset plan which covers time,

space, methods and measurement process, to reliably and

accurately provide the field information

Ground-based environmental data can be used in

various real life applications such as air pollution modeling,

healthcare studies [11] For example, healthcare sector can use

the data to make analysis and assess the impact of physical,

chemical and biological factors on dermatological, respiratory

or epidemic diseases [12] Also, the data can help the

managers in decision making process to create appropriate

solutions to limit the severely decreasing air quality in Vietnam

at the present

In Vietnam, there are two systems of environmental monitoring stations, both are managed by the Ministry of Natural Resources and Environment [2] Most of the stations are automated stations The stations measures meteorological indicators and air pollution indicators by hour Measured data

is stored in local memory and transferred to main center daily

or weekly There are also many abnormal data and many gaps

in the data due to problems during operation such as sensor’s problems, maintaining of stations Furthermore, the data has not been undergone any fixing or recovering process This makes some obstacles for researchers when they use the data to study

Currently, the authorities mainly use traditional statistical tools, i.e Microsoft Excel, this may result in more processing time especially when the data volume is huge Additionally, it is very time and cost consuming to detect abnormal data or filling in missing data by human Thus an automatic tool is needed to help the authorities or researchers work with the data

Current problems appearing in the measured data at the ground stations are described below:

- The data is not consistent: Data is not stored in a commonly standardized output The data is stored in different structures using different units of measurement, column names, date and time formats This cause a lot of difficulties to analysis the data

- Noisy data: occurring in several cases such as equipment failure, transmission errors and unidentified errors

- Missing data: data is missed in some situations such

as the monitoring modules are broken unexpectedly, power failure or by changing the position of the measuring devices

In this paper, we address the second and third problems The proposed standardization procedures helps in

2016 Eighth International Conference on Knowledge and Systems Engineering (KSE)

Trang 2

the synthesis, cleaning and missing filling of data, to save time

and effort for managers, researchers when working with the

data

II DATA

As mentioned before, in Vietnam, there exists two

automatic air monitoring stations which are managed by the

Ministry of Natural Resources and Environment The first is

monitoring networks of meteorological and environment

parameters (10 stations), the second is a network of national

environmental monitoring stations (7 stations) The monitoring

stations hourly measured data The air pollution parameters

measured at all of the stations include carbon monoxide

popular (CO), nitric oxide (NO), nitrogen dioxide (NO2),

sulfur dioxide (SO2), ozone (O3), PM, wind speed, wind

direction, temperature, relative humidity, barometer, radiation,

inner temperature In addition, these stations also measure

meteorological information such as wind speed

In Hanoi, there are three air monitoring stations, one

is located at Phao Dai Lang, Dong Da, the other two stations

located at 556 Nguyen Van Cu and Ho Chi Minh Mausoleum

In this study, we used data from Nguyen Van Cu station for

analysis The station is launched in 2009 with regular

maintenance This ground station is located in the Centre for

Environmental Monitoring (CEM) which is the most stable

operation and the data could be representative for Hanoi area

Particulate matter (PM) is solid and liquid particles

suspended in the atmosphere PM includes both organic and

inorganic particles such as dust, pollen, soot, smoke, and liquid

droplets These particles vary greatly in size, composition, and

origin PMs can be divided into three categories based on its

diameter including PM10, PM2.5 and PM1 Dust monitoring

data includes PM10, PM2.5, PM1 and PM10 is the main focus

in this study PM10 data collected from 01/01/2011 to

31/01/2011 and 01/01/2012 to 31/01/2012 at the Nguyen Van

Cu station, Hanoi This is close to the time when Nguyen Van

Cu station was set up and data quality is guaranteed For the

period of time from 2013-2016, the monitoring module are not

well maintained thus resulting in more errors and missing data

A Structure and volume

The two datasets collected from Nguyen Van Cu

station consisting of 15 indicators including wind speed, wind

direction, temperature, relative humidity, barometer, radiation,

inner temperature, NO, NO2, SO2, CO, O3, PM10, PM2.5 and

PM1 The data is stored in Microsoft Excel format (.xls) Data

of each day is saved in a separate file Thus the two datasets

contain 62 files corresponding to 62 days from 01/01/2011 to

31/01/2011 and 01/01/2012 to 31/01/2012 (Table 1) Total number of records for each dataset is 744

Table 1 Statistics on data structure, volume in 01/2011 and

01/2012

Time Monitoring Number of xls files Indicators

01/2011 31

Wind speed, wind direction, temperature, relative humidity, barometer, radiation, inner temperature,

NO, NO2, SO2, CO, O3, PM10, PM2.5 and

PM1

01/2012 31

B Missing status

The first dataset collected in 01/2011 has a low missing rate, i.e about 2% for PM and 0% for other indicators The second dataset collected in 01/2012 have a higher missing rate, i.e 23% for SO2 and 37.4% for O3 But PM indicators in this dataset were fully recorded (Table 2)

Table 2 Statistics on the number of missed records according

to indicators in two datasets

According to the statistics, the first dataset (01/2011)

is used as training dataset because the amount of data PM10 quite full and monitoring data of other indicators have high completeness The second dataset (01/2012) is used as test dataset

III METHODOLOGY Based on the characteristics of data, we propose a standardized procedure for automatic environmental data (Fig 1) as described below:

1 Data collection: collect data from the stations After that

to build common dataset defined by a conventional structure The aim is to create a dataset of standard data structure that simplifies the process of managing and analyzing data If dataset structure has not correct, collect data again and go to Data overview step when it correct

2 Data overview (based on statistics): using the statistical methods to extract statistical characteristics of the data, trends of data and prescreen it to assess against reality

Trang 3

This is just to get an overview of the data and to get a feel

if the data is noisy or missing This step help us assess the

quality of existing data If dataset have good quality then

call to the next step If not, determine the data source in

first step

3 Noise detecting: removing data based on data reliability

range or using correlation analysis methods This is to

detect the days that have abnormally observational data

This is to suggest unusual data to the analysts to make

decision on the data If the day had detected are not noise

data then revaluation noise detecting method, else go to

next step

4 Fill in missing data: using correlation analysis between

target indicator and other indicators to build linear

regression models The models are used to predict values

for missing data records of the target indicator If the

dataset has been filled is true, finish process, else

revaluation filling missing method

IV EXPERIMENTS AND RESULTS

A Data Collection and Data Overview

Based on data and basic statistical indicators, we can

draw some conclusions on the PM10 data from the two datasets

as Table 3

Table 3 The results of some statistical indicators were

calculated on 2 datasets

Month Mean Median Mode Q1 Q3

01/2011 141.37 129.68 40.91 56.07 210.41

01/2012 87.18 75.39 97.22 49.61 113.61

Overall, the average PM10 concentrations range from

85-140 ug/m3 This is close to the QCVN 05:2013/BTNMT

standard which states the standard of air pollution in Vietnam

for PM10 is 150 ug/m3 In general, the statistical indicators of

PM10 in the second dataset often have lower values than those

of the first dataset

Previous study conducted in Hanoi showed that the

average of monitoring indicators are often higher in winter and

lower in summer [1] The maximum PM10 value is often

observed in the period from October to January with average

PM10 value ranging from 100 to 150 ug/m3 This is similar to

the above statistical data

Previous study also showed an evolution of air

pollution levels in 05/2003 and 09/2003 in Hanoi [1] During

these days, air pollution level tends to rise during peak hours

from 7-9am and 18-20pm Furthermore, the highest peaks of

air pollution level in the morning are often similar to those in

the evening This is because of high volume of vehicles

appearing during peak hours every day Average values of PM10 for each hours calculated from data of each month showed an agreement with the general trend (Fig 2) Apply similar evaluation methods for other indicators such as NO2, SO2, CO The results showed that the two datasets are reliable and follow general trends that were reported in the literature This guarantees the following steps to be conducted

Next

Evalu ation Next Next Next

Problems Problems

Problems

Problems Data collection

Data overview

Noise detecting

Fill in missing

Start

Evalu ation

Evalu ation

Evalu ation

Finish

Fig 1 Data processing framework proposed

Trang 4

Fig 2 PM10 daily trend in 01/2011 and 01/2012

B Noise detecting

Noise removing aims to detect potential abnormal

data in daily basis This can be based on constructing reliable

data range or correlation analysis or combination of both

methods

The confidence interval can be used to determine a

reliable range of values which is used to remove noise data

This method requires analysts to have good experience of

working with observational data in a long time in order to

construct good data range Through research and

environmental reports [2, 3, 4, 5, 6, 7] we proposed a range of

reliable values for PM10 is [0-400] ug/m3 By applying the

proposed range to training dataset, there are 4 potentially

abnormal records as described in Table 4:

Table 4 List of date have valuable outside the confidence

interval in 01/2011

Datetime Observation value of PM10

12/01/2011 10:00 490

17/01/2011 08:00 420.656

17/01/2011 17:00 462.044

17/11/2011 18:00 425.139

Correlation analysis is another way to detect abnormal data We propose to detect potential abnormal data based on analysis of correlation between daily data and monthly average data First, the average value of each hour in a day is calculated from observed data in a month at the hour Thus for each month, 24 average PM10 values corresponding to 24 hours in a day are constructed The values are considered to represent the daily trend of PM10 for the month Correlation analysis is then conducted for PM10 data measured in a particular day in the month with the average PM10 values If the correlation coefficient is low then the data is considered to be noisy or abnormal Specifically, the range of [-0.3; 0.3] is used to filter out potentially abnormal data for further analysis and evaluation The range of [-0.3; 0.3] was chosen as it is negligible correlation based on research of Mukaka [14] Besides, in order to evaluate abnormal data, professional experience in meteorology, environment plus further assessment of originality of the pollutions such as traffic, industrial zones, the surrounding area at the measured time, status of measurement equipments at the time By applying the proposed range of [-0.3; 0.3] to training dataset, there are 8 days have low correlation coefficient as described in Table 5

Table 5 List of dates which have low correlation coefficient

between day and monthly average in training dataset

Date observation Correlation coefficients

03/01/2011 -0.2829 04/01/2011 0.2108 09/01/2011 -0.0953 11/01/2011 0.1110 13/01/2011 0.1502 17/01/2011 0.2299 19/01/2011 -0.2411 23/01/2011 -0.0405

C Filling missing

Previous studies show that some environmental indicators have significant correlations [9, 13] This means that missing PM10 data can be recovered from suitable environmental parameters by constructing linear regression models

In this study, we build linear regression models using training dataset The models are used to predict missing PM10 values in testing dataset Table 6 shows correlations between PM10 and other environmental indicators derived from training dataset:

Table 6 Table correlation between PM10 and other

environmental indicators in training dataset

Trang 5

WindSpd 0.04982 InnerTemp 0.02089

WindDir 0.03815 NO 0.23985

Temp 0.08365 NO2 0.59005

RH 0.34409 SO2 0.53962

Barometer 0.03855 CO 0.44486

Radiation -0.0124 O3 0.09338

From the table, there are three indicators owning high

correlation with PM10 including NO2, SO2 and CO Seven

linear regression models are constructed to predict PM10 from

the three parameters As described before, in training dataset,

15 records have missing PM10 values To ensure completeness

of data for building regression models, the records are removed

thus resulting in a training dataset containing 725 records

After building seven linear regression models, we validated the

predicted PM10 values of each models with actual PM10

values Assuming the data is missing 100% PM10, from that

R2, RMSE and MAPE are used to quantitatively assess

performance of each models (Table 7)

Table 7 Validation results of 7 linear regression models on

training dataset (100% PM10 missing)

Parameter for

model R 2 RMSE* MAPE*

SO 2 , NO 2 0.43 67.9 68.9

* Predicted PM10 and Actual PM10

Based on the results, priorities are set for each model

when applying to real life problems as Table 8:

Table 8 Table ordered models corresponding to the priority

Parameter

for model Linear regression equation Priority

SO 2 , NO 2 , CO Y= -8.98 + 2.02*SO2 +

1.35*NO2 + 0.011*CO 1

SO 2 , NO 2 Y= 0.79 + 1.87*SO2 +

1.80*NO2 2

SO 2 , CO Y= -1.95 + 2.59*SO2 +

0.028*CO 3

NO 2 , CO Y= 20.5 + 2.51*NO2 -

0.0004*CO 4

NO 2 Y= 20.2 + 2.5*NO2 5

SO 2 Y= 52.9 + 3.01*SO2 6

CO Y= 42.5 + 0.04*CO 7

In practice, NO2, SO2, CO are not always available

They can be missed like PM10 Therefore, deciding on which

models to use requires understanding of the data to be recovered In Table 9, a list of suggested models to use according to different status of the data

Table 9 Cases of missing data and suggested linear regression

models

Record status

Linear regression model number

Records missing SO2, NO2, CO 1

Records missing CO 2 Records missing SO2, NO2 7 Records missing SO2, CO 5

Records missing NO2, CO 6 Records missing all SO2, NO2, CO Can not predict

Next, we validated the models on testing dataset The dataset has no missing PM10, CO and NO2, but SO2 missing 170/744 records This is a good basis for assessment process

Assuming 100% PM10 data is missing from the testing dataset, the results showed the correlation coefficient between the predicted value of PM10 and actual PM10 is 0.56 RMSE and MAPE are 51 ug/m3 and 45% respectively (Table 10) This result is acceptable because it ensures data completeness and the R, RMSE and MAPE are in the medium level The MAPE value in this case smaller than MAPE in Table 7 because two linear regression models was applied to pedicted PM10 so the error rate will be smaller than use of only one model

Table 10 Results after filling missing PM10 in testing dataset

Assuming that 100% PM10 data is missed in 01/2012

Number

of records {NO 2 ,

SO 2 , CO}

Number

of records {NO 2 , CO}

Correlation coefficients

*

RMSE* MAPE*

574 170 0.56 51.4 45.3

* Predicted PM10 and Actual PM10

A test to evaluate the impact of missing data rate is also performed Different missing rate of PM10 are assumed including 10%, 20%, 30%, 40%, and 50% For each missing rate, 10 datasets are randomly generated Average results of 10 assessments perform on 10 datasets are reported in Table 11

Table 11 PM10 missing filling results, considering different

missing rates in testing dataset

Missing per cent 10% 20% 30% 40% 50%

Total records 744

Correlation 0.94 0.91 0.86 0.78 0.75

Trang 6

*

RMSE* 15.75 20.77 24.92 34.29 36.8

MAPE* 4.93 9.10 13.46 18.04 23.06

* Predicted PM10 and Actual PM10

In general, the results significantly disparity

Specifically, when 10% PM10 is missed, R, RMSE and MAPE

are 0.94, 15.75 ug/m3 and 4.9% which indicates best recovery

results in the test With a lack of data from 20% to 30%,

MAPE and RMSE range from 9 to 13% and 20-25 ug/m3

respectively When missing rate is higher than 30% or more,

RMSE rate started increasing from 18% to 23% The worst

case is of 50% missing rate with RMSE of 36.8 ug/m3 and R =

0.75 From the results, it is observed that it is better to perform

missing filling process when the data is of 30% missing rate or

less However, these results show the potential of applying the

method to real life problems

V CONCLUSION

In this paper, we propose a framework for automatic

environmental data, from data collection to building a dataset

which ensuring a standardized structures and acceptable

quality The proposed workflow includes different stages

including data collection, data overview, noise detecting,

missing filling and evaluation Different techniques at each

stage are also introduced and experimentally evaluated

Although the framework is an overall process but

analysts can customize every step in the process Besides, the

framework still exist some unresolved issues which include

historical knowledge for noise removal and the completeness

of other environmental indicators to estimate missed PM10

values In future, exploiting the use of meteorology or weather

stations in the same area to employ more environmental

indicators to improve the overall quality for the workflow

ACKNOWLEDGMENT

REFERENCES [1] Pham Duy Hien Current status and laws of changes of air quality in Hanoi, 03/2006

[2] The Ministry of Natural Resources and Environment Vietnam National environmental report in 2013,

[3] The Ministry of Natural Resources and Environment Vietnam National environmental report in 2010

[4] Ngo Tho Hung AARHUS University, Urban Air Quality Modelling and Management in Hanoi, Vietnam PhD Thesis, 2010

[5] Clean Air Initiative for Asian Cities (CAI-Asia) Center Viet Nam: Air Quality Profile 2010 Edition

[6] Cao Dung Hai, Nguyen Thi Kim Oanh Effects of local, regional meteorology and emission sources on mass and compositions of particulate matter in Hanoi Atmospheric Environment Volume 78, October 2013, Pages 105–112

[7] Nguyen Tran Huong Giang, Nguyen Thi Kim Oanh Roadside levels and traffic emission rates of PM2.5 and BTEX in Ho Chi Minh City, Vietnam Atmospheric Environment Volume 94, September 2014, Pages 806–816

[8] Dang Manh Doan, Tran Thi Dieu Hang, Phan Ban Mai Institute of Meteorology, Hydrology and Environment The situation of air pollution

in Hanoi and recommendations to reduce pollution, 2007 [9] Jung-Moon Yoo a, Yu-Ri Lee b, Dongchul Kim c,g,*, Myeong-Jae Jeong d, William R Stockwell e, Prasun K Kundu f,g, Soo-Min Oh a, Dong-Bin Shin b, Suk-Jo Lee New indices for wet scavenging of air pollutants (O3, CO, NO2, SO2, and PM10) by summertime rain Atmospheric Environment Volume 82, January 2014, Pages 226–237 [10] Ping Wang, Junji Cao, Xuexi Tie, Gehui Wang, Guohui Li, Tafeng Hu, Yaoting Wu, Yunsheng Xu, Gongdi Xu, Youzhi Zhao, Wenci Ding, Huikun Liu, Rujin Huang, Changlin Zhan Impact of Meteorological Parameters and Gaseous Pollutants on PM2.5 and PM10 Mass Concentrations during 2010 in Xi’an, China Aerosol and Air Quality Research, 15: 1844–1854, 2015

[11] Gharehchahi E, Mahvi AH, Amini H, Nabizadeh R, Akhlaghi AA, Shamsipour M, et al Health impact assessment of air pollution in Shiraz, Iran: a two-part study J Environ Health Sci Eng 2013; 11: 1 – 8 [12] Brauer M, Amann M, Burnett RT, Cohen A, Dentener F, Ezzati M, et al Exposure assessment for estimation of the global burden of disease attributable to outdoor air pollution Environ Sci Technol 2012; 46: 652 – 660

[13] Dragan M Markoviü, Dragan A Markoviü, Anka Jovanoviü, Lazar Laziü, Zoran Mijiü, Determination of O3, NO2, SO2, CO and PM10 measured in Belgrade urban area, Environmental Monitoring and Assessment October 2008, Volume 145, Issue 1, pp 349-359

[14] M M Mukaka A Guide to Appropriate Use of Correlation Coefficient

in Medical Research Malawi Medical Journal, Vol 24, No 3, 2012, pp 69-71

Ngày đăng: 14/12/2017, 16:03

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN