Hanoi, Vietnam linhnd@fimo.edu.vn Abstract - In Vietnam, environmental data collected from ground-based stations may contain abnormal or missing values due to several problems during o
Trang 1Standardization procedure for automatic
environmental data: a case study in Hanoi, Vietnam
Linh Nguyen Duc, Man Duc Chuc, Bui Quang Hung, Nguyen Thi Nhat Thanh
Center of Multidisciplinary Integrated Technology for Field Monitoring, University of Engineering and Technology, Vietnam National University
Hanoi, Vietnam linhnd@fimo.edu.vn
Abstract - In Vietnam, environmental data collected from
ground-based stations may contain abnormal or missing values
due to several problems during operation, i.e sensor’s problems
This paper proposes a standardization procedure which try to
detect unusual values and fill in missing data Experiments were
conducted for PM10 data Two datasets measured in 01/2011 and
01/2012 at Nguyen Van Cu station in Hanoi, Vietnam is used for
experiments For the abnormal detection process, unusual data
can be informed to the data analyzers at ground stations for
judging For the missing filling process, the first dataset is used as
training dataset to construct regression models for predicting
missing data, the second dataset is used as testing data In the
worst case, suppose 100% PM10 is missing, Root Mean Square
Error (RMSE) and Mean Absolute Percentage Error (MAPE)
are 51 μg/m3 and 45% respectively Correlation coefficient (R)
between original PM10 data and predicted PM10 data is 0.56 In
addition, different scenarios taking account of percentage of
missing data of the whole testing dataset are also considered
Experimental results showed that it is best to perform missing
filling process on datasets that contain 10% to 30% of missing
data For this case, RMSE ranges from 15-25 μg/m3 and MAPE
varies from 5 to 13%
Keywords—environmental data, abnormal detection, missing
filling, PM10
I INTRODUCTION
Environmental monitoring data is a dataset obtained
by the process of measuring one or more indicators of physical
properties, chemical and biological components of the
environment, according to a preset plan which covers time,
space, methods and measurement process, to reliably and
accurately provide the field information
Ground-based environmental data can be used in
various real life applications such as air pollution modeling,
healthcare studies [11] For example, healthcare sector can use
the data to make analysis and assess the impact of physical,
chemical and biological factors on dermatological, respiratory
or epidemic diseases [12] Also, the data can help the
managers in decision making process to create appropriate
solutions to limit the severely decreasing air quality in Vietnam
at the present
In Vietnam, there are two systems of environmental monitoring stations, both are managed by the Ministry of Natural Resources and Environment [2] Most of the stations are automated stations The stations measures meteorological indicators and air pollution indicators by hour Measured data
is stored in local memory and transferred to main center daily
or weekly There are also many abnormal data and many gaps
in the data due to problems during operation such as sensor’s problems, maintaining of stations Furthermore, the data has not been undergone any fixing or recovering process This makes some obstacles for researchers when they use the data to study
Currently, the authorities mainly use traditional statistical tools, i.e Microsoft Excel, this may result in more processing time especially when the data volume is huge Additionally, it is very time and cost consuming to detect abnormal data or filling in missing data by human Thus an automatic tool is needed to help the authorities or researchers work with the data
Current problems appearing in the measured data at the ground stations are described below:
- The data is not consistent: Data is not stored in a commonly standardized output The data is stored in different structures using different units of measurement, column names, date and time formats This cause a lot of difficulties to analysis the data
- Noisy data: occurring in several cases such as equipment failure, transmission errors and unidentified errors
- Missing data: data is missed in some situations such
as the monitoring modules are broken unexpectedly, power failure or by changing the position of the measuring devices
In this paper, we address the second and third problems The proposed standardization procedures helps in
2016 Eighth International Conference on Knowledge and Systems Engineering (KSE)
Trang 2the synthesis, cleaning and missing filling of data, to save time
and effort for managers, researchers when working with the
data
II DATA
As mentioned before, in Vietnam, there exists two
automatic air monitoring stations which are managed by the
Ministry of Natural Resources and Environment The first is
monitoring networks of meteorological and environment
parameters (10 stations), the second is a network of national
environmental monitoring stations (7 stations) The monitoring
stations hourly measured data The air pollution parameters
measured at all of the stations include carbon monoxide
popular (CO), nitric oxide (NO), nitrogen dioxide (NO2),
sulfur dioxide (SO2), ozone (O3), PM, wind speed, wind
direction, temperature, relative humidity, barometer, radiation,
inner temperature In addition, these stations also measure
meteorological information such as wind speed
In Hanoi, there are three air monitoring stations, one
is located at Phao Dai Lang, Dong Da, the other two stations
located at 556 Nguyen Van Cu and Ho Chi Minh Mausoleum
In this study, we used data from Nguyen Van Cu station for
analysis The station is launched in 2009 with regular
maintenance This ground station is located in the Centre for
Environmental Monitoring (CEM) which is the most stable
operation and the data could be representative for Hanoi area
Particulate matter (PM) is solid and liquid particles
suspended in the atmosphere PM includes both organic and
inorganic particles such as dust, pollen, soot, smoke, and liquid
droplets These particles vary greatly in size, composition, and
origin PMs can be divided into three categories based on its
diameter including PM10, PM2.5 and PM1 Dust monitoring
data includes PM10, PM2.5, PM1 and PM10 is the main focus
in this study PM10 data collected from 01/01/2011 to
31/01/2011 and 01/01/2012 to 31/01/2012 at the Nguyen Van
Cu station, Hanoi This is close to the time when Nguyen Van
Cu station was set up and data quality is guaranteed For the
period of time from 2013-2016, the monitoring module are not
well maintained thus resulting in more errors and missing data
A Structure and volume
The two datasets collected from Nguyen Van Cu
station consisting of 15 indicators including wind speed, wind
direction, temperature, relative humidity, barometer, radiation,
inner temperature, NO, NO2, SO2, CO, O3, PM10, PM2.5 and
PM1 The data is stored in Microsoft Excel format (.xls) Data
of each day is saved in a separate file Thus the two datasets
contain 62 files corresponding to 62 days from 01/01/2011 to
31/01/2011 and 01/01/2012 to 31/01/2012 (Table 1) Total number of records for each dataset is 744
Table 1 Statistics on data structure, volume in 01/2011 and
01/2012
Time Monitoring Number of xls files Indicators
01/2011 31
Wind speed, wind direction, temperature, relative humidity, barometer, radiation, inner temperature,
NO, NO2, SO2, CO, O3, PM10, PM2.5 and
PM1
01/2012 31
B Missing status
The first dataset collected in 01/2011 has a low missing rate, i.e about 2% for PM and 0% for other indicators The second dataset collected in 01/2012 have a higher missing rate, i.e 23% for SO2 and 37.4% for O3 But PM indicators in this dataset were fully recorded (Table 2)
Table 2 Statistics on the number of missed records according
to indicators in two datasets
According to the statistics, the first dataset (01/2011)
is used as training dataset because the amount of data PM10 quite full and monitoring data of other indicators have high completeness The second dataset (01/2012) is used as test dataset
III METHODOLOGY Based on the characteristics of data, we propose a standardized procedure for automatic environmental data (Fig 1) as described below:
1 Data collection: collect data from the stations After that
to build common dataset defined by a conventional structure The aim is to create a dataset of standard data structure that simplifies the process of managing and analyzing data If dataset structure has not correct, collect data again and go to Data overview step when it correct
2 Data overview (based on statistics): using the statistical methods to extract statistical characteristics of the data, trends of data and prescreen it to assess against reality
Trang 3This is just to get an overview of the data and to get a feel
if the data is noisy or missing This step help us assess the
quality of existing data If dataset have good quality then
call to the next step If not, determine the data source in
first step
3 Noise detecting: removing data based on data reliability
range or using correlation analysis methods This is to
detect the days that have abnormally observational data
This is to suggest unusual data to the analysts to make
decision on the data If the day had detected are not noise
data then revaluation noise detecting method, else go to
next step
4 Fill in missing data: using correlation analysis between
target indicator and other indicators to build linear
regression models The models are used to predict values
for missing data records of the target indicator If the
dataset has been filled is true, finish process, else
revaluation filling missing method
IV EXPERIMENTS AND RESULTS
A Data Collection and Data Overview
Based on data and basic statistical indicators, we can
draw some conclusions on the PM10 data from the two datasets
as Table 3
Table 3 The results of some statistical indicators were
calculated on 2 datasets
Month Mean Median Mode Q1 Q3
01/2011 141.37 129.68 40.91 56.07 210.41
01/2012 87.18 75.39 97.22 49.61 113.61
Overall, the average PM10 concentrations range from
85-140 ug/m3 This is close to the QCVN 05:2013/BTNMT
standard which states the standard of air pollution in Vietnam
for PM10 is 150 ug/m3 In general, the statistical indicators of
PM10 in the second dataset often have lower values than those
of the first dataset
Previous study conducted in Hanoi showed that the
average of monitoring indicators are often higher in winter and
lower in summer [1] The maximum PM10 value is often
observed in the period from October to January with average
PM10 value ranging from 100 to 150 ug/m3 This is similar to
the above statistical data
Previous study also showed an evolution of air
pollution levels in 05/2003 and 09/2003 in Hanoi [1] During
these days, air pollution level tends to rise during peak hours
from 7-9am and 18-20pm Furthermore, the highest peaks of
air pollution level in the morning are often similar to those in
the evening This is because of high volume of vehicles
appearing during peak hours every day Average values of PM10 for each hours calculated from data of each month showed an agreement with the general trend (Fig 2) Apply similar evaluation methods for other indicators such as NO2, SO2, CO The results showed that the two datasets are reliable and follow general trends that were reported in the literature This guarantees the following steps to be conducted
Next
Evalu ation Next Next Next
Problems Problems
Problems
Problems Data collection
Data overview
Noise detecting
Fill in missing
Start
Evalu ation
Evalu ation
Evalu ation
Finish
Fig 1 Data processing framework proposed
Trang 4Fig 2 PM10 daily trend in 01/2011 and 01/2012
B Noise detecting
Noise removing aims to detect potential abnormal
data in daily basis This can be based on constructing reliable
data range or correlation analysis or combination of both
methods
The confidence interval can be used to determine a
reliable range of values which is used to remove noise data
This method requires analysts to have good experience of
working with observational data in a long time in order to
construct good data range Through research and
environmental reports [2, 3, 4, 5, 6, 7] we proposed a range of
reliable values for PM10 is [0-400] ug/m3 By applying the
proposed range to training dataset, there are 4 potentially
abnormal records as described in Table 4:
Table 4 List of date have valuable outside the confidence
interval in 01/2011
Datetime Observation value of PM10
12/01/2011 10:00 490
17/01/2011 08:00 420.656
17/01/2011 17:00 462.044
17/11/2011 18:00 425.139
Correlation analysis is another way to detect abnormal data We propose to detect potential abnormal data based on analysis of correlation between daily data and monthly average data First, the average value of each hour in a day is calculated from observed data in a month at the hour Thus for each month, 24 average PM10 values corresponding to 24 hours in a day are constructed The values are considered to represent the daily trend of PM10 for the month Correlation analysis is then conducted for PM10 data measured in a particular day in the month with the average PM10 values If the correlation coefficient is low then the data is considered to be noisy or abnormal Specifically, the range of [-0.3; 0.3] is used to filter out potentially abnormal data for further analysis and evaluation The range of [-0.3; 0.3] was chosen as it is negligible correlation based on research of Mukaka [14] Besides, in order to evaluate abnormal data, professional experience in meteorology, environment plus further assessment of originality of the pollutions such as traffic, industrial zones, the surrounding area at the measured time, status of measurement equipments at the time By applying the proposed range of [-0.3; 0.3] to training dataset, there are 8 days have low correlation coefficient as described in Table 5
Table 5 List of dates which have low correlation coefficient
between day and monthly average in training dataset
Date observation Correlation coefficients
03/01/2011 -0.2829 04/01/2011 0.2108 09/01/2011 -0.0953 11/01/2011 0.1110 13/01/2011 0.1502 17/01/2011 0.2299 19/01/2011 -0.2411 23/01/2011 -0.0405
C Filling missing
Previous studies show that some environmental indicators have significant correlations [9, 13] This means that missing PM10 data can be recovered from suitable environmental parameters by constructing linear regression models
In this study, we build linear regression models using training dataset The models are used to predict missing PM10 values in testing dataset Table 6 shows correlations between PM10 and other environmental indicators derived from training dataset:
Table 6 Table correlation between PM10 and other
environmental indicators in training dataset
Trang 5WindSpd 0.04982 InnerTemp 0.02089
WindDir 0.03815 NO 0.23985
Temp 0.08365 NO2 0.59005
RH 0.34409 SO2 0.53962
Barometer 0.03855 CO 0.44486
Radiation -0.0124 O3 0.09338
From the table, there are three indicators owning high
correlation with PM10 including NO2, SO2 and CO Seven
linear regression models are constructed to predict PM10 from
the three parameters As described before, in training dataset,
15 records have missing PM10 values To ensure completeness
of data for building regression models, the records are removed
thus resulting in a training dataset containing 725 records
After building seven linear regression models, we validated the
predicted PM10 values of each models with actual PM10
values Assuming the data is missing 100% PM10, from that
R2, RMSE and MAPE are used to quantitatively assess
performance of each models (Table 7)
Table 7 Validation results of 7 linear regression models on
training dataset (100% PM10 missing)
Parameter for
model R 2 RMSE* MAPE*
SO 2 , NO 2 0.43 67.9 68.9
* Predicted PM10 and Actual PM10
Based on the results, priorities are set for each model
when applying to real life problems as Table 8:
Table 8 Table ordered models corresponding to the priority
Parameter
for model Linear regression equation Priority
SO 2 , NO 2 , CO Y= -8.98 + 2.02*SO2 +
1.35*NO2 + 0.011*CO 1
SO 2 , NO 2 Y= 0.79 + 1.87*SO2 +
1.80*NO2 2
SO 2 , CO Y= -1.95 + 2.59*SO2 +
0.028*CO 3
NO 2 , CO Y= 20.5 + 2.51*NO2 -
0.0004*CO 4
NO 2 Y= 20.2 + 2.5*NO2 5
SO 2 Y= 52.9 + 3.01*SO2 6
CO Y= 42.5 + 0.04*CO 7
In practice, NO2, SO2, CO are not always available
They can be missed like PM10 Therefore, deciding on which
models to use requires understanding of the data to be recovered In Table 9, a list of suggested models to use according to different status of the data
Table 9 Cases of missing data and suggested linear regression
models
Record status
Linear regression model number
Records missing SO2, NO2, CO 1
Records missing CO 2 Records missing SO2, NO2 7 Records missing SO2, CO 5
Records missing NO2, CO 6 Records missing all SO2, NO2, CO Can not predict
Next, we validated the models on testing dataset The dataset has no missing PM10, CO and NO2, but SO2 missing 170/744 records This is a good basis for assessment process
Assuming 100% PM10 data is missing from the testing dataset, the results showed the correlation coefficient between the predicted value of PM10 and actual PM10 is 0.56 RMSE and MAPE are 51 ug/m3 and 45% respectively (Table 10) This result is acceptable because it ensures data completeness and the R, RMSE and MAPE are in the medium level The MAPE value in this case smaller than MAPE in Table 7 because two linear regression models was applied to pedicted PM10 so the error rate will be smaller than use of only one model
Table 10 Results after filling missing PM10 in testing dataset
Assuming that 100% PM10 data is missed in 01/2012
Number
of records {NO 2 ,
SO 2 , CO}
Number
of records {NO 2 , CO}
Correlation coefficients
*
RMSE* MAPE*
574 170 0.56 51.4 45.3
* Predicted PM10 and Actual PM10
A test to evaluate the impact of missing data rate is also performed Different missing rate of PM10 are assumed including 10%, 20%, 30%, 40%, and 50% For each missing rate, 10 datasets are randomly generated Average results of 10 assessments perform on 10 datasets are reported in Table 11
Table 11 PM10 missing filling results, considering different
missing rates in testing dataset
Missing per cent 10% 20% 30% 40% 50%
Total records 744
Correlation 0.94 0.91 0.86 0.78 0.75
Trang 6*
RMSE* 15.75 20.77 24.92 34.29 36.8
MAPE* 4.93 9.10 13.46 18.04 23.06
* Predicted PM10 and Actual PM10
In general, the results significantly disparity
Specifically, when 10% PM10 is missed, R, RMSE and MAPE
are 0.94, 15.75 ug/m3 and 4.9% which indicates best recovery
results in the test With a lack of data from 20% to 30%,
MAPE and RMSE range from 9 to 13% and 20-25 ug/m3
respectively When missing rate is higher than 30% or more,
RMSE rate started increasing from 18% to 23% The worst
case is of 50% missing rate with RMSE of 36.8 ug/m3 and R =
0.75 From the results, it is observed that it is better to perform
missing filling process when the data is of 30% missing rate or
less However, these results show the potential of applying the
method to real life problems
V CONCLUSION
In this paper, we propose a framework for automatic
environmental data, from data collection to building a dataset
which ensuring a standardized structures and acceptable
quality The proposed workflow includes different stages
including data collection, data overview, noise detecting,
missing filling and evaluation Different techniques at each
stage are also introduced and experimentally evaluated
Although the framework is an overall process but
analysts can customize every step in the process Besides, the
framework still exist some unresolved issues which include
historical knowledge for noise removal and the completeness
of other environmental indicators to estimate missed PM10
values In future, exploiting the use of meteorology or weather
stations in the same area to employ more environmental
indicators to improve the overall quality for the workflow
ACKNOWLEDGMENT
REFERENCES [1] Pham Duy Hien Current status and laws of changes of air quality in Hanoi, 03/2006
[2] The Ministry of Natural Resources and Environment Vietnam National environmental report in 2013,
[3] The Ministry of Natural Resources and Environment Vietnam National environmental report in 2010
[4] Ngo Tho Hung AARHUS University, Urban Air Quality Modelling and Management in Hanoi, Vietnam PhD Thesis, 2010
[5] Clean Air Initiative for Asian Cities (CAI-Asia) Center Viet Nam: Air Quality Profile 2010 Edition
[6] Cao Dung Hai, Nguyen Thi Kim Oanh Effects of local, regional meteorology and emission sources on mass and compositions of particulate matter in Hanoi Atmospheric Environment Volume 78, October 2013, Pages 105–112
[7] Nguyen Tran Huong Giang, Nguyen Thi Kim Oanh Roadside levels and traffic emission rates of PM2.5 and BTEX in Ho Chi Minh City, Vietnam Atmospheric Environment Volume 94, September 2014, Pages 806–816
[8] Dang Manh Doan, Tran Thi Dieu Hang, Phan Ban Mai Institute of Meteorology, Hydrology and Environment The situation of air pollution
in Hanoi and recommendations to reduce pollution, 2007 [9] Jung-Moon Yoo a, Yu-Ri Lee b, Dongchul Kim c,g,*, Myeong-Jae Jeong d, William R Stockwell e, Prasun K Kundu f,g, Soo-Min Oh a, Dong-Bin Shin b, Suk-Jo Lee New indices for wet scavenging of air pollutants (O3, CO, NO2, SO2, and PM10) by summertime rain Atmospheric Environment Volume 82, January 2014, Pages 226–237 [10] Ping Wang, Junji Cao, Xuexi Tie, Gehui Wang, Guohui Li, Tafeng Hu, Yaoting Wu, Yunsheng Xu, Gongdi Xu, Youzhi Zhao, Wenci Ding, Huikun Liu, Rujin Huang, Changlin Zhan Impact of Meteorological Parameters and Gaseous Pollutants on PM2.5 and PM10 Mass Concentrations during 2010 in Xi’an, China Aerosol and Air Quality Research, 15: 1844–1854, 2015
[11] Gharehchahi E, Mahvi AH, Amini H, Nabizadeh R, Akhlaghi AA, Shamsipour M, et al Health impact assessment of air pollution in Shiraz, Iran: a two-part study J Environ Health Sci Eng 2013; 11: 1 – 8 [12] Brauer M, Amann M, Burnett RT, Cohen A, Dentener F, Ezzati M, et al Exposure assessment for estimation of the global burden of disease attributable to outdoor air pollution Environ Sci Technol 2012; 46: 652 – 660
[13] Dragan M Markoviü, Dragan A Markoviü, Anka Jovanoviü, Lazar Laziü, Zoran Mijiü, Determination of O3, NO2, SO2, CO and PM10 measured in Belgrade urban area, Environmental Monitoring and Assessment October 2008, Volume 145, Issue 1, pp 349-359
[14] M M Mukaka A Guide to Appropriate Use of Correlation Coefficient
in Medical Research Malawi Medical Journal, Vol 24, No 3, 2012, pp 69-71