implementation, we divide the literature depending on these common aspects, namely representation techniques, distance measures, and indexing methods Time Series Decomposition We can de
Trang 1TABLE OF CONTENTS
Acknowledgement 3
List of figures 4
Section 1: Introduction 5
Introduction 5
Rationale 5
Object and the range of study 5
Aim of the study 6
Research method 6
Section 2: Time series: 6
Theoretical basis 6
Time Series Decomposition 7
ACF and PACF: 7
ARIMA MODEL: 8
Fit model: 10
Section 3: Application 11
Load libraries and data: 11
Import data: 11
Data cleaning: 11
Invert data to time series model: 11
Time series decomposition 12
Test stationary: 13
ADF test: 13
Autocorrelation (ACF & PACF) 14
Remove trend and seasonal effect 15
ADF test: 15
ACF & PACF test: 15
FIT model: 17
ARIMA Model 17
Forecast 19
Section 4: Conclusion 20
Section 5: R Code 21
Reference 22
Trang 2Acknowledgement
First of all, we would like to express our deep appreciation to Professor Nguyen Tien Dung for giving us the opportunity to work with R studio, an important software in researching statistics We are also grateful that you have conveyed an abundant amount of knowledge about Probability and Statistics to us This is a great chance for
us to operate the R studio The software broadens not only our knowledge but also gives us the ideas for future projects
Trang 3List of figures
Figure 1: Time series of the data 12
Figure 2: Time series decomposition 13
Figure 3: ACF diagram with trend and seasonality 14
Figure 4: PACF diagram with trend and seasonality 14
Figure 5: ACF diagram without trend and seasonality 16
Figure 6: PACF diagram without trend and seasonality 16
Figure 7: Linear regression model of the data 17
Figure 8: Diagrams of different analysis of residuals for model selection 18
Figure 9: Histogram and Q-Q Plot of the residuals 19
Figure 10: Time series forecast 20
Trang 43 Test the stationarity
4 Fit a model used an automated algorithm
5 Calculate forecasts
Rationale
Since CO2 makes up 77% of greenhouse gas emissions and is the fourth most abundant gas in the Earth's atmosphere, we chose this topic In a normal concentration range, it is a harmless gas that has no color or smell In order to reduce pollution, analysis is necessary By doing so, we can forecast the future trend of CO2 Consequently, using information regarding the rate at which CO2 is rising, we may determine a technique to minimize the amount of CO2 in the air
Object and the range of study
We choose to analyze the Atmospheric CO2 Levels at Mauna Loa, Hawaii At Mauna Loa Observatory, the atmospheric carbon dioxide content displays a yearly pattern that is remarkably consistent year after year This seasonal signal's amplitude, expressed either as peak-to-peak concentration fluctuations or as a string of harmonic terms Moreover, it also relates to our specialized skills that analyze some chemicals
in environment around us
We use time series in this topic because we want
to analyze a series of data measured in each specific moment, and then we can use the trend
of data to predict the trend of data in the future
Trang 5Aim of the study
A thorough investigation of the calibration procedures and data analysis techniques used throughout this lengthy record fails to find any discrepancies that are significant enough to account for the increase It is likely that at least some of the increase is a result of rising plant activity because the northern hemisphere's yearly cycle of CO2
is assumed to be primarily caused by the metabolic activity of terrestrial plants
The purpose of time-series data mining is to try to extract all meaningful knowledge from the shape of data Even if humans have a natural capacity to perform these tasks,
it remains a complex problem for computers In this article we intend to provide a survey of the techniques applied for time-series data mining The first part is devoted
to an overview of the tasks that have captured most of the interest of researchers Considering that in most cases, time-series task relies on the same components for
Trang 6implementation, we divide the literature depending on these common aspects, namely representation techniques, distance measures, and indexing methods
Time Series Decomposition
We can decompose the time series into trend, seasonal and error components
The additive model is:
Y[t]=T[t]+S[t]+e[t]
where:
Y(t) is the concentration of co2 at time t,
T(t) is the trend component at time t,
S(t) is the seasonal component at time t,
e(t) is the random error component at time t
Classical decomposition of time series is performed using the decompose function In these decomposed plots we can again see the trend and seasonality as inferred previously, but we can also observe the estimation of the random component depicted under the “remainder”
ACF and PACF:
In order to test the stationarity of the time series, let’s run the Augmented
Dickey-Fuller Test using the adf.test function
First set the hypothesis test:
The null hypothesis H0: that the time series is non stationary
The alternative hypothesis HA: that the time series is stationary
Trang 7Where the p-value is less than 5%, we strong evidence against the null hypothesis, so
we reject the null hypothesis In this case, if the test results which is >0.05 therefore
we accept the null hypothesis that the time series is non stationary
A stationary time series has the conditions that the mean, variance and covariance are not functions of time In order to fit arima models, the time series is required to be stationary We will use two methods to test the stationarity
Another way to test for stationarity is to use autocorrelation We will use autocorrelation function acf and partial autocorrelation function pacf These functions plot the correlation between a series and its lags ie previous observations with a 95% confidence interval in blue If the autocorrelation crosses the dashed blue line, it means that specific lag is significantly correlated with current series
Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) The ACF and PACF are used to figure out the order of AR, MA, and ARMA models The ACF and PACF plots can be obtained from the original data, as well as from the residuals of a model On the original data, these plots can help detect any autoregressive or moving average terms that may be significant in the time series When applied to the residuals, these plots can detect any remaining autocorrelation in the model This also provides insight into whether additional AR or MA terms need
to be included in the model Similarly, they can detect any seasonal behaviour that must be accounted for in the model
Trang 8Differencing is the commonly used technique to remove non-stationarity This
differencing is called as the Integration part in AR(I)MA Now, we have three parameters
p represents for AR
d represents for I
q represents for MA
An auto regressive (AR(p)) component is referring to the use of past values in the
regression equation for the series Y The auto-regressive parameter p specifies the number of lags used in the model
The d represents the degree of differencing in the integrated (I(d)) component Differencing a series involves simply subtracting its current and previous values d times
A moving average (MA(q)) component represents the error of the model as a linear
combination of previous error terms et The order q determines the number of terms
to include in the model
Seasonality can easily be incorporated in the ARIMA model directly
ARIMA stands for Auto Regression Integrated Moving Average It is specified by three ordered parameters (p,d,q):
p is the order of the autoregressive model (number of time lags)
d is the degree of differencing (number of times the data have had past values subtracted)
q is the order of moving average model
Due the fact that our times series exhibits seasonality, we will use actually a model called SARIMA, that is, as name suggest, a seasonality ARIMA We write SARIMA
as ARIMA(p,d,q)(P, D, Q)m:
Trang 9 p — the number of autoregressive
d — degree of differencing
q — the number of moving average terms
m — refers to the number of periods in each season
(P, D, Q)— represents the (p,d,q) for the seasonal part of the time series
Use the auto.arima function to fit the best model and coefficients, given the default parameters including seasonality as TRUE
It is frequently used to predict demand, such as when estimating future demand for atmospheric CO2 This is so that managers have solid parameters to follow when making judgments about how to limit pollution Based on historical data, ARIMA models can also be used to forecast how much CO2 our environment will contain in the future
Fit model:
Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained A model that is well-fitted produces more accurate outcomes A model that is overfitted matches the data too closely A model that is underfitted doesn’t match closely enough
A machine learning model's model fitting is a gauge of how well it generalizes to data that is comparable to the data it was trained on We are able to use machine learning algorithms every day to make predictions and classify data because they can generalize a model to fresh data When a model is given unknown inputs, a good model fit is one that closely approximates the outcome of an unknown input The process of fitting a model involves changing its parameters in order to increase its accuracy A machine learning method is applied to data for which the target variable
is known ("labeled" data) in order to produce a machine learning model In our case
Trang 10we refuse to use linear model because it does not capture the seasonality and additive effects over time.
Section 3: Application
Load libraries and data:
The first thing to do is to load the first data set we will use This data set contains observations on the concentration of carbon dioxide (CO2) in the atmosphere made at Mauna Loa from 1958 to 2020 This is an in-built data set in R so can be loaded via the data function
Invert data to time series model:
co=ts(ts1$val,start = 1958 , end = 2020, frequency = 12)
Trang 11Figure 1: Time series of the data
According to the graph, the variance of CO2 concentration remains relatively
constant throughout the survey time period, resulting in it being an additive model
An additive model is the one in which the time series is the sum of trend,
seasonality, and remainder
Time series decomposition
decomposeCO2 <- decompose(co,"additive")
autoplot(decomposeCO2)
- autoplot is a generic function to visualize various data object
- tries to give better default graphics and customized choices for each data type, quick and convenient to explore your genomic data
"decompose" to Decompose a time series into seasonal, trend and irregular components using moving averages Deals with additive or multiplicative seasonal component.
In this case, we use additive model.
additive model: Y[t]=T[t]+S[t]+e[t]
Y(t) is the concentration of co2 at time t,
T(t) is the trend component at time t,
S(t) is the seasonal component at time t,
Trang 12Figure 2: Time series decomposition
Based on the graph of the trend, it can be seen that there is an upward trend
presented The fact that the diagram of seasonality displays the same frequency and magnitude solidifies the appropriacy of the additive model Besides, since the mean value of remainder is at 0, there is little to no correlation between the random values confirming the fitness of the model
Trang 13Autocorrelation (ACF & PACF)
autoplot(acf(co,plot=FALSE))+ labs(title="Correlogram of CO2 from 1958 to 2020")
autoplot(pacf(co,plot=FALSE))+ labs(title="Correlogram of CO2 from 1958 to 2020")
Figure 3: ACF diagram with trend and seasonality
Figure 4: PACF diagram with trend and seasonality
Trang 14From the figure of ACF and PACF show that the lags of the ACF series decrease gradually whilst the lags of the PACF series dies out quickly Hence, the series is most likely non-stationary
Remove trend and seasonal effect
labs(title="Correlogram of CO2 from 1958 to 2020")
diff(log(co)): remove trend and seasonal effect to create a new test.
Trang 15Figure 5: ACF diagram without trend and seasonality
Figure 6: PACF diagram without trend and seasonality
After remove the trend and the seasonality, the two graphs have a few significant lags that die out quickly, which is prove that the series is most likely stationary
Trang 16FIT model:
Since there is an upwards trend we will look at a linear model first for comparison
We plot the raw dataset with a linear model
autoplot(co) + geom_smooth(method="lm")+ labs(x ="Year", y = ylab, title="Mauna Loa CO2 (PPM) from 1958 to 2020")
Figure 7: Linear regression model of the data
This may not be best model to fit as it doesn’t capture the seasonality and additive effects over time
ARIMA Model
arimaCO2 <- auto.arima(co)
arimaCO2
geom_smooth() adds a trend line over an existing plot.
method="lm" use liner regression model to specific a trend line
Use the auto.arima function to fit the best model and coefficients, given the default parameters including seasonality as TRUE.
Trang 17Figure 8: Diagrams of different analysis of residuals for model selection
The above figure shows the ACF of the residuals for a model The “lag” (time span between observations) is shown along the horizontal, and the autocorrelation is on the vertical The lines indicated bounds for statistical significance The residual plots appear to be centered around 0 as noise, with no pattern Ljung-Box test show a p-value pretty high so the SARIMA model is a fairly good fit
SARIMA is the best model for forcasting
ggtsdiag : plot time series diagnostic.
Trang 18qqline(residuals(arimaCO2))
Figure 9: Histogram and Q-Q Plot of the residuals
Remainder data is normally distributed so we can conclude that this model is best fitted for our data since all of the correlation between the data in the dataset have been considered
Based on the graph, we see that the residuals are mostly concentrated
on the normal distribution expected line, so the normal distribution of the residuals is assumed to
be satisfied.
The second graph (Normal Q-Q) plots normalized error values, allowing to test the assumption about the normal distribution of residuals.
level of confidence: 95%
h: forecast horizon periods
in months.
Trang 19Figure 10: Time series forecast
Based on what the forecast showed, a continual increase in CO2 concentration will be witnessed during the next two years
Section 4: Conclusion
Thanks to the forecast, it is obvious that the concentration of CO2 on Earth will continue to rise without the any human-related activity If we do not stop producing additional CO2, its concentration will soon reach a level that is lethal for human survival Therefore, it is our mission to cut down on the amount of CO2 from our industries that is released into the environment so as not to worsen the current situation The reduction of CO2 can be accomplished by changing from fossil fuels to green alternative energy like solar or wind energy By refraining from using gas-based cars and turning to electricity-based cars, we can also help to reduce the emission of CO2 Besides, the transition from commuting by cars to bikes also helps as bikes do not emit CO2 and they also help to improve humans’ health Governments play a vital role in this campaign as they are the only ones capable of passing laws that force businesses to stop releasing too much CO2 by making them treat their polluted air
waste before it is let loose into the environment