Self organizing maps based hybrid approaches to short term load forecasting

off-1.3 Approaches to Short Term Load Forecasting STLF methods, and more generally, time series prediction TSP methods can be broadly divided into two categories: statistical methods and

Trang 1

1

Chapter 1 Introduction

This first chapter offers a general description of the short term load forecasting (STLF) problem, and its significance for the power industry Then the two main approaches to STLF – statistical approach and artificial neural networks approach are introduced and detailed, followed by the motivation for this thesis and contribution of this thesis Finally there is a bibliographic review of the methods for STLF from these two disciplines, and then the structure of this thesis is explained

1.1 Load Forecasting

Load forecasting has always been an issue of major interest for the electricity industry During the operation of a power system, the system response closely follows the load requirements So when there is an increase or decrease in the load demand, then the power generation has to be increased or decreased accordingly To be able to provide this on-demand power generation, the electric utility operator needs to have available

sufficient quantity of generation resources Thus, if the operator has some a priori

knowledge of the load requirements in the future, he can optimally allocate the generation resources

There are three kinds of load forecasting: short term, medium term, and long term forecasts Utility operators need to perform all the three forecasts, as they influence different aspects of the power supply chain Short term load forecasting typically means forecasts for one hour to one week, and are needed for the daily operation of the power system Medium term forecasts typically cover one week to one year ahead, and are needed for fuel supply planning and maintenance Long term load forecasts usually cover

a period longer than a year, and are needed for power system planning

1.2 Importance of Short Term Load Forecasting

Short term load forecasting (STLF) is the keystone of the operation of today’s power systems Without access to good short term forecasts, it would be impossible for any electric utility to be able to operate in an economical, reliable and secure manner The input data for load flow studies and contingency analysis is provided by STLF Utilities need to perform these studies to calculate the generating requirements of each

Trang 2

off-1.3 Approaches to Short Term Load Forecasting

STLF methods, and more generally, time series prediction (TSP) methods can be broadly divided into two categories: statistical methods and computational intelligence (CI) methods

1.3.1 Statistical Methods

1.3.1.1 Time Series Models

Modern statistical methods for time series prediction can be said to begun in 1927 when Yule came up with an autoregressive technique to predict the annual number of sunspots According to this model, the next-step value was a weighted average of previous observations of the series To model more interesting behavior from this linear system, outside intervention in the form of noise was introduced For the next half-century, the reigning paradigm for predicting any time series remained that of a linear model added with noise The popular models developed during this period would include moving

Trang 3

3

average, exponential smoothing methods, Box-Jenkins approach to modeling autoregressive moving average (ARMA) models and autoregressive integrated moving average (ARIMA) models These models, referred together as time series models, assume that the data is following a stationary pattern, i.e the series is normally distributed with a constant mean and variance over a long time period They also assume that the series has uncorrelated random error, and no outliers are present

Applied for load forecasting, time series methods provide satisfactory results as long

as the variables affecting the load demand, such as environmental variables, do not change suddenly Whenever there is an abrupt change in such variables, the accuracy of the time series models suffers Also, the assumption of stationarity of the load series is rather restricting, and whenever the historical load data deviates significantly from this assumption, the forecasting accuracy decreases

1.3.1.2 Regression Models

Regression methods are another popular tool for load forecasting Here the load is modeled as a linear combination of relevant variables such as weather conditions and day type Temperature is usually the most important factor for load forecasting among weather variables, though its importance depends upon the kind of forecast and the type of climate For example, for STLF, temperature effects might be more critical for tropical regions than temperate ones Typically temperature is modeled in a nonlinear fashion Other weather variables such as wind velocity, humidity and cloud cover can be included in the regression model to obtain higher accuracy Clearly, no two utilities are the same, and a detailed case study analysis of the different geographical, meteorological, and social factors affecting the load demand needs to be carried out before proceeding with the regression methods Once the variables have been determined, the coefficients of these variables can be estimated using least squares or other regression methods

Though regression methods are popular tools for STLF among electric utilities, they have their share of drawbacks The relationship between the load demand and the influencing factors is a nonlinear and complex one, and developing an accurate model is a challenge From on-site tests, it has been seen that the performance of regression methods

Trang 4

4

deteriorates when the weather changes abruptly, leading to load deviation [3] This drawback occurs in particular because the model is linearized so as to obtain its coefficients But the load patterns are nonlinear; hence a linearized model fails to represent the load demand accurately during certain distinct time periods

1.3.1.3 Kalman Filtering Based Models

Towards the end of 1980s, as computers became more powerful, it became possible

to record longer time series and apply more complex algorithms to them Drawing on ideas from differential topology and dynamical systems, it was possible to represent a time series as being generated by deterministic governing equations This approach of Kalman filtering techniques characterizes dynamical systems by a state-space representation The theory of Kalman filtering provides an efficient computational (recursive) means to estimate the state of a process, in a way that minimizes the mean of the squared error The filter supports estimation of the past, present and even future states, and it can do so even when the precise nature of the modeled system is unknown [4] A significant challenge in the use of Kalman filtering based methods is the estimation of the state-space model parameters

1.3.1.4 Non-linear Time Series Models

To overcome the limitations of the linear time series models, a second generation of non-linear statistical time series models has been developed Some of the models, such as autoregressive conditional heteroscedatic (ARCH) and generalized autoregressive conditional heteroscedatic (GARCH) attempt the model the variance of the time series as a function of its past values These models achieved limited success for STLF since they were mostly specialized for particular problems in particular domain, example volatility clustering in financial indices

Regime-switching models, developed first for econometrics, are slowly being successfully applied for STLF as well As the name suggests, these models involve switching between a finite number of linear regimes The models only differ in their assumptions about the stochastic process generating the regime

Trang 5

5

i The mixture of normal distributions model has state transition probabilities which

are independent of the history of the regime Compared to a single normal distribution, this approach is better able to model fatter-than-normal tails and skewness [5]

ii In the Markov-switching model, the switching between two or more regimes is governed by a discrete-state homogeneous Markov chain [6] So in a possible formulation of the Markov-switching model, the model can be divided into two parts, firstly a regressive model to regress the model variable over hidden state variables, and secondly, an autoregressive model to describe the hidden state variables

iii In the threshold autoregressive (TAR) model [7][8], the switching between two or more linear autoregressive models is governed by an observable variable, called

the threshold variable In the case where this threshold variable is a lagged value

of the time series, the model is called a self-exciting threshold autoregressive (SETAR) model

iv In the smooth transition autoregressive (STAR) model, the switching is governed

by an observable threshold variable, similar to TAR model, but a smooth

transition between the two regimes is enforced

As a few of these non-linear time series model form the basis of the hybrid models proposed in this work, they are explained in detail in Chapter 2

1.3.2 Computational Intelligence Methods

The deregulated markets and the constant need to improve the accuracy of load forecasting have forced the electricity utility operators to focus much attention to computational intelligence based forecasting methods It has been calculated in [9] that a reduction of 1% of forecasting error could save up to $1.6 million annually for a utility Computational intelligence techniques broadly fall into four classes – expert systems, fuzzy logic systems, neural networks and evolutionary computation systems A brief introduction to these four approaches is provided

Trang 6

6

1.3.2.1 Expert Systems

An expert system is a computer program which simulates the judgment and behavior

of a human or an organization that has expert knowledge and experience in a particular field Typically an expert system would comprise four parts: a knowledge base, a data base, an inference mechanism, and a user interface For STLF, the knowledge base is typically a set of rules represented in the IF-THEN form, and can consist of relationships between the changes in the load demand and changes in factors which affect the use of electricity The data base is typically a collection of facts provided by the human experts after interviewing them, and also facts obtained using the inference mechanism of the system The inference mechanism is the “thinking” part of the expert system, because it makes the logical decisions using the knowledge from the knowledge base and information from the data base Forward chaining and backward chaining are two popular reasoning mechanisms used by the inference mechanism [10]

In terms of advantages, the expert systems can be used to take decisions when the human experts are unavailable, thus reducing the work burden of human experts When human experts retire, their knowledge can still be retained in these systems

1.3.2.2 Fuzzy Logic Systems

Fuzzy systems are knowledge-based software environments which are constructed from a collection of linguistic IF-THEN rules, and realize nonlinear mapping which has interesting mathematical properties of “low-order interpolation” and “universal function approximation” These systems facilitate the design of reasoning mechanism of partially known, nonlinear and complex processes

A fuzzy logic system comprises of four parts – fuzzifier, fuzzy inference engine, fuzzy rule base and defuzzifier The system takes the crisp input value, which is then fuzzified (i.e converted into corresponding membership grade in the input fuzzy sets), thereafter it is fed to the fuzzy inference engine Using the stored IF-THEN fuzzy rules from the rule base, the inference engine produces a fuzzy output that undergoes further defuzzification to result in crisp output

Trang 7

7

Fuzzy logic is often combined with other computational intelligence methods such

as expert systems and neural networks

1.3.2.3 Artificial Neural Networks (ANN)

Artificial neural networks are massively parallel, distributed processing systems built on the analogy to the human neural network – the fundamental information processing system Generally speaking, the practical use of neural networks has been recognized mainly because of such distinguished features as

i general nonlinear mapping between a subset of the past time series values and the future time series values

ii the capability of capturing essential functional relationships among the data, which

is valuable when such relationships are not known a priori or are very difficult to

describe mathematically and/or when the collected observation data are corrupted

The multilayer perceptron (MLP) is one of the most researched network architecture It is a supervised learning neural architecture, and it has been very popular for time series prediction in general, and STLF in particular This is because in its simplest form, a TSP problem can be rewritten as a supervised learning problem, with the current and past values of the time series as the input values to the network, and the one-step-

Trang 8

8

ahead value as the output value This formulation allows one to explore the universal function approximation and subsequent generalization capability of the MLP The radial basis function (RBF) network is another popular supervised learning architecture which can also be used for the same purposes

The Self-Organizing Map (SOM) is an important unsupervised learning neural architecture, which is based on unsupervised competitive-cooperative learning paradigm

In contrast to the supervised learning methods, SOM has not been popular for time series prediction, or STLF This mostly is because the SOM is traditionally viewed as a data vector quantization and clustering algorithm [12][13] less suitable for function approximation by itself Hence when used for TSP, the SOM is usually used in a hybrid model, where the SOM is first used for clustering, and subsequently another function approximation method such as MLP or support vector regression (SVR) is used to learn the function

As the MLP and SOM form the basis of the work proposed in this thesis, they are reviewed in greater detail in Chapter 3

1.3.2.4 Evolutionary Approach

The algorithms developed under the common term of evolutionary computation are inspired from the study of evolutionary behavior of biological processes They are mainly based on selection of a population as a possible initial solution of a given problem Through stepwise processing of the initial population using evolutionary operators, such

as crossover, recombination, selection and mutation, the fitness of the initial population steadily improves

Consider how a genetic algorithm might be applied to load forecasting First an appropriate model (either linear or nonlinear) is selected and an initial population of candidate solutions is created A candidate solution is produced by randomly choosing a set of parameter values for the selected forecasting model Each solution is then ranked based on its prediction error over a set of training data A new population of solutions is generated by selecting fitter solutions and applying a crossover or mutation operation

Trang 9

In linear hybridization, two or more linear statistical models are combined together Though some work has been done in this field, as discussed in the section on literature review, this field did not really pick up because a linear hybrid model would still suffer from many of the problems with which linear models suffer

The most heavily researched hybrid models would be those involving two nonlinear models, especially two computational intelligence models This is because the three popular CI models - ANNs, fuzzy logic and evolutionary computation have their own capabilities and restrictions, which are usually complimentary to each other So for eg, the black-box modeling approach of neural networks might be well suited for process modeling or for intelligent control, but not that suitable for decision control Similarly the fuzzy logic systems can easily handle imprecise data and explain their decisions in the context of the available facts in linguistic form; however they cannot automatically acquire the linguistic rules to make these decisions It is these capabilities and restrictions

of individual intelligent technologies which have driven their fusion to create hybrid intelligent systems which have been successfully applied for various complex problems, including STLF

The third class of hybrid models, which this thesis is about, involve one statistical method and one computational intelligence method Usually the CI method is a neural network, chosen for their flexibility and powerful pattern recognition capabilities But

Trang 10

10

when developed as a predictive model, neural networks become difficult to interpret due

to their black-box nature and it becomes hard to test the parameters for their statistical significance Hence, time series models, linear ones such as ARMA or ARIMA, or nonlinear ones such as STAR are introduced in the hybrid model to handle the concern of interpretation

1.4 Motivation

Though a comfortable state of performance has been achieved for electricity load forecasting, but market players will always bring in new dynamic bidding strategies, which, coupled with price-dependent load shall introduce new variability and non-stationarity in the electricity load demand series Besides, the stricter power quality requirements and development of distributed energy resources are other reasons why the modern power system will always require more advanced and more accurate load forecasting tools

Consider why a SOM based hybrid model is an appealing option Though every possible approach has been applied for STLF, the more popular ones are the time series approaches and computational intelligence approaches of feed-forward neural networks

An extensive literature review is done in Section 1.6 Both these approaches attempt to

build a single global model to describe the load dynamics The difference between time

series approaches and supervised learning neural networks is that while time series approaches build an exact model of the dynamics (“hard computing”), the supervised learning neural networks allow some tolerance for imprecision and uncertainty to achieve tractability and robustness (“soft computing”) However, there is an exciting alternative to

building a global model, which is to build local models for the series dynamics, where each local model handles a smaller section of the series dynamics This is definitely an

area which needs further study, because a time series such as load demand series shows

various stylized facts, discussed further in Chapter 4 The complexity of a global model increases a lot if it is to handle all the stylized facts Working with multiple local models

might bring down the complexity On the other hand, the challenges faced in working with

local models are manifold Firstly, what factors should decide the division of the series

Trang 11

11

dynamics into local models Secondly, how do we combine the results from multiple local

models to give the final prediction value?

In this thesis, SOM based hybrid models are proposed to explore the

above-mentioned idea of local models As above-mentioned earlier, SOMs have been less applied to

STLF traditionally, which mostly has to do with the prevalent attitude among researchers that SOMs are an unsupervised learning method, suitable only for data vector quantization and clustering [12][13] But this same property of clustering makes SOMs an excellent

tool for building local models

Another motivation for this thesis is to further explore the idea of transitions

between local models Once the local models have been built, how does the transition from one model to another take place? Is it a sudden jump, where a local model M 1 was

being used to describe the series on a particular day and a different local model M 2 is being used for the next day? After analyzing the electricity load demand series, it was found that regimes were present in the series, due to season effects and market effects, and

the transition between these regimes was smooth Hence a sudden jump from one local model to another local model might not be the best approach Hence this thesis studies the NCSTAR model in Chapter 6, which allows smooth transition between local models The

idea is to be able to obtain a highly accurate learning and prediction not only for test

samples which clearly belong to a particular local model, but also for test samples which represent the transition from one local model to another local model

Earlier researchers have proposed working with local models for STLF in different

ways, and to attain different aims For eg., in [14], the same wavelet-based neural network

is trained four times over different periods in the year to handle the four seasons But this

paper does not consider the transitions between the local models, i.e the seasons to be

smooth Not much work has been done on enforcing smooth transition between regimes or

local models for STLF After extensive literature review (please see Section 1.6), the only paper which was found to be handling smooth transition between local models for

electricity load forecasting is [30] So definitely more study has to be done on how to

identify local models, how to implement smooth transitions between local models, and

Trang 12

12

how introducing the smooth transition will affect the prediction accuracy of the overall model for STLF This is exactly what this thesis sets out to do

1.5 Contribution of the Thesis

In this work, two SOM based hybrid models are proposed for STLF

In the first model, a load forecasting technique is proposed which uses a weighted SOM for splitting the past historical data into clusters For the standard SOM, all the inputs to the neural network are equally weighted This is a drawback compared to other supervised learning methods which have procedures to adjust their network weights, e.g back-propagation method for MLPs and pseudo-inverse method for RBFs Hence, a strategy is proposed which weighs the inputs according to their correlation with the output Once the training with the weighted SOM is complete, the time series has now

been divided into smaller clusters, one cluster for each neuron Next, a local linear model

is built for each of these clusters using an autoregressive model, which helps to smoothen the results

In the second hybrid model, the aim is to allow for smooth transitions between the

local models Here the model of interest is a linear model with time varying coefficients

which are the outputs of a single hidden layer feedforward neural network The hidden layer is responsible for partitioning the input space into multiple sub-spaces through multivariate thresholds and smooth transition between the sub-spaces Significant research has already been done into the specification, estimation and evaluation of this model In this thesis, a new SOM-based method is proposed to smartly initialize the weights of the hidden layer before the network training First, a SOM network is applied to split the historical data dynamics into clusters Then the Ho-Kashyap algorithm is used to obtain the equations of the hyperplanes separating the clusters These hyperplanes' equations are then used to smartly initialize the weights and biases of the hidden layer of the network

1.6 Literature Survey

The two approaches to STLF, and TSP in general, statistical methods and CI methods have already been discussed above, and their different sub-categories have been

Trang 13

13

introduced Some of the approaches described, such as non-linear time series models, SOMs, and MLPs are more relevant to the work in this thesis than other models What follows next is a bibliographical survey for methods in STLF, with more emphasis given

to methods relevant to work done in this thesis

1.6.1 Statistical Methods

In the field of linear approach to time series, Box-Jenkins methodology is the most popular approach to handling ARMA and ARIMA models, and consists of model identification and selection, parameter estimation and model checking Box-Jenkins methodology is among the oldest methods applied to STLF It was proposed in [15], and further developed in [16] With a more modern perspective, [17] is an influential text on nonlinear time series models, including several of those described in Section 1.3.1.4 ARMA and ARIMA continue to be very popular for STLF

In [18], the load demand is modeled as the sum of the two terms, the first term depending on the time of day and the normal weather pattern for that day, and the second term being the residual term which models the random disturbances using an ARMA model Usually the Box-Jenkins models assume a Gaussian noise In [19], the ARMA-modeling method proposed allows for non-Gaussian noise as well Other works which use Box-Jenkins method for STLF are [20][21]

In [22], a periodic autoregression model is used to develop 24 seasonal equations, using the last 48 load values within each equation The motivation is that by following a seasonal-modeling approach, it is possible to incorporate a priori information concerning the seasonalities at several levels (daily, weekly, yearly, etc.) by appropriately choosing the model structure and estimation method In [23], an ARMAX model is proposed for STLF, where the X represents an exogenous variable, temperature in this case Actually this is a hybrid model as it uses a computational intelligence method, paticle swarm optimization to determine the order of the model as well as its coefficients instead of the traditional Box-Jenkins approach

An ARIMA model uses differencing to handle the non-stationarity of the series, and then uses ARMA to handle the resulting stationary series In [24], six methods are

Trang 14

14

compared for STLF, and ARIMA is found to be a suitable benchmark In [25], a modified ARIMA model is proposed Basically this model not only takes past loads as input, but also the estimates of past loads provided by human experts Thus this model, in a sense, incorporates the knowledge of experienced human operators This method is shown to be superior to both ANN and ARIMA

Now consider the previous work in STLF on regime-switching models, i.e linear statistical time series model discussed earlier in Section 1.3.1.4 The threshold autoregressive (TAR) model was proposed by [7] and [8] In [26], a TAR model with multiple thresholds is developed for load forecasting This model chooses the optimum number of thresholds is the one which minimizes the sum of threshold variances

non-A generalization of the Tnon-AR model is the smooth transition autoregressive (STnon-AR) model, which was initially proposed in [27], and further developed in [28] and [29] A modified STAR model for load forecasting is proposed in [30] where temperature plays the role of threshold variable This method uses periodic autoregressive models to represent the linear regime, as they better capture the fact that the autocorrelation at a particular lag of one half-hour varies across the week Such switching regime models have also been proposed for electricity price forecasting [31] [32]

1.6.2 Computational Intelligence Methods

Four CI methods were introduced earlier in Section 1.3.2, but the following literature review focuses mostly on neural networks, as these are the most popular amongst the four for STLF, and also the most relevant to the work done in this thesis There are several kinds of ANN models, classified by their architecture, processing and training For STLF, the popular ones have been used, e.g radial basis function networks [33][34], self-organizing maps [35] and recurrent neural networks [36][37] However the most popular network architecture is the multi-layer perceptron described in Section 1.3.2.3, as its structure lends naturally to unknown function approximation In [38], a fully connected three-layer feedforward ANN is implemented with backpropagation learning rule, and the input variables being historical hourly load data, day of the week and temperature In [3], a multi-layered feedforward ANN is developed

Trang 15

15

which takes three types of variables as inputs - season related inputs, weather related inputs, and historical loads In [39], electricity price is also considered as a main characteristic of the load Other recent work involving MLP for STLF include [40][41][34][42]

In [43], in order to reduce the neural network structure and learning time, a hour-ahead load forecasting method is proposed which uses the correction of similar day data In this proposed prediction method, the forecasted load power is obtained by adding

one-a correction to the selected similone-ar done-ay done-atone-a In [44], weone-ather ensemble predictions one-are used for STLF A weather ensemble prediction consists of multiple scenarios for a weather variable These scenarios are used to produce multiple scenarios for load forecasts In [45], network committee technique, which is a technique from the neural network architecture, is applied to improve the accuracy of forecasting the next-day peak load

1.6.3 Hybrid Methods

Hybrid models combining statistical models and neural networks are rare for STLF, though they have been proposed for other TSP fields In [46], a hybrid ARIMA/ANN model is proposed Because of the complexity of a moving trend as well as a cyclic seasonal variation, an adaptive ARIMA model is first used to forecast the monthly load and then the forecast load of the ARIMA model is used as an additive input to the ANN The prediction accuracy of this approach is shown to be better than traditional methods of time series models and regression methods In [47], a recurrent neural network is trained

by features extracted from ARIMA analyses, and used for predicting the mid-term price trend of the Taiwan stock exchange weighted index In [48], again an ARIMA model and neural network model are combined to forecast time series of reliability data with growth trend, and the results are shown to be better than either of the component models In [49], seasonal ARIMA (SARIMA) model and the neural network MLP are combined to forecast time series with seasonality

It was mentioned earlier in Section 1.3.2.3 that a neural network can be implemented for both, supervised as well as unsupervised learning But unsupervised

Trang 16

16

learning architectures, such as SOMs have traditionally been used for data vector quantization and clustering Hence when used for TSP, the SOM is usually used in a hybrid model, where the SOM is first used for clustering, and subsequently another function approximation method such as MLP or support vector regression (SVR) is used

to learn the function In [50][51][52], a two-stage adaptive hybrid network is proposed In the first stage, a SOM network is applied to cluster the input data into several subsets in an unsupervised manner In the next stage, support vector machines (SVMs) are used to fit the training data of each subset in a supervised manner In [53], profiling is done through SOMs, followed by prediction through radial function networks In [54], the first SOM module is used to forecast normal and abnormal days, and the second MLP module is able

to make the load model sensitive weather factors such as temperature

As was mentioned in Section 1.3.3, the most heavily researched hybrid models for TSP in general involve those where both the component models are computational intelligence methods In [55], a real-time pricing type scenario is envisioned where energy prices change on an hourly basis, and the consumer is able to react to those price signals

by changing his load demand In [56], attention is paid to special days An ANN provides the forecast scaled load curve and fuzzy inference models give the forecast maximum and minimum loads of the special day Similarly, significant work has also been done on hybridizing evolutionary algorithms with neural networks In [57], a genetic algorithm is used to tune the parameters of a neural network which is used for STLF A similar approach is presented in [58] In [59], a fuzzy neural network is combined with a chaos-search genetic algorithm and simulated annealing, and is found to be able to exploit all the original methods' advantages Similarly, particle swarm optimization is a recent CI approach which has been hybridized with other CI approaches such as neural networks [60][61] and support vector machines [62] to successfully improve the prediction accuracy

for STLF

1.7 Structure of the Thesis

The thesis consists of the following chapters

Trang 17

17

In this first chapter, short term load forecasting was introduced The two approaches

to short term load forecasting, statistical approach and computational intelligence based approach, were introduced, and their hybrid methods were discussed Relevant work from past research was presented Finally the motivation for this thesis, and its contributions were presented

In the second chapter, statistical methods for time series analysis are briefly discussed These include the more traditional Box-Jenkins methodology, Holt-Winters exponential smoothing, and the more recent regime-switching models

In the third chapter, two popular neural network models, multilayer perceptron for supervised learning and self-organizing maps for unsupervised learning are described The architecture, the learning rule and relevant issues are presented

In the fourth chapter, the stylized facts of the load demand series are presented It is necessary to understand the unique properties of the load demand series before any attempt is made to model them

In the fifth chapter, the first hybrid model is presented First it is explained how an unsupervised model such as a self-organizing map can be used for time series prediction Then the hybrid model, involving autocorrelation weighted input to the self-organizing map and autoregressive model is explained, along with the motivation for weighing with autocorrelation coefficients

In the sixth chapter, the second hybrid model is proposed to overcome certain issues with the first proposed model The need for smooth transitions between regimes in the load series is highlighted The contribution of this paper, a novel method to smartly initialize the weights of the hidden layer of the neural network model NCSTAR is presented

The final chapter concludes this thesis with some directions for future work

Trang 18

18

Chapter 2 Statistical Models for Time Series Analysis

In this chapter, the classical tools for time series prediction are reviewed, and recent developments in nonlinear modeling are detailed First, the commonly used Box-Jenkins approach to time series analysis is described Then, another commonly used classical method, the Holt-Winters exponential smoothing procedure is explained Finally, an overview of the more recent regime-switching models is given

2.1 Box-Jenkins Methodology

ARMA models, as described by the Box-Jenkins methodology, are a very rich class

of possible models The assumptions for this class of models are (a) the series is stationary

or can be transformed to one using a simple transformation such as differencing (b) the series follows a linear model

The original Box-Jenkins modeling procedure involves an iterative three-stage procedure of model identification, model estimation and model validation Later work [63] includes a preliminary stage for data preparation and a final stage for forecasting

• Data preparation can involve several sub-steps If the variance of the series changes with the level, then a transformation of the data, such as logarithms, might

be necessary to make it a homoscedastic (constant variance) series Similarly, it needs to be determined if the series is stationary, and if there is any significant seasonality which needs to be modeled Differencing approach enables to handle stationarity and remove seasonality

• Model identification involves identifying the order of the autoregressive and moving average terms to obtain a good fit to the data Several graph based approaches exist, which include the autocorrelation function and partial autocorrelation function approaches, and new model selection tools such as Akaike’s Information Criterion have been developed

• Model estimation involves finding the value of model coefficients in order to obtain a good fit on the data The main approaches are non-linear least squares and maximum likelihood estimation

Trang 19

19

• Model validation involves testing the residuals As the Box-Jenkins models assume that the error term should follow a stationary univariate process, the residuals should have nearly the properties of i.i.d normal random variables If the assumptions are not satisfied, then a more appropriate model needs to be found The residual analysis should hopefully provide some clues on how to develop a more appropriate model

2.1.1 AR Model

An autoregressive model of order p ≥ 1 is defined as

X t = b 1 X t-1 + + b p X t-p + ε t (2.1)

where {ε t} ~ N(0, σ2), also known as white noise This model can be written as an AR(p)

process The equation explicitly specifies the linear relationship between the current value and its past values

2.1.2 MA Model

A moving average model of order q ≥ 1 is defined as

X t = ε t + a 1 ε t-1 +…+ a q ε t-q (2.2)

where {ε t} ~ N(0, σ2), or white noise This model can be written as an MA(q) process For

h < q, there is a correlation between X t and X t-h due to the fact that they depend on the

same error terms ε t-j

2.1.3 ARMA Model

Combining the AR and MA forms together gives the popular autoregressive moving average ARMA model, which can be defined as

X t = b 1 X t-1 + + b p X t-p + ε t + a 1 ε t-1 +…+ a q ε t-q (2.3)

where {ε t} ~ N(0, σ2), or white noise, and (p,q) are the order of the models ARMA

models are a popular choice for approximating various stationary processes

Trang 20

20

2.1.4 ARIMA Model

An autoregressive integrated moving average ARIMA model is a generalization of

an ARMA model A time series which needs to be differenced to be made stationary is

said to be an “integrated” version of a stationary series So an ARIMA(p,q,d) process is one where the series needs to be differenced d times to obtain an ARMA(p,q) process

This model, as mentioned in Section 1.6.1, continues to popular for STLF, and has been used as a benchmark in this work

2.2 Holt Winters Exponential Smoothing Method

Single exponential smoothing, used for short-range smoothing, assumes that the data fluctuates around a reasonably stable mean (no trend or seasonality) Double exponential smoothing method is used when the data shows a trend Finally, the method which is most interesting for this thesis, triple exponential smoothing, also called Holt-Winters smoothing, can handle both trend and seasonality

There are two main Holt-Winters smoothing models, depending on the type of seasonality – multiplicative seasonal model and additive seasonal model The difference between the two is that in the multiplicative case, the size of the seasonal fluctuations varies, depending on the overall level of the series, whereas in the additive case, the series shows steady seasonal fluctuations So an additive seasonal model is appropriate for a time series when the amplitude of the seasonal pattern is independent of the average level

of the series

Trang 21

b1 is the base signal called the permanent component

b2 is a linear trend component, which may be deleted if necessary

St is a additive seasonal factor, such that for season length of L periods, ∑

≤

≤ L t

1

St = 0

εt is the random error component

2.2.3 Notation Used for the Updating Process

Let the current deseasonalized level of the process at the end of period T be denoted

by RT At the end of a time period t, let

Rt be the estimate of the deseasonalized level

Gt be the estimate of the trend

St be the estimate of the seasonal component

2.2.4 Procedure for Updating the Estimates of Model Parameters

2.2.4.1 Overall smoothing

Rt = α (yt - St-L) + (1 – α) * ( Rt-1 + Gt-1) (2.5) where 0 < α < 1 is a smoothing constant

St-L is the seasonal factor for period T computed one season (L periods) ago Subtracting St-L from yt deseasonalizes the data so that only the trend component and the

prior value of the permanent component enter into the updating process for Rt

Trang 22

22

2.2.4.2 Smoothing of the trend factor

Gt = β * ( Rt - Rt-1) + (1 – β) * Gt-1 (2.6) where 0 < β < 1 is another smoothing constant

The estimate of the trend component is simply the smoothed difference between two successive estimates of the deseasonalized level

2.2.4.3 Smoothing of the seasonal component

St = γ * (yt - Rt) + (1 – γ) * St-L (2.7) where 0 < γ < 1 is the third smoothing constant

The estimate of the seasonal component is a combination of the most recently observed seasonal factor given by the demand yt after removing the deseasonalized series level estimate Rt and the previous best seasonal factor estimate for this time period All the parameters in the method, α, β, and γ are estimated by minimizing the sum of squared one step-ahead in-sample errors The initial smoothed values for the level, trend and seasonal components are estimated by averaging the early observations

2.2.4.4 Value of forecast

The forecast for the next period is given by:

yt = Rt-1 + Gt-1 +St-L (2.8) Note that the best estimate of the seasonal factor for this time period in the season is used, which was last updated L periods ago

2.2.5 Exponential smoothing for double seasonality

When dealing with daily load forecasting, the series shows only one significant seasonality, which is the within-week cycle Hence the above proposed method can be satisfactorily applied in that scenario

Trang 23

23

But when concerned with hourly load forecasting, there are two seasonalities, the within-day cycle and the within-week cycle To handle this double seasonality scenario, [64] proposes an extension of the classical seasonal Holt-Winters smoothing method Using a new formulation where St and Tt denote the smoothed level and trend, Dt and Wtare seasonal indices (intra-day and intra-week), s1 and s2 are the seasonal periodicity lengths for intra-day and intra-week periods respectively, α, γ, δ, and ω are the smoothing parameters, and ŷt(k) is the k-step-ahead forecast made from forecast origin t, then:

St = α

t t-s t-s

In [65], a comparison of several univariate methods for STLF is presented Besides the exponential smoothing for double seasonality described above, the other methods compared are double seasonal ARIMA model, artificial neural network, and a regression method with principal component analysis It is reported that in terms of mean absolute percentage error (MAPE), the best approach is double seasonal exponential smoothing Hence in this paper, the standard Holt-Winters exponential smoothing has been used as a benchmark for daily load forecasting, and the double seasonal exponential smoothing as proposed in [64] has been used as a benchmark for hourly load forecasting

Trang 24

Henceforth the following notation will be used:

• yt is the value of a time series {yt} at time t

• ɶxt ∈ ℜp is a p × 1 vector of lagged values of yt and/or some exogenous variables

• xxxxtttt ∈ ℜp+1 is defined as xxxxtttt = [1, ɶxtT]T, where the first element is referred as an

intercept

• The general nonlinear model is then expressed as

where Φ(xxxxtttt ; ψ) is a nonlinear function of the variable xt with parameter ψ, and {εt}

is a sequence of independently normally distributed random variables with zero mean and variance σ2

• The logistic function which is used later on, when defined over the domain ℜp is usually written as

f(γ(xxxxtttt - β)) =

t

1

1 + exp (-γ ( x - β ) ) (2.11a)

where γ or slope parameter determines the smoothness of the change between

models, i.e the smoothness of the transition from one regime to another, and β can

be considered as the threshold which marks the regime switch In its dimensional form, it can be written as

Trang 25

the delay parameter

2.3.1 Threshold Autoregressive Model (TAR)

To solve limitations of the linear approach, a threshold autoregressive (TAR) model was proposed, which allows for a locally linear approximation over a number of regimes, and it can be formulated as

∑{ωi,0 + ωi,1 yt-1 + ωi,2 yt-2 +…+ ωi,p yt-p} I(st ∈ Ai) + εt (2.12b)

where st is the threshold variable, I is an indicator (or step) function, ωi is the

autoregressive parameters for the ith linear regime, and {Ai} forms a partition of (-∞,∞) with

k i i=

∩ = ϕ, ∀∀∀∀ i ≠ j So basically one of the autoregressive models

is activated, depending upon the value of the threshold variable st relative to the partitions {Ai}

2.3.2 Smooth Transition Autoregressive Model (STAR)

If one has good reason to believe that the transitions between the regimes are smooth, and not discontinuous as assumed by TAR model, then one can choose the

smooth transition autoregressive (STAR) model In this model, the indicator function I(.)

changes from a step function to a smooth function, such as sigmoid function, as in

Equation 2.11b This STAR model with k regimes is defined as

Trang 26

parameter vector α Finally, we obtain a model with smoothly changing parameters if the transition variable is a linear time trend, i.e st= t

The observable variable st and the associated value of F(st; γi, ci) determine the regime that occurs at time t Different types of regime-switching behavior can be obtained

by different choices for the transition function The first-order logistic function, Equation 2.11b is a popular choice for F(st; γi, ci), and the resultant model is called a logistic STAR (LSTAR)

In the LSTAR model, the transition function F(st; γi, ci) in Equation 2.13 is defined

The LSTAR nests threshold autoregressive (TAR) models as a special case, because

as the slope parameter γ becomes very large, the logistic function approaches the indicator function I(.), and at st =c, F(st; γi, ci) changes from 0 to 1 instantaneously The LSTAR also has the linear AR model as a special case, when γ→0

The regime switches in the LSTAR model are associated with small and large values of the transition variable st relative to c In certain applications, it may be more

Trang 27

27

interesting to specify the transition function such that the regimes are associated with small and large absolute values of st relative to c In this scenario, it is advised to use the exponential function

f(γ(yt-d - c)) = 1 - exp(-γ(yt-d - c)2), γ > 0 (2.16) where the resulting model is named exponential STAR (ESTAR) An another frequently used function is the normal distribution, which yields the normal STAR (NSTAR)

2.3.3 Autoregressive Neural Network Model (AR-NN)

In [66], a neural network is considered as a statistical nonlinear model, and statistical inference is applied to the problem of the model's specification A "bottom-up" strategy is devised, which works from specific to general model

The autoregressive single hidden layer neural network model is defined as

where υi are the weights of the neural network The function f(·) is the activation function

of the hidden layer neuron, and is assumed to be logistic in the paper More details on the neural network are given in Chapter 3

Consider the geometric interpretation of this model The ω i x t defines a hyperplane

in the p-dimensional Euclidean space So the AR-NN divides this space into several

polyhedral regions The output is computed as a sum of the contribution of each region controlled by the smoothing function f(·) Another interpretation of the AR-NN is

hyper-as a generalization of LSTAR where the transition variable can also be a linear combination of multiple stochastic variables

2.3.4 Neuro-Coefficient Smooth Transition Autoregressive Model (NCSTAR)

This model is a recent development in threshold based models, and was first proposed in [67], and further developed in [68] and [69] This model is a generalization of the previously described models which can handle multiple regimes and multiple transition variables This model can be considered as a linear model whose parameters

Trang 28

where φ t = (φt(0), φt(1) φt(p)) ∈ ℜp+1 is the vector of coefficients of the model The time

evolution of the coefficient φt(j) is given by the output of a single hidden layer neural

network with k hidden units

where υj0 and υji are real coefficients, zt is q × 1 vector of transition variables and ωi =

(ω1i, ω2i ωqi) are real parameters

Combining the two Equations 2.18 and 2.19, the resulting equation is obtained

yt = υ 1 x t +

2

k i=

∑ υ i x t f(ω i zt) + εt (2.20)

By observation, this model generalizes both LSTAR and AR-NN When zt= yt-d,

the model becomes a LSTAR with k regimes When υi = (υ0i, 0, ,0), then the model

becomes a AR-NN model with k hidden units

Trang 29

29

Chapter 3 Neural Network Models

In this chapter, two popular neural network architectures are briefly described The architecture, learning rule and relevant issues are presented Among the supervised learning rules, the multilayer perceptron is presented Among the unsupervised learning rules, the self-organizing map is presented

3.1 Introduction

Artificial neural networks are computational networks which attempt to simulate the network of biological nerve cells (neurons) in the biological central nervous system While largely inspired by the inner workings of the brain, many of the finer details of an ANN arise more out of mathematical and computational convenience than actual biological plausibility, where an interconnected group of artificial neurons is built around

a mathematical or computational model for information processing based on what is

known as a connectionist approach to computation In many cases a neural network is an

adaptive system that changes its structure (through its topology and/or synaptic weight parameters) based on external or internal information that ‘flows’ through the network [70]

At a fundamental level, a neural network behaves much like a functional mapper between an input and an output space, where the object of modeling is to ‘learn’ the relationship between the data presented at the inputs and the signals desired at the outputs Neural networks are a particularly useful method of non-parametric data modeling because they have the ability to capture and represent complex input-output relationships between sets of data via a learning algorithm based on an iterative optimization routine The entire concept underlying a neural network can be deconstructed into two parts: first, there is an architectural or structural model, and secondly, there has to be a separate somewhat independent learning mechanism Under normal circumstances, the design of a neural network based approach to solving a problem would first require determining the architecture of a suitable structural complexity to meet problem-specific requirements The second step of applying the learning algorithm is intimately tied to (i) the network topology, and (ii) the problem to be solved The learning algorithm is basically a systematic way of modifying the network parameters in an iterative and automated

Trang 30

30

manner, so that a pre-defined loss or error function is minimized In most cases, a squared error or a mean-squared-error is used as the error function A gradient-descent method can be used for training the network

sum-of-3.2 Multi-Layer Perceptron

Multilayer perceptrons, also known as multilayer feedforward neural networks, have

a layered structure to process information flow in a unidirectional manner, hence the name feedforward There is an input layer consisting of sensory nodes, one or more hidden layers of computational nodes, and then an output layer where the outputs of the network are calculated Multilayer perceptrons play a fundamental role in neural computation because of their universal function approximation property, and they have been widely applied in a range of different areas including intelligent control, pattern recognition, image processing, time series prediction, etc According to the universal approximation theorem, a multilayer perceptron with a single hidden layer is sufficient to compute a uniform approximation for some given training set and its desired outputs

The typical architecture is shown in Figure 3.1 At each neuron of the network, the output is calculated as a function of the neuron’s inputs and the associated weights applied

to that neuron The sigma sign in the circle denotes the summation of weighted inputs, and the non-linearity sign in the box show the function, possibly nonlinear, which is applied to

the input This function is referred to as the activation function, and is continuously differentiable The popular activation functions are the logistic function, hyperbolic

tangent function and the linear function

Trang 31

31

The learning is achieved through the backpropagation algorithm In this method, there are two passes of the network: in the forward pass, the input signal propagates in a forward direction through the network, moving from layer to layer, with the weights fixed;

in the backward pass, the error signal calculated at the output layer is propagated in a backward manner Based on an error-correction rule, the weights are adjusted Although very popular in the real world applications, this method does have two significant shortcomings, i.e sensitivity to parameters and slow learning speed Even for a simple problem, where a small network has to be trained, many iterations are required It must be remembered that the more complex the network architecture, the more the likelihood of many local minima, and greater is the chance of the backpropagation algorithm getting trapped in a local minimum In [71], the sensitivity to initial states, learning parameters and perturbations has been analyzed

The standard backpropagation algorithm uses first-order gradient descent based methods to correct the weights of the network iteratively A Newton-Raphson framework which uses second-order information can offer faster convergence, but at increased complexity As such, second-order approaches are not very popular as calculating the

Figure 3.1 Typical Multi-Layer Perceptron Architecture

Trang 32

o refers to the actual output appearing

at the kth neuron The mean squared error is then written as

E = 1

C

p k 1

1( )e2

1(d o )2

∂ is then computed using the partial differentiation rule Similarly for the hidden layer

weights, vkj the partial derivative w.r.t vkj i.e

kj

E

v

∂

∂ is computed The optimization of the

error function over the weights wkj and vkj is done by using the steepest descent algorithm,

which means that the successive adjustments which are applied to the weights are in the direction of the steepest descent, in a direction opposite to the gradient So,

where η is a positive constant called the learning rate

The MLP is the most popular architecture among the supervised neural network architectures In a supervised architecture, the training data consists of many pairs of input/output training patterns Therefore, the learning benefits from the assistance of the

Trang 33

33

teacher So for eg, in MLP, an error criterion was used as a basis for updating the weights

of the ANN so that the network will be trained with successive input patterns

Contrast that with an unsupervised learning rule, where the training set consists only

of input training patterns Here the network is being trained without the benefit of a teacher Hence the network has to learn to adapt based on the experiences collected through the previous training patterns The most popular architecture for unsupervised learning is the self-organizing map, which is discussed next

3.3 Self-Organizing Maps

3.3.1 Introduction

The SOM algorithm, first introduced by Kohonen in [72], is one of the most popular ANN model based on the unsupervised competitive learning paradigm Learning from past examples, the SOM creates a mapping from a continuous high dimensional input space φ onto a discretized low dimensional output space χ

The discrete output space χ consists of q neurons which are arranged according to some topology, the two most popular being rectangular or hexagonal grid Usually a one-dimensional or a two-dimensional grid is used, because an important aim of the SOM is also to reduce the dimensionality of the original data set, as multidimensionality make it impossible for the analyst to visually inspect the data set Figure 3.2 shows an example each of a rectangular and a hexagonal grid of size 5×5 As can be seen, each neuron is connected to either four or six of its direct neighbors through a communication link A faster learning is achieved with the hexagonal architecture because of the higher number

of communication links to the neighbors compared to the rectangular architecture For the same reason, the hexagonal network is also more flexible and has a better ability to replicate the actual shape of the data surface

Trang 34

34

3.3.2 Learning Rule

The mapping c(x) : φ → χ is defined by the weight vectors W=(w 1 ,w 2 , ,w q) of the q

neurons For each training sample x(t), the first task is to find the neuron which is closest

to the sample x(t) This node is called the winning neuron Following this procedure, an

adjustment of the neural grid to the observed data is guaranteed This winning neuron can

be found using

i*(t) = arg min { ||x(t) − w i(t)|| } (3.5) ∀ i

where || || refers to the Euclidean distance and t is the discrete current iteration It is important to note that the weight vectors have the same dimensionality as the input patterns

A competitive-cooperative learning rule is used to train the weight vectors When an input vector is presented to the network, the weight vector of the winning neuron and its neighbors are updates as

wi(t+1) = wi(t) + α(t) h(i*,i ; t) [x(t) - wi(t)] (3.6)

so the weight vectors of the adapted neurons are moved a bit towards the input vector The amount of movement is controlled by the learning rate α which decreases exponentially

with time The number of neurons surrounding the winning neuron which are affected by

this adaptation is determined by a neighborhood function h Typically the neighborhood function is a unimodal, symmetric and monotonically decreasing with increasing distance

to the winner A popular choice is the Gaussian function:

Figure 3.2 Neural grids of size 5×5 of form (a) rectangular and (b) hexagonal

Trang 35

35

h(i*,i ; t) =

2

* 2

3.3.3 Neighborhood Function and Learning Rate

The neighborhood function h and the learning rate α have key roles in the framework of the SOM algorithm Eventually, these two parameters determine the speed

at which the neural network converges

The neighborhood function affects the learning process in two ways Firstly, it determines the radius around the winning neuron in which the neighboring neurons will get affected by the updating of the winning neuron Secondly, the strength of the updating effect on the neighbors is also determined by the neighborhood function

The neighborhood function is selected to cover a large area of the output space χ in the beginning of the learning, and it is gradually reduced such that towards the end of the process, only the winner is adapted This can be achieved by defining the radius σ(t) as σ(t) = σ0 + σ - σ1 0

though any other suitable function might be employed as well

Usually the learning rate is connected to the neighborhood function in a multiplicative manner The learning rate has a high value in the beginning of the learning process, and decreases later on The idea behind this is that the system should be able to use the information obtained during the past iteration steps in order to converge faster

3.3.4 Convergence

The map is said to have converged when the global ordering of the weight vectors achieves a steady state An important feature of the resulting map is the preservation of neighborhood relations, i.e nearby data vectors in the input space are mapped onto neighboring neurons in the output space Due to this topology-preserving property, the

Trang 36

36 low dimension output space is able to show the structure hidden the high-dimensional data, such as clusters and spatial relationships [12][13]

Trang 37

37

Chapter 4 Stylized Facts of the Load Demand Series

In this short chapter, some of the main features of the load demand series are examined Before any attempt

to develop an appropriate load forecasting model, it is necessary to understand the series first This chapter also introduces weekend effect, holiday effect and temperature effect, and how they shall be handled in the future models

4.1 Introduction

In Figure 4.1, the semi-hourly electricity demand of England and Wales from 1st July 2005 to 30th June 2006 is shown The key intra-annual features to be noticed are the weekly seasonal cycle, strong influence of the holiday period around Christmas, and most importantly, the weather sensitive part of the load because of the changing seasons More relevant to the work done in Chapter 6 is the intuitive fact that the transition between different seasons is rather smooth, taking place over several days

4.2 Intraday Patterns

From the same electricity market of England and Wales, the load demand during two weeks of summer season 1/5/2005 to 14/5/2005 and two weeks of winter season 30/11/2005 to 13/12/2005 is shown in Figure 4.2 So the daily load profile is periodic, but

it depends on the day of the week and the time of the year

Figure 4.1 Semi-hourly electricity demand in England and

Wales from July 1, 2005 to June 30, 2006

Trang 38

38

It is generally assumed that there are four seasons in Britain These are spring (March to May), summer (June to August), autumn (September to November) and winter (December to February) Figure 4.3 shows the average daily demand patterns by season Not only is the average load demand different between the four seasons, but also the shape

of the load demand pattern varies between the seasons For example, the evening peak hourly load between 5 pm and 7 pm is distinctly visible in autumn and winter, while it is clearly missing in spring and summer Similarly, we can see a jump in load demand in the early morning hours from midnight to 5 am which is more distinct in autumn and winter compared to spring and summer This heating load jump is because of the "Economy 7" cheaper rate units for off-peak overnight hours in Britain To reiterate a previous point, while the daily load pattern has been shown here by four different curves representative of

Figure 4.2 Electricity demand in England and Wales for two

weeks in summer and winter respectively

Trang 39

of other weekdays This is because on Monday mornings, the industry has just started to work, and evening load on Friday is different because of its proximity to the weekend Similarly the load profile might be different between Saturday and Sunday because of different industry practices or cultural issues

In this thesis, whenever the weekend effect needs to be handled, a simple approach

is taken As there is sufficient amount of data available, the data is simply divided into seven sub-sets, one for each day of the week

Figure 4.3 Average daily demand patterns by season for England and Wales N.B Winter=December to February, Spring=March to May, Summer=June to August, Autumn=September to November

Trang 40

40

Just looking at the time series, it might be difficult to visualize the issues discussed earlier, such as whether the load on Saturdays are similar to Sundays, or Fridays and Mondays need to be considered separately from other weekdays An interesting way to visualize this is through scatter plots For example, for the daily electricity load series for Britain and Wales from 1/6/2005 to 30/5/2008, in Figure 4.4, there are seven sub-plots

showing the scatter plots between the load demand on day t+1 on x-axis, and load demand

on day t-6, t-5, t respectively on y-axis Consider the sub-plot (g), which is the scatter plot for day t+1 against day t The thin strip above the central thick strip is caused by the

sudden drop in electricity demand from Friday to Saturday, and the thin strip below the central strip is caused by the sudden jump in electricity demand from Sunday to Monday The bottom thin strip is closer to the central thick strip than the top thin strip This would suggest that demand drop from Friday to Saturday is significantly more than the demand jump from Sunday to Monday Thus using the same model for Saturday and Sunday might not be the best approach

Figure 4.4 Scatter plot between the electricity load demand on day t+1

and day (a) t-6 (b) t-5 (c) t-4 (d) t-3 (e) t-2 (f) t-1 (g) t

Định dạng
Số trang	117
Dung lượng	1,51 MB