In the past decade, the most common forecast period has been April-July for most sites in the upper Colorado River basin and January-May for lthe ower Colorado, for each month a forecast
Trang 1An Assessment of Water Supply Outlook Forecasts in the Colorado River Basin
Jean C Morrill1, Holly C Hartmann1 and Roger C Bales2,a
1Department of Hydrology and Water Resources, University of Arizona, Tucson, AZ, USA
2School of Engineering, University of California, Merced, CA, USA
aCorresponding author
Trang 2A variety of forecast skill measures of interest to stakeholders were used to assess the strengths and weaknesses of seasonal water supply outlooks (WSO’s) at 55 sites in the Colorado River basin, and provide a baseline against which alternative and experimental forecast methods can be compared These included traditional scalar measures (linear correlation, linear root-mean square error and bias), categorical measures (false alarm rate, threat score), probabilistic measures (Brier score, rank probability score) and
distribution-oriented measures (resolution, reliability and discrimination) Despite the shortcomings of the WSO’s they are generally an improvement over climatology The majority of forecast points have very conservative predications of seasonal flow, with below-average flows often over predicted and above-average flows under predicted Late-season forecasts at most locations are generally better than those issued in January There is a low false alarm rate for both low and high flows at most sites, however, these flows are not forecast nearly as often as they are observed Moderate flows have a very high probability of detection, but are forecast more often than they occur There is also good discrimination between high and low flows, i.e when high flows are forecast, low flows are not observed, and vice versa The diversity of forecast performance metrics reflects the multi-attribute nature of forecast and ensembles
Trang 3Seasonal water supply outlooks, or volume of total seasonal runoff, are routinely used by decision makers in the southwestern United States for making commitments for water deliveries, determining industrial and agriculture water allocation, and carrying out reservoir operations These forecasts are based primarily on statistical regression
equations developed from monthly precipitation, recent snow-water equivalent, and a subset of past streamflow observations (Day, 1985) In the Colorado River Basin the National Weather Services Colorado Basin River Forecast Center (CBRFC) and the Natural Resources Conservation Service (NRCS) jointly issue seasonal water supply outlook (WSO) forecasts of naturalized, or unimpaired, flow, i.e the flow that would most likely occur in the absence of diversions These forecast were not always issued jointly (Hartmann et.a , 200X?)
Currently WSO’s are issued once each month from January to June However, until the mid-1990s, the forecasts were only issued until May Each forecast contains: the most probable value for the forecast period, a comparison to a historical, climatological mean value (usually a 10-to 30-year mean), a reasonable maximum (usually the 10% exceedance value), and a reasonable minimum (usually the 90% exceedance value) In some locations with strongly skewed flow distributions, the comparison is to a historical median, rather than the mean
The forecast period is the period of time over which the forecasted flow is predicted
to occur It is not the same for all sites, all years at one location, or even all months in a single year In the past decade, the most common forecast period has been April-July for most sites in the upper Colorado River basin and January-May for lthe ower Colorado, for each month a forecast was issued However, previously many sites used April-
September forecast periods, and prior to that the forecast period for January forecast was January-September, for February forecast the forecast period was February-September, etc
Most of the sites at which forecasts are issued are impaired, i.e have diversion abovethe forecast and gauging location Therefore the CRBRFC combines measured
discharges with historical estimates of diversion to reconstruct the unimpeded observed
Trang 4flow (Ref bulletins) Despite the shortcomings of this approach, it provides the best estimate against which to assess the skill of WSO’s
Forecast verification is important for assessing forecast quality and performance, improving forecasting procedures, and providing users with information helpful in applying the forecasts (Murphy and Winkler, 1987) Decision makers take account of forecast skill in using forecast information and are interested in having access to a variety
of skill measures (Bales et.al 2004; Franz et al., 2003) [Any additional result about
skill??]
The work reported here assesses the skill of forecasts relative to naturalized
streamflow across the Colorado River basin Using a variety of methods of interest to stakeholders: traditional scalar measures (linear correlation, linear root-mean square errorand bias), categorical measures (false alarm rate, threat score), probabilistic measures (Brier score, rank probability score) and distributive measures (resolution, reliability and discrimination) The purpose was to assess the strengths and weaknesses of the current water supply forecasts, and provide a baseline against which alternative and experimentalforecast methods can be compared
1 Data and Methods
1.1 Data
WSO records from 136 forecast points on 84 water bodies were assembled, including
some forecast locations that are no longer active NEED TO APPEND DATA
Reconstructed flows were made available by the CRBRFC and NOAA (T Tolsdorf and Shumate, personal communication), however data were not available for all forecast locations Many current forecast points were established in 1993, and so do not yet have good long-term records For this study we chose 54 sites having at least 10 years of both forecast and observed data (Figure 1) Another 33 sites have fewer than 10 years of data, but most are still active, and so should be more useful for statistical analysis in a few years time The earliest water supply forecasts used in this study were issued in 1953 at
22 of the 54 locations
These 54 forecasting sites were divided in 9 smaller basins (or in the case of Lake Powell, a single location), compatible with the divisions used by CBRFC in the tables and graphs accompanying the WSO forecasts (Table 1) The maximum number of years
Trang 5in the combined forecast and observation record was 48 (1953–2000), the minimum used was 21, and the median and average number of years were 46 and 41.5 respectively.Each forecast includes the most likely value, a reasonable maximum (usually the 10% exceedance value), and a reasonable minimum (usually the 90% exceedance value) These were used to calculate the 30 and 70% exceedance values associated with each forecast Five forecast flow categories were calculated for each forecast, based on
exceedance probability: 0-10%, >10-30%, >30-70%, >70-90%, and >90% The
probability of the flow falling within each of these categories is 0.1, 0.2, 0.4, 0.2 and 0.1 respectively
1.2 Summary and correlation measures
Summary measures are scalar measures of accuracy from forecasts of continuous
variables, and include the mean absolute error (MAE) and mean square error (MSE):
where for a given location, f is the forecast seasonal runoff for period i and o the
naturalized observed flow for the same period Since MSE is computed by squaring the forecast errors, it is more sensitive to larger errors than is MAE It increases from zero for
perfect forecasts to large positive values as the discrepancies between the forecast and
observations become larger RMSE is the square root of the MSE.
Often an accuracy measure is not meaningful by itself, and is compared to a
reference value, usually based on the historical record In order for a forecast technique
to be worthwhile, it must generate better results than simply using the cumulative
distribution of the climatological record, i.e assuming that the most likely flow next year
is the average flow in the climatological record In order to judge this, skill scores are calculated for the accuracy measures:
%100
A A
SS
perf
where SS A ,If A is a generic skill score, A ref is the accuracy of a reference set of values (e.g
the climatological record) and A perf is the value of A given by perfect forecasts If A=A perf,
Trang 6SS A will be at its maximum 100% If A=A ref , then SS A=0%, indicating no improvement
over the reference forecast If SS A <0%, then the forecasts are not as good as the
reference (Wilks, 1995) For MSE:
cl cl
cl cl
perf
cl MSE
MSE
MSE MSE
MSE MSE
MSE MSE
MSE MSE
proportion of the variability of the observation that is linearly accounted for by the forecast It represents a quantitative summary measure of the joint distribution of the forecasts and observations However it does not account for any forecast bias, and when bias is large, the correlation is not likely to be informative
The most widely used correlation measure is the coefficient of determination, which describes the proportion of the variability of the observation that is linearly accounted for
1
2 5
0
1
2 1
2 2
i
i
n i
i i
f f o
o
f f o o
Trang 7f o
1
2 1
2
(5)
It has a maximum of 1 for a perfect forecast and a minimum of negative infinity
Physically, NSC is 1 minus the ratio of MSE to the variance of the observed data If NSC
> 0, the forecast is a better predictor of flow than is the observed mean, but if NS C< 0,
the observed mean is a better predictor and there is a lack of correlation between the forecast and observed values
Discussion of correlation is often combined with that of the percent bias, which measures the difference between the average forecasted and observed values (Wilks, 1995):
%100
observed distribution) that is successfully forecast (both forecast and observed) occurs a times An event that is forecast but not observed occurs b times, and an event that is observed but not forecast occurs c times An event that is not forecast and not observed for the same period occurs d times The total number of forecasts in the data set is
n=a+b+c+d A perfectly accurate binary (2× 2) categorical forecast will have b = c =0
and a+d=n However, few forecasts are perfect Several measures can be used to
examine the accuracy of the forecast, including hit rate, threat score, probability of detection and false alarm rate (Wilks, 1995)
The hit rate is the proportion correct:
n
d a
HR= +
(7)and ranges from one (perfect) to zero (worst)
Trang 8The threat score, also known as the critical success index, is the proportion of
correctly forecast events out of the total number of times the event was either forecast or observed, and does not take into account the accurate non-occurrence of events:
c b a
a TS
++
It also ranges from one (perfect) to zero (worst)
The probability of detection is the fraction of times when the event was correctly forecast to the number of times is actually occurred, or the probability of the forecast given the observation:
c a
a POD
+
A perfect POD is 1 and the worst 0.
A related statistic is the false alarm rate, FAR, which is the fraction of forecasted
events that do not happen In terms of condition probability, it is the probability of not observing an event given the forecast:
b a
b FAR
+
Unlike the other categorical measure describe, the FAR has a negative orientation, with the best possible FAR being 0 and the worst being 1
The bias of the categorical forecasts compares the average forecast with the average
observation, and is represented by the ratio of “yes” observations to “yes” forecasts:
c a
b a
bias
+
+
A biased forecast has a value of 1, showing that the event occurred the same number of
times that it was forecast If the bias is greater than 1, the event is overforecast (forecast most often than observed); if the bias is less than one, the event is underforecast Since
the bias does not actually show anything about whether the forecasts matched the
observations, it is not an accuracy measure.
1.4 Probabilistic Measures
Whereas categorical forecasts contain no expression of uncertainty, probabilistic forecasts do Linear error in probability space assesses forecast errors with respect to their difference in probability, rather than their overall magnitude:
( )i c( )i c
Trang 9F c (o) refers to the climatological cumulative distribution function of the observations,
and Fc(f) to the corresponding distribution for the forecasgts The corresponding skill score is:
( ) ( ) ( )
o F f F SS
1
1
5.0
using the climatological median as reference forecast
The Brier score is analogous to MSE :
that event occurred instead of comparing the actual forecast and observation Therefore f i
ranges from 0 to 1, o i =1 if the event occurred or o i =0 if the event did not occur and
BS=0 for perfect forecasts The corresponding skill score is:
ref BS
BS
BS
where the reference forecast is generally the climatological relative frequency
The ranked probability score (RPS) is essentially an extension of the Brier score to
multi-event situations Instead of just looking at the probability associated with one event
or condition, it looks simultaneously at the cumulative probability of multiple events occurring RPS uses the forecast cumulative probability:
>30-70%, >70-90%, and >90%},so F m = {0.1 0.3 0.7 0.9 and 1.0} and J=5 The
observation occurs in only one of the flow categories, which will be given a value of 1; all the others are given a value of zero:
Trang 101
1
(19)
A perfect forecast will assign all the probability to the same percentile in which the event
occurs, which will result in RPS=0 The RPS has a lower bound of 0 and an upper bound
of J-1 RPS values are rewarded for the observation being closer to the highest
probability category The RPS skill score is defined as:
ref RPS
RPS
RPS
where RPS ret is the reference value
The Brier score focuses on how well the forecasts perform in a single flow category;
RPS is a measure of overall forecast quality.
1.5 Distributive Measures
We used two distributive measures, reliability and discrimination, to assess the forecasts in various categories (i.e low, medium, high) The same five forecast
probabilities used for RPS were used to represent the probability given to each of the
three flow categories Our applicatiuon of these measures follows that outlined by Franz
et al (2003)
Reliability uses the conditional distribution (p(o|f)) and describes how often an
observation occurred given a particular forecast Ideally, p(o=1| f)= f (Murphy and
Winkler, 1987) That is, for a set of forecasts where a forecast probability value f was given to a particular observation o, the forecasts are considered perfectly reliable if the relative frequency of the observation equals the forecast probability (Murphy et al., 1992) For example, given all the times in which high flows were forecasted with a 50% probability, the forecasts would be considered perfectly reliable if the actual flows turned out to be high in 50% of the cases
On a reliability diagram (Figure 2) the conditional distribution (p(o|f)) of a set of
perfectly reliable forecasts will fall along the 1:1 line Forecasts that fall to the left of the line are underforecasting or not assigning enough probability to the subsequent
observation Those that fall to the right of the line are overforecasting Forecasts that fall
on the no-resolution line are unable to identify occasions when the event is more or less
Trang 11likely than the overall climatology (Wilks, 1995) Conditional distributions of forecasts lacking resolution plot along the horizontal line associated with their climatology value The discrimination diagram displays the conditional probability distributions ((p(f|o))
of each possible flow category as a function of forecast probability (Figure 3) If the forecasts are discriminatory, then the probability distribution functions of the forecasted flow categories will have minimal overlap on the discrimination diagram (Murphy et al., 1989) Ideally, a forecast issued prior to an observation of a low flow should say that there is 100% chance of having a low flow and 0% chance of having high or middle flows A set of forecasts that consistently provide such strong and accurate statements is perfectly discriminatory and will produce a discrimination diagram like Figure 3a Figure 3b illustrates a case where the sample of forecasts is unable to consistently assign the largest probability to the occurrence of low flows Users of forecasts from such a system could have no confidence in the predictions
A discrimination diagram is produced for occurrences of observations in each flow category; therefore, forecasts that were issued prior to observations that occurred in the lowest 30% (low flows), middle 40% (mid-flows), etc are plotted on separate
discrimination diagrams The number of forecasts represented on each plot depends uponthe number of historical observations in the respective flow category
2 Results
2.1 Scalar measures
The New Fork River near Big Piney (USGS site number 9205000) in the Upper Green River basin captures many of the patterns seen in the different sites, and is used to illustrate the different types of output results presented across the basin ADD TWO
OTHER EXAMPLES The 3 examples of Pbias on Figure 4 show that 1997 for XXX
represents an almost perfect forecast year, with forecast to observed values very close to
1 It is an excellent examples of consistency in forecasting, with the concentric circles showing that the January forecast was the same as the July forecast In many of the otheryears, such as 1992, there is forecast drift, with the January forecast farthest from 1, and values getting progressively better with each month Comparing Figures 4a and 4b shows that years of above average flow (e.g 1982, 1983 and 1986) are often associated
Trang 12with forecasts being too low; conversely, in years of below average flow (e.g 1988 and 1992), forecasts were too high This is seen more clearly by plotting f / vs i o i o i/ o
(Figure 5a) Ideally, all points should be in a horizontal line f i /o i =1, which would indicate that no matter how high or low (above or below average) the observed flow, the forecast values equal the observed value It can also be seen that forecasts issued in May are an improvement over those issued earlier in the year
Note that different years were used to produce the climatology against which
forecasts were compared (Figure 4c) For example, for 1975-1980, data for 1958-1972 were used, while for 1993-2000, data from the 1960-1999 period were used This trend isrepeated for all the forecast locations Every five or ten years, the definition of average observed flow changes, and different sites may use data from different time periods; although in 1991-2000 the majority of the forecast were based on the 1961-1990
climatology Starting in 2001, forecasts were based on the 1971-2000 climatology.Another problem in comparing forecasts from one year to another is that the forecast period, or months during which the forecasted flow is supposed to occur, changes,
sometimes from month to month, other times from year to year (Figure 4d) For
example, for 1975-1979, the forecast period for January was January-September and for May it was May-September For 1980-1990, the forecast period was April through September for every month of issue, and from 1991 to present, it was from April to July For this location no one forecasting period has a visibly better correlation than another, nor do forecasts show any marked improvement over the period of record
In examining forecasts issued in January through June, the correlation between forecast and the observation clearly improves as the year progresses Note that April and May values lie much closer to the 1:1 line than do the January and February values For this location no one forecasting period has a visibly better correlation than another, nor doforecasts show any marked improvement over the period of record This discussion for left panel on Fig 5, which was deleted?
R2 values across all the sites are lowest in January (all sites < 0.5) and become progressively higher through May (0.4–0.9, with the highest around 0.8, although there islittle difference in February through April values) (Figure 6a) Even in April and May there are still many poorly correlated sites
Trang 13[Present these or remove from description: MAE MSE SS MSE RMSE; RMSE should be presented, even if the others are not Although
I think that RMSE is as much related to flow volume as anything else Of the 10 sites with the lowest RMSE, 5 are tributaries in the Lower Green Basin and the other five are smaller creeks/rivers as well Of the 10 with the highest (worst) RMSE, 4 are on the Colorado River and 2 are on the San Juan River Others are Green River, Gunnison River, Yampa River and Salt River Maybe skip the absolute values and just look as skill scores?] Report AE i for exaple & MAE ranges
as function of ō for all Just skill scores Also RMSE & skill score.
A similar pattern is seen for NSC although there is little difference in February
through April values (Figure 6b) Two sites have negative values, indicating that the
forecasts are not an improvement over the climatology, during all five months: the
Strawberry River near Duchesne in the Lower Green basin (#9288180), and the Florida River Inflow to Lemon Reservoir (9363100) in the San Juan River basin Of the five siteswith the highest average NSC values for all five months, three are in the Upper Green: Green River Warren Bridge (#9188500), Pine Creek Above Fremont Lake (9196500), andFontelle Reservoir Inflow (9211150) The other two are the Virgin River near Virgin
(#9406000), which had good correlations in March-May, despite very low January
values, and the Gunnison River Inflow to Blue Mesa Reservoir (#9124800)
DISCUSS SS ON FIG 6
WHERE IS FIGURE 7 METHIONTD?
Overall the April forecasts display a tendency toward a negative skew of forecast
errors (Figure 8), with this being most pronounced in the Gila River Basin A large
negative skew means that the overall tendency of
the forecasts is to under predict rather that over
predict
The forecast display a tendency towards a
negative skew of forecast errors, G (See
Equations 9 and 10) As can be seen in Figure 8,
this is most pronounced in the Gila River Basin,
although most of the other basins had some sites
with negative skew of forecast errors, some sites
with not skew, and not sites with positive skew
A large negative G means that the overall tendency of the forecasts is to under predict
rather that over predict
2.2 Categorical measures
Hit rate, threat score, false alarm rate and probability of detection (Figure 9-12) for each month and flow category need to be considered together Eighty to ninety percent of
sites have HR for correct predictions for the lower and upper 30% of flow categories
above 0.6 (Figure 9), meaning that these flows actually occur a majority of the time that
Trang 14the forecast is for high or low flows Similarly FAR (0 is perfect) is best for the low and
high flows (Figure 11)
However, the POD shows that the majority of high and low flows that occur are not
being accurately forecast (Figure 12) In January-April, under 5% of of flows in the
upper or lower 30% are correctly forecast, i.e the POD was near 0 There were very few points with POD above 0.5 for the high and low flows POD for the mid 40% was high,
because most forecasts predict that conditions will fall in this category For the same
reason, HR was low and FAR high in the middle category Note that TS (Figure 10) combines some features of HR and POD – while it is similar to HR for the mid 40%, it is low for the upper and lower 30% The bias (not shown) was near 0-0.25 (very low) for
low and high flows, showing that they are underpredicted, and between 2-4 (very high) for moderate flows, showing that they are overpredicted [add more on bias???]
2.3 Probabilistic measures
At the New Fork River Near Big Piney (#9205000), the LEPS is clearly lower than
LEPS ref and the skill score increased from January through May (Figure 13a) The Brier scores of the forecast were all higher than those of the reference set as well Comparing
average LEPS skill scores across the basins shows that the lowest values occurred in
January and the highest in April or May (Figure 14a) The Lower Green River basin
showed negative SS LEPS values for January and February as a result of the very large
negative value at the Strawberry River near Soldier Springs (#9285000) With LEPS near 0.5 and LEPS ref near 0.1, SS LEPS is several hundred percent, and the average skill scores of both the Lower Green River basin and the whole Colorado basin are lowered With this point removed, these skill scores are similar to those of the other basins
Brier skill scores, SS BS, also tend to improve with time, although not always (Figure 14b) For example, the February skill scores in the lower Green River and the San Juan River basins are lower than those in January Five of the sites in the Gila River basin
have negative SS BS values in March, making the basin average negative The Strawberry
River near Soldier Springs also has negative SS BS each month The Virgin River basin
had the highest average SS BS
Twenty-two of the 54 sites have a negative average SS RPS for January-May Thirteen
have negative SS RPS values for each month that a forecast was issued Seven of these
Trang 15were the Gila River basin locations; two were in the San Juan River basin (San Juan River near Bluff and the San Juan River inflow to Navajo Reservoir), one along the main stem of the upper Colorado (Colorado River near Cisco) one each in the upper Green, lower Green, and the Yampa and White River basins (Henry’s Fork near Manila
Duchesne River at Myton, and Little Snake River near Dixon, respectively) However,
four of the remaining San Juan River basin sites had SS RPS values in the top ten (averaging
30-40) and four of the Yampa and White River basin sites were among the top fourteen (Figure 14c ??)
2.4 Distributive measures
Using the Green River Near Warren Bridge (#9188500) [why a different example
site???] as an example, one sees that the best resolution for this site occurs for May for
the upper 30% of flows (Figure 15) Later months have better resolution than earlier months, which have a larger fraction of flows being forecast with only 30-70%
likelihood, especially moderate flows For the forecast low flows, forecast of
non-occurrence (<10% probability) are much more frequent than forecasts of non-occurrence (>90% probability)
Table 2 shows the sum of the resolution in the <0.1 and >0.9 categories for each of the basins and the Colorado basin as a whole In a forecast system with perfect
resolution, this should be equal to 1 For the entire Colorado basin, this basin average of this sum increases from 0.5 in January to 0.8 in May for low flows, while for high flows
it increases from 0.5 to 0.7 For the moderate flows, this sum is lower, usually averaging less than 0.5, with values less than or equal to 0.3 in January and March at many of the basins Low and high flows have the poorest resolution in the Virgin River basin The best average resolution for high and low flows occurs in the lower Green basin Further analysis of the resolution of the high and low flow at each site shows that six of the ten best forecast sites are in the lower Green River Basin The poorest resolution of low flows occurred mostly at sites in the main stem of the upper Colorado River and in the upper Green River basins
Table 3 shows this sum for the top 10 and bottom 10 sites in the high and the low flow categories Six of the sites with the best resolution for high and low flows are in the lower Green River Basin The Gila River at Calva (#9466500) has the sixth best
Trang 16resolution of low flows and the second worst resolution of high flows The poorest resolution of low flows occurred mostly at sites in the main stem of the upper Colorado River and in the upper Green River basins The Eagle River below Gypsum (# 9070000) and the Virgin River Near Virgin UT (# 9406000) show poor resolution in all flow
categories
Reliability for Green River Near Warren Bridge (#9188500) shows similar patterns for all five months (Figure 15) Low flows are underconfident at low probability, have noreliability at moderate probability, and are overconfident at high probability High flows are overforecast at low probabilities and overforecast at 30-70% and 70-90% likelihood, but overall seem to have better reliability than the low flows
Discrimination at this site, however, is better for low flows than the high flows (Figure 16) High flows are rarely observed when low flows are predicted In March-May, when low flows were observed, 80-90% of the forecasts predicted less than 10% probably of high flow, and low flows were accurately predicted 50% of the time in April and 80% of the time in May When moderate flows are observed, all flow categories are given about equal chance of occurring, and no flow is given a high probability of
occurring, even in late in the year In the high flow category at this site, even in May, the high flows are only predicted to occur about 50% of the time that they are observed, and are forecast not to occur about 30% of the time that they are observed However, low flows are almost never observed when high flows are predicted When high flows are observed, forecast discrimination of moderate flow is accurate in Mar-May as well
3 Discussion
3.1 General observations
[The following needs context – specific to current results] As forecasts become sharper, or more refined, the forecast probability becomes more narrowly distributed and
is more frequently assigned to the extreme non-exceedance categories (i.e., 0-10% and
>90-100%) (Murphy et al., 1987) Thus, the sample sizes within the middle probability categories become smaller with sharper forecasts A relative frequency diagram displays forecast resolution and also allows the user to determine which reliability results may be most valid based on the sample size within the probability category (bin) Statistics