Advanced Search
Article Contents

Ensemble Mean Forecast Skill and Applications with the T213 Ensemble Prediction System

doi: 10.1007/s00376-016-6155-2

  • Ensemble forecasting has become the prevailing method in current operational weather forecasting. Although ensemble mean forecast skill has been studied for many ensemble prediction systems (EPSs) and different cases, theoretical analysis regarding ensemble mean forecast skill has rarely been investigated, especially quantitative analysis without any assumptions of ensemble members. This paper investigates fundamental questions about the ensemble mean, such as the advantage of the ensemble mean over individual members, the potential skill of the ensemble mean, and the skill gain of the ensemble mean with increasing ensemble size. The average error coefficient between each pair of ensemble members is the most important factor in ensemble mean forecast skill, which determines the mean-square error of ensemble mean forecasts and the skill gain with increasing ensemble size. More members are useful if the errors of the members have lower correlations with each other, and vice versa. The theoretical investigation in this study is verified by application with the T213 EPS. A typical EPS has an average error coefficient of between 0.5 and 0.8; the 15-member T213 EPS used here reaches a saturation degree of 95% (i.e., maximum 5% skill gain by adding new members with similar skill to the existing members) for 1-10-day lead time predictions, as far as the mean-square error is concerned.
  • 加载中
  • Bohn T. J., M. Y. Sonessa, and D. P. Lettenmaier, 2010: Seasonal hydrologic forecasting: Do multimodel ensemble averages always yield improvements in forecast skill? J. Hydrometeorol., 11( 4), 1358- 1372.10.1175/ techniques have proven useful in improving forecast skill in many applications, including hydrology. Seasonal hydrologic forecasting in large basins represents a special case of hydrologic modeling, in which postprocessing techniques such as temporal aggregation and time-varying bias correction are often employed to improve forecast skill. To investigate the effects that these techniques have on the performance of multimodel averaging, the performance of three hydrological models [[Variable Infiltration Capacity, Sacramento/Snow-17, and the Noah land surface model]] and two multimodel averages [[simple model average (SMA) and multiple linear regression (MLR) with monthly varying model weights]] are examined in three snowmelt-dominated basins in the western United States. These evaluations were performed for both simulating and forecasting [[using the Ensemble Streamflow Prediction (ESP) method]] monthly discharge, with and without monthly bias corrections. The single best bias-corrected model outperformed the multimodel averages of raw models in both retrospective simulations and ensemble mean forecasts in terms of RMSE. Forming an MLR multimodel average from bias-corrected models added only slight improvements over the best bias-corrected model. Differences in performance among all bias-corrected models and multimodel averages were small. For ESP forecasts, both bias correction and multimodel averaging generally reduced the RMSE of the ESP ensemble means at lead times of up to 6 months in months when flow is dominated by snowmelt, with the reduction increasing as lead time decreased. The primary reason for this is that aggregating simulated streamflows from daily to monthly time scales increases model cross correlation, which in turn reduces the effectiveness of multimodel averaging in reducing those components of model error that bias correction cannot address. This effect may be stronger in snowmelt-dominated basins because the interannual variability of winter precipitation is a common input to all models. It was also found that both bias correcting and multimodel averaging using monthly varying parameters yielded much greater error reductions than methods using time-invariant parameters.
    Bougeault, P., Coauthors, 2010: The THORPEX interactive grand global ensemble. Bull. Amer. Meteor. Soc., 91, 1059- 1072.10.1175/ Ensemble forecasting is increasingly accepted as a powerful tool to improve early warnings for high-impact weather. Recently, ensembles combining forecasts from different systems have attracted a considerable level of interest. The Observing System Research and Predictability Experiment (THORPEX) Interactive Grand Global Ensemble (TIGGE) project, a prominent contribution to THORPEX, has been initiated to enable advanced research and demonstration of the multimodel ensemble concept and to pave the way toward operational implementation of such a system at the international level. The objectives of TIGGE are 1) to facilitate closer cooperation between the academic and operational meteorological communities by expanding the availability of operational products for research, and 2) to facilitate exploring the concept and benefits of multimodel probabilistic weather forecasts, with a particular focus on high-impact weather prediction. Ten operational weather forecasting centers producing daily global ensemble forecasts to 1-2 weeks ahead have agreed to deliver in neareal time a selection of forecast data to the TIGGE data archives at the China Meteorological Agency, the European Centre for Medium-Range Weather Forecasts, and the National Center for Atmospheric Research. The volume of data accumulated daily is 245 GB (1.6 million global fields). This is offered to the scientific community as a new resource for research and education. The TIGGE data policy is to make each forecast accessible via the Internet 48 h after it was initially issued by each originating center. Quicker access can also be granted for field experiments or projects of particular interest to the World Weather Research Programme and THORPEX. A few examples of initial results based on TIGGE data are discussed in this paper, and the case is made for additional research in several directions.
    Buizza R., T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev., 126, 2503- 2518.10.1175/1520-0493(1998)126<2503:IOESOE>2.0.CO; Available
    Chen Q. Y., M. M. Yao, and Y. Wang, 2004: A new generation of operational medium-range weather forecast model T213L31 in National Meteorological Center. Meteorological Monthly, 30( 8), 16- 21. (in Chinese)10.1117/ medium-range numerical weather forecast system T213 became an operational one on 1 September 2002 in the National Meteorological Center. As the core of the new system, the global model T213L31 uses some new numerical techniques and time integration scheme, which include the introduction of the semi-Lagrangian treatment of advection, the use of a reduced Gaussian grid, improvements to the model's basic architecture, the application of distributed memory and shared memory parallelization, realizing the running of high resolution model on the computers now available in National Meteorological Center. It is even more important that, T213L31 uses some new physical parametrization schemes with more realistic physical concept, for example, the schemes for radiation, subgrid-scale orographic drag, convection, clouds and land surface parametrization, therefore overcomes a lot of problems that T106L19 suffers and enhances the forecast skill obviously.
    Clark, A. J., Coauthors, 2011: Probabilistic precipitation forecast skill as a function of ensemble size and spatial scale in a convection-allowing ensemble. Mon. Wea. Rev., 139, 1410- 1418.10.1175/ quantitative precipitation forecasts (PQPFs) from the storm-scale ensemble forecast system run by the Center for Analysis and Prediction of Storms during the spring of 2009 are evaluated using area under the relative operating characteristic curve (ROC area). ROC area, which measures discriminating ability, is examined for ensemble size n from 1 to 17 members and for spatial scales ranging from 4 to 200 km.Expectedly, incremental gains in skill decrease with increasing n. Significance tests comparing ROC areas for each n to those of the full 17-member ensemble revealed that more members are required to reach statistically indistinguishable PQPF skill relative to the full ensemble as forecast lead time increases and spatial scale decreases. These results appear to reflect the broadening of the forecast probability distribution function (PDF) of future atmospheric states associated with decreasing spatial scale and increasing forecast lead time. They also illustrate that efficient allo...
    Deque M., 1997: Ensemble size for numerical seasonal forecasts. Tellus A, 49, 74- 86.10.1034/ The predictability of 500 hPa height, 850 hPa temperature and precipitation is studied using the “perfect model” approach with ensemble numerical forecasts. The sea surface temperature interannual variability is introduced to provide some source of seasonal predictability. The mean scores of 16 winter forecasts show a large potential in the tropics, a weak one in the midlatitudes, in particular over Europe. The weakness of the scores in the midlatitudes may be partly explained by the underestimation of the amplitude of the seasonal anomalies by the model. The seasonal forecasts are based on 9 individual model integrations starting at slightly different initial conditions. The variation of the scores with the size of the ensemble is estimated empirically. It is shown that, for a perfect seasonal forecast, the size necessary to approach the saturation score is about 3 for the tropical precipitation, 20 for midlatitude height, and 40 for temperature over Europe.
    Du J., S. L. Mullen, and F. Sanders, 1997: Short-range ensemble forecasting of quantitative precipitation. Mon. Wea. Rev., 125, 2427- 2459.10.1175/1520-0493(1997)1252.0.CO; impact of initial condition uncertainty (ICU) on quantitative precipitation forecasts (QPFs) is examined for a case of explosive cyclogenesis that occurred over the contiguous United States and produced widespread, substantial rainfall. The Pennsylvania State University--National Center for Atmospheric Research (NCAR) Mesoscale Model Version 4 (MM4), a limited-area model, is run at 80-km horizontal resolution and 15 layers to produce a 25-member, 36-h forecast ensemble. Lateral boundary conditions for MM4 are provided by ensemble forecasts from a global spectral model, the NCAR Community Climate Model Version 1 (CCM1). The initial perturbations of the ensemble members possess a magnitude and spatial decomposition that closely match estimates of global analysis error, but they are not dynamically conditioned. Results for the 80-km ensemble forecast are compared to forecasts from the then operational Nested Grid Model (NGM), a single 40-km/15- layer MM4 forecast, a single 80-km/29-layer MM4 forecast, and a second 25-member MM4 ensemble based on a different cumulus parameterization and slightly different unperturbed initial conditions. Large sensitivity to ICU marks ensemble QPF. Extrema in 6-h accumulations at individual grid points vary by as much as 3.000. Ensemble averaging reduces the root-mean-square error (rmse) for QPF. Nearly 90% of the improvement is obtainable using ensemble sizes as small as 8--10. Ensemble averaging can adversely affect the bias and equitable threat scores, however, because of its smoothing nature. Probabilistic forecasts for five mutually exclusive, completely exhaustive categories are found to be skillful relative to a climatological forecast. Ensemble sizes of approximately 10 can account for 90% of improvement in categorical forecasts relative to that for the average of individual forecasts. The improvements due to short-range ensemble forecasting (SREF) techniques exceed any due to doubling the resolution...
    Epstein E. S., 1969: Stochastic dynamic prediction. Tellus, 21, 739- 759.10.1111/ - Scientific documents that cite the following paper: Stochastic dynamic prediction
    Fritsch J. M., J. Hilliker, J. Ross, and R. L. Vislocky, 2000: Model consensus. Wea.Forecasting, 15, 571- 582.10.1175/1520-0434(2000)015<0571:MC>2.0.CO; consensus FRITSCH J. M. Wea. Forecasting 15, 571-582, 2000
    Hagedorn R., F. J. Doblas-Reyes, and T. N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting——I. Basic concept. Tellus A, 57( 3), 219- 233.10.1111/ The DEMETER multi-model ensemble system is used to investigate the rationale behind the multi-model concept. A comprehensive documentation of the differences in the single and multi-model performance in the DEMETER hindcast data set is given. Both deterministic and probabilistic diagnostics are used and a variety of analyses demonstrate the improvements achieved by using multi-model instead of single-model ensembles. In order to understand the reason behind the multi-model superiority, basic scenarios describing how the multi-model approach can improve over single-model skill are discussed. It is demonstrated that multi-model superiority is caused not only by error compensation but in particular by its greater consistency and reliability.
    Hagedorn R., R. Buizza, T. M. Hamill, M. Leutbecher, and T. N. Palmer, 2012: Comparing TIGGE multimodel forecasts with reforecast-calibrated ECMWF ensemble forecasts. Quart. J. Roy. Meteor. Soc., 138, 1814- 1827.10.1002/ provided by the THORPEX Interactive Grand Global Ensemble (TIGGE) project were compared with reforecast‐calibrated ensemble predictions from the European Centre for Medium‐Range Weather Forecasts (ECMWF) in extratropical regions. Considering the statistical performance of global probabilistic forecasts of 850 hPa and 2 m temperatures, a multimodel ensemble containing nine ensemble prediction systems (EPS) from the TIGGE archive did not improve on the performance of the best single‐model, the ECMWF EPS. However, a reduced multimodel system, consisting of only the four best ensemble systems, provided by Canada, the USA, the United Kingdom and ECMWF, showed an improved performance. The multimodel ensemble provides a benchmark for the single‐model systems contributing to the multimodel. However, reforecast‐calibrated ECMWF EPS forecasts were of comparable or superior quality to the multimodel predictions, when verified against two different reanalyses or observations. This improved performance was achieved by using the ECMWF reforecast dataset to correct for systematic errors and spread deficiencies. The ECMWF EPS was the main contributor for the improved performance of the multimodel ensemble; that is, if the multimodel system did not include the ECMWF contribution, it was not able to improve on the performance of the ECMWF EPS alone. These results were shown to be only marginally sensitive to the choice of verification dataset. Copyright 08 2012 Royal Meteorological Society
    Hamill T. M., R. Hagedorn, and J. S. Whitaker, 2008: Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts. Part II: Precipitation. Mon. Wea. Rev., 136, 2620- 2632.10.1175/ a companion to Part I, which discussed the calibration of probabilistic 2-m temperature forecasts using large training datasets, Part II discusses the calibration of probabilistic forecasts of 12-hourly precipitation amounts. Again, large ensemble reforecast datasets from the European Centre for Medium-Range Weather Forecasts (ECMWF) and the Global Forecast System (GFS) were used for testing and calibration. North American Regional Reanalysis (NARR) 12-hourly precipitation analysis data were used for verification and training. Logistic regression was used to perform the calibration, with power-transformed ensemble means and spreads as predictors. Forecasts were produced and validated for every NARR grid point in the conterminous United States (CONUS). Training sample sizes were increased by including data from 10 nearby grid points with similar analyzed climatologies. "Raw" probabilistic forecasts from each system were considered, in which probabilities were set according to ensemble relative frequency. Calibrated forecasts were also considered based on three amounts of training data: the last 30 days of forecasts (available for 2005 only), weekly reforecasts during 1982-2001, and daily reforecasts during 1979-2003 (GFS only). Several main results were found. (i) Raw probabilistic forecasts from the ensemble prediction systems' relative frequency possessed little or negative skill when skill was computed with a version of the Brier skill score (BSS) that does not award skill solely on the basis of differences in climatological probabilities among samples. ECMWF raw forecasts had larger skills than GFS raw forecasts. (ii) After calibration with weekly reforecasts, ECMWF forecasts were much improved in reliability and were moderately skillful. Similarly, GFS-calibrated forecasts were much more reliable, albeit somewhat less skillful. Nonetheless, GFS-calibrated forecasts were much more skillful than ECMWF raw forecasts. (iii) The last 30 days of training data produced calibrated forecasts of light-precipitation events that were nearly as skillful as those with weekly reforecast data. However, for higher precipitation thresholds, calibrated forecasts using the weekly reforecast datasets were much more skillful, indicating the importance of large sample size for the calibration of unusual and rare events. (iv) Training with daily GFS reforecast data provided calibrated forecasts with a skill only slightly improved relative to that from the weekly data.
    Hashino T., A. A. Bradley, and S. S. Schwartz, 2007: Evaluation of bias-correction methods for ensemble streamflow volume forecasts. Hydrology and Earth System Sciences, 11( 2), 939- 950.10.5194/ prediction systems are used operationally to make probabilistic streamflow forecasts for seasonal time scales. However, hydrological models used for ensemble streamflow prediction often have simulation biases that degrade forecast quality and limit the operational usefulness of the forecasts. This study evaluates three bias-correction methods for ensemble streamflow volume forecasts. All three adjust the ensemble traces using a transformation derived with simulated and observed flows from a historical simulation. The quality of probabilistic forecasts issued when using the three bias-correction methods is evaluated using a distributions-oriented verification approach. Comparisons are made of retrospective forecasts of monthly flow volumes for the Des Moines River, issued sequentially for each month over a 48-year record. The results show that all three bias-correction methods significantly improve forecast quality by eliminating unconditional biases and enhancing the potential skill. Still, subtle differences in the attributes of the bias-corrected forecasts have important implications for their use in operational decision-making. Diagnostic verification distinguishes these attributes in a context meaningful for decision-making, providing criteria to choose among bias-correction methods with comparable skill.
    Houtekamer P. L., J. Derome, 1995: Methods for ensemble prediction. Mon. Wea. Rev., 123, 2181- 2196.10.1175/1520-0493(1995)1232.0.CO; Available
    Jeong D., Y. O. Kim, 2009: Combining single-value streamflow forecasts——A review and guidelines for selecting techniques. J. Hydrol., 377( 3-4), 284- 299.10.1016/ an appropriate method for combining single-value forecasts should depend on characteristics of the individual forecasts being combined and their relationships with each other. This study attempts to develop a guideline to choose effective combining techniques by using analytical derivations and/or hydrological experiments. The two most popular combining techniques, Simple Average (SA) ...
    Krishnamurti T. N., C. M. Kishtawal, T. E. LaRow, D. R. Bachiochi, Z. Zhang, C. E. Williford, S. Gadgil, and S. Surendran, 1999: Improved weather and seasonal climate forecasts from multimodel superensemble. Science, 285( 5433), 1548- 1550.10.1126/ method for improving weather and climate forecast skill has been developed. It is called a superensemble, and it arose from a study of the statistical properties of a low-order spectral model. Multiple regression was used to determine coefficients from multimodel forecasts and observations. The coefficients were then used in the superensemble technique. The superensemble was shown to outperform all model forecasts for multiseasonal, medium-range weather and hurricane forecasts. In addition, the superensemble was shown to have higher skill than forecasts based solely on ensemble averaging.
    Krishnamurti T. N., C. M. Kishtawal, Z. Zhang, T. LaRow, D. Bachiochi, E. Williford, S. Gadgil, and S. Surendran, 2000: Multimodel ensemble forecasts for weather and seasonal climate. J.Climate, 13( 23), 4196- 4216.10.1175/1520-0442(2000)013<4196:MEFFWA>2.0.CO; this paper the performance of a multimodel ensemble forecast analysis that shows superior forecast skills is illustrated and compared to all individual models used. The model comparisons include global weather, hurricane track and intensity forecasts, and seasonal climate simulations. The performance improvements are completely attributed to the collective information of all models used in the statistical algorithm.The proposed concept is first illustrated for a low-order spectral model from which the multimodels and a `nature run' were constructed. Two hundred time units are divided into a training period (70 time units) and a forecast period (130 time units). The multimodel forecasts and the observed fields (the nature run) during the training period are subjected to a simple linear multiple regression to derive the statistical weights for the member models. The multimodel forecasts, generated for the next 130 forecast units, outperform all the individual models. This procedure was deployed for the multimodel forecasts of global weather, multiseasonal climate simulations, and hurricane track and intensity forecasts. For each type an improvement of the multimodel analysis is demonstrated and compared to the performance of the individual models. Seasonal and multiseasonal simulations demonstrate a major success of this approach for the atmospheric general circulation models where the sea surface temperatures and the sea ice are prescribed. In many instances, a major improvement in skill over the best models is noted.
    Leith C. E., 1974: Theoretical skill of Monte Carlo forecasts. Mon. Wea. Rev., 102, 409- 418.10.1175/1520-0493(1974)1022.0.CO; Available
    Ma J. H., Y. J. Zhu, R. Wobus, and P. X. Wang, 2012: An effective configuration of ensemble size and horizontal resolution for the NCEP GEFS. Adv. Atmos. Sci., 29, 782-794, doi: 10.1007/s00376-012-1249-y.10.1007/ important questions are addressed in this paper using the Global Ensemble Forecast System(GEFS) from the National Centers for Environmental Prediction(NCEP):(1) How many ensemble members are needed to better represent forecast uncertainties with limited computational resources?(2) What is the relative impact on forecast skill of increasing model resolution and ensemble size? Two-month experiments at T126L28 resolution were used to test the impact of varying the ensemble size from 5 to 80 members at the 500hPa geopotential height.Results indicate that increasing the ensemble size leads to significant improvements in the performance for all forecast ranges when measured by probabilistic metrics,but these improvements are not significant beyond 20 members for long forecast ranges when measured by deterministic metrics.An ensemble of 20 to 30 members is the most effective configuration of ensemble sizes by quantifying the tradeoff between ensemble performance and the cost of computational resources.Two representative configurations of the GEFS-the T126L28 model with 70 members and the T190L28 model with 20 members,which have equivalent computing costs-were compared.Results confirm that,for the NCEP GEFS,increasing the model resolution is more(less) beneficial than increasing the ensemble size for a short(long) forecast range.
    Najafi M. R., H. Moradkhani, 2016: Ensemble combination of seasonal streamflow forecasts. Journal of Hydrologic Engineering, 21( 2), 04015043.10.1061/(ASCE) Various hydrologic models with different complexities have been developed to represent the characteristics of river basins, improve streamflow forecasts such as seasonal volumetric flow predictions, and meet other demands from different stakeholders. Because no single hydrologic model is able to perfectly simulate the observed flow, multimodel combination techniques are developed to combine forecasts obtained from different models and to quantify the uncertainties with the goal of improving upon single-model performance. In this study, a comprehensive set of multimodel ensemble averaging techniques with varying complexities are investigated for operational forecasting over four river basins in the Western United States. Ensemble merging models are divided into three categories of simple, intermediate, and complex, and comparison is made between each class by using a bootstrap approach. Analysis suggests that model combination effectively improves most of the individual seasonal forecasts and can outperform the best forecast model. Simple average, median, Bates-Granger, constrained linear regression, and Bayesian model averaging optimized by expectation maximization showed better results compared with other methods over three basins. For the Rogue River basin, the intermediate and complex models outperformed most of the individual forecasts and the simple methods. Multimodeling techniques based on information criteria showed similar performances.
    Raftery A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133( 3), 1155-;link_type=DOI
    Reifen C., R. Toumi, 2009: Climate projections: Past performance no guarantee of future skill? Geophys. Res. Lett., 36,L13704, doi: 10.1029/2009GL038082.10.1029/ principle of selecting climate models based on their agreement with observations has been tested for surface temperature using 17 of the IPCC AR4 models. Those models simulating global mean, Siberian and European 20th Century surface temperature with a lower error than the total ensemble for one period on average do not do so for a subsequent period. Error in the ensemble mean decreases sys...
    Richardson D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quart. J. Roy. Meteor. Soc., 127, 2473- 2489.10.1002/ forecasts provide probabilistic predictions for the future state of the atmosphere. Usually the probability of a given event E is determined from the fraction of ensemble members which predict the event. Hence there is a degree of sampling error inherent in the predictions. In this paper a theoretical study is made of the effect of ensemble size on forecast performance as measured by a reliability diagram and Brier (skill) score, and on users by using a simple cost-loss decision model. The relationship between skill and value, and a generalized skill score, dependent on the distribution of users, are discussed. The Brier skill score is reduced from its potential level for all finite-sized ensembles. The impact is most significant for small ensembles, especially when the variance of forecast probabilities is also small. The Brier score for a set of deterministic forecasts is a measure of potential predictability, assuming the forecasts are representative selections from a reliable ensemble prediction system (EPS). There is a consistent effect of finite ensemble size on the reliability diagram. Even if the underlying distribution is perfectly reliable, sampling this using only a small number of ensemble members introduces considerable unreliability. There is a consistent over-forecasting which appears as a clockwise tilt of the reliability diagram. It is important to be aware of the expected effect of ensemble size to avoid misinterpreting results. An ensemble of ten or so members should not be expected to provide reliable probability forecasts. Equally, when comparing the performance of different ensemble systems, any difference in ensemble size should be considered before attributing performance differences to other differences between the systems. The usefulness of an EPS to individual users cannot be deduced from the Brier skill score (nor even directly from the reliability diagram). An EPS with minimal Brier skill may nevertheless be of substantial value to some users, while small differences in skill may hide substantial variation in value. Using a simple cost-loss decision model, the sensitivity of users to differences in ensemble size is shown to depend on the predictability and frequency of the event and on the cost-loss ratio of the user. For an extreme event with low predictability, users with low cost-loss ratio will gain significant benefits from increasing ensemble size from 50 to 100 members, with potential for substantial additional value from further increases in number of members. This sensitivity to large ensemble size is not evident in the Brier skill score. A generalized skill score, dependent on the distribution of users, allows a summary performance measure to be tuned to a particular aspect of EPS performance.
    Sanders F., 1963: On subjective probability forecasting. J. Appl. Meteor., 2, 191- 201.10.1175/1520-0450(1963)0022.0.CO; The subjective process of probability forecasting is analyzed. It is found to contain a sorling aspect, in which the forecaster distributes all instances into an ordered set of categories of likelihood of occurrence, and a laboling aspect, in which the forecaster assigns an anticipated relative frequency, or probability, of occurrence for each category. These two aspects are identified with the concepts of sharpness and validity, which have been introduced by other writers. The verification score proposed by Brier is shown to consist of the sum of measures of these two qualities. A satisfactory measure of synoptic skill is obtained by applying the Brier score to the synoptic probability forecast and to a control forecast of the climatological probability, and by expressing the difference as a percentage of the control score. In an analysis of a large number of short-range probability forecasts made by instructors and students in the synoptic laboratory of the Massachusetts Institute of Technology...
    Su X., H. L. Yuan, Y. J. Zhu, Y. Luo, and Y. Wang, 2014: Evaluation of TIGGE ensemble predictions of Northern Hemisphere summer precipitation during 2008-2012. J. Geophys. Res. Atmos., 119, 7292- 7310.10.1002/ ensemble mean quantitative precipitation forecasts (QPFs) and probabilistic QPFs (PQPFs) from six operational global ensemble prediction systems (EPSs) in The Observing System Research and Predictability Experiment Interactive Grand Global Ensemble (TIGGE) data set are evaluated against the Tropical Rainfall Measuring Mission observations using a series of area‐weighted verification metrics during June to August 2008–2012 in the Northern Hemisphere (NH) midlatitude and tropics. Results indicate that generally the European Centre for Medium‐Range Weather Forecasts performs best while the Canadian Meteorological Centre (CMC) is relatively good for short‐range QPFs and PQPFs at light precipitation thresholds. The overall forecast skill is better in the NH midlatitude than in the NH tropics. QPFs and PQPFs from China Meteorological Administration (CMA) have very little discrimination ability of different observed rain events in the NH tropics. The day +1 QPFs from Japan Meteorological Agency have remarkably large moist biases in the NH tropics, which leads to the discontinuity of forecast performance with the lead times. Performance changes due to the major EPS upgrades during the five summers are also examined using the forecasts from CMA as the reference to eliminate the interannual variation. After the EPS upgrade, CMC improves the PQPF skill at light precipitation threshold while its excessively enlarged ensemble spread increases the overall QPF and PQPF errors.
    Vislocky R. L., J. M. Fritsch, 1995: Improved model output statistics forecasts through model consensus. Bull. Amer. Meteor. Soc., 76( 5), 1157- 1164.10.1175/1520-0477(1995)0762.0.CO; forecasts are computed by averaging model output statistics (MOS) forecasts based on the limited-area fine-mesh (LFM) model and the nested grid model (NGM) for the three-year period 1990-92. The test consists of four weather elements (max/ min temperature, wind speed, probability of cloud amount, and 12-h probability of precipitation) at four projection times from each initialization (0000 and 1200 UTC) for roughly 250-350 stations. Verification results clearly indicate a substantial improvement forthe consensus MOS over both the LFM and NGM MOS forecasts for all variables and all lead times. The accuracy increase is on par with a 2-8-yr scientific advancement and a 4-12-h lead time improvement. Moreover, performance of the consensus MOS forecasts is similar to subjective forecasts issued by the National Weather Service. These results are illustrative of the broad need to adopt a strategy of statistically combining available forecast products rather than relying upon the single most superior product (such as the newest numerical model). Furthermore, there appears to be strong justification to continue support for the entire LFM MOS product both in terms of its full availability and its equation upgrade.
    Vrugt J. A., M. P. Clark, C. G. H. Diks, Q. Y. Duan, and B. A. Robinson, 2006: Multi-objective calibration of forecast ensembles using Bayesian model averaging. Geophys. Res. Lett., 33(17),L19817, doi: 10.1029/2006GL027126.10.1029/ Model Averaging (BMA) has recently been proposed as a method for statistical postprocessing of forecast ensembles from numerical weather prediction models. The BMA predictive probability density function (PDF) of any weather quantity of interest is a weighted average of PDFs centered on the bias-corrected forecasts from a set of different models. However, current applications of BMA calibrate the forecast specific PDFs by optimizing a single measure of predictive skill. Here we propose a multi-criteria formulation for postprocessing of forecast ensembles. Our multi-criteria framework implements different diagnostic measures to reflect different but complementary metrics of forecast skill, and uses a numerical algorithm to solve for the Pareto set of parameters that have consistently good performance across multiple performance metrics. Two illustrative case studies using 48-hour ensemble data of surface temperature and sea level pressure, and multi-model seasonal forecasts of temperature, show that a multi-criteria formulation provides a more appealing basis for selecting the appropriate BMA model.
    Wang Y., H. Qian, J.-J. Song, and M.-Y. Jiao, 2008: Verification of the T213 global spectral model of China National Meteorology Center over the East-Asia area. J. Geophys. Res., 113,D10110, doi: 10.1029/2007JD008750.10.1029/ September 2002, the global spectral model T213L31 has been put into operational use at the National Meteorology Center of China. To acquire a comprehensive assessment of T213's performance, four verifications have been implemented, (1) temporal analysis of its forecast accuracy series; (2) spatial analysis and lag correlation analysis of the forecast accuracy; (3) precipitation verification and (4) comparison between the models of T213, T106 (prior version of T213) and ECMWF. The verification illustrates that, after adopting a finer grid and improving many physical schemes, T213 has largely enhanced its forecast accuracy over its prior version. However, its forecast is still poorer than the ECMWF model, and T213 needs to especially improve its 3-5 days' forecast performance. The precipitation verification indicates that T213's forecast for light rain is up-to-standard (0.561 for 24 h forecast), but the forecast accuracy for the larger precipitation drops rapidly. The time series verification shows that the T213's daily forecast exhibits a seasonal trend: the forecast for summer is worse than other seasons and the forecast accuracy decreases to the minimum at July, which suggests the possible impact of the moisture forecast error on the decreases of weather forecast accuracy. This impact is confirmed by the spatial analysis and lag correlation analysis, which show that the specific humidity's forecast has a lagged influence on the accuracies of both the temperature forecast and the geo-potential height forecast, and therefore indicates that the further improvement on specific humidity forecasting and the related moisture parameterization schemes are the crucial points in the future development of the T213 model.
    Weigel A. P., M. A. Liniger, and C. Appenzeller, 2008: Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts? Quart. J. Roy. Meteor. Soc., 134( 630), 241- 260.10.1002/ Available
    Weisheimer A., Coauthors, 2009: ENSEMBLES: A new multi-model ensemble for seasonal-to-annual predictions——Skill and progress beyond DEMETER in forecasting tropical Pacific SSTs. Geophys. Res. Lett., 36, L21711, doi: 10.1029/2009GL040896.10.1029/ new 46-year hindcast dataset for seasonal-to-annual ensemble predictions has been created using a multi-model ensemble of 5 state-of-the-art coupled atmosphere-ocean circulation models. The multi-model outperforms any of the single-models in forecasting tropical Pacific SSTs because of reduced RMS errors and enhanced ensemble dispersion at all lead-times. Systematic errors are considerably reduced over the previous generation (DEMETER). Probabilistic skill scores show higher skill for the new multi-model ensemble than for DEMETER in the 4-6 month forecast range. However, substantially improved models would be required to achieve strongly statistical significant skill increases. The combination of ENSEMBLES and DEMETER into a grand multi-model ensemble does not improve the forecast skill further. Annual-range hindcasts show anomaly correlation skill of 0.5 up to 14 months ahead. A wide range of output from the multi-model simulations is becoming publicly available and the international community is invited to explore the full scientific potential of these data. Copyright 2009 by the American Geophysical Union.
    Winter C. L., D. Nychka, 2010: Forecasting skill of model averages. Stochastic Environmental Research and Risk Assessment, 24( 3), 633- 638.10.1007/ a collection of science-based computational models that all estimate states of the same environmental system, we compare the forecast skill of the average of the collection to the skills of the individual members. We illustrate our results through an analysis of regional climate model data and give general criteria for the average to perform more or less skillfully than the most skillful individual model, the “best” model. The average will only be more skillful than the best model if the individual models in the collection produce very different forecasts; if the individual forecasts generally agree, the average will not be as skillful as the best model.
    Yoo J. H., I. S. Kang, 2005: Theoretical examination of a multi-model composite for seasonal prediction. Geophys. Res. Lett., 32(16), L18707, doi: 10.1029/2005GL023513.10.1029/ performance of a multi-model composite for seasonal prediction is theoretically examined in terms of a correlation skill. On the basis of theoretical analysis, we discuss the improvement of skill in the multi-model composite using the APCN multi-model seasonal prediction dataset. Although the skill of multi-model composite is generally increased by increasing the number of models, the highest skill can be obtained by selecting several skillful models which are less dependent each other.
    Yuan H. L., X. G. Gao, S. L. Mullen, S. Sorooshian, J. Du, and H. M. H. Juang, 2007: Calibration of probabilistic quantitative precipitation forecasts with an artificial neural network. Wea. Forecasting, 22, 1287- 1303.10.1175/ feed-forward neural network is configured to calibrate the bias of a high-resolution probabilistic quantitative precipitation forecast (PQPF) produced by a 12-km version of the NCEP Regional Spectral Model (RSM) ensemble forecast system. Twice-daily forecasts during the 2002–2003 cool season (1 November–31 March, inclusive) are run over four U.S. Geological Survey (USGS) hydrologic unit regions of the southwest United States. Calibration is performed via a cross-validation procedure, where four months are used for training and the excluded month is used for testing. The PQPFs before and after the calibration over a hydrological unit region are evaluated by comparing the joint probability distribution of forecasts and observations. Verification is performed on the 4-km stage IV grid, which is used as “truth.” The calibration procedure improves the Brier score (BrS), conditional bias (reliability) and forecast skill, such as the Brier skill score (BrSS) and the ranked probability skill score (RPSS), relative to the sample frequency for all geographic regions and most precipitation thresholds. However, the procedure degrades the resolution of the PQPFs by systematically producing more forecasts with low nonzero forecast probabilities that drive the forecast distribution closer to the climatology of the training sample. The problem of degrading the resolution is most severe over the Colorado River basin and the Great Basin for relatively high precipitation thresholds where the sample of observed events is relatively small.
  • [1] Jun Kyung KAY, Hyun Mee KIM, Young-Youn PARK, Joohyung SON, 2013: Effect of Doubling the Ensemble Size on the Performance of Ensemble Prediction in the Warm Season Using MOGREPS Implemented at the KMA, ADVANCES IN ATMOSPHERIC SCIENCES, 30, 1287-1302.  doi: 10.1007/s00376-012-2083-y
    [2] Jie FENG, Jianping LI, Jing ZHANG, Deqiang LIU, Ruiqiang DING, 2019: The Relationship between Deterministic and Ensemble Mean Forecast Errors Revealed by Global and Local Attractor Radii, ADVANCES IN ATMOSPHERIC SCIENCES, 36, 271-278.  doi: 10.1007/s00376-018-8123-5
    [3] MA Juhui, Yuejian ZHU, Richard WOBUS, Panxing WANG, 2012: An Effective Configuration of Ensemble Size and Horizontal Resolution for the NCEP GEFS, ADVANCES IN ATMOSPHERIC SCIENCES, 29, 782-794.  doi: 10.1007/s00376-012-1249-y
    [4] ZHENG Fei, ZHU Jiang, WANG Hui, Rong-Hua ZHANG, 2009: Ensemble Hindcasts of ENSO Events over the Past 120 Years Using a Large Number of Ensembles, ADVANCES IN ATMOSPHERIC SCIENCES, 26, 359-372.  doi: 10.1007/s00376-009-0359-7
    [5] JIANG Zhina, MU Mu, 2009: A Comparison Study of the Methods of Conditional Nonlinear Optimal Perturbations and Singular Vectors in Ensemble Prediction, ADVANCES IN ATMOSPHERIC SCIENCES, 26, 465-470.  doi: 10.1007/s00376-009-0465-6
    [6] T. N. KRISHNAMURTI, A. D. SAGADEVAN, A. CHAKRABORTY, A. K. MISHRA, A. SIMON, 2009: Improving Multimodel Weather Forecast of Monsoon Rain Over China Using FSU Superensemble, ADVANCES IN ATMOSPHERIC SCIENCES, 26, 813-839.  doi: 10.1007/s00376-009-8162-z
    [7] TAN Jiqing, XIE Zhenghui, JI Liren, 2003: A New Way to Predict Forecast Skill, ADVANCES IN ATMOSPHERIC SCIENCES, 20, 837-841.  doi: 10.1007/BF02915409
    [8] Zheng HE, Pangchi HSU, Xiangwen LIU, Tongwen WU, Yingxia GAO, 2019: Factors Limiting the Forecast Skill of the Boreal Summer Intraseasonal Oscillation in a Subseasonal-to-Seasonal Model, ADVANCES IN ATMOSPHERIC SCIENCES, 36, 104-118.  doi: 10.1007/s00376-018-7242-3
    [9] Gill M. MARTIN, Nick J. DUNSTONE, Adam A. SCAIFE, Philip E. BETT, 2020: Predicting June Mean Rainfall in the Middle/Lower Yangtze River Basin, ADVANCES IN ATMOSPHERIC SCIENCES, 37, 29-41.  doi: 10.1007/s00376-019-9051-8
    [10] Jing MA, Haiming XU, Changming DONG, Jing-Jia LUO, 2024: The Forecast Skills and Predictability Sources of Marine Heatwaves in the NUIST-CFS1.0 Hindcasts, ADVANCES IN ATMOSPHERIC SCIENCES.  doi: 10.1007/s00376-023-3139-x
    [11] LIU Xiangwen, WU Tongwen, YANG Song, JIE Weihua, NIE Suping, LI Qiaoping, CHENG Yanjie, LIANG Xiaoyun, 2015: Performance of the Seasonal Forecasting of the Asian Summer Monsoon by BCC_CSM1.1(m), ADVANCES IN ATMOSPHERIC SCIENCES, 32, 1156-1172.  doi: 10.1007/s00376-015-4194-8
    [12] Guo DENG, Yuejian ZHU, Jiandong GONG, Dehui CHEN, Richard WOBUS, Zhe ZHANG, 2016: The Effects of Land Surface Process Perturbations in a Global Ensemble Forecast System, ADVANCES IN ATMOSPHERIC SCIENCES, 33, 1199-1208.  doi: 10.1007/s00376-016-6036-8
    [13] Zhizhen XU, Jing CHEN, Mu MU, Guokun DAI, Yanan MA, 2022: A Nonlinear Representation of Model Uncertainty in a Convective-Scale Ensemble Prediction System, ADVANCES IN ATMOSPHERIC SCIENCES, 39, 1432-1450.  doi: 10.1007/s00376-022-1341-x
    [14] Lili LEI, Yangjinxi GE, Zhe-Min TAN, Yi ZHANG, Kekuan CHU, Xin QIU, Qifeng QIAN, 2022: Evaluation of a Regional Ensemble Data Assimilation System for Typhoon Prediction, ADVANCES IN ATMOSPHERIC SCIENCES, 39, 1816-1832.  doi: 10.1007/s00376-022-1444-4
    [15] ZHU Jiang, LIN Caiyan, WANG Zifa, 2009: Dust Storm Ensemble Forecast Experiments in East Asia, ADVANCES IN ATMOSPHERIC SCIENCES, 26, 1053-1070.  doi: 10.1007/s00376-009-8218-0
    [16] Yuejian ZHU, 2005: Ensemble Forecast: A New Approach to Uncertainty and Predictability, ADVANCES IN ATMOSPHERIC SCIENCES, 22, 781-788.  doi: 10.1007/BF02918678
    [17] T. N. Krishnamurti, Mukul Tewari, Ed Bensman, Wei Han, Zhan Zhang, William K. M. Lau, 1999: An Ensemble Forecast of the South China Sea Monsoon, ADVANCES IN ATMOSPHERIC SCIENCES, 16, 159-182.  doi: 10.1007/BF02973080
    [18] KE Zongjian, DONG Wenjie, ZHANG Peiqun, WANG Jin, ZHAO Tianbao, 2009: An Analysis of the Difference between the Multiple Linear Regression Approach and the Multimodel Ensemble Mean, ADVANCES IN ATMOSPHERIC SCIENCES, 26, 1157-1168.  doi: 10.1007/s00376-009-8024-8
    [19] ZHU Jiangshan, Fanyou KONG, LEI Hengchi, 2012: A Regional Ensemble Forecast System for Stratiform Precipitation Events in Northern China. Part I: A Case Study, ADVANCES IN ATMOSPHERIC SCIENCES, 29, 201-216.  doi: 10.1007/s00376-011-0137-1
    [20] Lu WANG, Xueshun SHEN, Juanjuan LIU, Bin WANG, 2020: Model Uncertainty Representation for a Convection-Allowing Ensemble Prediction System Based on CNOP-P, ADVANCES IN ATMOSPHERIC SCIENCES, 37, 817-831.  doi: 10.1007/s00376-020-9262-z

Get Citation+


Share Article

Manuscript History

Manuscript received: 19 June 2016
Manuscript revised: 27 July 2016
Manuscript accepted: 08 August 2016
通讯作者: 陈斌,
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Ensemble Mean Forecast Skill and Applications with the T213 Ensemble Prediction System

  • 1. Key Laboratory of Mesoscale Severe Weather, Ministry of Education, School of Atmospheric Sciences, Nanjing University, Nanjing 210023, China

Abstract: Ensemble forecasting has become the prevailing method in current operational weather forecasting. Although ensemble mean forecast skill has been studied for many ensemble prediction systems (EPSs) and different cases, theoretical analysis regarding ensemble mean forecast skill has rarely been investigated, especially quantitative analysis without any assumptions of ensemble members. This paper investigates fundamental questions about the ensemble mean, such as the advantage of the ensemble mean over individual members, the potential skill of the ensemble mean, and the skill gain of the ensemble mean with increasing ensemble size. The average error coefficient between each pair of ensemble members is the most important factor in ensemble mean forecast skill, which determines the mean-square error of ensemble mean forecasts and the skill gain with increasing ensemble size. More members are useful if the errors of the members have lower correlations with each other, and vice versa. The theoretical investigation in this study is verified by application with the T213 EPS. A typical EPS has an average error coefficient of between 0.5 and 0.8; the 15-member T213 EPS used here reaches a saturation degree of 95% (i.e., maximum 5% skill gain by adding new members with similar skill to the existing members) for 1-10-day lead time predictions, as far as the mean-square error is concerned.

1. Introduction
  • The principle of combining forecasting outputs from different models and members into an ensemble was proposed several decades ago (Sanders, 1963; Epstein, 1969; Leith, 1974) and has been widely employed in meteorology and other fields since the 1990s, especially the arithmetic average of all ensemble members, i.e., the ensemble mean. From an experimental perspective, it is well known that the ensemble mean often outperforms its individual members in operational forecasts (Vislocky and Fritsch, 1995; Fritsch et al., 2000). Recently, other complex methods have been developed to construct unequally weighted or bias-corrected ensembles instead of the arithmetic mean, such as linear regressions (Krishnamurti et al., 1999, 2000), nonlinear regressions (Hamill et al., 2008), Bayesian averages (Raftery et al., 2005; Vrugt et al., 2006), artificial neural networks (Yuan et al., 2007), and time-varying weighted bias correction methods (Hashino et al., 2007; Bohn et al., 2010). The improvements in the ensemble mean due to the application of these statistical methods vary on a case-by-case basis and

    are not stable due to the lack of a sufficient number of samples (Weisheimer et al., 2009). In fact, persistence in the relative skills of members is required to use complex weighting combination methods, except for simple arithmetic averaging (Reifen and Toumi, 2009). Therefore, the arithmetic ensemble mean remains one of the most effective methods in operational forecasts for many cases (Najafi and Moradkhani, 2016).

    From a theoretical perspective, the pioneering work of (Leith, 1974) first examined the potential skill of Monte Carlo forecasts and found that the sample mean could better estimate the real state in comparison with conventional single forecasts. This indicated that the improvement of such a Monte Carlo ensemble in terms of the mean-square skill was a consequence of the optimal filtering nature of the procedure. Additionally, several recent studies have attempted to reveal the essence of the forecast skill of an ensemble mean. (Hagedorn et al., 2005) argued that the success of multimodel ensemble means was mainly due to error cancellation and nonlinearity of skill score metrics. (Weigel et al., 2008) further found that a "poorer" member can also contribute to the skill of an ensemble mean. There are also other studies that have examined the advantages of using an ensemble mean by studying the relationships between ensemble members. For example, the members that have higher skills and are less dependent on each other have been suggested for an ensemble prediction system (EPS) to achieve the best ensemble mean skill (Yoo and Kang, 2005). (Jeong and Kim, 2009) demonstrated that neither the equally nor unequally weighted mean method could effectively improve the forecast skill if significant correlations exist between the members. However, their study only targeted two-member combinations and assumed that the two members were unbiased. (Winter and Nychka, 2010) conceptually indicated that the ensemble mean can outperform the best individual member if the forecasting outputs of the ensemble members are markedly different from each other. However, the relationship between the ensemble mean skill and the correlation of the ensemble members has rarely been quantitatively deduced without any assumptions.

    Another important issue is the role of the ensemble size on the performance of EPSs. In previous studies, it has been concluded that a limited number of ensemble members is sufficient to achieve a saturated skill (Houtekamer and Derome, 1995; Deque, 1997; Buizza and Palmer, 1998). (Du et al., 1997) indicated that an ensemble size of 8-10 can account for a near 90% reduction in the RMSEs of ensemble mean precipitation forecasts. (Clark et al., 2011) revealed that the skill gain decreases with increasing ensemble size. (Ma et al., 2012) found that more members are required to increase the forecast skill, especially for long-range forecasts, although the improvements were found to be insignificant beyond 20 members when measured by deterministic metrics. All of the above research was based on experimental studies. From a theoretical perspective, (Richardson, 2001) discussed the impact of the ensemble size on probabilistic forecasts in terms of Brier scores, reliability diagrams and potential economic value, and found that for different metrics, the sufficient ensemble size is different. However, the impact of the ensemble size on the mean-square error (MSE), which is one of the most commonly used deterministic metrics, has rarely been discussed in a theoretic context.

    This study aims to investigate the potential forecast skill of the ensemble mean, including the optimum ensemble mean and its superiority over its individual members, and the impact of ensemble size, without specific assumptions regarding the ensemble members. The theoretical analyses related to the fundamental questions of the ensemble mean are described in section 2. Experimental studies based on the China Meteorological Administration (CMA) T213L31 EPS (hereafter, T213 EPS) are presented in section 3. Section 4 gives a summary and discussion.

2. Theoretical analysis of the ensemble mean
  • For a finite number of data points K, the forecasts from M ensemble members (F1,F2,…,FM) can be combined to construct an ensemble mean F*, where Fi=(Fi,1,Fi,2,…,Fi,K), i=1,2,…,M, and F*=(F1*,F2*,…,FK*). Fi,k denotes the forecast at the kth data point predicted by the ith member, and T=(T1,T2,…,TK) denotes the corresponding validation values.

    Let E1,E2,…,EM and E* denote the errors for each member and the ensemble mean, respectively, where Ei=(ei,1,…,ei,K) and ei,k=Fi,k-Tk. The errors E1,E2,…,EM, E* can be viewed as random variables with expectations \(\overline E_1,\overline E_2,\ldots,\overline E_M,\overline E^\ast\). Thus, \(\overline E_i=(1/K)\sum_k=1^Ke_i,k\).

    The forecast error is expressed by the MSE: \begin{equation} \label{eq1} R_i^2=\frac{1}{K}\left[\sum_{k=1}^K{(F_{i,k}-T_k)^2}\right]= \dfrac{1}{K}\sum_{k=1}^K{e_{i,k}^2},\quad i=1,2,\ldots,M . (1)\end{equation}

    For the ensemble mean E*=(e1*,e2*,…,eK*), there exists \begin{eqnarray} e_k^\ast&=&\dfrac{1}{M}\sum_{i=1}^M{e_{i,k}},\quad k=1,2,\ldots,K ,\nonumber\\ R^{\ast2}&=&\dfrac{1}{K}\sum_{k=1}^Ke_k^{\ast2} =\dfrac{1}{K}\sum_{k=1}^K\left(\frac{1}{M}\sum_{i=1}^Me_{i,k}\right)^2\nonumber\\ &=&\dfrac{1}{M^2}\left[\sum_{i=1}^M\left(\dfrac{1}{K}\sum_{k=1}^K{e_{i,k}^2}\right)+ 2\sum_{i,j=1\atop i\ne j}^M\left(\dfrac{1}{K}\sum_{k=1}^Ke_{i,k}e_{j,k}\right)\right] .(2) \end{eqnarray}

    Let the following denote the error covariance between the ith and jth members: \begin{equation} R_{i,j}=\dfrac{1}{K}\sum_{k=1}^K{e_{i,k}e_{j,k}} .(3) \end{equation} The MSE of the ensemble mean can be calculated as \begin{equation} R^{\ast2}=\dfrac{1}{M^2}\left[\sum_{i=1}^M{R_i^2}+2\sum_{i,j=1\atop i\ne j}^M{R_{i,j}}\right] . (4)\end{equation} The errors of all ensemble members can be represented by a matrix E: \begin{equation} \label{eq2} E=\left[ \begin{array}{c@{\quad}c@{\quad}c} {R_{1}^2} & \cdots & {R_{{1},M}}\\ \vdots & \ddots & \vdots \\ {R_{M{,1}}} & \cdots & {R_M^2} \end{array} \right]_{M\times M} . (5)\end{equation}

    The MSE of the ensemble mean in Eq. (4) is equal to the average of the elements of matrix E in Eq. (5). The matrix E is symmetrical because Ri,j=Rj,i. The diagonal elements indeed represent the forecast skill of the individual ensemble members according to Eq. (1), whereas the remaining elements Ri,j represent the relationship between the errors of any two members. This actually reveals the mathematical essence of the ensemble mean in that the skill of the ensemble mean depends on both the skills of the individual ensemble members and the relationship between the errors of any two members. This result is the generalization of (Jeong and Kim, 2009) because no assumptions were made in Eq. (4).

    If we want to add a new member FM+1 to the already existing M-member ensemble, the MSE of the new ensemble is equal to the average elements of matrix EM+1: \begin{eqnarray} \label{eq3} E_{M{+1}}&=&\left[ \begin{array}{c@{\quad}c@{\quad}c@{\quad}c} {R_{1}^2} & \cdots & {R_{{1},M}} & {R_{{1},M+{1}}}\\ \vdots & \ddots & \vdots & \vdots \\ {R_{M,{1}}} & \cdots & {R_M^2} & {R_{M,M+{1}}}\\ {R_{M+{1,1}}} & \cdots & {R_{M+{1},M}} & {R_{M+{1}}^2} \end{array} \right]_{(M+{1})\times(M+{1})}\nonumber\\ &=&\left[ \begin{array}{c@{\quad}c@{\quad}c} {E_M} & & {R_{{1},M+{1}}}\\ & & \vdots \\ {R_{M+{1,1}}} & \cdots & {R_{M+{1}}^2} \end{array} \right] . (6)\end{eqnarray}

    The (M+1)-member ensemble can outperform the already existing M-member ensemble if and only if the average elements of EM+1 are smaller than EM. This means that the average of the newly added elements in Eq. (6) should be smaller than the average elements of EM: \begin{equation} \label{eq4} \dfrac{1}{2M+1}\left(R_{M+{1}}^2{+}2\sum_{i=1}^M{R_{i,M+{1}}}\right)<R^{\ast 2} .(7) \end{equation}

    Equation (7) gives the essential and sufficient conditions in which a new member can enhance the skill of the already existing ensemble mean. Instead of simply having better skill, the newly added members should be less correlated with the already existing ensemble members because of the weights in Eq. (7). This can explain why a "poorer" member can still enhance the skill of the ensemble mean, which was discovered by (Weigel et al., 2008). Otherwise, even if the new member has a higher skill than the existing members, it can still decrease the ensemble mean if it is highly correlated with the already existing ensemble members.

    For different i and j, we have the following:


    Moreover, the following is also valid: \begin{equation} \label{eq5} R_{i,j}=\frac{1}{K}\sum_{k=1}^K{e_{i,k}e_{j,k}\le\frac{1}{K}}\sum_{k=1}^K{\frac{1}{2}}(e_{i,k}^2+e_{j,k}^2)=\frac{1}{2}(R_i^2+R_j^2) .(8) \end{equation}

    From Eqs. (4) and (8), the following can be obtained: \begin{eqnarray} \label{eq6} R^{\ast 2}&\le&\frac{1}{M^2}\left[\sum_{i=1}^M{R_i^2}+\sum_{i,j=1\atop i\ne j}^M(R_i^2+R_j^2)\right]\nonumber\\ &=&\frac{1}{M^2}\left[\sum_{i=1}^M{R_i^2}+(M-1)\sum_{i=1}^M{R_i^2}\right]\nonumber\\ &=&\frac{1}{M}\sum_{i=1}^M{R_i^2} . (9)\end{eqnarray} Equation (9) demonstrates that the forecast skill of the ensemble mean always outperforms the average skill of the individual members. Thus, the ensemble mean can be used to avoid choosing the "poorer" single members if the relative performance of the individual members or the best member is unknown. This explains why the ensemble mean can often achieve a satisfactory skill in practice.

    Let \(R_\min^2=\min(R_1^2,\ldots,R_M^2)\) denote the MSE of the best ensemble member. Moreover, let the following denote the average of the individual MSE Ri2: \begin{equation} \label{eq7} U=\frac{1}{M}\sum_{i=1}^M{R_i^2} . (10)\end{equation} The average of all possible Ri,j (i≠ j) can be expressed as \begin{equation} \label{eq8} L=\dfrac{2}{M(M-1)}\sum_{i,j=1\atop i\ne j}^M{R_{i,j}} . (11)\end{equation} From Eqs. (4), (10) and (11), the MSE of the ensemble mean can be written as \begin{equation} \label{eq9} R^{\ast2}=\dfrac{1}{M}U+\left(1-\dfrac{1}{M}\right)L .(12) \end{equation}

    As a result, the ensemble mean outperforms the best individual member if and only if \begin{equation} \label{eq10} R_{\min}^2>\dfrac{1}{M}U+\left(1-\dfrac{1}{M}\right)L . (13)\end{equation}

    Equation (12) can also be explained in terms of the matrix E in Eq. (5) because U represents the average of the diagonal elements of E, whereas L represents the average of all other elements of E.

    Equation (13) gives the sufficient and essential conditions in which the ensemble mean can achieve a higher skill than the best individual member, which occurs only under specific conditions, i.e., the members have similar skills and lower error covariances. This result is consistent with previous studies (Yoo and Kang, 2005; Jeong and Kim, 2009; Winter and Nychka, 2010), and it further indicates that the ensemble mean cannot outperform the best individual member if the members are highly correlated (larger L) or there are distinctively poorer members (noticeable increase of U). It can also explain why the multimodel ensemble occasionally cannot outperform its best individual model in numerical weather predictions (Hagedorn et al., 2012) if the individual models operated by different centers are highly correlated or the best model performs distinctively better than the other models.

  • From Eqs. (9), (10) and (12), the following is valid: \begin{equation} \label{eq11} U\ge R^{\ast 2}\ge L , (14)\end{equation} which shows that U and L can be treated as upper and lower bounds of the MSE of the ensemble mean.

    Figure 1.  MSE of the ensemble mean compared to its individual members $R^\ast2/U$, as a function of $\rho$ and $M$, according to Eq. (15).

    The MSE of the ensemble mean is equal to a weighted mean of U and L [Eq. (12)]. When the ensemble size M increases, the weight of the larger term U in Eq. (12) decreases, whereas the weight of the smaller term L increases. As a result, the error correlation for each pair of ensemble members becomes the main factor that determines the forecast skill of the ensemble mean. If U and L remain constant with increasing ensemble size and the newly added members have similar attributes to the already existing members, the MSE of the ensemble mean should decrease toward its lower bound L and reach a saturated skill level.

    Specifically, Eq. (12) can be simplified to \begin{equation} \label{eq12} \lim_{M\to \infty}R^{\ast2}=L . (15)\end{equation} When \(M\to \infty\), the lower bound L exactly represents the potential skill of the ensemble mean with increasing ensemble size.

    Let ρ conceptually express the average error correlation coefficients between each pair of members: \begin{equation} \label{eq13} \rho=\frac{L}{U} .(16) \end{equation} The parameter ρ describes similarity among ensemble members. Larger ρ implies the members are similar to each other. It is clear that ρ≤ 1.

    From Eqs. (12) and (16), the MSE of the ensemble mean compared with its individual members R*2/U can be written as a function of ρ and M: \begin{equation} \label{eq14} \frac{R^{\ast2}}{U}=\frac{1}{M}+\left(1-\frac{1}{M}\right) .(17) \end{equation}

    Obviously, ρ≤ R*2/U≤ 1. When the ensemble size M increases, R*2/U saturates to its lower bound ρ. The effect of the ensemble mean compared with its individual members is dependent on the average error coefficients and the ensemble size (Fig. 1). If the ensemble size M is sufficient, ρ exactly determines the effect of the ensemble mean. Smaller ρ leads to a better ensemble mean compared with individual members, which implies that the errors of the members should have lower correlations with each other to improve an ensemble mean.

    The "saturation degree" can be defined to describe the relative distance between the MSE of the ensemble mean and its potential skill L: $$ S=\left(1-\frac{R^{\ast 2}-L}{L}\right)\times 100\% .(18) $$ The saturation degree S increases when the ensemble size increases and the upper bound of S is 100%. By combining Eqs. (12) and (18), the saturation degree S can be simplified to $$$$S=\left(1-\frac{1-\rho}{M\rho}\right)\times 100\% .(19)$$ Equation (19) can be rewritten as \begin{equation} \label{eq17} M_{\rm saturate}=\frac{1-\rho}{(1-S)\rho} . (20)\end{equation}

    Equation (20) implies that the minimal ensemble size to reach a given saturated skill is determined by the error correlation coefficients between each pair of ensemble members. Fewer members are required for a larger ρ, and vice versa (Table 1). When \(\rho\to 0\), which implies that \(L\to 0\) and the members are independent of each other, the skill of the ensemble mean can be effectively improved with increasing ensemble size. Conversely, when \(\rho \to 1\), which implies that the individual members are highly dependent, increasing the ensemble size is ineffective, and the improvement in the ensemble mean is negligible compared with the single members.

3. Application with the T213 EPS
  • The T213 EPS (Su et al., 2014) forecasts, which are provided by the CMA, have been archived in the TIGGE (Bougeault et al., 2010) database. The breeding initial perturbation method has been applied to the T213L31 (60 km and 31 vertical levels) global spectral model (Chen et al., 2004; Wang et al., 2008) to generate 15 ensemble members, including one control run and seven pairs of perturbed members. This study uses the daily forecasts and corresponding analysis (validation) data from the T213 EPS over the Northern Hemisphere in 2008; the data have a 1°× 1° output grid and 1-10-day lead time.

    Figure 2.  MSE of the control run, the ensemble mean of the T213 EPS, and the upper and lower bounds $U$ and $L$ for 1-10-day lead times: (a) 500 hPa geopotential height; (b) 850 hPa temperature; (c) 850 hPa specific humidity; (d) 200 hPa wind speed.

    The MSE of the 500 hPa geopotential height, 850 hPa temperature and specific humidity, and 200 hPa wind speed (Fig. 2) shows that for a 1-3-day lead time, the ensemble mean of the 15 members performs slightly better than the control run and the average MSE of all its members U. With increasing lead time, the advantage of the ensemble mean becomes increasingly more significant for medium-range forecasts (4-10 days), despite the analysis field favoring the control run. Although the average MSE of the individual members U is appreciably larger, the ensemble mean outperforms the control run and is close to its lower bound L because the smaller L has a weight exceeding 90% (14/15) to determine R* 2 according to Eq. (10) for the 15-member T213 EPS. With increasing lead time, the MSE of the individual members (including the control run) increases rapidly, whereas the increase in L is relatively slow. As a result, the error correlation coefficients between the ensemble members decrease when the lead time increases [Fig. 3; Eq. (14)]. This can explain why the advantage of the ensemble mean is more significant in medium-range forecasts than in short-range predictions.

    The relationship between the forecast skill of the ensemble mean and ensemble size is also explored. There are 15!/[(15-i)!i!] choices to select i members from the 15-member ensemble. Among these choices, the best choice for each i is selected based on the lowest MSE of the ensemble mean. For the short-range forecasts (1-3 days), the skill (Fig. 4) of the best ensemble mean is barely improved by increasing the ensemble size. For the medium-range forecasts (4-10 days), the MSE of the ensemble mean rapidly decreases when the ensemble size increases, and the skill of the ensemble mean gradually becomes marginal and saturated.

    The terms in Eq. (10) vary with the ensemble size. For example, with a 10-day lead time (Fig. 5), both U and L remain constant with increasing ensemble size, whereas the MSE of the ensemble mean decreases due to the change in the weight in Eq. (10). When the ensemble size M increases, the weight for the smaller term L increases toward 1, whereas the weight for the larger term U decreases toward 0. This indicates that the skill of the ensemble mean increases with increasing ensemble size and the ensemble mean skill becomes saturated because the weight change becomes smaller with a large M.

    Figure 3.  The average error correlation coefficients for each pair of ensemble members $\rho$ and the saturation degree $S$ of the T213 EPS for 1-10-day lead times: (a) 500 hPa geopotential height; (b) 850 hPa temperature; (c) 850 hPa specific humidity; (d) 200 hPa wind speed.

    Figure 4.  MSE of the ensemble mean, as a function of the ensemble size, for lead times of 1, 3, 5, 7 and 10 days: (a) 500 hPa geopotential height; (b) 850 hPa temperature; (c) 850 hPa specific humidity; (d) 200 hPa wind speed.

    In this case, for the 1-10-day forecasts of the different fields of interest, such as the 500 hPa geopotential height, the 850 hPa temperature and specific humidity, and the 200 hPa wind speed, the parameter ρ varies between 0.5 and 0.8 (Fig. 3). The ensemble size required for a saturated ensemble mean can be deduced from Eq. (18) and Table 1. For different meteorological elements, the saturated ensemble size is different (Fig. 6). More members are required for medium-range forecasts than short-range predictions. For the four meteorological elements considered in this paper, 2-4 members can achieve a saturation degree of 80%, 3-8 members can reduce the MSE of the ensemble mean by 90%, and 6-16 members are enough to obtain a saturation degree of 95%. This can explain the previous results of (Du et al., 1997) and (Ma et al., 2012).

    Figure 5.  MSE of the 500 hPa geopotential height for the ensemble mean $R^\ast 2$, and its two factors $U$ and $L$, as a function of the ensemble size $M$, for a lead time of 10 days (units: gpm$^2$).

    Figure 6.  The minimum ensemble sizes required to reach saturation degrees of 80%, 90% and 95%: (a) 500 hPa geopotential height; (b) 850 hPa temperature; (c) 850 hPa specific humidity; (d) 200 hPa wind speed.

    The already existing 15-member T213 EPS can reach a saturation degree of 95% for 1-10-day lead times (Fig. 3), except for the 10-day lead time of the 500 hPa geopotential height predictions, as far as the MSE is concerned. For the short-range predictions, the saturation degree is even higher. This implies that if new members are added to the 15-member T213 EPS, the MSE of the ensemble mean can only be reduced by 5%, unless the skills and error covariances of individual members are significantly improved. Note that the correlations between existing members of T213 EPS are high, which limits the skill gain of the ensemble mean when adding more ensemble members. For a better configured EPS that has lower correlations among the ensemble members, more ensemble mean skill is expected to be gained from more members.

4. Summary and discussion
  • Ensemble methods, especially the arithmetic mean, have been widely used in weather and climate forecasting. This paper set out to reveal the rationale behind the success of using ensemble means. The ensemble mean cannot always outperform the best single member, although it has a better skill than the average skill of all individual members. The skill of the ensemble mean not only depends on the skills of individual members, but even more so on the error covariances between each pair of ensemble members. This suggests that the best approach is to choose ensemble members that have lower error covariances with each other, to achieve a better ensemble mean skill.

    It is inappropriate to blindly add new members into an already existing ensemble. A greater ensemble size does not necessarily yield higher skill. Even if a new member has a higher skill, it can still decrease the ensemble skill if it is highly correlated with the already existing ensemble members. In addition, the ensemble mean skill tends to saturate toward its potential skill when the ensemble size increases under the condition that the newly added members have similar attributes to the already existing ensemble members. This also indicates that increasing ensemble size will benefit the ensemble mean more when the added members have lower covariances with existing members.

    The average error coefficient between individual ensemble members is the most important factor to determine the ensemble skill. It not only determines the effect of the ensemble mean compared with individual members, but also the potential skill and the saturation degree of the ensemble mean. More members are useful if the errors of the members have lower correlations with each other, and vice versa.

    The T213 EPS forecasts confirm the above theoretical results. The ensemble mean of the T213 EPS outperforms its control run, especially for medium-range forecasts, because the error covariances between each pair of ensemble members are lower than the MSEs of the individual members. The skill of the ensemble mean can be improved by increasing the ensemble size for medium-range forecasts, which saturates gradually, under the condition that the perturbed members have similar attributes to each other. However, the ensemble mean skill of the short-range forecasts saturates quickly with a small ensemble size.

    For an ensemble that has an average error correlation coefficient that varies between 0.5 and 0.8, 15 members already reach a saturated ensemble mean. The 15-member T213 EPS can reach a saturation degree of 95% for 1-10-day lead time predictions, as far as the MSE is concerned. For short-range forecasts, the saturated ensemble size is even smaller. This can also be attributed to a greater correlation between ensemble members in short-range forecasts. The already existing ensemble can barely be improved by simply adding new members that have similar attributes to the already existing members. The T213 EPS members show high correlations, and for this reason its ensemble mean skill saturates quickly at around 10 members, especially for specific humidity forecasts at shorter lead times. Therefore, efforts should be made to reduce the correlations among the ensemble members to benefit from more members. In addition, we only examine the ensemble mean skill score in a deterministic sense in this study; we do not address the probabilistic forecasting aspect. It is very likely that probabilistic forecasting skill can further benefit from more ensemble members, even when the ensemble mean skill score ceases to improve by adding additional members. Further research from the probabilistic forecasting perspective is still needed.

    In this paper, the MSE is used as the metric to evaluate the ensemble mean. For different metrics, the theoretical frameworks are different. Theoretical analyses based on other metrics and the internal relationship between different metrics still requires further study.

    Although these theoretical analyses in this study focus on the ensemble mean with equal weights, they can also be generalized to an unequally weighted ensemble mean. Obviously, to obtain a better weighted mean, larger weights should be assigned to the members with higher skill. Further research is needed on weight setting methods.

    This study is based on the EPS of a single center (the CMA); the error covariances between the outputs of different centers in the THORPEX TIGGE data may improve the skill of the ensemble mean. Research on multi-center models requires further study.




    DownLoad:  Full-Size Img  PowerPoint