3.1. Statistical models
-
In this study, a multiple linear regression (MLR) approach, which is both robust and easily applicable, is used to estimate the relations between predictors and predictands. MLR seeks to summarize the linear relationship between the predictand and multiple predictor variables and is ultimately expressed as a linear combination of K predictors that can generate the least error for the prediction of \(\hat{y}\) given observations of each predictor xi from the training dataset, as follows: \begin{equation} \hat{y}=\beta_0+\sum_{i=1}^K\beta_i x_i \ \ (1)\end{equation} where βi is the coefficient determined by the regression. The MLR models are developed independently for 1-h INCA precipitation forecasts initiating from 0000, 0100, …., 2300 UTC, respectively.
Screening the regression, i.e., selecting a high-quality set of predictors from a pool of potential predictors to avoid the problem of over-fitting and to remove redundant predictors, is an important step in the statistical prediction procedure (Wilks, 2011). In this study, the predictors are selected using backward elimination (Wilks, 2011), and the relative importance of the selected predictors is discussed in section 4.
3.2. Candidate predictor specification
-
Assuming that the initial time of each MLR forecast model is valid at t=T UTC, the candidate predictors are provided from the four following sources:
(1) QPF: INCA hourly operational extrapolation precipitation forecast itself, valid at time T+1 h UTC (i.e., the forecast accumulated precipitation from [T, T+1 h] UTC), which is based on the Lagrangian persistence.
(2) The radar products [(i)-(iii) below]: products derived from composite measurements of five C-band (5.33 cm) radars covering the INCA domain valid at [T+0.16 h] UTC.
(i) MAXCAPPI: Analysis of the maximum constant-altitude plan position indicator above sea level. A CAPPI (constant altitude plan position indicator) and a MAXCAPPI (maximum CAPPI) product are provided every 5 min on a Cartesian grid with a spatial resolution of 1 km. Both products are reduced to 14 reflectivity classes ("no rain", 11.8, 14.0, 19.5, 22.0, 26.7, 30.0, 34.2, 38.0, 41.8, 46.0, 50.2, 54.3, and 58.0 dBZ). The MAXCAPPI product in combination with the Marshall-Palmer relationship is Z=200 r1.6 (Marshall and Palmer, 1948) and is used for the estimation of precipitation, where Z is the sum of six powers of drop diameters in unit volume, and r is the rate of rainfall.
(ii) VIL: Vertically integrated liquid water content. This product can be used as a measure of severe weather potential, such as an indicator of the size of prospective hail and the potential amount of rain during a thunderstorm (Greene and Clark, 1972). It is derived from the CAPPI product by integrating the liquid water content calculated from reflectivity within a vertical column.
(iii) ECHOTOP: The radar echo top defined as the maximum height where a reflectivity of at least 4 dBZ is recorded. It is an estimate of the top of an area of precipitation that gives an impression of the intensity of updrafts. The vertical resolution of ECHOTOP in this study is limited to 1 km by the resolution of the CAPPI data.
(3) The INCA convective parameter analysis valid at time T UTC, including (i) the CAPE, (ii) the CIN, and (iii) the moisture convergence (MCONV). The areas with positive MCONV values (convergence of moisture flux) indicate the potential for convective forcing, whereas negative values indicate divergence on the ground and thus subsidence in the air mass above.
(4) QPE: The INCA hourly operational precipitation analysis valid at time T UTC; i.e., the analyzed accumulated precipitation from [T-1, T] UTC acquired from the combination of radar, rain gauge information, and the inclusion of a parameterization of elevation effects in mountain areas. This analysis comprises the precipitation truth 1 h ahead of the current forecast time. This predictor represents the possible correlation of persistent precipitation between the two consecutive time periods of [T-1, T] UTC and [T, T+1] UTC.
The list of predictor variables is given in Table 1, and the generalized form of Eq. (1) can be rewritten as Eq. (2), with all potential predictors and β representing the model coefficients (Table 1): \begin{eqnarray} {\rm QPF}(T+1)&=&\beta_0+\beta_1{\rm QPF}(T+1)+\beta_2{\rm QPE}(T)+\nonumber\\ &&\beta_3{\rm MCAPPI}(T+0.16)+\beta_4{\rm VIL}(T+0.16)+\nonumber\\ &&\beta_5{\rm ECHOTOP}(T+0.16)+\beta_6{\rm CAPE}(T)+\nonumber\\ &&\beta_7{\rm CIN}(T)+\beta_8{\rm MCONV}(T) \ \ (2)\end{eqnarray}
3.3. Sample collection strategy
-
In this paper, the historical radar composite products, INCA analysis and forecasts, which are available every 5 and 15 min, are from three summer seasons (June to September) over the period 2012-14. For every MLR model constructed for each initiating forecast hour, the total 12-month dataset is divided into the following two parts: the training dataset and verification dataset. The training part contains 10 randomly selected months of data from which the regression equations are derived. The remaining two months, August 2013 and July 2014, make up the verification part, which will be used to validate the forecast performance of the MLR.
In this study, on a grid point basis, the number of qualified precipitation events is still limited in order to be statistically significant. Therefore, an alternative method to construct a sample dataset and derive a regression model for a cluster of N× N grid points has been applied. Here, we set N as 10, as this value is large enough to ensure that the samples collected grid-by-grid may reach a significant value, but small enough to be representative of the local orographic and precipitation features. Therefore, the entire INCA domain with 700× 400 1-km2 grid cells is decomposed into 2800 small-scale patches of 10× 10 grid points (Fig. 1). For every patch domain, a common sample dataset is constructed by collecting qualified samples ( QPF>0) from each of the 100 grid points; and thereafter, a common MLR equation is then derived using the dataset, and will be assigned to all points within the domain. Overall, the sample size of the training dataset for each patch domain is approximately 5000-10 000.
The 2800 small patches are traversed to calculate regression coefficients and perform a backward elimination of predictors and statistical output correction. For the predictors that fail to pass the backward elimination, their corresponding coefficients will be set to 0.
3.4. Correction of statistical output
-
There are two different types of bias that need to be considered. The first is the systematic bias usually shown from the regression models derived from the various fitting approaches (Roberts et al., 2015). The important source of bias in MLR nowcasts is from the nonlinear nature of precipitation. Therefore, both incomplete thermodynamic processes and the linear approximation of the regression may contribute to systematic errors in forecasts of the MLR models. Second, because the phenomena that an MLR can linearly describe are the characteristics of samples that concentrate near the climatological mean of the training dataset, the behavior of the MLR mostly depends on the probability of the predictand in the training dataset. The precipitation record samples with less probability, such as the observed large-amount precipitation events, would be treated as outliers, although they are correct. Regression often has difficulty replicating the occurrence of extremes as frequently as they are observed (Roberts et al., 2015); therefore, it is also important to ensure that the predicted precipitation has the correct relative frequency of amounts.
In this study, the direct output from the application of the MLR models to the training dataset reveals that statistical models usually tend to overpredict small amounts of precipitation but underpredict large amounts. As illustrated in Fig. 2a, the direct outputs of the MLR models tend to systematically underpredict the moderate to strong precipitation events with intensities greater than 4 mm h-1. Also, a large portion of 2-4 mm h-1 precipitation is also underpredicted to 0-2 mm h-1 while overpredicting weak precipitation events below 1 mm h-1. Consequently, MLR overpredicts the frequencies of the observed precipitation amount intervals between 1 and 2 mm h-1 but underpredicts those below 1 mm h-1, which leads to a distorted probability distribution of the MLR fitted values (Fig. 2e) compared with observations (Fig. 2d) and significantly non-Gaussian forecast errors (Fig. 2g). To resolve the above two deficiencies, both bias and distribution corrections need to be performed.
To measure the goodness-of-fit of a regression, the coefficient of determination, R2, can be computed from \begin{equation} R^2=\dfrac{\sum(\hat{y}-\bar{y})^2}{\sum(y-\bar{y})^2}=1-\dfrac{\sum r^2}{\sum(y-\bar{y})^2} , (3)\end{equation} where y, \(\hat{y}\) and \(\bar{y}\) are the predictand, regression prediction and the mean of the predictand, respectively; and r2 is the residual. R2 is the proportion of the variation of the predictand that is described or accounted for by the regression. If no effective regression can be reached, R2 is equal to 0, while for a perfect regression R2 is equal to 1. In this way, R2 can be used as an index to evaluate the quality of a linear regression and quantify the nature and strength of the linear relationship between the predictand and predictors (Wilks, 2011).
(Pore et al., 1974) and (Roberts et al., 2015) described a bias correction approach by multiplying the MLR predictions by the reciprocal of the multiple correlation coefficients, or \(1/\sqrt{R^2}\), which is also equivalent to the square root of the coefficient of determination defined as Eq. (3). An identical approach is used in this study. As shown in Fig. 2b, the bias-corrected output of the MLR is generally pushed towards the direction with larger fitted values. In this way, the underprediction for moderate observations is partly corrected, but with the sacrifice of a higher overprediction for minor precipitation, which leads to the non-Gaussian distribution with a higher frequency of positive forecast errors (Fig. 2h).
Distribution correction is performed after bias correction by compulsorily constructing the distribution of the output of the MLRs to be close to that of the corresponding predictand (Sokol and Pesice, 2012). First, for the observed predictand Ok,k=1,…,M and its direct MLR forecasted counterpart yk, out,k=1,…,M from the training dataset, their percentile values of 0.01,1,2,…,99, and 99.9, denoted as Oi,i=1,…,101 and Yi,i=1,…,101, can be obtained and archived as look-up tables for each MLR equation. The modified value yk, mod,k=1,…,M can be calculated using linear interpolation as follows: \begin{equation} y_{k,{\rm mod}}=O_i+\dfrac{y_{k,{\rm out}}-Y_i}{Y_{i+1}-Y_i}(O_{i+1}-O_i) , \ \ (4)\end{equation} where Yi≤ yk, out≤ Yi+1, O0=0 and O102=100 mm are set to ensure that for any yk, out, the corresponding Oi and Oi+1 exist.
After applying the two-step corrections, the probability frequency distribution of forecast values (Fig. 2f) looks similar to those of the observations, and the distributions of the forecast errors are closer to Gaussian (Fig. 2i). It should be noted that some slight negative forecasted values generated by the statistical models (Fig. 2a), which partly contribute to the distortion of probability frequency (Fig. 2e), are also corrected after the distribution correction taken. This result demonstrates that the correction procedures will significantly improve the quality and distribution of the model outputs.
3.5. Summary of development and implementation steps for the MLR models
-
The MLR models are developed for the predictand (T+1 h precipitation), initiating every hour at 0000, 0100, …, 2300 UTC. With the entire domain decomposed into square patches with N× N grid points (N=10 in this study), the MLR models are constructed using the training dataset for each patch domain as follows:
(1) MLR regression equations are derived and optimized using a backward elimination technique.
(2) For each patch domain, the MLR regression coefficients are updated by multiplying by the reciprocal of the multiple correlation coefficients for bias correction, and the look-up tables containing the percentile values of the predictand Oi and fitted values Yi of the MLR model are then obtained for distribution correction.
(3) The identical configuration of the regression coefficients and its corresponding percentile look-up table are shared with all grid points inside the current patch domain.
(4) Regression coefficients βi, i=0,1,…,K in Eq. (1) of every grid point of the entire domain are then assigned, with all patch domains traversed.
In an operational environment, the implementation of the MLR model must fit the operation of the INCA system and the radar data update frequency. Assuming the current time is [T+0.16 h] UTC, the predictors that can be obtained are as follows: (i) the latest radar products valid at [T+0.16 h] UTC; (ii) a 1-h precipitation analysis valid at T UTC (with precipitation from [T-1 h, T] UTC); (iii) a 1-h extrapolated precipitation forecast valid at T+1 h UTC; and (iv) convective analysis fields from T UTC. With the above data, the prepared MLR models will be implemented using the following steps:
(1) Calculation of predictor values and forecasts with the MLR models.
(2) Bias and distribution correction of the updated forecasts of the MLR with the look-up tables from Eq. (4).
The above implementation procedure can be completed within 1 min. Therefore, the updated T+1 h precipitation forecasts made by the MLR models can be issued with only a very short delay.