-
Let
$ S = \{1,2,\dots,C\} $ be the indices of$ C $ candidate predictors, and let$ y \in\mathbb{R}^D $ be the target. The goal of predictor selection is to identify an index subset$ S^* \subset S $ of predictors that can train a model of better prediction from given samples$ (x^{(i)}, y^{(i)})_{i = 1,\dots,n} $ , where a model is said to be of better prediction if it uses fewer input predictors and obtains better accuracy than the model that inputs all candidate predictors. We will denote the model that inputs all candidate predictors as a reference model. -
Our solution to the predictor selection problem is inspired by Ye and Sun (2018), in which the feature selection for deep neural networks (not CNNs) is performed by iterative elimination. However, the procedures are also computationally expensive because the contributions need to be calculated for each feature (or group) separately using a modified model over the whole training dataset. Under this context, our solution is to compute the importance metrics of input predictors using gradient-based feature attribution, in which only one traversal calculation is required over the validation dataset. In fact, gradient is one of the simplest and most efficient methods of measure featuring importance in differentiable models (Ancona et al., 2018), and the main reason we choose this method over others, such as layer-wise relevance propagation, is that the downscaling model used in this study has a simple architecture (containing only sequentially connected convolutional layers and ReLU activation), and the gradient-based method is easier to understand and has a better theoretical explanation for simple CNN models. The computed gradients are then aggregated to define the contributions of a single predictor. The procedures are detailed as follows.
Let function
$ y = F(x)\in \mathbb{R}^D $ denote the (fitted) CNN model that maps a stack of predictors$ x\in \mathbb{R}^{C \times P \times Q} $ to the target$ y $ .$ C $ is the number of selected predictors. In our statistical downscaling application,$ P,\; Q $ , and$ D $ are 6, 8, and 157, respectively. Given an input$ x_0 $ , we can approximate$ y_d $ with a linear function near$ x_0 $ by computing the first-order Taylor expansion:where
$ \circ $ represents the Hadamard product operator,$ \sum $ means the sum of all elements in the matrix, and$ \omega_d \in \mathbb{R}^{C \times P \times Q} $ is the gradient of$ y_d $ with respect to the input at the point$ x_0 $ :Equation 1 implies that
$ y_d $ is approximately proportional to the entries in$ \omega_d $ around$ x_0 $ . Therefore, a reasonable definition of the$ c $ -th predictor's contribution$ A_c $ at a given point$ x_0 $ is the sum of all absolute values of gradients corresponding to the predictor over all entries of$ y $ :The gradients
$ {\boldsymbol{\omega}}_d $ of deep neural networks, including CNNs, can be efficiently computed using the backpropagation algorithm (LeCun et al., 2015). There are several variants of backpropagation approaches, such as guided backpropagation (guided-backpropagation) (Springenberg et al., 2015), integrated gradient (Sundararajan et al., 2017), etc. This study uses guided-backpropagation because it is more robust to noise than standard backpropagation results (Nie et al., 2018) and more efficient than integrated gradient. The contribution of the individual predictors to the CNN model is averaged over the whole validation set. See Algorithm 1 (Table 1) for an outline of the computation procedures, where lines 7 and 8 correspond to Eqs. 2 and 3, respectively.Algorithm 1 Calculation of predictor contributions 1: procedure PREDICTORCONTRIBUTION ($ F $, $ X $) $\triangleright F$ and $ X $ are fitted model and validation set 2: $ N \gets \text{Length}(X) $ $\triangleright $ Number of samples 3: $ A \gets (0,0,\dots,0) $ $\triangleright $ of length $ C $ 4: for $ n=1,\dots,N $ do 5: $ x_0 \gets X[n] $ $\triangleright $ The $ n $-th sample 6: $ y = F(x_0) $ $\triangleright $ Forward pass of CNN 7: $ \omega \gets \left(\left.\frac{\partial y_1}{\partial x} \right|_{x=x_0}, \left.\frac{\partial y_2}{\partial x} \right|_{x=x_0}, \cdots, \left.\frac{\partial y_D}{\partial x} \right|_{x=x_0} \right) $ $\triangleright $ Compute gradients with guided-backpropagation 8: $ A^\prime \gets \left(\displaystyle_{d=1}^D \displaystyle_{p=1}^{P} \displaystyle_{q=1}^{Q} \left|(\omega_{d})_{1,p,q}\right|, \cdots, \displaystyle_{d=1}^D \displaystyle_{p=1}^{P} \displaystyle_{q=1}^{Q} \left|(\omega_{d})_{C,p,q}\right| \right) $ 9: $ A \gets A + A^\prime $ $\triangleright $ Accumulate the contribution metric 10: end for 11: return $ A/N $ $\triangleright $ Average and return 12: end procedure Table 1. Greedy predictor elimination with predictor contribution calculation
Based on the definition of the contribution metrics, the overall greedy elimination algorithm to the predictor selection problem is summarized in Algorithm 2 (Table 1). Notably, the predictor contributions are calculated and averaged on multiple-run models, and the evaluation scores, which will be presented next, are also calculated in this way. Furthermore, there is no termination condition like the one that stops the iteration if the accuracy drops. Instead, the optimal subset of candidate predictors is determined after evaluating all the constructed CNN models. If multiple-run and cross-validation are not counted, CNN models need to be trained and evaluated for a number of times exactly equal to the number of candidate predictors, which is more efficient than the greedy algorithm presented in Ye and Sun (2018).
Algorithm 2 Greedy predictor elimination algorithm 1: Initialization: $S=\{1,2,\dots,C\},\; S^\prime=\varnothing$ $\triangleright $Sets of indices to candidate and eliminated predictors 2: $ S^* \gets S \setminus S^\prime $ $\triangleright $Set of indices to remaining predictors 3: while $ |S^*| \geq 1 $ do $\triangleright $$ |S^*| $ is the cardinality of set $ S^* $ 4: $ A \gets (0,0,\dots,0) $ $\triangleright $of length $ C $ 5: for $ k=1, 2, \dots, 6 $ do 6: $ X^{k} \gets $ Validation set in fold $ k $ 7: $ F^{k} \gets $ Fitted model trained using predictors in $ S^* $ $\triangleright $Multiple-run 8: $ A^\prime \gets \text{PREDICTORCONTRIBUTION}(F^{k},\; X^{k}) $ $\triangleright $ Multiple-run and average 9: $ A \gets A + A^\prime $ 10: end for11: $ A \gets A / 6 $ 12: $ i \gets $ Index of predictor whose contribution is $ \min(A) $ 13: $ S^\prime \gets S^\prime \cup \{i\} $ 14: $ S^* \gets S \setminus S^\prime $ 15: end while -
According to Algorithm 2, a succession of models with different numbers of input predictors can be constructed and evaluated. Three scores are implemented to measure the accuracy. They are RMSE (root-mean-square error), CC (Correlation Coefficient), and ATCC (Anomaly Temporal Correlation Coefficient), where RMSE and CC are calculated by months and used for measuring the spatial errors and Pearson correlations between the predictions and observations. ATCCs are Pearson correlation coefficients calculated between the predicted and observed grid-wise time series. In particular, when calculating ATCCs, the time series are first filtered for seasonal oscillations by subtracting the climatological average of observations calculated in 12 calendar months using a 30-yr period (1981−2010) of data. Note that as mentioned before, the scores are computed and averaged in multiple runs. In addition to quantities that measure model accuracy, we use a common quantity FLOPs (Floating Point Operations), which computes the theoretical amount of multiply-add operations in CNNs to estimate the computation cost of CNN models. FLOPs are a deterministic quantity, independent of the dataset.
Each experiment in our implementation takes about two and a half hours to run, including the CNN and linear models training and evaluation, and we can infer that about 15 minutes are needed if it is done without a multiple-run. Running the program uses 1 GPU (NVIDIA RTX 2080Ti 11G) and 1 CPU (i9-7920X, 2.90GHz).
-
It is imaginable that there are models of better prediction than the reference models if candidate-predictor redundancy exists and the defined contribution metric is indicative enough. To verify this expectation, we applied the constructed CNN and linear models to the test set and compared the outputs with the observations. Note that all model outputs need to be anti-standardized by multiplying them with the standard deviations and adding the means. Due to the use of cross-validation, the test set consists of 441 months, and there are 157 target grid points in the region; therefore, for each model to be evaluated, a total of 441 RMSE and CC scores and 157 ATCC scores can be obtained.
Figure 3 shows the mean evaluation scores of CNN models during the whole predictor selection procedure. Meanwhile, the scores of linear models constructed using the same selected predictors are also displayed. Specifically, RMSE, CC, and ATCC are plotted in Figs. 3a, 3b, and 3c, respectively. For CNN models, it can be found that the accuracy remains stable at first as the predictors are removed one by one, and even better prediction models exist. After seven predictors are left, further elimination leads to a significant degradation of the model performance (see Figs. 3a and 3a). Although the CC score starts to drop after only four predictors are left (see Fig. 3b), seven should be the least number of input predictors for all three evaluation metrics. When 9 predictors remain, both RMSE and CC are at their best, and ATCC is also close to its optimal case (11 predictors). In the following, we will refer to the CNN models with nine and seven input predictors as BEST and LEAST, respectively. For linear models, the effect of predictor removal is greater, especially according to RMSE and CC. However, when there are 10 predictors left, continued removal will lead to a significant decrease in ATCC. Therefore, we assert that in this experiment, 10 predictors should be the best choice for the linear models. The results are in line with our expectations: for both the CNN model and linear method, better prediction models with fewer input predictors exist compared to the reference models.
Figure 3. The mean scores of CNN (convolutional neural network) and LR (linear regression) models throughout the predictor elimination procedures. The x-axis is the number of predictors. For each type of score, the CNNs of better predictions are highlighted with different markers, among which the BEST and LEAST are specialized.
Comparing results of the CNN and linear models, we can find that the CNN model always performs better than the linear method, which shows the advantage of the CNN. The large difference in the effect of predictor elimination indicates that the CNN model is more robust to data redundancy, while the linear model may be more prone to overfitting when the input dimension is too high. Furthermore, the minimum number of input predictors for CNN is seven, three less than the linear method. It can be inferred from this fact that the CNN model is more capable of nonlinear feature extraction, which is consistent with the highly nonlinear characteristics of CNN. In particular, redundancy exists mainly because the input predictors are not completely independent of each other. Therefore, as some predictors are removed, their contribution to the model can be replaced by others. As shown in Fig. 3a, the RMSE is stable at first as predictors are removed and increases significantly and rapidly after seven predictors. The remaining predictors become more independent of each other and thus cannot compensate for the loss in accuracy caused by removing any predictor. Not quite as expected, the experimental results show that as the predictors are eliminated, the model accuracy does not strictly monotonically increase first and then decrease at some point. This is mainly because the evaluation results are influenced by the generalization ability of models; after all, the model is built on the training and validation sets, while the evaluation is performed on the test set.
In addition to showing the mean values of each evaluation metric in Fig. 3, we analyzed the distribution of each metric. Figure 4 shows the box plots of RMSE (Fig. 4a), CC (Fig. 4b), and ATCC (Fig. 4c) scores of the reference, BEST, and LEAST CNN models. From the figures, it can be seen that the distribution of each score of the three models is very close. Specifically, the BEST model scores outperform the other two models in terms of the mean and the individual score quartiles. Note that the BEST model only uses one less than half of the predictors of the reference model. The LEAST model was chosen in such a way that it is the closest to the reference model, but it uses only 7 input predictors, which is 13 predictors less than the reference model.
Figure 4. Box plots of evaluation scores of reference, BEST, and LEAST CNN models. A six-number summary of the scores is displayed. Box and whiskers cover the 25−75th and 5−95th percentile ranges, respectively. Median and mean are plotted with an orange line and a green triangle, respectively. The mean value is shown on the top of the boxes.
Additionally, the geographical distributions of the ATCC score of the reference model and the scoring bias of the BEST and LEAST models compared to it are presented in Fig. 5. The deviation distributions show that, firstly, the variation of the scores of the BEST and LEAST models with respect to the reference model is tiny (specifically between −0.04 and 0.04). Secondly, the regions where the scores increase account for the vast majority of the entire region, especially for the BEST model. The findings suggest that the greedy predictor elimination algorithm improves the accuracy of the model not only in the average statistical sense but also in the overall distributions. Although decreased ATCC occurs in small areas after predictor removal, the magnitude is within an acceptable range, and the range is mainly caused by model randomness. Accordingly, the rationale for improving model performance through predictor removal is the inference that the elimination of highly correlated or redundant variables provides improvement since they are considered detrimental to the CNN. Exceptionally, despite the reduction being slight, the ATCC in Hainan (South of the region) becomes worse under both BEST and LEAST models. This implies that some removed variables are not important enough for downscaling precipitation in all of South China but may have more impact on some local regions. Therefore, we believe that the factor selection algorithm proposed in this paper should be used for relatively small regional downscaling tasks since the factors affecting precipitation differ from region to region (Jaagus et al., 2010; Jonah et al., 2021).
Figure 5. The geographic distributions of ATCC of the reference CNN model (1st column) and ATCC bias between the BEST (LEAST) and reference CNN models [2nd (3rd) column].
Although the improvement in model performance is not significant through predictor selection, it can improve our understanding of the data. The above experiments showed that about half or less than half of the candidate variables are the most relevant to the regional monthly precipitation in the studied area. In addition, CNN models with fewer variables can reduce model parameters, lower FLOPs, and thus improve computational efficiency. Specifically, Table 2 presents the changes in average scores, the number of model parameters, and FLOPs of the models constructed with 9 and 7 predictors and the reference model, which inputs 20 predictors. Compared with the reference model, although the RMSE is reduced by only 0.8%, only 9 out of 20 predictors are used to build the CNN, and the FLOPS decrease by 20.4%, with better model performance. When using seven predictors, the number of CNN parameters reduces by 6.0%, and the FLOPs decrease by 24.1%, without accuracy loss.
Predictors RMSEs CCs ATCCs Parameters FLOPs 20 1.793 0.633 0.577 98,102 1,163,677 9 1.779(−0.8%) 0.641(+1.3%) 0.592(+1.7%) 93,152(−5.0%) 926,077(−20.4%) 7 1.790(−0.2%) 0.637(+0.6%) 0.583(+0.7%) 92,257(−6.0%) 882,877(−24.1%) Table 2. Comparisons between the models constructed using nine (2nd row) and seven (3rd row) predictors to the reference model (1st row). The data in parentheses are differential percentages of the corresponding model compared to the reference one.
-
Next, we performed a reverse predictor selection experiment to demonstrate that the performance variation of the model is not solely determined by the number of predictors. That is, in line 12 of Algorithm 2, the predictor to be eliminated from the computation is set to be the predictor with the maximal rather than minimal contribution, and the rest of the procedure remains unchanged. Figure 6 shows the reverse experimental results as in Fig. 3. As can be seen from the figure, the performance of both the CNN and linear models shows a significant decrease as the important predictors indicated by the metrics are removed. This suggests that the defined predictor importance metric is indicative. Since the predictors are not independent of each other, the effect of reverse removal on the model is relatively small at first and then becomes larger, but in general, the effect of removing significant predictors is negative for the model accuracy. Consequently, we can infer that the contribution of the removed predictor in each step is too significant to be entirely replaced by the remaining predictor. Contrary to the results of the normal greedy predictor elimination experiment, the model performance of the CNN in this reverse experiment tends to vary more significantly than the linear models. This may be due to the fact that the generalization performance of the linear method improves as the input dimension decreases, leading to better accuracy. However, the accuracy of the linear approach also decreases with the removal of significant predictors in this reverse experiment, and the two effects cancel each other out, resulting in a less pronounced change of the linear model than the CNN.
Figure 6. Same as Fig. 3. except that this is for reverse predictor elimination procedures.
-
Since deep neural networks can learn highly nonlinear and complex relationships between inputs and outputs, we believe that the traditional variable selection method, such as correlation analysis, is not adequate for deep neural network models. To prove this point, we conducted another greedy predictor elimination experiment for comparison. Specifically, Pearson correlation coefficients of individual predictors are calculated grid-wise concerning the total regional precipitation. Then, importance metrics of each predictor are set as the average absolute values of correlation coefficients associated with the predictor in
$ 6 \times 8 $ grids. The time range is set to be 1981−2010 for a total of 360 months. The predictors and total regional precipitation are standardized month by month to remove the seasonal variation. All candidate predictors are sorted according to the associated importance metrics in ascending order, which is also the order of their elimination. We used the new predictor elimination sequence and applied the same cross-validation and multiple-run strategies to train and evaluate the CNN models. The elimination order will be presented later.Score results are shown in Fig. 7. Note that the reference model is completely the same as the ones in Figs. 3 and 6, so no new training of the reference model is needed. It can be observed that our method is advantageous in several aspects. First and foremost, the BEST (or LEAST) CNN models have five (four), four (three), and four (one) fewer input predictors, according to RMSE, CC, and ATCC, respectively. Also, the linear models with the least input predictors that achieve better prediction than the linear reference model have three, three, and four fewer input predictors than the correlation-analysis-based method. Third, there are more better-prediction CNN models compared to the reference one. Specifically, under RMSE (CC, or ATCC), our method finds 9 (16, or 12) CNN models obtaining better prediction, while the correlation-analysis-based approach discovers only 6 (13, or 12). This comparison shows that the correlation coefficient does not adequately represent the contribution of different input variables in the CNN model compared to the gradient-based importance measure defined in this paper. After all, CNNs are considered complex black-box models, and the advantage of the metric definition in this paper is that it uses the backpropagation of the CNN model itself.
Figure 7. Same as Fig. 3. except that this is for predictor elimination procedures based on the correlation analysis.
-
The above experiments have demonstrated that the defined gradients-based predictor importance metric is representative in measuring the predictor contributions in a CNN model and instructive for predictor selection. Figure 8 (right) shows the squared root of the contribution metrics for each predictor in different CNN models. The squared root of the contribution metric is calculated for better visual contrast. The vertical coordinates from top to bottom in the figure are the orders of predictor elimination under our method. Note that the magnitude of each predictor's metric in the same model measures the importance of that predictor. Moreover, although the size of the metrics for the specific predictor in different models is not highly comparable, it is evident that as some predictors are eliminated, the remaining predictors provide increasingly important contributions to the model. Additionally, the left subplot of Fig. 8 displays a bar plot of the importance metrics (averaged correlation coefficients) of candidate predictors computed in the correlation analysis experiment. Their removal indices are labeled on the right of each bar (from 1 to 20). By comparing the two elimination sequences, it is hard to find a meaningful relationship between them, except that some removals of predictors of the same variable follow about the same orders. For example, shum500 is the last specific humidity predictor, and vwnd850 is the final zonal wind component predictor to be eliminated under both approaches.
Figure 8. (Left): Bar plot of the importance metrics (coefficient coefficients) of predictors calculated using the correlation analysis method. The predictors of the same variable are rendered in the same color. The indices of predictors in the elimination sequences under the correlation-analysis-based method are labeled on the right of bars. (Right): The squared root of contribution metrics of all predictors in the CNN models of different numbers of input predictors (x-axis) throughout the selection procedures
Specifically, the results under both predictor selection schemes suggest that the humidity and wind components are the most critical variables to the precipitation. This is to some extent consistent with the investigation in Ramseyer and Mote (2016), which pointed out that lower-tropospheric humidity and wind are the more important among 37 predictors for precipitation in a neural network model. The same conclusion was made in Hu and Zhao (2016), which investigated the primary influence of moisture transport driven by wind and humidity on precipitation over South China. However, some aspects of the predictor removal process are difficult to fully explain, such as which predictors share the contribution of the removed ones in the new model; why the order of the magnitude of the contribution of some variables changes across models; etc. These will be our future work.
Next, we select one grid of interest from the 157 grids in the downscaling region and compute the correlation coefficients between the precipitation of the grid and gridded predictors, as well as the gradients of the grid with respect to the input predictors (average
$ \omega_d $ s in Eq. (2) over validation set) calculated in the reference model using the guided-backpropagation technique. Note both correlation coefficients and gradients are tensors of shape$ 6 \times 8 \times 20 $ , where 20 is the number of candidate predictors. To obtain comparable visualization results, we scaled both tensors to be between −1 and 1 by dividing the gradients and correlation coefficients by their respective absolute maximum values. The heatmaps of scaled correlation coefficients and gradients of three predictors are shown in Fig. 9. The selected predictors are from three different circulation variables, all of which are in the input predictors of the LEAST model.Figure 9. Heatmaps of three predictors' scaled correlation coefficients (left) and gradients (right). The selected grid of interest is highlighted with cyan dots. Red and black dots are grids of predictors and predictand, respectively.
It can be seen that, overall, the data of the two importance measures show similar spatial distribution patterns. This shows that although we use the backpropagation algorithm of the black-box model, the results are not completely incomprehensible. Specifically, the importance of shum500 (Fig. 9a) on the grid of interest decreases in the southwest and northeast directions; uwnd500 and vwnd1000 (Figs. 9b and 9c) both show significant north−south and east−west differences. Of course, there are also many differences between correlation coefficients and gradients, such as relative data magnitude and the location of the positive and negative data boundaries. Moreover, in terms of importance, it can be found that vwnd1000 has the largest correlation coefficients (darkest red shade) among the three predictors, while the gradient shows that shum500 has the most significant gradient values. The similarity of the distribution type between the correlation coefficient and the gradient somehow justifies the definition of the gradient-based variable importance measure. Moreover, the difference between them is mainly due to the fact that the correlation coefficients are linear and are calculated independently without considering the interactions between predictors. In contrast, the gradient is calculated by the backpropagation algorithm in a CNN model built with all predictors, which is a highly nonlinear relationship and utilizes the interactions between variables.
-
The above experiments and discussion illustrate that the method proposed in this paper can find a better subset of the candidate variables for the statistical downscaling of monthly precipitation where the models are built on a year-round dataset so that the selected variables can be considered as the most relevant factors affecting precipitation in a given region throughout the year. However, the physical factors affecting precipitation in the cold and warm seasons are often different (Gutowski et al., 2004). Therefore, here we consider applying the greedy predictor selection algorithm to analyze the main contemporaneous factors affecting precipitation in South China during the cold and warm seasons, respectively. To achieve this goal, we divide the augmented dataset into two parts according to months, where April to September are the warm season and the remaining six months are the cold season. The two sub-datasets are each about half of the original dataset. Then, the greedy predictor selection algorithm was applied to each of the two sub-datasets. Generally, the warm season in southern China has more precipitation and higher temperature than the cold season.
Figure 10 shows the comparison of the ATCC scores of the resulting CNN models in the warm and cold seasons. Figure 10a shows the change in the ATCC scores of the constructed CNN models on the test set as the predictors are removed, and it can be seen that as the variables are removed, the change in model accuracy for the cold and warm seasons follows a similar trend as the results for the whole year (see Fig. 3), i.e., it starts to gradually increase and then decreases. Note that although the same horizontal coordinate in Fig. 10a means that the same number of predictors are used as model inputs for both the cold and warm seasons, the combination of predictors may be different. As seen in the results, the models perform better in the cold season than in the warm season, and the number of predictors used in the BEST and LEAST models is higher in the warm season (12 and 7) than in the cold season (9 and 5). This indicates that the warm-season precipitation in South China is more difficult to simulate and has more influencing factors. This is consistent with the findings of past studies for other regions (Davis et al., 2003; Bukovsky and Karoly, 2011). The box plots of ATCC scores for the cold-season and warm-season reference, LEAST, and BEST models shown in Fig. 10b illustrate that although the cold-season models have higher mean scores, their scores are relatively less stable (with a wider range of box quantile distributions). This may be due to the low precipitation in the cold season, where small changes in prediction bias may cause large fluctuations in scores.
In particular, the main factors affecting precipitation in the cold and warm seasons are very similar to those in the full year. Taking the LEAST model in Fig. 10 as an example, there are five main factors: shum1000, shum700, uwnd1000, uwnd500, and vwnd700 for the cold season; the warm season has shum1000, shum500, uwnd500, uwnd700, vwnd1000, vwnd700, and vwnd850, totaling seven main factors. It can be seen that the most relevant factors for precipitation in both cold and warm seasons are still wind and humidity variables. Moreover, the set of main factors for both seasons is a subset of the top 10 main factors (see Fig. 8), but there are some differences. For instance, shum700 and uwnd1000, which are considered as main factors in the cold season, are not so important in the warm season. In addition, the top 10 year-round principal factor uwnd850 was not selected in the LEAST model for both seasons, but it was one of the 12 predictors in the BEST model for the warm season. The results show that the wind field components (uwnd and vwnd) that mainly affect precipitation may not come from the same pressure layer due to the fact that uwnd and vwnd (i.e., east−west and north−south winds) are viewed as different predictors in this study. Moreover, the wind field of a particular pressure layer may be important, but it is possible that only one of the wind directions (e.g., north−south winds) mainly drives the north−south water vapor transport to form precipitation.
The experiments on cold and warm seasons demonstrate that the predictor selection algorithm proposed in this paper can be applied to the cold and warm seasons separately to filter out the main factors affecting precipitation in different seasons from the candidate variables. The main factors can also be used to build more accurate statistical downscaling CNN models. Of course, the dataset can be further divided into four seasons (such as spring, summer, autumn, and winter) or 12 months to explore the principal factors affecting precipitation in different seasons or months. We have not explored this in the present study because too many divisions may lead to a decrease in sample size and affect the generalization ability of the established CNN models.
Algorithm 1 Calculation of predictor contributions |
1: procedure PREDICTORCONTRIBUTION ($ F $, $ X $) $\triangleright F$ and $ X $ are fitted model and validation set |
2: $ N \gets \text{Length}(X) $ $\triangleright $ Number of samples |
3: $ A \gets (0,0,\dots,0) $ $\triangleright $ of length $ C $ |
4: for $ n=1,\dots,N $ do |
5: $ x_0 \gets X[n] $ $\triangleright $ The $ n $-th sample |
6: $ y = F(x_0) $ $\triangleright $ Forward pass of CNN |
7: $ \omega \gets \left(\left.\frac{\partial y_1}{\partial x} \right|_{x=x_0}, \left.\frac{\partial y_2}{\partial x} \right|_{x=x_0}, \cdots, \left.\frac{\partial y_D}{\partial x} \right|_{x=x_0} \right) $ $\triangleright $ Compute gradients with guided-backpropagation |
8: $ A^\prime \gets \left(\displaystyle_{d=1}^D \displaystyle_{p=1}^{P} \displaystyle_{q=1}^{Q} \left|(\omega_{d})_{1,p,q}\right|, \cdots, \displaystyle_{d=1}^D \displaystyle_{p=1}^{P} \displaystyle_{q=1}^{Q} \left|(\omega_{d})_{C,p,q}\right| \right) $ |
9: $ A \gets A + A^\prime $ $\triangleright $ Accumulate the contribution metric |
10: end for |
11: return $ A/N $ $\triangleright $ Average and return |
12: end procedure |