-
We first evaluated GFS, ConvLSTM-ED, and different hybrid models against in situ observations (Figs. 3, 4). The ConvLSTM-ED model performed reasonably well in short-term predictions, with comparable predictability to SMAP L4 data (i.e., training target SM). However, its performance degraded dramatically as the forecast time scale increased (Fig. 3), particularly in southeastern China (Figs. 4g–i). The GFS model showed inferior performance compared to the ConvLSTM-ED model during all forecast time scales, especially for short-term forecasting (Fig. 3). Although the ConvLSTM-ED model outperformed the GFS model for different forecast time scales, it showed degraded performance in wet regions (e.g., southeastern China) compared to the GFS model (Figs. 4d–f, Figs. 4g–i). Notably, we cannot conclusively determine which model (GFS or ConvLSTM-ED) was superior based on the aforementioned results. Moreover, this determination was beyond the scope of our paper (the reasons are shown in Text S5).
Figure 3. The mean (a) R and (b) ubRMSE of different forecast models at different forecast time scales. Dash lines denote the performance of SMAP L4 data evaluated by in-situ observations. The abbreviations of model names are the same as in section 3.
Figure 4. The spatial distribution of performance (R) in 1-, 7- and 16-day forecasts of different models. We used the average model as the baseline hybrid model to evaluate the performances of the different hybrid models. Panels (a–c) show the performance of the average model, while the remaining rows show the differences between the R of the target model and the R of the average model. Red points indicate that the model improved the performance compared to the average model, while blue points show a declined performance.
The average model dramatically improved the ConvLSTM-ED model at nearly all forecast time scales, demonstrating the benefit of adding physical information into DL models. However, the performance of the average model still dropped dramatically for long-term forecasting (Fig. 3), particularly in southeastern China (Figs. 4a–c), consistent with the property of the ConvLSTM-ED model (i.e., decreasing performance in southeastern China as the forecast time scale increased). Moreover, the average model performed poorly in northern China, caused by introducing the bias of the GFS model in this region. These results indicate that the simple averaging method could not fully exploit the benefits of both the PB and DL models.
The condition model greatly improved the long-term predictability compared to the average model over nearly all stations (Fig. 3, Figs. 4j–l), demonstrating that adding GFS SM forecasts as the exogenous inputs into the decoder of the ConvLSTM-ED model could significantly improve long-term predictions. However, the performance of the condition model decreased significantly in short-term predictions compared to the average model (Fig. 3). In short-term predictions, we observed that the underperforming regions in the condition model (blue dots in Fig. S3b) were consistent with those in the GFS model (red dots in Fig. S3a in the ESM) when compared to the ConvLSTM-ED model. This result highlighted a problem with the condition model, i.e., it introduced the biases from the short-term predictions provided by the GFS model into the DL models. Notably, although the condition model could propagate short-term forecast errors, incorporating the PB SM evolution (i.e., sharpened predictions from GFS) in the decoder could still improve the performance of the ConvLSTM-ED model significantly. This emphasizes the significance of integrating physically consistent predictions into pure DL models.
The attention model further improved the short-term predictability and had equal long-term predictability when compared to the condition model (Fig. 3). Spatially, the attention model retained the high predictability of the ConvLSTM-ED model in northern China and effectively overcame its deficiencies in southeastern China (Figs. 5g–i). Moreover, the attention model outperformed the condition model in most regions (Figs. 5j–l). In particular, it showed significant improvement in short-term predictions where the condition model introduced the bias of the GFS model (Fig. 5). These findings indicate that the attention mechanism can adaptively learn to exploit the benefits of the ConvLSTM-ED and GFS models for different forecast time scales and soil conditions, thereby significantly improving the model performance spatiotemporally.
Figure 5. The R of the (a–c) GFS, (d–f) ConvLSTM-ED, and (g–i) attention models, and (j–l) the improvement of the attention model compared to the condition model.
The ensemble model outperformed all other hybrid models (in terms of both R and ubRMSE) at all different time scales (Fig. 3), especially in long-term predictions; for example, the ensemble model improved the mean R by 65% (from 0.205 to 0.340) compared to the ConvLSTM-ED model for the 16-day predictions. Moreover, the ensemble model was able to further reduce the bias of the predictions in southeastern China (provided by the ConvLSTM-ED model) compared to the attention model (Figs. 4p–r), and the ensemble model outperformed the ConvLSTM-ED model over 79.5% of the in situ stations. These results underline the value of ensemble methods and emphasize the exceptional spatiotemporal predictability of the ensemble model.
-
We further evaluated the predictability using gridded datasets by TCA, which could evaluate the performance toward an unknown truth. The spatial distribution of SNR is shown in Fig. 6. The results were similar to those of the in situ evaluation, and also demonstrated the superior performance of the proposed hybrid models (attention and ensemble models). For example, the condition model enhanced the long-term predictability but decreased the short-term predictability (Figs. 6j–l to Figs. 6p–r). In addition, the attention model was able to further correct the bias of the condition model in short-term predictions (Figs. 6j–l to Figs. 6m–o), whereas the ensemble model achieved the best performance among all hybrid models (Fig. 6). We further used another collocated dataset (i.e., SMOS data, in situ data from the CMA) and evaluated the predictability by TCA. The result was also consistent with that of the in situ evaluation (Text S6, Fig. S4 in the ESM).
Figure 6. TCA-based SNR of different models. The triplets of the TCA are [*, ERA5-Land, SoMo.ml], where * denotes the forecast models. Panels (a–c) show the average model. The remaining rows show the difference between the SNR of the target model and the average model.
Figure 7 shows the SNR of SMAP L4 data (i.e., the training target of the DL models) and the ConvLSTM-ED and ensemble models. The ConvLSTM-ED model showed inferior performance compared with SMAP L4 data in most regions. Additionally, the regions where the ConvLSTM-ED model underperformed were consistent with those of SMAP L4 data. It was found that the performance of the ConvLSTM-ED model depended entirely upon the quality of the SMAP L4 data and was limited by the SMAP L4 data as a performance ceiling. Notably, the ensemble model outperformed the SMAP L4 data in most regions for short-term predictions (Fig. 7), particularly in drought-prone areas (e.g., the North China Plain), suggesting that the ensemble of different hybrid models could “break” the performance ceiling constrained by the training data in some areas. This is attributable to the introduction of physical information into the pure DL models. The in situ validation of the SMAP L4 data and the ensemble model further confirmed this result (Fig. S5 in the ESM). However, the long-term predictability of the ensemble model was still far inferior to the SMAP L4 data. Moreover, all forecast models still did not show satisfying performance (i.e., signal larger than noise, SNR > 0) in more than half of the regions (Fig. S6c in the ESM) in long-term predictions, indicating the challenge of long-term SM forecasting, which necessitates further investigation.
-
We further evaluated the drought predictability of the different forecast models. Figure 8 illustrates the kernel density curves of the SWDI of the in situ observations and different models. Surprisingly, the SWDI of the in situ observations contained two peaks located at the SWDI values of nearly −10 and −2. We further used more stringent quality-control processes for the in situ observations (Dorigo et al., 2013) and found the same two peak structures (Fig. S7 in the ESM). It is noted that these two peaks may be a unique property of the in situ observation datasets used in our study. We found that the GFS and average models tended to forecast the right peak of the in situ observations (i.e., they gave relatively stable predictions and were unaware of some extreme events such as SWDI < −10). On the contrary, the attention model tended to simulate the left peak (i.e., extreme drought events) better than the other hybrid models, showing the effectiveness of the attention mechanism for extreme drought forecasting. Furthermore, although the ensemble model provided the best general performance, it tended to forecast the mean SWDI of observations. This result emphasizes that ensemble methods could provide a more stable prediction by correcting the bias of each model but may also “remove” some extreme events, which may not be suitable for drought forecasting.
Figure 8. The kernel density curve of the SWDI of the in situ observations from different forecast models (lines with different colors) at the (a) week-1 and (b) week-2 forecast.
We further evaluated the fractions of observed drought events that were correctly depicted by the forecast models. Table 1 summarizes the POD values of the different models over different Köppen–Geiger major climate zones. Generally, the attention model was able to accurately detect 60.6% and 56.8% of drought events at 1- and 2-week forecasts and achieved the best detection over arid, temperate, cold, and polar regions. Moreover, the ensemble-average operation always yielded an average prediction of drought events among ensemble members (see average and ensemble models), reinforcing the prior results. It is noted that the GFS model excelled in temperate regions but performed the worst over arid, cold, and polar regions among all the forecast models, indicating a poor representation of SM dynamics over these regions.
Model Tropical (n=16) Arid (n=91) Temperate (n=642) Cold (n=350) Polar (n=30) Week 1 Week 2 Week 1 Week 2 Week 1 Week 2 Week 1 Week 2 Week 1 Week 2 GFS 0.578 0.493 0.511# 0.477# 0.665* 0.582 0.506# 0.469# 0.396# 0.370# ConvLSTM 0.720 0.661 0.573* 0.521 0.605# 0.560 0.575 0.532 0.656 0.637 average 0.521 0.479 0.536 0.492 0.643 0.592 0.542 0.502 0.529 0.502 condition 0.744* 0.693* 0.543 0.519 0.605# 0.532# 0.582 0.545 0.640 0.578 attention 0.655 0.630 0.570 0.536* 0.629 0.598* 0.599* 0.550* 0.696* 0.644* ensemble 0.506# 0.474# 0.551 0.531 0.613 0.564 0.571 0.538 0.622 0.577 *Best model to detect drought events over the target climate region.
#Worst model to detect drought events over the target climate region.Table 1. The probability of an accurate drought event detection by different models over different climate regions based on in situ SM observations. The abbreviations of the model names are the same as in Fig. 1. The week 1 and week 2 columns represent the ability to forecast the 1-week and 2-week drought. n denotes the number of stations located over target climate regions.
-
It was found in this study that embedding physical information in the DL models through useful hybrid methods dramatically improved the SM predictability compared to using pure DL models, and this could be attributed to several possible reasons. Firstly, it is well known that pure DL models may produce unrealistic predictions because of a lack of physical consistency (e.g., mass and energy balance). For example, Fang et al. (2019) found that pure DL models provided highly fluctuating simulations non-physically. Thus, physical information provided by PB models that obeys physical laws can be used to correct the non-physical predictions of pure DL models. Secondly, pure DL models can benefit from the assimilation of high-quality observations in PB models (Fang and Shen, 2020); for example, pure DL models cannot predict the corresponding SM variation if a rainfall event is missing in the forcing data. However, data assimilation can remedy the forcing errors with high-quality observations, resulting in better temporal representation of SM dynamics. One benefit of using the GFS (including data assimilation) forecasts in our study was to help the pure DL models in correcting the bias induced by the forcing errors. Thirdly, Daw et al. (2022) pointed out that pure DL models rely heavily on the training data quality and can only depict the evolution of existing SM (Klocek et al., 2022). This may lead to significant biases over regions with poor-quality data (e.g., wet regions in SMAP L4 data). On the contrary, PB models can depict the dynamics of SM over different soil conditions (e.g., precipitation infiltrates more easily in regions with high soil porosity), and can provide stable and realistic simulations with high-quality rainfall forcing [e.g., wet regions; see Maggioni et al. (2012)]. In addition, the GFS model can simulate SM in different water states (e.g., solid, liquid) through soil dynamics, which pure DL models struggle to do accurately because of the poor quality of the training datasets during the freezing period. Thus, incorporating physical information into pure DL models might help to overcome the deficiencies derived from data (Daw et al., 2022).
Although we introduced PB features to improve the model, the proposed hybrid models still inherited the uncertainties from the supervised DL models, i.e., the uncertainty from the training data. In addition, another source of uncertainty came from the selection of hybrid schemes, as demonstrated in section 4. Furthermore, the quality of the PB models also contributed to the uncertainty. Parameterizations and inadequate representation of land processes can introduce uncertainties in hybrid models. However, when compared with the PB models, the hybrid models benefited from the fitting ability of the DL algorithm and the vast amount of data, which could partially correct systematic errors. Moreover, the introduction of PB features also alleviated the limitation of the training data when compared to the pure DL models. These findings suggest that hybrid models are a promising way of enhancing the prediction skill for meteorological and hydroclimatic variables (Slater et al., 2023).
The potential applications of SM forecasting models have been comprehensively discussed in Peng et al. (2021), and we highlighted two important application directions. Firstly, the proposed model could provide accurate initializations of land-surface conditions for numerical weather prediction (NWP) systems. Indeed, the integration of SM into several NWP models has been found to improve forecasts of atmospheric variables (Dharssi et al., 2011; Muñoz-Sabater et al., 2019; De Rosnay et al., 2020). Secondly, accurate predictions of SM could be utilized for monitoring, analyzing and providing early warnings of hydrometeorological disasters, including agricultural drought (Mishra et al., 2017) and floods (Li et al., 2018). Additionally, these predictions could inform decision-making processes, such as in watershed management (Heimhuber et al., 2017) and irrigation water management (Lawston et al., 2017).
In our study, we aimed to investigate the benefits of incorporating physical information into DL models, but exploring the interpretability of the proposed models is beyond the scope of the present paper. However, these complex hybrid models may have low interpretability and should be used with caution in practical applications. Explainable artificial intelligence (XAI) provides tools to aid in decision-making processes when applying DL models in real-world applications. Several studies have explored the interpretability of DL SM forecasting models using XAI tools. For example, Huang et al. (2023) adopted various post-hoc interpretation methods to assess the feature effects on SM predictions and showed that a comprehensive understanding of the relationship between input features and predicted SM could be achieved. Different interpretation methods used in their study, such as “shapely values” and “partial dependence plots”, could be used to investigate the contributions of different features (e.g., GFS forecasted values) to our proposed models, which deserves further exploration.
We end our discussion by pointing out some limitations of our study. Firstly, we did not provide the “best” hybrid schemes to achieve the “best” forecast (i.e., general performance and drought predictability) at different forecast time scales and spatial regions. For example, the ensemble model achieved the best general performance at all forecast time scales (section 4.1), but the ensemble method may “remove” some extreme drought events (section 4.3). Therefore, we highlight that the choice of different hybrid methods might depend on the different applications, such as the ensemble model is suited for long-term, stable predictions, which mainly focus on the average state of SM, while the attention model is suited for forecasting extreme drought events. Secondly, we integrated GFS with the ConvLSTM-ED models because of its efficiency and widespread use (Fan and van den Dool, 2011; Yin et al., 2019). However, the GFS and ConvLSTM-ED models were both not the “best” PB and DL models for SM prediction. Thus, the result of our study may not fully represent the properties of PB and DL models. Nonetheless, we showed improvements in the different hybrid methods based on these two widely used models. Thirdly, hybrid models have different framework, e.g., physically guided DL (Willard et al., 2022a), or differentiable programming (Feng et al., 2022). In this study, we only focused on using the PB model outputs and observational features in a hybrid modeling setup to generate strong-performing SM predictions. We did not introduce any physical laws and principles to guide the DL models. Several “deep” hybrid frameworks have been developed (Read et al., 2019; Liu et al., 2022), which can “force” DL models to forecast based on physical consistency, thereby possibly providing more realistic and stable predictions (Willard et al., 2022a). Moreover, pre-training DL models using PB model outputs and fine-tuning them in the target data (i.e., transfer learning) may also utilize the physical information.
Model | Tropical (n=16) | Arid (n=91) | Temperate (n=642) | Cold (n=350) | Polar (n=30) | |||||||||
Week 1 | Week 2 | Week 1 | Week 2 | Week 1 | Week 2 | Week 1 | Week 2 | Week 1 | Week 2 | |||||
GFS | 0.578 | 0.493 | 0.511# | 0.477# | 0.665* | 0.582 | 0.506# | 0.469# | 0.396# | 0.370# | ||||
ConvLSTM | 0.720 | 0.661 | 0.573* | 0.521 | 0.605# | 0.560 | 0.575 | 0.532 | 0.656 | 0.637 | ||||
average | 0.521 | 0.479 | 0.536 | 0.492 | 0.643 | 0.592 | 0.542 | 0.502 | 0.529 | 0.502 | ||||
condition | 0.744* | 0.693* | 0.543 | 0.519 | 0.605# | 0.532# | 0.582 | 0.545 | 0.640 | 0.578 | ||||
attention | 0.655 | 0.630 | 0.570 | 0.536* | 0.629 | 0.598* | 0.599* | 0.550* | 0.696* | 0.644* | ||||
ensemble | 0.506# | 0.474# | 0.551 | 0.531 | 0.613 | 0.564 | 0.571 | 0.538 | 0.622 | 0.577 | ||||
*Best model to detect drought events over the target climate region. #Worst model to detect drought events over the target climate region. |