-
The fires in the WUS considered in this study are from California (CA), Oregon (OR) and Washington (WA), while the hail in the CUS is from Montana (MT), Wyoming (WY), Colorado (CO), New Mexico (NM), North Dakota (ND), South Dakota (SD), Nebraska (NE), Kansas (KS), Oklahoma (OK), and Texas (TX), as shown in Fig. 1. MT, WY, CO, and NM are near the Rocky Mountains and are referred to as the original column 1 states (original CS1), while ND, SD, NE, KS, OK and TX are further downstream and are referred to as the original column 2 states (original CS2) (Fig. 1).
Figure 1. Map of fire states in the WUS and hail states in the CUS. The three fire states in the WUS (highlighted by red points) are WA, OR and CA. The CUS states are divided into two columns: the original CS1 (i.e., MT, WY, CO, and NM) and original CS2 (i.e., NE, SD, NE, KS, OK and TX). States scattered with blue points are those more likely to be affected by fires in the WUS. The green dashed rectangle denotes the region with westerly winds in general, which is considered as the plume transport path from the WUS to the CUS.
The study period is from 2001 to 2020. Only the warm season from March to September is considered since hailstorms are minimal in the cold season. The data selection process for co-occurrences of WUS fires and CUS hail is detailed in section 2.1.1. The two tree-based ML models (i.e., RF and XGB) are developed to understand the relationships between the daily hail occurrence for large hail with hail size ≥ 2.54 cm in the CUS and the fire features in the WUS (e.g., fire size, fire intensity, and smoke aerosols), with consideration of the co-located meteorological variables (e.g., air temperature, U-wind) over fire regions and along the path of fire plumes. The trained ML models are used to identify the important fire features and co-located meteorological variables contributing to the hail characteristics for physical understanding. Section 2.1 introduces the data used in this study, and section 2.2 describes the methodology of the tree-based ML models, the model performance evaluation metrics, and variable ranking.
-
The hail observational datasets used for this study are from the National Oceanic and Atmospheric Administration’s Storm Prediction Center (SPC) database. The advantage of using hail reports from SPC is the confidence in the occurrence of hailstones on the ground. However, hail reports could underestimate the hail size—for example, due to surface melting prior to and during hail size measurement (Blair et al., 2011, 2017). This underestimation in hailstone size is more obvious for smaller hail sizes, as they tend to melt faster. Smaller-sized hailstones are less likely to cause direct damage compared to large hail (size ≥ 2.54 cm). Therefore, in this study, we mainly focus on large hail with size ≥ 2.54 cm, which covers both severe hail (2.54–5.08 cm) and significantly severe hail (≥ 5.08 cm). For each state over the study region of the CUS, we calculate the large hail occurrence and the large hail count from March to September (warm season) over the period of 2001–20 (Table 1). The daily large hail occurrence in a specific state is considered with a threshold, i.e., 1 means that the daily total number of large hail count is greater than a threshold, and 0 otherwise. This is to exclude very minor hail events, which would not produce much impact and are difficult to predict. To check if the choice of a specific threshold for the total number of hail counts will affect the prediction of large hail occurrence in a specific state, two hail count thresholds (i.e., 10 and 20) are considered. These labeled hail occurrences are then used as the target variables in the ML classification models.
Target variables Abbreviation Temporal resolution Data Source Daily occurrence of hail with size ≥ 2.54 cm (0/1) in a state Hail occurrence daily SPC Daily hail count for hail with size ≥ 2.54 cm in a state Hail count daily SPC Predictor variables Abbreviation Temporal resolution Data Source Mean maxFRP for fire grids in three WUS states within t days before hail maxFRP _m_COW_dt daily MODIS Maximum maxFRP for fire grids in three WUS states within t days before hail maxFRP_max_COW_dt daily MODIS Mean maxFRP for fire grids in states within t days before hail maxFRP_m_s_dt daily MODIS Maximum maxFRP for fire grids in states within t days before hail maxFRP_max_s_dt daily MODIS Total number of fire grids in three WUS states within t days before hail ngrids_COW_dt daily MODIS Total number of fire grids in states within t days before hail ngrids_s_dt daily MODIS Temporal change of fire grids in three WUS states within t days before hail gdiff_COW_dt daily MODIS Temporal change of fire grids in states within t days before hail gdiff_s_dt daily MODIS Mean BC+OC over fire grids in three WUS states within t days before hail BCOC_m_COW_dt daily MERRA-2 Maximum BC+OC for all grids in three WUS states within t days before hail BCOC_max_COW_dt daily MERRA-2 Mean BC+OC over fire grids in states within t days before hail BCOC_m_s_dt daily MERRA-2 Maximum BC+OC for all grids in states within t days before hail BCOC_max_s_dt daily MERRA-2 Mean RH at 850 hPa over three WUS states within t days before hail RH850_m _dt daily MERRA-2 Maximum RH at 850 hPa over three WUS states within t days before hail RH850_max _dt daily MERRA-2 Mean air temperature at 850 hPa over three WUS states within t days before hail T_m _dt daily MERRA-2 Maximum air temperature at 850 hPa over three WUS states within t days before hail T_max _dt daily MERRA-2 Mean U-wind at 850 hPa for grids along fire path within t days before hail U850_m_dt daily MERRA-2 Maximum U-wind at 850 hPa for grids along fire path within t days before hail U850_max_dt daily MERRA-2 Mean U-wind at 250 hPa for grids along fire path within t days before hail U250_m_dt daily MERRA-2 Maximum U-wind at 250 hPa for grids along fire path within t days before hail U250_max_dt daily MERRA-2 Notes: t ∈[1,2] for U-wind; t ∈[2,4] for other variables; s∈[CA, OR, WA]; the fire transport region (38°–44°N, 125°–112°W) Table 1. Target and predictor variables used in the ML models.
As mentioned in the Introduction, this study follows on from the modeling study of the impacts of WUS wildfires on weather hazards in the CUS in Zhang2022. Different from Zhang2022 in which wildfire data from the Fire Program Analysis Fire-Occurrence Database (FPA-FOD) were used, here, we use the thermal anomaly datasets from the Terra Moderate Resolution Imaging Spectroradiometer (MODIS) Thermal Anomalies and Fire Daily (MOD14A1) Version 6. Unlike FPA-FOD, which only includes wildfires reported from federal, state, tribal, and local governments, MOD14A1 has fire-related thermal anomaly detection, capturing all types of fires, including wildfires, agricultural field burning, prescribed fires, etc. (Huang et al., 2012; Wang and Wang, 2020). The datasets are generated at ~1 km spatial resolution and daily temporal resolution. The variables include the fire mask, pixel quality indicators, maximum fire radiative power (maxFRP), and the position of the fire pixel within the scan. Individual 1-km pixels are assigned to one of nine fire mask pixel classes, which indicate the different confidence levels of fire occurrence. In this study, we only use the fire pixels with the highest confidence level to calculate the daily fire features for individual fire states (i.e., CA, OR, or WA) as well as the whole WUS region (i.e., CA + OR + WA). The fire features include the mean and maximum maxFRP, total number of fire pixels, and the temporal change of fire pixels (the change in the daily total number of fire pixels compared with that of the previous day) for each fire day (Table 1). The black carbon and organic carbon aerosols (BC+OC), characteristic of smoke aerosols in the fire regions, are also considered. We use the column-integrated mass concentrations of BC+OC from the Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2) (Gelaro et al., 2017) to represent the smoke aerosols over WUS fire regions. Here, we only consider fires with daily burned areas no smaller than 20 km2 (i.e., the top 30% of daily burned areas of WUS fires). Small fires with total burned areas less than 20 km2 over the WUS are not considered in this study, as their impacts on remote hailstorms should be minor or even negligible, especially when the fire pixels are sparsely distributed over the WUS.
Besides identifying co-occurring events of WUS fires and CUS hailstorms, the WUS fires (≥ 20 km2 in size) need to co-occur with hailstorm days (daily hail counts ≥ 10 or 20 over the CUS). To account for the time lag for the remote effect of WUS fires, we also require fires to exist within 2–4 days before the occurrence of hailstorms, based on the estimate of aerosol optical depth changes in Zhang2022. For example, for a selected storm occurring on 26 July, not only that day but also 24 and 25 July must be fire days for the 2-day requirement (we tested 3 and 4 days to judge the sensitivity). The co-occurring events identified with daily hail counts ≥ 20 over the CUS in each year are shown in Fig. 2. As discussed earlier, use of MODIS fire data, which include all kinds of fires, increases the sample size of co-occurring events compared with Zhang2022 in which only wildfires were considered. We have about 30 co-occurring events in 2008 and 2009, and more than 25 events in 2013, 2015 and 2016. There is no significant trend in the time series (Fig. 2).
-
Other than WUS fire features (e.g., fire size, fire intensity, and smoke aerosols), the meteorological variables over the fire region and along the paths of fire plumes are considered in this study based on physical mechanisms revealed in Zhang2022. These meteorological variables include air temperature (T), relative humidity (RH), and U-wind. Behaviors of individual fires are determined by fire weather characterized by atmospheric elements such as T, RH, wind etc. (Liu et al., 2013). As with the smoke aerosols (BC+OC), the meteorological variables are also obtained from MERRA-2 to be physically consistent. The MERRA-2 data for BC+OC, T, RH, and U-wind are available every three hours at an approximate spatial resolution of 0.5° × 0.625° and 72 hybrid-eta levels. Here, we use the values at 850 hPa for BC+OC, RH and T, and values at 250 hPa and 850 hPa for U-wind. The daily mean and maximum values for BC+OC, RH, and T are summarized over individual states (i.e., CA, OR, or WA) as well as the whole WUS region (i.e., CA + OR + WA) within 2–4 days before each hail day. Here, a hail day is defined as a day with total hail counts of at least 10 or 20 over the CUS. For U-wind, the daily mean and maximum values are calculated approximately along the paths of fire plumes [green dashed rectangle (38°–44°N, 125°–112°W) in Fig. 1] within 1–2 days before each hail day. Combining all these attributes as shown in Table 1, we have a total of 91 variables in the predictor matrix, which will be used as the inputs for training the ML models.
-
To study the impacts of fire features and their associated meteorological variables in the WUS on the hail characteristics in the CUS, RF and XGB models are built to model their complex and nonlinear relationships using the hail occurrence as the target variable. RF is a tree-based ensemble ML method for regression and classification, which was developed by Breiman (2001). It is mainly used to construct a prediction model in a supervised learning problem. It can also be used to evaluate the predictor variables with respect to their ability to predict the response (Boulesteix et al., 2012). XGB is an ensemble learning method based on the idea of boosting (Chen and Guestrin, 2016). The boosting approach incorporates multiple decision trees and combines all the predictions to obtain the final prediction. XGB is an implementation of gradient boosted decision trees, a weighted ensemble of weak prediction models. It is designed to prevent overfitting and to be computationally more efficient than the gradient boosting machine.
The CS1 and CS2 states (Fig. 1) have different distances from the WUS regions and the fire impacts are expected to be different. Therefore, ML models are built separately for CS1 and CS2 states. We further investigated the effects on the states located further downstream, specifically in the Midwest, and found the impact to be minimal. Consequently, our focus is primarily on the CS1 and CS2 states. The ML models for predicting hail occurrence (i.e., daily hail occurrence with hail size ≥ 2.54 cm in a specific state) in CS1 and CS2 are built by randomly selecting 80% of the dataset for training and 20% for testing. We then validated the ML models using five-fold cross-validation. Each ML model is formulated as
where
$ {f}_{\mathrm{c}\mathrm{l}\mathrm{a}\mathrm{s}\mathrm{s}\mathrm{i}\mathrm{f}\mathrm{i}\mathrm{e}\mathrm{r}}(.) $ is an RF or XGB classifier, built to predict the probability of hail occurrence ($ {y}_{p} $ ) in the CS1 states and CS2 states, and x1, x2… xi are the predictor variables, as shown in Table 1.Various evaluation metrics are used to evaluate the model performances. The classification models are evaluated by their accuracy, precision, recall, and F1 score. Precision and recall are defined as follows:
where “true positive” indicates the occurrence of large hail is correctly predicted by the model; “false positive” is where large hail does not occur but is predicted as an event, and “false negative” measures where the model fails to predict the occurrence of large hail when it does occur. The F1 score measures a model’s accuracy by combining the precision and recall:
The F1 score has a maximum value of 1 and a minimum value of 0, and a higher F1 indicates a higher balance between precision and recall.
In binary classifications, a model gives us a probability instead of the prediction (0/1) itself, so we need to convert this probability into a prediction by applying a classification threshold (e.g., default threshold of 0.5). However, the default threshold of 0.5 may not represent an optimal interpretation of the predicted probabilities, particularly for a classification problem with very imbalanced data. For example, in this study, the hail days over March to September from 2001 to 2020 are less than 20% of the total days in either CS1 or CS2. One way to find the optimal classification threshold is by checking and balancing the precision and recall values for the RF and XGB models and adjusting the classification thresholds ranging from 0 to 1.
As mentioned above, the set of predictor variables listed in Table 1 has a total of 91 variables. Such a high dimensionality of the predictor matrix is usually associated with issues like data collinearity. This may affect the variable rankings of the constructed RF and XGB models. To gain more robust variable rankings, we use the Shapley additive explanation (SHAP) (Nohara et al., 2019) values from both RF and XGB to evaluate the importance of a variable to the prediction of hail characteristics. SHAP is a novel approach to explain individual local and global variable importance based on game theory (Lundberg and Lee, 2017). When applying game theory to the explanation of variable importance, the predictor variables are considered as “players” in the operative game in which the goal is a prediction for a single observation. Each predictor variable obtains a “payout” based on its contribution to the game, so the “payout” is the corresponding variable importance. For a predictor variable, the SHAP value considers the difference in the model predictions made by including and excluding the predictor variable for all combinations of predictors. Variables with a larger mean absolute SHAP value (MASV) are relatively more important. This means those variables have higher predictive power and contribute more to the prediction of the target variables. In this study, we render the MASV from both RF and XGB to evaluate the variable importance by introducing the relative MASV. The relative MASV for a specific variable is calculated as
where
$ {R}_{i} $ is the relative MASV for variable$ i $ , n is the total number of predictor variables,$ {\mathrm{M}\mathrm{A}\mathrm{S}\mathrm{V}}_{i}^{\mathrm{R}\mathrm{F}} $ is the MASV for variable$ i $ from the RF model, and$ {\mathrm{M}\mathrm{A}\mathrm{S}\mathrm{V}}_{i}^{\mathrm{X}\mathrm{G}\mathrm{B}} $ is the MASV for variable i from the XGB model. Based on the relative MASV, variables that are in the top rankings are identified as important predictors for states in CS1 and CS2.
Target variables | Abbreviation | Temporal resolution | Data Source |
Daily occurrence of hail with size ≥ 2.54 cm (0/1) in a state | Hail occurrence | daily | SPC |
Daily hail count for hail with size ≥ 2.54 cm in a state | Hail count | daily | SPC |
Predictor variables | Abbreviation | Temporal resolution | Data Source |
Mean maxFRP for fire grids in three WUS states within t days before hail | maxFRP _m_COW_dt | daily | MODIS |
Maximum maxFRP for fire grids in three WUS states within t days before hail | maxFRP_max_COW_dt | daily | MODIS |
Mean maxFRP for fire grids in states within t days before hail | maxFRP_m_s_dt | daily | MODIS |
Maximum maxFRP for fire grids in states within t days before hail | maxFRP_max_s_dt | daily | MODIS |
Total number of fire grids in three WUS states within t days before hail | ngrids_COW_dt | daily | MODIS |
Total number of fire grids in states within t days before hail | ngrids_s_dt | daily | MODIS |
Temporal change of fire grids in three WUS states within t days before hail | gdiff_COW_dt | daily | MODIS |
Temporal change of fire grids in states within t days before hail | gdiff_s_dt | daily | MODIS |
Mean BC+OC over fire grids in three WUS states within t days before hail | BCOC_m_COW_dt | daily | MERRA-2 |
Maximum BC+OC for all grids in three WUS states within t days before hail | BCOC_max_COW_dt | daily | MERRA-2 |
Mean BC+OC over fire grids in states within t days before hail | BCOC_m_s_dt | daily | MERRA-2 |
Maximum BC+OC for all grids in states within t days before hail | BCOC_max_s_dt | daily | MERRA-2 |
Mean RH at 850 hPa over three WUS states within t days before hail | RH850_m _dt | daily | MERRA-2 |
Maximum RH at 850 hPa over three WUS states within t days before hail | RH850_max _dt | daily | MERRA-2 |
Mean air temperature at 850 hPa over three WUS states within t days before hail | T_m _dt | daily | MERRA-2 |
Maximum air temperature at 850 hPa over three WUS states within t days before hail | T_max _dt | daily | MERRA-2 |
Mean U-wind at 850 hPa for grids along fire path within t days before hail | U850_m_dt | daily | MERRA-2 |
Maximum U-wind at 850 hPa for grids along fire path within t days before hail | U850_max_dt | daily | MERRA-2 |
Mean U-wind at 250 hPa for grids along fire path within t days before hail | U250_m_dt | daily | MERRA-2 |
Maximum U-wind at 250 hPa for grids along fire path within t days before hail | U250_max_dt | daily | MERRA-2 |
Notes: t ∈[1,2] for U-wind; t ∈[2,4] for other variables; s∈[CA, OR, WA]; the fire transport region (38°–44°N, 125°–112°W) |