Advanced Search
Article Contents

Meshless Surface Wind Speed Field Reconstruction Based on Machine Learning


doi: 10.1007/s00376-022-1343-8

  • We propose a novel machine learning approach to reconstruct meshless surface wind speed fields, i.e., to reconstruct the surface wind speed at any location, based on meteorological background fields and geographical information. The random forest method is selected to develop the machine learning data reconstruction model (MLDRM-RF) for wind speeds over Beijing from 2015–19. We use temporal, geospatial attribute and meteorological background field features as inputs. The wind speed field can be reconstructed at any station in the region not used in the training process to cross-validate model performance. The evaluation considers the spatial distribution of and seasonal variations in the root mean squared error (RMSE) of the reconstructed wind speed field across Beijing. The average RMSE is 1.09 m s−1, considerably smaller than the result (1.29 m s−1) obtained with inverse distance weighting (IDW) interpolation. Finally, we extract the important feature permutations by the method of mean decrease in impurity (MDI) and discuss the reasonableness of the model prediction results. MLDRM-RF is a reasonable approach with excellent potential for the improved reconstruction of historical surface wind speed fields with arbitrary grid resolutions. Such a model is needed in many wind applications, such as wind energy and aviation safety assessments.
    摘要: 我们提出了一种新的无网格地表风速场的机器学习重建方法,即基于气象背景场和地理信息重建任意位置的地表风速。以北京地区为例,我们利用2015–19年的站点资料,训练了基于随机森林的风速重构模型。模型的输入特征包括时间、地理信息以及局地气象背景场。我们将独立于训练过程之外的站点作为假定预测点,并利用交叉验证法扩大了预测点,评估了模型的表现。模型预测的均方根误差为1.09 m s−1,表现优于传统的反距离加权插值法得到的结果(1.29 m s−1)。其次从季节变化上看,模型在秋季和夏季表现优于春季和冬季。最后,利用平均杂质减少方法提取重要特征排列,评估了所选特征对模型影响力的大小,并讨论了模型预测结果的合理性。本文提出的基于机器学习风速重构模型是一种新的具有合理性、准确性的模型,具有改进与应用的潜力。这种模型在许多风能应用中都是需要的,如风能和航空安全评估。
  • 加载中
  • Figure 1.  Spatial distribution of observations (left) and elevation map in Beijing (right).

    Figure 2.  Flow chart of wind speed reconstruction model (MLDRM-RF). Step 1−4 are shown section 2.3.2. LOOCV represents “leave-one-out cross validation”. IDW represents “inverse distance weight” interpolation method. MDI represents “mean decarease in inpurity”.

    Figure 3.  Schematic diagram of sample datasets construction from observed data. Sample (i, j) infers to the record at time i at location j. Every sample consists of 17 variables (features) and one label as shown in the figure. $ {M}^{i} $, $ {t}^{i} $, $ {G}_{j} $ represent three parts of these variables as meteorological background, time, and geographic respectively. Wind speed is the label of this model.

    Figure 4.  Error distribution with different tree counts and maximum depths of the decision trees. The boxes in figure (a) represent the errors of different combinations of numbers and depths based on the RMSE. In figure (b), for a given number of trees, we define “minimum RMSE” as the RMSE which has been stabilized with depth increasing. In our test, the stabilized RMSE is always the smallest RMSE for a given number of trees. The “maximum OOBS” is defined as the corresponding OOBS when the RMSE reach to the stability. Figure (b) shows how “minimum RMSE” and “maximum OOBS” change with the numbers of trees.

    Figure 5.  Wind speed prediction score map with different model parameters (number and depth of trees). The horizontal axis represents the number of trees, and the vertical axis represents the depth of trees.

    Figure 6.  Data set partitioning for different models drawing on the idea of having one verification. For model 1, the testing set consists of records from station 1, and training set consists of remaining samples. Model 2 use the records from station 2, and training set also consists of the other 225 stations. Other models follow suit.

    Figure 7.  RMSE (m s−1) spatial distribution (a) and error probability density distribution (b). Error probability density represents the density distribution of the RMSE for 226 models. The horizontal axis shows the magnitude of the error, and the vertical axis shows the density.

    Figure 8.  RMSE in different seasons and their spatial distributions, presented individually for (a) spring, (b) summer, (c) autumn, (d) winter.

    Figure 9.  Probability density distribution of RMSE (m s−1) in different seasons, presented individually for (a) spring, (b) summer, (c) autumn, (d) winter. The horizontal axis represents the size of the RMSE.

    Figure 10.  Spatial distribution of the error differences between MLDRM-RF and IDW. The colors represent the difference between the RMSEs (m s−1) of MLDRM-RF and IDW. Positive values indicate that MLDRM-RF performs better than IDW.

    Figure 11.  Probability density curve of MLDRM-RF and IDW RMSE (m s−1). Blue shadow represents the RMSE distribution curve for MLDRM-RF, and black shadow represents the RMSE distribution curve for IDW.

    Table 1.  The variables used in the models. All these variables come from the observed stations records and are divided into three parts: meteorological, time, and geographic variables. Average wind variables represent the average wind speed and direction in a 10-minute period at 10-meter height. Wind component U and V represent the zonal and meridional respectively. Given the cyclical nature of the wind direction, the wind directions are not input into the models. Instead, we input four components ($ {\mathrm{A}\mathrm{W}}_{u},{\mathrm{A}\mathrm{W}}_{v},{\mathrm{E}\mathrm{W}}_{u},{\mathrm{E}\mathrm{W}}_{v} $) into the model to introduce the wind speed direction influence.

    ClassVariable NamesAbbr.Range
    Meteorological variables (M)Surface temperatureTJan. 01, 2015−Aug. 31, 2019 (hourly)
    Relative humidityRH
    Precipitation (in 1 hour)Pn
    PressurePe
    Average wind speedAW
    Extreme wind speedEW
    Average wind speed U component$ {\mathrm{A}\mathrm{W}}_{u} $
    Average wind speed V component$ {\mathrm{A}\mathrm{W}}_{v} $
    Extreme wind speed U component$ {\mathrm{E}\mathrm{W}}_{u} $
    Extreme wind speed V component$ {\mathrm{E}\mathrm{W}}_{v} $
    Time variables (t)YearY2015−19
    MonthM1−12
    DayD1−31
    HourH0−23
    Geographic variables (G)Longitudelon226 stations
    Latitudelat
    Altitudealt
    DownLoad: CSV

    Table 2.  Wind speed prediction score. We adopt this evaluation index from the study of Yu et al. (2020). The vertical column on the left of the table represents the range of observed wind speed (), and the upper side represents the range of prediction wind speed. Every prediction value can be evaluated with a score based on this table.

    Observed (m s−1)Prediction (m s−1)
    0.0−
    0.2
    0.3−
    1.5
    1.6−
    3.3
    3.4−
    5.4
    5.5−
    7.9
    8.0−
    10.7
    10.8−
    13.8
    13.9−
    17.1
    17.2−
    20.7
    20.8−
    24.4
    24.5−
    28.4
    28.5−
    32.6
    32.7−
    36.9
    ≥37.0
    0.0−0.210.60.400000000000
    0.3−1.50.610.60.40000000000
    1.6−3.30.40.610.60.4000000000
    3.4−5.400.40.610.60.400000000
    5.5−7.9000.40.610.60.40000000
    8.0−10.70000.40.610.60.4000000
    10.8−13.800000.40.610.60.400000
    13.9−17.1000000.40.610.60.40000
    17.2−20.70000000.40.610.60.4000
    20.8−24.400000000.40.610.60.400
    24.5−28.4000000000.40.610.60.40
    28.5−32.60000000000.40.610.60.4
    32.7−36.900000000000.40.610.6
    ≥37.0000000000000.40.61
    DownLoad: CSV

    Table 3.  Importance rank of features that we use in the RF model. Importance is shown by percentages. Wind speed component U and V represent zonal and meridional components respectively.

    DownLoad: CSV
  • Alizadeh, M. J., M. R. Kavianpour, B. Kamranzad, and A. Etemad-Shahidi, 2019: A Weibull distribution based technique for downscaling of climatic wind field. Asia-Pacific Journal of Atmospheric Sciences, 55, 685−700, https://doi.org/10.1007/s13143-019-00106-z.
    Bernier, N. B., S. Bélair, B. Bilodeau, and L. Y. Tong, 2014: Assimilation and high resolution forecasts of surface and near surface conditions for the 2010 vancouver winter olympic and paralympic games. Pure Appl. Geophys., 171, 243−256, https://doi.org/10.1007/s00024-012-0542-0.
    Bosch, J., I. Staffell, and A. D. Hawkes, 2017: Temporally-explicit and spatially-resolved global onshore wind energy potentials. Energy, 131, 207−217, https://doi.org/10.1016/j.energy.2017.05.052.
    Breiman, L., 2001: Random forests. Machine Learning, 45, 5−32, https://doi.org/10.1023/A:1010933404324.
    Franco, B. M., L. Hernández-Callejo, and L. M. Navas-Gracia, 2020: Virtual weather stations for meteorological data estimations. Neural Computing and Applications, 32, 12 801−12 812,
    Gielen, D., F. Boshell, D. Saygin, M. D. Bazilian, N. Wagner, and R. Gorini, 2019: The role of renewable energy in the global energy transformation. Energy Strategy Reviews, 24, 38−50, https://doi.org/10.1016/j.esr.2019.01.006.
    Hengl, T., and Coauthors, 2017: SoilGrids250m: Global gridded soil information based on machine learning. PLoS One, 12, e0169748, https://doi.org/10.1371/journal.pone.0169748.
    Hou, Y. K., Y. F. He, H. Chen, C. Y. Xu, J. Chen, J. S. Kim, and S. L. Guo, 2019: Comparison of multiple downscaling techniques for climate change projections given the different climatic zones in China. Theor. Appl. Climatol., 138, 27−45, https://doi.org/10.1007/s00704-019-02794-z.
    Isaac, G. A., and Coauthors, 2014: Science of nowcasting olympic weather for vancouver 2010 (SNOW-V10): A world weather research programme project. Pure Appl. Geophys., 171, 1−24, https://doi.org/10.1007/s00024-012-0579-0.
    Jing, W. L., P. Y. Zhang, H. Jiang, and X. D. Zhao, 2017: Reconstructing satellite-based monthly precipitation over northeast China using machine learning algorithms. Remote Sensing, 9, 781, https://doi.org/10.3390/rs9080781.
    Joe, P., and Coauthors, 2010: Weather services, science advances, and the vancouver 2010 olympic and paralympic winter games. Bull. Amer. Meteor. Soc., 91, 31−36, https://doi.org/10.1175/2009BAMS2998.1.
    Kadow, C., D. M. Hall, and U. Ulbrich, 2020: Artificial intelligence reconstructs missing climate information. Nature Geoscience, 13, 408−413, https://doi.org/10.1038/s41561-020-0582-5.
    Karpatne, A., and S. Liess, 2015: A guide to earth science data: Summary and research challenges. Computing in Science & Engineering, 17, 14−18, https://doi.org/10.1109/MCSE.2015.127.
    Karpatne, A., I. Ebert-Uphoff, S. Ravela, H. A. Babaie, and V. Kumar, 2019: Machine learning for the geosciences: Challenges and opportunities. IEEE Transactions on Knowledge and Data Engineering, 31, 1544−1554, https://doi.org/10.1109/TKDE.2018.2861006.
    Keck, R. E., and N. Sondell, 2020: Validation of uncertainty reduction by using multiple transfer locations for WRF-CFD coupling in numerical wind energy assessments. Wind Energy Science, 5, 997−1005, https://doi.org/10.5194/wes-5-997-2020.
    Krasnopolsky, V. M., and M. S. Fox-Rabinovitz, 2006: Complex hybrid models combining deterministic and machine learning components for numerical climate modeling and weather prediction. Neural Networks, 19, 122−134, https://doi.org/10.1016/j.neunet.2006.01.002.
    Leinonen, J., A. Guillaume, and T. L. Yuan, 2019: Reconstruction of cloud vertical structure with a generative adversarial network. Geophys. Res. Lett., 46, 7035−7044, https://doi.org/10.1029/2019GL082532.
    Li, J., and A. D. Heap, 2011: A review of comparative studies of spatial interpolation methods in environmental sciences: Performance and impact factors. Ecological Informatics, 6, 228−241, https://doi.org/10.1016/j.ecoinf.2010.12.003.
    Liu, J. K., Z. Q. Gao, L. L. Wang, Y. B. Li, and C. Y. Gao, 2018a: The impact of urbanization on wind speed and surface aerodynamic characteristics in Beijing during 1991-2011,. Meteorol. Atmos. Phys., 130, 311−324, https://doi.org/10.1007/s00703-017-0519-8.
    Liu, Y. C., D. Y. Chen, S. W. Li, and P. W. Chan, 2018b: Discerning the spatial variations in offshore wind resources along the coast of China via dynamic downscaling. Energy, 160, 582−596, https://doi.org/10.1016/j.energy.2018.06.205.
    Liu, Y. H., J. M. Feng, Z. L. Yang, Y. H. Hu, and J. L. Li, 2019: Gridded statistical downscaling based on interpolation of parameters and predictor locations for summer daily precipitation in North China. J. Appl. Meteorol. Climatol., 58, 2295−2311, https://doi.org/10.1175/JAMC-D-18-0231.1.
    Louppe, G. J., 2014: Understanding random forests: From theory to practice. arXiv: 1407.7502.
    Miao, Y. C., J. P. Guo, S. H. Liu, H. Liu, Z. Q. Li, W. C. Zhang, and P. M. Zhai, 2017: Classification of summertime synoptic patterns in Beijing and their associations with boundary layer structure affecting aerosol pollution. Atmospheric Chemistry and Physics, 17, 3097−3110, https://doi.org/10.5194/acp-17-3097-2017.
    Nechaj, P., L. Gaál, J. Bartok, O. Vorobyeva, M. Gera, M. Kelemen, and V. Polishchuk, 2019: Monitoring of low-level wind shear by ground-based 3D lidar for increased flight safety, protection of human lives and health. International Journal of Environmental Research and Public Health, 16, 4584, https://doi.org/10.3390/ijerph16224584.
    Nikulin, G., and Coauthors, 2018: Dynamical and statistical downscaling of a global seasonal hindcast in eastern Africa. Climate Services, 9, 72−85, https://doi.org/10.1016/j.cliser.2017.11.003.
    Pirhalla, M., D. Heist, S. Perry, S. Hanna, T. Mazzola, S. P. Arya, and V. Aneja, 2020: Urban wind field analysis from the Jack Rabbit II special sonic anemometer study. Atmos. Environ., 243, 117871, https://doi.org/10.1016/j.atmosenv.2020.117871.
    Prasanna, V., H. W. Choi, J. Jung, Y. G. Lee, and B. J. Kim, 2018: High-resolution wind simulation over incheon international airport with the unified model's rose nesting suite from KMA operational forecasts. Asia-Pacific Journal of Atmospheric Sciences, 54, 187−203, https://doi.org/10.1007/s13143-018-0003-5.
    Reichstein, M., G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, N. Carvalhais, and Prabhat, 2019: Deep learning and process understanding for data-driven Earth system science. Nature, 566, 195−204, https://doi.org/10.1038/s41586-019-0912-1.
    Rodrigues, E. R., I. Oliveira, R. Cunha, and M. Netto, 2018: DeepDownscale: A deep learning strategy for high-resolution weather forecast. 2018 IEEE 14th International Conference on E-Science, Amsterdam, IEEE, 415--422,
    Rose, S., and J. Apt, 2015: What can reanalysis data tell us about wind power. Renewable Energy, 83, 963−969, https://doi.org/10.1016/j.renene.2015.05.027.
    Rose, S., and J. Apt, 2016: Quantifying sources of uncertainty in reanalysis derived wind speed. Renewable Energy, 94, 157−165, https://doi.org/10.1016/j.renene.2016.03.028.
    Salvação, N., and C. G. Soares, 2018: Wind resource assessment offshore the Atlantic Iberian coast with the WRF model. Energy, 145, 276−287, https://doi.org/10.1016/j.energy.2017.12.101.
    Seiler, C., F. W. Zwiers, K. I. Hodges, and J. F. Scinocca, 2018: How does dynamical downscaling affect model biases and future projections of explosive extratropical cyclones along North America's Atlantic coast. Climate Dyn., 50, 677−692, https://doi.org/10.1007/s00382-017-3634-9.
    Szewc, K., B. Graca, and A. Dołęga, 2021: Atmospheric deposition of microplastics in the coastal zone: Characteristics and relationship with meteorological factors. Science of the Total Environment, 761, 143272, https://doi.org/10.1016/j.scitotenv.2020.143272.
    Torralba, V., F. J. Doblas-Reyes, and N. Gonzalez-Reviriego, 2017: Uncertainty in recent near-surface wind speed trends: A global reanalysis intercomparison. Environmental Research Letters, 12, 114019, https://doi.org/10.1088/1748-9326/aa8a58.
    Wang, G. S., X. D. Wang, H. Wang, M. Hou, Y. Li, W. J. Fan, and Y. L. Liu, 2020: Evaluation on monthly sea surface wind speed of four reanalysis data sets over the China seas after 1988. Acta Oceanologica Sinica, 39, 83−90, https://doi.org/10.1007/s13131-019-1525-0.
    Wei, G., C. H. Peng, Q. A. Zhu, X. L. Zhou, and B. Yang, 2021: Application of machine learning methods for paleoclimatic reconstructions from leaf traits. International Journal of Climatology, 41, E3249−E3262, https://doi.org/10.1002/joc.6921.
    Willison, J., W. A. Robinson, and G. M. Lackmann, 2015: North atlantic storm-track sensitivity to warming increases with model resolution. J. Climate, 28, 4513−4524, https://doi.org/10.1175/JCLI-D-14-00715.1.
    Yan, Z. W., S. Bate, R. E. Chandler, V. Isham, and H. Wheater, 2002: An analysis of daily maximum wind speed in northwestern Europe using generalized linear models. J. Climate, 15, 2073−2088, https://doi.org/10.1175/1520-0442(2002)015<2073:AAODMW>2.0.CO;2.
    Yang, P., G. Y. Ren, P. C. Yan, and J. M. Deng, 2020: Tempospatial pattern of surface wind speed and the "urban stilling island" in Beijing city. J. Meteor. Res., 34, 986−996, https://doi.org/10.1007/s13351-020-9135-5.
    Yu, C., H. C. Li, J. J. Xia, H. Q. Z. Wen, and P. W. Zhang, 2020: A data-driven random subfeature ensemble learning algorithm for weather forecasting. Communications in Computational Physics, 28, 1305−1320, https://doi.org/10.4208/cicp.OA-2020-0006.
    Yu, J., T. J. Zhou, Z. H. Jiang, and L. W. Zou, 2019: Evaluation of near-surface wind speed changes during 1979 to 2011 over China based on five reanalysis datasets. Atmosphere, 10, 804, https://doi.org/10.3390/atmos10120804.
    Zhai, S. X., and Coauthors, 2019: Fine particulate matter (PM2.5) trends in China, 2013−2018: separating contributions from anthropogenic emissions and meteorology. Atmospheric Chemistry and Physics, 19, 11 031−11 041,
    Zhang, D., L. Y. Chen, F. M. Zhang, J. Tan, and C. H. Wang, 2020: Numerical simulation of near-surface wind during a severe wind event in a complex terrain by multisource data assimilation and dynamic downscaling. Advances in Meteorology, 2020, 7910532, https://doi.org/10.1155/2020/7910532.
    Zhang, L., Z. Q. Zhang, C. Y. Feng, M. R. Tian, and Y. N. Gao, 2021a: Impact of various vegetation configurations on traffic fine particle pollutants in a street canyon for different wind regimes. Science of the Total Environment, 789, 147960, https://doi.org/10.1016/j.scitotenv.2021.147960.
    Zhang, L. Q., and Coauthors, 2021b: Reconstruction of ESA CCI satellite-derived soil moisture using an artificial neural network technology. Science of the Total Environment, 782, 146602, https://doi.org/10.1016/j.scitotenv.2021.146602.
  • [1] Haochen LI, Chen YU, Jiangjiang XIA, Yingchun WANG, Jiang ZHU, Pingwen ZHANG, 2019: A Model Output Machine Learning Method for Grid Temperature Forecasts in the Beijing Area, ADVANCES IN ATMOSPHERIC SCIENCES, 36, 1156-1170.  doi: 10.1007/s00376-019-9023-z
    [2] Honghua Dai, 1996: Machine Learning of Weather Forecasting Rules from Large Meteorological Data Bases, ADVANCES IN ATMOSPHERIC SCIENCES, 13, 471-488.  doi: 10.1007/BF03342038
    [3] Marek PÓŁROLNICZAK, Leszek KOLENDOWICZ, Bartosz CZERNECKI, Mateusz TASZAREK, Gabriella TÓTH, 2021: Determination of Surface Precipitation Type Based on the Data Fusion Approach, ADVANCES IN ATMOSPHERIC SCIENCES, 38, 387-399.  doi: 10.1007/s00376-020-0165-9
    [4] ZHOU Lian-Tong, HUANG Ronghui, 2010: An Assessment of the Quality of Surface Sensible Heat Flux Derived from Reanalysis Data through Comparison with Station Observations in Northwest China, ADVANCES IN ATMOSPHERIC SCIENCES, 27, 500-512.  doi: 10.1007/s00376-009-9081-8
    [5] Xinming LIN, Jiwen FAN, Yuwei ZHANG, Z. Jason HOU, 2024: Machine Learning Analysis of Impact of Western US Fires on Central US Hailstorms, ADVANCES IN ATMOSPHERIC SCIENCES.  doi: 10.1007/s00376-024-3198-7
    [6] Chao LIU, Shu YANG, Di DI, Yuanjian YANG, Chen ZHOU, Xiuqing HU, Byung-Ju SOHN, 2022: A Machine Learning-based Cloud Detection Algorithm for the Himawari-8 Spectral Image, ADVANCES IN ATMOSPHERIC SCIENCES, 39, 1994-2007.  doi: 10.1007/s00376-021-0366-x
    [7] Michael B. RICHMAN, Lance M. LESLIE, Theodore B. TRAFALIS, Hicham MANSOURI, 2015: Data Selection Using Support Vector Regression, ADVANCES IN ATMOSPHERIC SCIENCES, 32, 277-286.  doi: 10.1007/s00376-014-4072-9
    [8] Wenbo Xue, Hui Yu, Shengming TANG, Wei Huang, 2024: Relationships between Terrain Features and Forecasting Errors of Surface Wind Speeds in a Mesoscale Numerical Weather Prediction Model, ADVANCES IN ATMOSPHERIC SCIENCES.  doi: 10.1007/s00376-023-3087-5
    [9] Lihua ZHU, Gang HUANG, Guangzhou FAN, Xia QU, Guijie ZHAO, Wei HUA, 2017: Evolution of Surface Sensible Heat over the Tibetan Plateau Under the Recent Global Warming Hiatus, ADVANCES IN ATMOSPHERIC SCIENCES, 34, 1249-1262.  doi: 10.1007/s00376-017- 6298-9
    [10] Mingyue SU, Chao LIU, Di DI, Tianhao LE, Yujia SUN, Jun LI, Feng LU, Peng ZHANG, Byung-Ju SOHN, 2023: A Multi-Domain Compression Radiative Transfer Model for the Fengyun-4 Geosynchronous Interferometric Infrared Sounder (GIIRS), ADVANCES IN ATMOSPHERIC SCIENCES, 40, 1844-1858.  doi: 10.1007/s00376-023-2293-5
    [11] Jiangjiang XIA, Haochen LI, Yanyan KANG, Chen YU, Lei JI, Lve WU, Xiao LOU, Guangxiang ZHU, Zaiwen Wang, Zhongwei YAN, Lizhi WANG, Jiang ZHU, Pingwen ZHANG, Min CHEN, Yingxin ZHANG, Lihao GAO, Jiarui HAN, 2020: Machine Learning−based Weather Support for the 2022 Winter Olympics, ADVANCES IN ATMOSPHERIC SCIENCES, 37, 927-932.  doi: 10.1007/s00376-020-0043-5
    [12] Liu Shikuo, Peng Weihong, Huang Feng, Chi Dongyan, 2002: Effects of Turbulent Dispersion on the Wind Speed Profile in the Surface Layer, ADVANCES IN ATMOSPHERIC SCIENCES, 19, 794-806.  doi: 10.1007/s00376-002-0045-5
    [13] Na LI, Lingkun RAN, Dongdong SHEN, Baofeng JIAO, 2021: An Experiment on the Prediction of the Surface Wind Speed in Chongli Based on the WRF Model: Evaluation and Calibration, ADVANCES IN ATMOSPHERIC SCIENCES, 38, 845-861.  doi: 10.1007/s00376-021-0201-4
    [14] Banghua YAN, Fuzhong WENG, 2008: Applications of AMSR-E Measurements for Tropical Cyclone Predictions Part I: Retrieval of Sea Surface Temperature and Wind Speed, ADVANCES IN ATMOSPHERIC SCIENCES, 25, 227-245.  doi: 10.1007/s00376-008-0227-x
    [15] LI Tao, ZHENG Xiaogu, DAI Yongjiu, YANG Chi, CHEN Zhuoqi, ZHANG Shupeng, WU Guocan, WANG Zhonglei, HUANG Chengcheng, SHEN Yan, LIAO Rongwei, 2014: Mapping Near-surface Air Temperature, Pressure, Relative Humidity and Wind Speed over Mainland China with High Spatiotemporal Resolution, ADVANCES IN ATMOSPHERIC SCIENCES, 31, 1127-1135.  doi: 10.1007/s00376-014-3190-8
    [16] Jincheng WANG, Xingwei JIANG, Xueshun SHEN, Youguang ZHANG, Xiaomin WAN, Wei HAN, Dan WANG, 2023: Assimilation of Ocean Surface Wind Data by the HY-2B Satellite in GRAPES: Impacts on Analyses and Forecasts, ADVANCES IN ATMOSPHERIC SCIENCES, 40, 44-61.  doi: 10.1007/s00376-022-1349-2
    [17] HU Banghui, YANG Xiuqun, TAN Yanke, WANG Yongqing, FAN Yong, 2010: A New Method for Calculating the Wind Speed Distribution of a Moving Tropical Cyclone, ADVANCES IN ATMOSPHERIC SCIENCES, 27, 69-79.  doi: 10.1007/s00376-009-7209-5
    [18] Lei LIU, Fei HU, 2019: Long-term Correlations and Extreme Wind Speed Estimations, ADVANCES IN ATMOSPHERIC SCIENCES, 36, 1121-1128.  doi: 10.1007/s00376-019-9031-z
    [19] Chentao SONG, Jiang ZHU, Xichen LI, 2024: Assessments of Data-Driven Deep Learning Models on One-Month Predictions of Pan-Arctic Sea Ice Thickness, ADVANCES IN ATMOSPHERIC SCIENCES.  doi: 10.1007/s00376-023-3259-3
    [20] Federico OTERO, Diego C. ARANEO, 2022: Forecasting Zonda Wind Occurrence with Vertical Sounding Data, ADVANCES IN ATMOSPHERIC SCIENCES, 39, 161-177.  doi: 10.1007/s00376-021-1007-0

Get Citation+

Export:  

Share Article

Manuscript History

Manuscript received: 06 September 2021
Manuscript revised: 15 December 2021
Manuscript accepted: 15 February 2022
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Meshless Surface Wind Speed Field Reconstruction Based on Machine Learning

    Corresponding author: Jiangjiang XIA, xiajj@tea.ac.cn
  • 1. Key Laboratory of Regional Climate-Environment for Temperate East Asia (RCE-TEA), Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing 100029, China
  • 2. University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing 100049, China
  • 3. Center for Artificial Intelligence in Atmospheric Science, Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing 100029, China
  • 4. Qi Zhi Institute, Shanghai 200232, China
  • 5. Beijing Meteorological Service Center, BMSC, Beijing 100089, China
  • 6. School of Mathematical Sciences, Peking University, Beijing 100871, China
  • 7. School of Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • 8. Lab of Meteorological Big Data, Beijing 100086, China

Abstract: We propose a novel machine learning approach to reconstruct meshless surface wind speed fields, i.e., to reconstruct the surface wind speed at any location, based on meteorological background fields and geographical information. The random forest method is selected to develop the machine learning data reconstruction model (MLDRM-RF) for wind speeds over Beijing from 2015–19. We use temporal, geospatial attribute and meteorological background field features as inputs. The wind speed field can be reconstructed at any station in the region not used in the training process to cross-validate model performance. The evaluation considers the spatial distribution of and seasonal variations in the root mean squared error (RMSE) of the reconstructed wind speed field across Beijing. The average RMSE is 1.09 m s−1, considerably smaller than the result (1.29 m s−1) obtained with inverse distance weighting (IDW) interpolation. Finally, we extract the important feature permutations by the method of mean decrease in impurity (MDI) and discuss the reasonableness of the model prediction results. MLDRM-RF is a reasonable approach with excellent potential for the improved reconstruction of historical surface wind speed fields with arbitrary grid resolutions. Such a model is needed in many wind applications, such as wind energy and aviation safety assessments.

摘要: 我们提出了一种新的无网格地表风速场的机器学习重建方法,即基于气象背景场和地理信息重建任意位置的地表风速。以北京地区为例,我们利用2015–19年的站点资料,训练了基于随机森林的风速重构模型。模型的输入特征包括时间、地理信息以及局地气象背景场。我们将独立于训练过程之外的站点作为假定预测点,并利用交叉验证法扩大了预测点,评估了模型的表现。模型预测的均方根误差为1.09 m s−1,表现优于传统的反距离加权插值法得到的结果(1.29 m s−1)。其次从季节变化上看,模型在秋季和夏季表现优于春季和冬季。最后,利用平均杂质减少方法提取重要特征排列,评估了所选特征对模型影响力的大小,并讨论了模型预测结果的合理性。本文提出的基于机器学习风速重构模型是一种新的具有合理性、准确性的模型,具有改进与应用的潜力。这种模型在许多风能应用中都是需要的,如风能和航空安全评估。

    • Wind speed is one of the fundamental variables in basic atmospheric equations. Surface wind speeds at a hyperfine resolution are needed in many applications. For instance, wind energy plays a role in the global energy transition with regard to the mitigation of global warming (Bosch et al., 2017), and a continuous wind speed field is essential for evaluating the wind power capacity in different areas (Gielen et al., 2019). Street-scale wind fields with variations in building density and height in local areas are important for the diffusion and deposition of pollutants (Miao et al., 2017; Zhai et al., 2019; Pirhalla et al., 2020; Szewc et al., 2021; Zhang et al., 2021a). A high-resolution wind field over airport runways is required when assessing the risk of taking off and landing. Super-resolution wind field data is useful for planning airport construction (Prasanna et al., 2018; Nechaj et al., 2019). Moreover, many outdoor events at the Winter Olympic Games are restricted by wind speed, so the prediction of very local and temporal wind speeds is imperative (Joe et al., 2010; Bernier et al., 2014; Isaac et al., 2014). In such applications, continuous wind speed fields with a resolution of hundreds or even tens of meters are required. In short, historical super-resolution surface wind fields are useful in many applications. One problem is determining how to instantly obtain super-resolution wind fields based on limited information.

      Observations and numerical simulations (including reanalysis) are common data sources for wind field construction. Observatory sites provide discrete reference records. However, the density of stations is usually sparse and inhomogeneous, as it is generally expensive to establish an intensive observation network. Reanalysis methods provide global gridded data, but the resolution is often insufficient for local application scenarios. Most reanalysis datasets are characterized by large biases and uncertainties in describing local wind climatology and climate trends (Rose and Apt, 2015, 2016; Torralba et al., 2017; Yu et al., 2019; Wang et al., 2020).

      Conventional methods used to obtain high-resolution meteorological fields include interpolation and downscaling. Yan et al. (2002) modeled a continuous wind field in association with large-scale climate factors based on a generalized linear method, but such statistical techniques are difficult to apply when reconstructing local-scale winds. Downscaling can be divided into two broad classes: dynamic downscaling (DD) and statistical downscaling (SD). In DD, local-scale climate patterns are estimated via a high-resolution mesoscale dynamic model or regional climate model (RCM) coupled with a global coupled model (GCM), with boundary conditions determined from the GCM output (Salvação and Soares, 2018; Zhang et al., 2020). The mesoscale Weather Research and Forecasting (WRF) model and computational fluid dynamics (CFD) models are often applied for local wind applications (Liu et al., 2018b; Salvação and Soares, 2018; Keck and Sondell, 2020). DD provides continuous gridded results consistent with physical principles, but some inevitable problems remain (Willison et al., 2015; Liu et al., 2018b; Zhang et al., 2020). First, it is quite challenging to use numerical models to capture the detailed dynamic structures of near-surface wind trends dominated by local microtopography, and thus there exists lack of the effectiveness of near-surface wind simulations. Secondly, long-term DD simulations at high resolutions require vast computational resources, and for the current mesoscale models, such as the WRF model, more computing resources are required as the resolution becomes finer. Moreover, regional dynamical simulations are sensitive to boundary conditions, physical parameterization and systematic model error. Such errors can quickly accumulate and reach an unacceptable level.

      SD provides more local information via the statistical relationships among local variables (usually observations) and large-scale variables (usually simulated by GCMs) (Liu et al., 2019). SD, as well as interpolation methods, produces fast and accessible results and requires far less computational time than DD, but there are also some limitations (Nikulin et al., 2018; Seiler et al., 2018; Alizadeh et al., 2019; Hou et al., 2019). First, SD relies on not only the accuracy of observations but also the validity of dynamical simulations, especially the relationships among the large-scale and local variables used. Second, SD, as well as many statistical interpolation methods, is usually based on a steady empirical relationship (function), which is characterized by a priori error for local variables. Moreover, SD may not fully consider the temporal physical interactions among variables at different scales, which can lead to spatial and temporal discontinuities in high-resolution outputs.

      Machine learning (ML), which is popular due to its data-driven nature, has increasingly been applied in geoscience for data reconstruction (Reichstein et al., 2019). ML algorithms can address many challenges encountered in geoscience problems, such as those in remote sensing and model simulations (Karpatne and Liess, 2015; Rodrigues et al., 2018; Karpatne et al., 2019; Reichstein et al., 2019). ML algorithms have displayed high accuracy in various applications and can balance computational cost and run time objectives (Krasnopolsky and Fox-Rabinovitz, 2006). For instance, Jing et al. (2017) reconstructed precipitation data for the regions not covered by the Tropical Rainfall Measuring Mission 3B43 (TRMM) precipitation dataset by using a random forest model. Kadow et al. (2020) used a deep learning model to reconstruct historical sea surface temperatures from 1870 to 2005 based on two distinct datasets. Machine learning models have also been used in other fields to reconstruct data, such as soil (Hengl et al., 2017; Zhang et al., 2021b), cloud structure (Leinonen et al., 2019), and paleoclimate (Wei et al., 2021) data, among others. The data-driven nature of ML methods makes them unique, but ML models are often not easily explainable. Therefore, it is beneficial to clearly interpret and improve ML models by integrating a priori knowledge (Reichstein et al., 2019).

      In this study, we propose a new approach to reconstruct the meshless field of hourly wind speed in Beijing. The model can fit the wind field to the resolution of the available geographical factors in the region. We select a random forest to build machine learning models. We introduce and preprocess model inputs and parameters and then evaluate the models by comparison with conventional methods in the following sections. A physical explanation of model performance is given based on the importance of the features involved. Finally, a summary of the conclusions is presented.

    2.   Data and method
    • ML models require as many station observations as possible for training and testing. We collected data from 226 stations in Beijing (include urban and suburban areas) (Fig. 1). The distribution of observed stations used are fairly uniform, so the datasets from observed stations are representative. In addition, the Beijing surface wind speed field is influenced by geography and large-scale climate change (Liu et al., 2018a; Yang et al., 2020). Beijing is a highly developed city with a large population density; these factors may increase the uncertainty of local wind speed downscaling. Therefore, Beijing is a good location for the present data reconstruction study.

      Figure 1.  Spatial distribution of observations (left) and elevation map in Beijing (right).

    • The hourly observations used in this study were collected from 226 stations in Beijing from 2015−19. The dataset includes 17 variables, as shown in Table 1. We divided the 17 variables into two classes: meteorological variables (10), time variables (4) and geographical variables (3). We used these data to construct over 9 300 000 hourly samples as ML model inputs. In this study, we decompose wind speed into meridional and zonal wind speeds. This approach helps identify the effects of different predictive features from the perspective of atmospheric dynamics.

      ClassVariable NamesAbbr.Range
      Meteorological variables (M)Surface temperatureTJan. 01, 2015−Aug. 31, 2019 (hourly)
      Relative humidityRH
      Precipitation (in 1 hour)Pn
      PressurePe
      Average wind speedAW
      Extreme wind speedEW
      Average wind speed U component$ {\mathrm{A}\mathrm{W}}_{u} $
      Average wind speed V component$ {\mathrm{A}\mathrm{W}}_{v} $
      Extreme wind speed U component$ {\mathrm{E}\mathrm{W}}_{u} $
      Extreme wind speed V component$ {\mathrm{E}\mathrm{W}}_{v} $
      Time variables (t)YearY2015−19
      MonthM1−12
      DayD1−31
      HourH0−23
      Geographic variables (G)Longitudelon226 stations
      Latitudelat
      Altitudealt

      Table 1.  The variables used in the models. All these variables come from the observed stations records and are divided into three parts: meteorological, time, and geographic variables. Average wind variables represent the average wind speed and direction in a 10-minute period at 10-meter height. Wind component U and V represent the zonal and meridional respectively. Given the cyclical nature of the wind direction, the wind directions are not input into the models. Instead, we input four components ($ {\mathrm{A}\mathrm{W}}_{u},{\mathrm{A}\mathrm{W}}_{v},{\mathrm{E}\mathrm{W}}_{u},{\mathrm{E}\mathrm{W}}_{v} $) into the model to introduce the wind speed direction influence.

      In Table 1, AW represents the average wind speed at 10 meters height in a 10-minute period, EW refers to the maximum instantaneous wind speed at 10 meters height within an hour period, and instantaneous wind speed refers to the 3-second average wind speed. Variables $ {\mathrm{A}\mathrm{W}}_{u} $ and $ {\mathrm{A}\mathrm{W}}_{v} $ are calculated based on AW and the direction of AW and represent the decomposed winds speed in the zonal and meridional directions, respectively. We obtain $ {\mathrm{E}\mathrm{W}}_{u} $ and $ {\mathrm{E}\mathrm{W}}_{v} $ in the same way. By this way, we don’t need to add wind direction features, because the zonal wind speed and meridional wind speed have described the direction of the wind speed. Such decomposition variables introduce the structure of meridional and zonal circulation in the upper atmosphere into the lower layer, which is more consistent with the physical knowledge of meteorology. We remove the incomplete or missing wind speed records, resulting in over 9.3 million hourly samples of wind speed used in the study. The time variables include the year (ranging from 2015 to 2019), month (ranging from 1 to 12), day (ranging from 1 to 31) and hour (ranging from 0 to 23). Surface wind speed is extremely sensitive to topographic features. Adding comprehensive geographical features could improve the predictive performance of the model. Here, we select basic geographical factors, including longitude, latitude and altitude, to investigate the lower limit of modeling performance.

    • The meshless data reconstruction (MDR) process here refers to a method that uses discrete station data to predict the wind speed at any location in the study region as long as the basic geographic information (latitude, longitude and altitude) is available for this location. Such a model allows us to synchronize the wind speed field from the station distribution to gridded distribution data with any resolution when geographic information is available. The present model can reconstruct the 10-minute-mean wind speed at 10 meter height at any location on historical moment.

      We transform the data reconstruction problem into a regression problem and then solve the problem with a machine learning algorithm. Specifically, we train a machine learning data reconstruction model (MLDRM) to learn the map between the considered features (predictive factors) and the wind speed (label or target) at any given place and time. The features include the meteorological background ($ \boldsymbol{M} $), geographic information ($ \boldsymbol{G} $) and time variables (t). In this study, we suppose one observed station as the supposed forecast point and construct the data from this station using the model training process. By this way, forecast points have true values to evaluate model performance, and predicted stations are still independent of the model.

      For a prediction point (station j) at time (i),

      where $ {\boldsymbol{t}}^{\boldsymbol{i}} $ represents time variables at time (i), $ {\boldsymbol{M}}^{i} $ represents 10 meteorological background variables at time (i), and $ {\boldsymbol{G}}_{j} $ represents the geographic variables for station (j).

      A random forest (RF) is a supervised ensemble classification algorithm with better interpretability and fewer parameters than other machine learning methods (Jing et al., 2017). An RF can represent nonlinear relationships and outperform many conventional models based on fitting performance (Hengl et al., 2017). Additionally, the form of the objective function does not need to be preset, and complex interactions among features can be considered. Furthermore, An RF model can quantify the impact of features, thus aiding in assessing and improving model performance. Therefore, we apply an RF to build the MLDRM-RF model.

      In this study, we choose the root mean squared error (RMSE) as a measure of model performance, and it is defined as

      where $ {O}_{i} $ is the observed wind speed data, $ {P}_{i} $ is the predicted wind speed data, and n represents the number of samples in the test set. A small RMSE indicates good predictive capability.

    • The MLDRM-RF model can predict the wind speed at any location in the study region with meteorological background and local geographic information; to evaluate its performance, we used one of the stations as a supposed target location. All records from this station formed a testing set, and the other station records formed a training set. By repeating this modeling process for all stations, we maximized the utilization of data while guaranteeing the independence of the testing set from the training set. Based on such cross-validations, we could obtain a general assessment of the predictive ability of the model.

      As shown in Fig. 2, there are four steps in building and evaluating the model of regional wind speed field reconstruction.

      Figure 2.  Flow chart of wind speed reconstruction model (MLDRM-RF). Step 1−4 are shown section 2.3.2. LOOCV represents “leave-one-out cross validation”. IDW represents “inverse distance weight” interpolation method. MDI represents “mean decarease in inpurity”.

    • In this step, we processed the data into samples that could be input into the MLDRM-RF. Each sample was established based on the corresponding time and station (Fig. 3). At a certain time (i), we averaged the records from all stations at time (i) for each of 10 meteorological variables (Table 1) to obtain the hourly regional climate background fields ($ {\boldsymbol{M}}^{i} $). Moreover, we introduced four time scales (year, month, day, and hour) as time features (${\boldsymbol{t}}^{i}$). For station (j), we added 3 geographic variables ($ {\boldsymbol{G}}_{j} $) as features: longitude, latitude, and altitude. Finally, every sample was composed of 17 features and 1 label (wind speed). The model dataset ($ \mathbf{S}\mathbf{T} $) spanned 226 stations and 40896 hours, with over 9 200 000 samples.

      Figure 3.  Schematic diagram of sample datasets construction from observed data. Sample (i, j) infers to the record at time i at location j. Every sample consists of 17 variables (features) and one label as shown in the figure. $ {M}^{i} $, $ {t}^{i} $, $ {G}_{j} $ represent three parts of these variables as meteorological background, time, and geographic respectively. Wind speed is the label of this model.

    • Model parameters have a considerable influence on model performance. Here, we focus on optimizing two parameters, the tree depth and the number of regression trees, which are the two commonly used hyperparameters in RF modeling (Breiman, 2001). To obtain the best parameters for $ \mathbf{S}\mathbf{T} $, we divided $ \mathbf{S}\mathbf{T} $ randomly into a training set and a validation set at an 80 to 20 percent ratio. Then, the model was constantly adjusted to obtain the parameters that optimize performance based on the training set. The number of trees and maximum depth of trees have the greatest impact on model performance. Therefore, we adjusted the number of decision trees from 5 to 100 and the model depth from 10 to 100. We used the RMSE and out-of-bag score (OOBS) as performance metrics.

      OOBD&OOBS: Random forest model consists of multiple decision trees. Each decision tree is built by a bootstrap resampling from the training set. This means that each tree has data that does not participate in the decision tree samples, which is called an “out of bag” data (OOBD). The OOBD of one tree is not involved in training the corresponding decision tree. The OOBD of all decision tress means that this part of data hasn’t been used in any decision tree. Since this part of data is not involved in the establishment of this tree, OOBD can be used to test the model generalization capability. The prediction errors of these out-of-bag data are averaged and normalized as out of bag score (OOBS).

      As Fig. 4a shows, for a given number of trees, the RMSE always decreases with increasing depth of decision trees, but little change is observed after the depth reaches 30, and no change is observed at depths above 70. Figure 4b shows the error variations for different numbers of decision trees at a constant depth (70). Furthermore, we evaluate the influence of parameters based on another indicator, the wind speed prediction score (Table. 2) (Yu et al., 2020), and obtain a similar result, as shown in Fig. 5. In Table 2, the vertical column on the left side of the table represents the range of observed wind speed (m s−1), and the upper side represents the range of predicted wind speed. Every prediction value can be evaluated with a score based on this table, with the higher score indicating that the model has better prediction effect.

      Observed (m s−1)Prediction (m s−1)
      0.0−
      0.2
      0.3−
      1.5
      1.6−
      3.3
      3.4−
      5.4
      5.5−
      7.9
      8.0−
      10.7
      10.8−
      13.8
      13.9−
      17.1
      17.2−
      20.7
      20.8−
      24.4
      24.5−
      28.4
      28.5−
      32.6
      32.7−
      36.9
      ≥37.0
      0.0−0.210.60.400000000000
      0.3−1.50.610.60.40000000000
      1.6−3.30.40.610.60.4000000000
      3.4−5.400.40.610.60.400000000
      5.5−7.9000.40.610.60.40000000
      8.0−10.70000.40.610.60.4000000
      10.8−13.800000.40.610.60.400000
      13.9−17.1000000.40.610.60.40000
      17.2−20.70000000.40.610.60.4000
      20.8−24.400000000.40.610.60.400
      24.5−28.4000000000.40.610.60.40
      28.5−32.60000000000.40.610.60.4
      32.7−36.900000000000.40.610.6
      ≥37.0000000000000.40.61

      Table 2.  Wind speed prediction score. We adopt this evaluation index from the study of Yu et al. (2020). The vertical column on the left of the table represents the range of observed wind speed (), and the upper side represents the range of prediction wind speed. Every prediction value can be evaluated with a score based on this table.

      Figure 4.  Error distribution with different tree counts and maximum depths of the decision trees. The boxes in figure (a) represent the errors of different combinations of numbers and depths based on the RMSE. In figure (b), for a given number of trees, we define “minimum RMSE” as the RMSE which has been stabilized with depth increasing. In our test, the stabilized RMSE is always the smallest RMSE for a given number of trees. The “maximum OOBS” is defined as the corresponding OOBS when the RMSE reach to the stability. Figure (b) shows how “minimum RMSE” and “maximum OOBS” change with the numbers of trees.

      Figure 5.  Wind speed prediction score map with different model parameters (number and depth of trees). The horizontal axis represents the number of trees, and the vertical axis represents the depth of trees.

      In the parameter selection step, we aim to maximize model performance and improve model efficiency. According to the results in Fig. 4 and Fig. 5, we selected 50 trees and a maximum depth of 30 as the parameters of the MLDRM-RF in subsequent model evaluation steps.

    • First, to divide the training set, we label $ \mathbf{S}\mathbf{T} $ records by station as s1, s2, …, s226. As Fig. 6 shows, the first training set (${\mathbf{S}\mathbf{T}}_{\rm{train}}$) consists of samples s2, s3, …, s226, and s1 is the testing set (${\mathbf{S}\mathbf{T}}_{\rm{test}}$). We use ${\mathbf{S}\mathbf{T}}_{\rm{train}}$ to train MLDRM-RF (50 trees with a depth of 30) and use ${\mathbf{S}\mathbf{T}}_{\rm{test}}$ to evaluate the performance of the model. The RMSE is used to evaluate the difference between real and reconstructed values. In addition, we analyzed the feature importance ranking to determine the physical explanation underlying the model performance. In MLDRF-RF, different features influence local wind speed predictions to varying degrees. Tree-based models provide an alternative measure of feature importance based on the mean decrease in impurity (MDI) (Louppe, 2014). Impurity is quantified by the splitting criterion of the decision trees (Gini, Entropy or Mean Squared Error). To explore the possible physical explanation of the predictive capacity of the model, we calculated the permutation feature importance by MDI in testing data. Specifically, we generated a set of random numbers equal in size to the OOBD set. We used the random-number series to replace the original feature set and see how the errors changed. Empirically, if the feature is of high importance, then some random variation in this feature should considerably decrease the model performance. The model normalizes the feature importance so that the sum of all importance scores is 1.

      Figure 6.  Data set partitioning for different models drawing on the idea of having one verification. For model 1, the testing set consists of records from station 1, and training set consists of remaining samples. Model 2 use the records from station 2, and training set also consists of the other 225 stations. Other models follow suit.

    • Repeat Step 3, but designate samples from one of the remaining stations (e.g., s2) as the testing set, resulting in a new model (model 2 in Fig. 6). This approach is similar to leave-one-out cross-validation (LOOCV) in statistical modeling, and it can enhance model reliability. This process was repeated until we built 226 models with different training and testing sets. Finally, we used the average performance of the 226 models to evaluate MLDRM-RF modeling performance.

    • To assess the performance of MLDRM-RF, we obtained an interpolated result from the IDW method as a baseline. The IDW method assumes that the dependent variable is affected by the distance from sampling locations and the power of this distance. IDW is one of the most frequently applied methods in spatial data interpolation ( Li and Heap, 2011; Franco et al., 2020) because it provides relatively fast and reasonably accurate results. Here, we use a basic version of IDW, formulated as

      where $ \mathbb{O} $ is the predictor point set, which consists of the 15 points nearest to the forecasted point; $ {\lambda }_{i} $ represents the weight of the wind speed on predictor point i; $ {d}_{i} $ represents the distance between the predictor point and forecasted point.

      RF and IDW methods require almost the same calculation time, and they are relatively easy to implement. MLDRM-RF can replace IDW as a new method for quickly obtaining regional wind fields, and it provides a higher resolution. It is worth noting that MLDRM-RF here is different from an interpolation approach. First, as shown in Fig. 3, MLDRM-RF contains various predictors, and both physical and geographic factors are considered. Second, MLDRM-RF does not use a preset mapping function; therefore, the upper limit of model performance is improved. MLDRM is a basic and potential model that can be extended for various features and data sources in real practical applications.

    3.   Results
    • First, the performance of the random forest wind speed reconstruction model is evaluated based on the root mean square error. The spatial error distribution is shown in Fig. 7. The average RMSE for all 226 model prediction samples is 1.09 m s−1. As illustrated by the probability density distribution in Fig. 7, the RMSE is less than 1.2 m s−1 for most stations, with a median of 0.91 m s−1. In general, the model performance is better for the southeastern part of Beijing than for the northwestern part. Large RMSE values (>2 m s−1) occur in western Beijing, likely because the meteorological background features in that area are based on the mean conditions in the whole region. However, most stations are located in plains areas, and only a few mountainous stations are located in northern and western Beijing.

      Figure 7.  RMSE (m s−1) spatial distribution (a) and error probability density distribution (b). Error probability density represents the density distribution of the RMSE for 226 models. The horizontal axis shows the magnitude of the error, and the vertical axis shows the density.

    • The geographical distribution of seasonal RMSE is shown in Fig. 8. The spatial average RMSE is 1.10 m s−1 (DJF), 1.10 m s−1 (MAM), 0.88 m s−1 (JJA), and 0.94 m s−1 (SON). According to Fig. 8, MLDRM-RF performs better in summer and autumn than in spring and winter. The large-scale background wind is strongest (weakest) in winter and spring (summer and autumn) in Beijing (Liu et al., 2018a; Yang et al., 2020), which probably causes seasonal variations in model performance.

      Figure 8.  RMSE in different seasons and their spatial distributions, presented individually for (a) spring, (b) summer, (c) autumn, (d) winter.

      As shown in Fig. 9, during each season, RMSE of most stations is less than 1.0 m s−1, and the probability density functions (PDFs) tend to be skewed by extremely large values. The median seasonal RMSE is 0.99 m s−1 for winter, 0.97 m s−1 for spring, 0.79 m s−1 for summer, and 0.83 m s−1 for autumn. The model exhibits the best performance in summer and autumn.

      Figure 9.  Probability density distribution of RMSE (m s−1) in different seasons, presented individually for (a) spring, (b) summer, (c) autumn, (d) winter. The horizontal axis represents the size of the RMSE.

    • To assess the potential advantage of ML models over conventional methods, we interpolated wind speeds at every station based on records from other 225 stations by using the IDW method for comparison. The RMSE of MLDRM-RF is smaller than that of IDW for most stations: as shown in Fig. 10, at most sited the MLDRM-RF model performs better than IDW by approximately 0.5−1.0 m s−1 in terms of the RMSE.

      Figure 10.  Spatial distribution of the error differences between MLDRM-RF and IDW. The colors represent the difference between the RMSEs (m s−1) of MLDRM-RF and IDW. Positive values indicate that MLDRM-RF performs better than IDW.

      Figure 11 compares the PDFs of RMSE for the two models. Obviously, the median RMSE of MLDRM-RF is considerably smaller than that of IDW (MLDRM-RF: 0.91 m s−1; IDW: 1.14 m s−1); additionally, the errors of MLDRM-RF are more concentrated in a smaller range, implying that it is more stable than IDW. The mean RMSE of predictions by IDW is 1.29 m s−1, approximately 18% larger than that of the RF model. In addition, the tail of the PDF curve in Fig. 11 indicates that MLDRM-RF produces less extreme error than IDW.

      Figure 11.  Probability density curve of MLDRM-RF and IDW RMSE (m s−1). Blue shadow represents the RMSE distribution curve for MLDRM-RF, and black shadow represents the RMSE distribution curve for IDW.

    • In MLDRF-RF, different features influence local wind speed predictions to varying degrees. We calculated the feature importance permutation by MDI on testing data predictions. We obtain the importance scores for all 226 models with all features and normalize the results to obtain Table. 3. The most influential feature is the regional average 10-minute wind speed, with an importance level of 44.35%. The next three most-important factors are altitude, longitude and latitude; these are all geographic features, and their total importance amounts to 21.2%. Therefore, the performance of MLDRM-RF highly relies on the regional background wind speed. The basic geographic information influences the wind speed distribution in the region. The other basic ground meteorological variables and time features account for the remaining 34.5% of feature importance; among them, wind speed-related variables have the most significant influence, accounting for 18.23% of this contribution. Temperature, relative humidity and pressure have almost the same influence on the results at approximately 2% to 3%. Time variables are far less important maybe because some diurnal and seasonal temporal factors are encompassed in the trends in meteorological variables. The low importance of precipitation may be related to the small number of hourly rainfall samples. The total importance rank of zonal wind (6.74%) is higher than that of meridional wind (4.82%), which indicates that zonal motion is important for refactoring surface wind speed in Beijing.

      Table 3.  Importance rank of features that we use in the RF model. Importance is shown by percentages. Wind speed component U and V represent zonal and meridional components respectively.

    4.   Conclusions and discussion
    • Surface wind speed fields with fine resolution are useful in many applications. This paper introduces an ML model based on an RF algorithm (MLDRM-RF) to reconstruct meshless wind speed fields. We input time, geospatial attributes and meteorological background fields into the algorithm and evaluated the models based on cross-validation. The RMSE of MLDRM-RF is 1.09 m s−1, and the median error is approximately 0.91 m s−1. In terms of the RMSE, the model performs better in summer and autumn than in spring and winter by approximately 14.5%. Additionally, the results are better in the southeast than in the west and north of Beijing. We compared the reconstructed values with the results from the classic IDW method and verified that the MLDRM-RF outperforms the IDW method by approximately 18%, with both models displaying similar computational times. Furthermore, we ranked the importance of features by MDI to explore the reasonableness of the model prediction. We found that the most important feature is regional average wind speed, which contributed to 44.35% of model performance. Geospatial features contributed 21.18%, while meteorological features other than average wind speed accounted for 27.61% of the model predictions. The prediction from the random forest model is reasonable and does not overfit for irrational variables. After specifying the reconstruction time point, the model uses the regional background wind field as the basis for prediction, introduces the comprehensive influence of the distribution of other meteorological variables at the location of the observation site on the wind speed, uses geographic information to obtain the location of the prediction point, and finally makes the prediction.

      Our model is highly customizable and could be expanded to include additional features and samples. We can build special models for small regions and train these models with specific samples, such as high-elevation samples. For instance, the terrain has complex effects on wind patterns in many mountainous regions. When a reconstruction model is applied in these regions, we could introduce additional geographical characteristics as features to increase model performance; additionally, we can introduce meteorological variables associated with high pressure levels from reanalysis datasets as features to adapt to complex regions. MLDRM-RF has the potential to become a new baseline in the ML data reconstruction field.

      Acknowledgements. This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA19030402), the Key Special Projects for International Cooperation in Science and Technology Innovation between Governments (Grant No. 2017YFE0133600, and the Beijing Municipal Natural Science Foundation Youth Project 8214066: Application Research of Beijing Road Visibility Prediction Based on Machine Learning Methods.

Reference

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return