Abstract:
Soybean is one of the world"s four major grain crops and the most important source of vegetable oil and protein. China, as the world"s fourth-largest soybean producer and largest consumer, relies heavily on Northeast China for domestic production, which accounts for approximately half of the national output. The interannual variability of soybean yield in Northeast China is primarily driven by meteorological factors. Its accurate prediction is crucial for food security and market stability. Previous statistical prediction studies have been limited to local areas or single provinces and have only provided fitting skills or short-term (≤5 years) prediction skills. To address these limitations, this study developed prediction models for interannual soybean yield variations at the provincial scale in Northeast China during 1981-2018 using six machine learning methods based on meteorological factors. The main findings are: (1) Ridge regression showed the best overall performance among the six methods, with cross-validation correlation coefficients reaching 0.48 (P<0.01), 0.58 (P<0.001), and 0.72 (P<0.001) in Heilongjiang, Jilin, and Liaoning provinces, respectively; (2) Compared to stepwise linear regression, ridge regression demonstrated superior performance in a correlation coefficient (R) and root mean square error (RMSE), with slightly lower accuracy only in amplitude prediction for Jilin and Liaoning provinces; (3) Predictor selection and sample augmentation generally improved the cross-validation prediction skills of machine learning models; (4) The critical meteorological impact window concentrated in the flowering and pod-setting period (July-August), during which the positive effects of temperature, precipitation, and sunshine duration significantly enhanced final yields through promoting pod formation, grain development, and photosynthesis. These findings provide scientific support for soybean yield prediction and agricultural risk management in Northeast China.