-
Air turbulence is a common occurrence in civil aviation and poses a threat to flight safety (Eick, 2014; Golding, 2002). Due to the intensification of global warming and the growth of the aviation industry, turbulence has become more frequent in civil aviation worldwide (HU et al., 2022; Wright, 2023). Williams and Josh (2013) and Williams (2017) verified that the amount of transatlantic wintertime clear-air turbulence in the atmosphere will increase significantly in all aviation-relevant strength categories as the climate changes. Hu et al. (2021, 2022) also found a rising trend of clear-air turbulence over East Asia. As a hazard to aviation, turbulence is responsible for numerous injuries each year and is the underlying cause of many people’s fear of air travel (Sharman et al., 2012). Eleven patients hospitalized after a Delta flight experienced severe turbulence en route to Atlanta (Freiman, 2023). Estimates indicate that there are over 63 000 encounters with moderate-or-greater turbulence and 5000 encounters with severe-or-greater turbulence annually (Sharman et al., 2006). More seriously, turbulence can also cause structural damage to aircraft (Clark et al., 2000). The operational challenges associated with any future increase in turbulence are compounded by the projected future growth of the aviation sector. According to Boeing’s commercial market outlook, air travel is forecast to continue growing faster than global economic activity driven by tourism demand and increased service levels, particularly in developing markets (BOEING, 2023). Recording and understanding the occurrence of turbulence is crucial for enhancing passenger comfort and preventing significant losses.
There are many ways to measure turbulence. Specifically, the eddy dissipation rate (EDR) has been widely employed to assess the vortices generated by aircraft in ambient atmospheric turbulence (Chan, 2011; Bradshaw, 2013). The International Civil Aviation Organization (ICAO) has published that the turbulence shall be reported in terms of the cube root of the EDR (ICAO, 2010). The EDR is a parameter to determine the amount of energy absorbed by the viscous forces in the turbulent flow, with units of velocity squared per second (m2/3 s−1) (Sharman and Pearson, 2017). The EDR may be estimated using radar spectral widths or profiles of refractivity (Hocking and Mu, 1997), high-resolution rawinsonde data (e.g., Clayson and Kantha, 2008), or structure function estimates based on wind measurements from commercial aircraft (Frehlich and Sharman, 2010).
The routine EDR estimates use the quick access record (QAR) as a data source. The QAR is an airborne flight data recorder specifically designed to facilitate quick and easy access to raw flight data. It records the environmental, equipment, and operating parameters of aircraft throughout the entire flight duration.
The detection of turbulence through EDR algorithms is based on conventional intuition/hypothesis-driven model creation (Cornman et al., 1995). In a turbulence model symposium held in 2017, the emphasis was placed on turbulence modeling in a predictive context for complex problems, rather than focusing on turbulence or descriptive modeling. Instead of relying on hypothesis-driven models, we adopt a data-driven approach by utilizing machine learning algorithms. Our conjecture is that QAR data can directly provide information on turbulence-induced anomalies. Machine learning has been widely used for processing flight data. Li et al. (2020) presented a machine learning approach to evaluate the performance of aircraft using on-board sensor information on commercially scheduled flights. Zhao et al. (2021) proposed a novel incremental anomaly detection method based on a Gaussian mixture model to identify common patterns and detect outliers in flight operations from digital flight data. As researchers are acquiring flight anomaly information from flight data using machine learning (Lee et al., 2020; Sheridan et al., 2020; e Silva and Murça, 2023), we speculate that turbulence anomaly information can also be extracted from QAR data without calculating EDR. The objective of this study is to introduce a new generally applicable methodology for detecting turbulence-induced anomalies that can achieve comparable performance with existing EDR approaches.
In this paper, we initially analyze the QAR data and perform preprocessing before utilizing the results for machine learning. Subsequently, we select a symbolic classification algorithm based on genetic programming (GP) depending on the structure and characteristics of the QAR data. In the experimental section, we employ QAR data with anomaly information to train the model and subsequently utilize the trained model to process QAR data from other flights to verify its universal applicability. We also apply various machine learning algorithms, such as random forest and logistic regression models, to the QAR data, and the proposed method outperforms these algorithms.
-
To accomplish the aforementioned objectives, we introduce a GP-based symbolic classifier to detect turbulence anomalies. We hypothesize that time points corresponding to aircraft shaking caused by atmospheric turbulence are anomalous data points.
-
As previously mentioned, QAR data contain multiple types of data and have a mixed data type structure, with continuous numerical variables as well as discrete categorical factors. Prior to applying the symbolic classifier to the QAR data, preprocessing is needed. The selection of our parameters mainly refers to the parameters used in the EDR algorithm (Chen et al., 2019, Huang et al., 2019). As our selected parameters have varying ranges and units, the value of each parameter is normalized within a range of 0 to 1. We then use the normalized parameters as our model’s input.
In our work, we use recognized EDR values as a reference for anomaly detection. According to IACO’s classification of turbulence levels, 0.1 is used as a division between light turbulence and none (ICAO, 2010). To establish a training dataset, we define an EDR value equal to or larger than 0.1 as an anomaly time point. We only use the value of EDR as the reference standard for data division and there will be no EDR in our training or predicting stage. Moreover, as the value of the EDR is calculated from a series of continuous QAR data at time points, an EDR value does not directly correspond to a QAR record. From our data, the QAR data are sampled at 1 sample per second and the EDR is calculated every 17 seconds. To solve this problem, we use linear interpolation to expand the EDR dataset. Linear interpolation is a technique used to estimate values between two known data points along a straight line.
-
GP is an evolutionary computation technique that automatically solves problems without requiring the user to know or specify the form or structure of the solution in advance (O’Neill, 2009). Based on the experience of numerous researchers over many years, GP and other evolutionary computational methods have been especially productive in some research areas (Sette and Boullart, 2001; McCall, 2005; Zhang et al., 2021). GP has been used to automatically construct convolutional neural network architectures for an image classification task (Suganuma et al., 2017). For predicting noncohesive particle setting velocity, a machine learning approach based on GP has been used (Goldstein and Coco, 2014). Competitive performance has been verified in the detection of financial statement fraud using GP (Ravisankar et al., 2011). One of the particular values of GP is in exploring poorly understood domains that are suitable for QAR data, where the interrelationships among the relevant variables and anomaly state are unknown or cannot be directly observed.
A symbolic classifier is an estimator for which a formula is used to represent a certain relationship. The formulas are expressed as tree-like structures with mathematical functions that are recursively applied to variables and constants (Revanasiddappa et al., 2014; Wang et al., 2019).
The symbolic classifier utilizes GP to evolve a mathematical formula that models the relationship among variables, which is similar to the process of calculating the EDR. First, a population of random formulaic expressions represented as tree structures is established. The nodes of the trees consist of mathematical functions, variables, and constants that are recursively organized. The algorithm then iteratively improves this population through an evolutionary process (Bojarczuk et al., 2000). In GP, the variables and constants are called terminals, and the arithmetic operations are associated with internal nodes called functions. The sets of allowed functions and terminals together form the primitive set of a GP system (Ando et al., 2007).
In each generation, the fittest individuals are selected to undergo genetic operations such as crossover, mutation, and reproduction to produce the next generation. This process mimics biological evolution, with the formulas evolving over successive generations to become better suited for representing the pattern in the corresponding data (Kinnear et al., 1994). The iterative process continues until a stopping criterion is met, at which point the best-performing formula is output as the final symbolic classification model (Fig. 1).
-
In GP, the individuals in the initial population are typically randomly generated. There are numerous approaches used to generate this preliminary random initial population. Two common approaches are filling and growth methods (O’Neill, 2009). In both, the initial individuals are constrained by the user to not surpass a predefined maximal depth set. In the filling method, nodes are selected at random from the function set until the maximum tree depth is reached. The growth method, on the contrary, allows for the creation of trees of more varied sizes and shapes.
While the filling method and growth method allow for some variation in the sizes and shapes of the initial trees, neither approach alone can generate the wide array we desire. Therefore, we opted for the ramped half-and-half method proposed by Koza (1992). In this approach, half of the initial population is constructed using the filling method, and half is constructed using the growth method.
-
As with most evolutionary algorithms, genetic operators in GP are applied to individuals who are probabilistically selected based on fitness, such that fitter individuals are more likely to have more child programs than inferior individuals. The most common method for selecting individuals in GP is tournament selection. In tournament selection, numerous individuals are chosen at random from the population and compared, with the best individuals chosen as parents. When performing crossover, two parents are needed, necessitating two selection tournaments.
The random selection of candidates induces an element of noise in tournament selection. Therefore, although the best candidate is preferred, even average-quality candidates have some chance of having children. Since tournament selection is easy to implement and provides automatic fitness rescaling, it is commonly used in GP.
-
GP departs significantly from other evolutionary algorithms in the implementation of crossover and mutation operators. The predominant form of crossover utilized in this work is subtree crossover (Angeline, 1997). In subtree crossover, crossover points, specified as nodes, are randomly and independently selected from each parent tree. The offspring are generated by replacing the subtree rooted at the crossover point in a copy of the first parent with the subtree rooted at the crossover point in the second parent (Fig. 2a).
Figure 2. Schematic maps of the (a) subtree crossover process, (b) subtree mutation process, (c) hoist mutation process, and (d) point mutation process.
To enhance the diversity of, and number of possibilities in, the evolutionary process, we utilize three mutation techniques: subtree mutation, hoist mutation, and point mutation (Figs. 2b–d) (Goldberg, 1989; Kinnear, 1994; Angeline, 1997). Together, these three mutation operators provide beneficial diversity during evolution.
The last operation mentioned before is reproduction. Reproduction simply involves the selection of an individual based on fitness and the insertion of a copy of it in the next generation.
-
The experimental data are entirely derived from the actual flight data of a certain airline. The original training data utilized in this work comprised the QAR data from a Boeing 737 flight from Sanya to Jinan on 22 January 2023; as documented in the flight report, this flight encountered turbulence. We selected 24 parameters as inputs of our model (Table 1), primarily incorporating the parameters utilized in the EDR algorithm (Chen et al., 2019; Huang et al., 2019). Since symbolic classifiers can select the most important parameters during the evolution process, some redundant parameters, such as the number of flights, have no influence on the performance of the classifier.
Parameter Unit Parameter Unit AOA1 ° Pitch Angle ° AOA2 ° Roll Angle ° Flight Number − Wind Direction ° Gross Weight kilogram Ground Speed knots Altitude feet Instruction Air Speed knots Latitude ° Mach − Longitude ° True Air Speed knots Radio Height feet Wind Speed knots Total Air Temp °C Vertical Acceleration G Static Air Temp °C Lateral Acceleration G Display Heading ° Longitudinal Acceleration G Drift Angle ° Vertical Speed Feet min−1 Table 1. Parameters used for turbulence detection.
The QAR dataset contained 11 140 data points, with each representing 24 parameters and 1 anomaly class information after preprocessing. We applied a linear interpolation method to make the EDR values and QAR records directly correspond (Fig. 3a). The data points were classified into two categories based on the EDR values computed from the QAR data, denoting the presence or absence of turbulence. Class 0 indicates that no turbulence occurred at that time point, and Class 1 signifies that turbulence was present. With this EDR-based turbulence classification scheme, approximately 80% of the 11 140 flight data points fell into Class 0, with the remaining 2228 points labelled Class 1 (Fig. 3b). We then randomly split the data into a training set and a test set at a ratio of 7:3.
-
In this work, Python 3.10 was systematically used, from data processing to model testing. For model training, we used the gplearn library and Scikit-learn library. Prior to model training, the key GP parameters were initialized as follows. We conducted multiple repetitions of the experiment using different parameters, and consistently obtained similar results in terms of accuracy. Specifically, since the classifier not only chooses from the terminal set but also chooses from the function set to establish a suitable formula, we trained the model using different function sets to find the most appropriate one (section 3.4).
We selected a set of parameters that can be considered relatively effective. The terminal set comprised all normalized training parameters and random constants ranging from −1 to 1. To expand possibilities beyond the standard arithmetic operators, the function set was augmented to include square root, logarithm, and absolute value functions. The population size was set to 5000 individuals, with evolution conducted for 30 generations. Tournament selection utilized a tournament size of 20. The range of tree depths for the initial population was set from 2 to 6. The crossover rate was set to a 0.9 probability for tournament winners, and the three mutation methods shared an equal 0.01 probability of being selected. When the total of the crossover and mutation rates summed to a value p less than 100%, a reproduction rate of 1 − p (0.07 here) was selected.
-
In this analysis, no stopping criterion was set, and the model was considered to be fully trained after 30 generations. With a first-generation length of 11, the average length of the individuals in the population stabilized at approximately 17 after multiple generations of evolution (Fig. 4a), and the length of the best individual stabilized at approximately 24 (Fig. 4b).
Figure 4. (a) Changes in the average length and fitness of the individuals in the population during the iterative process. (b) Changes in the length and fitness of the best individual during the iterative process. (c) The formula evolved with the GP-based symbolic classifier after training completion. X5, X10, X16, and X23 each represent a parameter in the QAR dataset.
As previously described, the symbolic classifier evolves a formula with QAR data as inputs and anomaly detection results (0 or 1) as outputs. The tree-structured formula obtained after training selects the most salient QAR parameters as inputs (Fig. 5c). The formula also shows the advantage of interpretability of the symbolic classifier based on GP. In the formula, X5 represents the latitude, X10 represents the display heading, X16 represents the instruction airspeed, and X23 represents the vertical speed. As location information other than latitude is not included in the formula, the detection of turbulence primarily relies on the aircraft’s status and its environment. This characteristic of our method makes it potentially universally applicable. Specifically, since the vertical speed is the most frequently utilized parameter in the formula, we speculate that the classifier primarily relies on the vertical speed as the main indicator for turbulence detection, while other parameters provide auxiliary assistance. The utilization of vertical speed by our trained model shows similarities to an onboard algorithm that converts vertical acceleration into EDR estimates:
Figure 5. (a) ROC curve of the trained symbolic classifier for the training data and confusion matrix of the training data. The confusion matrix contained 6152 true positives, 1480 true negatives, 74 false positives, and 92 false negatives. (b) ROC curve of the trained symbolic classifier for the test data and confusion matrix of the test data.
To evaluate the anomaly detection performance, the true positive rate (TPR) and false positive rate (FPR) were utilized as key metrics. The TPR constitutes the proportion of positive instances correctly classified, whereas the FPR represents the proportion of negative instances incorrectly predicted as positive. Moreover, the accuracy (ACC) indicates the proportion of correct predictions by the model among all predictions. The corresponding formulas are as follows:
where FN is the percentage of normal data detected as abnormal; FP is the percentage of abnormal data detected as normal; TN is the percentage of abnormal data correctly detected as abnormal; and TP is the percentage of normal data correctly detected as normal.
According to the confusion matrix (Figs. 5a, b), an ACC of 97.87% and 98.08% was achieved for the training and testing data, respectively. The symbolic classifier yielded a TPR of 98.68% and an FPR of 4.30% for the test dataset. Meanwhile, the receiver operating characteristic curve (ROC curve) shows the trained model has a good performance on the training and test set. The ROC curve illustrates the performance of a binary classifier model at varying threshold values. It is composed of the FPR and TPR under different thresholds. The closer to the upper left corner, the higher the model detection accuracy. These results effectively validate the proficiency of the GP-based symbolic classifier for turbulence anomaly detection.
-
Since the function set is an important part of the formula generated by the classifier, the diversity of the function set also influences the diversity of the formula. We selected different function sets to examine their impact on the training results and evaluated two different function sets: one with basic arithmetic operations (+, −, ×, /) and the other with additional functions (sin, cos, max, min, abs, log, etc.). Training on different initial populations will result in different models. Models using different function sets were trained separately on 50 different initial populations while keeping other parameters constant. The average results represent the performance of the three function sets. A t-test hypothesis test is utilized to examine the independence between the ACC data of each group. At a significance level of 0.05, there is a significant difference in the accuracy of the three function sets.
The results show that the average performance of the three function sets is similar (Table 2). The function sets we selected perform relatively better and have the lowest complexity. However, it also shows that the choice of function set has limited impact on the performance of the model. Appropriate addition of some functions can bring more diversity and lower complexity to the model.
Function number Average ACC on test data Average length Significance Function sets used 7 97.85% 25 Yes Basic function sets 4 97.10% 37 Yes Extra function sets 14 97.60% 27 Yes Table 2. The performance results using different function sets.
-
As described above, our objective is to establish a universally applicable method for turbulence detection. However, the trained symbolic classifier demonstrates high performance based on the above results, and it should be noted that both the training and test data originate from the same flight. Given that the training data and test data are not completely independent, our method’s applicability across different routes or flights cannot be confirmed. To further assess the effectiveness and applicability of the model, we employed the trained classifier to detect turbulence anomalies for two separate flights: one from Chongqing to Jinan on 1 February 2023 and one from Jinan to Harbin on 10 July 2023. The routes of the two flights are completely different from the flight routes used in the training data. We selected the same parameters used for the training data from the QAR for these two flights (Fig. 6a). Based on identical evaluation criteria, the model demonstrated ACCs of 93.91% (Figs. 6b, c) and 95.98% (Figs. 6d, e) compared to the EDRs of the two flights. Our method can identify most of the time points when the EDR is greater than 0.1; that is, when turbulence occurs. The performance of our method on new data confirmed its proficiency and applicability in accurately detecting turbulence anomalies among different flights. This means that our method has the potential to be universally applicable to all flights, similar to the EDR.
Figure 6. (a) The workflow of the detection process. (b) The confusion matrix of the detection results for a flight from Chongqing to Jinan. The TPR and FPR are 99.28% and 17.95%, respectively. (c) The detection results and EDR values for the flight from Chongqing to Jinan. (d) The confusion matrix of the detection results for a flight from Jian to Harbin. The TPR and FPR are 97.49% and 9.64%, respectively. (e) The detection results and EDR values for the flight from Jinan to Harbin.
-
For comparative evaluation, several additional classification algorithms were applied to the QAR dataset. We chose extra trees, random forest, support vector machines (SVM), and logistic regression to make a comparison. We used the Scikit-learn library to implement these algorithms. We used grid search to help us find suitable parameters for these algorithms. The hyperparameters of these algorithms are as follows:
For random forest, the hyperparameters are criterion="log_loss", n_estimators=80, min_samples_split=2, max_depth=7, min_samples_leaf=1, max_features=2, and random_state=1.
For extra trees, the hyperparameters are n_estimators=100, max_depth=10, min_samples_split=2, max_features='sqrt', and min_samples_leaf=2.
For SVM, the hyperparameters are degree=3, kernel='poly', C=10, gamma='scale', probability=True, max_iter=5,000, and tol=1e-4.
For logistic regression, the hyperparameters are solver="newton-cholesky", max_iter=5000.
In order to compare the performance of different models, we introduced the area under curve (AUC). The model with a larger AUC value shows a better performance on the data (Figs. 7a, b). Interestingly, extra trees, random forest and SVM all achieved significantly high performance on both training and test data. The ACCs of extra trees, random forest and SVM on the test data achieved 99.85%, 99.52% and 99.04%, respectively. Only logistic regression demonstrates inferior performance in the same QAR data classification task.
Figure 7. (a) The ROC curves of different algorithms for the training data. (b) The ROC curves of different algorithms for the test data. (c) Confusion matrix of extra trees on flight from Chongqing to Jinan. The confusion matrix contained 4020 true positives, 1763 true negatives, 432 false positives, and 829 false negatives. (d) Confusion matrix of random forest on flight from Chongqing to Jinan. (e) Confusion matrix of SVM on flight from Chongqing to Jinan.
Due to the random splitting of the QAR data from a complete flight into training and test sets, the independence of these datasets cannot be assured. It is speculated that the absence of independence between the training and test sets leads to overfitting in certain algorithms. To further explore the performance of random forest, extra trees and SVM on their applicability, the trained models of these three algorithms were also employed for the aforementioned independent flights. Notably, the algorithms demonstrate a significant decline in performance when applied to new data, as opposed to their performance on the test data (Figs. 7c–e). Table 3 lists the ACCs of these algorithms. The results indicate that the symbolic classifier exhibits greater robustness compared to other algorithms and is more applicable when dealing with new data. Since our objective is to establish a universally applicable method for turbulence detection, the symbolic classifier built on GP is more suitable, considering that the dataset is unlikely to encompass data from all routes worldwide. Conversely, the remaining algorithms display impressive performance on the test data. It is envisioned that these algorithms may yield improved performance when constructing a turbulence detection model specific to a particular route.
Algorithm TPR FPR ACC Symbolic classifier 99.28% 17.95% 93.91% Extra trees 82.90% 19.68% 82.10% Random forest 82.47% 16.49% 82.79% SVM 12.83% 7.88% 37.54% Table 3. The ACC of the algorithms on the flight from Chongqing to Jinan.
-
In this study, we developed and implemented a data-driven turbulence detection method. Instead of employing the conventional method of calculating the EDR, we utilized a GP-based symbolic classifier to directly detect turbulence anomalies using QAR data. The individuals in the population of the symbolic classifier are represented by a formula. After the evolution process was complete, we obtained a model with a formula for which the QAR data were the input and the detection results (0 or 1) were the output. Its applicability has been verified on new data and is more robust than other algorithms. Since modern aircraft are universally equipped with QARs, our method can be widely applied to detect turbulence-induced anomalies during or after flights, eliminating the need for direct EDR computation.
Although our method demonstrates good performance and high accuracy in anomaly detection, there is still room for improvement. The proposed method, which extracts turbulence information directly from QAR data, is only capable of determining the presence or absence of turbulence. However, given that turbulence is categorized into four levels by ICAO, we intend to enhance this method by developing a multi-classifier for classifying turbulence levels. This will be achieved once we acquire a sufficient amount of moderate and severe turbulence data, where the EDR exceeds 0.4. Additionally, EDR offers a more intuitive means of quantifying the severity of turbulence. Therefore, our future research will also concentrate on obtaining turbulence information by building a regression model for estimating the EDR.
Acknowledgements. This work was supported by the Meteorological Soft Science Project (Grant No. 2023ZZXM29), the Natural Science Fund Project of Tianjin, China (Grant No. 21JCYBJC00740), and the Key Research and Development–Social Development Program of Jiangsu Province, China (Grant No. BE2021685).
Parameter | Unit | Parameter | Unit |
AOA1 | ° | Pitch Angle | ° |
AOA2 | ° | Roll Angle | ° |
Flight Number | − | Wind Direction | ° |
Gross Weight | kilogram | Ground Speed | knots |
Altitude | feet | Instruction Air Speed | knots |
Latitude | ° | Mach | − |
Longitude | ° | True Air Speed | knots |
Radio Height | feet | Wind Speed | knots |
Total Air Temp | °C | Vertical Acceleration | G |
Static Air Temp | °C | Lateral Acceleration | G |
Display Heading | ° | Longitudinal Acceleration | G |
Drift Angle | ° | Vertical Speed | Feet min−1 |