-
The experimental data are entirely derived from the actual flight data of a certain airline. The original training data utilized in this work comprised the QAR data from a Boeing 737 flight from Sanya to Jinan on 22 January 2023; as documented in the flight report, this flight encountered turbulence. We selected 24 parameters as inputs of our model (Table 1), primarily incorporating the parameters utilized in the EDR algorithm (Chen et al., 2019; Huang et al., 2019). Since symbolic classifiers can select the most important parameters during the evolution process, some redundant parameters, such as the number of flights, have no influence on the performance of the classifier.
Parameter Unit Parameter Unit AOA1 ° Pitch Angle ° AOA2 ° Roll Angle ° Flight Number − Wind Direction ° Gross Weight kilogram Ground Speed knots Altitude feet Instruction Air Speed knots Latitude ° Mach − Longitude ° True Air Speed knots Radio Height feet Wind Speed knots Total Air Temp °C Vertical Acceleration G Static Air Temp °C Lateral Acceleration G Display Heading ° Longitudinal Acceleration G Drift Angle ° Vertical Speed Feet min−1 Table 1. Parameters used for turbulence detection.
The QAR dataset contained 11 140 data points, with each representing 24 parameters and 1 anomaly class information after preprocessing. We applied a linear interpolation method to make the EDR values and QAR records directly correspond (Fig. 3a). The data points were classified into two categories based on the EDR values computed from the QAR data, denoting the presence or absence of turbulence. Class 0 indicates that no turbulence occurred at that time point, and Class 1 signifies that turbulence was present. With this EDR-based turbulence classification scheme, approximately 80% of the 11 140 flight data points fell into Class 0, with the remaining 2228 points labelled Class 1 (Fig. 3b). We then randomly split the data into a training set and a test set at a ratio of 7:3.
-
In this work, Python 3.10 was systematically used, from data processing to model testing. For model training, we used the gplearn library and Scikit-learn library. Prior to model training, the key GP parameters were initialized as follows. We conducted multiple repetitions of the experiment using different parameters, and consistently obtained similar results in terms of accuracy. Specifically, since the classifier not only chooses from the terminal set but also chooses from the function set to establish a suitable formula, we trained the model using different function sets to find the most appropriate one (section 3.4).
We selected a set of parameters that can be considered relatively effective. The terminal set comprised all normalized training parameters and random constants ranging from −1 to 1. To expand possibilities beyond the standard arithmetic operators, the function set was augmented to include square root, logarithm, and absolute value functions. The population size was set to 5000 individuals, with evolution conducted for 30 generations. Tournament selection utilized a tournament size of 20. The range of tree depths for the initial population was set from 2 to 6. The crossover rate was set to a 0.9 probability for tournament winners, and the three mutation methods shared an equal 0.01 probability of being selected. When the total of the crossover and mutation rates summed to a value p less than 100%, a reproduction rate of 1 − p (0.07 here) was selected.
-
In this analysis, no stopping criterion was set, and the model was considered to be fully trained after 30 generations. With a first-generation length of 11, the average length of the individuals in the population stabilized at approximately 17 after multiple generations of evolution (Fig. 4a), and the length of the best individual stabilized at approximately 24 (Fig. 4b).
Figure 4. (a) Changes in the average length and fitness of the individuals in the population during the iterative process. (b) Changes in the length and fitness of the best individual during the iterative process. (c) The formula evolved with the GP-based symbolic classifier after training completion. X5, X10, X16, and X23 each represent a parameter in the QAR dataset.
As previously described, the symbolic classifier evolves a formula with QAR data as inputs and anomaly detection results (0 or 1) as outputs. The tree-structured formula obtained after training selects the most salient QAR parameters as inputs (Fig. 5c). The formula also shows the advantage of interpretability of the symbolic classifier based on GP. In the formula, X5 represents the latitude, X10 represents the display heading, X16 represents the instruction airspeed, and X23 represents the vertical speed. As location information other than latitude is not included in the formula, the detection of turbulence primarily relies on the aircraft’s status and its environment. This characteristic of our method makes it potentially universally applicable. Specifically, since the vertical speed is the most frequently utilized parameter in the formula, we speculate that the classifier primarily relies on the vertical speed as the main indicator for turbulence detection, while other parameters provide auxiliary assistance. The utilization of vertical speed by our trained model shows similarities to an onboard algorithm that converts vertical acceleration into EDR estimates:
Figure 5. (a) ROC curve of the trained symbolic classifier for the training data and confusion matrix of the training data. The confusion matrix contained 6152 true positives, 1480 true negatives, 74 false positives, and 92 false negatives. (b) ROC curve of the trained symbolic classifier for the test data and confusion matrix of the test data.
To evaluate the anomaly detection performance, the true positive rate (TPR) and false positive rate (FPR) were utilized as key metrics. The TPR constitutes the proportion of positive instances correctly classified, whereas the FPR represents the proportion of negative instances incorrectly predicted as positive. Moreover, the accuracy (ACC) indicates the proportion of correct predictions by the model among all predictions. The corresponding formulas are as follows:
where FN is the percentage of normal data detected as abnormal; FP is the percentage of abnormal data detected as normal; TN is the percentage of abnormal data correctly detected as abnormal; and TP is the percentage of normal data correctly detected as normal.
According to the confusion matrix (Figs. 5a, b), an ACC of 97.87% and 98.08% was achieved for the training and testing data, respectively. The symbolic classifier yielded a TPR of 98.68% and an FPR of 4.30% for the test dataset. Meanwhile, the receiver operating characteristic curve (ROC curve) shows the trained model has a good performance on the training and test set. The ROC curve illustrates the performance of a binary classifier model at varying threshold values. It is composed of the FPR and TPR under different thresholds. The closer to the upper left corner, the higher the model detection accuracy. These results effectively validate the proficiency of the GP-based symbolic classifier for turbulence anomaly detection.
-
Since the function set is an important part of the formula generated by the classifier, the diversity of the function set also influences the diversity of the formula. We selected different function sets to examine their impact on the training results and evaluated two different function sets: one with basic arithmetic operations (+, −, ×, /) and the other with additional functions (sin, cos, max, min, abs, log, etc.). Training on different initial populations will result in different models. Models using different function sets were trained separately on 50 different initial populations while keeping other parameters constant. The average results represent the performance of the three function sets. A t-test hypothesis test is utilized to examine the independence between the ACC data of each group. At a significance level of 0.05, there is a significant difference in the accuracy of the three function sets.
The results show that the average performance of the three function sets is similar (Table 2). The function sets we selected perform relatively better and have the lowest complexity. However, it also shows that the choice of function set has limited impact on the performance of the model. Appropriate addition of some functions can bring more diversity and lower complexity to the model.
Function number Average ACC on test data Average length Significance Function sets used 7 97.85% 25 Yes Basic function sets 4 97.10% 37 Yes Extra function sets 14 97.60% 27 Yes Table 2. The performance results using different function sets.
-
As described above, our objective is to establish a universally applicable method for turbulence detection. However, the trained symbolic classifier demonstrates high performance based on the above results, and it should be noted that both the training and test data originate from the same flight. Given that the training data and test data are not completely independent, our method’s applicability across different routes or flights cannot be confirmed. To further assess the effectiveness and applicability of the model, we employed the trained classifier to detect turbulence anomalies for two separate flights: one from Chongqing to Jinan on 1 February 2023 and one from Jinan to Harbin on 10 July 2023. The routes of the two flights are completely different from the flight routes used in the training data. We selected the same parameters used for the training data from the QAR for these two flights (Fig. 6a). Based on identical evaluation criteria, the model demonstrated ACCs of 93.91% (Figs. 6b, c) and 95.98% (Figs. 6d, e) compared to the EDRs of the two flights. Our method can identify most of the time points when the EDR is greater than 0.1; that is, when turbulence occurs. The performance of our method on new data confirmed its proficiency and applicability in accurately detecting turbulence anomalies among different flights. This means that our method has the potential to be universally applicable to all flights, similar to the EDR.
Figure 6. (a) The workflow of the detection process. (b) The confusion matrix of the detection results for a flight from Chongqing to Jinan. The TPR and FPR are 99.28% and 17.95%, respectively. (c) The detection results and EDR values for the flight from Chongqing to Jinan. (d) The confusion matrix of the detection results for a flight from Jian to Harbin. The TPR and FPR are 97.49% and 9.64%, respectively. (e) The detection results and EDR values for the flight from Jinan to Harbin.
-
For comparative evaluation, several additional classification algorithms were applied to the QAR dataset. We chose extra trees, random forest, support vector machines (SVM), and logistic regression to make a comparison. We used the Scikit-learn library to implement these algorithms. We used grid search to help us find suitable parameters for these algorithms. The hyperparameters of these algorithms are as follows:
For random forest, the hyperparameters are criterion="log_loss", n_estimators=80, min_samples_split=2, max_depth=7, min_samples_leaf=1, max_features=2, and random_state=1.
For extra trees, the hyperparameters are n_estimators=100, max_depth=10, min_samples_split=2, max_features='sqrt', and min_samples_leaf=2.
For SVM, the hyperparameters are degree=3, kernel='poly', C=10, gamma='scale', probability=True, max_iter=5,000, and tol=1e-4.
For logistic regression, the hyperparameters are solver="newton-cholesky", max_iter=5000.
In order to compare the performance of different models, we introduced the area under curve (AUC). The model with a larger AUC value shows a better performance on the data (Figs. 7a, b). Interestingly, extra trees, random forest and SVM all achieved significantly high performance on both training and test data. The ACCs of extra trees, random forest and SVM on the test data achieved 99.85%, 99.52% and 99.04%, respectively. Only logistic regression demonstrates inferior performance in the same QAR data classification task.
Figure 7. (a) The ROC curves of different algorithms for the training data. (b) The ROC curves of different algorithms for the test data. (c) Confusion matrix of extra trees on flight from Chongqing to Jinan. The confusion matrix contained 4020 true positives, 1763 true negatives, 432 false positives, and 829 false negatives. (d) Confusion matrix of random forest on flight from Chongqing to Jinan. (e) Confusion matrix of SVM on flight from Chongqing to Jinan.
Due to the random splitting of the QAR data from a complete flight into training and test sets, the independence of these datasets cannot be assured. It is speculated that the absence of independence between the training and test sets leads to overfitting in certain algorithms. To further explore the performance of random forest, extra trees and SVM on their applicability, the trained models of these three algorithms were also employed for the aforementioned independent flights. Notably, the algorithms demonstrate a significant decline in performance when applied to new data, as opposed to their performance on the test data (Figs. 7c–e). Table 3 lists the ACCs of these algorithms. The results indicate that the symbolic classifier exhibits greater robustness compared to other algorithms and is more applicable when dealing with new data. Since our objective is to establish a universally applicable method for turbulence detection, the symbolic classifier built on GP is more suitable, considering that the dataset is unlikely to encompass data from all routes worldwide. Conversely, the remaining algorithms display impressive performance on the test data. It is envisioned that these algorithms may yield improved performance when constructing a turbulence detection model specific to a particular route.
Algorithm TPR FPR ACC Symbolic classifier 99.28% 17.95% 93.91% Extra trees 82.90% 19.68% 82.10% Random forest 82.47% 16.49% 82.79% SVM 12.83% 7.88% 37.54% Table 3. The ACC of the algorithms on the flight from Chongqing to Jinan.
Parameter | Unit | Parameter | Unit |
AOA1 | ° | Pitch Angle | ° |
AOA2 | ° | Roll Angle | ° |
Flight Number | − | Wind Direction | ° |
Gross Weight | kilogram | Ground Speed | knots |
Altitude | feet | Instruction Air Speed | knots |
Latitude | ° | Mach | − |
Longitude | ° | True Air Speed | knots |
Radio Height | feet | Wind Speed | knots |
Total Air Temp | °C | Vertical Acceleration | G |
Static Air Temp | °C | Lateral Acceleration | G |
Display Heading | ° | Longitudinal Acceleration | G |
Drift Angle | ° | Vertical Speed | Feet min−1 |