In this section, we discuss expert ratings of variable importance for the six science drivers. In order to understand whether participants' responses differed depending on their degree of expertise, we first divided the participants into two experience groups: those who rated themselves as "very experienced" in evaluating model fidelity were placed into the "high experience" group (N=36); all other participants were placed into the "low experience" group (N=60).
We emphasize that our "low experience" group consists largely of working climate scientists over the age of 30 (95%), with a median of 10 years of experience in climate modeling. In other words, our "low experience" group mostly consists not of laypersons, students or trainees, but of early-to-mid-career climate scientists with moderate levels of experience in evaluating and tuning climate models. Our "high experience" group consists largely of mid-to-late career scientists: the majority are over the age of 50 (53%), with a median of 20.5 years of experience in climate modeling. Researchers on the development of expertise have argued that roughly 10 years of experience are needed for the development and maturation of expertise (Ericsson, 1996); 86% of our "high experience" group members have 10 years or more of climate modeling experience.
3.1.1. Science Driver 1: How well does the model reproduce the overall features of the Earth's climate?
Our first Science Driver asked respondents to assess the importance of different variables to "the overall features of Earth's climate". We believe that this statement summarizes the primary aim of most experts when calibrating a climate model. However, experts' typical practices are likely to be influenced by factors such as the tools and practices used by their mentors and immediate colleagues, their disciplinary background, and their research interests. Such factors could contribute to differences in judgments of what constitutes a "good" model simulation. The aim of this Science Driver is to understand what experts prioritize when the goal is relatively imprecisely defined as optimizing the "overall features" of climate; these responses can then be contrasted with the more specific questions in the following five Science Drivers.
Figures 2 and 3 show the distribution of responses for each variable in Science Driver 1 for the high and low experience groups. Figure 4 (top) summarizes the mean and standard deviation of importance ratings for all variables in Science Driver 1. Overall, the variables most likely to be identified as "extremely important" were (in ranked order): rain flux (N=31), 2-m air temperature (N=28), longwave cloud forcing (N=22), shortwave cloud forcing (N=21), and sea level pressure (N=20). The complete distributions of responses for all science drivers by experience group, together with statistical summary variables and significance tests, are shown in Tables S1-13.
The distribution and degree of consensus is similar between the two groups, with no statistically significant differences for any variable (see Supplementary Tables S4-S6). This suggests that once an initial level of experience is acquired, additional experience may not lead to significant differences in judgments about model fidelity.
It is instructive to examine which variables are the exceptions to this general rule; these exceptions hint at insights into where and how greater experience matters most in informing the judgments experts make about model fidelity. The distribution of responses of the high experience and low experience group differed for only one item in Science Driver 1——the oceanic surface wind stress (p<0.01); for this variable, the median response of the high and low experience groups was "very important" and "moderately important," respectively. We speculate that the high-experience group may be more sensitive to this variable due to (1) its critical importance to ocean-atmosphere coupling, and (2) awareness of the relatively high-quality observational constraints available from wind scatterometer data.
We also investigated the degree of consensus on the importance of different variables. We observe a clearly higher degree of consensus for some variables, compared to others. Across all participants (high and low experience groups together), there is a comparatively high degree of consensus on the importance of shortwave cloud forcing (A=0.67), longwave cloud forcing (A=0.62), and rain flux (A=0.62). In particular, there is comparatively little agreement on the importance of oceanic surface wind stress (A=0.39), due to the discrepancy between experience groups on this item, and on the aerosol optical depth (AOD; A=0.42). The data we collected do not allow us to be certain of the reasoning behind importance ratings, but the lack of consensus on AOD importance is perhaps unsurprising in light of the high uncertainty associated with the magnitude of aerosol impacts on climate (Stocker et al., 2013), and recent controversies among climate modelers on the importance of aerosols to climate, or lack thereof (Booth et al., 2012; Stevens, 2013; Seinfeld et al., 2016).
3.1.2. Science Driver 2: How well does the model reproduce features of the global water cycle?
Our second Science Driver included a comparatively limited number of variables related to the global water cycle (Fig. 4: middle). These should be considered in combination with Science Driver 6, which addresses the assessment of simulated clouds using a satellite simulator (Fig. 5).
While the differences did not pass our criteria for statistical significance, we note a slight tendency for the high experience group to assign higher mean importance ratings to net TOA radiative fluxes and precipitable water amount. We speculate that this might be due to a slightly greater awareness of, and sensitivity to, observational uncertainties among the high experience group, expressed as a higher importance rating for variables with stronger observational constraints from satellite measurements. This interpretation is supported by the comment of one study participant (with 20 years' experience in climate modeling), who observed that "surface LH [latent heating] and SH [sensible heating] are not well constrained from obs[ervations]. While important, that means they aren't much use for tuning."
3.1.3. Science Driver 3: How well does the model simulate Southern Ocean climate?
For Southern Ocean climate, surface interactions that affect ocean-atmosphere coupling, including wind stress, latent heat flux (evaporation) and rain flux, together with shortwave cloud forcing, were identified as among the most important variables by our participants (Fig. 4: bottom).
The high experience group rated rain fluxes as more important (median: "very" important) compared to the low experience group (median: "moderately" important; probability of difference: p=0.02).
It is interesting to compare the responses with Science Driver 1, which included many of the same variables. For instance, for AOD, the low experience group assigned a lower mean importance for overall climate (mean: 4.32; σ: 1.41) than for Southern Ocean climate (mean: 4.04; σ: 1.49); the high experience group assigned a higher mean importance for overall climate (mean: 4.64; σ: 1.16) than for Southern Ocean climate (mean: 4.34; σ: 1.13).
The reasons for this discrepancy are unclear. One possibility is that the high experience group may be more aware that over the Southern Ocean, AOD provides a poor constraint on cloud condensation nuclei (Stier, 2016), and is affected by substantial observational uncertainties, with estimates varying widely between different satellite products.
3.1.4. Science Driver 4: How well does the model simulate important features of the water cycle in the Amazon watershed?
On Science Driver 4, which addresses the water cycle in the Amazon watershed (Fig. 5: top), participants identified surface sensible and latent heat flux, specific humidity, and rain flux as the most important variables for evaluation. It is possible that the more experienced group is more sensitive to the critical role of land-atmosphere coupling in the Amazonian water cycle. This interpretation would be consistent with the additional variables suggested by our survey participants for this science driver, which also focused on variables critical to land-atmosphere coupling, e.g. "soil moisture", "water recycling ratio", and "plant transpiration" (Supplementary Table S2). While the variables selected for the survey focused largely on mean thermodynamic variables, commenters also mentioned critical features of local dynamics in the Amazon region, such as surface topography and "wind flow over the Andes", "convection", and vertical velocity at 850 hPa.
3.1.5. Science Driver 5: How well does the model simulate important features of the water cycle in the Asian watershed?
For Science Driver 5, focused on the Asian watershed, participants rated rain flux, surface latent heat flux, and net shortwave radiative flux at the surface as the most important variables (Fig. 5: bottom). For variables included in both Science Drivers, the order of variable importance was the same as in the Amazon watershed, but different than in the Southern Ocean; some of these differences will be discussed in section 3.3. Written responses again mentioned soil moisture (3×) and moisture advection (2×) as important variables missing from the list.
3.1.6. Science Driver 6: How well does the model simulate the climate impact of clouds globally?
The final Science Driver addressed the evaluation of cloud properties in the model (Fig. 6) using a satellite simulator, which produces simulated satellite observations and retrievals based on radiative transfer calculations in the model. "Very important" (6) was the most common response for all variables in Science Driver 6 (Supplementary Table S15).
While differences in responses between the two experience groups did not pass our bar for statistical significance, the high experience group selected "extremely important" more frequently than the low experience group for the "high level cloud cover" and "low cloud cover" items, which also had the highest mean importance ratings in this Science Driver.
Five participants indicated that longwave cloud forcing and shortwave cloud forcing should have been included, and one respondent noted "A complete vertical distribution of cloud properties would be even more interesting than "low", "medium" and "high" cloud cover. Cloud particle size and number would also be interesting." Another responded that "cloud fraction is a model convenience but is quite arbitrary."