A Bayesian framework for emergent constraints: case studies of climate sensitivity with PMIP

In this paper we introduce a Bayesian framework, which is flexible and explicit about the prior assumptions, for using model ensembles and observations together to constrain future climate change. The emergent constraint approach has seen broad application in recent years, including studies constraining the equilibrium climate sensitivity (ECS) using the Last Glacial Maximum (LGM) and the mid-Pliocene Warm Period (mPWP). Most of these studies were based on Ordinary Least Squares 5 (OLS) fits between a variable of the climate state, such as tropical temperature, and climate sensitivity. Using our Bayesian method, and considering the LGM and mPWP separately, we obtain values of ECS of 2.7 K (1.1–4.8, 5–95 percentiles) using the PMIP2, PMIP3 and PMIP4 data sets for the LGM, and 2.4 K (0.4–5.0) with the PlioMIP1 and PlioMIP2 data sets for the mPWP. Restricting the ensembles to include only the most recent version of each model, we obtain 2.7 K (1.1–4.3) using the LGM and 2.4 K (0.4–5.1) using the mPWP. An advantage of the Bayesian framework is that it is possible to combine the two 10 periods assuming they are independent, whereby we obtain a slightly tighter constraint of 2.6 K (1.1–3.9). We have explored the sensitivity to our assumptions in the method, including considering structural uncertainty, and in the choice of models, and this leads to 95% probability of climate sensitivity mostly below 5 and never exceeding 6 K. The approach is compared with other approaches based on OLS, a Kalman filter method and an alternative Bayesian method. An interesting implication of this work is that OLS-based emergent constraints on ECS generate tighter uncertainty estimates, in particular at the lower end, 15 suggesting a higher bound by construction in case of weaker correlation. Although some fundamental challenges related to the use of emergent constraints remain, this paper provides a step towards a better foundation of their potential use in future probabilistic estimation of climate sensitivity.


Introduction
In recent years, researchers have identified a number of relationships between observational properties and a future cli-20 mate change, which was not immediately obvious a priori, but which exists across the ensemble of global climate models (GCMs) (Allen and Ingram, 2002;Hall and Qu, 2006;Boé et al., 2009;Cox et al., 2018) participating in the Climate Model Intercomparison Project (CMIP). These relationships are generally referred to as 'emergent constraints' as they emerge from the ensemble behaviour as a whole rather than from explicit physical analysis.
Such emergent constraints have been broadly used to constrain properties of the Earth's climate system which are not easily 25 or directly observable. These are usually presented in probabilistic terms, mostly based on Ordinary Least Squares (OLS) methods. For example, studies have explored the constraint on equilibrium climate sensitivity (ECS), which is the global mean equilibrium temperature after a sustained doubling of CO 2 over pre-industrial levels, using model outputs from the Paleoclimate Model Intercomparison Project (PMIP) Schmidt et al., 2014;Hopcroft and Valdes, 2015;Hargreaves and Annan, 2016). Because of their relatively strong temperature signal, paleoclimate states like the Last Glacial Maximum 30 (LGM) and the mid-Pliocene Warm Period (mPWP) are often considered as promising constraints for the ECS Hargreaves and Annan, 2016), in particular at the high end.
Almost all emergent constraint studies have used OLS to establish the link between variables in the model ensembles.
However, whether ECS or another climate parameter was investigated, the theoretical foundations for the calculations have not previously been clearly presented. An additional problem arising from this is the resulting difficulty in synthesising estimates 35 of climate system properties generated by different statistical methods with different, and often not explicitly introduced, assumptions. These methods include OLS but also alternative Bayesian approaches such as estimates of the climate sensitivity using energy-balance models Aldrin et al., 2012;Bodman and Jones, 2016).
Two recent papers have also addressed the question of emergent constraints in different ways. Bowman et al. (2018) presented a hierarchical statistical framework which went a long way to closing the gap in theoretical understanding of emergent 40 constraints. Conceptually, it is very similar to a single step Kalman filter. Specifically, it uses the model distribution (approximated as a Gaussian) as a prior, which is then updated using the observation to a posterior. However, such prior and the underlying assumptions attached to it could be seen as a restrictive choice to impose on the climate sensitivity area of research.
In particular, it is fairly difficult to generate posterior values which are outside of the prior range, even when the observation is outside the range covered by models. Because of that, it does not appear to correspond to the choice which is usually made, Schmidt et al., 2014;Cox et al., 2018). A summary of several different emergent constraints on climate sensitivity was made by Caldwell et al. (2018). This approach using emergent constraints is meaningful only if we believe that reality satisfies the same relationship, and it was not observed purely by chance in the model ensemble. There is a risk in searching for such relationships in a small ensemble that we may find examples which are coincidental, with no real predictive value (Caldwell et al., 2014). Spurious relationships could also be found because of model limitations (Fasullo and Trenberth, 2012;Grise et al., 75 2015;Notz, 2015).
In this study, we focus on the relationship between equilibrium climate sensitivity, defined here as S, and the temperature change in the tropics which is observed at the Last Glacial Maximum (LGM) and the mid-Pliocene Warm Period (mPWP), defined as T tropical . We posit that a relationship between climate sensitivity and temperature change is physically plausible, as we expect the long-term quasi-equilibrium temperature to be mainly influenced by radiative forcing, and past model ensembles 80 variations in climate sensitivity have been dominated by tropical feedbacks, mostly arising from low-level clouds (Bony et al., 2006;Vial et al., 2013). Furthermore, since mixing with the deep oceans happens mostly at high latitudes and equilibrating the deep ocean temperatures in a climate model takes several thousand years, there is the risk that the temperature in these regions have not yet equilibrated to the same extent in all models. This can cause high latitude temperature variations that are not related to climate sensitivity, but to how modelling centres conducted their experiments.

Ordinary Least Squares
The most widely-used approach to emergent constraint analysis is to find an observable phenomenon that exhibits some relationship to the parameter of interest, and use this as a predictor in a linear regression framework. The Ordinary Least Squares (OLS) method has been widely used because of its simplicity, and so we also use it here as a benchmark for comparison with alternative statistical methods. In the context of constraining climate sensitivity, the parameter of interest (i.e. the ECS) is 90 considered as a predicted variable Schmidt et al., 2014;Hargreaves and Annan, 2016). This may be written as where S is the climate sensitivity, α and β two unknown constants, T tropical the temperature anomaly averaged over the tropical region for the given paleo-time interval, and the residual term which is drawn from a Gaussian distribution N (0, σ) 95 and which accounts for deviations from the linear fit. When we use this approach, the unknown constants of the linear fit are estimated via ordinary least squares (OLS) using the (T i tropical , S i ) pairs representing the model ensemble (here i indexes the models) and then the equation is used to predict the true value of S for the climate system, based on the observed value T o tropical . A confidence interval for the predictor variable can be generated by accounting for uncertainties in the fit and in the observed value as was demonstrated by Hargreaves et al. (2012). This procedure makes the assumption that reality satisfies the 100 same regression relationship as the models, i.e. is likely to be at a similar distance from the line as the model points are.
Integrating the intrinsically frequentist OLS-based confidence intervals into a Bayesian framework is somewhat unclear. One issue is the misinterpretation of frequentist confidence intervals as Bayesian posterior credible intervals, where the former is the representation of a percentage of intervals to contain the fixed true value of the parameter, while Bayesian credible interval is the probability of the true value to be included within a given interval. For instance, if there is an observed T tropical = 1 K, with 105 an assumed Gaussian observational uncertainty of σ = 0.25 at one standard deviation, then stating that there is a close-to-95% probability of having the true value of the parameter within the interval 0.5-1.5 K is a Bayesian credible interval interpretation.
However, the latter is a common interpretation of frequentist-based studies. This confusion has inherent drawbacks on the analysis of posterior outputs, as shown in various fields of science (Hoekstra et al., 2014) and more recently, for climate sensitivity computations (Annan and Hargreaves, 2019). Williamson and Sansom (2019) have presented a Bayesian interpretation of this 110 approach using reference priors on ψ, as defined by Cox et al. (2018) as a metric of global mean temperature variability, and the regression coefficients. However, this approach does not appear to readily allow for the use of any arbitrary prior distribution for S which may either be desired for comparison with other research, or else have arisen through a previous unrelated analysis.
The Bayesian linear regression approach that we introduce in the next section avoids these problems.

115
The (subjective) Bayesian paradigm is based on the premise that we use probability distributions to describe our uncertain beliefs concerning unknown parameters. We use Bayes' Theorem to update a prior probability distribution function (pdf) for the equilibrium climate sensitivity via where P (S|T tropical ) is the posterior after conditioning on the data, P (S) is the prior and P (T tropical ) is a normalisation constant. The likelihood P (T tropical |S) is a function that takes any value of S and generates a probabilistic prediction of what we would expect to observe as T tropical if that value was correct. The use of the Bayesian paradigm requires us to create such a function. Using the principles of emergent constraint analyses in which a linear relationship between these two parameters, which was seen in the GCM ensemble, is believed to apply also to reality, it is natural to use the regression relationship Note that this reverses the rôles of predictor and predictand compared to the OLS-based approach (Eq. 1). The values of α, β and σ, where σ represents the standard deviation of as ∈ N (0, σ), are estimated from the model ensemble via a Bayesian linear regression procedure, which requires priors to be defined over these parameters.
In this way, we create a statistical model, conditioned on the model ensemble, that can generate a predictive pdf for the tropical temperature change at the LGM or at the mPWP P (T tropical |S), for any given sensitivity. There is a structural difference 130 between this approach and that of Eq. 1, in that here the residual uncertainties ∈ N (0, σ) represent our inability to perfectly predict the tropical temperature anomaly arising from a given sensitivity, and are probabilistically independent of the latter rather than the former variable. The issue here is not a matter of which regression line is 'correct', but rather how, given the model ensemble, we can create a plausible likelihood model for P (T tropical |S).
The regression prediction for the temperature change as a function of sensitivity, together with the observed tropical temper-135 ature change as estimated through analysis of proxy data T o tropical , naturally leads to a likelihood function for the sensitivity of the climate system P (T tropical = T o tropical |S). That is, for a given sensitivity S, we can calculate the probability of the observation of tropical temperature that we have, as the composition of the predictive pdf for actual tropical temperature, together with the uncertain observation operator. In practice this is performed by a simple sampling algorithm.
A prior belief both on climate sensitivity (P (S)), and on the parameters of the regression model, has to be assumed. There is 140 no clearly uncontested choice for prior distribution for climate sensitivity. However,  argued that a Cauchy distribution has a reasonable behaviour with a long tail to high values, but unlike the uniform prior, does not assign high probability to these values. Thus we adopt this prior for our main analyses. In section 3.5 we test the sensitivity of the results to this choice and compare the results obtained using Gamma and uniform prior distributions. Priors for the parameters of the regression model are chosen with reference to the specific experiment and are intended to represent our reasonable (albeit 145 uncertain) expectation that models do indeed generate a regression relationship as described.
An additional issue, that was briefly mentioned above, is that we may like to consider the probability that reality is qualitatively and quantitatively distinct from all models. This issue, which was explicitly argued in the context of emergent constraint analysis by Williamson and Sansom (2019), seems reasonable since all models do share a theoretical heritage and certain limitations. However, this issue remains challenging to quantify. It has not been considered in most previous studies which also 150 makes it difficult to compare. We investigate this issue in Section 3.6. Whilst the proposed resolution remains preliminary and although the concept is promising for understanding emergent constraints, we decide to omit it for the rest of the analysis to enable more direct comparisons with previous studies. The Bayesian Linear Regression (BLR) method is more explicit than the standard OLS approach, as the prior assumptions have to be given by the user. This transparency leads to more freedom and control of the statistical model. Moreover, it is less 155 sensitive to outliers as the prior on the regression coefficients provides a form of regularisation. This should result in lower variance in the results, particularly when, as in the examples studied here, we have small model ensembles.
Additionally, the Bayesian method allows the user to add multiple lines of evidence by sequentially updating the chosen prior for S. The method for combining independent constraints is reasonably simple, as it only requires us to calculate and store the posterior of the first emergent constraint analysed, and use this distribution as the prior for the second emergent constraint.

160
Thus it is a direct form of sequential Bayesian updating. This process results in a posterior distribution which will generally be narrower than either of the two posteriors that would have been generated from either of the emergent constraints separately.
Although it may be tempting to simply combine all emergent constraints in this way, it is necessary to also consider possible dependencies between the uncertainties in the different emergent constraints before this can be done with confidence (Annan and Hargreaves, 2017). two dimensions for the Gaussian, these being the scalar predictor (e.g. sensitivity) and predictand (e.g. tropical temperature 170 change). While this approach is a natural and attractive option in many respects, it has the specific drawback of using the distribution of model samples as a prior. Existing literature on emergent constraints does not make this assumption and this could be seen as a limiting aspect of the method, as it implies that the model ensemble is already a credible predictor even before consideration of the observational constraint. Some implications of this approach are that the posterior estimate is equal to the model distribution in the case that no constraint exists, either because there is in fact no relationship between observation 175 and predictand, or else when the observational uncertainty is excessively large. It is also difficult for the method to generate posterior estimates that include values significantly outside the model range, even in the case where the observed value is outside the model spread. We present results generated with a Kalman filter in Section 3.1 for comparison with our main analysis.

180
The BLR method may be applied to any emergent constraint. In this study, we use the model outputs and data syntheses arisen from phases 2 and 3 of PMIP (Braconnot et al., 2007;Haywood et al., 2011;Harrison et al., 2014), summarised in Table 1. The Last Glacial Maximum (19-23 ka) corresponds to the period of the last ice age where ice sheets and sea ice had their maximum extent. Due to its temporal proximity, relative abundance of proxy data, and substantial radiative forcing anomaly, the LGM is widely considered one of the best paleoclimate intervals for testing global climate models and has been featured in all of the PMIP consortium experiments. A representation of several model LGM simulations compared to the SAT reconstruction of Annan and Hargreaves (2013) is shown in Fig. 1-(a).
Previous results from PMIP2 showed a significant correlation between LGM tropical temperatures and climate sensitivity in the models , although the equivalent calculation for the PMIP3 models found no significant correlation (Schmidt et al., 2014;Hopcroft and Valdes, 2015). These two similar sized ensembles with contrasting characteristics are a good test-bed for exploring the properties of the different methods. For the tropical temperature anomaly relative to preindustrial we use a value from Annan and Hargreaves (2013), for 20 • S to 30 • N, a T o tropical of -2.2 K with a Gaussian observational uncertainty of ± 0.7 K (5-95% confidence interval). Several data compilations are presently in development as part of PMIP4, but these have yet to be integrated into a global temperature field so revising the temperature estimate from Annan and Hargreaves (2013) is a topic for future work.

195
Interest in the mPWP (2.97-3.29 million years ago) as a more direct analogy for future climate change, has grown during the past years. This is the most recent period with a sustained high level of greenhouse gases and concomitant warmth relative to the pre-industrial period, however, the data are more sparse and uncertain. In Fig. 1-(b), the sea-surface temperature anomaly of different climate models which performed a mPWP simulation is displayed, as well as the PRISM3 SST reconstruction (Dowsett et al., 2009). Previous results for this period from the PlioMIP experiment, which was part of PMIP3, indicated a 200 fairly strong correlation between tropical temperature and climate sensitivity in the models, but the confidence with which this can be used to constrain climate sensitivity was low due to high uncertainty in various observationally derived components as well as various compromises in the way the protocol was formulated (Hargreaves and Annan, 2016). For the mPWP, a tropical temperature anomaly of 0.8 ± 1.6 K (5-95% interval) is taken from Hargreaves and Annan (2016) for 30 • S to 30 • N, assuming the largest 5-95% uncertainty showed in that work. The reconstruction used here is the PRISM3 (Pliocene Research, 205 Interpretation and Synoptic Mapping) SST anomaly field as described in Dowsett et al. (2009).
The Last Interglacial (127 ka, referred as lig127k in CMIP6) and the mid-Holocene (6 ka) are part of the PMIP simulations and also relatively warm climates. The forcings are, however, seasonal and regional in nature, mostly influencing the patterns of climate change. The global change in temperature and the global climate forcing are both very small, and this coupled with the large uncertainty in paleoclimate data makes these intervals poor candidates for constraining climate sensitivity. We do not 210 explore these intervals further here.
Climate sensitivity has various definitions and there are also a number of different ways of approximating the value in climate models that have not been run to equilibrium. For PMIP3 LGM the model values are mostly based on the regression method of Gregory et al. (2004), but for the models which contributed to PMIP2 LGM and PlioMIP the exact definition and derivation used in each case is not always clear in the literature. In order to make comparisons with previous work, here we 215 use the same values as those used in Hargreaves et al. (2012), Schmidt et al. (2014) and Hargreaves and Annan (2016) with two exceptions to ensure that only one value of sensitivity is used for identical versions of the same model across different experiments. Specifically, for FGOALS-g2 we use the value of 3.37 K (Yoshimori, pers. comm.) for both PMIP3 LGM and PMIP3 PlioMIP, and for HadCM3 we use 3.3 K  for both PMIP2 LGM and PMIP3 PlioMIP. Previous values used by Hargreaves and Annan (2016) for PMIP3 PlioMIP were 3.7 K for FGOALS-g2 (Zheng et al., 2013) and 3.1 K for HadCM3 . These changes are minor compared to the ensemble range of climate sensitivity and thus, they have no significant effect on the posterior outputs.
In addition to the already published results from PMIP2 and PMIP3 we add to our ensembles the results that are currently available from PMIP4 in section 3.3. While the LGM protocol (Kageyama et al., 2017) remains very similar to that in previous iterations of PMIP, the mPWP protocol (Haywood et al., 2016) has more significant differences which address several of the 225 limitations of the previous version. Most importantly, PlioMIP2 seeks to represent a specific quasi-equilibrium climate state in the past rather than representing an amalgamation of different warm peak climates as had been the case for PlioMIP1. A priori we are therefore less confident about combining the results from PlioMIP1 and PlioMIP2 and do so mostly to indicate where the new models lie in the ensemble and to highlight the potential for future research in this area once more model results based on the PlioMIP2 protocol become available.

Applications and Results
In order to apply the Bayesian Linear Regression and compute the likelihood P (T tropical |S), several priors have to be established as initial conditions. Specifically, for both the LGM and the mPWP we use Eq. 3 as the basis for our likelihood function.
The prior expectations of the three unknown parameters α, β and the standard deviation of the residual , referred to as σ, need to be defined. The relative complexity of the likelihood function with three a priori unknown parameters requires the use of a 235 sampling method for computational efficiency. In this study, we use the Markov Chain Monte Carlo (MCMC) method NUTS as described by Hoffman and Gelman (2014). The NUTS method is also included in the MCMC python package PyMC3 (Salvatier et al., 2016) which is applied here. Other MCMC methods which have been tested, such as Metropolis sampling or Hamiltonian Monte Carlo, give equivalent results.
Depending on the strength of the correlation among the dataset, one could expect a sensitivity of the regression to the choice 240 of prior parameters. In the following sections, we first describe the physical arguments behind the choice of priors over α, β and σ, and then present the outputs of the BLR for both the PMIP2 and PMIP3 dataset of the LGM and the PlioMIP1 dataset of the mPWP. Then, we include the CMIP6 data in the BLR framework for both paleo intervals, and present an approach of combining the two emergent constraints. Finally, we explore the sensitivity of the BLR approach to the choice of priors over the climate parameter of choice (i.e. the climate sensitivity) and to the hypothetical inadequacy of climate models.

The Last Glacial Maximum
From consideration of energy balance arguments and fundamental physical properties, such as the response of Earth to an increase of CO 2 , we have a prior expectation of a relationship between sensitivity and global LGM temperature anomaly (e.g. Lorius et al., 1990), and model experiments of Hargreaves et al. (2007) as well as simple physical arguments about the spatial distribution of forcing suggest that this relationship may be most clearly visible when we focus on the tropical region. While the 250 total negative forcing at the LGM is roughly twice as large as the positive forcing that would be caused by a doubling of CO 2 , the temperature response at low latitudes is generally expected to be lower than the global mean, due to polar amplification and the related presence of high latitude ice sheets. Thus we might reasonably expect the tropical temperature change at the LGM to be roughly equal to the global temperature rise under a doubling of CO 2 . It would also be unexpected if the correlation had the opposite sign to that based on simple energy balance arguments, such that a more sensitive model had a lower temperature 255 change at the LGM. However we cannot justify imposing a precise constraint on the slope and therefore our choice of prior for α is N (−1, 1). As for β, we expect the regression line to pass close to the origin, as a model with no sensitivity to CO 2 would probably have little response to any other forcing changes, especially in the tropical region where the influence of ice sheets is remote. However, we do not expect a precise fit to the origin and therefore, the prior chosen for β is N (0, 1). Finally, we chose a wide prior for σ, a Half Cauchy with a scale parameter of 5. The Cauchy is fairly close to uniform for values smaller than the 260 scale parameter, decaying gradually for higher values.
Deviations from the regression line may be due to different efficacies of other forcing components, especially ice sheets or dust. To take into account the uncertainty on the strength of the response, we performed two additional analyses where the prior response was smaller (α defined as N (−0.5, 1)) and larger (α defined as N (−2, 1)). We do not see much difference in the results using the three priors over α: the difference is approximately 0.2 K of climate sensitivity for both the upper and lower 265 percentiles quoted, giving us confidence in our choice of N (−1, 1). The computed 5-95% posterior climate sensitivity ranges for different values of α are summarised in Table 2.
The MCMC algorithm samples the posterior distribution of regression parameters which is represented by the ensemble of predictive regression lines in Fig. 2. This ensemble is used to infer the climate sensitivity following the Bayesian inference approach using the geological reconstruction of the LGM tropical temperature. The posterior distributions are computed using 270 a truncated-at-zero Cauchy prior with a peak of 2.5 and a scale of 3, which corresponds to a wide 5-95% prior interval of 0.5-28.7 K. Such a prior was used previously by  because it has a long tail, allowing for a substantial probability of having high climate sensitivity while still maintaining some preference for more moderate values. Additionally, the Cauchy prior has a finite integral, unlike the uniform distribution (which is sometimes referred as an "improper prior" for this reason). However, the sensitivity of Bayesian statistics to the choice of prior has often been noted. Thus, two alternative 275 priors, including the widely used uniform prior, and their corresponding posterior distributions, are investigated in Section 3.5.
To test the robustness of the method and also to compare it with the statistical methods used in previous studies, three cases are investigated in which we use different combinations of the available model ensembles. The results are shown in Fig. 2 and Table 2. For the PMIP2 ensemble, the correlation between tropical temperature and climate sensitivity was found to be reasonably 280 strong and in this study the resulting 90% range for inferred climate sensitivity is 1.0-4.5 K (Fig. 2-(b)). The range is slightly better constrained at the lower end than the 0.5-4 K from Hargreaves et al. (2012), however we have used the revised value for the LGM tropical anomaly of -2.2 ± 0.7 K rather than the value of -1.8 ± 0.7 K that was used by Hargreaves et al. (2012).
When all the models of PMIP2 and PMIP3 (see Table 1) were considered jointly the correlation became weaker and the corresponding 5-95% range generated by the BLR method is 0.7-4.8 K (Fig. 2-(d)). Schmidt et al. (2014) obtained 1.6-4.5 K 285 using a similar ensemble although in that case multiple results obtained from the same modelling centre were combined by 9 https://doi.org/10.5194/cp-2019-162 Preprint. Discussion started: 17 January 2020 c Author(s) 2020. CC BY 4.0 License.
averaging. Using the OLS method on our ensemble we obtain 1.8-4.3 K. The BLR method generates a wider range here, as the correlation is weaker and the prior starts to influence the posterior.
Finally, we consider the PMIP3 models in isolation. For this ensemble no correlation is found so for the BLR method the result is heavily dependent on our prior assumptions. We obtain a 5-95% range here of 0.7-5.5 K (Fig. 2-(f)). Applying the 290 OLS method on the PMIP3 dataset gives a 5-95% range of 2.2-4.7 K. As previously argued for the combination of PMIP2 and PMIP3, the OLS produces a tighter posterior range. It is also close to the range of the initial ensemble because of the lack of correlation for this dataset.
The Kalman filtering approach presented by Bowman et al. (2018) has not previously been used for emergent constraint analyses in paleoclimate research. Thus, we also use this method to explore both PMIP2 and the combination of PMIP2 and 295 PMIP3 (Fig. 3). With the same geological reconstruction value, and a prior 5-95% range (based on the PMIP2 GCM ensemble) of 1.8-4.6 K, a posterior range of 1.3-4.6 K is inferred. By combining the PMIP2 and PMIP3 models, the prior 5-95% range becomes 2.0-4.5 K and the posterior range is 1.6-4.5 K. The increase in lower bound in these calculations is the largest change compared to our Bayesian linear regression method. However, this is strongly forced by the underlying assumptions of a Kalman filter (Section 2.3) which uses the model ensemble as a prior, making it difficult to compute a posterior range outside 300 of the model range, in particular when the observed value is considered as excessively uncertain.

The mid-Pliocene Warm Period
As for the LGM, priors parameters have to be defined to perform the BLR with the mPWP data. In principle these may be different to those used for the LGM experiment but in practice we have adopted the same priors for our base case, apart from the obvious sign change for α. Regarding the slope term α, the total positive forcing of the mPWP is not as large as the negative 305 forcing of the LGM. Therefore, it seems reasonable to expect a roughly similar slope in the regression. We performed the same sensitivity experiments as for the LGM, with three different priors over alpha: N (1, 1), N (0.5, 1), N (2, 1). There was only a small difference between the results using the three priors: the differences at the 5th percentile being less than than 0.1 K and the differences at the 95th percentile being approximately 0.3 K (see Table 2). Regarding β and σ, there is no physical reason for them to be substantially different than the ones chosen for the LGM. Indeed, although the mPWP is a warm climate, it 310 should also be expected that there is little temperature change to other forcing if the climate sensitivity is null. Thus, a N (0, 1) prior for β is selected and the same prior for σ as for the LGM analysis is chosen.
The Bayesian inference method applied above for the LGM model outputs is now applied on the mPWP model outputs (Fig. 4). With less abundant models and less well-constrained temperature data, we prefer to assume large uncertainties in the mPWP SST reconstruction (0.8 ± 1.6 K, 5-95% confidence). We adopt the Cauchy prior on climate sensitivity as for the LGM 315 analysis (5-95% interval of 0.5-28.7 K) and compute a 5-95% interval for the ECS of 0.5-5.0 K for the PlioMIP1 dataset.
Similar to the results for the LGM, the OLS method (Hargreaves and Annan, 2016) resulted in a slightly narrower 5-95% range than the BLR method (1.3-4.2 K, assuming 1.6 K of uncertainty on the data).

Inclusion of CMIP6 / PMIP4 data
The ongoing PMIP4 experiments have produced LGM and mPWP (PlioMIP2) simulations. Here we add those results to our 320 ensembles. There are two model runs available for the LGM and three for the mPWP (see Table 2).
For the LGM we have previously combined the PMIP2 and PMIP3 results, and the protocol for PMIP4 is not very different.
If we combine all three ensembles we obtain a 5-95% range for the ECS of 0.8-4.7 K using the BLR method ( Fig. 5-(b)). The ensemble size is now 16, but we note that this includes several models coming from the same modelling centres.
Past studies have investigated the proximity of models with hierarchical trees (Masson and Knutti, 2011;Knutti et al., 2013) 325 and the influence of their dependency on statistical methods (Annan and Hargreaves, 2017). Thus, although we believe such dependencies exist in the ensemble, it is in reality difficult to quantify and correct for this. How to deal with this possible duplication of information is therefore a subjective decision. In Schmidt et al. (2014) it was taken into account by averaging the results from models from the same modelling centre. Here we take an alternative approach of including only the latest version of each model. This gives an ensemble size of 9 models (Fig. 6-(a)) and a rather well-constrained 5-95% climate sensitivity 330 range of 1.1-4.3 K with the BLR method ( Fig. 6-(b)).
For PlioMIP and PlioMIP2 the situation is a little more complex as the protocol has been redesigned to represent a specific interglacial state rather than a generic warm climate, referred to as a "time slab" in the PlioMIP protocol. Thus there could be a different regression relationship for these two ensembles. However, when we plot the PlioMIP1 ensemble members (Fig. 5-(d)) we see that they do not look different to the PlioMIP2 ensemble members. The straight combination of PlioMIP1 and PlioMIP2

335
gives an ensemble range of 12 models and we computed a 5-95% range of 0.4-5.0 K (Fig. 5-(d)). Including only the most recent versions of models results in an ensemble size of 9 models (Fig. 6-(c)) and generates a merely identical 5-95% climate sensitivity range of 0.4-5.1 K with the mPWP simulation ( Fig. 6-(d)).

Combining multiple constraints
As described in section 2.4, the mPWP and the LGM are very different climates. If the observational data are generated by 340 unrelated analyses, we may be able to consider the two lines of evidence to be independent, and combine them using Bayes theorem to create a new posterior which is likely to be narrower than that arising from either analysis alone. Assuming that the uncertainties arising from the mPWP and the LGM analyses are independent of each other may be plausible as the proxy reconstructions use different observations and analyses to estimate both the tropical temperatures and the other variables that act as boundary conditions for the model experiments. Moreover, modelling uncertainties that influence the regression analysis 345 are expected to arise from rather different sources, such as the response to ice sheets and a cold climate in one case, versus the influence of a warmer climate in the other. Having said that, model biases influencing the simulation of one climate change may also influence the other, which means that if similar models occur in both ensembles, this could lead to dependencies.
Using Bayes theorem to combine the constraints means that it is not necessary for the same set of models to be used for each ensemble but, as we can see from Table 1, a few models do occur in both ensembles.
It is straightforward to first compute the posterior estimate of S from the LGM analysis as previously described, and then use this as a prior for the mPWP analysis. Priors over the regression coefficients are considered independent between the two analyses. Because of the issues discussed above, we perform an analysis using both ensembles of latest model versions in the LGM and the mPWP as described in section 3.3. The posterior of the LGM is used as the prior for the mPWP analysis and the resulting posterior from this process has a narrower 5-95% interval for S of 1.1-3.9 K (Fig. 7).

Alternative Priors on sensitivity
A major strength of the Bayesian analysis developed here is the way that the prior on the parameter of interest, here climate sensitivity, can easily be specified independently of all other aspects of the analysis. A uniform prior for S has been widely used (e.g. Tomassini et al., 2007;Aldrin et al., 2012). However, it has also been argued that such prior could give an unrealistically high probability to high climate sensitivity . Here we test our method with the 360 commonly-used uniform prior U[0;10] which has a 5-95% range of 0.5-9.5 K. The resulting posterior 5-95% range for climate sensitivity is 0.8-5.0 K when analysing the LGM PMIP2 models only, and 0.6-5.4 K with the LGM PMIP2 and PMIP3 models together. These posteriors are wider than the ranges previously computed with a Cauchy prior, particularly for the case of combining PMIP2 and PMIP3 where the correlation is rather weak, meaning that the prior has a relatively higher influence.
These results are shown in Fig. 8. Due to the questions which have arisen over the use of a uniform prior and the fact that it has 365 an infinite integral, we also perform a comparison with an alternative prior which features a decaying tail and a finite integral.
For this purpose, a Gamma prior is chosen with a shape parameter of 2 and a scale of 2, which corresponds to a similar 5-95% prior range of 0.7-9.5 K. The posterior computed 5-95% range is 1.0-4.5 K for LGM PMIP2 models and 0.9-4.8 K for the combination of PMIP2 and PMIP3, which is very close to the one computed with the Cauchy prior. Although the Bayesian paradigm will inevitably involve such subjective choices, the sensitivity of the results to a sensible choice of prior appears to 370 be low as long as a reasonable correlation exists in the ensemble.

Model Inadequacy
As previously explored and described by Williamson and Sansom (2019), we investigate the probability that all models deviate from reality to a certain extent, mainly because of computational limitations and their shared technical heritage. Statistically, this issue is best described by the terminology that while the models are considered 'exchangeable' with each other, they are 375 not exchangeable with reality. Williamson and Sansom (2019) provide further discussion on this point. In our methodology, this can simply be accounted for by considering that the regression prediction of S for reality has a larger residual than that arising for the models themselves: where the superscript t indicates here that we are referring to the truth (i.e. the real climate system) and * has the distribution 380 N(0,σ * ) for some σ * > σ. There can be various reasons why such an inadequacy, represented as * in Eq. 4, may be thought to exist. Models all share a common heritage and theoretical basis, which is certainly incomplete even if not substantially wrong, and computational constraints limit their performance. Particularly in the paleoclimate context, there may be biases in the experimental protocol and differences in number of feedbacks included in the different model systems, e.g. interactive vegetation and prognostic dust. Such errors would lead to reality being some distance from the model regression line, even if 385 the models were otherwise perfect. Such issues are pertinent to both the LGM, where there are significant uncertainties relating to dust and vegetation effects, and the mPWP where even the GHG forcing is somewhat uncertain, and furthermore where the older simulations are designed as a general representation of interglacial warm periods rather than a specific quasi-equilibrium climate state.
However, while we may anticipate reality deviating further from the regression line, it is difficult to quantify such deviation.

390
Here, we perform two sensitivity tests where we define σ * = 2σ, that is to say the distribution for the residual term * is defined as N(0,2σ) for our predictions. We consider that this corresponds to a rather large inadequacy term. To compare with our previous analysis, we investigate the effect of the model inadequacy using the data set of PMIP2 and PMIP3 combined for the case of the LGM, and the data set of PlioMIP1 for the case of the mPWP, and present them in Fig. 9. For the LGM, the 5-95% posterior range computed after doubling σ is 0.5-5.8 K ( Fig. 9-(b)), while the 5-95% posterior range for the mPWP 395 is 0.5-5.4 K ( Fig. 9-(d)). When we consider the 'latest model version' approach outlined in Section 3.3 and take the same approach of doubling the estimated residual, the 5-95% posterior ranges increase to 0.7-5.1 K for the LGM and a 5-95% posterior range of 0.4-5.7 K for the mPWP. Thus these sensitivity tests typically add around half a degree to the upper bound obtained, while having much less influence on the lower bounds in these examples.

400
Past climates are relevant sources of information on the properties of the climate system, specifically the equilibrium climate sensitivity, due to the quasi-equilibrium changes in response to external forcing, which are of similar magnitude to the projected future climate changes. In this study, we have described a new statistical method based on Bayesian inference to approach the question of emergent constraints. We believe this method provides a reasonable representation within the Bayesian paradigm of the underlying structure of emergent constraint principles. This Bayesian method is designed to be as explicit and flexible 405 as possible. Previous work using Ordinary Least Squares usually applied implicit assumptions. Because of these assumptions, OLS tends to generate tight posterior ranges, particularly on the lower end and when the correlation is rather weak.
By applying the method to the LGM tropical temperature model ensemble used in Schmidt et al. (2014), which included 14 models from the PMIP2 and PMIP3 generations, we estimate the climate sensitivity to be 2.6 K (0.7-4.8, 5-95 percentiles).
Similarly, applying the method to the mPWP tropical temperature data set of Hargreaves and Annan (2016) gives a climate 410 sensitivity of 2.4 K (0.5-5.0), but with the more uncertain ensemble of models which contributed to PlioMIP1.
With the new generation of climate models, the LGM and mPWP analyses have been widened by the addition of several CMIP6 model outputs. By adding the PMIP4 LGM simulations, we computed a 5-95% interval for climate sensitivity of 0.8-4.7 K. We performed the same analysis by combining PlioMIP1 and PlioMIP2 models and obtained a 5-95% interval of mPWP could lead to biased results, since the experimental protocol substantially changed in PlioMIP2.
An alternative approach is to consider solely the latest version of each model. By doing this we reduce expected redundancy in the ensemble, and so improve our confidence in the result, despite the smaller ensemble sizes. This leads to a more tightlyconstrained climate sensitivity of 2.7 (1.1-4.3) for the LGM simulations, and a less well-constrained sensitivity 2.4 (0.4-5.1) for the mPWP simulations. Our experiment considering a substantial model inadequacy term resulted in an increase of up to a 420 degree in the upper bounds presented here, though this aspect is as yet poorly understood and quantified.
Our results obtained by analysing the LGM or the mPWP in isolation are consistent with results obtained by other statistical methods used in previous studies. The differences between the way the information is obtained from the paleo record for the mPWP and the LGM and the different dominant climate features of the intervals suggest it may be reasonable to consider these estimates to be statistically independent, given climate sensitivity. It is then possible to combine them within the same 425 Bayesian framework to compute a narrower range of climate sensitivity. By doing so, we evaluated the climate sensitivity to be 2.6 K (1.1-3.9). Nevertheless, this approach requires independence between the different combined emergent constraints.
It is, in principle, straightforward to include other independent emergent constraints into our Bayesian framework. As well as evidence from historical or present day analyses, other past climates are starting to be explored by modellers and may be potential candidates for future analyses, such as the Eocene, the Miocene and the last deglaciation. Over the next couple of 430 years we expect new outputs for models from CMIP6 and new data analyses to become available, which will enable these preliminary analyses to be compared with results from expanded LGM and mPWP ensembles and improved data estimates.
The "Latest" models ensembles are those created from the most recent versions of each model (see Section 3.3).