For context, make sure you read this first: A toy model for forecasting global temperatures – 2011 redux, part 1
Here is a list and brief description of the data sources I will be analysing:
1. Annual global temperature (WTI): This is the ‘dependent variable‘ in the model. The time span is 1979 to 2010. For the first analysis, I will only fit to the period 1970 – 2005, leaving 2006-2010 separate from the analysis for use in out-of-sample validation. The data stream starts in 1979 because this is the year when the satellite data (RSS and UAH) begin. I will use the seasonal average global temperature anomaly (TA) based on the composite measure provided at WoodforTrees. I have done this in order to it side-steps the ‘debate’ over which of the four major temperature measures is ‘best’ – it uses all of them and corrects for different baselines. Seasons are Dec-Jan-Feb, Mar-Apr-May, Jun-Jul-Aug and Sep-Oct-Nov. This gives 108 data points through to end-2005, and 128 through to end-2010.
2. ERSL CO2 from Mauna Loa (CO2): carbon dioxide measurements, in parts per million atmospheric concentration.
3. Total solar irradiance (TSI): PMOD composite values, measuring the intensity of incoming solar energy, in W/m2.
4. ERSL multivariate ENSO index (MEI): based on a sea-level pressure, zonal and meridional components of the surface wind, sea surface temperature, surface air temperature, and total cloudiness fraction of the sky. Details here.
5. JISAO PDO index (PDO): The “Pacific Decadal Oscillation” is a long-lived El Niño-like pattern of Pacific climate variability. Details here.
6. Volcano: a binary categorical variable (1/0), based on MLO apparent transmission data. This flags the El Chichón (1982) and Pinatubo (1991) large equatorial eruptions.
Here is a useful list of other potentially relevant climate data sources.
All of my analyses will be done in Program R. Some of the continuous independent data vectors (CO2 and TSI) were centered by subtracting their respective means; this is because they include large constants that are irrelevant to this exercise.
Here is what the structure of the data frame looks like:
'data.frame': 108 obs. of 7 variables:
$ DATE: num 1979 1979.25 1979.5 1979.75 1980 ...
$ WTI : num -0.1881 -0.1939 -0.098 0.0488 0.0022 ...
$ CO2 : num -24.4 -22.4 -25.7 -26 -22.7 ...
$ TSI : num 0.53 0.6 0.33 0.58 0.62 0.51 0.69 0.54 0.78 0.72 ...
$ MEI : num 0.327 0.357 0.587 0.794 0.646 0.887 0.471 0.185 0.01 0.248 ...
$ PDO : num -0.54 0.72 0.79 0.37 0.77 0.82 0.28 0.54 1.01 1.63 ...
$ VOL : int 0 0 0 0 0 0 0 0 0 0 ...
First up, let’s look at a cross-correlation table:
Read this by looking at a row, then reading across to the column. A value of 0 would indicate no relationship whereas a value of 1 would mean perfect (100%) correlation. So, the correlation between WTI (temperatures) and MEI (the ENSO measure) is 0.121, or 12.1%. Okay?
The values in bright red (and slightly larger font) are correlations >50%; the dark red are some other potentially interesting values. As you can see, there is a very strong relationship between DATE and CO2, which is not surprising as it has been rising steadily, year after year. Interestingly, the correlation between WTI and CO2 is slightly higher than that between WTI and DATE — so although WTI are rising over time, and so is CO2, the CO2 predictor is slightly better than DATE. For the later regression models (in the next post), we’ll leave out DATE, since it’s so highly correlated with CO2 anyway (and is not a forcing). But in this post, I’ll use it to look at the overall trend in the data.
Other things to note are that TSI has a slightly negative correlation with DATE over this period, i.e., solar irradiance has declined slightly over this period (1979-2005). The PDO has also weakened a little over the past three decades. Finally, there is a strong relationship between PDO and MEI, as you might expect, since they both measure oceanic conditions. On this basis there are grounds to leave one of them out of the analysis, but for the sake of completeness I’ll leave them both in and let a model-selection statistic tell us whether it is worthwhile including both indices.
You can see the WTI vs CO2 plot is quite bunched together along a ~45 degree line, whereas the WTI vs PDO, or CO2 vs MEI (for instance) are little more than a random sea of points.
The final thing to note from this most simple of analyses is that the correlation between WTI and TSI is extremely small, i.e., in isolation terms (as a direct effect on the WTI trend), solar forcing is irrelevant over this period of time. BUT, when incorporated into a model that also includes other variables (to capture trend etc.), it may yet prove useful in explaining some additional variation in the data.
Now let’s look at the relationship between WTI and DATE, to get a better feel for the trend. Here is the results of a simple ‘gaussian’ fit, with DATE adjusted to remove the constant (1979 subtracted to start the regression at x=0):
glm(formula = WTI ~ YEAR, family = gaussian)
Min 1Q Median 3Q Max
-0.3444864 -0.1064622 0.0003404 0.0884084 0.3903914
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.163348 0.025262 -6.466 3.16e-09 ***
YEAR 0.016595 0.001632 10.169 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.01747063)
Null deviance: 3.6586 on 107 degrees of freedom
Residual deviance: 1.8519 on 106 degrees of freedom
Some things to note. The slope coefficient for YEAR is 0.016595, or ~0.17 degrees C per decade (1.7 C per century). Comparing the residual to null deviance, we can work out that the percent of deviance explained by this model is (3.6586-1.8519)/3.6586*100 = 49.4%. So about half of the total variation in the WTI data is explained by the DATE regression line (as compared to simple the average of the WTI data, which, by definition, explains 0%). I will use the “AIC” value in the next post, when I look at other (more complex) models.
Here is a plot of the fit, and the distribution of the residuals.
You can see that the plot of WTI vs Date follows the data reasonably well, but there are some clear anomalies, e.g. some seasons during 1991, 1998. The histogram of residuals is roughly symmetrically distributed about zero, and follows a normal (gaussian) distribution reasonably well. This tells me that we are meeting the assumptions of this statistical model, and so it will be okay to use a gaussian linear model and an identity link (i.e., no transformation of the dependent data) for the rest of this exercise.
Okay, let’s pause. This is an important topic, and working systematically through this problem is interesting to me. As such, I’d like to make sure that this exercise is broken down into manageable components, to aid in understanding. So, it looks like it will take me more than three posts to get through it all (contrary to what I’d previously anticipated). I guess it will take as many posts as it takes…
Any questions so far? Any other correlates you’d like to argue for inclusion? Any other issues/critiques to raise at this point?
Next post in this series I will fit a range of statistical models (of varying complexity) based on these correlates, do some bootstrap resampling to evaluate robustness of the results to autocorrelation, and make a ‘forecast’ for the 2006-2010 period.