For context, make sure you read this first: A toy model for forecasting global temperatures – 2011 redux, part 1
Here is a list and brief description of the data sources I will be analysing:
1. Annual global temperature (WTI): This is the ‘dependent variable‘ in the model. The time span is 1979 to 2010. For the first analysis, I will only fit to the period 1970 – 2005, leaving 2006-2010 separate from the analysis for use in out-of-sample validation. The data stream starts in 1979 because this is the year when the satellite data (RSS and UAH) begin. I will use the seasonal average global temperature anomaly (TA) based on the composite measure provided at WoodforTrees. I have done this in order to it side-steps the ‘debate’ over which of the four major temperature measures is ‘best’ – it uses all of them and corrects for different baselines. Seasons are Dec-Jan-Feb, Mar-Apr-May, Jun-Jul-Aug and Sep-Oct-Nov. This gives 108 data points through to end-2005, and 128 through to end-2010.
2. ERSL CO2 from Mauna Loa (CO2): carbon dioxide measurements, in parts per million atmospheric concentration.
3. Total solar irradiance (TSI): PMOD composite values, measuring the intensity of incoming solar energy, in W/m2.
4. ERSL multivariate ENSO index (MEI): based on a sea-level pressure, zonal and meridional components of the surface wind, sea surface temperature, surface air temperature, and total cloudiness fraction of the sky. Details here.
5. JISAO PDO index (PDO): The “Pacific Decadal Oscillation” is a long-lived El Niño-like pattern of Pacific climate variability. Details here.
6. Volcano: a binary categorical variable (1/0), based on MLO apparent transmission data. This flags the El Chichón (1982) and Pinatubo (1991) large equatorial eruptions.
Here is a useful list of other potentially relevant climate data sources.
All of my analyses will be done in Program R. Some of the continuous independent data vectors (CO2 and TSI) were centered by subtracting their respective means; this is because they include large constants that are irrelevant to this exercise.
Here is what the structure of the data frame looks like: