Posts Tagged ‘ccc-gistemp’

Analysis of Canada data

In an earlier post I describe the trials and tribulations of tracking down some station data from Environment Canada’s website.

The obvious question to ask is, how does this affect the ccc-gistemp analysis?

For starters, how much extra data do we get, once we’ve merged all the duplicates and rejected short records and so on? Here’s the station count by year for the GHCN stations (dark lines), and the extra Environment Canada stations (lighter lines):

This count is made after ccc-gistemp Step 2 processing, so duplicate records for the same station have been merged, short records have been discarded, and urban stations that could not be adjusted have been dropped (the log tells me there are 18 such stations). New Environment Canada stations (recall that some of the Environment Canada data is for stations that are not in GHCN) do not get any brightness information in the v2.inv file; it so happens that in ccc-gistemp this means they get marked as rural, more by accident by design. I should probably fix this (by calculating brightnesses for the new stations, and rejecting stations with no brightness data), but this will certainly do for a preliminary analysis.

The 1990s still don’t reach the dizzying peaks of the 1960s (in terms of station count), but the Environment Canada data is certainly a welcome contribution. More than doubling the number of stations for recent decades.

The effect of this on the analysis? Here’s the arctic zone:

The first thing to note if you haven’t seen one of these before, is the scale. The swings in this zone are much larger than the global average (this zone is 5% of the Earth’s surface); the recent warming in this zone is over 5 °C per century! The remaining points of note are the slight differences here and there in the very recent period. That large dip in the 2000s is 2004, and the new analysis has the anomaly some 0.16 °C colder (+0.57 versus +0.73). A warm spike is 1995 is 0.09 °C warmer. The same blips are also just about visibly different on the Northern Hemisphere analysis, but the differences smaller.

The additional Environment Canada is welcome, and does affect the result just enough to be visible, but the trends and any conclusion one could derive are not affected at all.


The data are available here, but you don’t need to download that if you’re using ccc-gistemp. Run «python tool/» to download the data, and then run «python tool/» to generate a mapping table. «python tool/ -d ‘data_sources=ghcn ushcn scar hohenpeissenberg ca.v2’» will then run the analysis.


It’s easy to note that GHCN is a bit lacking when it comes to recent temperature data for Canada:

Canada is of course just a particular aspect of the global case, which is well known and discussed in the literature ([Jones and Moberg 2003] for example).

Note in particular the drop off at around 1990, shortly before GHCN v2 was compiled in 1997. Canada is of some interest because: it has many relatively high latitude stations; and, it’s an industrialised nation that ought to have a good weather station network. The recent periods of GHCN are updated via CLIMAT messages. Who knows why Canada’s data doesn’t make it into GHCN, perhaps they prohibit NOAA from republishing the data, perhaps their BUFR format isn’t quote what NOAA expect.

Environment Canada publish (some of) their monthly temperature data on the web. Using ScraperWiki I wrote a scraper to collect it, and another program to format it into GHCN v2 format so that it could be used as an input for ccc-gistemp. Whilst there are far fewer stations in the scraped data, there is more data in the 2 most recent decades (note different y-axis scale compared to above):

GHCN is bumping along with record counts in the 30s and 40s for the last 2 decades, Environment Canada has 100 or more records for most of those years. Until 2007 in fact (change of government?).

Quite a few of the stations in the scraped data have fewer than 20 years. I exclude those because they’re unlikely to be of use in the ccc-gistemp analysis (in principle they could be combined in Step 1 with existing GHCN duplicates even when they’re short, but meh). It turns out this also avoids some thorny and potentially embarrassing data cleansing war stories (such as why are the reported locations for WMO station 71826 13 degrees apart?)

There are 106 stations from the scraped data with 20 or more years of monthly data.

So, the first question is how do we merge this dataset with GHCN? I suppose I could simply add all the scraped data alongside the GHCN data, but that would be wrong. The scraped data has some stations which are in GHCN, and those would be “double counted”. We ought to try and identify which stations are the same in the two sources; so how do we do that? Most of the scraped stations are WMO stations (87 out of 106) and they come with WMO identifiers. So it’s easy to match those against a possible counterpart in GHCN (WMO stations in GHCN have their 11-digit identifier ending with “000”). Both GHCN and the scraped data come with location information, so we can try matching on that as well. In the case of the matched WMO stations (67), 5 have locations that were more than 0.015 degrees apart, but they’re all within 0.03 degrees.

Sometimes a scraped WMO station will not match a GHCN station by WMO identifier, but will have a location match. This is probably a station in GHCN that should be marked as a WMO station, but by mistake isn’t.

WMO station 71808 represents a good win. The existing GHCN data (station 40371808000) runs from 1967 to 1982. The scraped data (Environment Canada identifier 7040813) runs from 1982 to 2010. In fact there is no overlap at all, GHCN finishes in 1982-07 and Environment Canada starts in 2010-08. Curious.

WMO station 71808: Blanc Sablon

We need a rule of thumb to decide whether to add the scraped data as a new station or as a duplicate of an existing GHCN station. Scraped WMO stations that match a WMO station in GHCN will always be added as a duplicate. For each remaining scraped station I select the nearest GHCN station as a comparison (this is in Canada in all but one case). An overlap and q score is computed for each GHCN duplicate, and the “best” duplicate is picked. If there are duplicates with “long” overlaps (meaing >= 120 months) then the one with lowest q score is chosen for “best”, otherwise (all the overlaps must be short) the duplicate with longest overlap is chosen. q is sqrt(sum(d[i]**2)/n) where d[i] is the difference between the two series at time i; the sum is over the common period; n is the number of elements (months) in the common period.

So for the non WMO stations (some of these will be WMO stations, just without an identified WMO counterpart in GHCN) the rule of thumb is: if s+q < 1, then the scraped data is added as a duplicate, otherwise it is added as a new station. s is the separation between the scraped station and the “best” station in degrees; q is the q score of the “best” duplicate.

The added value of this dataset, compared to GHCN, is recent data for high latitude stations. Alert is a classic case, North of 80:

The scraped data brings almost 20 years of recent data to the table.

So what does this all add up to when we incorporate this new data into the GISTEMP analysis? Well, I’ve done that, but I’ll have to leave the results for another post.


[Jones and Moberg 2003] Jones, P. D., A. Moberg, 2003: Hemispheric and Large-Scale Surface Air Temperature Variations: An Extensive Revision and an Update to 2001. J. Climate, 16, 206–223.

ccc-gistemp release 0.6.0

[ed: 2010-10-29: This was an announcement for release 0.6.0, but that has a bug (see comments). Please use release 0.6.1 instead. I’ve edited the article to update the links]

I’m pleased to announce ccc-gistemp release 0.6.1 (making the previous buggy 0.6.0 obsolete). ccc-gistemp is our project to reimplement the NASA GISS GISTEMP algorithm in clearer code (in Python). This release is the first release made under the aegis of the Climate Code Foundation.

Many of the significant changes in this release have already been previewd in earlier blog posts:

Further details are available in the release notes.

We intend to carry on our work on ccc-gistemp, and we urge you to download our code, try it, and read it. We welcome contributions.

The work for release 0.6.0 was carried out by David Jones and Nick Barnes, and the work for release 0.6.1 was carried out by David Jones.

A real land mask

As mentioned in the previous post, I now have a real land mask from ISLSCP Initiative II.

The format of the files in refreshingly simple. 180 rows with 360 space separated numbers. It takes nothing more than a few dozen lines of Python to convert this into the 8000 cell format that ccc-gistemp uses for its step5mask.

Here’s the 8000 GISTEMP cells that contain any land (according to the ISLSCP file). Drawn on a Plate Carrée projection:

Even at this blobby scale there are some suspicious features (missing). Where is Ascension or Cocos Islands? I don’t think it’s an artefact of my processing, because they don’t appear on the ISLSCP file either; that’s probably a bug. So accepting that their might be some minor issues with islands and so on, we can go on to do a run of ccc-gistemp using this land mask:

As mentioned in the previous article, restricting to land de-emphasises coastal stations which generally warm less slowly. So the restricted version of the analysis has a stronger warming trend.

Many of the partial land cells will have ocean data and will not have a nearby land station to count as a land cell in the usual GISTEMP analysis. So we find that excluding cells that have ocean data (and no nearby land station) results in an even stronger trend. The red graph is reprised from the previous article:

(pedants should note that in the above graph “cells with ocean data” means “cells with ocean data and where the nearest land station to the cell’s centre is more than 100 km away”)

There are countless variations one could try such as weighting cells by their coverage, or using a threshold of 50% land cover. As far as an estimate of land temperatures go, such variations amount to interpolating between those two curves on the previous chart. Roughly.


[update: 2010-08-26: the “no arctic” masks were wrong, fixed now]

I added a land mask feature to ccc-gistemp. The land mask is used In Step 5, where land and ocean series are combined into zonal averages. For each cell, ocean data is used unless: there is very little ocean data (fewer than 240 months); or, there is a land station within 100 km. With the new land mask feature an external file (input/step5mask which is the same format as the newly output work/step5mask) specifies whether to use land or ocean data for each cell.

This enables me to do something new: a run of ccc-gistemp using only land-data but restricted to locations where there is no ocean data:

I don’t actually have a true land mask to hand (a list of cells that cover the Earth’s land surface), but I can do a normal run of ccc-gistemp that uses both land and ocean data and use the generated mask. That mask tells me which cells have no ocean data (and therefore land data is used, if there is any).

That’s the mask I use above. This reduces the effects of the 1200 km smoothing radius, by preventing land series from being interpolated over the ocean (unless there is no ocean data). This mask has 3077 cells (8000 in total), so clearly some ocean cells are still using land data. It looks like this:

Note that black is not land, but it is where land data is used if there is any. Note the arctic ocean has no ocean data so land data (generally interpolated) is used. Also some of southern ocean will pick up land data.

The fact that the land series when restricted to exclude cells with ocean data shows greater warming can be explained by considering coastal stations. In the usual ccc-gistemp land-only analysis every station has an influence that extends out to a disc of radius 1200 km. Coastal stations will have a disc of influence that extends into the ocean. When excluding cells that have ocean data that means that a coastal station has its disc chopped in half (roughly), and an island station is almost entirely excluded. The overall effect is to weight stations in the continental interior more strongly. And warming is greater in the continent interior.

[edit: Chad also shows a version of the same effect (scroll down the very long post until you reach the red and black graphs): de-emphasising coastal stations increases trend]

We can show the effect of excluding Arctic and Antarctic zones (everything north of 60N and south of 60S is excluded):

Only a small effect. The broken mask version had an even smaller effect, but I’m still a little surprised by how small this difference is.

Again we can exclude cells with ocean data, and again this shows increased warming. This more than cancels out the cooling by excluding the high latitude zones:

Anyone got a real land mask? [edit: Yes, Steven Mosher pointed me at one; so expect a post using a real land mask soon]

No US temperatures

Along with other changes I’ve been making to ccc-gistemp I recently added a feature to remove GHCN stations that are in the contiguous US. I did that partly because GISTEMP’s Fortran has a similar feature, and partly because it allows me to do the following whimsical experiment:

Running ccc-gistemp with no USHCN data and dropping contiguous US stations from GHCN:

tool/ -p 'retain_contiguous_US=False;data_sources=ghcn,scar,hohenpeissenberg'

The -p allows a parameter from to be set, without editing that file. retain_contiguous_US and data_sources are recently added parameters.

Onto the graphs!


Hmm, not much difference.

What about removing the ocean? Removing it should bring the differences due to land stations into sharper focus:

Tiny differences, not easily distinguishable from rounding.

What about zooming into the Northern Hemisphere, where the US is located?

Aha, some slightly less tiny differences appear which are not due to rounding. US citizens may be interested to note that their contribution to the recent trend is to lower it by a tiny amount (the trend is lower when US stations are included).

ccc-gistemp produces zonal anomaly files (because GISTEMP does). I’ve not really looked at them much before, but we can zoom in further on Zone 24N to 44N where most of the contiguous US is found. Note I had to rescale the next chart to fit the higher peaks in:

What above the next (GISTEMP) zone north, which includes the remaining part of the contiguous US but which has a slightly less dense station network and more volatile temperature swings:

The differences are all lost in the noise?

The US is only about 1½ percent of the Earth’s surface after all.

Just 440 stations

Over at his blog, Nick Stokes selects rural stations with more than 90 years of data and that report in 2010. In GHCN he counts just 61 stations.

We can do something similar with ccc-gistemp. We get a result that is unreasonably similar to the official GISTEMP with full series.

That major differences in methodology are:

Here’s the short (but not very clear) Python script that identifies long rural stations:

rural = set((row[:11] for row in open('input/v2.inv') if
  int(row[102:106]) <= 10))
def id11(x):return x[:11]
def id12(x):return x[:12]
import itertools
rurlong = [group for group,rows in
  ((g,list(r)) for g,r in itertools.groupby(open('work/v2.step1.out'), id12))
  if id11(group) in rural and len(rows) > 90 and rows[-1][12:16] == '2010']
print len(rurlong)

There are 440 stations. With these 12-digit record identifiers in hand it is a trivial matter to create a new v2.mean file («open(‘v2.longrural’, ‘w’).writelines((row for row in open(‘work/v2.step1.out’) if row[:12] in rurlong))» if you must).

As I said before the results are pretty close to the standard analysis:

The new v2.mean file is over 10 times smaller (uncompressed) than the official GHCN v2.mean file. The analysis is correspondingly about 10 times quicker to run (a few minutes on my old laptop). In case you want to use this file, to replicate my results or run it through your own analysis, I’ve made it available from our googlecode repository.


In this comment Bob Koss notices something a little about some of the 1980 anomalies near Byrd station (-80.0-119.4). He points out there is a patch of suspiciously zero anomaly north of Byrd. Illustrated by this graphic (the dot is Byrd, roughly):

Here’s an extract from ccc-gistemp’s Step 3 output (work/v2.step3.out) showing that several cells have flat 0 anomalies for 1980:

-77.2-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-77.2-112.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-77.2-103.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-77.2-094.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-130.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-112.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-103.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-74.8-094.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-139.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-130.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-112.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-72.7-103.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-139.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-130.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-112.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-70.9-103.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0
-69.2-121.5C1980-9999-9999    0    0    0    0    0    0    0    0    0    0

(this file is a v2.mean style file where the 11-digit station identifiers are in fact the centre of the grid cell rounded to the nearest .1 degree in latitude and longitude.)

It turns out that all of those cells have a time series starting in 1980. And this leads to me having to explain a curiosity in the algorithm used to produce anomalies.

The anomalies produce by the gridding step, Step 3, have a baseline of 1951 to 1980 (see gridding_reference_period in What this means is that just before the end of Step 3 each cell’s time series is adjusted: For each of the 12 months in the year, an average is computed for the period 1951 to 1980 and this average is subtracted from that month’s time series.

If there is a cell that is unlucky enough to have its series start in 1980, then the average for the period 1951 to 1980 is simply the 1980 value. That’s subtracted from the series, so after Step 3 all the anomalies for 1980 will be 0 (see list above).

This doesn’t happen a lot. Recall that the GISTEMP analysis uses a grid with 8000 cells. We can see that the vast majority of the cells (that have any series at all) have 30 years of data for the period 1951 to 1980:

(The chart above represents a simplification, but a reasonable one: A year is marked as “present” if it has any valid monthly data)

The vast majority of the cells have a full 30 years for the base period.

What if there are no values between 1951 and 1980? Then in that case the algorithm uses an average over the entire series for its baseline. There are 39 such cells (that’s the tiny blip on the left of the graph).

So some cells will have a different baseline, some cells will have a pathetically small number of values to average over in the baseline (the 1980 example above is the extreme example). This is unfortunate if you’re the sort of person that worries about individual years in localised regions, but it doesn’t affect the larger scale averages at all.

Koss also asks why the cell just north and west of Byrd does not have a yearly anomaly for 1980. A reasonable question since it’s close to Byrd and Byrd has data for 1980.

Here’s a Google Earth screenshot, showing stations south of -70. Taken from roughly above the South Pole with 0 degrees longitude to the right.

Note that this shows stations after Step 2 (I ran the command «tool/ --lat -90,-70 work/v2.step2.out > step2.kml»), and Step 2 discards all stations with a short series, which means throwing out several Antarctic stations.

Let’s zoom in a little on Byrd and add the gridded cells (a select few of the gridded cells from Step 3):

For the cells (with locations of the form -NN.N-EEE.E), the 4 digit number is the first year with data (technically, the first year with enough data to produce a yearly anomaly, but that’s the same in these cases).

So why does cell -77.2-121.5 (just north of Byrd) have a series that starts in 1980? Answer: Because it uses data from Byrd, and Byrd’s data starts in 1980. The little squiggle inside the icon is the yearly anomaly series and you can just about see that -77.2-121.5 is the same squiggle as Byrd.

What about cell -77.2-130.5 (just north and west of Byrd)? Why does that have no anomaly for 1980? Because its series starts in 1985. Why is that? Because this cell uses data from Gill and not Byrd (again, compare the squiggle to Gill’s squiggle). Why is that? Because there are 3 stations within 1200 km of this cell. I recently added a bit of logging to Step 3 and it’s now possible to see for each cell what stations contributed and the weights of their contribution:

$ grep -e -77.2-130.5C log/step3.log 
-77.2-130.5C stations [('700893760009', 0.10965896795479568), ('700893240009', 0.0), ('700891250000', 0.0)]

70089376000, Gill, -80.00-178.60
70089324000, Byrd, -80.00-119.40
70089125000, Byrd Station, -80.02-119.53

Byrd and Byrd Station are presumably different stations in more or less the same location. Byrd Station has a temperature record starting in 1957 (yay for International Geophysical Year) running continuously up to 1971, and then sporadic southern summers up to 1987. Byrd is presumably the modern station; its record starts in 1980 and runs up to 2010 (not quite continuously).

When considering the order in which to combine stations to produce a series for a cell, stations are considered “longest record first” by a simple count of the total number of months of valid data. It turns out that Gill, 70089376000, has 246 months of data, while Byrd, 70089324000, has 216 months of data. So Gill comes first (even though Byrd is much closer).

Referring back to the log output for the three stations, we see that records 700893240009 and 700891250000 contributed with 0.0 weight. In other words, not at all. Why is that?

In order for a station to combine with a cell series it has to have a certain number of years of common overlap: parameters.gridding_min_overlap. This is normally 20. Each month of the year is considered separately; it turns out that February is the month for which Gill and Byrd have the greatest overlap, with 18 years in common. Just falling short of the cut. If both Gill and Byrd continue to have temperature records then in 2012 their February record will be combined. This still won’t be enough to compute a yearly anomaly for 1980 for that cell, because only February will combine, not the other months of the year. And one month is not enough to compute a yearly anomaly.

For completeness we should consider Byrd Station, 70089125000, as well. Gill starts in 1985, and Byrd Station has only 4 monthly values since 1985 (and in fact, only 8 for the whole of the 1980’s). So no chance of enough overlap.

However for the cell containing Byrd itself, -80.1-121.5, we can see that its record starts in 1957. Again grepping station contributors out of the log:

$ grep -e -80.1-121.5 log/step3.log 
-80.1-121.5C stations [('700890090008', 0.079848008479036836),
('700893760009', 0.12098360649264683),
('700893240009', 0.9657843990261038),
('700891250000', 0.96815637481310624)]

The first station is Amundsen–Scott, 70089009000, with a record that starts in 1957 (International Geophysical Year again). We have increased overlap, and so it turns out that Gill and both records from Byrd contribute. You can think of Amundsen-Scott as being like a catalyst that allows Gill and Byrd to combine when they ordinarily would not. Amundsen-Scott is only just within range to contribute to this cell. Cells further north will not use Amundsen-Scott, so Gill and Byrd will not combine (unless there is some other station in range to provide an overlapping series).

Conclusion: Some individual cells can have slightly unintuitive series, but they are merely the result of the documented computation (the key facts—that records are combined longest first, stations from up to 1200 km away are used, and records require an overlap for combining—are all documented in Hansen and Lebedeff 1987). A tiny fraction of cells in the anomaly maps either have misleadingly short anomaly periods or entirely different anomaly periods. This makes no difference to zonal averages. Probably when browsing the maps such cells should be marked or eliminated.


tool/ has been recently updated so that it computes and draws trends (the work was done by me and Nick Barnes). Here’s some recent comparisons redrawn with trends:

The “before 1992 / after 1992 stations” from “The 1990s station dropout does not have a warming effect”:

The short trends are done with the last 30 years of data for each series (which since one series ends in 1991, is a different period for each). Notice how similar the recent trends are.

Reprising the Urban Adjustment post:

I don’t think I’ve done a combined land and ocean chart comparing hemispheres for the blog before, but here it is now:

Nick Barnes added the calculation of R2 whilst I was writing this post, causing me to redraw all the charts.

Nick has also been exploiting ccc-gistemp‘s new module, and did a run with the somewhat experimental 250km smoothing rather than the traditional 1200km smoothing. The parameter is named gridding_radius and it affects gridding in Step 3; setting it to 250km essentially reduces each station’s influence to very roughly the size of the cell used in gridding.

The effect on the trends is most visible in the Northern Hemisphere:

Trends are just one minor example of the way in which the ccc-gistemp code can be continuously improved. We don’t just draw trends for one graph, we improve the code so that all graphs can have trends.