It’s easy to note that GHCN is a bit lacking when it comes to recent temperature data for Canada:

Canada is of course just a particular aspect of the global case, which is well known and discussed in the literature ([Jones and Moberg 2003] for example).

Note in particular the drop off at around 1990, shortly before GHCN v2 was compiled in 1997. Canada is of some interest because: it has many relatively high latitude stations; and, it’s an industrialised nation that ought to have a good weather station network. The recent periods of GHCN are updated via CLIMAT messages. Who knows why Canada’s data doesn’t make it into GHCN, perhaps they prohibit NOAA from republishing the data, perhaps their BUFR format isn’t quote what NOAA expect.

Environment Canada publish (some of) their monthly temperature data on the web. Using ScraperWiki I wrote a scraper to collect it, and another program to format it into GHCN v2 format so that it could be used as an input for ccc-gistemp. Whilst there are far fewer stations in the scraped data, there is more data in the 2 most recent decades (note different y-axis scale compared to above):

GHCN is bumping along with record counts in the 30s and 40s for the last 2 decades, Environment Canada has 100 or more records for most of those years. Until 2007 in fact (change of government?).

Quite a few of the stations in the scraped data have fewer than 20 years. I exclude those because they’re unlikely to be of use in the ccc-gistemp analysis (in principle they could be combined in Step 1 with existing GHCN duplicates even when they’re short, but meh). It turns out this also avoids some thorny and potentially embarrassing data cleansing war stories (such as why are the reported locations for WMO station 71826 13 degrees apart?)

There are 106 stations from the scraped data with 20 or more years of monthly data.

So, the first question is how do we merge this dataset with GHCN? I suppose I could simply add all the scraped data alongside the GHCN data, but that would be wrong. The scraped data has some stations which are in GHCN, and those would be “double counted”. We ought to try and identify which stations are the same in the two sources; so how do we do that? Most of the scraped stations are WMO stations (87 out of 106) and they come with WMO identifiers. So it’s easy to match those against a possible counterpart in GHCN (WMO stations in GHCN have their 11-digit identifier ending with “000”). Both GHCN and the scraped data come with location information, so we can try matching on that as well. In the case of the matched WMO stations (67), 5 have locations that were more than 0.015 degrees apart, but they’re all within 0.03 degrees.

Sometimes a scraped WMO station will not match a GHCN station by WMO identifier, but will have a location match. This is probably a station in GHCN that should be marked as a WMO station, but by mistake isn’t.

WMO station 71808 represents a good win. The existing GHCN data (station 40371808000) runs from 1967 to 1982. The scraped data (Environment Canada identifier 7040813) runs from 1982 to 2010. In fact there is no overlap at all, GHCN finishes in 1982-07 and Environment Canada starts in 2010-08. Curious.

WMO station 71808: Blanc Sablon

We need a rule of thumb to decide whether to add the scraped data as a new station or as a duplicate of an existing GHCN station. Scraped WMO stations that match a WMO station in GHCN will always be added as a duplicate. For each remaining scraped station I select the nearest GHCN station as a comparison (this is in Canada in all but one case). An overlap and q score is computed for each GHCN duplicate, and the “best” duplicate is picked. If there are duplicates with “long” overlaps (meaing >= 120 months) then the one with lowest q score is chosen for “best”, otherwise (all the overlaps must be short) the duplicate with longest overlap is chosen. q is sqrt(sum(d[i]**2)/n) where d[i] is the difference between the two series at time i; the sum is over the common period; n is the number of elements (months) in the common period.

So for the non WMO stations (some of these will be WMO stations, just without an identified WMO counterpart in GHCN) the rule of thumb is: if s+q < 1, then the scraped data is added as a duplicate, otherwise it is added as a new station. s is the separation between the scraped station and the “best” station in degrees; q is the q score of the “best” duplicate.

The added value of this dataset, compared to GHCN, is recent data for high latitude stations. Alert is a classic case, North of 80:

The scraped data brings almost 20 years of recent data to the table.

So what does this all add up to when we incorporate this new data into the GISTEMP analysis? Well, I’ve done that, but I’ll have to leave the results for another post.


[Jones and Moberg 2003] Jones, P. D., A. Moberg, 2003: Hemispheric and Large-Scale Surface Air Temperature Variations: An Extensive Revision and an Update to 2001. J. Climate, 16, 206–223.

Leave a Reply