Posts Tagged ‘canada’

Analysis of Canada data

In an earlier post I describe the trials and tribulations of tracking down some station data from Environment Canada’s website.

The obvious question to ask is, how does this affect the ccc-gistemp analysis?

For starters, how much extra data do we get, once we’ve merged all the duplicates and rejected short records and so on? Here’s the station count by year for the GHCN stations (dark lines), and the extra Environment Canada stations (lighter lines):

This count is made after ccc-gistemp Step 2 processing, so duplicate records for the same station have been merged, short records have been discarded, and urban stations that could not be adjusted have been dropped (the log tells me there are 18 such stations). New Environment Canada stations (recall that some of the Environment Canada data is for stations that are not in GHCN) do not get any brightness information in the v2.inv file; it so happens that in ccc-gistemp this means they get marked as rural, more by accident by design. I should probably fix this (by calculating brightnesses for the new stations, and rejecting stations with no brightness data), but this will certainly do for a preliminary analysis.

The 1990s still don’t reach the dizzying peaks of the 1960s (in terms of station count), but the Environment Canada data is certainly a welcome contribution. More than doubling the number of stations for recent decades.

The effect of this on the analysis? Here’s the arctic zone:

The first thing to note if you haven’t seen one of these before, is the scale. The swings in this zone are much larger than the global average (this zone is 5% of the Earth’s surface); the recent warming in this zone is over 5 °C per century! The remaining points of note are the slight differences here and there in the very recent period. That large dip in the 2000s is 2004, and the new analysis has the anomaly some 0.16 °C colder (+0.57 versus +0.73). A warm spike is 1995 is 0.09 °C warmer. The same blips are also just about visibly different on the Northern Hemisphere analysis, but the differences smaller.

The additional Environment Canada is welcome, and does affect the result just enough to be visible, but the trends and any conclusion one could derive are not affected at all.


The data are available here, but you don’t need to download that if you’re using ccc-gistemp. Run «python tool/» to download the data, and then run «python tool/» to generate a mapping table. «python tool/ -d ‘data_sources=ghcn ushcn scar hohenpeissenberg ca.v2’» will then run the analysis.


It’s easy to note that GHCN is a bit lacking when it comes to recent temperature data for Canada:

Canada is of course just a particular aspect of the global case, which is well known and discussed in the literature ([Jones and Moberg 2003] for example).

Note in particular the drop off at around 1990, shortly before GHCN v2 was compiled in 1997. Canada is of some interest because: it has many relatively high latitude stations; and, it’s an industrialised nation that ought to have a good weather station network. The recent periods of GHCN are updated via CLIMAT messages. Who knows why Canada’s data doesn’t make it into GHCN, perhaps they prohibit NOAA from republishing the data, perhaps their BUFR format isn’t quote what NOAA expect.

Environment Canada publish (some of) their monthly temperature data on the web. Using ScraperWiki I wrote a scraper to collect it, and another program to format it into GHCN v2 format so that it could be used as an input for ccc-gistemp. Whilst there are far fewer stations in the scraped data, there is more data in the 2 most recent decades (note different y-axis scale compared to above):

GHCN is bumping along with record counts in the 30s and 40s for the last 2 decades, Environment Canada has 100 or more records for most of those years. Until 2007 in fact (change of government?).

Quite a few of the stations in the scraped data have fewer than 20 years. I exclude those because they’re unlikely to be of use in the ccc-gistemp analysis (in principle they could be combined in Step 1 with existing GHCN duplicates even when they’re short, but meh). It turns out this also avoids some thorny and potentially embarrassing data cleansing war stories (such as why are the reported locations for WMO station 71826 13 degrees apart?)

There are 106 stations from the scraped data with 20 or more years of monthly data.

So, the first question is how do we merge this dataset with GHCN? I suppose I could simply add all the scraped data alongside the GHCN data, but that would be wrong. The scraped data has some stations which are in GHCN, and those would be “double counted”. We ought to try and identify which stations are the same in the two sources; so how do we do that? Most of the scraped stations are WMO stations (87 out of 106) and they come with WMO identifiers. So it’s easy to match those against a possible counterpart in GHCN (WMO stations in GHCN have their 11-digit identifier ending with “000”). Both GHCN and the scraped data come with location information, so we can try matching on that as well. In the case of the matched WMO stations (67), 5 have locations that were more than 0.015 degrees apart, but they’re all within 0.03 degrees.

Sometimes a scraped WMO station will not match a GHCN station by WMO identifier, but will have a location match. This is probably a station in GHCN that should be marked as a WMO station, but by mistake isn’t.

WMO station 71808 represents a good win. The existing GHCN data (station 40371808000) runs from 1967 to 1982. The scraped data (Environment Canada identifier 7040813) runs from 1982 to 2010. In fact there is no overlap at all, GHCN finishes in 1982-07 and Environment Canada starts in 2010-08. Curious.

WMO station 71808: Blanc Sablon

We need a rule of thumb to decide whether to add the scraped data as a new station or as a duplicate of an existing GHCN station. Scraped WMO stations that match a WMO station in GHCN will always be added as a duplicate. For each remaining scraped station I select the nearest GHCN station as a comparison (this is in Canada in all but one case). An overlap and q score is computed for each GHCN duplicate, and the “best” duplicate is picked. If there are duplicates with “long” overlaps (meaing >= 120 months) then the one with lowest q score is chosen for “best”, otherwise (all the overlaps must be short) the duplicate with longest overlap is chosen. q is sqrt(sum(d[i]**2)/n) where d[i] is the difference between the two series at time i; the sum is over the common period; n is the number of elements (months) in the common period.

So for the non WMO stations (some of these will be WMO stations, just without an identified WMO counterpart in GHCN) the rule of thumb is: if s+q < 1, then the scraped data is added as a duplicate, otherwise it is added as a new station. s is the separation between the scraped station and the “best” station in degrees; q is the q score of the “best” duplicate.

The added value of this dataset, compared to GHCN, is recent data for high latitude stations. Alert is a classic case, North of 80:

The scraped data brings almost 20 years of recent data to the table.

So what does this all add up to when we incorporate this new data into the GISTEMP analysis? Well, I’ve done that, but I’ll have to leave the results for another post.


[Jones and Moberg 2003] Jones, P. D., A. Moberg, 2003: Hemispheric and Large-Scale Surface Air Temperature Variations: An Extensive Revision and an Update to 2001. J. Climate, 16, 206–223.